How Do I Enable Models to Learn Unsupervised Domain-Specific Knowledge If the Data Volume Is Insufficient for Incremental Pre-training?
Generally, incremental pre-training is recommended to enable models to learn domain-specific knowledge. However, pre-training requires a large amount of data. If the data volume of your unsupervised documents is too small and does not meet the pre-training requirements, you can convert it into supervised data by some means, mix the converted domain-specific knowledge with the target task data, and then fine-tune the model.
The following are some solutions for converting unsupervised data into supervised data for your reference:
- Rule-based construction: You can construct supervised data by adopting some simple rules. For example:
Table 1 Common methods for constructing unsupervised data into supervised data using rules Rule Scenario
Description
Text generation: Generate paragraphs based on titles, keywords, and introductions.
If your unsupervised document contains structured information such as titles, keywords, and introductions, you can set the supervised question to "Please generate a text of no less than xx words based on title xxx/keyword xxx/introduction xxx." Then set the answer to a paragraph that meets the requirements.
Continuation: Write a complete paragraph based on the first sentence and the first paragraph.
If your unsupervised document does not contain any structured information, you can set the supervised question to "The following is the first sentence of an article: xxx/first paragraph of an article: xxx. Please continue writing a text of no less than xx words based on the above sentence/paragraph." Then set the answer to a paragraph that meets the requirements.
Expansion: Write a complete paragraph based on one sentence or one paragraph.
If your unsupervised document does not contain any structured information, you can set the supervised question to "The following is a sentence of an article: xxx/a paragraph of an article: xxx. Please expand a text of no less than xx words based on the above sentence/paragraph." Then set the answer to a paragraph that meets the requirements.
Blank filling: Randomly cover up one or more words, sentences, and paragraphs, and then fill in the blanks.
If your unsupervised document does not contain any structured information, you can set the supervised question to "Some words/sentences/paragraphs are missing in the following article: xxx. Please complete the missing information based on the article content." Then set the answer to the information that meets the requirements.
It is fast and low-cost to construct data using rules, but the data is not diverse.
- LLM generalization: You can invoke an LLM (for example, Pangu basic function model of any specifications) to obtain supervised data. A common method is to slice unsupervised text by chapter, paragraph, and number of characters, enable the model to generate Q&A pairs based on the fragments, and then assemble the paragraphs, questions, and answers into supervised data. Using models can generate rich data, but the cost is high.
When you convert unsupervised data into supervised data, ensure data diversity as much as possible. Construct different text using different rule scenarios, or even construct the same text into multiple different rule scenarios.
Models of different specifications support different lengths. When you convert unsupervised data into supervised data, ensure that the data length complies with the model length limit.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot