Preparing Software Packages, Weights, and Training Datasets
Before training, upload the model weights, software packages, and training datasets to the specified directory on the Lite Server.
|
File |
Recommended Directory |
Description |
|---|---|---|
|
Model weights |
/mnt/sfs_turbo/model/{Model name} |
Define /mnt/sfs_turbo as the variable ${work_dir}. |
|
AscendCloud software package (including the AscendCloud-LLM code package. For details, see Software Package Structure.) |
/mnt/sfs_turbo |
|
|
Training dataset |
/mnt/sfs_turbo/training_data |
The /mnt/sfs_turbo directory serves as the default working directory for mounting SFS Turbo to the host. Simply upload your model weights, software packages, and training datasets on the server.
Step 1: Uploading the Code Package and Weight File
- Upload the AscendCloud-LLM-xxx.zip package to the host and decompress it. For details about how to obtain the package, see Table 5. The decompression details are as follows:
cd ${work_dir} unzip AscendCloud-*.zip && unzip ./AscendCloud-LLM-*.zip - Upload the weight file to the Lite Server. The weight file must be in Hugging Face format. For details about how to obtain the open-source weight file, see Supported Models.
- The weight file must be stored in the specified directory on the disk. Ensure that the model file and weight file (such as the LFS file) have been completely downloaded.
cd ${work_dir} mkdir -p model/{model_name} - Modify the weight (tokenizer) file for these models based on the selected framework and model type. For details, see the tokenizer file description.
Step 2: Uploading Data to a Specified Directory
The data requirements vary according to the framework. Place the expected training data in the ${work_dir}/training_data directory. The procedure is as follows:
- Create the training_data directory.
cd ${work_dir} mkdir training_data - Obtain the datasets and upload the specified data to the ${work_dir}/training_data directory.
- Method 1: Download data by referring to Training Data Description.
- Method 2: Some datasets have been preset in the software package and can be used directly.
tar -zxvf ${work_dir}/llm_train/AscendFactory/data.tgz cp ${work_dir}/llm_train/AscendFactory/data/* ${work_dir}/training_data - Method 3: You have processed the datasets.
- Place the raw data or processed data by following this structure.
${work_dir} |── training_data |── alpaca_en_demo.json # Original code dataset |── identity.json # Original code dataset ...dat |── alpaca_gpt4_data.json # Custom sample dataset - [LlaMA-Factory framework] Preprocess data.
- If data needs to be preprocessed, update the data/dataset_info.json file in the code directory. The command below is for reference. For details about the dataset file format and configuration, see README_zh.md.
vim dataset_info.json
The new parameters are as follows:
"alpaca_gpt4_data": { "file_name": "alpaca_gpt4_data.json" },The following figure shows an example.

- If data preprocessing is not required, the data preparation is complete.
- If data needs to be preprocessed, update the data/dataset_info.json file in the code directory. The command below is for reference. For details about the dataset file format and configuration, see README_zh.md.
- [VeRL framework] Preprocess data.
The VeRL framework requires that all datasets used for training must be preprocessed.
- Copy the content from VeRL Data Processing Sample Script to the dataset_demo.py file locally according to the model type. Modify datasets.load_dataset in the script.
dataset = datasets.load_dataset(xxx/xxx/xxx) # Replace xxx/xxx/xxx with the full or relative path to your dataset folder or file.
- Download the datasets.
git clone https://huggingface.co/datasets/hiyouga/geometry3k # Multimodal dataset git clone https://huggingface.co/datasets/openai/gsm8k # LLM dataset
- Run the following command on the local PC to convert the dataset:
python dataset_demo.py --local_dir=/data/verl-workdir/data/xxx/
--local_dir: path of the dataset generated after data processing

- Copy the content from VeRL Data Processing Sample Script to the dataset_demo.py file locally according to the model type. Modify datasets.load_dataset in the script.
The MindSpeed framework handles data conversion automatically using settings in the training YAML file, eliminating the need for manual effort.
In a multi-node setup, only the rank_0 node handles data preprocessing and weight conversion. Ensure the original dataset, weights, and results are stored in a shared directory.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot