Preparing Software Packages, Weights, and Training Datasets

Before training, upload the model weights, software packages, and training datasets to the specified directory on the Lite Server.

**Table 1** Uploading files and directories
File	Recommended Directory	Description
Model weights	/mnt/sfs_turbo/model/{Model name}	Define /mnt/sfs_turbo as the variable ${work_dir}.
AscendCloud software package (including the AscendCloud-LLM code package. For details, see Software Package Structure.)	/mnt/sfs_turbo
Training dataset	/mnt/sfs_turbo/training_data

The /mnt/sfs_turbo directory serves as the default working directory for mounting SFS Turbo to the host. Simply upload your model weights, software packages, and training datasets on the server.

Step 1: Uploading the Code Package and Weight File

Upload the AscendCloud-LLM-xxx.zip package to the host and decompress it. For details about how to obtain the package, see Table 5. The decompression details are as follows:
```
cd ${work_dir}
unzip AscendCloud-*.zip && unzip ./AscendCloud-LLM-*.zip
```
Upload the weight file to the Lite Server. The weight file must be in Hugging Face format. For details about how to obtain the open-source weight file, see Supported Models.
The weight file must be stored in the specified directory on the disk. Ensure that the model file and weight file (such as the LFS file) have been completely downloaded.
```
cd ${work_dir}
mkdir -p model/{model_name}
```
Modify the weight (tokenizer) file for these models based on the selected framework and model type. For details, see the tokenizer file description.
- LlaMA-Factory: glm4-9b model and InternVL2_5 models

Step 2: Uploading Data to a Specified Directory

The data requirements vary according to the framework. Place the expected training data in the ${work_dir}/training_data directory. The procedure is as follows:

Create the training_data directory.
```
cd ${work_dir}
mkdir training_data 
```
Obtain the datasets and upload the specified data to the ${work_dir}/training_data directory.
- Method 1: Download data by referring to Training Data Description.
- Method 2: Some datasets have been preset in the software package and can be used directly.
```
tar -zxvf ${work_dir}/llm_train/AscendFactory/data.tgz
cp  ${work_dir}/llm_train/AscendFactory/data/*  ${work_dir}/training_data
```
- Method 3: You have processed the datasets.

Place the raw data or processed data by following this structure.

${work_dir}
  |── training_data
       |── alpaca_en_demo.json                   # Original code dataset
       |── identity.json                         # Original code dataset
       ...dat
       |── alpaca_gpt4_data.json                 # Custom sample dataset

[LlaMA-Factory framework] Preprocess data.
- If data needs to be preprocessed, update the data/dataset_info.json file in the code directory. The command below is for reference. For details about the dataset file format and configuration, see README_zh.md.
```
vim dataset_info.json
```
  The new parameters are as follows:
```
"alpaca_gpt4_data": {
    "file_name": "alpaca_gpt4_data.json"
  },
```
  The following figure shows an example.
- If data preprocessing is not required, the data preparation is complete.
[VeRL framework] Preprocess data.
The VeRL framework requires that all datasets used for training must be preprocessed.
1. Copy the content from VeRL Data Processing Sample Script to the dataset_demo.py file locally according to the model type. Modify datasets.load_dataset in the script.
```
dataset = datasets.load_dataset(xxx/xxx/xxx) # Replace xxx/xxx/xxx with the full or relative path to your dataset folder or file.
```
2. Download the datasets.
```
git clone https://huggingface.co/datasets/hiyouga/geometry3k # Multimodal dataset
git clone https://huggingface.co/datasets/openai/gsm8k # LLM dataset
```
3. Run the following command on the local PC to convert the dataset:
```
python  dataset_demo.py --local_dir=/data/verl-workdir/data/xxx/
```
  --local_dir: path of the dataset generated after data processing

The MindSpeed framework handles data conversion automatically using settings in the training YAML file, eliminating the need for manual effort.

In a multi-node setup, only the rank_0 node handles data preprocessing and weight conversion. Ensure the original dataset, weights, and results are stored in a shared directory.

Parent topic: Training Preparations

Previous topic: Preparing the Lite Server Environment

Next topic: Preparing an Image