How Do I Adjust Training Parameters to Maximize the Pangu Model Performance?

There is no standard answer for selecting model fine-tuning parameters. The adjustment policies vary according to scenarios. Generally, the fine-tuning parameters are affected by the following factors:

Difficulty of the target task: If the target task is simple, the model can easily learn knowledge, and a small number of training epochs can achieve good effects. Conversely, if the task is complex, more training epochs may be required.
Data volume: Objectively, having more fine-tuning data can lead to a closer approximation to the true distribution. You can set learning_rate and batch_size to a large value to improve the training efficiency. If the amount of fine-tuning data is relatively small, you can set learning_rate and batch_size to a small value to avoid overfitting.
Parameter scale of the foundation model: If the parameter scale is small, you can set learning_rate and batch_size to a large value to improve the training efficiency. Otherwise, set learning_rate and batch_size to a small value to prevent memory overflow.

The following table lists the recommended values and descriptions of some fine-tuning parameters.

**Table 1** Recommendations and descriptions of fine-tuning parameters
Training Parameter	Value Range	Recommended Value	Description
epoch	1 to 50	2/4/8/10	The number of epochs is the number of complete iterations of the training dataset. A larger number of epochs indicates more iterations for a model to learn data and a deeper learning effect. However, it also may result in overfitting. A smaller number of epochs indicates fewer learning iterations. Similarly, too few epochs may cause underfitting. You can adjust the epochs based on the task difficulty and data volume. Generally, if a target task is difficult or the data volume is small, you can use a relatively large number of epochs. On the contrary, use a relatively small number of epochs. If you lack professional fine-tuning experience, you can preferentially use the default values provided and then adjust the values based on the model convergence status during training.
batch_size	>=1	4/8	The batch size is the number of training instances read per batch. A larger batch size speeds up training, but it consumes more memory, makes convergence difficult, or results in overfitting. A smaller batch size consumes less memory, but it slows convergence and adds noise to training. You can adjust the value based on your data and model scale. Generally, if the data volume is small or the model's parameter scale is large, you can use a small batch size. Otherwise, you can use a large batch size. If you lack professional fine-tuning experience, you can preferentially use the default values provided and then adjust the values based on the actual situation during training.
learning_rate	0 to 1	1e-6 to 5e-4	The learning rate is a hyperparameter that dictates the amount by which the weights are updated during gradient descent. If the learning rate is too high, the model may overshoot the optimal solution and diverge. If the learning rate is too low, the model converges very slowly. You can adjust the value based on your data and model scale. Generally, if the data volume is small or the model's parameter scale is large, you can use a small learning rate. Otherwise, you can use a large learning rate. If you lack professional fine-tuning experience, you can preferentially use the default values provided and then adjust the values based on the model convergence status during training.
learning_rate_decay_ratio	0 to 1	0.01 to 0.1	The learning rate decay ratio is used to set the minimum learning rate during training. The calculation formula is as follows: Minimum learning rate = Learning rate x Learning rate decay ratio