Updated on 2025-06-30 GMT+08:00

Basic Concepts

Concepts Related to Large Models

Concept

Description

LLM

Large language models (LLMs) are a category of foundation models pretrained on immense amounts of data. A pre-trained model means that the model is trained on an original task and continuously fine-tuned on downstream tasks. This improves the model accuracy on downstream tasks. A large-scale pre-trained model is a pre-trained model whose model parameters reach a level of 100 billion or 1 trillion. These models have stronger generalization capabilities and can accumulate industry experience and obtain information more efficiently and accurately.

Token

A token is the smallest unit of text a model can work with. A token can be a word or part of characters. An LLM converts input and output text into tokens, generates a probability distribution for each possible word, and then samples tokens according to the distribution.

Some compound words are split based on semantics. For example, overweight is made up of two tokens: "over" and "weight".

Take Pangu N1 models as an example. One token represents approximately 0.75 English words and 1.5 Chinese characters. For details about the word-to-token ratios, see Table 1.

Table 1 Word-to-token ratios

Model Specifications

English Word-to-Token Ratio

Chinese Character-to-Token Ratio

N1 series models

0.75

1.5

N2 series models

0.88

1.24

N4 series models

0.75

1.5

Training

Table 2 Training-related concepts

Concept

Description

Self-supervised learning

Self-supervised learning (SSL) is a subset of unsupervised learning that utilizes pretext tasks to derive supervision signals from unlabeled data. These pretext tasks are self-generated challenges that the model solves to learn from the data, thereby creating valuable representations for downstream tasks. SSL does not require additional manually labeled data because the supervisory signal is derived from the data itself.

Supervised learning

Supervised learning is a machine learning task that infers a function from labeled training data to make predictions. Each sample in labeled training data includes an input and an expected output.

LoRA

Low-rank adaptation (LoRA) fine-tuning is an optimization technology used to update only some parameters of a deep learning model during fine-tuning. This approach can significantly reduce the computational resources and time required for fine-tuning while maintaining or approaching the optimal performance of the model.

Overfitting

Overfitting occurs when a model tries to fit the training data so closely that it does not generalize well to new data.

Underfitting

Underfitting occurs when a model performs poorly on the training data or the model is too simplistic to capture the underlying patterns of the data.

Loss function

A loss function, often defined as L(Y, f(x)), is a mathematical function that measures the error between the predicted value f(x) and the actual value Y of the sample x. It is a non-negative real value function. A smaller loss indicates better robustness of the model.

Inference

Table 3 Inference-related concepts

Concept

Description

Temperature

The temperature parameter controls the randomness and creativity of the generated text in a generative language model. It is used to adjust the probabilities of the predicted words in the softmax output layer of the model. Higher temperature indicates a smaller variance of probabilities of the predicted words. That is, there is a higher probability that many words are more likely to be selected, facilitating the diversity of the generated text.

Diversity and consistency

Diversity and consistency are two important dimensions of evaluating text generated by LLMs. Diversity refers to the difference between different outputs generated by a model. Consistency refers to the consistency between different outputs corresponding to the same input.

Repetition penalty

Repetition penalty is a technique used in model training or text generation. It discourages the repetition of tokens that have appeared recently in the generated text. This is done by adding a penalty for repetitive output during loss calculation. (The loss function is essential for model optimization.) If the model generates repetitive tokens, its loss will increase, which encourages the model to produce more diverse tokens.

Prompt Engineering

Table 4 Concepts related to prompt engineering

Concept

Description

Prompt

A prompt is a language used to interact with an AI model, indicating the content needed for model generation.

CoT

Chain-of-thought (CoT) is a method that simulates human problem-solving. It uses a series of natural language inference processes to gradually deduce a problem from the input to the final conclusion.

Self-Instruct

Self-Instruct is a method for aligning pre-trained language models with instructions. With Self-Instruct, language models are able to generate instruction data themselves without relying on extensive manual annotation.