Updated on 2025-11-04 GMT+08:00

Inference Service Accuracy Evaluation

This chapter introduces three accuracy evaluation methods: OpenCompass, Simple-evals, and MME tools.

  • OpenCompass supports evaluation schemes for over 20 Hugging Face and API models, more than 70 datasets, and approximately 400,000 questions, enabling one-stop evaluation of various large language models (LLMs).
  • Simple-evals is a lightweight library for evaluating language models. It can be used to evaluate datasets such as MMLU, GPQA, DROP, MGSM, and HumanEval. This tool is designed for online evaluation and uses the OpenAI API by default.
  • MME is suitable for accuracy testing of multimodal models. Currently supported models include qwen2-vl-2B, qwen2-vl-7B, qwen2-vl-72B, qwen2.5-vl-7B, qwen2.5-vl-32B, qwen2.5-vl-72B, internvl2.5-26B, InternVL2-Llama3-76B-AWQ, and gemma3-27B.

Using the OpenCompass Accuracy Evaluation Tool for LLM Accuracy Evaluation

Use OpenCompass for online service accuracy evaluation.

  1. Prepare the OpenCompass runtime environment using conda (recommended).
    conda create --name opencompass python=3.10 -y
    conda activate opencompass
  2. Install OpenCompass.
    git clone https://github.com/open-compass/opencompass
    cd opencompass
    pip install -e .
  3. Download supported datasets.
    # Download datasets to the data/ directory.
    wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip
    unzip OpenCompassData-core-20240207.zip

    You can also query supported models and datasets using the following command:

    python tools/list_configs.py [PATTERN1] [PATTERN2] [...]

    If no parameters are provided, it will list all model configurations in configs/models and configs/dataset.

    You can pass any number of parameters, and the script will list all configurations related to the provided strings, supporting fuzzy search and * wildcard matching. For example, the following command will list all configurations related to mmlu and llama:

    python tools/list_configs.py mmlu llama
  4. Build configuration files.

    OpenCompass allows you to write complete experiment configurations in configuration files and run them directly using run.py. Configuration files are organized in Python format and must include the datasets and models fields.

    Create a new example.py file in opencompass/examples/. This configuration file imports the required dataset and model configurations using inheritance and combines the datasets and models fields in the desired format.

    from mmengine.config import read_base 
    from opencompass.models import OpenAI
    
    with read_base():
        from opencompass.configs.datasets.gsm8k.gsm8k_gen import gsm8k_datasets
     
    datasets = gsm8k_datasets
     
    models  = [dict(
        abbr='Qwen3-32B-W8A8',
        type=OpenAI,
        path='Qwen3-32B-W8A8',  
        tokenizer_path='/Qwen/Qwen3-32B-W8A8',
        key='EMPTY', 
        openai_api_base='http://127.0.0.1:8091/v1/chat/completions', 
        temperature=0.6,
        query_per_second=1,
        max_out_len=31744,
        max_seq_len=31744,
        batch_size=8
    )]

    Some generative datasets have a default max_out_len=512 configuration, which may truncate results before the answer is fully generated, leading to low scores. You can update the datasets configuration in the example.py configuration file.

    Example:

    gsm8k_datasets[0]["infer_cfg"]["inferencer"].pop("max_out_len")

    Parameters:

    • abbr: model abbreviation
    • type: model type
    • path: registered model name
    • tokenizer_path: tokenizer directory (defaults to path if not specified)
    • key: model access key
    • openai_api_base: model service address
    • temperature: generation temperature
    • query_per_second: service request rate
    • max_out_len: maximum output length
    • max_seq_len: maximum input length
    • batch_size: batch size
  5. Run the accuracy test task.
    python run.py examples/example.py -w ./outputs/demo

    More parameters in run.py:

    The following are some parameters related to evaluation that can help you configure more effective inference tasks based on your environment:

    • -w outputs/demo: Directory to save evaluation logs and results. In this case, the experiment results will be saved to outputs/demo/{TIMESTAMP}.
    • -r {TIMESTAMP/latest}: Reuses existing inference results and skips completed tasks. If a timestamp is provided, it will reuse the results from that timestamp in the workspace path; if latest or nothing is specified, it will reuse the latest results in the specified workspace path.
    • --mode all: Specifies a particular stage of the task.
      • all: (default) Performs full evaluation, including inference and evaluation.
      • infer: Performs inference on each dataset.
      • eval: Evaluates based on inference results.
      • viz: Displays evaluation results only.
    • --max-num-workers: Maximum number of parallel tasks.
    • --debug: Runs tasks in debug mode. Tasks will be executed sequentially and output will be printed in real-time, which is useful for troubleshooting and ideal for initial task execution.
  6. Evaluate the results.

    After the script runs, the test results are output to the terminal. All run outputs are directed to the outputs/demo/ directory, with the following structure:

    outputs/default/
    ├── 20250220_120000
    ├── 20250220_183030     # Each experiment in a separate folder
    │   ├── configs         # Dumped configuration files for recording. If different experiments are rerun in the same experiment folder, multiple configurations may be retained.
    │   ├── logs            # Log files for the inference and evaluation stages
    │   │   ├── eval
    │   │   └── infer
    │   ├── predictions   # Inference results for each task
    │   ├── results       # Evaluation results for each task
    │   └── summary       # Summary evaluation results for a single experiment
    ├── ...

Using the Simple-evals Accuracy Evaluation Tool for LLM Accuracy Evaluation

Simple-evals is an online service accuracy evaluation tool that can be used seamlessly with any framework that supports OpenAI.

  1. Install the Simple-evals evaluation tool. Create a new python environment using the conda environment in the package. The following uses accuracy as an example.
    conda create -n accuracy --clone python-3.11.10
    conda activate accuracy
    cd xxx/simple_evals
    bash build.sh
  2. Run an online accuracy evaluation task.
    # Debug mode
    python simple_evals.py --model $model --dataset gpqa \
    --served-model-name $served_model_name \
    --url http://localhost:$port/v1 \
    --max-tokens 128 \
    --temperature 0.6 \
    --num-threads 32 \
    --debug
    
    # start
    python simple_evals.py --model $model --dataset gpqa \
    --served-model-name $served_model_name \
    --url http://localhost:$port/v1 \
    --max-tokens 16384 \
    --temperature 0.6 \
    --num-threads 32

    Parameters:

    • model: The model to be evaluated, which affects the generated file name. For example, Qwen3-32B will generate gpqa_Qwen3-32B_20250719_130703.json and gpqa_Qwen3-32B_20250719_130703.html in the results/ directory. The JSON file records the score, and the HTML file can be used to view detailed results.
    • dataset: The dataset to be evaluated, supporting mmlu, gpqa, mgsm, drop, and humaneval.
    • served_model_name: If the service to be evaluated supports OpenAI, it will have a served_model_name. This needs to be included in the request.
    • port: Supports evaluation of local and online services. For local services, the URL is generally localhost:8080, where 8080 is the port. For online services, the service provider will provide the URL and served_model_name in OpenAI format. Note that the URL should end with /v1.
    • max-tokens: The maximum number of tokens to generate. As the output becomes longer with the support of chain-of-thought or other features, it is recommended to set this to 16384.
    • temperature: Affects the generated results; it is recommended to keep it unchanged.
    • num-threads: The number of concurrent requests sent to the service. Within the supported range, a higher concurrency reduces the time required. The recommended value is 32.
    • debug: Due to the high resource consumption of evaluation, a debug mode is provided to verify the successful installation. It sends a small number of requests to complete the entire process. It is recommended to use debug mode for the first run to quickly complete the process.
  3. Result description

    The results are generated in the current directory of simple-evals. Scores are saved in JSON files, and detailed results are saved in HTML files. For gpqa_Qwen3-32B_20250719_130703.json, gpqa is the dataset used for evaluation, Qwen3-32B is the LLM being evaluated, and 20250719_130703 is the timestamp when the evaluation was executed.

Using the MME Accuracy Evaluation Tool for Multimodal Model Accuracy Evaluation

  1. Obtain the MME dataset.

    Obtain the MME evaluation dataset and upload it to the directory llm_tools/llm_evaluation/mme_eval/data/eval/.

  2. Obtain the accuracy test code. The accuracy test code is located in the llm_tools/llm_evaluation/mme_eval directory of the code package AscendCloud-LLM. The directory structure is as follows:
    mme_eval
    ├──metric.py        # MME accuracy test script
    ├──MME.sh           # Script to run MME
  3. Run the MME accuracy test script.
    export MODEL_PATH=/data/nfs/model/InternVL2-8B/ 
    export MME_PATH=/llm_tools/llm_evaluation/mme_eval/data/eval/MME
    export MODEL_TYPE=internvl2
    export OUTPUT_NAME=internvl2-8B 
    export ASCEND_RT_VISIBLE_DEVICES="0:1:2:3:4:5:6:7"
    bash MME.sh

    Parameters:

    1. MODEL_PATH: Path to the model weights. The default value is empty.
    2. MME_PATH: Path to the MME dataset. The default value is the current path.
    3. MODEL_TYPE: Model type. Currently supported model types include: llava, llava-next, minicpm, qwen-vl, internvl2, qwen2-vl, and llava-onevision.
    4. OUTPUT_NAME: Name of the output result file. The default value is llava.
    5. ASCEND_RT_VISIBLE_DEVICES: Indicates support for multiple model service instances and model parallelism, such as 0,1:2,3. The default is device 0.
    6. QUANTIZATION: Quantization option. If not provided, the default value is None (quantization is not enabled). Supported values include w4a16, which requires corresponding weights.
    7. GPU_MEMORY_UTILIZATION: Ratio of the GPU memory used by the NPU. The input parameter name of the original vLLM is reused. The default value is 0.9.

    After the script runs, the test results are output to the terminal.