Help Center/ Huawei Cloud EulerOS/ FAQs/ How Do I Adjust the Threshold of memcpy in x86_64?
Updated on 2025-09-04 GMT+08:00

How Do I Adjust the Threshold of memcpy in x86_64?

Background

The threshold of glibc's memcpy is determined by the parameter x86_non_temporal_threshold. It has a great impact on the memory bandwidth. You can adjust the threshold as needed to achieve better memory copy performance.

Method

The following setting is recommended by the glibc community:

export GLIBC_TUNABLES=glibc.cpu.x86_non_temporal_threshold=$(($(getconf LEVEL3_CACHE_SIZE) * 3 / 4))

memcpy

In glibc-2.34, memcpy and memmove are implemented in the similar way which is described in the glibc source code.

sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S

/* memmove/memcpy/mempcpy is implemented as:
   1. Use overlapping load and store to avoid branch.
   2. Load all sources into registers and store them together to avoid
      possible address overlap between source and destination.
   3. If size is 8 * VEC_SIZE or less, load all sources into registers
      and store them together.
   4. If address of destination > address of source, backward copy
      4 * VEC_SIZE at a time with unaligned load and aligned store.
      Load the first 4 * VEC and last VEC before the loop and store
      them after the loop to support overlapping addresses.
   5. Otherwise, forward copy 4 * VEC_SIZE at a time with unaligned
      load and aligned store.  Load the last 4 * VEC and first VEC
      before the loop and store them after the loop to support
      overlapping addresses.
   6. If size >= __x86_shared_non_temporal_threshold and there is no
      overlap between destination and source, use non-temporal store
      instead of aligned store.  */

As described in item 6 above, if __x86_shared_non_temporal_threshold is exceeded, non-temporal stores instead of aligned stores will be used. Non-temporal stores use the movntdq instruction to bypass the CPU L3 cache and directly access the memory. In this cache missing case, non-temporal stores omit the cache read and write and are more suitable for large memory copies than aligned stores.