Help Center > > Tuning Guide> Ceph Block Storage Tuning Guide> High-Performance Storage> Ceph Tuning

Ceph Tuning

Updated at: Sep 30, 2021 GMT+08:00

Ceph Configuration Tuning

  • Purpose

    Adjust the Ceph configuration items to fully utilize the hardware performance of the system.

  • Procedure

    You can edit the /etc/ceph/ceph.conf file to modify all Ceph configuration parameters.

    For example, to change the number of copies to 4, you can add osd_pool_default_size = 4 to the /etc/ceph/ceph.conf file and run the systemctl restart ceph.target command to restart the Ceph daemon process for the change to take effect.

    The preceding operations take effect only on the current Ceph node. You need to modify the ceph.conf file on all Ceph nodes and restart the Ceph daemon process for the modification to take effect on the entire Ceph cluster.

    Table 1 lists the optimization items.

    Table 1 Ceph parameter configuration

    Parameter

    Description

    Suggestion

    [global]

    osd_pool_default_min_size

    Minimum number of I/O copies that the PG can receive. If a PG is in the degraded state, its I/O capability is not affected.

    Default value: 0

    Suggestion: Set this parameter to 1.

    cluster_network

    You can configure a network segment different from the public network for OSD replication and data balancing to relieve the pressure on the public network.

    Default value: none

    Suggestion: Set this parameter to 192.168.4.0/24.

    osd_pool_default_size

    Number of copies

    Default value: 3

    Suggestion: Set this parameter to 3.

    mon_max_pg_per_osd

    PG alarm threshold. You can increase the value for better performance.

    Default value: 250

    Suggestion: Set this parameter to 3000.

    mon_max_pool_pg_num

    PG alarm threshold. You can increase the value for better performance.

    Default value: 65536

    Suggestion: Set this parameter to 300000.

    debug_none

    Disable the debugging function to reduce the log printing overheads.

    Suggestion: Set this parameter to 0/0.

    debug_lockdep

    debug_context

    debug_crush

    debug_mds

    debug_mds_balancer

    debug_mds_locker

    debug_mds_log

    debug_mds_log_expire

    debug_mds_migrator

    debug_buffer

    debug_timer

    debug_filer

    debug_striper

    debug_objecter

    debug_rados

    debug_rbd

    debug_rbd_mirror

    debug_rbd_replay

    debug_journaler

    debug_objectcacher

    debug_client

    debug_osd

    debug_optracker

    debug_objclass

    debug_filestore

    debug_journal

    debug_ms

    debug_mon

    debug_monc

    debug_paxos

    debug_tp

    debug_auth

    debug_crypto

    debug_finisher

    debug_reserver

    debug_heartbeatmap

    debug_perfcounter

    debug_rgw

    debug_civetweb

    debug_javaclient

    debug_asok

    debug_throttle

    debug_refs

    debug_xio

    debug_compressor

    debug_bluestore

    debug_bluefs

    debug_bdev

    debug_kstore

    debug_rocksdb

    debug_leveldb

    debug_memdb

    debug_kinetic

    debug_fuse

    debug_mgr

    debug_mgrc

    debug_dpdk

    debug_eventtrace

    throttler_perf_counter

    This function is enabled by default. You can check whether the threshold is a bottleneck. After the optimal performance is obtained, you are advised to disable the tracker. The tracker affects the performance.

    Default value: True

    Suggestion: Set this parameter to False.

    ms_dispatch_throttle_bytes

    Maximum number of messages to be scheduled. You are advised to increase the value to improve the message processing efficiency.

    Default value: 104857600

    Suggestion: Set this parameter to 2097152000.

    ms_bind_before_connect

    Message queue binding, which ensures that traffic of multiple network ports is balanced.

    Default value: False

    Suggestion: Set this parameter to True.

    [client]

    rbd_cache

    Disable the client cache. After the function is disabled, the RBD cache is always in writethrough mode.

    Default value: True

    Suggestion: Set this parameter to False.

    [osd]

    osd_max_write_size

    Maximum size (in MB) of data that can be written by an OSD at a time

    Default value: 90

    Suggestion: Set this parameter to 256.

    osd_client_message_size_cap

    Maximum size (in bytes) of data that can be stored in the memory by the clients

    Default value: 524288000

    Suggestion: Set this parameter to 1073741824.

    osd_map_cache_size

    Size of the cache (in MB) that stores the OSD map

    Default value: 50

    Suggestion: Set this parameter to 1024.

    bluestore_rocksdb_options

    RocksDB configuration parameter

    Default value:

    1
     compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2
    

    Suggestion:

    1
    compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=16,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,max_bytes_for_level_base=6GB,compaction_threads=8,flusher_threads=4,compaction_readahead_size=2MB
    

    bluestore_csum_type

    The checksum type is not specified.

    Default value: crc32c

    Suggestion: none

    mon_osd_full_ratio

    Percentage of used drive space when an OSD is considered to be full. When the data volume exceeds this percentage, all read and write operations are stopped until the drive space is expanded or data is cleared so that the percentage of used drive space is less than the value.

    Default value: 0.95

    Suggestion: Set this parameter to 0.97.

    mon_osd_nearfull_ratio

    Percentage of used drive space when an OSD is regarded as almost used up. When the data volume exceeds this percentage, an alarm is generated indicating that the space is about to be used up.

    Default value: 0.85

    Suggestion: Set this parameter to 0.95.

    osd_min_pg_log_entries

    Lower limit of the number of PG logs

    Default value: 3000

    Suggestion: Set this parameter to 10.

    osd_max_pg_log_entries

    Upper limit of the number of PG logs

    Default value: 3000

    Suggestion: Set this parameter to 10.

    bluestore_cache_meta_ratio

    Ratio of BlueStore cache allocated to metadata.

    Default value: 0.4

    Suggestion: Set this parameter to 0.8.

    bluestore_cache_kv_ratio

    Ratio of BlueStore cache allocated to key/value data.

    Default value: 0.4

    Suggestion: Set this parameter to 0.2.

Optimizing the PG Distribution

  • Purpose

    Adjust the number of PGs on each OSD to balance the load on each OSD.

  • Procedure

    By default, Ceph allocates eight PGs/PGPs to each storage pool. When creating a storage pool, run the ceph osd pool create {pool-name} {pg-num} {pgp-num} command to specify the number of PGs/PGPs, or run the ceph osd pool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name} pgp_num {pgp-num} command to change the number of PGs/PGPs created in a storage pool. After the modification, run the ceph osd pool get {pool_name} pg_num/pgp_num command to check the number of PGs/PGPs in the storage pool.

    The default value of ceph balancer mode is none. You can run the ceph balancer mode upmap command to change it to upmap. The Ceph balancer function is disabled by default. You can run the ceph balancer on/off command is used to enable or disable the Ceph balancer function.

    Table 2 describes the PG distribution parameters.

    Table 2 PG distribution parameters

    Parameter

    Description

    Suggestion

    pg_num

    Total PGs = (Total_number_of_OSD x 100)/max_replication_count

    Round up the result to the nearest integer power of 2.

    Default value: 8

    Symptom: A warning is displayed if the number of PGs is insufficient.

    Suggestion: Calculate the value based on the formula.

    pgp_num

    Set the number of PGPs to be the same as that of PGs.

    Default value: 8

    Symptom: It is recommended that the number of PGPs be the same as the number of PGs.

    Suggestion: Calculate the value based on the formula.

    ceph_balancer_mode

    Enable the balancer plug-in and set the plug-in mode to upmap.

    Default value: none

    Symptom: If the number of PGs is unbalanced, some OSDs may be overloaded and become bottlenecks.

    Recommended value: upmap

    • The number of PGs carried by each OSD must be the same or close. Otherwise, some OSDs may be overloaded and become bottlenecks. The balancer plug-in can be used to optimize the PG distribution. You can run the ceph balancer eval or ceph pg dump command to view the PG distribution.
    • Run the eph balancer mode upmap and ceph balancer on commands to automatically balance and optimize Ceph PGs. Ceph adjusts the distribution of a few PGs every 60 seconds. Run the ceph balancer eval or ceph pg dump command to view the PG distribution. If the PG distribution does not change, the distribution is optimal.
    • The PG distribution of each OSD affects the load balancing of write data. In addition to optimizing the number of PGs corresponding to each OSD, the distribution of the primary PGs also needs to be optimized. That is, the primary PGs need to be distributed to each OSD as evenly as possible.

Binding OSDs to CPU Cores

  • Purpose

    Bind each OSD process to a fixed CPU core.

  • Procedure

    Add osd_numa_node = <NUM> to the /etc/ceph/ceph.conf file.

    Table 3 describes the optimization items.

    Table 3 OSD core binding parameters

    Parameter

    Description

    Suggestion

    [osd.n]

    osd_numa_node

    Bind the osd.n daemon process to a specified idle NUMA node, which is a node other than the nodes that process the NIC software interrupt.

    This parameter has no default value.

    Symptom: If the CPU of each OSD process is the same as that of the NIC interrupt, some CPUs may be overloaded.

    Suggestion: To balance the CPU load pressure, avoid running each OSD process and NIC interrupt process (or other processes with high CPU usage) on the same NUMA node.

    • The Ceph OSD daemon process and NIC software interrupt process must run on different NUMA nodes. Otherwise, CPU bottlenecks may occur when the network load is heavy. By default, Ceph evenly allocates OSD processes to all CPU cores. You can add the osd_numa_node parameter to the ceph.conf file to avoid running each OSD process and NIC interrupt process (or other processes with high CPU usage) on the same NUMA node.
    • Optimizing the Network Performance describes how to bind NIC software interrupts to the CPU core of the NUMA node to which the NIC belongs. When the network load is heavy, the usage of the CPU core bound to the software interrupts is high. Therefore, you are advised to set osd_numa_node to a NUMA node different from that of the NIC. For example, run the cat /sys/class/net/Port Name/device/numa_node command to query the NUMA node of the NIC. If the NIC belongs to NUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 to prevent the OSD and NIC software interrupt from using the same CPU core.

Did you find this page helpful?

Submit successfully!

Thank you for your feedback. Your feedback helps make our documentation better.

Failed to submit the feedback. Please try again later.

Which of the following issues have you encountered?







Please complete at least one feedback item.

Content most length 200 character

Content is empty.

OK Cancel