Load Imbalance of Cluster Instances

Scenarios

It is common that load is imbalanced between shard nodes in a cluster instance. If the shard key is incorrectly selected, no chunk is preset, and the load balancing speed between shard nodes is lower than the data insertion speed, load imbalance may occur.

This section describes how to fix load imbalance.

Fault Locating

Connect to a database from the client.

Run the following command to check the shard information:

sh.status()

mongos> sh.status() 
 \--- Sharding Status --- 
   sharding version: { 
         "_id" : 1, 
         "minCompatibleVersion" : 5, 
         "currentVersion" : 6, 
         "clusterId" : ObjectId("60f9d67ad4876dd0fe01af84") 
   } 
   shards: 
         {  "_id" : "shard_1",  "host" : "shard_1/172.16.51.249:8637,172.16.63.156:8637",  "state" : 1 } 
         {  "_id" : "shard_2",  "host" : "shard_2/172.16.12.98:8637,172.16.53.36:8637",  "state" : 1 } 
   active mongoses: 
         "4.0.3" : 2 
   autosplit: 
         Currently enabled: yes 
   balancer: 
         Currently enabled:  yes 
         Currently running:  yes 
         Collections with active migrations: 
                 test.coll started at Wed Jul 28 2021 11:40:41 GMT+0000 (UTC) 
         Failed balancer rounds in last 5 attempts:  0 
         Migration Results for the last 24 hours: 
                 300 : Success 
   databases: 
         {  "_id" : "test",  "primary" : "shard_2",  "partitioned" : true,  "version" : {  "uuid" : UUID("d612d134-a499-4428-ab21-b53e8f866f67"),  "lastMod" : 1 } } 
                 test.coll 
                         shard key: { "_id" : "hashed" } 
                         unique: false 
                         balancing: true 
                         chunks: 
                                 shard_1 20 
                                 shard_2 20

databases lists databases for which you enable enableSharding.
test.coll is the collection namespace. test indicates the name of the database where the collection is located, and coll indicates the name of the collection for which sharding is enabled.
shard key is the shard key of the previous collection. _id: indicates that the shard is hashed based on _id. _id: -1 indicates that the shard is sharded based on the range of _id.
chunks indicates the distribution of shards.

Analyze the shard information based on the query result in 2.
1. If no shard information is queried, the collections are not sharded.
  Run the following command to enable sharding:
```
mongos> sh.enableSharding("<database>") 
mongos> use admin 
mongos> db.runCommand({shardcollection:"<database>.<collection>",key:{"keyname":<value> }})
```
2. If an improper shard key is selected, the load may be imbalanced. For example, if a large number of requests are processed on a range of shards, the load on these shards is heavier than other shards, causing load imbalance.
  You can redesign the shard key, for example, changing ranged sharding to hashed sharding.
```
mongos> db.runCommand({shardcollection:"<database>.<collection>",key:{"keyname":<value> }})
```
  - If a sharding mode is determined, it cannot be changed easily. The sharding mode must be fully considered in the design phase.
3. If a large amount of data is inserted and the data volume exceeds the load capacity of a single shard, shard imbalance occurs and the storage usage of the primary shard is too high.
  You can run the following command to check the network connection of the server and check whether the amount of data transmitted by each network adapter reaches the upper limit.
```
sar -n DEV 1  //1 is the interval.
Average: IFACE  rxpck/s  txpck/s    rxkB/s    txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil 
Average:    lo  1926.94  1926.94  25573.92  25573.92    0.00    0.00     0.00    0.00 
Average:  A1-0     0.00     0.00      0.00      0.00    0.00    0.00     0.00    0.00 
Average:  A1-1     0.00     0.00      0.00      0.00    0.00    0.00     0.00    0.00 
Average:  NIC0     5.17     1.48      0.44      0.92    0.00    0.00     0.00    0.00 
Average:  NIC1     0.00     0.00      0.00      0.00    0.00    0.00     0.00    0.00 
Average:  A0-0  8173.06 92420.66  97102.22 133305.09    0.00    0.00     0.00    0.00 
Average:  A0-1 11431.37  9373.06 156950.45    494.40    0.00    0.00     0.00    0.00 
Average:  B3-0     0.00     0.00      0.00      0.00    0.00    0.00     0.00    0.00 
Average:  B3-1     0.00     0.00      0.00      0.00    0.00    0.00     0.00    0.00
```
  - rxkB/s is the number of KBs received per second.
  - txkB/s is the number of KBs sent per second.
  After the check is complete, press Ctrl+Z to exit.
  
  If the network load is too high, analyze MQL statements, optimize the roadmap, reduce bandwidth consumption, and increase specifications to expand network throughput.
  - Check whether there are sharded collections that do not carry ShardKey. In this case, requests are broadcast, which increases the bandwidth consumption.
  - Control the number of concurrent threads on the client to reduce the network bandwidth traffic.
  - If the problem persists, increase instance specifications in a timely manner. High-specification nodes can provide higher network throughput.