Updated on 2024-08-08 GMT+08:00

Sharding

You can shard a large-size collection for a sharded cluster instance. Sharding distributes data across different machines to make full use of the storage space and compute capability of each shard.

Number of Shards

The following is an example using database mytable, collection mycoll, and the field name as the shard key.

  1. Log in to a sharded cluster instance using Mongo Shell.
  2. Check whether a collection has been sharded.

    use <database>
    db.<collection>.getShardDistribution()

    Example:

    use mytable
    db.mycoll.getShardDistribution()

  3. Enable sharding for the databases that belong to the cluster instance.

    • Method 1
      sh.enableSharding("<database>") 

      Example:

      sh.enableSharding("mytable")
    • Method 2
      use admin 
      db.runCommand({enablesharding:"<database>"})

  4. Shard a collection.

    • Method 1
      sh.shardCollection("<database>.<collection>",{"<keyname>":<value> })

      Example:

      sh.shardCollection("mytable.mycoll",{"name":"hashed"},false,{numInitialChunks:5})
    • Method 2
      use admin
      db.runCommand({shardcollection:"<database>.<collection>",key:{"keyname":<value> }})
    Table 1 Parameter description

    Parameter

    Description

    <database>

    Database name

    <collection>

    Collection name.

    <keyname>

    Shard key.

    Cluster instances are sharded based on the value of this parameter. Select a proper shard key for the collection based on your service requirements. For details, see Selecting a Shard Key.

    <value>

    The sort order based on the range of the shard key.
    • 1: Ascending indexes
    • -1: Descending indexes
    • hashed: indicates that hash sharding is used. Hashed sharding provides more even data distribution across the sharded cluster.

    For details, see sh.shardCollection().

    numInitialChunks

    Optional. The minimum number of shards initially created is specified when an empty collection is sharded using a hashed shard key.

  5. Check the data storage status of the database on each shard.

    sh.status()

    Example:

Selecting a Shard Key

  • Background

    Each sharded cluster contains collections as its basic unit. Data in the collection is partitioned by the shard key. Shard key is a field in the collection. It distributes data evenly across shards. If you do not select a proper shard key, the cluster performance may deteriorate, and the sharding statement execution process may be blocked.

    Once the shard key is determined it cannot be changed. If no shard key is suitable for sharding, you need to use a sharding policy and migrate data to a new collection for sharding.

  • Characteristics of proper shard keys
    • All inserts, updates, and deletes are evenly distributed to all shards in a cluster.
    • The distribution of keys is sufficient.
    • Rare scatter-gather queries.

    If the selected shard key does not have all the preceding features, the read and write scalability of the cluster is affected. For example, If the workload of the find() operation is unevenly distributed in the shards, hot shards will be generated. Similarly, if your write load (inserts, updates, and deletes) is not uniformly distributed across your shards, then you could end up with a hot shard. Therefore, you need to adjust the shard keys based on service requirements, such as read/write status, frequently queried data, and written data.

    After existing data is sharded, if the filter filed of the update request does not contain shard keys and upsert:true or multi:false, the update request will report an error and return message "An upsert on a sharded collection must contain the shard key and have the simple collation.".

  • Judgment criteria
    You can use the dimensions provided in Table 2 to determine whether the selected shard keys meet your service requirements:
    Table 2 Reasonable shard keys

    Identification Criteria

    Description

    Cardinality

    Cardinality refers to the capability of dividing chunks. For example, if you need to record the student information of a school and use the age as a shard key, data of students of the same age will be stored in only one data segment, which may affect the performance and manageability of your clusters. A much better shard key would be the student number because it is unique. If the student number is used as a shard key, the relatively large cardinality can ensure the even distribution of data.

    Write distribution

    If a large number of write operations are performed in the same period of time, you want your write load to be evenly distributed over the shards in the cluster. If the data distribution policy is ranged sharding, a monotonically increasing shard key will guarantee that all inserts go into a single shard.

    Read distribution

    Similarly, if a large number of read operations are performed in the same period, you want your read load to be evenly distributed over the shards in a cluster to fully utilize the computing performance of each shard.

    Targeted read

    The dds mongos query router can perform either a targeted query (query only one shard) or a scatter/gather query (query all of the shards). The only way for the dds mongos to be able to target a single shard is to have the shard key present in the query. Therefore, you need to pick a shard key that will be available for use in the common queries while the application is running. If you pick a synthetic shard key, and your application cannot use it during typical queries, all of your queries will become scatter/gather, thus limiting your ability to scale read load.

Choosing a Distribution Policy

A sharded cluster can store a collection's data on multiple shards. You can distribute data based on the shard keys of documents in the collection.

There are two data distribution policies: ranged sharding and hashed sharding. For details, see 4.

The following describes the advantages and disadvantages of the two methods.

  • Ranged sharding

    Ranged-based sharding involves dividing data into contiguous ranges determined by the shard key values. If you assume that a shard key is a line stretched out from positive infinity and negative infinity, each value of the shard key is the mark on the line. You can also assume small and separate segments of a line and that each chunk contains data of a shard key within a certain range.

    Figure 1 Distribution of data

    As shown in the preceding figure, field x indicates the shard key of ranged sharding. The value range is [minKey,maxKey] and the value is an integer. The value range can be divided into multiple chunks, and each chunk (usually 64 MB) contains a small segment of data. For example, chunk 1 contains all documents in range [minKey, -75] and all data of each chunk is stored on the same shard. That means each shard containing multiple chunks. In addition, the data of each shard is stored on the config server and is evenly distributed by dds mongos based on the workload of each shard.

    Ranged sharding can easily meet the requirements of query in a certain range. For example, if you need to query documents whose shard key is in range [-60,20], dds mongos only needs to forward the request to chunk 2.

    However, if shard keys are in ascending or descending order, newly inserted documents are likely to be distributed to the same chunk, affecting the expansion of write capability. For example, if _id is used as a shard key, the high bits of _id automatically generated in the cluster are ascending.

  • Hashed sharding

    Hashed sharding computes the hash value (64-bit integer) of a single field as the index value; this value is used as your shard key to partition data across your shared cluster. Hashed sharding provides more even data distribution across the sharded cluster because documents with similar shard keys may not be stored in the same chunk.

    Figure 2 Distribution of data

    Hashed sharding randomly distributes documents to each chunk, which fully expands the write capability and makes up for the deficiency of ranged sharding. However, queries in a certain range need to be distributed to all backend shards to obtain documents that meet conditions, resulting in low query efficiency.