Help Center> Document Database Service> Best Practices> Enabling Sharding on a Database

Enabling Sharding on a Database

You can shard a large-size collection for a sharded cluster instance. Sharding distributes data across different machines to make full use of the storage space and compute capability of each shard.

The operations described in this section apply to cluster instances only.

Procedure

The following is an example using database mytable, collection mycoll, and the field name as the shard key.

  1. Log in to a sharded cluster instance using mong shell.
  2. Enable sharding for the databases that belong to the cluster instance.
    • Method 1
      sh.enableSharding("<database>") 
    • Method 2
      use admin 
      db.runCommand({enablesharding:"<database>"}

    Example:

    sh.enableSharding("mytable")
  3. Shard a collection.
    • Method 1
      sh.shardCollection("<database>.<collection>",{"<keyname>":<value> })
    • Method 2
      use admin
      db.runCommand({shardcollection:"<database>.<collection>",key:{"keyname":<value> }})
      • <database>: indicates the database name.
      • <collection>: indicates the collection name.
      • <keyname>: indicates the shard key, which is used to shard data.
      • <value>: indicates the sort order based on the range of the shard key.
        • 1: Ascending indexes
        • -1: Descending indexes
        • hashed: indicates that hash sharding is used. Hashed sharding provides more even data distribution across the sharded cluster.

      For details, see sh.shardCollection().

      Example:

      sh.shardCollection("mytable.mycoll",{"name":"hashed"})
  4. Check the data storage status of the database on each shard.
    sh.status()

    Example:

Selecting a Shard Key

Each shard contains collections as its basic unit. Data in the collection is partitioned by the shard key. Shard key is a field in the collection. It distributes data evenly across shards. If you do not select a proper shard key, the cluster performance may deteriorate, and the sharding statement execution process may be blocked.

Once the shard key is determined it cannot be changed. If no shard key is suitable for sharding, you need to use a sharding policy and migrate data to a new collection for sharding.

The ideal shard key is as follows:

  • All inserts, updates, and deletes are evenly distributed to all shards in the cluster.
  • The distribution of keys is sufficient.
  • Rare scatter-gather queries.

If the selected shard key does not have all the preceding features, the read and write scalability of the sharded cluster is affected. For example, If the workload of the find() operation is unevenly distributed in the shards, hot shards will be generated. Similarly, if your write load (inserts, updates, and deletes) is not uniformly distributed across your shards, then you could end up with a hot shard. Therefore, you need to adjust the shard keys based on service requirements, such as read/write status, frequently queried data, and written data.

You can use the following dimensions to determine whether the selected shard keys meet your service requirements:

  • Cardinality

    Cardinality refers to the capability of dividing chunks. For example, if you need to record the student information of a school, you use the age as a shard key for the age can be divided into different ranges. That choice means that data of students of the same age is stored in only one data segment, which may have impact on the performance and manageability of your clusters. A much better shard key would be the student number because it is unique. If the student number is used as a shard key, the relatively large cardinality can ensure the even distribution of data.

  • Write Distribution

    If a large number of write operations are performed in the same period of time, you want your write load to be evenly distributed over the shards in the cluster. If the data distribution policy is range sharding, a monotonically increasing shard key will guarantee that all inserts go into a single shard.

  • Read Distribution

    Similarly, if a large number of read operations are performed in the same period, you want your read load to be evenly distributed over the shards in your cluster to fully utilize the computing performance of each shard.

  • Read Targeting

    The mongos query router can perform either a targeted query (query only one shard) or a scatter/gather query (query all of the shards). The only way for the mongos to be able to target a single shard is to have the shard key present in the query. Therefore, you need to pick a shard key that will be available for use in the common queries while the application is running. If you pick a synthetic shard key, and your application cannot use it during typical queries, all of your queries will become scatter/gather, thus limiting your ability to scale read load.

Choosing a Distribution Policy

A sharded cluster can store a collection's data on multiple shards. You can distribute data based on the shard keys of documents in the collection.

Currently, there are two data distribution policies: range sharding and hash sharding. For details, see 3.

The following describes the advantages and disadvantages of the two methods.

  • Ranged Sharding

    Ranged-based sharding involves dividing data into contiguous ranges determined by the shard key values. If you assume that a shard key is a line stretched out from positive infinity and negative infinity, each value of the shard key is the mark on the line. You can also assume small and separate segments of a line and that each chunk contains data of a shard key within a certain range.

    Figure 1 Distribution of data

    As shown in the preceding figure, field x indicates the shard key of ranged sharding. The value range is [minKey,maxKey] and the value is an integer. The value range can be divided into multiple chunks, and each chunk (usually 64 MB) contains a small segment of data. For example, chunk 1 contains all documents in range [minKey, -75] and all data of each chunk is stored on the same shard. That means each shard containing multiple chunks. In addition, the data of each shard is stored on the config server and is evenly distributed by mongos based on the workload of each shard.

    Ranged sharding can easily meet the requirements of query in a certain range. For example, if you need to query documents whose shard key is in range [-60,20], mongos only needs to forward the request to chunk 2.

    However, if shard keys are in ascending or descending order, newly inserted documents are likely to be distributed to the same chunk, affecting the expansion of write capability. For example, if _id is used as a shard key, the high bits of _id automatically generated in the cluster are ascending.

  • Hashed Sharding

    Hashed sharding computes the hash value (64-bit integer) of a single field as the index value; this value is used as your shard key to partition data across your shared cluster. Hashed sharding provides more even data distribution across the sharded cluster because documents with similar shard keys may not be stored in the same chunk.

    Figure 2 Distribution of data

    Hashed sharding randomly distributes documents to each chunk, which fully expands the write capability and makes up for the deficiency of ranged sharding. However, queries in a certain range need to be distributed to all backend shards to obtain documents that meet conditions, resulting in low query efficiency.