Suggestions on Sharding of Collections

Scenario

Adjust the number of Solr instances and the number of shards of a collection as an MRS cluster administrator to improve the Solr indexing performance.

Procedure

Adjust the number of Solr instances.
If there are eight data nodes, you are advised to plan them as follows:

When the collection data is stored on the local disk:
- The SolrServerAdmin role occupies two nodes. SolrServerN (N can be any number ranging from 1 to 5) is deployed on each node. There are 12 instances in total.
- On the other six nodes, SolrServerN (N can be any number ranging from 1 to 5) is deployed on each node. There are 30 instances in total.
- A total of 42 instances are involved.
When the index data is stored on HDFS:
- The SolrServerAdmin role occupies two nodes. SolrServerN (N can be 1 or 2) is deployed on each node. There are six instances in total.
- On each of the rest six nodes, SolrServerN (N can be 1, 2 or 3) is deployed. 18 instances are involved.
- A total of 24 instances are involved.
Set the maximum system memory occupied by each instance to 4 GB. If the memory resources are sufficient, the index performance can be improved.

To adjust the memory usage, perform the following steps:
1. Log in to Manager.
2. Choose Cluster > Name of the desired cluster > Service > Solr > Configuration.
3. Adjust the SOLR_GC_OPTS parameter of each instance. -Xms2G -Xmx4G indicates that the occupied memory size ranges from 2 GB to 4 GB. Change the value of this parameter to -Xms4G -Xmx4G.
4. Click Save. In the dialog box that is displayed, click OK to restart the service for the configuration to take effect.
Adjust the number of shards of a Solr collection.
When creating a full-text search collection for HDFS and HBase data (for details, see Common Service Operations About Solr), you are advised to use the following scheme:

Each instance corresponds to one shard and each shard contains two replicas (namely, the replica factor is 2). When the number of replicas is two, create a collection that contains 21 shards. The two replicas of each shard are deployed on two different node instances. In this case, each of the 42 instances contains one replica, respectively.

Based on the preceding scheme, each collection contains 42 shards (42 instances are deployed in this example). The larger the number of shards is, it is conducive to the indexing performance. However, the increase of service interaction operations consumes more system resources.

When the collection data is stored on the local disk, the preceding rule is also applicable. That is, each collection contains 24 shards (24 instances are deployed in this example).