Optimization Suggestions on Solr over HDFS

Scenario

Optimize the environment configuration as an MRS cluster administrator when using Solr over HDFS.

Prerequisites

The HDFS, Solr, and Yarn services have been installed, and INDEX_STORED_ON_HDFS of the Solr service is set to TRUE on Manager.
Preparations for using Solr over HDFS have been completed. For details, see Solr over HDFS.

Procedure

When Solr over HDFS is used, you can optimize the configuration from the following aspects:

Disk and network planning
- During disk partitioning, ZooKeeper occupies one disk or disk partition. If ZooKeeper and HDFS share the same disk, frequent data access and config set access will cause ZooKeeper to stop responding when the data volume is large.
- HDFS attaches multiple single disks or multiple RAID 0 groups to multiple data directories. When MapReduce tasks process a large amount of data, the disk I/O becomes a bottleneck.
- Generally, the NIC binding mode bond0 is used to set the networking mode, which can be set based on the site requirements.

Operating system optimization
When Solr reads data from HDFS, many temporary ports are generated. If the netstat -anp | grep DataNode Port | wc -l command output is greater than 4096, TIME_WAIT is displayed and the task fails. To avoid this problem, perform the following operations:

Log in to Manager and choose Cluster > Name of the desired cluster > Service > HDFS > Configuration, and check the value of dfs.datanode.port, that is, the value of DataNode Port.
1. Log in to each DataNode as user root and run the following commands:
  vi /etc/sysctl.conf
  
  Add the following information:
```
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_timestamps = 1
```
2. Save the modification and exit. Run the sysctl -p command to load the configuration file.
Solr instance deployment
When storing collections on HDFS, ensure that all Solr instances and DataNodes are deployed on the same node.

When Solr reads collection files from HDFS using the HDFSDirectoryFactory method, you can first optimize the configuration in which Solr works as a client of HDFS. If Solr and other components need to be deployed on the same node, you are advised to deploy only one Solr instance on each node.
Optimizing the performance of MapReduce tasks submitted by Solr
For details, see Optimizing Node Configuration.

Set the configuration parameters of the Yarn service by referring to the following suggestions and restart the Yarn service:
- mapreduce.task.timeout: 1800000 (The default value is 60000. You can increase the value when a large amount of data needs to be processed.)
- yarn.nodemanager.resource.cpu-vcores: 24 (The default value is 8. You can set this value to 1 to 2 times of the total number of vCPUs on the current node when processing a large number of VMs.)
- yarn.nodemanager.resource.memory-mb: You are advised to set this parameter for each NodeManager. Check the memory usage of each node on the host page. The value of this parameter for each NodeManager equals the idle memory minus 8 GB, or 75% of the total memory.
Adjusting the HDFS cache to improve collection performance
The HDFS-related cache in Solr has 10% to 20% of the available memory allocated from the system.

For example, when HDFS and Solr are running on a host with 128 GB memory, 12.8 GB to 25.6 GB memory is used as the HDFS cache. You need to adjust the parameter to ensure the best performance as the size of collection increases. This parameter can be configured by modifying the solrconfig.xml file.

Adjust the allocated cache size as follows:
1. Log in to the node where the Solr client resides as user root, go to the directory where the Solr client resides, and run the following command:
  source bigdata_env
  
  kinit solr
1. Run the following commands to obtain the configuration file set and open the solrconfig.xml file:
  solrctl confset --get confWithSchema /home/solr/
  
  vi /home/solr/conf/solrconfig.xml
1. Change the value of solr.hdfs.blockcache.slab.count.
  The size of each slab is 128 MB. 18 slabs occupy about 2.3 GB memory. This is the configuration of each shard. If a host has six shards, a total of 13.8 GB memory is occupied.
  
  In this example, the parameter value can be changed as follows:
```
<int name="solr.hdfs.blockcache.slab.count">${solr.hdfs.blockcache.slab.count:18}</int>
```
1. Run the following command to upload the modified config set:
  solrctl confset --update confWithSchema /home/solr/