Solr over HBase

Scenario

HBaseIndexer provides the batch indexing, incremental indexing, and real-time indexing functions. For batch indexing and incremental indexing, MapReduce jobs are created to import data from HBase and create indexes in Solr. The difference is that batch indexing is based on the scan mode. Each time the batch indexing is executed, the entire table data is scanned. The incremental index is based on the rowkey list. Incremental indexing requires Loader (or BulkLoader). The rowkey list is saved to the specified directory of HDFS. During incremental indexing, only new or updated data is scanned, which is more efficient than batch indexing. Real-time indexing is based on the replication mechanism of HBase. When data is imported to HBase, indexes are created in Solr. However, the efficiency is lower than that of batch indexing and incremental indexing.

If there is a large amount of data in HBase, you can only select batch collection execution and incremental collection execution for data synchronization.

For details about the related commands, see Shell Client Operation Commands.

Permission management is required in security mode. For details, see Solr User Permission Configuration and Management.

The major function of HBaseIndexer is to create collections for data stored in HBase tables. HBase functions as a storage end for raw data. Solr functions as a storage end for collection data. Therefore, it is not recommended to store the raw values of HBase columns that require collection creation on Solr. If raw values of the HBase columns need to be stored on Solr, it is recommended that all column values corresponding to one row key be inserted in the HBase table at a time. Otherwise, some column values cannot be stored on Solr.
For real-time indexing, it is advised to use the unified search feature. For details, see HBase Full-Text Index.
In the hbase-indexer operation, the default Solr user has all management permissions, and other users have only the read permission (list-indexers operation).

Prerequisites

The HDFS, Solr, Yarn, and HBase services have been installed. INDEX_STORED_ON_HDFS is set to TRUE and SOLR_INDEX_LOCAL_STORAGE_DIR is left blank on Manager.
The value of hbase.rpc.protection in the Solr client configuration file hbase-site.xml is the same as that of hbase.rpc.protection on the HBase server. Otherwise, download the client again or manually update the value to be the same as the value of hbase.rpc.protection on the HBase server. If they are inconsistent, the hbase-indexer task will fail.

Procedure

Log in to the node where the Solr client is installed as user root.
Download the Solr, HDFS, Yarn, and HBase clients and install them in a specified directory, for example, /opt/client.
Run the following command to go to the client installation directory:

cd /opt/client
Run the following command to configure environment variables:

source bigdata_env
Check whether multiple Solr services are installed.
- If yes, when you use the client to connect to a specific Solr service, run a command to load environment variables of the service. For example, run the source Solr-1/component_env command to load Solr-1 service variables.
- If no, skip this step.
If the cluster is in security mode, authenticate the user. For a normal cluster, user authentication is not required.

kinit solr

Create the configuration file required by the HBaseIndexer. Go to the client installation directory Solr/hbase-indexer/conf and run the vi user.xml command to create the user.xml file.

<?xml version="1.0"?>
<indexer table="indexdemo" mapping-type="row" read-row="never">
<field name="firstname_s" value="info:firstname"/>
<field name="lastname_s" value="info:lastname"/>
<field name="age_i" value="info:age"/>
<param name="zookeeper.znode.parent" value="/hbase"/>
</indexer>

Create an HBase table.
```
$ hbase shell
hbase> create 'indexdemo', { NAME => 'info',REPLICATION_SCOPE => '1' }
```
Table property REPLICATION_SCOPE => '1' applies only to real-time collection and does not affect batch collection and incremental collection. This property must be configured for real-time collection. The columns for which collections are to be created must be in the related column family. collections cannot be created for independent columns.

Add data to the HBase table based on the fields defined in the configuration file.

$ hbase shell
hbase>put 'indexdemo','zhangsan','info:firstname','zhang'
hbase>put 'indexdemo','zhangsan','info:lastname','san'
hbase>put 'indexdemo','zhangsan','info:age','26'
hbase>quit

Run the following command to create a Solr collection:

Example:

solrctl collection --create coll-indexdemo -c confWithSchema -s 3 -r 1

For details, see Shell Client Operation Commands.
Creating an indexer

For example, run the following command to create an indexer named userindexer:

hbase-indexer add-indexer -n userindexer -c /opt/client/Solr/hbase-indexer/conf/user.xml -cp solr.zk=192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181/solr -cp solr.collection=coll-indexdemo -cp solr.http.socket.timeout=120000

Replace coll-indexdemo with the actual name of the target collection. The default value of the HTTP request waiting time is 120000, in milliseconds. You can change the value through the solr.http.socket.timeout parameter.

You can run the following command to view the created indexer:

hbase-indexer list-indexers

You must create an indexer before executing batch collection, incremental collection, and real-time collection.
Start an index task.
- When the HBase multi-service function is enabled, the --hbase-indexer-file parameter must be added if you run the batch collection and incremental collection commands, for example, --hbase-indexer-file /opt/client/Solr/hbase-indexer/conf/user.xml. The param parameter must exist in the configuration file. For details about the parameter format, see 7. The parameter specifies the HBase instances that the tables belong to. If the parameter is not added, the default HBase instance of HBaseIndexer will be used to execute collection tasks. When the HBase multi-instance function is disabled, you do not need to add the parameter.
- In multi-service mode, ensure that the Solr service and the associated HBase service share the same ZooKeeper before enabling the collection task. Otherwise, the error message "Failed to get cluster ID" may be displayed.
- When the data volume of a collection task is large, SocketTimeoutException may occur. You can set the --http-socket-timeout parameter (in milliseconds) when starting the collection task, for example, --http-socket-timeout 600000. If this parameter is not specified, the default value 120000 is used.
1. Batch indexing
  If a large amount of data exists in HBase, run the following command to create a collection on Solr:
  
  hbase-indexer batch-indexer --hbase-indexer-zk 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181 --hbase-indexer-name userindexer --output-dir hdfs://hacluster/tmp/solr --go-live --overwrite-output-dir -v --reducers 3 --zk-host 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181/solr
  
  Do not run index tasks repeatedly during batch indexing execution. This is because the Merge index file is used to create indexes in batches to improve the submission efficiency. In this mode, the unique ID of each data record is not parsed one by one. Therefore, when duplicate data is imported, duplicate index data exists, and the same index is stored twice or even multiple times at the bottom layer. In this case, you can view the number of multiple indexes on Solr. However, because the ID of each index is duplicate, Solr finds the index corresponding to an ID and returns the result. For details about data conflicts caused by duplicate collections, see https://wiki.apache.org/solr/MergingSolrIndexes.
  
  If a batch collection task fails, delete the collections for batch collection and recreate a collection to batch import data.
  
  Note: Batch collection does not support the implicit routing mode for collections.
2. Incremental indexing
  Use Loader to incrementally import data to HBase. Set the storage type to HBASE_BULKLOAD. Rowkeys are stored in HDFS as files. Change the value of the Loader configuration item record.hbase.rowkey to true to specify that rowkeys are stored. Check the hbase.rowkey.output.path. For details about how to use Loader, see Using Loader. Run the following command:
  
  hbase-indexer batch-indexer --hbase-indexer-zk 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181 --hbase-indexer-name userindexer --output-dir hdfs://hacluster/tmp/solr --go-live --overwrite-output-dir -v --reducers 1 --zk-host 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181/solr --rowkey-dir hdfs://hacluster/user/loader/hbase/rowkey/output/HBase_table/job_xxx
  - hdfs://hacluster/user/loader/hbase/rowkey/output indicates the storage path of the row keys. For details, run the HDFS command, hadoop fs -ls /user/loader/hbase/rowkey/output. For details about how to use the HDFS command, see sections in Using HDFS.
  - HBase_table indicates the HBase table name. If a user-defined namespace (for example, namespace:HBase_table) is used in HBase, Loader will replace the colon (:) with a pound key (#) in the generated HDFS directory. This is because HDFS does not support directories with colons (:). In this case, the row key storage path is as follows: hdfs://hacluster/user/loader/hbase/rowkey/output/namespace#HBase_table/job_xxx.
  - job_xxx indicates the name of the latest collection job executed by the table. You can view the name on the Yarn web UI.
3. Real-time indexing
  Ensure that REPLICATION_SCOPE of the column family is set to 1, and the mode of submitting data to HBase is putlist (The data import mode is not specified, and you can use a tool or compile an HBase client program to import data). In addition, ensure that attribute indexed of the fields that require indexing is set to true in the managed-schema file in the collection config set. Reset <autoSoftCommit> in the solrconfig.xml file in the collection config set to ensure that the real-time data search can be implemented after indexes are created in Solr. Modify the submission time as required. If the time is set to a small value, the index efficiency will be affected. After the preceding configuration is completed, import data to HBase and check whether corresponding indexes are created on the Solr page. In addition, you can run the hbase-indexer replication-status command to view the HBase replication progress. For details, see Shell Client Operation Commands.
Choose Cluster > Services > Solr, and confirm that all Solr instances are working properly. On the Solr web UI, click either SolrServerAdmin. On the Solr Admin page that is displayed, click Collection Selector and select coll-indexdemo. Run the query command to view the data indexed from the HBase table to Solr collection.
Delete indexers.

When the functions of HBaseIndexer are not used, delete indexers in a timely manner. If you need to clear the data in HBase and Solr, delete the indexers first, and then perform related operations in HBase and Solr.