Solr over HDFS

Scenario

Solr supports batch import of HDFS data to Solr to create collections. You can run the shell command to start a batch import task.

Currently, the supported data formats include CSV, XML, EML, HTML, TXT, DOC, PDF, XLS, XLSX, JPG, PNG, and TIF.

Prerequisites

The HDFS, Solr, and Yarn services have been installed. INDEX_STORED_ON_HDFS is set to TRUE and SOLR_INDEX_LOCAL_STORAGE_DIR is left blank on Manager.
Solr over HDFS data needs to be stored in HDFS. Data in local disks is not supported.
If the cluster is in common mode, disable HDFS authentication to avoid insufficient permission. For MRS 3.x or later: Log in to Manager and choose Cluster > Name of the desired cluster > Service > HDFS > Configuration > All Configurations. Search for the dfs.namenode.acls.enabled and dfs.permissions.enabled parameters.
- dfs.namenode.acls.enabled indicates whether to enable HDFS ACL. The default value is true, indicating that the ACL is enabled. Change the value to false.
- dfs.permissions.enabled indicates whether to enable permission check for HDFS. The default value is true, indicating that permission check is enabled. Change the value to false. After the modification, the owner, owner group, and permission of the directories and files in HDFS remain unchanged.
Click Save, and then click OK. After the system displays a message indicating that the operation is complete, disable HDFS authentication.

Make initial preparations.

Ensure that the Solr, HDFS and Yarn clients have been installed in a directory, for example, /opt/client.
Copy the XML configuration file. Switch to the Yarn client installation directory and copy the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml files in the /opt/client/Yarn/config/ directory to the /opt/client/Solr/hdfs-indxer/conf directory on the Solr client. The XML file is mandatory for accessing HDFS and MapReduce.
Access the client installation directory and run the source bigdata_env command to import environment variables.

Run the following command to verify MapReduceIndexerTool:

hdfs-indexer --help

If the following information is displayed, the dependency package is configured correctly and MapReduceIndexerTool can be used properly:

FI02XB:/opt/client # hdfs-indexer --config /opt/client/Solr/hdfs-indexer/conf  jar /opt/client/Solr/lib/solr-map-reduce-*.jar -libjars "$SOLR_HADOOP_LIBJAR"  --  help 
 WARNING: Use "yarn jar" to launch YARN applications. 
 SLF4J: Class path contains multiple SLF4J bindings. 
 SLF4J: Found binding in [jar:file:/opt/client/HDFS/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] 
 SLF4J: Found binding in [jar:file:/opt/client/Solr/hbase-indexer/lib/slf4j-log4j12-1.7.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]   
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 
 usage: hadoop [GenericOptions]... jar solr-map-reduce-*.jar  
        [--help] --output-dir HDFS_URI [--input-list URI] --morphline-file FILE [--morphline-id STRING] 
        [--update-conflict-resolver FQCN] [--mappers INTEGER] [--reducers INTEGER] [--max-segments INTEGER] 
        [--fair-scheduler-pool STRING] [--dry-run] [--log4j FILE] [--verbose] [--show-non-solr-cloud] 
        [--zk-host STRING] [--go-live] [--collection STRING] [--go-live-threads INTEGER] [HDFS_URI [HDFS_URI ...]] 
  
 MapReduce batch job driver that takes a morphline and  creates  a  set  of  Solr index shards from a set of input 
 files and writes the indexes into  HDFS,  in  a  flexible,  scalable  and fault-tolerant manner. It also supports 
 merging the output shards into a set of  live  customer  facing  Solr servers, typically a SolrCloud. The program 
 proceeds in several consecutive MapReduce based phases, as follows:

Procedure

According to the format of data stored in HDFS, a corresponding morphlines configuration file (for example, the ReadCSVContainer.conf file) must be configured. A CSV file is used as an example in this section. For details about configuration files, see http://kitesdk.org/docs/1.1.0/morphlines/morphlines-reference-guide.html and https://github.com/kite-sdk/kite/tree/master/kite-morphlines/kite-morphlines-core/src/test/resources.

Run the following command to go to the client installation directory:

cd /opt/client
Run the following command to configure environment variables:

source bigdata_env
Check whether multiple Solr services are installed.
- If yes, when you use the client to connect to a specific Solr service, run a command to load environment variables of the service. For example, run the source Solr-1/component_env command to load Solr-1 service variables.
- If no, skip this step.
If the cluster is in security mode, authenticate the user. For a normal cluster, user authentication is not required.

kinit solr
Run the following command to download the default config set of Solr to the local PC:

solrctl confset --get confWithHDFS /opt/client/Solr/hdfs-indexer/config-csv/

Go to the Solr/hdfs-indexer/config-csv/ directory where the default config set of Solr is downloaded, and modify the managed-schema and vi conf/managed-schema configuration files based on the fields required by the actual data.

For example, if you want to create an index for fieldA to fieldG, modify the managed-schema file as follows:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
      <field name="timestamp" type="tlong" indexed="true" stored="true" multiValued="false" docValues="true" /> 
      <field name="fieldA" type="text_ngram" indexed="true" stored="true" multiValued="false" /> 
          <field name="fieldB" type="text_ngram" indexed="true" stored="true" multiValued="false" /> 
          <field name="fieldC" type="text_ngram" indexed="true" stored="true" multiValued="false" /> 
          <field name="fieldD" type="text_ngram" indexed="true" stored="true" multiValued="false" /> 
           <field name="fieldE" type="text_ngram" indexed="true" stored="true" multiValued="false" /> 
          <field name="fieldF" type="text_ngram" indexed="true" stored="true" multiValued="false" /> 
       <field name="fieldG" type="text_ngram" indexed="true" stored="true" multiValued="false" /> 
    <fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100"> 
       <analyzer type="index"> 
         <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/> 
       </analyzer> 
       <analyzer type="query"> 
         <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/> 
       </analyzer> 
       </fieldType>

There is an example of managed-schema in Solr/hdfs-indexer/m-conf/conf-model-Forcsv of the client installation directory. You can refer to the example to modify the file.

After the configuration is modified, upload the configuration set to ZooKeeper.

solrctl confset --create conf_test_100 /opt/client/Solr/hdfs-indexer/config-csv/
Create a collection using a specified config set.

solrctl collection --create collection_hdfs_100 -s 3 -c conf_test_100 -r 1 -m 10 -S true
Upload the data for which indexes need to be created to the specified directory in HDFS. For details, see the section Using HDFS. For example:

hdfs dfs -put /opt/client/Solr/hdfs-indexer/m-conf/conf-model-Forcsv/testfile-csv.txt hdfs://hacluster/user/solr/testfile-csv.txt

There is an example of CSV file in Solr/hdfs-indexer/m-conf/conf-model-Forcsv of the client installation directory.

Change the values of collection and zkHost in the /opt/client/Solr/hdfs-indexer/m-conf/conf-model-Forcsv/ReadCSVContainer.conf configuration file to the actual values, respectively. The file format is as follows:

SOLR_LOCATOR : { 
    collection: collection_hdfs_100 //Change it to the actual collection name.
    zkHost : "192.168.1.1:2181,192.168.1.2:2181, 192.168.1.3:2181/solr" // Change to the actual zkHost value of the cluster. You can view the value on the Solr Dashboard.
    batchSize : 10000 
}

morphlines : [ 
   { 
     id : pretest 
     importCommands : ["org.kitesdk.**", "org.apache.solr.**"] 
     commands : [ 
       { 
         readCSV { 
           separator : ";" 
           columns : [timestamp,fieldA,fieldB,fieldC,fieldD,fieldE,fieldF,fieldG] 
           quoteChar : "\"" 
           charset : UTF-8 
         } 
       } 
       { 
         generateUUID { 
           field : id 
         } 
       } 
       { 
         sanitizeUnknownSolrFields { 
           solrLocator : ${SOLR_LOCATOR} 
         } 
       } 
       { logDebug { format : "output record: {}", args : ["@{}"] } } 
       { 
         loadSolr {          
           solrLocator : ${SOLR_LOCATOR} 
         } 
       } 
     ] 
   } 
 ]

**Table 1** Description of the **ReadCSVContainer.conf** configuration file
Parameter		Description
In SOLR_LOCATOR	collection	Name of the collection created in Solr
	zkHost	ZooKeeper information
	batchSize	Number of data entries submitted to Solr for collection creation
In Morphlines	id	Customized identifier
	importCommands	Referenced class
	separator	Column separator
	columns	Field in the collection corresponding to a column
	quoteChar	Separator of document content and identifier of the document start row
	charset	Encoding format
	generateUUID	uniqueKey of a collection
	logDebug	Log export format

The following is a CSV file sample:

1120420594561;SnVIeDqGwtAptrVzmbfQLCmde2BsDFISCG4;PwWbFHlTkeEUduwElwWZap01g;75U7OAQyy2JPMNmXWwe9SpGw;aphcCkOqA3iwRAu0QMrrEf6rLcdMqjL;s9C5R1pmHofdN1XOAr9NJ;oGO3z9A6sNSusRlxXfuaUDCH7P5cptIIo;sKe84RHByT3vYww73tY04szq91DKlh3s1o9
1107591105847;HPp2JjFCFQ6BRxWFM17Iqye8Hl;xHxV249tLJ5Pl3HCcbYAPxe9RWh;qJb8esxvlAM06TZI5egmxECfMJD;Bp5OBAPT8GZufPOhuHY6LVb;uUKRJG7EGBh8SrvYzmXjjWdf97Thk2CPRAkORd;OVWhxibc47RblUx9vj6VkZAsz18kuBnNYIoPnlN;TjVBIv6FqRVKzeuwYmInxeFiM

The example contains two records. In each record, semicolons (;) are used to separate the fields that correspond to the timestamp, fieldA, fieldB, fieldC, fieldD, fieldE, fieldF, and fieldG.

Run the following command to start a collection task:

nohup hdfs-indexer --morphline-file /opt/client/Solr/hdfs-indexer/m-conf/conf-model-Forcsv/ReadCSVContainer.conf --collection collection_hdfs_100 --go-live $HDFS_URI &

Do not execute index tasks repeatedly during batch indexing. This is because batch indexing merges index files to create indexes to improve efficiency, without parsing unique IDs of data one by one. As a result, when duplicate data is imported, duplicate index data is generated and an index is stored twice or more at the bottom layer. When you query an index on Solr, multiple indexes are displayed. However, if you query with an index ID, you will get only one result because Solr returns a result as soon as the ID is matched, causing a data conflict. For details about data conflicts caused by duplicate collections, see https://wiki.apache.org/solr/MergingSolrIndexes.

You are advised to select a new collection for each batch collection execution. In this way, if a collection task fails to be executed, you can directly delete the collection to roll back and clear data.

Note: In the ReadCSVContainer.conf configuration file, the collection and zkHost parameters must be consistent with those in the actual environment. The HDFS_URI parameter in the collection command must be set to the actual HDFS directory. Batch collection does not support the implicit routing mode for collections.

**Table 2** Parameters in the Solr over HDFS collection command
Parameter	Parameter
--config	Directory of the configuration file used by the collection task. The default directory is ${SOLR_HOME}/hdfs-indexer/conf. You can also specify one.
jar	JAR file path used by the collection task. The default path is ${SOLR_HOME}/lib/solr-map-reduce-${VERSION}.jar. You can also specify one.
--libjars	Path of the dependency JAR file. The default path is ${SOLR_HADOOP_LIBJAR}. You can also specify one.
--morphline-file	Path of the morphline configuration file path. The path must be specified by the user.
--output-dir	Output path of the generated collection. The default value is hdfs://hacluster/tmp/solr. You can also specify one.
--http-socket-timeout	This parameter is used to set HTTP Socket Timeout in the GoLive phase. The default value is 120000, and the unit is millisecond. When there is a large amount of data, you are advised to increase the value of this parameter to prevent socket timeout.
--mappers	Number of collection task mappers. The default value is -1. You can also specify a value.
--reducers	Number of index task reducers. The default value is -1. You can also specify a value.
--zk-host	Solr path in ZooKeeper. The default value is zk-host. You can also specify a value.
--collection	Collection name of Solr. The value must be specified by the user.
--go-live	Whether to merge collection data to the Solr cluster. If this parameter is not specified, the created index cannot be queried in Solr.
HDFS_URI	HDFS path of the original data for which the collection is to be created, for example, hdfs://hacluster/user/solr/testfile-csv.txt. The value must be specified by the user.

You can query the index task progress and logs in the Yarn service. For details, see Using YARN.

The preceding operations are applicable to the simple test environment. In the actual production environment, the hardware environment and component configuration need to be optimized.

Parent topic: Common Service Operations About Solr

Previous topic: Operations on the Solr Admin UI

Next topic: Solr over HBase