Solr Rich Text Indexing

Scenario

Solr supports rich text indexing on webUI interface. The text formats (CSV, XML, EML, HTML, TXT, Doc, PDF, xls, xlsx, JPG, PNG, TIF, etc. ), stored on the local disk can be indexed through specific parameters according to the configured solrconfig.xml.

Prerequisites

Solr service is working properly with a requestHandler that supports rich text indexing in the solrconfig.xml file.

Procedure

Confirm that the requestHandler that supports rich text indexing is configured in the solrconfig.xml file.

<!-- Solr Cell Update Request Handler
http://wiki.apache.org/solr/ExtractingRequestHandler
-->
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>

Fill in the index parameters as shown in the following figure.
Handler parameter: literal.id=d1&uprefix=attr_&fmap.content=attr_content
- literal.<fieldname>=<value>: Create a field with the specified value. May be multivalued if the Field is multivalued.
- uprefix=<prefix>: Prefix all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example: uprefix=ignored_ would effectively ignore all unknown fields generated by Tika given the example schema contains <dynamicField name="ignored_*" type="ignored"/>.
- fmap.<source_field>=<target_field>: Maps (moves) one field name to another. Example: fmap.content=text will cause the content field normally generated by Tika to be moved to the text field.

Submit the index command, and in the corresponding collection under the Query operation, you can get the indexed information.

RestAPI Attached:

Request URL:
https://ip:solr_port/solr/collName/update/extract?commitWithin=1000&boost=1.0&overwrite=true&wt=json&literal.id=d3&uprefix=attr_&fmap.content=attr_content

Request Method:POST

Request Headers:
Accept:application/json, text/plain, */*
Accept-Encoding:gzip, deflate, br
Accept-Language:zh,en;q=0.8,zh-CN;q=0.6
Connection:keep-alive
Content-Length:400378
Content-Type:multipart/form-data; boundary=----WebKitFormBoundaryW45dtKs8K3BymOjP

Request Payload
------WebKitFormBoundaryW45dtKs8K3BymOjP
Content-Disposition: form-data; name="file"; filename="text.doxc"
Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
------WebKitFormBoundaryW45dtKs8K3BymOjP--

Parent topic: Common Service Operations About Solr

Previous topic: Solr Multi-System Mutual Trust

Next topic: (Recommended) Changing the Collection Data Storage Mode from HDFS to Local Disk