Updated on 2024-11-29 GMT+08:00

Solr Rich Text Indexing

Scenario

Solr supports rich text indexing on webUI interface. The text formats (CSV, XML, EML, HTML, TXT, Doc, PDF, xls, xlsx, JPG, PNG, TIF, etc. ), stored on the local disk can be indexed through specific parameters according to the configured solrconfig.xml.

Prerequisites

Solr service is working properly with a requestHandler that supports rich text indexing in the solrconfig.xml file.

Procedure

  1. Confirm that the requestHandler that supports rich text indexing is configured in the solrconfig.xml file.

    <!-- Solr Cell Update Request Handler
    http://wiki.apache.org/solr/ExtractingRequestHandler
    -->
    <requestHandler name="/update/extract"
    startup="lazy"
    class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>
    <!-- capture link hrefs but ignore div attributes -->
    <str name="captureAttr">true</str>
    <str name="fmap.a">links</str>
    <str name="fmap.div">ignored_</str>
    </lst>
    </requestHandler>

  2. Fill in the index parameters as shown in the following figure.

    Handler parameter: literal.id=d1&uprefix=attr_&fmap.content=attr_content

    • literal.<fieldname>=<value>: Create a field with the specified value. May be multivalued if the Field is multivalued.
    • uprefix=<prefix>: Prefix all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example: uprefix=ignored_ would effectively ignore all unknown fields generated by Tika given the example schema contains <dynamicField name="ignored_*" type="ignored"/>.
    • fmap.<source_field>=<target_field>: Maps (moves) one field name to another. Example: fmap.content=text will cause the content field normally generated by Tika to be moved to the text field.

  3. Submit the index command, and in the corresponding collection under the Query operation, you can get the indexed information.

    RestAPI Attached:

    Request URL:
    https://ip:solr_port/solr/collName/update/extract?commitWithin=1000&boost=1.0&overwrite=true&wt=json&literal.id=d3&uprefix=attr_&fmap.content=attr_content
    
    Request Method:POST
    
    Request Headers:
    Accept:application/json, text/plain, */*
    Accept-Encoding:gzip, deflate, br
    Accept-Language:zh,en;q=0.8,zh-CN;q=0.6
    Connection:keep-alive
    Content-Length:400378
    Content-Type:multipart/form-data; boundary=----WebKitFormBoundaryW45dtKs8K3BymOjP
    
    Request Payload
    ------WebKitFormBoundaryW45dtKs8K3BymOjP
    Content-Disposition: form-data; name="file"; filename="text.doxc"
    Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
    ------WebKitFormBoundaryW45dtKs8K3BymOjP--