Solr Rich Text Indexing
Scenario
Solr supports rich text indexing on webUI interface. The text formats (CSV, XML, EML, HTML, TXT, Doc, PDF, xls, xlsx, JPG, PNG, TIF, etc. ), stored on the local disk can be indexed through specific parameters according to the configured solrconfig.xml.
Prerequisites
Solr service is working properly with a requestHandler that supports rich text indexing in the solrconfig.xml file.
Procedure
- Confirm that the requestHandler that supports rich text indexing is configured in the solrconfig.xml file.
<!-- Solr Cell Update Request Handler http://wiki.apache.org/solr/ExtractingRequestHandler --> <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true</str> <str name="uprefix">ignored_</str> <!-- capture link hrefs but ignore div attributes --> <str name="captureAttr">true</str> <str name="fmap.a">links</str> <str name="fmap.div">ignored_</str> </lst> </requestHandler>
- Fill in the index parameters as shown in the following figure.
Handler parameter: literal.id=d1&uprefix=attr_&fmap.content=attr_content
- literal.<fieldname>=<value>: Create a field with the specified value. May be multivalued if the Field is multivalued.
- uprefix=<prefix>: Prefix all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example: uprefix=ignored_ would effectively ignore all unknown fields generated by Tika given the example schema contains <dynamicField name="ignored_*" type="ignored"/>.
- fmap.<source_field>=<target_field>: Maps (moves) one field name to another. Example: fmap.content=text will cause the content field normally generated by Tika to be moved to the text field.
- Submit the index command, and in the corresponding collection under the Query operation, you can get the indexed information.
RestAPI Attached:
Request URL: https://ip:solr_port/solr/collName/update/extract?commitWithin=1000&boost=1.0&overwrite=true&wt=json&literal.id=d3&uprefix=attr_&fmap.content=attr_content Request Method:POST Request Headers: Accept:application/json, text/plain, */* Accept-Encoding:gzip, deflate, br Accept-Language:zh,en;q=0.8,zh-CN;q=0.6 Connection:keep-alive Content-Length:400378 Content-Type:multipart/form-data; boundary=----WebKitFormBoundaryW45dtKs8K3BymOjP Request Payload ------WebKitFormBoundaryW45dtKs8K3BymOjP Content-Disposition: form-data; name="file"; filename="text.doxc" Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document ------WebKitFormBoundaryW45dtKs8K3BymOjP--
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot