Статьи

Индексируйте папку мультиязычных документов в Solr с помощью Tika

Предыдущие сообщения в серии

Все работает, но теперь требования меняются, документы могут иметь несколько языков (итальянский и английский в моем сценарии), и мы хотим сделать самое простое, что могло бы сработать . Прежде всего я изменяю схему ядра в solr для поддержки специфичных для языка полей с подстановочными знаками.

образ

Рисунок 1: Настройка ядра Solr для поддержки нескольких языков.

Это простая модификация, все поля индексируются и сохраняются (для выделения) и многозначны. Теперь мы можем использовать еще одну интересную функциональность Solr + Tika, обработчик обновлений, который определяет язык каждого документа, который был проиндексирован . На этот раз нам нужно изменить файл solrconfig.xml , найти раздел обработчика / update и изменить его таким образом.

  
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
< requestHandler name = "/update" class = "solr.UpdateRequestHandler" >
< lst name = "defaults" >
< str name = "update.chain" >langid</ str >
</ lst >
</ requestHandler >
< updateRequestProcessorChain >
< processor name = "langid" class = "org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory" >
< lst name = "defaults" >
< bool name = "langid" >true</ bool >
< str name = "langid.fl" >title,content</ str >
< str name = "langid.langField" >lang</ str >
< str name = "langid.fallback" >en</ str >
< bool name = "langid.map" >true</ bool >
< bool name = "langid.map.keepOrig" >true</ bool >
</ lst >
</ processor >
< processor class = "solr.LogUpdateProcessorFactory" />
< processor class = "solr.RunUpdateProcessorFactory" />
</ updateRequestProcessorChain >

I use a TikaLanguageIndentifierUpdateProcessorFactory to identify the language of documents, this runs for every documents that gets indexed, because it is injected in the chain of UpdateRequests. The configuration is simple and you can find full details in solr wiki. Basically I want it to analyze both the title and content field of the document and enable mapping of fields. This means that if the document is detected as Italian language it will contain content_it and title_it fields not only content field. Thanks to previous modification of solr.xml schema to match dynamicField with the correct language all content_xx files are indexed using the correct language.

This way to proceed consumes memory and disk space, because for each field I have the original Content stored as well as the content localized, but it is needed for highlighting and makes my core simple to use.

Now I want to be able to do a search in this multilanguage core, basically I have two choices:

  • Identify the language of terms in query and query the correct field
  • Query all the field with or.

Since detecting language of term used in query gives a lots of false positive, the secondo technique sounds better. Suppose you want to find italian term “tipografia”, You can issue query: content_it:tipografia OR content_en:tipografia. Everything works as expected as you can see from the following picture.

образ

Figure 2: Sample search in all content fields.

Now if you want highlights in the result, you must specify all localized fields, you cannot simply use Content field. As an example, if I simply ask to highlight the result of previous query using original content field, I got no highlight.

образ

Figure 3: No highlighting found if you use the original Content field.

This happens because the match in the document was not an exact match, I ask for word tipografia but in my document the match is on the term tipografo, thanks to language specific indexing Solr is able to match with stemming, this a typical full text search. The problem is, when is time to highlight, if you specify the content field, solr is not able to find any match of word tipografia in it, so you got no highlight.

 To avoid problem, you should specify all localized fields in hl parameters, this has no drawback because a single document have only one non-null localized field and the result is the expected one:

образ

Figure 4: If you specify localized content fields you can have highlighting even with a full-text match.

In this example when is time to highlight Solr will use both content_it and content_en. In my document content_en is empty, but Solr is able to find a match in content_it and is able to highlight with the original content, because content_it has stored=”true” in configuration.

Clearly using a single core with multiple field can slow down performances a little bit, but probably is the easiest way to deal to index Multilanguage files  automatically with Tika and Solr.