Some of our SearchStax clients index websites that use multiple languages. We were recently asked how to enable Solr indexing of Mandarin on a cloud platform. (This post describes indexing Traditional Chinese characters. It is also possible to use Simplified Chinese by following a similar series of steps. Contact us at support@searchstax.com for an example.)
Step 1: Obtain Configuration Files.
Step 2. Add the Required Library.
Update solrconfig.xml file by adding following line after all the lib declarations.
<!-- Traditional Chinese library --> <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" regex="lucene-analyzers-icu-\d.*\.jar" /> <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib" regex="icu4j-\d.*\.jar" /> <!-- Traditional Chinese library - END -->
Step 3. Update the Schema
A. Create a new field type in the managed-schema file with the SmartChineseAnalyzer.
<fieldType name="text_mandarin" class="solr.TextField">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
B. Create a field that uses this field type.
<field name=”text_man” type=”text_mandarin” multiValued=”true” indexed=”true” stored=”true”/>