lucene-linguistics/README.md
Linguistics implementation based on the Apache Lucene.
Features:
components.Build:
mvn clean test -U package
To compile configuration classes so that Intellij doesn't complain:
pom.xmlMavenGenerate Sources and Update FoldersAdd <component> to services.xml of your application package, e.g.:
<component id="com.yahoo.language.lucene.LuceneLinguistics" bundle="lucene-linguistics">
<config name="com.yahoo.language.lucene.lucene-analysis">
<configDir>linguistics</configDir>
<analysis>
<item key="en">
<tokenizer>
<name>standard</name>
</tokenizer>
<tokenFilters>
<item>
<name>reverseString</name>
</item>
</tokenFilters>
</item>
</analysis>
</config>
</component>
into container clusters that have <document-processing/> and/or <search> specified.
And then package and deploy, e.g.:
(mvn clean -DskipTests=true -U package && vespa deploy -w 100)
Read the Lucene docs of subclasses of:
E.g. tokenizer StandardTokenizerFactory has this config snippet:
<fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" maxTokenLength="255"/>
</analyzer>
</fieldType>
Then go to the <a href="https://github.com/apache/lucene/blob/17c13a76c87c6246f32dd7a78a26db04401ddb6e/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerFactory.java#L36" data-proofer-ignore>
source code</a> of the class on GitHub.
Copy value of the public static final String NAME into the <name> and observe the names used for configuring the tokenizer (in this case only maxTokenLength).
<tokenizer>
<name>standard</name>
<config>
<item key="maxTokenLength">255</item>
</config>
</tokenizer>
The AnalyzerFactory constructor on the application startup logs the available analysis components.
The analysis components are discovered through Java Service Provider Interface (SPI).
To add more analysis components it should be enough to put a Lucene analyzer dependency into your application package pom.xml
or register services and create classes directly in the application package.
The Lucene analyzers can use various resource files, e.g. for stopwords, synonyms, etc.
The configDir configuration parameter controls where to load these files from.
These files are relative to the application package root directory.
If the configDir is not specified then files are loaded from the classpath.
These projects: