Back to Vespa

Vespa Lucene Linguistics

lucene-linguistics/README.md

8.685.14.5 KB
Original Source
<!-- Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->

Vespa Lucene Linguistics

Linguistics implementation based on the Apache Lucene.

Features:

  • a list of default analyzers per language;
  • building custom analyzers through the configuration of the linguistics component;
  • building custom analyzers in Java code and declaring them as components.

Development

Build:

shell
mvn clean test -U package

To compile configuration classes so that Intellij doesn't complain:

  • right click on pom.xml
  • then Maven
  • then Generate Sources and Update Folders

Usage

Add <component> to services.xml of your application package, e.g.:

xml
<component id="com.yahoo.language.lucene.LuceneLinguistics" bundle="lucene-linguistics">
  <config name="com.yahoo.language.lucene.lucene-analysis">
    <configDir>linguistics</configDir>
    <analysis>
      <item key="en">
        <tokenizer>
          <name>standard</name>
        </tokenizer>
        <tokenFilters>
          <item>
            <name>reverseString</name>
          </item>
        </tokenFilters>
      </item>
    </analysis>
  </config>
</component>

into container clusters that have <document-processing/> and/or <search> specified.

And then package and deploy, e.g.:

shell
(mvn clean -DskipTests=true -U package && vespa deploy -w 100)

Configuration of Lucene Analyzers

Read the Lucene docs of subclasses of:

E.g. tokenizer StandardTokenizerFactory has this config snippet:

xml
 <fieldType name="text_stndrd" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory" maxTokenLength="255"/>
   </analyzer>
 </fieldType>

Then go to the <a href="https://github.com/apache/lucene/blob/17c13a76c87c6246f32dd7a78a26db04401ddb6e/lucene/core/src/java/org/apache/lucene/analysis/standard/StandardTokenizerFactory.java#L36" data-proofer-ignore> source code</a> of the class on GitHub. Copy value of the public static final String NAME into the <name> and observe the names used for configuring the tokenizer (in this case only maxTokenLength).

xml
<tokenizer>
  <name>standard</name>
  <config>
    <item key="maxTokenLength">255</item>
  </config>
</tokenizer>

The AnalyzerFactory constructor on the application startup logs the available analysis components.

The analysis components are discovered through Java Service Provider Interface (SPI). To add more analysis components it should be enough to put a Lucene analyzer dependency into your application package pom.xml or register services and create classes directly in the application package.

Resource files

The Lucene analyzers can use various resource files, e.g. for stopwords, synonyms, etc. The configDir configuration parameter controls where to load these files from. These files are relative to the application package root directory.

If the configDir is not specified then files are loaded from the classpath.

Inspiration

These projects: