Back to Smile

SMILE NLP — Tokenizers and Text Splitters

nlp/TOKENIZER.md

6.1.015.3 KB
Original Source

SMILE NLP — Tokenizers and Text Splitters

Tokenization is the foundational step in almost every NLP pipeline: raw text must be broken into discrete units (tokens, sentences, paragraphs) before any further processing can take place. The smile.nlp.tokenizer package provides a clean three-level hierarchy of splitters together with multiple implementations suited for different use cases.


Package Overview

LevelInterfaceImplementations
ParagraphParagraphSplitterSimpleParagraphSplitter
SentenceSentenceSplitterSimpleSentenceSplitter, BreakIteratorSentenceSplitter
WordTokenizerSimpleTokenizer, PennTreebankTokenizer, BreakIteratorTokenizer

All three interfaces extend java.util.function.Function<String, String[]>, so any splitter can be used directly in a Java stream pipeline.

A supporting dictionary class, EnglishAbbreviations, is used internally by the English-specific implementations to avoid mis-splitting abbreviation periods.


Interfaces

Tokenizer

java
public interface Tokenizer extends Function<String, String[]> {
    String[] split(String text);   // tokenize text into words/tokens
}

SentenceSplitter

java
public interface SentenceSplitter extends Function<String, String[]> {
    String[] split(String text);   // segment text into sentences
}

ParagraphSplitter

java
public interface ParagraphSplitter extends Function<String, String[]> {
    String[] split(String text);   // segment text into paragraphs
}

Because all three interfaces implement Function<String, String[]>, you can compose them with standard Java functional utilities:

java
SentenceSplitter splitter = SimpleSentenceSplitter.getInstance();
Tokenizer tokenizer       = new SimpleTokenizer();

// Use as Function in a stream
String[] sentences = splitter.apply(text);
String[][] tokens  = Arrays.stream(sentences)
        .map(tokenizer)
        .toArray(String[][]::new);

Word Tokenizers

SimpleTokenizer

SimpleTokenizer is the recommended general-purpose English word tokenizer. It handles contractions, possessives, punctuation, and abbreviation-final periods sensibly.

Key behaviours:

  • Splits most punctuation from adjoining words.

  • Expands contractions to their full forms:

    InputOutput tokens
    won'twill not
    can'tcan not
    shan'tshall not
    cannotcan not
    weren'twere not
    'tisn'tit is not
    I'mI 'm
    he'llhe 'll
    gonnagon na
  • Keeps abbreviation-terminal periods attached (e.g., etc. stays etc. at the end of a sentence, but emits an additional . sentence-terminal token).

  • Commas inside numbers (2,500) are not split.

  • Thread safety: instances are independent and thread-safe (no shared mutable state; each call is stateless beyond the compiled Pattern constants).

Basic usage

java
import smile.nlp.tokenizer.SimpleTokenizer;

SimpleTokenizer tokenizer = new SimpleTokenizer();

String[] tokens = tokenizer.split(
    "Dr. Smith won't attend the conference, but she'll send her notes.");

System.out.println(java.util.Arrays.toString(tokens));
// [Dr., Smith, will, not, attend, the, conference, ,, but, she, 'll, send, her, notes, .]

Numeric and punctuation edge cases

java
SimpleTokenizer tokenizer = new SimpleTokenizer();

// Commas inside numbers are not split
System.out.println(Arrays.toString(tokenizer.split("Population is 2,500,000.")));
// [Population, is, 2,500,000, .]

// Ellipsis is separated
System.out.println(Arrays.toString(tokenizer.split("Wait... then go.")));
// [Wait, ..., then, go, .]

PennTreebankTokenizer

PennTreebankTokenizer follows the tokenization conventions of the Penn Treebank corpus. It is a singleton (use PennTreebankTokenizer.getInstance()) and is the standard choice when your downstream models (e.g., HMMPOSTagger) were trained on Penn Treebank data.

Key differences from SimpleTokenizer:

InputSimpleTokenizerPennTreebankTokenizer
won'twill notwo n't
can'tcan notca n't
'tisn'tit is not't is n't

The Penn Treebank convention keeps the contracted negative n't as a separate morpheme; SimpleTokenizer expands to natural English forms instead.

Basic usage

java
import smile.nlp.tokenizer.PennTreebankTokenizer;

PennTreebankTokenizer tokenizer = PennTreebankTokenizer.getInstance();

String[] tokens = tokenizer.split("They couldn't have known.");
System.out.println(java.util.Arrays.toString(tokens));
// [They, could, n't, have, known, .]

When to use

  • Use PennTreebankTokenizer when feeding tokens to models trained on Penn Treebank data (including HMMPOSTagger).
  • Use SimpleTokenizer for all other English NLP tasks where natural-English token forms are preferred.

BreakIteratorTokenizer

BreakIteratorTokenizer wraps Java's java.text.BreakIterator for word segmentation. It supports any locale supported by the JVM, making it the right choice for non-English text.

⚠️ Not thread-safe. BreakIterator maintains internal state; each thread must create its own instance.

Basic usage

java
import smile.nlp.tokenizer.BreakIteratorTokenizer;
import java.util.Locale;

// Default locale
BreakIteratorTokenizer tokenizer = new BreakIteratorTokenizer();
System.out.println(java.util.Arrays.toString(tokenizer.split("Hello, world!")));

// Explicit locale
BreakIteratorTokenizer frTokenizer = new BreakIteratorTokenizer(Locale.FRENCH);
System.out.println(java.util.Arrays.toString(frTokenizer.split("Bonjour, le monde!")));

Multi-threaded use

java
ThreadLocal<BreakIteratorTokenizer> tlTokenizer =
        ThreadLocal.withInitial(BreakIteratorTokenizer::new);

// In each thread:
BreakIteratorTokenizer tokenizer = tlTokenizer.get();
String[] tokens = tokenizer.split(text);

Sentence Splitters

SimpleSentenceSplitter

SimpleSentenceSplitter is the recommended English sentence splitter. It is a singleton that uses a set of regular-expression heuristics to handle the hardest cases:

  • A . after a known abbreviation (Mr., Dr., etc., vs., …) is not treated as a sentence boundary.
  • . followed by a lowercase letter is not a boundary.
  • . at the end of the string or before a newline is always a boundary.
  • ? and ! are always boundaries.
  • Treats carriage returns as whitespace (expects paragraph-segmented input).

Assumes input has already been split into paragraphs. Feed each paragraph individually for best results.

Basic usage

java
import smile.nlp.tokenizer.SimpleSentenceSplitter;

SimpleSentenceSplitter splitter = SimpleSentenceSplitter.getInstance();

String paragraph =
    "Dr. Smith attended the conf. in Jan. He presented his findings. "
  + "Was the result surprising? Absolutely!";

for (String sentence : splitter.split(paragraph)) {
    System.out.println(sentence);
}
// Dr. Smith attended the conf. in Jan.
// He presented his findings.
// Was the result surprising?
// Absolutely!

Thread safety

SimpleSentenceSplitter is a stateless singleton and is thread-safe.


BreakIteratorSentenceSplitter

BreakIteratorSentenceSplitter wraps java.text.BreakIterator for sentence segmentation. Like BreakIteratorTokenizer, it supports any locale.

⚠️ Not thread-safe. Create one instance per thread.

Basic usage

java
import smile.nlp.tokenizer.BreakIteratorSentenceSplitter;
import java.util.Locale;

// Default locale
BreakIteratorSentenceSplitter splitter = new BreakIteratorSentenceSplitter();

// Specific locale
BreakIteratorSentenceSplitter deSplitter =
        new BreakIteratorSentenceSplitter(Locale.GERMAN);

for (String sentence : deSplitter.split("Das ist ein Test. Und noch ein Satz.")) {
    System.out.println(sentence);
}
// Das ist ein Test.
// Und noch ein Satz.

Paragraph Splitter

SimpleParagraphSplitter

SimpleParagraphSplitter is a singleton that segments text into paragraphs by splitting on one or more blank lines. A blank line is any line containing only whitespace characters.

It also handles the Unicode paragraph separator character (U+2029).

Basic usage

java
import smile.nlp.tokenizer.SimpleParagraphSplitter;

SimpleParagraphSplitter splitter = SimpleParagraphSplitter.getInstance();

String document =
    "First paragraph with multiple sentences. It continues here.\n\n"
  + "Second paragraph begins after the blank line.\n\n"
  + "Third paragraph.";

for (String para : splitter.split(document)) {
    System.out.println("PARAGRAPH: " + para);
}
// PARAGRAPH: First paragraph with multiple sentences. It continues here.
// PARAGRAPH: Second paragraph begins after the blank line.
// PARAGRAPH: Third paragraph.

SimpleParagraphSplitter is stateless and thread-safe.


English Abbreviations — EnglishAbbreviations

EnglishAbbreviations is a package-private interface that exposes a static dictionary of common English abbreviations loaded from the classpath resource abbreviations_en.txt. It is used internally by SimpleSentenceSplitter and PennTreebankTokenizer to avoid splitting on abbreviation periods.

The dictionary includes titles (Mr, Mrs, Dr, Prof), calendar items (Jan, Feb, Mon, Tue), geographic terms (Ave, Blvd, St), Latin abbreviations (etc, vs, cf, al), and more. It is not directly accessible from outside the package.


Complete Pipeline Example

A typical NLP preprocessing pipeline works in three stages: paragraph → sentence → token.

java
import smile.nlp.tokenizer.*;
import smile.nlp.stemmer.PorterStemmer;
import smile.nlp.pos.*;

// ── Splitters & tokenizer ────────────────────────────────────────────
ParagraphSplitter paragraphSplitter = SimpleParagraphSplitter.getInstance();
SentenceSplitter  sentenceSplitter  = SimpleSentenceSplitter.getInstance();
Tokenizer         tokenizer         = new SimpleTokenizer();
HMMPOSTagger      tagger            = HMMPOSTagger.getDefault();

ThreadLocal<PorterStemmer> tlStemmer = ThreadLocal.withInitial(PorterStemmer::new);

// ── Input document ───────────────────────────────────────────────────
String document =
    "Alan Turing was a British mathematician. "
  + "He proposed the Turing test in 1950.\n\n"
  + "His work laid the foundation for computer science.";

// ── Pipeline ─────────────────────────────────────────────────────────
PorterStemmer stemmer = tlStemmer.get();

for (String paragraph : paragraphSplitter.split(document)) {
    for (String sentence : sentenceSplitter.split(paragraph)) {
        String[] tokens = tokenizer.split(sentence);
        PennTreebankPOS[] tags = tagger.tag(tokens);

        for (int i = 0; i < tokens.length; i++) {
            if (tags[i].open) { // content word
                String stem = stemmer.stem(tokens[i].toLowerCase());
                System.out.printf("%-20s %-6s %s%n", tokens[i], tags[i], stem);
            }
        }
        System.out.println();
    }
}

Choosing the Right Implementation

Word tokenizer

ScenarioRecommended
General English textSimpleTokenizer
Penn Treebank / pre-trained NLP modelsPennTreebankTokenizer
Non-English or multilingualBreakIteratorTokenizer

Sentence splitter

ScenarioRecommended
English text (production use)SimpleSentenceSplitter
Multilingual / locale-sensitiveBreakIteratorSentenceSplitter

Paragraph splitter

ScenarioRecommended
Any text with blank-line paragraph boundariesSimpleParagraphSplitter

Thread-Safety Summary

ClassThread-safe?Notes
SimpleTokenizer✅ YesStateless after construction
PennTreebankTokenizer✅ YesStateless singleton
BreakIteratorTokenizer❌ NoBreakIterator is not thread-safe; use ThreadLocal
SimpleSentenceSplitter✅ YesStateless singleton
BreakIteratorSentenceSplitter❌ NoBreakIterator is not thread-safe; use ThreadLocal
SimpleParagraphSplitter✅ YesStateless singleton

API Quick-Reference

java
// ── Word tokenizers ──────────────────────────────────────────────────
Tokenizer simple    = new SimpleTokenizer();                    // thread-safe
Tokenizer ptb       = PennTreebankTokenizer.getInstance();      // singleton, thread-safe
Tokenizer biTok     = new BreakIteratorTokenizer();             // per-thread
Tokenizer biTokFr   = new BreakIteratorTokenizer(Locale.FRENCH);// locale-aware

String[] tokens = simple.split("He won't go.");
// [He, will, not, go, .]

// ── Sentence splitters ───────────────────────────────────────────────
SentenceSplitter ss  = SimpleSentenceSplitter.getInstance();    // singleton
SentenceSplitter bis = new BreakIteratorSentenceSplitter();     // per-thread
SentenceSplitter bde = new BreakIteratorSentenceSplitter(Locale.GERMAN);

String[] sentences = ss.split("Hello world. How are you?");
// [Hello world., How are you?]

// ── Paragraph splitter ───────────────────────────────────────────────
ParagraphSplitter ps = SimpleParagraphSplitter.getInstance();   // singleton
String[] paragraphs  = ps.split("Para one.\n\nPara two.");
// [Para one., Para two.]

// ── As Function in streams ───────────────────────────────────────────
String[][] allTokens = Arrays.stream(sentences)
        .map(simple)                    // Tokenizer IS a Function<String,String[]>
        .toArray(String[][]::new);

Notes and Caveats

  • Input assumptionsSimpleSentenceSplitter and both word tokenizers assume the input is a single paragraph (no embedded newlines from paragraph breaks). Pass paragraph-split text through SimpleParagraphSplitter first.
  • Sentence-final abbreviationsSimpleSentenceSplitter consults EnglishAbbreviations to avoid splitting on abbreviation periods, but the dictionary is not exhaustive. Domain-specific abbreviations may require a custom splitter.
  • Penn Treebank conventions — if you use PennTreebankTokenizer, make sure your downstream models (taggers, parsers) are trained on Penn Treebank tokenized data. Mixing conventions causes accuracy drops.
  • LocaleBreakIterator-based classes are locale-aware but rely on the ICU data bundled with the JVM. Results may vary across JVM vendors.
  • Empty tokens — all implementations filter out blank tokens, so String[] tokens will never contain an empty string.

SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.