SMILE NLP — Tokenizers and Text Splitters

Tokenization is the foundational step in almost every NLP pipeline: raw text must be broken into discrete units (tokens, sentences, paragraphs) before any further processing can take place. The smile.nlp.tokenizer package provides a clean three-level hierarchy of splitters together with multiple implementations suited for different use cases.

Package Overview

Level	Interface	Implementations
Paragraph	`ParagraphSplitter`	`SimpleParagraphSplitter`
Sentence	`SentenceSplitter`	`SimpleSentenceSplitter`, `BreakIteratorSentenceSplitter`
Word	`Tokenizer`	`SimpleTokenizer`, `PennTreebankTokenizer`, `BreakIteratorTokenizer`

All three interfaces extend java.util.function.Function<String, String[]>, so any splitter can be used directly in a Java stream pipeline.

A supporting dictionary class, EnglishAbbreviations, is used internally by the English-specific implementations to avoid mis-splitting abbreviation periods.

Interfaces

`Tokenizer`

java

public interface Tokenizer extends Function<String, String[]> {
    String[] split(String text);   // tokenize text into words/tokens
}

`SentenceSplitter`

java

public interface SentenceSplitter extends Function<String, String[]> {
    String[] split(String text);   // segment text into sentences
}

`ParagraphSplitter`

java

public interface ParagraphSplitter extends Function<String, String[]> {
    String[] split(String text);   // segment text into paragraphs
}

Because all three interfaces implement Function<String, String[]>, you can compose them with standard Java functional utilities:

java

SentenceSplitter splitter = SimpleSentenceSplitter.getInstance();
Tokenizer tokenizer       = new SimpleTokenizer();

// Use as Function in a stream
String[] sentences = splitter.apply(text);
String[][] tokens  = Arrays.stream(sentences)
        .map(tokenizer)
        .toArray(String[][]::new);

Word Tokenizers

`SimpleTokenizer`

SimpleTokenizer is the recommended general-purpose English word tokenizer. It handles contractions, possessives, punctuation, and abbreviation-final periods sensibly.

Key behaviours:

Splits most punctuation from adjoining words.
Expands contractions to their full forms:

Input Output tokens
won't will not
can't can not
shan't shall not
cannot can not
weren't were not
'tisn't it is not
I'm I 'm
he'll he 'll
gonna gon na
Keeps abbreviation-terminal periods attached (e.g., etc. stays etc. at the end of a sentence, but emits an additional . sentence-terminal token).
Commas inside numbers (2,500) are not split.
Thread safety: instances are independent and thread-safe (no shared mutable state; each call is stateless beyond the compiled Pattern constants).

Input	Output tokens
`won't`	`will not`
`can't`	`can not`
`shan't`	`shall not`
`cannot`	`can not`
`weren't`	`were not`
`'tisn't`	`it is not`
`I'm`	`I 'm`
`he'll`	`he 'll`
`gonna`	`gon na`

Basic usage

java

import smile.nlp.tokenizer.SimpleTokenizer;

SimpleTokenizer tokenizer = new SimpleTokenizer();

String[] tokens = tokenizer.split(
    "Dr. Smith won't attend the conference, but she'll send her notes.");

System.out.println(java.util.Arrays.toString(tokens));
// [Dr., Smith, will, not, attend, the, conference, ,, but, she, 'll, send, her, notes, .]

Numeric and punctuation edge cases

java

SimpleTokenizer tokenizer = new SimpleTokenizer();

// Commas inside numbers are not split
System.out.println(Arrays.toString(tokenizer.split("Population is 2,500,000.")));
// [Population, is, 2,500,000, .]

// Ellipsis is separated
System.out.println(Arrays.toString(tokenizer.split("Wait... then go.")));
// [Wait, ..., then, go, .]

`PennTreebankTokenizer`

PennTreebankTokenizer follows the tokenization conventions of the Penn Treebank corpus. It is a singleton (use PennTreebankTokenizer.getInstance()) and is the standard choice when your downstream models (e.g., HMMPOSTagger) were trained on Penn Treebank data.

Key differences from SimpleTokenizer:

Input	`SimpleTokenizer`	`PennTreebankTokenizer`
`won't`	`will not`	`wo n't`
`can't`	`can not`	`ca n't`
`'tisn't`	`it is not`	`'t is n't`

The Penn Treebank convention keeps the contracted negative n't as a separate morpheme; SimpleTokenizer expands to natural English forms instead.

Basic usage

java

import smile.nlp.tokenizer.PennTreebankTokenizer;

PennTreebankTokenizer tokenizer = PennTreebankTokenizer.getInstance();

String[] tokens = tokenizer.split("They couldn't have known.");
System.out.println(java.util.Arrays.toString(tokens));
// [They, could, n't, have, known, .]

When to use

Use PennTreebankTokenizer when feeding tokens to models trained on Penn Treebank data (including HMMPOSTagger).
Use SimpleTokenizer for all other English NLP tasks where natural-English token forms are preferred.

`BreakIteratorTokenizer`

BreakIteratorTokenizer wraps Java's java.text.BreakIterator for word segmentation. It supports any locale supported by the JVM, making it the right choice for non-English text.

⚠️ Not thread-safe. BreakIterator maintains internal state; each thread must create its own instance.

Basic usage

java

import smile.nlp.tokenizer.BreakIteratorTokenizer;
import java.util.Locale;

// Default locale
BreakIteratorTokenizer tokenizer = new BreakIteratorTokenizer();
System.out.println(java.util.Arrays.toString(tokenizer.split("Hello, world!")));

// Explicit locale
BreakIteratorTokenizer frTokenizer = new BreakIteratorTokenizer(Locale.FRENCH);
System.out.println(java.util.Arrays.toString(frTokenizer.split("Bonjour, le monde!")));

Multi-threaded use

java

ThreadLocal<BreakIteratorTokenizer> tlTokenizer =
        ThreadLocal.withInitial(BreakIteratorTokenizer::new);

// In each thread:
BreakIteratorTokenizer tokenizer = tlTokenizer.get();
String[] tokens = tokenizer.split(text);

Sentence Splitters

`SimpleSentenceSplitter`

SimpleSentenceSplitter is the recommended English sentence splitter. It is a singleton that uses a set of regular-expression heuristics to handle the hardest cases:

A . after a known abbreviation (Mr., Dr., etc., vs., …) is not treated as a sentence boundary.
. followed by a lowercase letter is not a boundary.
. at the end of the string or before a newline is always a boundary.
? and ! are always boundaries.
Treats carriage returns as whitespace (expects paragraph-segmented input).

Assumes input has already been split into paragraphs. Feed each paragraph individually for best results.

Basic usage

java

import smile.nlp.tokenizer.SimpleSentenceSplitter;

SimpleSentenceSplitter splitter = SimpleSentenceSplitter.getInstance();

String paragraph =
    "Dr. Smith attended the conf. in Jan. He presented his findings. "
  + "Was the result surprising? Absolutely!";

for (String sentence : splitter.split(paragraph)) {
    System.out.println(sentence);
}
// Dr. Smith attended the conf. in Jan.
// He presented his findings.
// Was the result surprising?
// Absolutely!

Thread safety

SimpleSentenceSplitter is a stateless singleton and is thread-safe.

`BreakIteratorSentenceSplitter`

BreakIteratorSentenceSplitter wraps java.text.BreakIterator for sentence segmentation. Like BreakIteratorTokenizer, it supports any locale.

⚠️ Not thread-safe. Create one instance per thread.

Basic usage

java

import smile.nlp.tokenizer.BreakIteratorSentenceSplitter;
import java.util.Locale;

// Default locale
BreakIteratorSentenceSplitter splitter = new BreakIteratorSentenceSplitter();

// Specific locale
BreakIteratorSentenceSplitter deSplitter =
        new BreakIteratorSentenceSplitter(Locale.GERMAN);

for (String sentence : deSplitter.split("Das ist ein Test. Und noch ein Satz.")) {
    System.out.println(sentence);
}
// Das ist ein Test.
// Und noch ein Satz.

Paragraph Splitter

`SimpleParagraphSplitter`

SimpleParagraphSplitter is a singleton that segments text into paragraphs by splitting on one or more blank lines. A blank line is any line containing only whitespace characters.

It also handles the Unicode paragraph separator character (U+2029).

Basic usage

java

import smile.nlp.tokenizer.SimpleParagraphSplitter;

SimpleParagraphSplitter splitter = SimpleParagraphSplitter.getInstance();

String document =
    "First paragraph with multiple sentences. It continues here.\n\n"
  + "Second paragraph begins after the blank line.\n\n"
  + "Third paragraph.";

for (String para : splitter.split(document)) {
    System.out.println("PARAGRAPH: " + para);
}
// PARAGRAPH: First paragraph with multiple sentences. It continues here.
// PARAGRAPH: Second paragraph begins after the blank line.
// PARAGRAPH: Third paragraph.

SimpleParagraphSplitter is stateless and thread-safe.

English Abbreviations — `EnglishAbbreviations`

EnglishAbbreviations is a package-private interface that exposes a static dictionary of common English abbreviations loaded from the classpath resource abbreviations_en.txt. It is used internally by SimpleSentenceSplitter and PennTreebankTokenizer to avoid splitting on abbreviation periods.

The dictionary includes titles (Mr, Mrs, Dr, Prof), calendar items (Jan, Feb, Mon, Tue), geographic terms (Ave, Blvd, St), Latin abbreviations (etc, vs, cf, al), and more. It is not directly accessible from outside the package.

Complete Pipeline Example

A typical NLP preprocessing pipeline works in three stages: paragraph → sentence → token.

java

import smile.nlp.tokenizer.*;
import smile.nlp.stemmer.PorterStemmer;
import smile.nlp.pos.*;

// ── Splitters & tokenizer ────────────────────────────────────────────
ParagraphSplitter paragraphSplitter = SimpleParagraphSplitter.getInstance();
SentenceSplitter  sentenceSplitter  = SimpleSentenceSplitter.getInstance();
Tokenizer         tokenizer         = new SimpleTokenizer();
HMMPOSTagger      tagger            = HMMPOSTagger.getDefault();

ThreadLocal<PorterStemmer> tlStemmer = ThreadLocal.withInitial(PorterStemmer::new);

// ── Input document ───────────────────────────────────────────────────
String document =
    "Alan Turing was a British mathematician. "
  + "He proposed the Turing test in 1950.\n\n"
  + "His work laid the foundation for computer science.";

// ── Pipeline ─────────────────────────────────────────────────────────
PorterStemmer stemmer = tlStemmer.get();

for (String paragraph : paragraphSplitter.split(document)) {
    for (String sentence : sentenceSplitter.split(paragraph)) {
        String[] tokens = tokenizer.split(sentence);
        PennTreebankPOS[] tags = tagger.tag(tokens);

        for (int i = 0; i < tokens.length; i++) {
            if (tags[i].open) { // content word
                String stem = stemmer.stem(tokens[i].toLowerCase());
                System.out.printf("%-20s %-6s %s%n", tokens[i], tags[i], stem);
            }
        }
        System.out.println();
    }
}

Choosing the Right Implementation

Word tokenizer

Scenario	Recommended
General English text	`SimpleTokenizer`
Penn Treebank / pre-trained NLP models	`PennTreebankTokenizer`
Non-English or multilingual	`BreakIteratorTokenizer`

Sentence splitter

Scenario	Recommended
English text (production use)	`SimpleSentenceSplitter`
Multilingual / locale-sensitive	`BreakIteratorSentenceSplitter`

Paragraph splitter

Scenario	Recommended
Any text with blank-line paragraph boundaries	`SimpleParagraphSplitter`

Thread-Safety Summary

Class	Thread-safe?	Notes
`SimpleTokenizer`	✅ Yes	Stateless after construction
`PennTreebankTokenizer`	✅ Yes	Stateless singleton
`BreakIteratorTokenizer`	❌ No	`BreakIterator` is not thread-safe; use `ThreadLocal`
`SimpleSentenceSplitter`	✅ Yes	Stateless singleton
`BreakIteratorSentenceSplitter`	❌ No	`BreakIterator` is not thread-safe; use `ThreadLocal`
`SimpleParagraphSplitter`	✅ Yes	Stateless singleton

API Quick-Reference

java

// ── Word tokenizers ──────────────────────────────────────────────────
Tokenizer simple    = new SimpleTokenizer();                    // thread-safe
Tokenizer ptb       = PennTreebankTokenizer.getInstance();      // singleton, thread-safe
Tokenizer biTok     = new BreakIteratorTokenizer();             // per-thread
Tokenizer biTokFr   = new BreakIteratorTokenizer(Locale.FRENCH);// locale-aware

String[] tokens = simple.split("He won't go.");
// [He, will, not, go, .]

// ── Sentence splitters ───────────────────────────────────────────────
SentenceSplitter ss  = SimpleSentenceSplitter.getInstance();    // singleton
SentenceSplitter bis = new BreakIteratorSentenceSplitter();     // per-thread
SentenceSplitter bde = new BreakIteratorSentenceSplitter(Locale.GERMAN);

String[] sentences = ss.split("Hello world. How are you?");
// [Hello world., How are you?]

// ── Paragraph splitter ───────────────────────────────────────────────
ParagraphSplitter ps = SimpleParagraphSplitter.getInstance();   // singleton
String[] paragraphs  = ps.split("Para one.\n\nPara two.");
// [Para one., Para two.]

// ── As Function in streams ───────────────────────────────────────────
String[][] allTokens = Arrays.stream(sentences)
        .map(simple)                    // Tokenizer IS a Function<String,String[]>
        .toArray(String[][]::new);

Notes and Caveats

Input assumptions — SimpleSentenceSplitter and both word tokenizers assume the input is a single paragraph (no embedded newlines from paragraph breaks). Pass paragraph-split text through SimpleParagraphSplitter first.
Sentence-final abbreviations — SimpleSentenceSplitter consults EnglishAbbreviations to avoid splitting on abbreviation periods, but the dictionary is not exhaustive. Domain-specific abbreviations may require a custom splitter.
Penn Treebank conventions — if you use PennTreebankTokenizer, make sure your downstream models (taggers, parsers) are trained on Penn Treebank tokenized data. Mixing conventions causes accuracy drops.
Locale — BreakIterator-based classes are locale-aware but rely on the ICU data bundled with the JVM. Results may vary across JVM vendors.
Empty tokens — all implementations filter out blank tokens, so String[] tokens will never contain an empty string.

SMILE NLP — Tokenizers and Text Splitters

SMILE NLP — Tokenizers and Text Splitters

Package Overview

Interfaces

Tokenizer

SentenceSplitter

ParagraphSplitter

Word Tokenizers

SimpleTokenizer

Basic usage

Numeric and punctuation edge cases

PennTreebankTokenizer

Basic usage

When to use

BreakIteratorTokenizer

Basic usage

Multi-threaded use

Sentence Splitters

SimpleSentenceSplitter

Basic usage

Thread safety

BreakIteratorSentenceSplitter

Basic usage

Paragraph Splitter

SimpleParagraphSplitter

Basic usage

English Abbreviations — EnglishAbbreviations

Complete Pipeline Example

Choosing the Right Implementation

Word tokenizer

Sentence splitter

Paragraph splitter

Thread-Safety Summary

API Quick-Reference

Notes and Caveats

`Tokenizer`

`SentenceSplitter`

`ParagraphSplitter`

`SimpleTokenizer`

`PennTreebankTokenizer`

`BreakIteratorTokenizer`

`SimpleSentenceSplitter`

`BreakIteratorSentenceSplitter`

`SimpleParagraphSplitter`

English Abbreviations — `EnglishAbbreviations`