nlp/TAXONOMY.md
The smile.nlp.taxonomy package provides a lightweight, in-memory rooted tree data structure
for representing taxonomies (also called concept hierarchies or IS-A hierarchies). The
package contains three public classes:
| Class | Role |
|---|---|
Taxonomy | The tree itself — construction, navigation, and serialization |
Concept | A single node, carrying a synonym set (synset) of keywords |
TaxonomicDistance | Edge-counting distance and three semantic-similarity measures |
Typical use cases include:
Every node in the tree is a Concept. A concept can be named (one or more keywords are
registered to it) or anonymous (no keywords, used as a structural grouping node). Multiple
keywords that refer to the same concept form a synset (synonym set) — exactly like WordNet.
animal ← named concept, keyword "animal"
mammal, warm-blooded ← named concept with two synonyms
dog, canine ← named concept with two synonyms
cat, feline
reptile
snake
All keywords inside a taxonomy must be unique: attempting to register a keyword that already
exists throws IllegalArgumentException.
A Taxonomy always has exactly one root node. The root may be anonymous or named.
Every other Concept has exactly one parent and zero or more children. The depth of the
root is 0; the height of a leaf is 0.
// 1. Create the taxonomy with an optional root keyword
Taxonomy taxonomy = new Taxonomy("entity");
Concept root = taxonomy.getRoot();
// 2. Add children with addChild(String...)
Concept physical = root.addChild("physical");
Concept living = physical.addChild("living");
Concept animal = living.addChild("animal");
Concept plant = living.addChild("plant");
// 3. Create a Concept with synonyms using the Concept constructor
Concept mammal = new Concept(animal, "mammal", "warm-blooded");
Concept dog = new Concept(mammal, "dog", "canine");
Concept cat = new Concept(mammal, "cat", "feline");
Taxonomy.of(String) parses a simple indented-text format, which is convenient for
configuration files or tests:
# lines starting with '#' are comments
animal
mammal, warm-blooded
dog, canine
cat, feline
reptile
snake
Rules:
String text = """
animal
mammal, warm-blooded
dog, canine
cat, feline
reptile
snake
""";
Taxonomy t = Taxonomy.of(text);
Concept mammal = t.getConcept("mammal");
Concept canine = t.getConcept("canine");
// dog and canine are the same concept node
assert t.getConcept("dog") == t.getConcept("canine");
Concept c = taxonomy.getConcept("dog"); // null if not found
boolean exists = taxonomy.contains("cat");
List<String> all = taxonomy.getConcepts(); // all registered keywords
int depth = taxonomy.depth("dog"); // 0 = root; -1 if not found
int height = taxonomy.height(); // max depth of any node
int size = taxonomy.size(); // number of keywords (named concepts)
int nodes = taxonomy.nodeCount(); // total nodes, including anonymous ones
Concept dog = taxonomy.getConcept("dog");
boolean leaf = dog.isLeaf();
int depth = dog.depth(); // depth from root
int height = dog.height(); // height of sub-tree
int subtreeSize = dog.subtreeSize(); // keyword count in sub-tree
Set<String> keywords = dog.keywords();
List<Concept> children = dog.children();
List<Concept> siblings = dog.siblings();
List<Concept> fromRoot = dog.getPathFromRoot();
List<Concept> toRoot = dog.getPathToRoot();
taxonomy.isAncestor("mammal", "dog"); // true
taxonomy.isDescendant("dog", "mammal"); // true — symmetric helper
dog.isAncestorOf(mammal); // false
dog.isDescendantOf(mammal); // true
Concept lca = taxonomy.lowestCommonAncestor("dog", "snake");
// → "animal" (the deepest node that is an ancestor of both)
The LCA of a node with itself is the node itself.
List<Concept> path = taxonomy.shortestPath("dog", "snake");
// dog → mammal → animal → reptile → snake
// path.size() == distance + 1
The path goes up from the first concept to the LCA and then down to the second concept.
// All nodes at a given depth
List<Concept> level2 = taxonomy.level(2);
// Breadth-first order (root first)
List<Concept> bfsOrder = taxonomy.bfs();
// Depth-first pre-order traversal with a consumer
taxonomy.forEach(node -> System.out.println(node));
// All leaf nodes
List<Concept> leaves = taxonomy.leaves();
// All keywords under a given concept
List<String> subWords = taxonomy.subtree("mammal");
Concept dog = taxonomy.getConcept("dog");
dog.addKeywords("hound", "pooch"); // add synonyms
dog.removeKeyword("pooch"); // remove one synonym
Concept newNode = parent.addChild("newConcept"); // returns the new Concept
// Or create and attach later:
Concept extra = new Concept(parent, "extra", "additional");
// Remove a child (and its entire sub-tree)
parent.removeChild(child); // also removes all descendant keywords from index
Taxonomy.toString() renders the tree using Unicode box-drawing characters:
Taxonomy t = Taxonomy.of("""
animal
mammal
dog
cat
reptile
snake
""");
System.out.println(t);
Output:
[animal]
├── [mammal]
│ ├── [dog]
│ └── [cat]
└── [reptile]
└── [snake]
Anonymous nodes are shown as (anon).
TaxonomicDistance implements smile.math.distance.Distance<Concept> and wraps a Taxonomy
to provide four measures.
TaxonomicDistance td = new TaxonomicDistance(taxonomy);
The number of edges on the shortest path between two concepts through their LCA:
d(a, b) = depth(a) + depth(b) − 2 × depth(LCA(a, b))
double dist = td.d("dog", "snake"); // by keyword
double dist = td.d(dogConcept, snakeConcept); // by Concept object
Properties: symmetric, 0 iff identical, integer-valued.
The raw distance divided by the diameter of the taxonomy (2 × height), yielding a value in
[0, 1]:
double nd = td.normalizedDistance("dog", "snake"); // 0.0 → 1.0
Proposed by Wu & Palmer (1994), based on the depth of the LCA relative to the depths of the two concepts:
sim(a, b) = 2 × depth(LCA) / (depth(a) + depth(b))
Returns 1 when both concepts are identical, and approaches 0 when the LCA is near the root.
double sim = td.wuPalmer("dog", "cat"); // 0.0 → 1.0
Proposed by Leacock & Chodorow (1998), combining edge-counting distance with the overall
taxonomy height H:
raw = −log(d(a, b) / (2 × H))
norm = raw / log(2 × H) ∈ [0, 1]
Higher similarity means fewer edges between the two concepts. Returns 1 for identical concepts.
double sim = td.leacockChodorow("dog", "cat"); // 0.0 → 1.0
The raw score is normalized by dividing by log(2H) so that it always falls in [0, 1].
Proposed by Lin (1998), using information content (IC) as a proxy for specificity. When no external corpus is available, depth in the taxonomy serves as the IC proxy:
IC(c) = −log((depth(c) + 1) / (H + 1))
sim(a, b) = 2 × IC(LCA) / (IC(a) + IC(b))
Deeper (more specific) concepts have higher IC. Returns 1 for identical concepts, 0 when both concepts are at the root and the taxonomy has a single level.
double sim = td.lin("dog", "cat"); // 0.0 → 1.0
| Measure | Formula | Corpus needed? | Notes |
|---|---|---|---|
| Wu-Palmer | 2·d(LCA) / (d(a)+d(b)) | No | Simple, fast, widely used |
| Leacock-Chodorow | −log(dist/(2H)) normalized | No | Sensitive to path length |
| Lin | 2·IC(LCA) / (IC(a)+IC(b)) | No (depth proxy) | Information-theoretic foundation |
import smile.nlp.taxonomy.*;
public class TaxonomyDemo {
public static void main(String[] args) {
// Build the taxonomy from indented text
String text = """
entity
physical
living
animal
mammal, warm-blooded
dog, canine
cat, feline
whale
reptile
snake
lizard
non-living
rock
water
abstract
number
concept
""";
Taxonomy tax = Taxonomy.of(text);
// Visualize
System.out.println(tax);
// Navigate
System.out.println("Height : " + tax.height());
System.out.println("Size : " + tax.size() + " keywords");
System.out.println("Nodes : " + tax.nodeCount());
// LCA
Concept lca = tax.lowestCommonAncestor("dog", "snake");
System.out.println("LCA(dog, snake) = " + lca); // [animal]
// Path
tax.shortestPath("dog", "snake")
.forEach(c -> System.out.print(c + " → "));
System.out.println();
// Distances and similarities
TaxonomicDistance td = new TaxonomicDistance(tax);
System.out.printf("d(dog, cat) = %.1f%n", td.d("dog", "cat"));
System.out.printf("normalized(dog, cat) = %.3f%n", td.normalizedDistance("dog", "cat"));
System.out.printf("Wu-Palmer(dog, cat) = %.3f%n", td.wuPalmer("dog", "cat"));
System.out.printf("Leacock-Chodorow(dog,cat) = %.3f%n", td.leacockChodorow("dog", "cat"));
System.out.printf("Lin(dog, cat) = %.3f%n", td.lin("dog", "cat"));
System.out.printf("%ndog vs snake:%n");
System.out.printf("Wu-Palmer = %.3f%n", td.wuPalmer("dog", "snake"));
System.out.printf("Lin = %.3f%n", td.lin("dog", "snake"));
// Subtree
System.out.println("Mammals: " + tax.subtree("mammal"));
// Leaves
System.out.println("Leaves: " + tax.leaves().stream()
.map(c -> c.keywords().iterator().next())
.toList());
}
}
Taxonomy| Method | Description |
|---|---|
Taxonomy(String... keywords) | Create taxonomy with an optional named root |
getRoot() | Returns the root Concept |
getConcept(String keyword) | Look up a concept by keyword; null if not found |
contains(String keyword) | true if the keyword is registered |
getConcepts() | All registered keywords |
size() | Number of keywords |
nodeCount() | Total tree nodes (including anonymous) |
depth(String keyword) | Depth of a keyword (-1 if not found) |
height() | Maximum depth of any node |
lowestCommonAncestor(String, String) | LCA by keyword |
lowestCommonAncestor(Concept, Concept) | LCA by node reference |
isAncestor(String ancestor, String descendant) | Ancestor test by keyword |
isDescendant(String descendant, String ancestor) | Descendant test by keyword |
shortestPath(String, String) | Ordered node list from v to w |
shortestPath(Concept, Concept) | Ordered node list from v to w |
subtree(String keyword) | All keywords in sub-tree rooted at keyword |
bfs() | All nodes in breadth-first order |
leaves() | All leaf nodes |
level(int depth) | All nodes at a given depth |
forEach(Consumer<Concept>) | DFS pre-order traversal |
toString() | Pretty-print the tree with box-drawing characters |
Taxonomy.of(String text) | Parse from indented text (static factory) |
Concept| Method | Description |
|---|---|
keywords() | The synonym set |
addKeywords(String... kws) | Add synonyms |
removeKeyword(String kw) | Remove a synonym |
children() | Child concept list |
addChild(String kw) | Create and attach a named child |
addChild(Concept c) | Attach an existing concept as a child |
removeChild(Concept c) | Detach a child (and prune its sub-tree from the index) |
siblings() | All sibling concepts |
isLeaf() | true if no children |
depth() | Depth from root |
height() | Height of sub-tree rooted here |
subtreeSize() | Keyword count in sub-tree |
isAncestorOf(Concept c) | true if this is a proper ancestor of c |
isDescendantOf(Concept c) | true if this is a proper descendant of c |
getPathFromRoot() | Ordered list from root to this node |
getPathToRoot() | Ordered list from this node to root |
TaxonomicDistance| Method | Description |
|---|---|
d(String x, String y) | Edge-counting distance |
d(Concept x, Concept y) | Edge-counting distance (by node) |
normalizedDistance(String x, String y) | Normalized distance in [0, 1] |
wuPalmer(String x, String y) | Wu-Palmer similarity in (0, 1] |
leacockChodorow(String x, String y) | Leacock-Chodorow similarity in [0, 1] |
lin(String x, String y) | Lin similarity in [0, 1] |
SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.