Back to Smile

SMILE — Hash Functions User Guide

base/HASH.md

6.1.010.7 KB
Original Source

SMILE — Hash Functions User Guide

The smile.hash package provides a collection of high-performance, non-cryptographic hash functions and data structures built on top of them. It currently contains five classes:

ClassPurpose
MurmurHash2Fast 32-bit and 64-bit non-cryptographic hash for raw byte data
MurmurHash3Improved 32-bit and 128-bit non-cryptographic hash for raw byte data
PerfectHashCompile-time-style perfect hash mapping a fixed string set to [0, n)
PerfectMapImmutable String → T map backed by PerfectHash
SimHashLocality-sensitive fingerprinting for near-duplicate detection

1. MurmurHash2

MurmurHash2 is an extremely fast, non-cryptographic hash designed by Austin Appleby. The name comes from the two basic operations in its inner loop: multiply (MU) and rotate (R). It is well-suited to hash-table construction, checksumming, and any task that needs a fast, well-distributed digest but does not require cryptographic strength.

SMILE's implementation is adapted from Apache Cassandra and is provided as a static interface (MurmurHash2) with two overloads:

MurmurHash2.hash32(ByteBuffer data, int offset, int length, int seed)  → int
MurmurHash2.hash64(ByteBuffer data, int offset, int length, long seed) → long

1.1 32-bit hash

java
import smile.hash.MurmurHash2;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;

byte[] data = "hello world".getBytes(StandardCharsets.UTF_8);
ByteBuffer buf = ByteBuffer.wrap(data);

int h32 = MurmurHash2.hash32(buf, 0, data.length, 42 /*seed*/);
System.out.printf("MurmurHash2-32 = 0x%08X%n", h32);

1.2 64-bit hash

java
long h64 = MurmurHash2.hash64(buf, 0, data.length, 0L /*seed*/);
System.out.printf("MurmurHash2-64 = 0x%016X%n", h64);

1.3 Properties

PropertyValue
Output width32 or 64 bits
EndiannessLittle-endian
Seed supportYes — different seeds produce independent hash families
CryptographicNo
Suitable forHash tables, bloom filters, checksums, sharding

2. MurmurHash3

MurmurHash3 supersedes MurmurHash2 with better avalanche behavior and wider output options. SMILE provides two variants, both adapted from Apache Cassandra:

MurmurHash3.hash32(String text, int seed)                                → int
MurmurHash3.hash32(byte[] data, int offset, int length, int seed)        → int
MurmurHash3.hash128(ByteBuffer data, int offset, int length,
                    long seed, long[] result)                             → void

The 128-bit variant implements the x64 flavour: two independent 64-bit lanes that are mixed together in the finalization step.

Note: The x86 and x64 128-bit variants produce different outputs for the same input. SMILE implements only the x64 variant.

2.1 32-bit hash from a String

java
import smile.hash.MurmurHash3;

int h = MurmurHash3.hash32("the quick brown fox", 0);

2.2 32-bit hash from a byte array

java
byte[] data = "the quick brown fox".getBytes(StandardCharsets.UTF_8);
int h = MurmurHash3.hash32(data, 0, data.length, 0);

2.3 128-bit hash (x64)

The result is written into a caller-supplied long[2] array: result[0] = lower 64 bits, result[1] = upper 64 bits.

java
byte[] data = "the quick brown fox".getBytes(StandardCharsets.UTF_8);
ByteBuffer buf = ByteBuffer.wrap(data);
long[] result = new long[2];
MurmurHash3.hash128(buf, 0, data.length, 0L, result);
System.out.printf("h1=0x%016X  h2=0x%016X%n", result[0], result[1]);

2.4 Choosing between MurmurHash2 and MurmurHash3

CriterionMurmurHash2MurmurHash3
Output widths32, 6432, 128
Speed (64-bit JVM)Slightly faster 64-bit pathSlightly faster 32-bit path
Avalanche qualityGoodExcellent
RecommendationLegacy code / Cassandra compatibilityNew code

3. PerfectHash

PerfectHash builds a minimal perfect hash over a fixed set of strings known at construction time. Every key maps to a unique integer in [0, n) in O(length) time — there are no collisions and no chaining.

3.1 How it works

The hash function is:

h(w) = (len(w) + Σ kvals[w[i] − min]) mod tsize

where kvals is an integer array indexed by (character − min_char) and tsize is the table size (≥ n). Construction fills kvals with random integers and checks whether all n keys map to distinct slots. Failed attempts draw a new random kvals; with a load factor of ≈ 2/3 this succeeds in O(1) expected trials. The final lookup table maps each hash slot to the index of the keyword stored there, or −1 for empty slots; correctness is verified by comparing the stored keyword against the query string.

3.2 Basic usage

java
import smile.hash.PerfectHash;

PerfectHash hash = new PerfectHash("abstract", "boolean", "break",
                                   "class", "continue", "default");

int idx = hash.get("break");     // returns 2 (index in the array above)
int miss = hash.get("integer");  // returns -1 — not in the set

3.3 Selecting character positions

For performance with long strings you can tell the hash to use only specific character positions. This is analogous to the % keyword option in gperf.

java
// Only look at characters at positions 0 and 1.
PerfectHash hash = new PerfectHash(new int[]{0, 1}, keywords);

Choosing positions that maximise distinctiveness (e.g., the first and last characters of a keyword) typically reduces construction time.

3.4 Constraints

  • The keyword array must be non-empty.
  • Duplicate keywords throw IllegalArgumentException.
  • The set is frozen at construction time — there is no way to add or remove keywords after construction.
  • PerfectHash is Serializable and thread-safe for concurrent reads.

4. PerfectMap

PerfectMap<T> is an immutable, perfect-hash-backed String → T map. Lookup is O(key length) with zero allocation on the hot path. It is ideal for small, fixed dictionaries that are read very frequently (e.g., HTTP header tables, language keyword→token mappings, configuration key→handler tables).

4.1 Building with the fluent Builder

java
import smile.hash.PerfectMap;

PerfectMap<Integer> tokenTypes = new PerfectMap.Builder<Integer>()
    .add("abstract",   1)
    .add("boolean",    2)
    .add("break",      3)
    .add("class",      4)
    .add("continue",   5)
    .add("default",    6)
    .build();

int type = tokenTypes.get("break");   // 3
Object miss = tokenTypes.get("goto"); // null

4.2 Building from an existing Map

java
Map<String, String> source = Map.of(
    "Content-Type",     "application/json",
    "Accept",           "application/json",
    "Authorization",    "Bearer …"
);

PerfectMap<String> headers = new PerfectMap.Builder<>(source).build();
String ct = headers.get("Content-Type"); // "application/json"

4.3 Null semantics

get(key) returns null for any key that was not in the original key set. It never throws for well-formed (non-null) query strings.


5. SimHash

SimHash is a locality-sensitive fingerprinting algorithm. Unlike ordinary hash functions, two similar inputs produce fingerprints with a small Hamming distance, while dissimilar inputs produce fingerprints with a large Hamming distance. This makes SimHash useful for near-duplicate detection in large document collections (it is used by the Google crawler for exactly this purpose).

Each fingerprint is a 64-bit long. The Hamming distance between two fingerprints is Long.bitCount(h1 ^ h2). Documents are typically considered near-duplicates when their Hamming distance is ≤ 3.

5.1 Text similarity (tokenised documents)

Use SimHash.text() to get a SimHash<String[]> that treats each element of the array as a token. Internally, every token is hashed with MurmurHash2-64 and its hash bits vote (+1 or −1) on 64 counters; the sign of each counter becomes one bit of the fingerprint.

java
import smile.hash.SimHash;

SimHash<String[]> sh = SimHash.text();

String[] doc1 = {"the", "quick", "brown", "fox", "jumps"};
String[] doc2 = {"the", "quick", "brown", "fox", "leaps"}; // one word different

long h1 = sh.hash(doc1);
long h2 = sh.hash(doc2);

int hamming = Long.bitCount(h1 ^ h2);
System.out.println("Hamming distance = " + hamming); // small (≤ 5 typically)

5.2 Weighted feature vectors

Use SimHash.of(byte[][] features) when each document is represented as a sparse vector of pre-defined feature weights. Each feature is identified by its byte representation; the weight array passed to hash(int[] weights) must be the same length as the features array.

java
byte[][] features = {
    "title".getBytes(),
    "body".getBytes(),
    "tags".getBytes()
};
SimHash<int[]> sh = SimHash.of(features);

// A document with title weight 5, body weight 3, tags weight 1
long fp = sh.hash(new int[]{5, 3, 1});

5.3 Near-duplicate detection pipeline

A typical pipeline computes fingerprints for all documents and then uses a hash table keyed on the fingerprint to find exact duplicates, or a Hamming-distance index (e.g., sorted permutations) to find near-duplicates:

java
SimHash<String[]> sh = SimHash.text();
Map<Long, String> seen = new HashMap<>();

for (var entry : corpus.entrySet()) {
    String id  = entry.getKey();
    String[] tokens = entry.getValue();
    long fp = sh.hash(tokens);

    // Exact fingerprint match → likely duplicate
    String prev = seen.putIfAbsent(fp, id);
    if (prev != null) {
        System.out.println(id + " is a near-duplicate of " + prev);
    }
}

For large corpora, partition the 64 bits into k bands of b bits each (standard LSH technique) and bucket documents by each band independently to find pairs within Hamming distance ≤ b·k efficiently.

5.4 Properties

PropertyValue
Output width64 bits
Similarity metricHamming distance
Internal hashMurmurHash2-64
Near-duplicate thresholdHamming ≤ 3 (configurable by application)
Thread safetyStateless — the SimHash instance is thread-safe

6. Quick-reference table

Use caseRecommended class / method
Hash a byte buffer into a 32-bit intMurmurHash2.hash32
Hash a byte buffer into a 64-bit longMurmurHash2.hash64
Hash a string into a 32-bit intMurmurHash3.hash32(String, seed)
Hash a byte array into a 128-bit valueMurmurHash3.hash128
O(1) lookup in a fixed string setPerfectHash
Immutable String → T dictionaryPerfectMap
Near-duplicate document detectionSimHash.text()
Weighted feature-vector fingerprintingSimHash.of(byte[][])

SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.