base/HASH.md
The smile.hash package provides a collection of high-performance, non-cryptographic
hash functions and data structures built on top of them. It currently contains five
classes:
| Class | Purpose |
|---|---|
MurmurHash2 | Fast 32-bit and 64-bit non-cryptographic hash for raw byte data |
MurmurHash3 | Improved 32-bit and 128-bit non-cryptographic hash for raw byte data |
PerfectHash | Compile-time-style perfect hash mapping a fixed string set to [0, n) |
PerfectMap | Immutable String → T map backed by PerfectHash |
SimHash | Locality-sensitive fingerprinting for near-duplicate detection |
MurmurHash2 is an extremely fast, non-cryptographic hash designed by Austin Appleby. The name comes from the two basic operations in its inner loop: multiply (MU) and rotate (R). It is well-suited to hash-table construction, checksumming, and any task that needs a fast, well-distributed digest but does not require cryptographic strength.
SMILE's implementation is adapted from Apache Cassandra and is provided as a static
interface (MurmurHash2) with two overloads:
MurmurHash2.hash32(ByteBuffer data, int offset, int length, int seed) → int
MurmurHash2.hash64(ByteBuffer data, int offset, int length, long seed) → long
import smile.hash.MurmurHash2;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
byte[] data = "hello world".getBytes(StandardCharsets.UTF_8);
ByteBuffer buf = ByteBuffer.wrap(data);
int h32 = MurmurHash2.hash32(buf, 0, data.length, 42 /*seed*/);
System.out.printf("MurmurHash2-32 = 0x%08X%n", h32);
long h64 = MurmurHash2.hash64(buf, 0, data.length, 0L /*seed*/);
System.out.printf("MurmurHash2-64 = 0x%016X%n", h64);
| Property | Value |
|---|---|
| Output width | 32 or 64 bits |
| Endianness | Little-endian |
| Seed support | Yes — different seeds produce independent hash families |
| Cryptographic | No |
| Suitable for | Hash tables, bloom filters, checksums, sharding |
MurmurHash3 supersedes MurmurHash2 with better avalanche behavior and wider output options. SMILE provides two variants, both adapted from Apache Cassandra:
MurmurHash3.hash32(String text, int seed) → int
MurmurHash3.hash32(byte[] data, int offset, int length, int seed) → int
MurmurHash3.hash128(ByteBuffer data, int offset, int length,
long seed, long[] result) → void
The 128-bit variant implements the x64 flavour: two independent 64-bit lanes that are mixed together in the finalization step.
Note: The x86 and x64 128-bit variants produce different outputs for the same input. SMILE implements only the x64 variant.
import smile.hash.MurmurHash3;
int h = MurmurHash3.hash32("the quick brown fox", 0);
byte[] data = "the quick brown fox".getBytes(StandardCharsets.UTF_8);
int h = MurmurHash3.hash32(data, 0, data.length, 0);
The result is written into a caller-supplied long[2] array: result[0] = lower
64 bits, result[1] = upper 64 bits.
byte[] data = "the quick brown fox".getBytes(StandardCharsets.UTF_8);
ByteBuffer buf = ByteBuffer.wrap(data);
long[] result = new long[2];
MurmurHash3.hash128(buf, 0, data.length, 0L, result);
System.out.printf("h1=0x%016X h2=0x%016X%n", result[0], result[1]);
| Criterion | MurmurHash2 | MurmurHash3 |
|---|---|---|
| Output widths | 32, 64 | 32, 128 |
| Speed (64-bit JVM) | Slightly faster 64-bit path | Slightly faster 32-bit path |
| Avalanche quality | Good | Excellent |
| Recommendation | Legacy code / Cassandra compatibility | New code |
PerfectHash builds a minimal perfect hash over a fixed set of strings known
at construction time. Every key maps to a unique integer in [0, n) in O(length)
time — there are no collisions and no chaining.
The hash function is:
h(w) = (len(w) + Σ kvals[w[i] − min]) mod tsize
where kvals is an integer array indexed by (character − min_char) and tsize
is the table size (≥ n). Construction fills kvals with random integers and
checks whether all n keys map to distinct slots. Failed attempts draw a new
random kvals; with a load factor of ≈ 2/3 this succeeds in O(1) expected trials.
The final lookup table maps each hash slot to the index of the keyword stored there,
or −1 for empty slots; correctness is verified by comparing the stored keyword
against the query string.
import smile.hash.PerfectHash;
PerfectHash hash = new PerfectHash("abstract", "boolean", "break",
"class", "continue", "default");
int idx = hash.get("break"); // returns 2 (index in the array above)
int miss = hash.get("integer"); // returns -1 — not in the set
For performance with long strings you can tell the hash to use only specific
character positions. This is analogous to the % keyword option in gperf.
// Only look at characters at positions 0 and 1.
PerfectHash hash = new PerfectHash(new int[]{0, 1}, keywords);
Choosing positions that maximise distinctiveness (e.g., the first and last characters of a keyword) typically reduces construction time.
IllegalArgumentException.PerfectHash is Serializable and thread-safe for concurrent reads.PerfectMap<T> is an immutable, perfect-hash-backed String → T map. Lookup
is O(key length) with zero allocation on the hot path. It is ideal for small,
fixed dictionaries that are read very frequently (e.g., HTTP header tables,
language keyword→token mappings, configuration key→handler tables).
import smile.hash.PerfectMap;
PerfectMap<Integer> tokenTypes = new PerfectMap.Builder<Integer>()
.add("abstract", 1)
.add("boolean", 2)
.add("break", 3)
.add("class", 4)
.add("continue", 5)
.add("default", 6)
.build();
int type = tokenTypes.get("break"); // 3
Object miss = tokenTypes.get("goto"); // null
Map<String, String> source = Map.of(
"Content-Type", "application/json",
"Accept", "application/json",
"Authorization", "Bearer …"
);
PerfectMap<String> headers = new PerfectMap.Builder<>(source).build();
String ct = headers.get("Content-Type"); // "application/json"
get(key) returns null for any key that was not in the original key set.
It never throws for well-formed (non-null) query strings.
SimHash is a locality-sensitive fingerprinting algorithm. Unlike ordinary
hash functions, two similar inputs produce fingerprints with a small Hamming
distance, while dissimilar inputs produce fingerprints with a large Hamming
distance. This makes SimHash useful for near-duplicate detection in large
document collections (it is used by the Google crawler for exactly this purpose).
Each fingerprint is a 64-bit long. The Hamming distance between two fingerprints
is Long.bitCount(h1 ^ h2). Documents are typically considered near-duplicates
when their Hamming distance is ≤ 3.
Use SimHash.text() to get a SimHash<String[]> that treats each element of the
array as a token. Internally, every token is hashed with MurmurHash2-64 and its
hash bits vote (+1 or −1) on 64 counters; the sign of each counter becomes one bit
of the fingerprint.
import smile.hash.SimHash;
SimHash<String[]> sh = SimHash.text();
String[] doc1 = {"the", "quick", "brown", "fox", "jumps"};
String[] doc2 = {"the", "quick", "brown", "fox", "leaps"}; // one word different
long h1 = sh.hash(doc1);
long h2 = sh.hash(doc2);
int hamming = Long.bitCount(h1 ^ h2);
System.out.println("Hamming distance = " + hamming); // small (≤ 5 typically)
Use SimHash.of(byte[][] features) when each document is represented as a sparse
vector of pre-defined feature weights. Each feature is identified by its byte
representation; the weight array passed to hash(int[] weights) must be the same
length as the features array.
byte[][] features = {
"title".getBytes(),
"body".getBytes(),
"tags".getBytes()
};
SimHash<int[]> sh = SimHash.of(features);
// A document with title weight 5, body weight 3, tags weight 1
long fp = sh.hash(new int[]{5, 3, 1});
A typical pipeline computes fingerprints for all documents and then uses a hash table keyed on the fingerprint to find exact duplicates, or a Hamming-distance index (e.g., sorted permutations) to find near-duplicates:
SimHash<String[]> sh = SimHash.text();
Map<Long, String> seen = new HashMap<>();
for (var entry : corpus.entrySet()) {
String id = entry.getKey();
String[] tokens = entry.getValue();
long fp = sh.hash(tokens);
// Exact fingerprint match → likely duplicate
String prev = seen.putIfAbsent(fp, id);
if (prev != null) {
System.out.println(id + " is a near-duplicate of " + prev);
}
}
For large corpora, partition the 64 bits into k bands of b bits each (standard LSH technique) and bucket documents by each band independently to find pairs within Hamming distance ≤ b·k efficiently.
| Property | Value |
|---|---|
| Output width | 64 bits |
| Similarity metric | Hamming distance |
| Internal hash | MurmurHash2-64 |
| Near-duplicate threshold | Hamming ≤ 3 (configurable by application) |
| Thread safety | Stateless — the SimHash instance is thread-safe |
| Use case | Recommended class / method |
|---|---|
| Hash a byte buffer into a 32-bit int | MurmurHash2.hash32 |
| Hash a byte buffer into a 64-bit long | MurmurHash2.hash64 |
| Hash a string into a 32-bit int | MurmurHash3.hash32(String, seed) |
| Hash a byte array into a 128-bit value | MurmurHash3.hash128 |
| O(1) lookup in a fixed string set | PerfectHash |
Immutable String → T dictionary | PerfectMap |
| Near-duplicate document detection | SimHash.text() |
| Weighted feature-vector fingerprinting | SimHash.of(byte[][]) |
SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.