Back to Beads

Hash ID Collision Mathematics

docs/COLLISION_MATH.md

1.0.34.8 KB
Original Source

Hash ID Collision Mathematics

This document explains the collision probability calculations for beads' adaptive hash-based IDs and the thresholds used for automatic length scaling.

Birthday Paradox Formula

The collision probability for hash IDs is calculated using the birthday paradox:

P(collision) ≈ 1 - e^(-n²/2N)

Where:

  • n = number of issues in database
  • N = total possible IDs = 36^length (lowercase alphanumeric: [a-z0-9])

Collision Probability Table

DB Size4-char5-char6-char7-char8-char
500.07%0.00%0.00%0.00%0.00%
1000.30%0.01%0.00%0.00%0.00%
2001.18%0.03%0.00%0.00%0.00%
5007.17%0.21%0.01%0.00%0.00%
1,00025.75%0.82%0.02%0.00%0.00%
2,00069.60%3.25%0.09%0.00%0.00%
5,00099.94%18.68%0.57%0.02%0.00%
10,000100%56.26%2.27%0.06%0.00%

Key Insights

  • 4-char IDs are safe up to ~500 issues (7% collision risk)
  • 5-char IDs are safe up to ~1,500 issues (2% collision risk)
  • 6-char IDs are safe up to ~10,000 issues (2% collision risk)
  • 7-char IDs support 100,000+ issues with negligible collision risk
  • 8-char IDs support millions of issues

Expected Number of Collisions

This shows the average number of actual hash collisions you'll encounter:

DB Size4-char5-char6-char7-char8-char
1000.000.000.000.000.00
5000.070.000.000.000.00
1,0000.300.010.000.000.00
2,0001.190.030.000.000.00
5,0007.440.210.010.000.00
10,00029.770.830.020.000.00

Example: With 5,000 issues using 4-char IDs, you'll likely see ~7 hash collisions (automatically retried with +1 nonce).

Adaptive Scaling Strategy

Beads automatically increases ID length when the collision probability exceeds 25% (configurable via max_collision_prob).

Default Thresholds (25% max collision)

Database SizeID LengthCollision Probability at Max
0-5004 chars7.17% at 500 issues
501-1,5005 chars1.84% at 1,500 issues
1,501-5,0005 chars18.68% at 5,000 issues
5,001-15,0006 chars5.04% at 15,000 issues
15,001+continues scaling as needed

Why 25%?

The 25% threshold balances:

  • Readability: Keep IDs short for small databases
  • Safety: Avoid frequent collision retries
  • Scalability: Grow gracefully as database expands

Even at 25% collision probability, the expected number of actual collisions is low (< 1 collision per 1,000 issues created).

Alternative Thresholds

You can customize the threshold with bd config set max_collision_prob <value>:

Conservative (10% threshold)

DB SizeID Length
0-2004 chars
201-1,0005 chars
1,001-5,0006 chars
5,001+continues scaling

Aggressive (50% threshold)

DB SizeID Length
0-5004 chars
501-2,0005 chars
2,001-10,0006 chars
10,001+continues scaling

Collision Resolution

When a hash collision occurs (same ID generated twice), beads automatically:

  1. Tries base length with different nonce (10 attempts)
  2. Tries base+1 length with different nonce (10 attempts)
  3. Tries base+2 length with different nonce (10 attempts)

Total: 30 attempts before failing (astronomically unlikely).

Example with 4-char base:

  • bd-a3f2 (nonce 0) - collision!
  • bd-a3f2 (nonce 1) - collision again!
  • bd-b7d4 (nonce 2) - success! ✓

Mathematical Properties

ID Space Size

LengthPossible IDsNotation
3 chars46,65636³
4 chars1,679,61636⁴ ≈ 1.7M
5 chars60,466,17636⁵ ≈ 60M
6 chars2,176,782,33636⁶ ≈ 2.2B
7 chars78,364,164,09636⁷ ≈ 78B
8 chars2,821,109,907,45636⁸ ≈ 2.8T

Why Alphanumeric?

Using [a-z0-9] (36 characters) instead of hex (16 characters):

  • 4-char alphanumeric6-char hex in capacity
  • More readable: bd-a3f2 vs bd-a3f2e1
  • Easier to type and communicate

Verification

Run the collision calculator yourself:

bash
go run scripts/collision-calculator.go

This generates the tables above and shows adaptive scaling strategy for any threshold.