en/docs/chapter_hashing/hash_collision.md
The previous section mentioned that, in most cases, the input space of a hash function is much larger than the output space, so theoretically, hash collisions are inevitable. For example, if the input space is all integers and the output space is the array capacity size, then multiple integers will inevitably be mapped to the same bucket index.
Hash collisions can lead to incorrect query results, severely impacting the usability of the hash table. To address this issue, whenever a hash collision occurs, we can perform hash table expansion until the collision disappears. This approach is simple, straightforward, and effective, but it is very inefficient because hash table expansion involves a large amount of data migration and hash value recalculation. To improve efficiency, we can adopt the following strategies:
The main methods for improving the structure of hash tables include "separate chaining" and "open addressing".
In the original hash table, each bucket can store only one key-value pair. <u>Separate chaining</u> converts a single element into a linked list, treating key-value pairs as linked list nodes and storing all colliding key-value pairs in the same linked list. The figure below shows an example of a separate chaining hash table.
The operations of a hash table implemented with separate chaining have changed as follows:
key, obtain the bucket index through the hash function, then access the head node of the linked list, then traverse the linked list and compare key to find the target key-value pair.Separate chaining has the following limitations:
The code below provides a simple implementation of a separate chaining hash table, with two things to note:
[file]{hash_map_chaining}-[class]{hash_map_chaining}-[func]{}
It's worth noting that when the linked list is very long, the query efficiency $O(n)$ is poor. In this case, the list can be converted to an "AVL tree" or "Red-Black tree" to optimize the time complexity of the query operation to $O(\log n)$.
<u>Open addressing</u> does not introduce additional data structures but instead handles hash collisions through "multiple probes". The probing methods mainly include linear probing, quadratic probing, and double hashing.
Let's use linear probing as an example to introduce the mechanism of open addressing hash tables.
Linear probing uses a fixed-step linear search for probing, and its operation method differs from ordinary hash tables.
value; if an empty bucket is encountered, it means the target element is not in the hash table, so return None.The figure below shows the distribution of key-value pairs in an open addressing (linear probing) hash table. According to this hash function, keys with the same last two digits will be mapped to the same bucket. Through linear probing, they are stored sequentially in that bucket and the buckets below it.
However, linear probing is prone to create "clustering". Specifically, the longer the continuously occupied positions in the array, the greater the probability of hash collisions occurring in these continuous positions, further promoting clustering growth at that position, forming a vicious cycle, and ultimately leading to degraded efficiency of insertion, deletion, query, and update operations.
It's important to note that we cannot directly delete elements in an open addressing hash table. Deleting an element creates an empty bucket None in the array. When searching for elements, if linear probing encounters this empty bucket, it will return, making the elements below this empty bucket inaccessible. The program may incorrectly assume these elements do not exist, as shown in the figure below.
To solve this problem, we can adopt the <u>lazy deletion</u> mechanism: instead of directly removing elements from the hash table, use a constant TOMBSTONE to mark the bucket. In this mechanism, both None and TOMBSTONE represent empty buckets and can hold key-value pairs. However, when linear probing encounters TOMBSTONE, it should continue traversing since there may still be key-value pairs below it.
However, lazy deletion may accelerate the performance degradation of the hash table. Every deletion operation produces a deletion mark, and as TOMBSTONE increases, the search time will also increase because linear probing may need to skip multiple TOMBSTONE to find the target element.
To address this, consider recording the index of the first encountered TOMBSTONE during linear probing and swapping the searched target element with that TOMBSTONE. The benefit of doing this is that each time an element is queried or added, the element will be moved to a bucket closer to its ideal position (the starting point of probing), thereby optimizing query efficiency.
The code below implements an open addressing (linear probing) hash table with lazy deletion. To make better use of the hash table space, we treat the hash table as a "circular array". When going beyond the end of the array, we return to the beginning and continue traversing.
[file]{hash_map_open_addressing}-[class]{hash_map_open_addressing}-[func]{}
Quadratic probing is similar to linear probing and is one of the common strategies for open addressing. When a collision occurs, quadratic probing does not simply skip a fixed number of steps but skips a number of steps equal to the "square of the number of probes", i.e., $1, 4, 9, \dots$ steps.
Quadratic probing has the following advantages:
However, quadratic probing is not perfect:
As the name suggests, the double hashing method uses multiple hash functions $f_1(x)$, $f_2(x)$, $f_3(x)$, $\dots$ for probing.
None.Compared to linear probing, the double hashing method is less prone to clustering, but multiple hash functions introduce additional computational overhead.
!!! tip
Please note that open addressing (linear probing, quadratic probing, and double hashing) hash tables all have the problem of "cannot directly delete elements".
Different programming languages adopt different hash table implementation strategies. Here are a few examples:
dict dictionary uses pseudo-random numbers for probing.HashMap reaches 64 and the length of a linked list reaches 8, the linked list is converted to a red-black tree to improve search performance.