哈希表基本原理
First, I need to clarify a common misconception for beginners.
Are a hash table and what we often refer to as a Map (key-value mapping) the same thing? No, they are not.
This becomes very clear when explained in Java. Map
is a Java interface that merely declares a number of methods without providing their specific implementations:
interface Map<K, V> {
V get(K key);
void put(K key, V value);
V remove(K key);
// ...
}
The Map
interface itself only defines a series of operations for key-value mapping. Data structures like HashMap
implement these operations based on their characteristics. Other data structures, such as TreeMap
and LinkedHashMap
, also implement this interface.
In other words, you can say that the get
, put
, and remove
methods of HashMap
have a complexity of O(1)
, but you cannot say that the complexity of the Map
interface methods is O(1)
. If you use another implementation, such as TreeMap
, which uses a tree structure, the complexity of these methods becomes O(logN)
.
Why am I emphasizing this? Mainly for readers who use non-Java languages.
Other programming languages might not have such clear interface definitions as Java, so it is easy for readers to confuse hash tables with Map key-value pairs. Hearing about key-value operations might lead one to assume that their complexity is always O(1)
. This is incorrect; the actual complexity depends on how the underlying data structure implements key-value operations.
In this section, I will guide you through implementing a hash table and explore why hash tables can achieve O(1)
complexity for insert, delete, search, and update operations, as well as discuss two methods for resolving hash collisions.
Basic Principles of Hash Tables
A Few Words to Explain
A hash table can be understood as an enhanced array.
An array allows you to find the corresponding element in O(1)
time complexity using an index (a non-negative integer).
A hash table is similar; it allows you to find the value corresponding to a key in O(1)
time complexity. The key can be of various types, such as numbers or strings.
How does it work? It's very simple. The underlying implementation of a hash table is an array (let's call it table
). It first converts the key into an index in the array using a hash function (let's call it hash
), then the insert, delete, search, and update operations are almost the same as in an array:
// Pseudocode logic for a hash table
class MyHashMap {
private Object[] table;
// Insert/Update, complexity O(1)
public void put(K key, V value) {
int index = hash(key);
table[index] = value;
}
// Retrieve, complexity O(1)
public V get(K key) {
int index = hash(key);
return table[index];
}
// Delete, complexity O(1)
public void remove(K key) {
int index = hash(key);
table[index] = null;
}
// Hash function, converting key into a valid index in the table
// The time complexity must be O(1) to ensure that the above methods all have complexity O(1)
private int hash(K key) {
// ...
}
}
// Pseudocode logic for hash table
class MyHashMap {
private:
vector<void*> table;
public:
// Insert/Update, complexity O(1)
void put(auto key, auto value) {
int index = hash(key);
table[index] = value;
}
// Search, complexity O(1)
auto get(auto key) {
int index = hash(key);
return table[index];
}
// Delete, complexity O(1)
void remove(auto key) {
int index = hash(key);
table[index] = nullptr;
}
private:
// Hash function, convert key to valid index in the table
// Time complexity must be O(1) to ensure the above methods have O(1) complexity
int hash(auto key) {
// ...
}
};
# Hash table pseudocode logic
class MyHashMap:
def __init__(self):
self.table = [None] * 1000
# add/update, complexity O(1)
def put(self, key, value):
index = self.hash(key)
self.table[index] = value
# get, complexity O(1)
def get(self, key):
index = self.hash(key)
return self.table[index]
# remove, complexity O(1)
def remove(self, key):
index = self.hash(key)
self.table[index] = None
# hash function, convert key into a valid index in the table
# time complexity must be O(1) to ensure the above methods have O(1) complexity
def hash(self, key):
# ...
return hash(key) % len(self.table)
// Pseudo-code logic for hash table
type MyHashMap struct {
table []interface{}
}
func Constructor() MyHashMap {
return MyHashMap{
table: make([]interface{}, 10000),
}
}
// Add/Update, complexity O(1)
func (h *MyHashMap) Put(key, value interface{}) {
index := h.hash(key)
h.table[index] = value
}
// Get, complexity O(1)
func (h *MyHashMap) Get(key interface{}) interface{} {
index := h.hash(key)
return h.table[index]
}
// Remove, complexity O(1)
func (h *MyHashMap) Remove(key interface{}) {
index := h.hash(key)
h.table[index] = nil
}
// Hash function, convert key to a valid index in the table
// Time complexity must be O(1) to ensure the above methods are O(1)
func (h *MyHashMap) hash(key interface{}) int {
...
}
// Hash table pseudocode logic
class MyHashMap {
constructor() {
this.table = new Array(1000).fill(null);
}
// Add/Update, complexity O(1)
put(key, value) {
const index = this.hash(key);
this.table[index] = value;
}
// Get, complexity O(1)
get(key) {
const index = this.hash(key);
return this.table[index];
}
// Remove, complexity O(1)
remove(key) {
const index = this.hash(key);
this.table[index] = null;
}
// Hash function, convert key to a valid index in the table
// The time complexity must be O(1) to ensure that the above methods' complexity are O(1)
hash(key) {
// ...
}
}
There are many details to handle in the actual implementation, such as the design of the hash function and the handling of hash collisions. But if you understand the core principles mentioned above, you're already halfway there. The rest is just writing the code, which shouldn't be too hard, right?
Next, let's specifically introduce some key concepts and potential issues in the add, delete, search, and update processes mentioned above.
Key Concepts and Principles
key
is Unique, value
can be Duplicated
In a hash table, no two keys
can be the same, but values
can be duplicated.
Understanding this principle is straightforward if you think about arrays:
In an array, each index is unique. You can't have two index 0s in one array. As for what elements are stored in the array, that's up to you and no one cares.
So, in a hash table, keys
are unique, but values
can be anything and can be duplicated.
Hash Function
The purpose of a hash function is to convert an input of any length (key) into a fixed-length output (index).
You may have noticed that the methods for adding, deleting, searching, and updating all use the hash function to calculate the index. If the hash function you design has a complexity of O(N)
, then the performance of these operations will degrade to O(N)
. So, the performance of this function is very crucial.
Another important point is that the hash function must produce the same output for the same input key
each time to ensure the correctness of the hash table. It can't happen that you calculate hash("123") = 5
now and hash("123") = 6
later; otherwise, the hash table would be useless.
So, how does the hash function convert non-integer keys
into integer indices? And how does it ensure that these indices are valid?
How to Convert a `key` to an Integer
This problem can have many answers, as different hash function designs will have different methods. Here, I'll share a simple approach using Java. Other programming languages follow a similar idea.
Every Java object has an int hashCode()
method. If you don't override this method in a custom class, the default return value can be considered the memory address of the object. A memory address is a unique integer globally.
So, by calling the hashCode()
method on the key
, you effectively convert the key
into a unique integer globally.
Of course, this method has some issues, which we will discuss below, but at least we've found a way to convert any object to an integer.
How to Ensure Index Validity
The hashCode
method returns an int type, and one immediate issue is that this int value can be negative, while array indices are non-negative integers.
You might think of writing code to convert this value to a non-negative number like this:
int h = key.hashCode();
if (h < 0) h = -h;
But this approach has a problem. The minimum value an int can represent is -2^31
, and the maximum value is 2^31 - 1
. So if h = -2^31
, then -h = 2^31
, which exceeds the maximum value an int can hold. This is called integer overflow, and the compiler will throw an error or produce unpredictable results.
Why is the minimum int value -2^31
and the maximum 2^31 - 1
? This relates to the principles of two's complement encoding in computers. In simple terms, an int is 32 binary bits, where the highest bit (the leftmost bit) is the sign bit. The sign bit is 0 for positive numbers and 1 for negative numbers.
The problem now is straightforward: I want to ensure h
is non-negative but can't simply negate it. A simple and direct way to achieve this, using the principles of two's complement encoding, is to set the highest sign bit to 0, ensuring h
is non-negative:
int h = key.hashCode();
// Bitwise operation, remove the highest bit (sign bit)
// Additionally, bitwise operations are generally faster than arithmetic operations
// Therefore, you will see that the standard library source code prefers bitwise operations where possible
h = h & 0x7fffffff;
// The binary representation of 0x7fffffff is 0111 1111 ... 1111
// That is, all bits are 1 except the highest bit (sign bit) which is 0
// By performing a bitwise AND with 0x7fffffff, the highest bit (sign bit) is cleared, ensuring h is non-negative
I won't go into the principles of two's complement encoding in detail here. If you're interested, you can look it up and study it on your own.
Alright, the issue of hashCode
potentially being negative is resolved. But there's another problem: this hashCode
is usually very large, and we need to map it to a valid index in the table
array.
This shouldn't be difficult for you. Previously, in Circular Array Principles and Implementation, we used the %
modulus operation to ensure the index always falls within the valid range of the array. So, here we can also use the %
operation to ensure the index is valid. The complete hash
function implementation is as follows:
int hash(K key) {
int h = key.hashCode();
// ensure non-negative number
h = h & 0x7fffffff;
// map to a valid index in the table array
return h % table.length;
}
Of course, directly using %
also has its issues because the modulus operation %
is relatively performance-intensive. In standard libraries that prioritize runtime efficiency, the %
operation is usually avoided in favor of bitwise operations to improve performance.
However, the main purpose of this chapter is to help you understand how to implement a simple hash table, so we will ignore these optimization details. If you're interested, you can check out the source code of Java's HashMap to see how it implements the hash
function.
Hash Collision
Above, we discussed the implementation of the hash
function. Naturally, you might wonder, what happens if two different keys
produce the same index through the hash function? This situation is known as a "hash collision."
Can Hash Collisions Be Avoided?
Hash collisions cannot be avoided; they can only be handled properly at the algorithm level.
Hash collisions are inevitable because the hash
function essentially maps an infinite space to a finite index space, so different keys
are bound to map to the same index.
It's similar to how a three-dimensional object casts a two-dimensional shadow. This lossy compression necessarily results in information loss, and lossy information cannot correspond one-to-one with the original information.
How do we resolve hash collisions? There are two common solutions: chaining and linear probing (also often called open addressing).
The names might sound fancy, but they essentially boil down to two approaches: vertical extension and horizontal extension.
In chaining, the underlying array of the hash table doesn't store value
types directly but stores a linked list. When multiple different keys
map to the same index, these key -> value
pairs are stored in the linked list, effectively resolving hash collisions.
In linear probing, if a key
finds that the calculated index
is already occupied by another key
, it checks the position at index + 1
. If that spot is also occupied, it continues to check subsequent positions until it finds an empty spot.
For example, in the image above, the keys are inserted in the order k2, k4, k5, k3, k1
, resulting in the hash table looking like this:
Here, I'll explain the principles first. In the following chapters, I will walk you through the implementation of these two methods to resolve hash collisions.
Resizing and Load Factor
You've probably heard the term "load factor" before. Now that you understand the issue of hash collisions, you can grasp the significance of the load factor.
Both the chaining method and linear probing can resolve hash collisions, but they can lead to performance degradation.
For example, with the chaining method, you calculate index = hash(key)
, but when you look at that index, you find a linked list. You then have to traverse this list to find the value
you're looking for. The time complexity of this process is O(K)
, where K
is the length of the linked list.
Similarly, with linear probing, you calculate index = hash(key)
, but when you look at that index, you might not find the key
you’re looking for. Due to the way linear probing resolves collisions, you can't be sure that the key
doesn't exist. You have to keep searching forward until you find an empty slot or the key
itself. The time complexity of this process is also O(K)
, where K
is the number of probes.
Therefore, if hash collisions happen frequently, the value of K
increases, leading to significant performance degradation of the hash table. This is what we need to avoid.
Why do hash collisions occur frequently? There are two main reasons:
- Poorly designed hash functions, causing
key
hash values to distribute unevenly, leading manykeys
to map to the same index. - The hash table is too full of
key-value
pairs. Even with a perfect hash function, hash collisions cannot be avoided in such a scenario.
The first issue is typically not a concern because developers of programming language standard libraries have designed efficient hash functions for you to use.
The second issue is within our control, which brings us to the concept of the "load factor."
Load Factor
The load factor measures how full a hash table is. Generally, the larger the load factor, the more key-value
pairs stored in the hash table, the higher the probability of hash collisions, and the poorer the performance of hash table operations.
The formula to calculate the load factor is simple: size / table.length
. Here, size
is the number of key-value
pairs in the hash table, and table.length
is the capacity of the underlying array of the hash table.
You’ll notice that in hash tables implemented with the chaining method, the load factor can be infinitely large because the linked list can grow indefinitely. In hash tables implemented with linear probing, the load factor cannot exceed 1.
In Java’s HashMap, you can set a custom load factor when creating a hash table. If not set, the default is 0.75
. This value is based on experience, and it's generally fine to use the default.
When the number of elements in the hash table reaches the load factor, the hash table will resize. Similar to the previously discussed implementation of dynamic arrays, the capacity of the underlying table
array of the hash table is increased, and the data is moved to the new larger array. The size
remains unchanged, table.length
increases, and the load factor decreases.
Why You Should Not Rely on the Traversal Order of a Hash Table
You might have heard the general programming advice that the traversal order of keys in a hash table is unordered, and you shouldn't rely on the traversal order when writing programs. Why is that?
The traversal of a hash table essentially involves traversing the underlying table
array:
// pseudo-code logic for iterating all keys
// underlying table array of the hash table
KVNode[] table = new KVNode[1000];
// get all keys in the hash table
// we cannot rely on the order of this keys list
List<KeyType> keys = new ArrayList<>();
for (int i = 0; i < table.length; i++) {
KVNode node = table[i];
if (node != null) {
keys.add(node.key);
}
}
If you understand the previous content, you should be able to grasp this problem.
First, because the hash
function maps your key
, the distribution of keys
in the underlying table
array is random, unlike arrays or linked lists that have a clear element order.
Second, what happens when a hash table reaches its load factor? It resizes, right? This means the table.length
changes, and elements are moved.
So, during this rehashing process, the hash
function recalculates the hash value for each key
and places it in the new table
array.
The hash
function calculates values based on table.length
. This means that after a hash table automatically resizes, the hash value for the same key
might change, and the index where the key-value
pair is stored in the table
might also change, so the traversal order becomes different from before.
The phenomenon you observe is that the first key in one traversal is key1
, but after adding or removing a few elements, key1
might end up at the end.
So, there's no need to memorize these details. Once you understand the principle, you can figure it out with a bit of reasoning.
Why It’s Not Recommended to Add/Delete Hash Table keys
in a For Loop
Note that I say it’s not recommended, not that it’s forbidden. Different programming languages have different implementations of hash tables in their standard libraries. Some languages have optimized for such scenarios, so whether it's feasible or not, you need to check the documentation.
Here, we’ll analyze from the principle of hash tables. Adding/deleting hash table keys
in a for loop can easily cause problems for the same reason mentioned above: hash value changes due to resizing.
Traversing the keys
of a hash table essentially means traversing the underlying table
array of the hash table. If you add/delete elements while traversing, and the resizing is triggered by these operations, the entire table
array changes. So, what should happen next? And should the newly added/deleted elements during traversal be traversed?
Resizing causing key
order changes is a unique behavior of hash tables. But even if we exclude this factor, it's generally not recommended to add/delete elements in any data structure during traversal, as it can easily lead to unexpected behavior.
If you must do this, make sure to check the relevant documentation to understand the behavior clearly.
keys
Must Be Immutable
Only immutable types can be used as hash table keys
, and this is very important.
An immutable type means that once an object is created, its value cannot be changed. For example, in Java, String
and Integer
are immutable types. Once these objects are created, you can only read their values but cannot modify them.
In contrast, objects like ArrayList
and LinkedList
in Java can have elements added or removed after they are created, so they are mutable types.
Therefore, you can use String
objects as hash table keys
, but you cannot use ArrayList
objects as hash table keys
:
// Immutable types can be used as keys
Map<String, AnyOtherType> map1 = new HashMap<>();
Map<Integer, AnyOtherType> map2 = new HashMap<>();
// Mutable types should not be used as keys
// Note, this won't cause a syntax error, but the code is very likely to have bugs
Map<ArrayList<Integer>, AnyOtherType> map3 = new HashMap<>();
Why is it not recommended to use mutable types as key
? For example, take this ArrayList
. The implementation logic of its hashCode
method is as follows:
public int hashCode() {
for (int i = 0; i < elementData.length; i++) {
h = 31 * h + elementData[i];
}
}
The first issue is efficiency. Every time you calculate the hashCode
, you have to traverse the entire array, making the complexity O(N)
. This causes the complexity of the hash table's operations (add, delete, search, modify) to degrade to O(N)
.
A more serious problem is that the hashCode
of an ArrayList
is computed based on its elements. If you add or remove elements from the ArrayList
, or if the hashCode
of any element changes, then the hashCode
of the ArrayList
will also change.
For example, suppose you use an ArrayList
variable arr
as the key in a hash table to store a corresponding value. If any element in arr
is modified elsewhere in the program, the hashCode
of arr
will change. As a result, when you use this arr
variable to query the hash table again, you won't find any value.
In other words, the key-value
pair you stored in the hash table is lost unexpectedly, which is a very serious bug and can also lead to potential memory leaks.
public class Test {
public static void main(String[] args) {
// Incorrect example
// Using mutable types as keys in HashMap
Map<ArrayList<Integer>, Integer> map = new HashMap<>();
ArrayList<Integer> arr = new ArrayList<>();
arr.add(1);
arr.add(2);
map.put(arr, 999);
System.out.println(map.containsKey(arr)); // true
System.out.println(map.get(arr)); // 999
arr.add(3);
// Serious bug occurs, key-value pair is lost
System.out.println(map.containsKey(arr)); // false
System.out.println(map.get(arr)); // null
// At this point, the key-value pair for arr still exists in the underlying table of the map
// But because the hashCode of arr has changed, this key-value pair cannot be found
// This can also cause a memory leak because the arr variable is referenced by the map and cannot be garbage collected
}
}
The above is a simple error example. You might say, "Just remove the element 3
, and the arr -> 999
key-value pair will reappear" or "Directly traverse the underlying table
array of the hash table, and you should be able to see this key-value pair."
Come on 🙏🏻, are you writing code or a ghost story? One moment it appears, the next it disappears. Is your hash table haunted?
Just kidding. In reality, mutable types themselves are a source of uncertainty. In the mess of code, how do you know where the arr
is being modified?
The correct approach is to use immutable types as the hash table's key
. For example, use the String
type as the key
. Once a String
object is created in Java, its value cannot be changed, so you won't encounter the above problem.
The hashCode
method of the String
type also needs to traverse all characters, but because of its immutability, this value can be cached once calculated, avoiding the need to recompute it every time. Therefore, the average time complexity remains O(1)
.
I used Java as an example here, but other languages are similar. You need to check the relevant documentation to understand how the standard library's hash table calculates object hash values to avoid similar issues.
Summary
The above explanation should have connected all the underlying principles of hash tables. Let's summarize the content with a few interview questions:
1. Why do we often say that the efficiency of insertion, deletion, search, and update in a hash table is O(1)
?
Because the underlying structure of a hash table is an array. The primary time complexity comes from the hash function calculating the index and handling hash collisions. As long as the complexity of the hash function is O(1)
and hash collisions are reasonably resolved, the complexity of insertion, deletion, search, and update will be O(1)
.
2. Why does the traversal order of a hash table change?
Because a hash table will expand when it reaches its load factor. This expansion process changes the capacity of the underlying array, which in turn changes the indices calculated by the hash function, thus altering the traversal order.
3. Is the efficiency of insertion, deletion, search, and update in a hash table always O(1)
?
Not necessarily. As analyzed earlier, only if the hash function's complexity is O(1)
and hash collisions are reasonably resolved can we ensure the complexity of insertion, deletion, search, and update is O(1)
. Hash collisions can be solved with standard methods, but the key issue is the complexity of the hash function. If an inappropriate key type is used, such as using an ArrayList
as a key, the hash table's complexity can degrade to O(N)
.
4. Why must we use immutable types as the keys in a hash table?
Because the main operations of a hash table rely on the index calculated by the hash function. If the hash value of the key changes, it can lead to unexpected loss of key-value pairs and serious bugs.
To write efficient code, you need to have a good understanding of the source code in the standard library of your programming language.
Next, I will guide you step-by-step to implement a simple hash table using both chaining and linear probing methods to deepen your understanding of hash tables.