# 哈希表基本原理

First, I need to clarify a common misconception for beginners.

Are a hash table and what we often refer to as a Map (key-value mapping) the same thing? No, they are not.

This becomes very clear when explained in Java. `Map`

is a Java interface that merely declares a number of methods without providing their specific implementations:

```
interface Map<K, V> {
V get(K key);
void put(K key, V value);
V remove(K key);
// ...
}
```

The `Map`

interface itself only defines a series of operations for key-value mapping. Data structures like `HashMap`

implement these operations based on their characteristics. Other data structures, such as `TreeMap`

and `LinkedHashMap`

, also implement this interface.

In other words, you can say that the `get`

, `put`

, and `remove`

methods of `HashMap`

have a complexity of `O(1)`

, but you cannot say that the complexity of the `Map`

interface methods is `O(1)`

. If you use another implementation, such as `TreeMap`

, which uses a tree structure, the complexity of these methods becomes `O(logN)`

.

Why am I emphasizing this? Mainly for readers who use non-Java languages.

Other programming languages might not have such clear interface definitions as Java, so it is easy for readers to confuse hash tables with Map key-value pairs. Hearing about key-value operations might lead one to assume that their complexity is always `O(1)`

. This is incorrect; the actual complexity depends on how the underlying data structure implements key-value operations.

In this section, I will guide you through implementing a hash table and explore why hash tables can achieve `O(1)`

complexity for insert, delete, search, and update operations, as well as discuss two methods for resolving hash collisions.

## Basic Principles of Hash Tables

A Few Words to Explain

A hash table can be understood as an enhanced array.

An array allows you to find the corresponding element in `O(1)`

time complexity using an index (a non-negative integer).

A hash table is similar; it allows you to find the value corresponding to a key in `O(1)`

time complexity. The key can be of various types, such as numbers or strings.

How does it work? It's very simple. The underlying implementation of a hash table is an array (let's call it `table`

). It first converts the key into an index in the array using a hash function (let's call it `hash`

), then the insert, delete, search, and update operations are almost the same as in an array:

```
// Pseudocode logic for a hash table
class MyHashMap {
private Object[] table;
// Insert/Update, complexity O(1)
public void put(K key, V value) {
int index = hash(key);
table[index] = value;
}
// Retrieve, complexity O(1)
public V get(K key) {
int index = hash(key);
return table[index];
}
// Delete, complexity O(1)
public void remove(K key) {
int index = hash(key);
table[index] = null;
}
// Hash function, converting key into a valid index in the table
// The time complexity must be O(1) to ensure that the above methods all have complexity O(1)
private int hash(K key) {
// ...
}
}
```

```
// Pseudocode logic for hash table
class MyHashMap {
private:
vector<void*> table;
public:
// Insert/Update, complexity O(1)
void put(auto key, auto value) {
int index = hash(key);
table[index] = value;
}
// Search, complexity O(1)
auto get(auto key) {
int index = hash(key);
return table[index];
}
// Delete, complexity O(1)
void remove(auto key) {
int index = hash(key);
table[index] = nullptr;
}
private:
// Hash function, convert key to valid index in the table
// Time complexity must be O(1) to ensure the above methods have O(1) complexity
int hash(auto key) {
// ...
}
};
```

```
# Hash table pseudocode logic
class MyHashMap:
def __init__(self):
self.table = [None] * 1000
# add/update, complexity O(1)
def put(self, key, value):
index = self.hash(key)
self.table[index] = value
# get, complexity O(1)
def get(self, key):
index = self.hash(key)
return self.table[index]
# remove, complexity O(1)
def remove(self, key):
index = self.hash(key)
self.table[index] = None
# hash function, convert key into a valid index in the table
# time complexity must be O(1) to ensure the above methods have O(1) complexity
def hash(self, key):
# ...
return hash(key) % len(self.table)
```

```
// Pseudo-code logic for hash table
type MyHashMap struct {
table []interface{}
}
func Constructor() MyHashMap {
return MyHashMap{
table: make([]interface{}, 10000),
}
}
// Add/Update, complexity O(1)
func (h *MyHashMap) Put(key, value interface{}) {
index := h.hash(key)
h.table[index] = value
}
// Get, complexity O(1)
func (h *MyHashMap) Get(key interface{}) interface{} {
index := h.hash(key)
return h.table[index]
}
// Remove, complexity O(1)
func (h *MyHashMap) Remove(key interface{}) {
index := h.hash(key)
h.table[index] = nil
}
// Hash function, convert key to a valid index in the table
// Time complexity must be O(1) to ensure the above methods are O(1)
func (h *MyHashMap) hash(key interface{}) int {
...
}
```

```
// Hash table pseudocode logic
class MyHashMap {
constructor() {
this.table = new Array(1000).fill(null);
}
// Add/Update, complexity O(1)
put(key, value) {
const index = this.hash(key);
this.table[index] = value;
}
// Get, complexity O(1)
get(key) {
const index = this.hash(key);
return this.table[index];
}
// Remove, complexity O(1)
remove(key) {
const index = this.hash(key);
this.table[index] = null;
}
// Hash function, convert key to a valid index in the table
// The time complexity must be O(1) to ensure that the above methods' complexity are O(1)
hash(key) {
// ...
}
}
```

There are many details to handle in the actual implementation, such as the design of the hash function and the handling of hash collisions. But if you understand the core principles mentioned above, you're already halfway there. The rest is just writing the code, which shouldn't be too hard, right?

Next, let's specifically introduce some key concepts and potential issues in the add, delete, search, and update processes mentioned above.

## Key Concepts and Principles

`key`

is Unique, `value`

can be Duplicated

In a hash table, no two `keys`

can be the same, but `values`

can be duplicated.

Understanding this principle is straightforward if you think about arrays:

**In an array, each index is unique. You can't have two index 0s in one array. As for what elements are stored in the array, that's up to you and no one cares.**

So, in a hash table, `keys`

are unique, but `values`

can be anything and can be duplicated.

### Hash Function

The purpose of a hash function is to convert an input of any length (key) into a fixed-length output (index).

You may have noticed that the methods for adding, deleting, searching, and updating all use the hash function to calculate the index. If the hash function you design has a complexity of `O(N)`

, then the performance of these operations will degrade to `O(N)`

. **So, the performance of this function is very crucial.**

**Another important point is that the hash function must produce the same output for the same input key each time to ensure the correctness of the hash table.** It can't happen that you calculate

`hash("123") = 5`

now and `hash("123") = 6`

later; otherwise, the hash table would be useless.So, how does the hash function convert non-integer `keys`

into integer indices? And how does it ensure that these indices are valid?

How to Convert a `key` to an Integer

This problem can have many answers, as different hash function designs will have different methods. Here, I'll share a simple approach using Java. Other programming languages follow a similar idea.

Every Java object has an `int hashCode()`

method. If you don't override this method in a custom class, the default return value can be considered the memory address of the object. A memory address is a unique integer globally.

So, by calling the `hashCode()`

method on the `key`

, you effectively convert the `key`

into a unique integer globally.

Of course, this method has some issues, which we will discuss below, but at least we've found a way to convert any object to an integer.

How to Ensure Index Validity

The `hashCode`

method returns an int type, and one immediate issue is that this int value can be negative, while array indices are non-negative integers.

You might think of writing code to convert this value to a non-negative number like this:

```
int h = key.hashCode();
if (h < 0) h = -h;
```

But this approach has a problem. The minimum value an int can represent is `-2^31`

, and the maximum value is `2^31 - 1`

. So if `h = -2^31`

, then `-h = 2^31`

, which exceeds the maximum value an int can hold. This is called integer overflow, and the compiler will throw an error or produce unpredictable results.

Why is the minimum int value `-2^31`

and the maximum `2^31 - 1`

? This relates to the principles of two's complement encoding in computers. In simple terms, an int is 32 binary bits, where the highest bit (the leftmost bit) is the sign bit. The sign bit is 0 for positive numbers and 1 for negative numbers.

The problem now is straightforward: I want to ensure `h`

is non-negative but can't simply negate it. A simple and direct way to achieve this, using the principles of two's complement encoding, is to set the highest sign bit to 0, ensuring `h`

is non-negative:

```
int h = key.hashCode();
// Bitwise operation, remove the highest bit (sign bit)
// Additionally, bitwise operations are generally faster than arithmetic operations
// Therefore, you will see that the standard library source code prefers bitwise operations where possible
h = h & 0x7fffffff;
// The binary representation of 0x7fffffff is 0111 1111 ... 1111
// That is, all bits are 1 except the highest bit (sign bit) which is 0
// By performing a bitwise AND with 0x7fffffff, the highest bit (sign bit) is cleared, ensuring h is non-negative
```

I won't go into the principles of two's complement encoding in detail here. If you're interested, you can look it up and study it on your own.

Alright, the issue of `hashCode`

potentially being negative is resolved. But there's another problem: this `hashCode`

is usually very large, and we need to map it to a valid index in the `table`

array.

This shouldn't be difficult for you. Previously, in Circular Array Principles and Implementation, we used the `%`

modulus operation to ensure the index always falls within the valid range of the array. So, here we can also use the `%`

operation to ensure the index is valid. The complete `hash`

function implementation is as follows:

```
int hash(K key) {
int h = key.hashCode();
// ensure non-negative number
h = h & 0x7fffffff;
// map to a valid index in the table array
return h % table.length;
}
```

Of course, directly using `%`

also has its issues because the modulus operation `%`

is relatively performance-intensive. In standard libraries that prioritize runtime efficiency, the `%`

operation is usually avoided in favor of bitwise operations to improve performance.

However, the main purpose of this chapter is to help you understand how to implement a simple hash table, so we will ignore these optimization details. If you're interested, you can check out the source code of Java's HashMap to see how it implements the `hash`

function.

### Hash Collision

Above, we discussed the implementation of the `hash`

function. Naturally, you might wonder, what happens if two different `keys`

produce the same index through the hash function? This situation is known as a "hash collision."

Can Hash Collisions Be Avoided?

Hash collisions cannot be avoided; they can only be handled properly at the algorithm level.

Hash collisions are inevitable because the `hash`

function essentially maps an infinite space to a finite index space, so different `keys`

are bound to map to the same index.

It's similar to how a three-dimensional object casts a two-dimensional shadow. This lossy compression necessarily results in information loss, and lossy information cannot correspond one-to-one with the original information.

How do we resolve hash collisions? There are two common solutions: **chaining** and **linear probing** (also often called **open addressing**).

The names might sound fancy, but they essentially boil down to two approaches: vertical extension and horizontal extension.

In chaining, the underlying array of the hash table doesn't store `value`

types directly but stores a linked list. When multiple different `keys`

map to the same index, these `key -> value`

pairs are stored in the linked list, effectively resolving hash collisions.

In linear probing, if a `key`

finds that the calculated `index`

is already occupied by another `key`

, it checks the position at `index + 1`

. If that spot is also occupied, it continues to check subsequent positions until it finds an empty spot.

For example, in the image above, the keys are inserted in the order `k2, k4, k5, k3, k1`

, resulting in the hash table looking like this:

Here, I'll explain the principles first. In the following chapters, I will walk you through the implementation of these two methods to resolve hash collisions.

### Resizing and Load Factor

You've probably heard the term "load factor" before. Now that you understand the issue of hash collisions, you can grasp the significance of the load factor.

Both the chaining method and linear probing can resolve hash collisions, but they can lead to performance degradation.

For example, with the chaining method, you calculate `index = hash(key)`

, but when you look at that index, you find a linked list. You then have to traverse this list to find the `value`

you're looking for. The time complexity of this process is `O(K)`

, where `K`

is the length of the linked list.

Similarly, with linear probing, you calculate `index = hash(key)`

, but when you look at that index, you might not find the `key`

you’re looking for. Due to the way linear probing resolves collisions, you can't be sure that the `key`

doesn't exist. You have to keep searching forward until you find an empty slot or the `key`

itself. The time complexity of this process is also `O(K)`

, where `K`

is the number of probes.

Therefore, if hash collisions happen frequently, the value of `K`

increases, leading to significant performance degradation of the hash table. This is what we need to avoid.

Why do hash collisions occur frequently? There are two main reasons:

- Poorly designed hash functions, causing
`key`

hash values to distribute unevenly, leading many`keys`

to map to the same index. - The hash table is too full of
`key-value`

pairs. Even with a perfect hash function, hash collisions cannot be avoided in such a scenario.

The first issue is typically not a concern because developers of programming language standard libraries have designed efficient hash functions for you to use.

The second issue is within our control, which brings us to the concept of the "load factor."

Load Factor

The load factor measures how full a hash table is. Generally, the larger the load factor, the more `key-value`

pairs stored in the hash table, the higher the probability of hash collisions, and the poorer the performance of hash table operations.

**The formula to calculate the load factor is simple: size / table.length**. Here,

`size`

is the number of `key-value`

pairs in the hash table, and `table.length`

is the capacity of the underlying array of the hash table.You’ll notice that in hash tables implemented with the chaining method, the load factor can be infinitely large because the linked list can grow indefinitely. In hash tables implemented with linear probing, the load factor cannot exceed 1.

In Java’s HashMap, you can set a custom load factor when creating a hash table. If not set, the default is `0.75`

. This value is based on experience, and it's generally fine to use the default.

**When the number of elements in the hash table reaches the load factor**, the hash table will resize. Similar to the previously discussed implementation of dynamic arrays, the capacity of the underlying `table`

array of the hash table is increased, and the data is moved to the new larger array. The `size`

remains unchanged, `table.length`

increases, and the load factor decreases.

### Why You Should Not Rely on the Traversal Order of a Hash Table

You might have heard the general programming advice that the traversal order of keys in a hash table is unordered, and you shouldn't rely on the traversal order when writing programs. Why is that?

The traversal of a hash table essentially involves traversing the underlying `table`

array:

```
// pseudo-code logic for iterating all keys
// underlying table array of the hash table
KVNode[] table = new KVNode[1000];
// get all keys in the hash table
// we cannot rely on the order of this keys list
List<KeyType> keys = new ArrayList<>();
for (int i = 0; i < table.length; i++) {
KVNode node = table[i];
if (node != null) {
keys.add(node.key);
}
}
```

If you understand the previous content, you should be able to grasp this problem.

First, because the `hash`

function maps your `key`

, the distribution of `keys`

in the underlying `table`

array is random, unlike arrays or linked lists that have a clear element order.

Second, what happens when a hash table reaches its load factor? It resizes, right? This means the `table.length`

changes, and elements are moved.

So, during this rehashing process, the `hash`

function recalculates the hash value for each `key`

and places it in the new `table`

array.

**The hash function calculates values based on table.length. This means that after a hash table automatically resizes, the hash value for the same key might change, and the index where the key-value pair is stored in the table might also change, so the traversal order becomes different from before.**

The phenomenon you observe is that the first key in one traversal is `key1`

, but after adding or removing a few elements, `key1`

might end up at the end.

So, there's no need to memorize these details. Once you understand the principle, you can figure it out with a bit of reasoning.

### Why It’s Not Recommended to Add/Delete Hash Table `keys`

in a For Loop

Note that I say it’s not recommended, not that it’s forbidden. Different programming languages have different implementations of hash tables in their standard libraries. Some languages have optimized for such scenarios, so whether it's feasible or not, you need to check the documentation.

Here, we’ll analyze from the principle of hash tables. Adding/deleting hash table `keys`

in a for loop can easily cause problems for the same reason mentioned above: hash value changes due to resizing.

Traversing the `keys`

of a hash table essentially means traversing the underlying `table`

array of the hash table. If you add/delete elements while traversing, and the resizing is triggered by these operations, the entire `table`

array changes. So, what should happen next? And should the newly added/deleted elements during traversal be traversed?

Resizing causing `key`

order changes is a unique behavior of hash tables. But even if we exclude this factor, it's generally not recommended to add/delete elements in any data structure during traversal, as it can easily lead to unexpected behavior.

If you must do this, make sure to check the relevant documentation to understand the behavior clearly.

`keys`

Must Be Immutable

**Only immutable types can be used as hash table keys, and this is very important**.

An immutable type means that once an object is created, its value cannot be changed. For example, in Java, `String`

and `Integer`

are immutable types. Once these objects are created, you can only read their values but cannot modify them.

In contrast, objects like `ArrayList`

and `LinkedList`

in Java can have elements added or removed after they are created, so they are mutable types.

Therefore, you can use `String`

objects as hash table `keys`

, but you cannot use `ArrayList`

objects as hash table `keys`

:

```
// Immutable types can be used as keys
Map<String, AnyOtherType> map1 = new HashMap<>();
Map<Integer, AnyOtherType> map2 = new HashMap<>();
// Mutable types should not be used as keys
// Note, this won't cause a syntax error, but the code is very likely to have bugs
Map<ArrayList<Integer>, AnyOtherType> map3 = new HashMap<>();
```

Why is it not recommended to use mutable types as `key`

? For example, take this `ArrayList`

. The implementation logic of its `hashCode`

method is as follows:

```
public int hashCode() {
for (int i = 0; i < elementData.length; i++) {
h = 31 * h + elementData[i];
}
}
```

**The first issue is efficiency**. Every time you calculate the `hashCode`

, you have to traverse the entire array, making the complexity `O(N)`

. This causes the complexity of the hash table's operations (add, delete, search, modify) to degrade to `O(N)`

.

A more serious problem is that the `hashCode`

of an `ArrayList`

is computed based on its elements. If you add or remove elements from the `ArrayList`

, or if the `hashCode`

of any element changes, then the `hashCode`

of the `ArrayList`

will also change.

For example, suppose you use an `ArrayList`

variable `arr`

as the key in a hash table to store a corresponding value. If any element in `arr`

is modified elsewhere in the program, the `hashCode`

of `arr`

will change. As a result, when you use this `arr`

variable to query the hash table again, you won't find any value.

**In other words, the key-value pair you stored in the hash table is lost unexpectedly, which is a very serious bug and can also lead to potential memory leaks**.

```
public class Test {
public static void main(String[] args) {
// Incorrect example
// Using mutable types as keys in HashMap
Map<ArrayList<Integer>, Integer> map = new HashMap<>();
ArrayList<Integer> arr = new ArrayList<>();
arr.add(1);
arr.add(2);
map.put(arr, 999);
System.out.println(map.containsKey(arr)); // true
System.out.println(map.get(arr)); // 999
arr.add(3);
// Serious bug occurs, key-value pair is lost
System.out.println(map.containsKey(arr)); // false
System.out.println(map.get(arr)); // null
// At this point, the key-value pair for arr still exists in the underlying table of the map
// But because the hashCode of arr has changed, this key-value pair cannot be found
// This can also cause a memory leak because the arr variable is referenced by the map and cannot be garbage collected
}
}
```

The above is a simple error example. You might say, "Just remove the element `3`

, and the `arr -> 999`

key-value pair will reappear" or "Directly traverse the underlying `table`

array of the hash table, and you should be able to see this key-value pair."

Come on 🙏🏻, are you writing code or a ghost story? One moment it appears, the next it disappears. Is your hash table haunted?

Just kidding. In reality, mutable types themselves are a source of uncertainty. In the mess of code, how do you know where the `arr`

is being modified?

The correct approach is to use immutable types as the hash table's `key`

. For example, use the `String`

type as the `key`

. Once a `String`

object is created in Java, its value cannot be changed, so you won't encounter the above problem.

The `hashCode`

method of the `String`

type also needs to traverse all characters, but because of its immutability, this value can be cached once calculated, avoiding the need to recompute it every time. Therefore, the average time complexity remains `O(1)`

.

I used Java as an example here, but other languages are similar. You need to check the relevant documentation to understand how the standard library's hash table calculates object hash values to avoid similar issues.

## Summary

The above explanation should have connected all the underlying principles of hash tables. Let's summarize the content with a few interview questions:

**1. Why do we often say that the efficiency of insertion, deletion, search, and update in a hash table is O(1)?**

Because the underlying structure of a hash table is an array. The primary time complexity comes from the hash function calculating the index and handling hash collisions. As long as the complexity of the hash function is `O(1)`

and hash collisions are reasonably resolved, the complexity of insertion, deletion, search, and update will be `O(1)`

.

**2. Why does the traversal order of a hash table change?**

Because a hash table will expand when it reaches its load factor. This expansion process changes the capacity of the underlying array, which in turn changes the indices calculated by the hash function, thus altering the traversal order.

**3. Is the efficiency of insertion, deletion, search, and update in a hash table always O(1)?**

Not necessarily. As analyzed earlier, only if the hash function's complexity is `O(1)`

and hash collisions are reasonably resolved can we ensure the complexity of insertion, deletion, search, and update is `O(1)`

. Hash collisions can be solved with standard methods, but the key issue is the complexity of the hash function. If an inappropriate key type is used, such as using an `ArrayList`

as a key, the hash table's complexity can degrade to `O(N)`

.

**4. Why must we use immutable types as the keys in a hash table?**

Because the main operations of a hash table rely on the index calculated by the hash function. If the hash value of the key changes, it can lead to unexpected loss of key-value pairs and serious bugs.

To write efficient code, you need to have a good understanding of the source code in the standard library of your programming language.

Next, I will guide you step-by-step to implement a simple hash table using both chaining and linear probing methods to deepen your understanding of hash tables.