Moved to Wordpress

This blog has moved to Wordpress. GOTO the new blog now.

There will be no new updates here. Most of the posts have also been expanded.

2010/09/15

What In The Hell Is A Hash Table

A hash table is a combination of two parts; an array, to store the data, and a hash function that computes the index of the array from a key. The best analogy to a hash table is a dictionary. When you use a dictionary you lookup a word (the key) and find the definition (the value).

Hash tables are an improvement over arrays and linked lists for large data sets. With a hash table you don’t have to search through every element in the collection (O(n) access time). Instead you pay a one time cost (the hash function) to retrieve data (O(1) access time).

The Hash Function


The purpose of the hash function is to compute an index from a key. That index is then used to store and retrieve data from the array. Many higher-level languages include some type of hash table and hash function.



As you can see from the diagram above the hash(“string”) function transforms a string into an integer which is then used as the index into the array. This index is used to store or retrieve data in the array.


The amount of time it takes to set or retrieve a value from a hash table is the time it takes to run the hash function plus the time it takes to access an element of the array. For small data sets this could take longer than simply iterating over each element of the array. However, since it is a fixed cost, for larger data sets using a hash table could be much faster.

Assuming we have a function ‘set’ that takes a key to be hashed as its first argument and a value to be stored as its second argument, we can create the following hash table.



As you can see from the above diagram we store both the key and the value in the table. The reason for this is collisions, which will be explained below.

Collisions


A collision occurs when two different keys hash to the same value. Collisions will occur in even the best designed hash tables, however there are a few reasons why they can occur more frequently. The first reason is a poor hash function that does not distribute the indices evenly throughout the table, this will be explained in more detail below. The second reason is that the array is too small. If you have a 10 element array and you are trying to store 100 key/values in the hash table, there will obviously be a lot of collisions.


In the diagram above “Puff” hashed to the same value as “Peter” causing a collision. Both “Peter” and “Puff” are now stored in the same index of the array. A certain amount of collisions are unavoidable in a hash table, so we’ll need a way to deal with them.

The simplest way to deal with collisions is to make each element of the array a linked list. Then, when a collision occurs, we simply search through the linked list to find the data we want. This way the hash function gives us a ‘starting point’ to begin our search.

With a good hash function and an appropriately sized array collisions are a rare occurrence. However it is possible, though very unlikely, that all the key/values will be stored in one index and we will end up doing a search on a linked list.

What is a good hash function


A good hash function has two properties, computing unique index values and uniform distribution. Obviously a hash function that computes the same index value for a large number of keys will cause a lot of collisions. As we saw above, collisions can break our hash table down into a search of a linked list which is O(n) compared to the hash table’s O(1). A good hash function must also produce a uniform distribution of index values. If it does not it increases the chances of collisions and wastes space.

No comments:

Post a Comment