Have an Android phone?

Try our new game!

Hashtable

Hash table (or hash set) is a data structure that uses a hash function to obtain index of an array element (bucket) where the corresponding value must be stored or searched for. Hash function need not assign the unique index to every possible key (while such behavior is desired) as the bucket can hold more than one value. However it must distribute values over existing buckets as evenly as possible. As the hash function must return an index of the existing bucket, it usually takes at least two parameters, the object to hash and the total number of buckets in the table.

Hash set that stores integers using the hash function v mod n, where v is the value to store and n is the total number of buckets. Use combo box to select value for adding, removing or searching in the set. The control at the bottom allows to change the number of buckets without loosing the content.

Once the bucket is found, all possible values there are individually checked if they match the search criteria (in case of hash set they are simply checked for equality against the key). Hash table maps some key into associated value. Differently, a hash set simply holds information which values are stored in the table, passing them directly to the hash function and also comparing for equality.

Hash tables and hash sets are built-in or readily available in many recent programming languages (Java, C++). Only very specific corner cases may require to write your own implementation.

Hash table is normally understood as unsorted data structure but it may be kind of sorted if some parts of implementation order data by the hash key. In any case, the iteration over "classic" hash table does not return elements in any particular order: this order is different from the sequence in that the elements have been added, it may also be different from the "natural" (for instance, alphabetic) sequence in that the elements may be arranged. In some cases this arbitrary ordering is confusing enough to implement extensions that enable the iterator to return elements in the same order as they have been added [1].

Performance

Hash table is a very efficient data structure. As usually a bucket holds only small part of the stored keys, the time, required to find the key, is near independent on the number of the stored keys (O(1) in the big O notation). The number of keys per bucket should not increase significantly as the table grows up; instead the table must be re-arranged to increase the number of buckets. Depending on requirements, hash function may take significant computing time but this time is still not dependent on the number of values, stored in the table.

Bucket implementation

Many bucket implementations store values in array or linked list and use simple scan to find them. Arrays take more advantage of the CPU memory cache as they keep values closer together in the memory. They also use less memory per value and are efficient with small data structures. However linked lists need not be resized then they overflow. To increase cache locality, the bucket may contain both array (or just field for the first value in the bucket) and list for the remaining values.

Lists and arrays works well when there are relatively few values per bucket (the time required to scan array is proportional to its size, O(n)). When is important to keep good performance for corner cases when hash function assigns a lot of values to the same bucket, the bucket can sort values for the faster search or switch into some more advanced structure, for instance a 2-3-4 tree or red-black tree.

Load factor

The number of stored elements divided by the number of buckets[2] (this would give different value if there are many elements per bucket) is called a load factor of the table. The table is the most efficient when this value is somewhere about 0.7 (so there are many empty buckets). The load factor is a measure how full the hash table is allowed to get before it must be re-arranged, increasing the number of buckets[3]. For maximal performance the table may also be re-arranged is the load factor becomes too low[4].

Hash code

Hash function is closely related to the possible hash code that every key must provide. Hash code can be arbitrary integer. The non negative hash code can be easily transformed into the index of the bucket by applying mod operation ( index = hash_code mod number_of_buckets ). Hash codes, provided by different objects, need not be always different, while this is highly desirable. One of the simple ways to get a hash code is to take the address, where the object is allocated in memory, as it is done in default Java implementation. This approach, however, makes all objects different, including objects with the same internal content (like strings with identical values). Objects that are compared by value (numbers, strings, etc) provide custom hash code method that returns the same value for the objects that are equal to each other.

Closed and open addressing

Hash table, described above, always places new object into the bucket, determined by the value of the the hash function - and no any other. This approach is called closed addressing. It requires a slot to have the capability to grow.

Instead of growing the bucked or rearranging the whole table, after exceeding the bounds of the required bucket some algorithms start placing new values into adjacent buckets. The hash function does not longer strictly determines the bucket that should hold the value but it still helps to determine a group of such buckets, still speeding the search. When searching for a value, several buckets must be scanned. This approach is called open addressing. The sequence of the alternative buckets, where the non-fitting value can be stored, is called a probe sequence. The probe sequence depends on the proposed algorithm. It can be as simple as trying the slot that is directly or by some step adjacent to the occupied slot (linear probing), adding the successive outputs of a quadratic polynomial (quadratic probing) or use of the several alternative hash functions. If it is still not possible to find a free slot, some algorithms start moving the previously stored values to the alternative locations. Please see [5] for overview of these advanced features.

References

  1. 1 LinkedHashMap Java class
  2. 2 http://www.brpreiss.com/books/opus6/docs/Opus6HashTableLoadFactorProperty.html
  3. 3 http://download.oracle.com/javase/1.4.2/docs/api/java/util/HashMap.html
  4. 4 http://srfi.schemers.org/srfi-90/srfi-90.html
  5. 5 http://en.wikipedia.org/wiki/Hash_table