Thursday, January 11, 2007

Hash tables

Hash tables, a major application for hash functions, enable fast lookup of a data record given its key. For example, keys in an English dictionary would be English words, and their associated records would contain definitions. In this case, the hash function must map alphabetic strings to indexes for the hash table's internal array.

The generally impossible/impractical ideal for a hash table's hash function is to map each key to a unique index (see perfect hashing), because this guarantees access to each data record in the first probe (lookup) in the table.

Hash functions that are truly random with uniform output (including most cryptographic hash functions) are good in that, on average, only one or two probes will be needed (depending on the load factor). Perhaps as important is that excessive collision rates with random hash functions are highly improbable—if not computationally infeasible for an adversary. However, a small, predictable number of collisions is virtually inevitable (see birthday paradox).

In many cases, a heuristic hash function can yield many fewer collisions than a random hash function. Heuristic functions take advantage of regularities in likely sets of keys. For example, one could design a heuristic hash function such that file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc. map to successive indices of the table, meaning that such sequences will not collide. Beating a random hash function on "good" sets of keys usually means performing much worse on "bad" sets of keys, which can arise naturally—not just through attacks. Bad performance of a hash table's hash function means that lookup can degrade to a costly linear search.

Aside from minimizing collisions, the hash function for a hash table should also be fast relative to the cost of retrieving a record in the table, as the goal of minimizing collisions is minimizing the time needed to retrieve a desired record. Consequently, the optimal balance of performance characteristics depends on the application.

One of the most respected hash functions for use in typical hash tables is Bob Jenkins' LOOKUP3 hash function, published in an article in Dr. Dobb's Journal. The hash function performs well as long as there is no adversary, for it is trivially reversible and useless as a cryptographic hash function

Database indexing is an application for hash functions.

No comments: