A hash function is any well-defined procedure or mathematical function which converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. The values returned by a hash function are called hash values , hash codes , hash sums , or simply hashes .

Hash functions are mostly used to speed up table lookup or data comparison tasks — such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.

A hash function may map two or more keys to the same hash value. In many applications, it is desirable to minimize the occurrence of such collisions, which means that the hash function must map the keys to the hash values as evenly as possible. Depending on the application, other properties may be required as well. Although the idea was conceived in the 1950s, the design of good hash functions is still a topic of active research.

Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomization functions, error correcting codes, and cryptographic hash functions. Although these concepts overlap to some extent, each has its own uses and requirements and is designed and optimised differently. The HashKeeper database maintained by the National Drug Intelligence Center, for instance, is more aptly described as a catalog of file fingerprints than of hash values.

Applications

Hash tables

Hash functions are primarily used in hash tables, to quickly locate a data record (for example, a dictionary definition) given its search key (the headword). Specifically, the hash function is used to map the search key to the hash. The index gives the place where the corresponding record should be stored. Hash tables, in turn, are used to implement associative arrays and dynamic sets.

In general, a hashing function may map several different keys to the same index. Therefore, each slot of a hash table is associated with (implicitly or explicitly) a set of records, rather than a single record. For this reason, each slot of a hash table is often called a bucket , and hash values are also called bucket indices .

Thus, the hash function only hints at the record's location — it tells where one should start looking for it. Still, in a half-full table, a good hash function will typically narrow the search down to only one or two entries.

Caches

Hash functions are also used to build caches for large data sets stored in slow media. A cache is generally simpler than a hashed search table, since any collision can be resolved by discarding or writing back the older of the two colliding items.

Bloom filters

Hash functions are an essential ingredient of the Bloom filter, a compact data structure that provides an enclosing approximation to a set of keys.

Finding duplicate records

To find duplicated records in a large unsorted file, one may use a hash function to map each file record to an index into a table T , and collect in each bucket T a list of the numbers of all records with the same hash value i . Once the table is complete, any two duplicate records will end up in the same bucket. The duplicates can then be found by scanning every bucket T which contains two or more members, fetching those records, and comparing them. With a table of appropriate size, this method is likely to be much faster than any alternative approach (such as sorting the file and comparing all consecutive pairs).

Finding similar records

Hash functions can also be used to locate table records whose key is similar, but not identical, to a given key; or pairs of records in a large file which have similar keys. For that purpose, one needs a hash function that maps similar keys to hash values that differ by at most m , where m is a small integer (say, 1 or 2). If one builds a table of T of all record numbers, using such a hash function, then similar records will end up in the same bucket, or in nearby buckets. Then one need only check the records in each bucket T against those in buckets T where k ranges between - m and m .

This class includes the so-called acoustic fingerprint algorithms, that are used to locate similar-sounding entries in large collection of audio files (as in the MusicBrainz song labeling service). For this application, the hash function must be as insensitive as possible to data capture or transmission errors, and to "trivial" changes such as timing and volume changes, compression, etc. .

Finding similar substrings

The same techniques can be used to find equal or similar stretches in a large collection of strings, such as a document repository or a genomic database. In this case, the input strings are broken into many small pieces, and a hash function is used to detect potentially equal pieces, as above.

The Rabin-Karp algorithm is a relatively fast string searching algorithm that works in O( n ) time on average. It is based on the use of hashing to compare strings.

Geometric hashing

This principle is widely used in computer graphics, computational geometry and many other disciplines, to solve many proximity problems in the plane or in three-dimensional space, such as finding closest pairs in a set of points, similar shapes in a list of shapes, similar images in an image database, and so on. In these applications, the set of all inputs is some sort of metric space, and the hashing function can be interpreted as a partition of that space into a grid of cells . The table is often an array with two or more indices (called a grid file , grid index , bucket grid , and similar names), and the hash function returns an index tuple. This special case of hashing is known as geometric hashing or the grid method . Geometric hashing is also used in telecommunications (usually under the name vector quantization) to encode and compress multi-dimensional signals.

Properties

Good hash functions, in the original sense of the term, are usually required to satisfy certain properties listed below. Note that different requirements apply to the other related concepts (cryptographic hash functions, checksums, etc.).

Low cost

The cost of computing a hash function must be small enough to make a hashing-based solution advantageous with regard to alternative approaches. For instance, binary search can locate an item in a sorted table of n items with log 2 n key comparisons. Therefore, a hash table solution will be more efficient than binary search only if computing the hash function for one key costs less than performing log 2 n key comparisons. However, this example does not take sorting the data set into account. Even very fast sorting algorithms such as merge sort take an average of n log n time to sort a set of data, and so the efficiency of a binary search solution is reduced as the frequency with which items are added to the data set increases. One advantage of hash tables is that they do not require sorting, which keeps the cost of the hash function constant regardless of the rate at which items are added to the data set.

Determinism

A hash procedure must be deterministic — meaning that for a given input value it must always generate the same hash value. In other words, it must be a function of the hashed data, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as pseudo-random number generators that depend on the time of day. It also excludes functions that depend on the memory address of the object being hashed, if that address may change during processing (as may happen in systems that use certain methods of garbage collection), although sometimes rehashing of the item can be done.

Uniformity

A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of collisions — pairs of inputs that are mapped to the same hash value — increases. Basically, if some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.

Note that this criterion only requires the value to be uniformly distributed , not random in any sense. A good randomizing function is usually good for hashing, but the converse need not be true.

Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.

In other words, if a typical set of m recor

Signing the FAFSA with a Signature Page

To submit a FAFSA and sign it using a printed signature page, follow the steps below. Read all the steps before beginning this application.

...

Signature Page - Signix

SIGNiX is a patented electronic signature solution that provides users with the ability to securely "sign" documents - whether on the Internet, telephone or face to face.

...

The U.S. Conference of Mayors Climate Protection Agreement - Signature ...

The U.S. Conference of Mayors Climate Protection Agreement - Signature Page You have my support for the Mayors Climate Protection Agreement. Date: _____ Mayor ...

...

Western Alliance for Rail to Dulles (WARD)

Signature Page. Download signature page (pdf) Please print the signature page and carefully fill out the required information, including the signature of an authorized person.

...

Signature Systems, Inc. Home Page

Let us help you run your business.

...

Signature Pages Help to Sustain Institutional Memory

Autographed souvenirs help create a sense of identity and can sustain institutional memory. (From ‘How to Build a Residential College.’)

...

S t u d y A b r o a d C o m p a n y

Asia Exchange Ltd S t u d y A b r o a d C o m p a n y Signature page Bali International Program on Asian Studies (BIPAS) in Udayana University, Bali Indonesia Applicant: please ...

...

Signature Page for Admission to MS in Technology Management

Signature Page for Admission to MS in Technology Management Graduate College of the University of Illinois 15 Wohlers Hall, 1206 South Sixth St., Champaign, IL 61820 USA www.ms ...

...

E-School Signature Page

E-School Signature Page for Course Registration: Please PRINT this form, obtain the signatures and FAX or send in to address printed below .

...

Signature Page Help, Custom Signature Page Writing, Signature Page ...

Signature Page help and custom Signature Page writing service for doctoral research help.

...