Hash codes are a great way to handle large amounts of data where finding a particular entry is often required but not a trivial operation. This is often the case with strings, though this strategy applies for many other types of data as well. Suppose you have a large number of strings, and you want to make a list where each string appears only once. If the order of the strings cannot be predicted, and the number of repeats is not extremely high, a good strategy would be to order the list and then scan for repeats. Ordering a random list can be done in ( n log n ) comparisons which can be quite substantial for large n. A string comparison is a relatively slow operation involving a loop of memory fetches. The trick here is to associate each string with a unique number such that each distinct string has its own number ( that is two instances of the same string will have the same number ), or that at least only a few strings get the same hash number. It should be a fast operation to map from strings to numbers, and it should be completely trivial to compare the numbers. This way when you order the numbers, the strings will be ordered also ( not in alphabetical order necessarily, but in some deterministic order ). If you could do this you could order the strings using ( n log n ) trivial integer comparisons.

There are many ways to assign hash numbers to strings. Most of them involve looping over the different characters, and adding, shifting and applying bitwise operators to them. I have not made any careful analysis of the options, but there are many workable solutions and the differences are small. Below is a macro to associate with each input 'string' a number 'hash' that is guaranteed to be represented in fewer than 'num_bits' bits. The 'mask' when bitwise and'd to 'hash' should turn off all the high bits above 'num_bits'. In this code if the number of bits is lower than the size of a 'char' ( 8 bits ), then all the characters are summed in the lower 8 bits of 'hash'. Otherwise the odd bytes are in the low byte, the even bytes are in the high byte.

/* hashes strings with hash schemes up to 16 bits */ #define HASH( string, val, mask, num_bits ) \ val = 0; \ int hash_len = (int)(strlen( string )); \ if ( num_bits <= (int)(sizeof( char ))) \ { \ for ( int hash_i = 0; hash_i < hash_len; \ hash_i++ ) \ { \ val += (int)string[hash_i]; \ } \ } else { \ hash_len--; \ for ( int hash_i = 0; hash_i < hash_len; \ hash_i++ ) \ { \ val += (int)string[hash_i]; \ val += ((int)string[hash_i+1] << 8); \ } \ } \ val = ( val & mask ); class HASHER { friend ostream& operator<<( ostream& os, HASHER& x ); public: HASHER( int num_bits = 8 ); ~HASHER() { delete []_hash; int num = _strings.Num(); for ( int i = 0; i < num; i++ ) delete _strings[i]; } int Index( char* string ); int Register( char* string ); char* String( int idx ){ return _strings[idx]; } int IdxOrAdd( char* string ) { int idx = Index( string ); if ( idx == -1 ) idx = Register( string ); return idx; } protected: int _num_bits; int _mask; int _num_hash_codes; DBLARRAY< int >* _hash; DBLARRAY< char* > _strings; };