Hash table basics
The idea behind hashing is to transform the table key into a number that can be used to directly access an entry. That way, a potentially long search loop turns into some arithmetic and a couple of memory references. Below is a diagram of a simple hashing structure.
A hash table is simply a list of pointers. It must have a fixed length and be contiguous in memory. The addresses in the hash table may point to a table entry or be null, depending on the hashing algorithm.
The table itself contains the information you need. It may be in contiguous storage or spread all over creation, depending on your needs. In addition, each table entry needs a pointer for what's called a synonym chain.
Synonyms are table entries whose keys transform into the same index of the hash table. The hash table points to the first table entry for that hash value, then other entries for that index will be on the synonym chain. If there are no synonyms the chain pointer will be null. I know this goes against the premise that hashing eliminates loops but, unless the hashing algorithm is very badly suited to the table's keys, a search down the synonym chain will always be cheaper.
Hashing algorithms transform the table key into an index whose value is between 0 and one minus the number of entries, inclusive. It then multiplies the index by 4 (or 8 in the 64 bit addressing world) thus creating an offset into the hash table.
As mentioned above, this is highly data dependent and it may be worthwhile to study the table's keys for shortcuts. For instance, if the table key is four bytes long but only the last two changes, a good hashing algorithm might consist of lopping off the first two bytes and using the remainder as the index.
If the key is more variable there are a couple of ways to randomize or boil it down to an index. An example is the Assembler PACK instruction, normally used to convert EBCDIC numbers into packed decimal. PACK works on character strings, too, and can reduce one into a usable index. Say you have a key of ABCDEFGH which is C1C2C3C4C5C6C7C8 in hex. The PACK instruction picks out the right nibble out of each byte to create the five byte binary number 012345678C. This is a good first step, although the resulting number may still need some pummeling to be a good index.
My favorite method is to divide the table key by the number of entries in the hash table and use the remainder as the index. Assume you have a four byte key to hash into a 32 entry hash table. Using ABCD as an example, my algorithm would divide X'C1C2C3C4' by 32 (hex 20) giving a remainder of 4 for use as an index.
Hash table implementation
After deciding on the hashing algorithm and table size you must write the routine to build the structure. This is fairly simple, remembering that the idea is to do all the work ahead of time as to save CPU in executing the search routines. Here is what the logic must do:
- load the object table,
There are a couple efficiency tweaks to the basic idea:
- You can optimize an entry's position on the synonym chain by putting the entries in collating sequence. This allows the search routine to give up when it finds a key higher than the one it's looking for.
- You may also want to add code to your construction routine to collect statistics if you're concerned about synonym chain lengths or the number of empty hashing table slots.
- Lastly, you might try increasing the hash table size if the synonym chains get too long. There are diminishing returns as you will most likely always have synonyms. If the chains are still too long you may want to consider rewriting the hash algorithm.
ABOUT THE AUTHOR: For 24 years, Robert Crawford has worked off and on as a CICS systems programmer. He is experienced in debugging and tuning applications and has written in COBOL, Assembler and C++ using VSAM, DLI and DB2.
This was first published in January 2008