Hash table how-tos for mainframe programmers

The idea behind hashing is to transform the table key into a number that can be used to directly access an entry. Mainframe guru Robert Crawford explains.

I like to put referential data into external tables. It makes programs easier to maintain and it's much safer to change a table than logic. The only catch is that system level code, particularly in exits, must be very efficient. You wouldn't want to do a linear search of a fifty entry table for every file I/O as that would tend to bog things down. Ultimately the answer is to use a hash table.

Hash table basics
The idea behind hashing is to transform the table key into a number that can be used to directly access an entry. That way, a potentially long search loop turns into some arithmetic and a couple of memory references. Below is a diagram of a simple hashing structure.

hash table sample diagram

More on mainframe tables:
CICS program list tables: PLTPI and PLTSD programs

Adding and deleting CICS data tables

A hash table is simply a list of pointers. It must have a fixed length and be contiguous in memory. The addresses in the hash table may point to a table entry or be null, depending on the hashing algorithm.

The table itself contains the information you need. It may be in contiguous storage or spread all over creation, depending on your needs. In addition, each table entry needs a pointer for what's called a synonym chain.

Synonyms are table entries whose keys transform into the same index of the hash table. The hash table points to the first table entry for that hash value, then other entries for that index will be on the synonym chain. If there are no synonyms the chain pointer will be null. I know this goes against the premise that hashing eliminates loops but, unless the hashing algorithm is very badly suited to the table's keys, a search down the synonym chain will always be cheaper.

Hashing algorithms
Hashing algorithms transform the table key into an index whose value is between 0 and one minus the number of entries, inclusive. It then multiplies the index by 4 (or 8 in the 64 bit addressing world) thus creating an offset into the hash table.

As mentioned above, this is highly data dependent and it may be worthwhile to study the table's keys for shortcuts. For instance, if the table key is four bytes long but only the last two changes, a good hashing algorithm might consist of lopping off the first two bytes and using the remainder as the index.

If the key is more variable there are a couple of ways to randomize or boil it down to an index. An example is the Assembler PACK instruction, normally used to convert EBCDIC numbers into packed decimal. PACK works on character strings, too, and can reduce one into a usable index. Say you have a key of ABCDEFGH which is C1C2C3C4C5C6C7C8 in hex. The PACK instruction picks out the right nibble out of each byte to create the five byte binary number 012345678C. This is a good first step, although the resulting number may still need some pummeling to be a good index.

My favorite method is to divide the table key by the number of entries in the hash table and use the remainder as the index. Assume you have a four byte key to hash into a 32 entry hash table. Using ABCD as an example, my algorithm would divide X'C1C2C3C4' by 32 (hex 20) giving a remainder of 4 for use as an index.

Hash table implementation
After deciding on the hashing algorithm and table size you must write the routine to build the structure. This is fairly simple, remembering that the idea is to do all the work ahead of time as to save CPU in executing the search routines. Here is what the logic must do:

  • load the object table, and go through the table and hash each entry's key. Store the entry's address in the hash table if its slot is empty. If the slot is not empty, insert it into the synonym chain.
Now searching becomes very quick as it runs the search key through the hashing algorithm which returns an offset into the hashing table. Next the search routine gets the pointer to the table entry and checks the key. If they don't match it has to go through the synonym chain if one exists. Depending on the efficiency of the hash a search may be accomplished with four or five instructions instead of a repetitive loop.

There are a couple efficiency tweaks to the basic idea:

  • You can optimize an entry's position on the synonym chain by putting the entries in collating sequence. This allows the search routine to give up when it finds a key higher than the one it's looking for.
  • You may also want to add code to your construction routine to collect statistics if you're concerned about synonym chain lengths or the number of empty hashing table slots.
  • Lastly, you might try increasing the hash table size if the synonym chains get too long. There are diminishing returns as you will most likely always have synonyms. If the chains are still too long you may want to consider rewriting the hash algorithm.
Although hashing algorithms are processor-efficient, they will waste storage. Any number of slots in the hashing table may be empty in addition to the extra space needed for synonym chains. Still, after 60 years the processor is still the computer's most precious resource and system level code must be efficient enough to get quickly out of the way so the business can run.

ABOUT THE AUTHOR: For 24 years, Robert Crawford has worked off and on as a CICS systems programmer. He is experienced in debugging and tuning applications and has written in COBOL, Assembler and C++ using VSAM, DLI and DB2.

Dig Deeper on IBM system z and mainframe systems