News Stay informed about the latest enterprise technology news and product updates.

Searching tables for reference data

Robert Crawford explains some methods he's used to search a table for reference data.

This column originally appeared on TechTarget's Expert Answer Center as a post in Robert Crawford's blog. Robert served as the on-demand expert on the Expert Answer Center for two weeks in January to February 2006, during which he was available to quickly answer questions on CICS application performance and design, as well as to write daily blog entries. Keep an eye on the Expert Answer Center for topics that could help your IT shop.

There seem to be a few things that all programmers must do. One of those is search a table for some reference data. There are lots of ways to go through tables, some of which are more efficient than others. In this blog entry I'll talk about some of the methods I've used.

First, of course, there's the sequential search from the top to the bottom. This works well enough for small tables or every-once-in-a-while searches. About the only improvement you can make to this scheme is to sort the table entries by the search data element or key. That way, your search loop ends either when you find your row or a table entry whose key is lower than the search data.

Next comes the binary search. For this to work the table entries must be laid in order in contiguous storage. It's kind of like trying to guess a number between 0 and 100 by asking, "Is it 50?" and then halving the range of numbers at each subsequent try. With this method you use an index to address the row you're interested in. As you go through the loop you must also track the upper and lower bounds of your search. At initialization, the upper bound is the last row in the table while the lower bound is the first. You add the two together and divide by two to get the index of the middle entry. If that's your entry, you're done. If your data is lower, set the higher bound to the current index. If the data is higher, you set the lower bound to the current entry. At the top of the loop you add the bounds together and try again. You know the row isn't in the table when the upper and lower bounds are the same.

The last option is probably the fastest, although it tends to use a lot of storage. It's called hashing and it involves transforming the table's key into an offset of a hash table. The address in a hash table points to entry. The number of entries in the hash table depends on all the possible outcomes from the hashing routine. However, since hashing algorithms aren't perfect, you may have to deal with a "synonym chain," which links together all the table keys that hash to the same offset. As an example, I wrote a program that hashed a four-byte transaction ID by dividing it with the number of table entries. Since I used the remainder from the division as my offset into the hash table, my hash table had as many entries as the main table. Another method uses the PACK instruction to compress the data by eliminating some nibbles (byte halves). With the offset from the hashing routing, you pick up the address of the table entry from the hash table. If the entry matches, you're done. If it didn't, you have to run the synonym chain, if it exists. If you don't find the entry you were looking for at the end of the chain, you know the row wasn't in the table.

Once again, some of these algorithms are easier to implement in some languages than others. The programmer also has to be comfortable with using data as numbers and doing math with pointers. However, there are performance gains especially when performing repetitive searches or looking through long tables.

Dig Deeper on IBM system z and mainframe systems

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.