Whenever I teach Learning Tree Course 468, System and Network Security Introduction people ask what “hashing” is. I generally describe it from a computer science perspective.
Let’s say you want to build a tool to search for words in the dictionary. There are lots of ways to process a list of words to search for a particular word including a linear search (very slow) and a binary search (faster). One way it’s done is to create a table of the words and where they are in the list: the a’s begin at word 1, the b’s at word 1200 and so on. One could even be more efficient and say the aa’s begin at word 1, the ab’s at word 5 and such. The table of where the words begin is a hash table. The first instance has 26 entries (in English). The entries in the table are called ‘buckets’.
The way one uses the table is to look at the first letter of the word being sought, go that far into the hash table and begin searching the dictionary there. It greatly speeds up the search. Using two letters would even be faster. One could get more complex, though, and make a very large hash table – a function (which we call a ‘hash function’) would be used to process the word being sought and a number then generated. We call that number a ‘hash value’. The number would say where in the table to look. If the number could be large (and thus the table large), each word in the dictionary might have its own entry in the table. In order to keep things reasonable though, the table can be structured so each entry is a list of possible words that correspond to the hash value.
If multiple words can correspond to a given hash value, we call that a collision. So if both ‘bicycle’ and ‘brontosaurus’ have the value 1000, a collision occurs. This is not is unusual in this type of hash table.
When we use hashes for passwords and integrity-checking documents, though, we don’t want collisions. To avoid them we use well-designed algorithms (e.g. SHA-512) which produce long hashes (512 bits means 1.3×10154 or 1 followed by 154 zeroes possible hash values). We also want the hashes to be unpredictable so someone cannot intentionally create a collision.
Hash values can be used to verify the integrity of a document. The process is simple: when the document is created the hash value is computed; to verify that the document hasn’t been altered the hash value is re computed. If the two values are the same, the documents must be the same. This can also be used for images, audio or video segments.
While the math behind hash functions isn’t simple (or particularly pretty) the concepts are. If you have questions about hashing, let us know in the comments below.