What is Hashing?
A fundamental concept in cyber security, but often misunderstood
Updated
10 min read
Contents
Hashing is what is fundamentally behind a lot of computer security, as well as being a useful tool for a range of other applications. If it didn't exist the internet/computer security fundamentally wouldn't be what it is today, and most likely wouldn't exist.
So, lets start off with a login process for an application, this could be any sort of application from your laptop to Mashoom's login page, the neat thing about this technology is that it is based on maths, not algorithms/ a library of code. It is however worth pointing out that although this is the fundamentals of how logins work this shouldn't be used as a tutorial, there is more to it!
Logging in to something
As an application you need to check that the user's password is the same as a password they entered when they made the account; it is still the same person. This is because this 1 person should know the password, there is no way of getting this information without them knowing, so we have a solid point to start from.
Easy you say, just store their password so you have something to check against! At this point many people stop learning and start coding, and there have been many many incredibly harmful hacks as a result on many very high profile systems.
The issue is what happens if your list of passwords gets stolen. Now lets put this in context, a stolen plaintext password with it's username means that you/the user has lost control of this account in its entirety and their is no way of getting this control back; its very serious. How would that data be stolen? Well, if you are a desktop application you would have to store the password permanently in a file, which could simply be read by another program in a lot of cases. Or you have a web database, sounds safe, but what if you have an employer who is working on the table and copies it to develop the website? Or simply steals it because they are having a bad day? Surely what would be nice is to say this can never happen, rather than try to make sure it doesn't.
Now you're up to speed with the problem, welcome to the solution; hashing. Hashing takes a string of data (a password for instance), and transforms it into another string of data. The clever bit is that input will always produce the same output; this output will be unique to the input. The output also can't be reverse engineered to the input, think of this as "you can't unscramble an omlette". Now, "how" is something to leave to the professionals, all you need to know is that a good hashing algorithms are based on a mathematical process that has been reviewed 1000 times by many experts. A handful rise to the top and are subsequently are relied upon by everyone doing security in the tech world, Wordpress to Google.
So, back to our use case. The user puts in their password, for this case lets say its when they first make the account, you hash their password and store it. The original password is forgotten. So we now have a hash (a hash being the output of a hashing algorithm) for that user. Now remember what a hash is, this can't be used to get the original password. If this data is stolen, the password still isn't revealed. Here is an example of a hash:
$2y$10$LKWX6bfyuJ5C7VhG1JSxFu2Sn1hPikaSQu5m4GGmxlcp67kVsDG3.
This is genuinely one of my passwords (albeit, not for anything important), no-one can take this and find out my original password. At this point I think it would be a good time to say that there are other ways of using a hash to find a password or compromise security, read the first section of this blog post.
Now we want to authorise the user again, they enter a password, you hash this password again, using the same method as before. Now you simply compare the hash you stored and the hash you have just created, if they are identical then the password given is the same as the password used to create the account, you can authorise them!
Other uses
So, as you can see, hashing does a very good job when securing passwords, but there are some more tricks that can be done with them, still based off exactly the same fundamental hashing process. When data is transmitted it can be altered easily, by simply electronic interference through to someone who has hacked a wifi router. You want to know that what you sent remains integral so is delivered unaltered to its recipient.
Lets utilise the other property of a hash, each output should be unique to its input. We simply hash the original bit of data and send this hash along with the data, then the recipient does their own hash of the received data. As before, check one hash vs the other and see if they match.
The same idea is used to identify content, without having to store it and more importantly compare all of it against a given input. Say I have a 1000 page document and want to know if it already exists in a file storage. As long as that file storage has a database of all the hashes of all its files, these hashes can be search to find an identical match to your document's hash. I only need to send you the hash, not the whole document, and only the hashes need to be searched, not all the stored documents in their entirety.
Hash Strength
Hashes are awesome and pretty simple to grasp, but as you can imagine, there is more to know. Firstly, lets talk about a hashes strength. Remember the rule that any given input should have a unique output? Well, an algorithms ability to do this relates to the strength (I've never seen a concrete 'strength' measures, this one is by means of an explanation).
The obvious flaw in an algorithm that produces the same output for a few inputs, someone can guess one of the other inputs and still authorise themselves. Now in reality this is unlikely, the other input would likely be a unintelligible garbled string, not one a human is likely to find by chance.
These are called collisions, they are a big problem because we are talking about computers vs computers, rather than the use scenario above. Essentially we can leave a computer coming up with all sorts of hashes, to be stored in a hash table. The chance that the hash you want to compromise is in the table is vastly better than starting with one hash and then trying to find a second input for it, it's wonderfully called a birthday attack. This attack can be used to compromise a given hash; this attack is common to break digital signatures which use hashing.
This also explains "it would take [X high number of years] to crack a hash". To find a collision, the hashing algorithm must be run many many times with many inputs, then check all the results to see if two match. You can do some basic maths to work out the probability of a match given the amount of time to run a single hashing function. This is why the modern, secure hashing algorithms take a "significant" time to run, this is inherent in their strength. Mathematically, conflicts are so rare, and the time is takes to make one hash relatively long.
Which leads to the headlines "MD5 and SHA-1 hashes have been broken", yes, they have, but not quite like that. Computers getting faster is the reason hashes are "broken"; collisions become easier to find so therefore the hashing algorithm becomes useless when used in certain contexts. Interestingly, collisions are not the problem when it comes to storing passwords safely. Modern hashing functions just take long enough to make brute forcing the original password difficult, salting is also important in this use case, spoken about in the next section.
The fact of the matter is the new algorithms will also be broken, its a matter of time. Time on its own won't be able to break SHA-2 for instance, but time to develop better computers as well as to do the calculations will do it. There is plenty of will, so there will be a way. As an example, quantum computing, the maths and theory suggests that a quantum computer should be able to crack current hashing algorithms.
Salt and Pepper
Eh? That was my first thought when I saw these words anyway. So, remember what we have just said, a hash can potentially be broken by creating loads of them then hopefully find a match to the one you want to break. Salt and peppers are another way to combat this problem, in addition to using a 'strong' hashing algorithm.
Adding pepper to a hash is simply appending something to the input before it gets hashed. Of course this same something has to be appended when you want to check an input. Pretty simple, but, sort of small fry as improved security goes, least not as good as a salt.
A salt is best described as a modification to a hashing algorithm, changing the way the algorithm is performed. Normally, a new one is created for each hash and stored alongside it, so that as a pepper it can be re-used when another input is being checked. It provides the same benefit as the pepper, but it is that this is much more integral to the algorithm.
Both make the output of the hashing algorithm even more random, it would mean to break it you need a hashing table for every possible salt. That is going to take a while/modern this hasn't been anywhere near achieved with modern algorithms.
Salts are a must, they are done standard in the PHP password functions mentioned earlier and make your hashes utter useless if an attacker gets hold of them. Jury is slightly more out on the pepper, its an improvement, but not as good as the salt. Bear in mind that if you are in the game of finding a hash input given a hash, you probably don't care if a pepper has been added to the input.
A word of note to anyone who is hooked on "how" these algorithms work, by all means, I don't understand it and would admire anyone who could even shed a light on the maths behind it. But, don't put your own algorithm into anything that you need to be secure, I do know these algorithms take a lot of time and a sprinkling of genius to come up with, then are reviewed by many many other geniuses... the work has been done.