Hi, I'm Curtis Lassam.
(That's classam on twitter)
I'm going to do my best to show you why this fundamental technique belongs in your toolkit.
You can tell from my drawing of a toolkit: I am not a handy man.
So, we're going to talk about hash functions, hash tables, bloom filters, choosing hash functions, and hashes in security.
Let's get started - what IS a hash function?
Here's an example. Let's write a function that takes a string
splits it into characters
converts every character into an integer
adds the integers together
and mods the result by 100.
This is a hash function.
Not every hash function works THIS way, but this - like many other functions, is a hash.
It has a number of useful properties.
Regardless of what sequence of characters you put in, you're guaranteed that the output will be between 0 and 99.
The output for a string will always be the same.
Many different inputs can produce the same output.
And given the output, it's impossible to guess the input that produced it.
I know, I know, invoking the wikipedia definition of something is the presentation equivalent to opening a speech with "Webster's Dictionary Defines", but it's a good definition.
So, that's a hash function - now let's talk about Hash Tables.
Hash tables are a data structure concept used to store keys and values.
Keys and values are important - you deal with them when you work with things like
or Python dicts
And if you're dealing with keys and values, chances are, behind the scenes, something clever with hash tables is happening.
It could be something else, like a tree or a trie, but those aren't what we're talking about right now, so I'm going to pretend that they don't exist.
So let's start with a big block of memory - let's say an array with 100 elements.
We want to store the value "hash two one five dee two nine" against the key "color".
So we run a hash function on the key, "color"
this produces a number
which we mod by the length of our table.
this gives us an index.
and then we can just store our value at the location pointed to by the index.
That's a hash table. Easy peasy.
But it is not quite so easy, nor quite so peasy.
As the array fills up with values,
it gets more and more likely that we'll have a hash value point to a spot in the array that's already full.
This is called a "collision".
What do we do when there's already a value in the spot where we want to store a value?
We can keep walking forward in the table until we find an available spot.
This is called 'linear probing'.
Alternatively, we could root a linked list at every space in the array. That way, our hash table can take as many values as we can throw at it.
Or a tree. We could also root a tree at every space in the table.
This - rooting a separate data structure at every spot in the table - is called a 'chained hash table'.
Here, there's a bunch of different values stored where we're looking for the key, "cheese".
How do we know which is the right one?
We need to store the key with the value, so that we can make sure we're retrieving the right thing.
This is going to be the case any time we have to deal with collision - if we can retrieve multiple values from our hash table, we need to be able to tell which is the correct one.
Even with some strategy for collision detection in place, it's possible for the table to get so full that it performs very sluggishly.
A crowded chained-hash is little better than a linked list.
or - in the case of hashing strategies that just shuffle addresses around - it's possible for the table to become completely full.
When this happens, it's time to rebuild the hash.
This is the time-consuming process of addressing an even bigger whack of memory,
Taking all of the keys out of the first array, re-hashing them, and putting them in the second array.
There's one language I can think of, whose default HashTable implementation can perform this computationally intensive rebuild step unexpectedly any time an insert pushes the table above a pre-defined load limit.
But for the sake of politeness I'm not going to mention which language.
Of course, linear probing and chained hashing are not the only hash table management strategies - There are many, many ways to manage a hash table.
Like robin-hood hashing
which steals data from your richer tables
and inserts it to your poorer tables.
Or hopscotch hashing,
where you implement the entire array in chalk, on the pavement.
You probably shouldn't fact check me on those last two things.
Insert? Constant time.
Delete? Constant time.
Lookup? Constant time.
Search? Uuuuh. Don't search a hash table.
Okay, so, that first few minutes of the presentation was there to get you up to speed on the basics. Now let's get to some of the meat!
A bloom filter is a great way to keep balloons out of your face.
A bloom filter is a data structure that's fast
and space efficient
used to test for membership in a set.
That's important - it tests for set membership, it doesn't store any data.
It can tell you if something is in a set, but you can't retrieve an item from the set.
Like, if we have three objects - banana, apple, and bowling ball.
And a bloom filter, representing the set of 'fruit'.
We can use the set to determine that banana and apple are probably 'fruit', and 'bowling ball' is not.
But we can't give you a list of all of the fruit that were used to populate the set.
We don't know them.
So, a lot of the time, Bloom Filters are used to answer questions like
Is "chumpys" a real word?
No. No it is not.
Is evil.ru a malicious website?
I don't know.
Is main.css in the cache?
Yes. Yes it is.
Is this a banned image?
Okay, let's look at this image problem a little bit.
Let's imagine we're running a forum and our itinerant users keep posting banned images.
How do we know, looking at image data, if an image has been banned?
let's say we run a hash function on the target image,
yes, you can hash images - you can hash most anything
mod the result by the length of an array
move to that index location in the array
and save a link to the image there.
Then, when we're checking a new image, we can hash it, and then check to see if it's in our array.
of course, this is just a bog-standard hash table, and I'm supposed to be talking about Bloom Filters
this works fine, but it takes up a lot of space
and I promised you space efficiency
Let's imagine there are 5000 images we want to keep out of our forums and they take up 100 kilobytes each.
That means about a 500 megabyte table of duplicate images. While it's not a tonne of space, it's enough that you probably won't want to keep the whole thing in RAM.
remember, though, that we don't need storage from our data structure
we're only interested in whether or not this image exists
let's imagine that we store zeroes in every space in our table
and we store ones where our hash functions land.
that way, we can check if an image is banned
by hashing it, modding it
and then checking if that spot in the array has a 1 in it.
if there's no 1 there, we can't possibly have seen that image before.
there's only one slight problem with this technique.
what happens when we have a different image that accidentally collides with an image that we've set earlier?
this creates a false positive, and unfairly takes pictures of Nicholas Cage out of circulation
How do we stop collisions like this from occurring?
Well, we can use a hash function that guarantees an astronomically low probability of collision.
MD5, for example.
is a hash function that guarantees an astronomically low probability of collision
But, with the astronomically low probability of collision, you get, axiomatically, an astronomically high number of potential outputs. With one bit for every output, you're suddenly saddled with
four times ten to the 25 terabytes, which is just a little bit unrealistic to implement as a bit array.
it does hit on one of the two strategies we can use to reduce collisions, though
the obvious one: more space.
now let's look at the less obvious one:
more hash functions.
instead of hashing our image just once, let's hash it twice, with two different hash functions
this gives us two different locations in the table, and we can store a one at both of them
might share the result of one of the hash functions
but it's not as likely to share the result of all of them
and if any of the hash functions point to a location with a zero in it, we know that this object can never have been entered into the Bloom Filter
So, this is a bloom filter.
A bit field, and multiple hash functions to set the bits in that field.
We put items in by hashing them multiple times and setting the bits at all of those locations
And we check if items are in the set by hashing items multiple times and checking if the bits are set at all of those locations.
There are some downsides to this.
Because each item we put in to the Bloom Filter has multiple hash functions, and there can be some overlap, we can never delete anything from the Bloom Filter
if we do
we run the risk of accidentally deleting something else from the filter.
on top of that, we've reduced the chance of collisions, but it's impractical to reduce that chance to effectively zero, so there's always some chance of a false positive.
however, this probability is at least a number that we have control over
if we know the desired probability of a collision
- in the case of our image filter, let's say 0.1%.
and the number of things we want to put in the filter
as we mentioned before, about 5000 images
we can use handwavy math to determine how much space we need
and how many hash functions we need
to get an optimal solution
you don't need to look too closely, it's quite easy to find these functions online
so, doing the math with our numbers, we need an array of 71,888 bits - 8.8 kilobytes
that's small enough that we could keep it in RAM - heck we could send that bloom filter to the user and run it client-side. It's certainly easier to work with than 500 megabytes of image.
and 10 different hash functions -
that is to say, the optimal packing of this data requires that each item we place in the table set 10 different bits in the bloom filter
A lot of the time, Bloom Filters are paired up with other data structures that handle the storage of the items.
For example, they are useful when paired with data structures that exhibit worst-case performance when searching for items that are not in the data-structure.
In a linked list, or unsorted large array, for example, you get the worst possible case performance when you're searching for an item that just isn't there.
The bloom filter can check, before you hit the data structure, if the data is in there.
they are also very useful when the retrieval step for data takes a long time - for example, when a network call needs to be made to a far away database, a local bloom filter is a wonderfully fast way to know if you are wasting everybody's time with a request for something that doesn't exist.
or, data with a very low hit rate - if you're dealing with the sort of data where you're getting 10 misses for every hit, you can catch all of those misses with a bloom filter.
Let's talk about how to pick which hash functions to use.
I mean, I showed you my awesome cheesehash function earlier, but its terrible, and on top of that it only really works on strings.
Now, the number of different Hash Functions are countless - each one a unique snowflake.
only three of these are made-up
So which ones do we pick for our data structures?
well, for data-structures, the two properties we're looking for in a hash are that the hash should be fast, and well-distributed
When we say 'fast', what we mean is 'non-cryptographic'.
Cryptographic hashes are awesome. They have a bunch of properties that make them totally bad-ass for security functionality.
The most important one being that they're "collision resistant" - it's astronomically unlikely that you will be able to find two different items that hash to the same value.
But these cryptographic features also make them a lot more processor-hungry.
So we should avoid hashes like SHA or MD5 when we're working on data structure stuff, because we don't need awesome security features in our Bloom Filter. It's just a waste of CPU cycles.
And well-distributed, which means that no matter how similar your data is going in to the hash function,
the output appears all over the spectrum.
Hash functions that exhibit this quality are known as 'avalanching' hashes
because small changes in the input
lead to large changes in the output.
Of course, this is also a desirable property in cryptographic hashes, but this one we're willing to blow our precious processor time on, because it's really important for data structures that depend on hash functions to have well-distributed hash functions.
A common hash used for this purpose is the non-cryptographic, well-avalanching, public domain Murmur3 - implementations of which exist for most modern languages
- and which has appeared in numerous open-source products, including Hadoop, Cassandra, and nginx.
It also takes a seed value, so you can create dozens of different hash functions out of Murmur3
just by changing the seed value.
There are some reasons, though, that you might want cryptographic hashes in your data structures -
a clever attacker might take advantage of your data structure and only insert data that matches a certain small set of hashes, forcing collision after collision until your whole application falls over!
it's been demonstrated that Python is vulnerable to this sort of attack, although the BDFL called this "some security researchers drumming up business"
In PEP 456, Christian Heimes lays out the reasoning behind choosing a cryptographic hash function as the default hash function for Python strings and bytes, comparing the cryptographic SipHash with the non cryptographic MurmurHash as well as the previous CPython hash implementation.
This PEP was accepted, and as of Python 3.4, the default hash function for strings and bytes is SipHash.
One other thing - we hashed an image to put in our data structure, earlier.
What sort of hash function works on an image?
Well, most of them, really, but a perceptual hash, or pHash, is designed specifically to cluster very similar images together in the hash output.
For example, if it's the same image, but sized larger, or skewed to the left, it should end up with a hash location very close to or identical to the hash location of the original image.
Of course, by nature, this hash function won't be distributed in the way that we would need for an optimal general-purpose data structure
It's the opposite of an 'avalanching' hash - small changes in the input lead to almost no changes in the output.
But we can abuse that property so that false positives are unfairly clustered on images that look very similar to our banned images. Which would actually probably be a good thing.
Okay, so, that concludes the data-structures portion of the presentation.
Now, let's talk about how hash functions contribute to the security of your application.
Okay, you're running a web application, and the worst-case scenario happens.
Some hacker makes off with the user table from your database.
I don't know, for the sake of argument let's say SQL Injection.
At this point your users are all shit out of luck, right?
I mean, some brigand has made off with all of their passwords.
But no, because our developers have cleverly obscured the passwords before saving them.
Hopefully you're already aware of this.
So, in order to hide passwords this way, when the user creates a password, we don't save the password.
Instead, we save the result of a hash function.
Later, when the user tries to log in, they provide a password. We hash the password, and compare it with our hashed password for that user.
If the two match, the user has provided the correct password.
Of course, if two different passwords ever collide - if two strings hash to the same output value - then it would be possible for someone to log in to our site with the wrong password. That's bad.
Do you remember earlier when I said that cryptographic hashes - like MD5, for example - have a feature called collision resistance, which means that two inputs are astronomically unlikely to collide? Here is where that's super important.
Now, our passwords are protected.
Yup. Totally protected. Nothing could possibly go wrong.
Okay, let's talk about Dictionary Attacks, the Way To Break Hashed Passwords
So, we've stolen a whole database full of usernames and hashed passwords, and we want to get at the raw passwords so that we can try them out on banking sites and such.
What we can do, is create a list of everything that we think of as a possible password. An enormous, comprehensive list.
Then, we use the same hash that the programmer used to hash the original passwords, and we run that on every single item in our gigantic password list.
Now we have a giant collection of hash-to-password pairs, which we can compare against the original data.
Any time we find a match with one of the hashes in our set, we know what the password must have been.
This checking of every hash in the set against every hash in the database is n-squared, but inherently very parallelizable, so with a bit of optimization this can run VERY fast.
This is called a precomputed dictionary attack.
It's important to note that this isn't a Rainbow Table. A rainbow table is a different thing, involving hash chains, that accomplishes the same task but using much less storage space.
The MD5 hash function is so common that Juuso Salonen (I hope I pronounced his name right) released a utility called Bozocrack that works by taking a hashed password, searching for it on Google, and then MD5-hashing everything that comes back in the google results until it finds a match.
It's not as comprehensive as a MD5 Dictionary, but it still manages to be depressingly effective.
Part of the reason these attacks are so effective is that, because every user's password has been hashed with the same hash function, it's possible for us to test passwords against the entire table. While it is unlikely that we will ever crack all of the passwords, we're able to suss out the users who have simple passwords very quickly.
Just testing the whole database against the 1000 most common passwords shouldn't take more than an hour, and will provide us with a wealth of potentially useful data.
So what we want to do is reduce the effectiveness of this kind of attack, by using a different hash function for every single person in the database.
That doesn't mean we need to enlist thousands of different hash implementations - what it actually means is that we want to use one hash function that we seed with a different value for each user in the table, before we hash the password for that user.
It has to be a value that we have access to - because we need to be able to recreate this custom hash function every time we check the user's password.
It's quite common to use the username for this value.
That doesn't mean you should use the username.
Cryptographers recommend that you create a large block of randomized data and save it against the user to use as a seed.
This hash-seeding value is called a salt.
Okay, so that covers the very basics.
And when I say basics I mean very most basics. This isn't exactly state of the art. Unix's 'crypt' function first used salted hashes to protect against dictionary attacks in 1976, a decade before I was born.
Okay, don't use MD5. Let's talk about that.
MD5 is well-known, and well-understood, and it's been in use for a good long time.
Unfortunately, in that good long time, cracks have begun to show in this venerable hashing algorithm.
The biggest reason to avoid MD5 is simply that it is too fast.
It's possible to MD5 hash millions of values per second.
This makes brute force attacks against MD5-hashed passwords very easy.
This MD5-hashing python script I wrote runs, on a cheap little virtual machine, one million MD5 hashes in 1.5 seconds.
Now, BCrypt is designed to be slow. Using the default settings on the same virtual machine, that run would take me 3.4 days.
This, combined with a salt for each individual user, means that brute forcing passwords out of a database could take days or months per user, rather than an hour or two for the whole thing.
BCrypt also comes with a work variable that you can crank up to make things even slower as hardware gets faster. When I turn it up to 15, my million-hash script goes from 3.4 days to run, to 26 days to run.
At this point, though, just logging a single user in to my site could take a couple of seconds. I'm not sure if they're willing to wait that long.
So, for password security, you should use something that's specifically designed for password security - like Bcrypt, or PBKDF2
But even for general purposes, if you're looking for a secure hash algorithm, MD5 is kind of old and broken.
SHA means "Secure Hash Algorithm", and SHA-512 has become a new standard for hashing, one that's cryptographically much more secure.