Unpredictable hashes for humans
Jordi Boggiano • May 10, 2010
phpDisclaimer: As the title suggests, this was written for the average developers out there, not security experts. The article contains some shortcuts that might not please the more enlightened because they're not 100% correct, but the problem is that more accurate descriptions are just too long, boring, or hard to understand by the masses, which would defeat the point of this article.
It is not uncommon for web developers to have to generate random ids or hashes, for instance large scale project or frameworks may want to implement their own PHP session handlers either completely abstracted in their API, or overloading PHP's internal API using session_set_save_handler(). If you do so, unless you want to entrust PHP's core to do it, one thing you will have to take care of is generating unique session ids to send as a cookie to your users, allowing the session to persist. Other common use cases for such unique hashes is to generate CSRF tokens to insert in forms or URLs, and finally authentication tokens for email validation or such.
Common misconceptions and human errors
The most common error programmers do when implementing this is due to the fact that we humans are very bad at judging randomness. Especially once you add a hashing algorithm on top of it, it becomes very quickly impossible to discern whether the hashes you create are indeed unique and random enough or not. To illustrate, here is twice the output of md5(rand(1,5)):
[code] eccbc87e4b5ce2fe28308fd9f2a7baf3 a87ff679a2f3e71d9181a67b7542122c [/code] If you don't know better and just look at those values, it all seems pretty random, but using this for your session id would basically mean you would have collisions once you get 5 concurrent users, most likely less than that even since two consecutive calls might return the same value.
Most developers have some clue and wouldn't do such an obvious mistake, but here comes another common mistake: "let's use time !". Indeed, time is linear and if you use microtime(true), you will have pretty unique ids and very low chances of collisions unless you have a high load site with multiple servers. However this is very insecure since anyone knows the approximate time of your server, an attacker could then generate hashes for every micro-second and then potentially hijack a session. I won't go further in details about the ways those attacks work since this is more about securing your site than hijacking sessions.
Another common approach is to use rand(), mt_rand(), or uniqid(). The first two are somewhat random but they are cryptographically weak and some attacks can render them quite dangerous to rely upon, while the latter is just based on the time, so it's not offering any benefits over using microtime().
Randomness explained
The main thing to understand is how random values are generated. Computers are deterministic machines, they are made to be reliable and consistently produce the same output for a given input, therefore asking them to generate anything out of thin air is extremely difficult. To circumvent this, operating systems have ways to get really random values, that are virtually impossible to guess. Those are using entropy (i.e. noise or chaos) generated by the machine's environment, using an aggregate of various inputs like hard-drive I/O, fan speed, etc. Hardware devices that replace this process with more accurate methods are also available, such as the Entropy Key that you can plug in any linux computer.
Two variants are available on most UNIX machines, built as files you can read randomness from, /dev/random is a cryptographically safe, locking random number generator, which means what you read from it is erased, so it might be empty and locking your application until it is filled again so you can read from it. This is fine on high traffic servers but it's quite dangerous since if it gets empty, new users waiting for a session id will have to wait seconds or the request will completely time out, not quite the first experience you'd want them to have. The other alternative is /dev/urandom, the unlocked variant. This one records data and will serve you instantaneously a chunk of it. It is therefore not cryptographically safe, since it will potentially give you similar output twice, but is the best you can get if you can't rely on /dev/random being fast enough, and it will beat PHP's mt_rand() any day.
The Windows counterpart is sadly not built as a file anyone can read from, but is available through the OS API. From PHP, you can call it via the COM extension. This isn't cryptographically safe either, and sometimes fail to return anything depending on the exact windows setup you're on, so I wouldn't recommend it.
Also as of PHP 5.3.0, the openssl_random_pseudo_bytes provides you with a good algorithm from libssl that returns pseudo-random bytes. Beware if you use it to check for the value of the $strong parameter though, because if it's not true, it means the algorithm couldn't use the optimal method in which case you'd better do it yourself.
Note: It has been brought to my attention that the aforementioned method first polls /dev/random on unix systems and times out after a few milliseconds, so if you need speed (i.e. you're generating an id at every request), you may not want to use it.
Another option if you're running PHP 5.3.0 or above is mcrypt_create_iv. Before that is was really unsafe on windows, but it seems there is no really safe way to get entropy on windows before 5.3.0 anyway, so you'd do well to upgrade. Nowadays, used with the MCRYPT_DEV_RANDOM flag, you get decent entropy.
Hash algorithms
Getting random data is one thing, but then you most likely want to normalize it, which has two benefits: hiding what you used to create the unique id, and making it safe to be stored and passed as a cookie. Leaking random data to the user might hint an attacker as to what method you used to generate it, so it's better not to. While trying to set a cookie with data that contain weird characters might leave you with corrupted headers or a PHP warning appearing on some pages, making it hard to debug.
For this purpose, PHP offers the hash extension that is enabled by default and available since PHP5.1.2, so that should run on most hosts by now. By executing hash_algos() or via your phpinfo page, you can see what algorithms are supported by your version. There are a lot of algorithms to choose from, and they vary a lot in quality. Some are not designed for cryptographic purposes and therefore are very prone to collision attacks, while some were once great but have been cracked since then. The other criteria used to pick one is the speed at which they run, slow algorithms means an attacker will have to use more processing power in order to brute-force his way in, so you don't want an algorithm that performs well. There is a benchmark available in the user comments of the hash() function page, which shows, at the time of this writing, that Whirlpool is one of the slowest, and it would be my recommendation since it offers good cryptographic quality on top of it. However, keep an eye out for security advisories. Cryptographic algorithms are permanently under attack and it is not guaranteed that Whirlpool will still be a good choice 5 years from now.
And while it is not really the topic of this article, please note that even though MD5 is very popular, it is not a good idea for password hashing since, due to it's popularity, many people have built rainbow tables for it.
The code
To summarize, here is a basic implementation that should provide optimal results on any machine, using hopefully all available techniques. Obviously this means it does a bit too much, you may not want all of this in every case, but I hope it's a good reference to build upon.
[code php] /** * Tries a bunch of methods to get entropy in order * of preference and returns as soon as it has something * * @return string */ function generateEntropy() { // use mcrypt with urandom if we're on 5.3+ if (version_compare(PHP_VERSION, '5.3.0', '>=')) { return mcrypt_create_iv(64, MCRYPT_DEV_URANDOM); } // otherwise try ssl (beware - it may slow down // your app by a few milliseconds) if (function_exists('openssl_random_pseudo_bytes')) { $entropy = openssl_random_pseudo_bytes(64, $strong); // use ssl only if it was using the strong algo if ($strong) { return $entropy; } } // try to read from the unix RNG if (is_readable('/dev/urandom') && ($handle = fopen('/dev/urandom', 'rb'))) { $entropy = fread($handle, 64); fclose($handle); return $entropy; } // Warning ! // from here on, the entropy is considered weak // so you may want to consider just throwing // an exception to realize that your code is running // in an insecure way // try to read from the windows RNG if (class_exists('COM')) { try { $com = new COM('CAPICOM.Utilities.1'); $entropy = base64_decode($com->GetRandom(64, 0)); return $entropy; } catch (Exception $ex) { } } // last solution.. barely better than nothing return uniqid(mt_rand(), true); } /** * Grabs entropy and hashes it to normalize the output * * @param string $algo hash algorithm to use, defaults to whirlpool * @return string */ function getUniqueHash($algo = 'whirlpool') { $entropy = generateEntropy(); return hash($algo, $entropy); } echo getUniqueHash(); // 0532448b807325ae01661d46d7a534c852b6fc00ac14964b69bd0bb6f37f6c7c69ab4435c36dc57190c2a1cb4bd209a9f6a09e52c1ca25ef9ca20f04f34e643c echo generateEntropy(); // wçGœè’‰ÙThe code tries to get decent entropy via the best ways available and returns as soon as it got something. Finally it hashes it all with -by default- the Whirlpool algorithm and returns that. Whirlpool returns pretty long hashes however, since they're 128 characters long, and that might be a bit much if you plan to use it for an URL token. Therefore you may want to switch to other algorithms depending on your needs, like sha256 (64 characters) or md5 (32 chars), they're strong enough for such uses, since you don't really need collision protection for random hashes, but md5 should really not be used for hashing passwords, which is a completely different topic.