On Sun, 2014-01-12 at 16:37 -0500, Mark Lee wrote:
On Sun, 2014-01-12 at 16:29 -0500, Mark Lee wrote:
On Sun, 2014-01-12 at 11:29 -0700, Taylor Hornby wrote:
On 01/12/2014 10:11 AM, Mark Lee wrote:
Perhaps I'm not strong enough in mathematics but I'd like to know how possible md5 collisions can be weaponized. From what I see, the idea would be to modify a binary such that it contains malicious code (without changing the md5sum). Since most security packages contain a number of compilation tests and md5 hashes vary significantly with slight modifications, I'd like to know how these collisions can be used to hijack a system. If one has to build a binary that doesn't even encompass the functionality of the binary it's trying to mimic, wouldn't that severely decrease the effectiveness of a hash collision?
The problem with MD5 is that it's possible to find two different messages A and B such that MD5(A) = MD5(B). The birthday attack [1] makes it possible to find collisions for 128-bit hashes (like MD5) in about 2^64 operations, but there are MD5-specific attacks that make it even easier for MD5. So... it takes a bit of time to find a collision, but not too much time. It can be done in about a day with PS3's [2].
For my explanation, I'll ignore the MD5-specific attacks and just use the birthday attack since it's easier to understand.
The attacker starts out with two files:
1. The official copy of TrueCrypt.
2. A malicious copy of TrueCrypt.
The malicious change would be something simple. For example, they could find where the master encryption key is copied into a buffer:
... MOV [master_key + ECX], EAX // EAX holds master key bits ...
And change it so that it's always zero:
... MOV [master_key + ECX], 0 // Now the master key is set to zero ...
Only one instruction is different, but now TrueCrypt encrypts everything with a null key.
From these two copies copies of TrueCrypt, they generate 2^64 variants of each. They can do this by re-ordering instructions, changing which registers are used for which variables, re-ordering ELF segments, or just appending random strings to the file (you can append stuff to an ELF binary and it will still run fine).
Now the attacker has:
1. 2^64 files that are different, but behave exactly like the non-malicious official copy of TrueCrypt.
2. 2^64 files that are different, but behave exactly like the malicious copy of TrueCrypt (that uses zero keys or whatever).
If MD5 was a good hash function, each of the files would hash to a random 128-bit value. You should now be able to see that at least one file from the "non-malicious" set of files probably has the same MD5 hash as one of the files from the "malicious" set of files.
There are 2^64 * 2^64 = 2^128 ways of pairing a file from the non-malicious pile with one from the malicious pile. The chance of any one pair colliding is 1 in 2^128, so there's a good chance that at least one of the pairs collide (it's not exactly 100% since the pairs are not mutually independent).
To find the collision quickly, the attacker can just sort one of the sets of files by hash value, then for each file in the other set, use a binary search to see if any file has the same hash. Ignoring log(n) factors, that would take on the order of 2^64 operations.
So... Now the attacker has two files with the same MD5 hash. One that behaves like the official copy of TrueCrypt, and another that uses null keys.
So, they give the *good* copy to the package maintainer, who laboriously checks that it's not backdoored (as if that ever happens). The package maintainer thinks everything is fine, since it behaves exactly like TrueCrypt and there's nothing fishy about the file (they probably won't notice some instructions being reordered). The file might even be the same as the one from the TrueCrypt website, if the attacker is collaborating with the TrueCrypt developers.
The package maintainer thinks everything is good, so they hard-code the MD5 hash of their good copy into the PKGBUILD.
Now, when most users install TrueCrypt, they will download the *good* copy, check the MD5 hash, it will match, and everything will be good. However, when the adversary sees that his target (e.g. a terrorist, or activist/journalist living in a non-free country) is downloading TrueCrypt, they will silently replace the file with with the *bad* copy. When the MD5 gets checked, it will match, and the user will be encrypting their files with null keys.
In total, it took the attacker on the order of 2^66 operations to find the colliding files. You'd need a good budget (like the NSA's, or a large company's) to do that, but if you take advantage of the MD5-specific attacks too, you can do it for much cheaper.
Of course, this doesn't just apply to TrueCrypt, it could be applied to any other software that's checked only with an MD5 hash. To backdoor a web server, for example, you could change a constant used in array index bounds checking to introduce a buffer overflow vulnerability.
Salutations!
Excellent explanation but I still don't see the guarantee that md5 hash can be cracked. I don't see any application of the pidgeon hole principle which would show two files (one malicious with the null key and one normal) would have to share the same md5 hash. Can you elaborate any further?
Regarding, switching out the md5 hashes, I agree with the switch to sha256 for PKGBUILDS.
Regards, Mark
Salutations!
I'd just like the point out that with the birthday attack, we'd have 100% confidence when we have 2^128+1 different hashes (using the pidgeon hole theorem). Anyone care to calculate a # pair vs. probability of collision for md5 hash?
Regards, Mark
Salutations! In addition, this birthday attack assumes coercion from source code developers. It seems that the method you demonstrated of reducing the very high number needed to be 100% sure (from 2^128+1 to 129) one has a collision is dependent on source code developers being complacent (they are letting the code be refactored) to coercion. Regards, Mark -- Mark Lee <mark@markelee.com>