MD5 To Be Considered Harmful Someday
{LANG_NAVORIGIN} Encryption
By: Dan Kaminsky, 03/04/2005
Caveats
It should come as no surprise that the primary applied target for the Stripwire tool is the highly popular "Tripwire" file system auditing tool. Although Tripwire can configured to use more trustworthy algorithms, under common configurations it works by collecting MD5 hashes of every system file contained within the file system and alarming if any of those hashes change. The base security presumption is that as long as the file system doesn't change, neither will the behavior of the system running on top of it.
Stripwire makes it trivial for an attacker to swap out the harmless Ice for the arbitrarily dangerous Fire with Tripwire none the wiser. So does this mean Tripwire is fundamentally broken? The short answer, no, absolutely not. The longer answer is where things get interesting.
We begin by looking closer at Tripwire's base presumption -- yes, security engineers use Tripwire to detect unauthorized changes in the file system, but altering the file system is not the only way operations can be affected. The file system doesn't fully define an operating environment any more than laws fully define a legal system. Any number of external sources can alter behavior. Faults amidst its files are but one path, and not necessarily the best one. An entire branch of exploit research focuses on memory-only attacks that use the network as their injection vector and alter only the in-RAM kernel or library structures to support remote control of the OS. The disk is never touched; all evidence bleeds away the moment the plug is pulled by a naive forensic analyst.
And of course, systems do not need to be networked to exhibit deviant behavior with a constant software load. Anything from CPU speed to motherboard temperature sensors to the particular date emitted by the RTC (Real Time Clock) can be used to select between completely different sets of instructions. Systems can be even be configured to alter their behavior randomly. What matters is what the system is programmed to do, and that's the second problem: Tripwire doesn't tell you that you can trust something; only auditors can do that. It only says if you could trust it before, you still can now. For Stripwire to pose an actual threat to a deployed environment not only would Ice need to be added to the trusted list of MD5-monitored files but so too would the Stripwire execution harness itself. That is an unlikely circumstance.
So most uses of MD5, even by Tripwire, remain secure -- under the present threat regime at least. There still remains a critical blind spot in anything that uses MD5; to pick one example this is a fantastic channel for a group of malicious developers to submit innocuous and undecryptable content to their auditors for approval, and then once that's acquired to swap in a self-decrypting and unaudited payload. Audits against the shipped code would show the same MD5 hashes, and all would appear well.
Not that malice from the developers is a required component of such an attack. Maynor [6] describes a fascinating failure mode whereby the multitude of compilation, assembly, and packaging tools used to bring code from raw text to deployed code are themselves attacked. The logical progression of Thompson's classic essay on Trusting Trust [7], in which a C Compiler was infected and would subsequently infect anything else compiled with it, including other C Compilers, Maynor's approach has some interesting implications when combined with Stripwire. Conceivably, "Ice" could be injected into each build assembled by the developers, thus allowing internal testing to proceed uninhibited. But, upon shipment "Fire" would be swapped in by a malicious third party. Even if system administrators had a process by which they validated the MD5 sum of the code to be installed with the developers' concept of that sum (say, through an automated package manager), they would still find themselves installing the corrupted code.
Ultimately, MD5 cannot be depended upon to protect against a bait and switch, and neither can anything that depends on it.
Digital Signatures and DRM
Digital Rights Management, or DRM, has become a catch-all term for a extensive reimagining of issues not simply technical, but legal, political, and economic as well. The latter three have effectively driven the concept of a mutually trusted "third party attester" into technology that has traditionally operated on a "dumb automaton" model of command/execution. Third party attestation allows a third party to control the precise manner in which a system should operate, independent of mere technological capacity. Cryptographic primitives are chained together in DRM systems to link grantable resources to the externally provided objects that provide the granting.
DRM systems with MD5 as part of their chain could conceivably face problems even with they never hash data directly. All three major digital signature algorithms -- RSA, DSA/ElGamel, and Elliptical Curve -- are almost universally used in a mode where they do not sign data directly, but rather sign a hashed representation of the data. (Asymmetric algorithms are quite slow; this maneuver makes it realistic to sign arbitrarily large files.) Often the hash algorithm of choice is MD5. Identical input yields identical output -- if two files have the same hash, they'll both verify against the same signature. So, a key constraint of the digital signature primitive, that no other data could survive signature verification save for the data that was originally signed, cannot be met.
There appears to be only limited vulnerability to this in open deployment. Microsoft's Authenticode technology, used within its browser to limit executable content within web page to signed documents, does indeed use (or at least allow) MD5 hashes to be signed. It would be trivial to sign something innocuous and then actually release something malicious. But the security model of Authenticode has always been one of legal accountability -- having someone to sue -- and not of technical restriction. Indeed the amount of abjectly destructive "spyware" tunneled to user machines through Internet Explorer is astonishing.
It is worth noting that probably the widest-deployed hardware that employs digital signatures for third-party attestation, the Microsoft X-Box, uses SHA-1 as its hashing algorithm and not MD5 [8]. So it is not vulnerable. But it's also worth noting that had Microsoft selected MD5 instead of SHA-1 their use of a 2048 bit key for their RSA signature would have been completely irrelevant.
Multicollisions Unleashed
Interestingly enough, none of what's been discussed already actually requires the full attack discovered by Joux and Wang. Thus far everything has been based only on the ability to append arbitrary data to Wang's test vectors. But failures inside cryptographic primitives, even very small ones, tend to lead to slowly discovered catastrophic failures. MD5 does not seem to be an exception to this rule.
We can do much more with the actual multicollision attack. The test vectors collide only when stirred into the default initial state for the MD5 algorithm; the attack itself works against any arbitrary state. The upshot here is that we can not only append arbitrary data but prepend it as well. Presently Fire and Ice required a dedicated external execution harness which could arouse suspicion. With prepending available, a correctly formatted binary executable could be synthesized that would self-analyze and branch appropriately depending on which vector was contained within.
In addition, being limited to the MD5 initial state means only hashes calculated on a per-file basis can be made to collide; a full disc or partition sum will come across the doppelganger set at a vastly different initial state and fail to collide. With the full attack we could specify our colliding blocks against the MD5 state that would be found during a full disk or partition hashing operation. Of course, then the colliding set we generated wouldn't collide on a per-file basis. Thus far we can only adapt to a single MD5 state at a time.
HMAC
Most observers have written that the HMAC, or Hashed Message Authentication Code, construction is entirely immune to the multicollision attack. They are mostly correct, just not entirely correct. HMAC is a method of taking an arbitrary hashing algorithm like MD5 and introducing a secret to it such that only someone with that secret can either synthesize or verify the correct hash for some arbitrary input. In simple terms, HMAC is a mechanism for altering the initial state according to some password. More precisely, HMAC does the following:
Inner = MD5(Key XOR 0x36 + Data)
Outer = MD5(Key XOR 0x5c + Inner)
HMAC-MD5 = Outer
There are three things going on here:
- A key is prepended to the data being hashed.
- Additional noise is added to the user provided key.
- Two rounds of hashing are used instead of one, with the noise varying between the two rounds.
There's a fair amount of defensive cryptosystem design inside of HMAC -- it's an avowed goal of the algorithm to still function even if small faults are found in the underlying hash. But for all its defensive operation, the Data portion is only invoked once, prepended with a key-derived block. This new block creates a new 128 bit state for Data to be stirred into, and this state diverges substantially from our generic MD5 initial state.
But the multicollision attack works against any state, not just the generic one. That means it's straightforward, given the key, to adapt to the new system state and cause a collision in the inner hash. Once the inner hash has collided, the outer must as well, as it's getting the same input from both datasets. (If the outer hash also concatenated the Data portion, the attack would fail entirely, because it's thus far impossible to adapt to the two separate MD5 states by the 0x36 v. 0x5c padding XORs.)
This is the first known method of creating two datasets that collide under HMAC-MD5. Once again, though, the caveats are deep.
>From one perspective, saying HMAC is insecure when the key is known is a little like saying AES is insecure when the key is known -- the whole point is that the key is unknown to the attacker. It's quite arguable that the MAC primitive, like anything else with a key, is allowed to collapse if the key leaks. The basic idea of open crypto design, after all, is to migrate all secrecy and security out of the algorithm and into the key. Is it really fair to complain when, given this design, risks show up with the key being lost?
Probably not. For all the analysis in this paper, the multicollision attack against MD5 remains relatively weak, with special circumstances required for an attack to succeed. For Tripwire to be compromised, the initial trust database needed to be infected. For digital signatures to be affected, something only apparently innocuous needed to be signed first. In both cases, the use of MD5 opened up a small threat vector; would HMAC have changed this? If an attacker has access to a system such that files may be altered, he probably also has access to whatever HMAC key Tripwire had been reconfigured to use (assuming the entire contents of the file system aren't being streamed over the network to an uncompromised host). So HMAC doesn't change the threat scenario for Tripwire. And for a digital signature bait and switch, the attacker has to have the HMAC key to create a signable payload for the third party attester.
Ultimately, as most uses of MD5 are immune to the multicollision vulnerability, so too are most uses of HMAC-MD5. But when MD5 does experience an applied threat, HMAC-MD5 provides limited if any protection.
E-Mail Link
Your IP address will be sent with this e-mail