Hash functions mechanics

How does it work ?

From Davies-Meyer's function to the Merkle tree, trough the one-way compression function.

Davies-Meyer’s function

The Davies–Meyer hash function is a construction for a hash function based on a block cipher, where the length in bits of the hash result is equal to the block length of the block cipher. A hash function is a cryptographic algorithm that takes input strings of arbitrary (or very large) length and maps these to a short, fixed length output strings. The Davies–Meyer hash function is an unkeyed cryptographic hash function which may have the following properties: preimage resistance, second preimage resistance and collision resistance; these properties may or may not be achieved depending on the properties of the underlying block cipher.

In the following, the block length and key length of the block cipher will be denoted with n and k respectively. The encryption with the block cipher E using the key K will be denoted with EK(.).

The Davies–Meyer scheme is an iterated hash function with a compression function that maps k + n bits to n bits:

  • Hi_=EXi(Hi_−1)⊕_Xi.

One-way compression function

In cryptography, a one-way compression function is a function that transforms two fixed-length inputs into a fixed-length output. The transformation is “one-way”, meaning that it is difficult given a particular output to compute inputs which compress to that output. One-way compression functions are not related to conventional data compression algorithms, which instead can be inverted exactly (lossless compression) or approximately (lossy compression) to the original data.

One-way compression functions are for instance used in the Merkle–Damgård construction inside cryptographic hash functions.

One-way compression functions are often built from block ciphers. Some methods to turn any normal block cipher into a one-way compression function are Davies–Meyer, Matyas–Meyer–Oseas, Miyaguchi–Preneel (single-block-length compression functions) and MDC-2/Meyer–Schilling, MDC-4, Hirose (double-block-length compression functions)

One-way compression in Merkle construction

A common use of one-way compression functions is in the Merkle–Damgård construction inside cryptographic hash functions. Most widely used hash functions, including MD5, SHA-1 (which is deprecated]) and SHA-2 use this construction.

A hash function must be able to process an arbitrary-length message into a fixed-length output. This can be achieved by breaking the input up into a series of equal-sized blocks and operating on them in sequence using a one-way compression function. The compression function can either be specially designed for hashing or be built from a block cipher.

The last block processed should also be length padded, this is crucial to the security of this construction. This construction is called the Merkle–Damgård construction. Most widely used hash functions, including SHA-1 and MD5, take this form.

Merkle Tree

A Merkle tree is a hash-based data structure that is a generalization of the hash list. It is a tree structure in which each leaf node is a hash of a block of data, and each non-leaf node is a hash of its children. Typically, Merkle trees have a branching factor of 2, meaning that each node has up to 2 children.

Merkle trees are used in distributed systems for efficient data verification. They are efficient because they use hashes instead of full files. Hashes are ways of encoding files that are much smaller than the actual file itself. Currently, their main uses are in peer-to-peer networks such as Tor, Bitcoin, and Git.

Merkle trees are typically implemented as binary trees, as shown in the following image.

In this image, we see an input of data broken up into blocks labeled from L1 to L4. Each of these blocks are hashed using some hash function. Then each pair of nodes are recursively hashed until we reach the root node, which is a hash of all nodes below it

In various distributed and peer-to-peer systems, data verification is very important. This is because the same data exists in multiple locations. So, if a piece of data is changed in one location, it’s important that data is changed everywhere. Data verification is used to make sure data is the same everywhere.

However, it is time-consuming and computationally expensive to check the entirety of each file whenever a system wants to verify data. That’s why Merkle trees are used. Basically, we want to limit the amount of data being sent over a network (like the Internet) as much as possible. So, instead of sending an entire file over the network, we just send a hash of the file to see if it matches. The following protocol will be used :

  • Computer A sends a hash of the file to computer B.

  • Computer B compare that hash to the root of the Merkle tree.

  • If there is no difference, we are done! Otherwise, go to step 4.

  • If there is a difference in a single hash, computer B will request the roots of the two subtrees of that hash.

  • Computer A creates the necessary hashes and sends them back to computer B.

  • Repeat steps 4 and 5 until you have found the data blocks(s) that are inconsistent. It is possible to find more than one data block that is wrong because there might be more than one error in the data.

Because the computers are only sending hashes over the network (not the entire file), this process can go very quickly. Plus, if an inconsistent piece of data is found, it is much easier to insert a small chunk of fixed data than to completely rewrite the entire file to fix the issue.

The reason that Merkle trees are useful in distributed systems is that it is inefficient to check the entirety of file to check for issues. The reason that Merkle trees are useful in peer-to-peer systems is that they help you verify information, even if some of it come from an untrusted source (which is a concern in peer-to-peer systems).

The way Merkle trees are helpful in a peer-to-peer system has to do with trust. Before you download a file from a peer-to-peer source like Tor, the root hash is obtained from a trusted source. After that, you can obtain lower nodes of the Merkle tree from untrusted peers. All these nodes exist in the same tree-like structure described above, and they all are partial representations of the same data. The nodes from untrusted sources are compared to the trusted hash. If they match the trusted source (meaning they fit into the same Merkle tree), they are accepted, and the process continues. If they are not verified, they are discarded and searched for again from a different source