Chapter 8 Hash functions in digital forensics Page 129

(1)

Chapter 8 Hash functions in digital forensics

In this chapter we describe the role of hash functions in digital forensics. Essentially hash functions are used for two main purposes: ﬁrst, authenticity and integrity of digital traces are ensured by applying cryptographic hash functions. Second hash functions identify known objects (e.g., illicit ﬁles).

Before we give details on their applications in IT forensics, we introduce the foun-dations of hash functions in Section 8.1. Then Section 8.2 describes the use case authenticity and integrity of digital traces. Finally in Section 8.3 we explain the use case data reduction by identiﬁcation of known digital objects.

8.1 Cryptographic hash functions and approximate matching

In this section we first introduce the general idea of a hash function and then turn to two different concepts: first Section 8.1.1 discusses cryptographic hash functions, which originally come from cryptography to be used in the context of the security goals authenticity, integrity, and non-repudiation. Cryptographic hash functions are useful to uniquely identify an input bit string by its hash value. The second concept, on the other hand, is a rather new idea. It deals with the identification of similar input bit strings and is called approximate matching. We turn to approximate matching in Section 8.1.3.

Ageneral hash function is simply a function, which takes an arbitrary large bit Hash function

string as input and outputs a bit string of ﬁxed size. If n ∈ N denotes the bit length of the output and if we denote as usual by {0,1}∗_{the set of all bit strings, then a}

hash function h is a mapping

h : {0,1}∗_{→ {0,1}}n_. _(8.1)

Typically the computation of a hash value is eﬃcient, that is fast in practice. These two properties are characteristic for a hash function and thus used for its deﬁnition (see e.g. [16]).

D

Deﬁnition 8.1: Hash function

Let n ∈ N be given. A hash function is a function, which satisﬁes the following two properties:

1. Compression: h : {0,1}∗_{−→ {0,1}}n_.

2. Ease of computation: For all input bit strings bS ∈ {0,1}∗

computa-tion of h(bS) is ‘fast’ in practice.

The output of the function h(bS) is referred to as a hash value, ﬁngerprint, signature or digest.

B

Example 8.1

We look at two simple hash functions.

1. We set n = 1. For bS ∈ {0,1}∗_{we simply deﬁne h(bS) by the least}

(2)

empty bit string /0. For instance we have h(10101) = h(11) = h(1) = 1 and h(1000) = h(10) = h(0) = 0. Clearly this function satisﬁes both requirements from Deﬁnition 8.1.

2. We set n = 2. For bS ∈ {0,1}∗ _{we simply deﬁne h(bS) by bS mod 4,}

where bS is interpreted as a non-negative binary integer. Again, we set h(/0) := 0 for the empty bit string /0. For instance we have h(10110) = h(10) = 10 = 2and h(1000) = h(0) = 0. Again this function satisﬁes both requirements from Deﬁnition 8.1.

8.1.1 Cryptographic hash functions

Hash

Applications functions are well-established in computer science for diﬀerent purposes.

Sample security applications of hash functions comprise storage of passwords (e.g., on Linux systems), electronic signatures (both MACs and asymmetric signatures), and whitelists/blacklists in digital forensics. Depending on the application, we have to impose further requirements.

For

Preimage resistance instance, in cryptography a hash value serves as a unique identiﬁer for its

input, e.g., in the context of a digital signature, where the hash value uniquely represents the input data. Clearly in theory each hash value possesses infinitely many preimages, that is input bit strings, which map to the given hash value. However, in practice it is not possible to compute such a preimage – the run time of the most efficient algorithm to find a preimage is too long. This property is called preimage resistance. Besides preimage resistance a cryptographic hash function satisfies two additional security requirements, which we list in Definition 8.2.

D

Deﬁnition 8.2: Cryptographic hash function

Let h : {0,1}∗_{−→ {0,1}}n_{be a hash function. h is called a cryptographic hash}

function if it additionally satisﬁes the following security requirements: 1. Preimage resistance: Let a hash value H ∈ {0,1}n _{be given. Then}

it is infeasible in practice to ﬁnd an input (i.e., a bit string bS) with H = h(bS).

2. Second preimage resistance: Let a bit string bS1∈ {0,1}∗be given.

Then it is infeasible in practice to ﬁnd a second bit string bS2with

bS1�= bS2and h(bS1) =h(bS2).

3. Collision resistance: It is infeasible in practice to ﬁnd any two bit strings bS1,bS2∈ {0,1}∗with bS1�= bS2and h(bS1) =h(bS2).

Clearly both hash functions from Example 8.1 are not cryptographic hash functions. For instance, we consider h from Example 8.1 1. It is not preimage resistant, because given b ∈ {0,1} we simply take b as preimage and have h(b) = b, that is ﬁnding preimages is trivial. The same obviously holds for second preimage resistance and collision resistance, respectively.

As we will see in this chapter the IT forensic community adopted the use of cryptographic hash functions for two main purposes: ensuring authenticity and integrity of a digital trace and automatic file identification. In both cases, preimage resistance is crucial, because the hash value of the input serves as a unique identifier for its preimage. If such an identifier is given and if we are able to find a preimage, which is different to the actual input, both IT forensic use cases are corrupted.

(3)

If Sample cryptographic hash functions

his a hash function, then a necessary condition for h to be a cryptographic hash function is that the bit length of its digest n is suﬃciently large. For preimage resistance and second preimage resistance we have to impose n ≥ 100, for collision resistance h has to satisfy n ≥ 200. Thus we recommend to make use of the stronger requirement and only apply hash functions with n ≥ 200. Sample cryptographic hash functions, which are used in digital forensics are MD5 (n = 128), SHA-1 (n = 160) or hash functions from the SHA-2 family (e.g., SHA-256 (n = 256), [21]). For further details we refer to Table 8.1.

Name MD5 SHA-1 SHA-256 SHA-512 RIPEMD-160

n 128 160 256 512 160

Table 8.1: Sample cryp-tographic hash functions.

One important implication of the security properties of a cryptographic hash Avalanche effect

function is the avalanche effect. If we change the input bit string, then every bit of the output is expected to change its value with probability 50%, i.e., we do not have any control over the output, if the input changes. According to the avalanche effect, if only one single bit in the original input bit string bS is changed to get a tampered one bS�_{, the two outputs h(bS) and h(bS}�₎_{look ‘very’ different. We demonstrate the}

avalanche eﬀect on base of similar ASCII strings in Example 8.2.

B

Example 8.2: Avalanche eﬀect

We demonstrate the avalanche effect by applying SHA-256 to a simple ASCII string: in the first string, Wolfgang claims to give Angela 1 million EUR, while the amount changes slightly to 1 billion EUR in the second string. However, the respective SHA-256 hash values look very different.

$ echo ’Dear Angela, I give you 1 million EUR. Wolfgang’ | sha256sum

cb10cfd3b6d47af94cd48c096c606ec8d2d836e80c7f87701ff450267efb4787

-$ echo ’Dear Angela, I give you 1 billion EUR. Wolfgang’ | sha256sum

8dc377ef008781d03278982928dc7235aff7ac06e39a523eb7fda9ad547f6c4e

-The Linux commandechoprints the given string (including a subsequent new line character) to standard output. The Linux implementation of SHA-256sha256sumtakes this string as input. The number of output characters ofsha256sumis256₄ =64, because each group of 4 bits of the hash value is printed as one hexadecimal digit.

The avalanche effect is eligible in the context of unique identifiers or integrity of a trace, because it is easy to distinguish different input bit strings by comparing their respective hash values. However, the avalanche effect avoids detecting similar objects. It is important to keep this property in mind for the two use cases of cryptographic hash functions in IT forensics.

8.1.2 Bloom ﬁlter

This section introduces Bloom filters, which are an important concept for approxi-mate matching. Bloom filters are commonly used to represent elements of a finite set S. A Bloom filter is an array of m bits initially all set to zero. In order to ‘insert’ an element s ∈ S into the filter, k independent hash functions are needed where each hash function h outputs a value between 0 and m −1. Next, s is hashed by all hash functions h. The bits of the Bloom filter at the positions h0(s),h1(s),...hk−1(s)

(4)

To answer the question if s�_{is in S, we compute h}₀_(s�_),_h₁_(s�_{), . . .}_h_k−1_(s�₎_{and analyse}

if the bits at the corresponding positions in the Bloom ﬁlter are set to one. If this holds, s�_{is assumed to be in S, however, we may be wrong as the bits may be set to}

one by different elements from S. Hence, Bloom filters suffer from a non-trivial false positive rate. Otherwise, if at least one bit is set to zero, we know that s�_{∈ S.}_/

It is obvious that the false negative rate is equal to zero. In

False positive probability case of uniformly distributed data the probability that a certain bit is set to 1

during the insertion of an element is1_/m, i.e., the probability that a bit is still 0 is

1 −1_/_m_{. After inserting n elements into the Bloom ﬁlter, the probability of a given}

bit position to be one is 1 −(1−1_/_m₎k·n_{. In order to have a false positive, all k array}

positions need to be set to one. Hence, the probability p for a false positive is

p =�1 − (1 −1_/m)k·n�k_{≈ (1 − e}− kn/m₎k_. _(8.2)

8.1.3 Approximate matching: the concept

Often

Detection of similar objects

it is useful in computer science to identify similar digital objects. Prominent use cases are spam detection, malware analysis, network-based anomaly detection, biometrics, or digital forensics.

We

No formal deﬁnition ﬁrst remark that although similarity has a natural meaning for us, a formal

deﬁnition is still missing. The corresponding NIST special publication draft 800-168 [22] only describes approximate matching in terms of uses cases, terminology, and requirements. We therefore skip a deﬁnition, too.

The

Extending yes/no output basic aim of approximate matching is to extend the yes/no outcome of a

cryptographic hash function to a continuous one in the scope of automatic detection of a digital object. As explained in Section 8.1.1 a cryptographic hash function yields a binary decision ’identical/diﬀering’ for a comparison of two input bit strings: ’identical’ is encoded for instance as the integer 1, ’diﬀering’ as the non-matching integer 0. The output of an approximate non-matching comparison on the other hand is a matching score in the interval [0,1], where 1 means a high-level of similarity and 0 a low-level.

The

Use case classes NIST draft 800-168 [22] mentions two use case classes of similarity with two

challenges, respectively. First, approximate matching aims at finding resemblence of two objects. The two challenges within this class are object similarity detection (e.g., different versions of a document) and cross correlation, i.e. finding digital artefacts, which share a common object (e.g., two files sharing an identical picture). Second, approximate matching should detect containment. [22] lists the two according challenges fragment detection (e.g., identify a cluster of a deleted blacklisted file or an IP packet transferring a fragment of a classified document) and embedded object detection, i.e. finding an indexed trace within a digital artefact (e.g., a picture within an email).

The

Core functions concept of approximate matching comprises two core functions: a similarity

digest generation function and a similarity comparison function. In the terminology of [22] the ﬁrst one is called the feature extraction function and the latter on is denoted as similarity function. We prefer our notation because it more obviously describes the goal of the respective function.

Given

Features, sim-ilarity digest

an input object to the similarity digest generation function, it identiﬁes characteristic patterns within the given object. As usual these patterns are called features. The speciﬁcation of an approximate matching algorithm therefore

(5)

de-scribes how to extract features from the given input. The set of all features is the output of the similarity digest generation function and called the similarity digest.

The Similarity comparison

function

similarity comparison function takes as input two similarity digests and out-puts a match score in [0,1]. As more the match score is close to 1 the more similar the corresponding two inputs of the similarity digest generation function are con-sidered.

Asusual with noisy input, the user of approximate matching has to deﬁne a Error rates

threshold to decide about similarity. As a consequence approximate matching suﬀers from the well-known error rates: the false match rate (FMR) describes the proportion of dissimilar objects falsely declared to match the compared object. On the other side the false non match rate (FNMR) describes the proportion of similar objects falsely declared to not match the compared object.

Similaritymay be considered on diﬀerent layers of abstraction. The NIST draft Layers of abstraction

800-168 [22] distinguishes three layers:

1. First, Bytewise approximate

matching

bytewise approximate matching takes a bit string as input for the similarity digest generation function without any high-layer interpretation of the string, that is the features are extracted directly from the input bit string. Bytewise approximate matching is therefore a general approach and may be applied to any bit string. However, it assumes that similar artefacts, which are of interest for the digital forensic investigator, are represented by a similar bit string – or it fails within this use case. Bytewise approximate matching is often referred to as fuzzy hashing or similarity hashing.

2. Second, Semantic approximate

matching

semantic approximate matching takes the interpretation of the appli-cation data into account and simulates the human similarity perception procedure. For instance, semantic approximate matching in the scope of pictures extracts the features from the visual perception of the picture rather than from its low-layer representation. Semantic approximate matching is often referred to as perceptual hashing or robust hashing.

3. Third, Syntactic approximate

matching

syntactic approximate matching is based on standardised internal struc-tures of an artefact. For instance, within network packets a syntactic ap-proximate matching algorithm may work on ﬁelds like source/destination MAC/IP addresses, ports, protocols.

As bytewise and semantic approximate matching are useful for data reduction, we give more insights into these approaches in the subsequent sections. Breitinger et al. [5] provide an in-depth overview and we summarise and extend their key aspects in what follows.

8.1.4 Bytewise approximate matching

According to Breitinger et al. [5] there are seven bytewise approximate matching algorithms published by the digital forensic community. In this section we re-view the three main approaches of feature extraction which seem to be the most promising ones.

Theﬁrst feature extraction approach is used by the well-known bytewise approx- ssdeep, mrsh-v2

imate matching algorithmsssdeep(due to Kornblum [14]) andmrsh-v2(due to Breitinger and Baier [3]). The similarity digest generation function subdivides the input byte stream (denoted as m) into chunks m1, m2, ... as depicted in Figure 8.1.

The basic idea is that two digital artefacts are similar if they share a suﬃcient number of chunks.

(6)

Figure 8.1: Fea-ture extraction of

ssdeepand mrsh-v2

The

Chunk, trigger point end of a chunk m_i(and thus the beginning of the subsequent chunk m_i+1) is

called a trigger point. Such a trigger point is found if the ﬁnal r bytes before the trigger point meet a certain condition (typically r = 7 and these r bytes determine an integer value, which has to match a predeﬁned value for triggering). Each chunk represents a feature of the input and the feature set is the sequence of chunks, i.e. the input byte stream is fully covered by the feature set.

To represent a feature, it is hashed by a hash function h (e.g., h is FNV1_for_ssdeep_,

his MD5 formrsh-v2) and its hash value is either represented by a Base64 character (ssdeep) or a Bloom ﬁlter (mrsh-v2). In case ofssdeepthe similarity digest is a sequence of Base64 characters, in case ofmrsh-v2it is a sequence of Bloom ﬁlters. In Example 8.3 we compute thessdeepsimilarity digest of the photo given in Figure 8.2.

Figure 8.2: Sample input hacker-siedlung.jpg of ssdeep

B

Example 8.3: Similarity digest computation ofssdeep

We compute thessdeepsimilarity digest of the photo given in Figure 8.2.

$ ls -l hacker-siedlung.jpg

-rw--- 1 baier baier 78831 2015-05-15 10:16 hacker-siedlung.jpg $ ssdeep -l hacker-siedlung.jpg

ssdeep,2.13--blocksize:hash:hash,filename

1536:ZfICsORJt2PazD7Z2xqHmqL36uuXtrHTXkkknIKB+W2pDHviF4eYySb:\ ZfICNRf2CD7YwGqL36FXVTXQnIWgDvi2,"hacker-siedlung.jpg"

(7)

We first look at the file size, which is 78831 bytes. Then we invokessdeep, its flag-lsuppresses the whole path listing in the output ofssdeep. The output lists the block size, two parts of the similarity digest, and the file name, which are separated by a colon, respectively.

The block size determines, when a trigger point is found. It aims at splitting the input byte stream in approximately 64 chunks. It is always of the form 3 · 2k_{, where k is the smallest value with 3 ·2}k_{· 64 ≥ ﬁle size. In our example}

we have 78831

3·64 =410.6, thus k = 9 and the block size is 1536 = 3 ·29.

After the ﬁrst colon, we get the ﬁrst part of thessdeepsimilarity digest corresponding to the block size 1536. It consists of Base64 characters, where the characterZrepresents the hash value of h(m1),fthe hash value of h(m2),

andbthe hash value of the ﬁnal chunk h(m55). After the second colon we see

the second part of thessdeepsimilarity digest corresponding to the block size 2 ·1536 = 3072. We expect approximately half of the chunks.

Thesecond feature seletion strategy is to extract statistically improbable features. sdhash

This strategy is implemented bysdhashof Roussev [24]. The basic idea is that uncommon patterns serve as the baseline for similarity. A statistically improbable feature withinsdhashis a sequence of 64 bytes with a high Shannon entropy, that is a sufficiently large number of different bytes. The feature set ofsdhashis the sequence of the statistically improbable features, which are represented by Bloom filters. There is a parallelised version available for use in large-scale investigations [25].

Thethird feature selection strategy is based on a majority vote of bit appearance mvhash-B

with a subsequent run length encoding. This approach is used bymvhash-Bdue to Breitinger et al. [4]. The majority vote step replaces each byte of the input byte string by either an 0x00 byte or an 0xFF byte. The mapping depends on the neighbourhood of the respective byte: if the number of 0 bits predominate in its neighbourhood, the byte is mapped to 0x00, otherwise it is mapped to 0xFF. Then run length encoding is used, where each sequence of identical bytes is replaced by its length. The basic idea of similarity is that predominating regions of a certain bit are characteristic for digital objects. The integers of the run length encoding are then inserted into Bloom ﬁlters. The similarity digest ofmvhash-Bis therefore a sequence of Bloom ﬁlters.

8.1.5 Semantic approximate matching

Assemantic approximate matching extracts perceptual features it is bound to a Perceptual features

certain area of applications, for instance images, audio streams or videos. Again Breitinger et al. [5] present an overview of semantic approximate matching algo-rithms in the context of pictures. This branch dates back to the early 1990ies, when content-based image retrieval was an emerging research topic.

Thereare diﬀerent feature classes, which are used for image approximate matching. Feature classes

Breitinger et al. [5] mention histograms, low-frequency coeﬃcients (e.g., from the discrete cosine transform), block bitmaps or projection-based. To get an idea of image approximate matching, we shortly explain a block bitmap approach used by the robust hashing algorithmrhashdue to Steinebach [29].

Thesimilarity digest generation process ofrhashis depicted in Figure 8.3. The bit rhash

length of therhashvalue is ﬁxed in advance. As usual we denote it by n. In a ﬁrst step, the input image is converted to greyscale and normalised (e.g., in a preset

(8)

Figure 8.3: Similar-ity digest genera-tion of rhash [29]

size, with respect to orientation). Then the normalised and greyscaled picture is subdivided into n disjoint blocks, which cover the image. For instance, if n is a square, thenrhashsubdivides the image into √n equally sized rows and columns, respectively. The sample in Figure 8.3 makes use of n = 256 = 162_{, that is the input}

picture comprises 16 rows and columns, respectively. Next for each block i with 0 ≤ i ≤ n − 1rhashcomputes the mean of of its pixel values. We denote the mean of the i-th block by Miand the median of the sequence (Mi)0≤i≤n−1by Md. Finally,

the block i contributes to therhashsimilarity digest by the bit hi, where hi=0

if and only if mi<Md. A samplerhashsimilarity digest is given on the right in

Figure 8.3.

8.2 Authenticity and integrity of digital traces

In

Authenticity, integrity this section we look at the ﬁrst use case of hash functions in digital forensics:

ensuring authenticity and integrity of digital traces during the IT forensic process (e.g., during data acquisition). Remember authenticity means that the origin of a digital trace is validated, while integrity describes the property that a digital trace did not change.

The

Dead and live analysis use case ’authenticity and integrity of digital traces’ is relevant for both dead

and live analysis. We will focus on dead analysis in what follows (i.e., the digital forensic expert makes use of his own software), but we keep in mind that traces, which are acquired from a live system (e.g., main memory) must be protected by hash values, too.

From

Usage of crypto-graphic hash functions

Section 8.1 we know that cryptographic hash functions ensure integrity and authenticity by design due to their preimage and second preimage property (see Deﬁnition 8.2). For this reason the use case ’authenticity and integrity of digital traces’ assumes the usage of cryptographic hash functions.

An important

Protect hash values issue is that we have to protect the hash values against tampering.

There are two alternatives to achieve this goal: ﬁrst the classical analogue approach is to write down the hash values by hand in the narrative minutes (e.g., in the investigation notebook). Then the hash values are protected by the assumption that it is impossible to forge the handwriting of the investigator. Second the digital approach is to compute a digital signature over the hash values. This requires a private cryptographic key, which is related to the investigator. In this case the hash values are protected by the assumption that it is impossible to forge a digital signature.

We

General process now discuss the use case ’authenticity and integrity of digital traces’ by looking

at the classical data acquisition process of a dead system. To sum up the paradigm is to ﬁrst generate a master copy from the original device (because the original device must be touched as few as possible). Then the master copy is bitwise copied to get the working copy. If we only perform read-only commands on the working copy, we later on must prove that the working copy did not change during the

(9)

investigation (and hence any trace is directly extractable from the original device). The steps are as follows:

1. Compute hash value h1over the whole original volume.

2. Write hash value h1down in physical logbook.

3. Make a 1-to-1 copy of the volume usingdd. This is the master copy of the original device.

4. Compute hash value h2over the master copy.

6. Compare h1and h2: if both hash values match, the master copy is identical

to the original device. Otherwise, we have to go back to step 3.

7. Generate a 1-to-1 copy of the master copy usingdd. This is the working copy. 8. Compute hash value h3over the working copy.

10. Compare h2and h3: if both hash values match, the working copy is identical

to the master copy and thus to the original device, too. Otherwise, we have to go back to step 7.

11. Perform the investigation read-only on the working copy and extract digital traces.

12. To ﬁnish the investigation and to prove integrity of the working copy, com-pute the hash value h4of the working copy after the investigation and check,

if h1=h4holds. If yes, any digital trace is directly related to the original

device, otherwise the investigator has to identify the step, where he changed the working copy.

We show how to apply this process on base of the well-known cryptographic libraryopensslin Example 8.4.

B

Example 8.4: Acquire ﬁrst partition of an HDD

In Linux storage media are typically identified by a device (that is a ’file’ in the directory/dev) starting with the two letterssd(historically for SCSI device) and a subsequent character to distinguish different devices. For instance the first HDD is referred to as/dev/sda, an attached USB stick is then mapped to/dev/sdb, an external SSD is identified as/dev/sdc, and so on.

In our example we assume that our HDD is the device/dev/sda. Then its ﬁrst partition is identiﬁed by a digit following the device name (e.g., /dev/sda1), an extended partition may be the device/dev/sda5.

We apply the general acquisition process and compute the SHA-256 hash value of this partition. In this example, we make use of theopenssltool, becauseopensslis the most common implementation of cryptographic algo-rithms like hash functions, encryption or digital signatures. After invoking opensslwe have to tell the tool, which class of cryptographic algorithms we want to use. Cryptographic hash functions are identified by the digest commanddgst. The remaining arguments are the chosen hash function (the flag-sha256) and the input bit string of the hash function (in our example the first partition of the HDD/dev/sda1).

(10)

# openssl dgst -sha256 /dev/sda1

SHA256(/dev/sda1)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\ 38c581d9b6cbc84c98b28f39b

# dd if=/dev/sda1 of=mastercopy-sda1.dd # openssl dgst -sha256 mastercopy-sda1.dd

SHA256(mastercopy-sda1.dd)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\ 38c581d9b6cbc84c98b28f39b

As both hash values match, we generate the working copy and check the respective hash values.

# dd if=mastercopy-sda1.dd of=workingcopy-sda1.dd $ openssl dgst -sha256 workingcopy-sda1.dd

SHA256(workingcopy-sda1.dd)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\ 38c581d9b6cbc84c98b28f39b

Again both hash values match, that is the working copy is bitwise identical to the ﬁrst partition of our HDD. We next investigate read-only the working copy. In the last step we check that the working copy did not change during the processing, which we prove by applying SHA-256 to the respective image of the working copy after our investigation.

$ openssl dgst -sha256 workingcopy-sda1.dd

SHA256(workingcopy-sda1.dd)= b9c028c604b5a1dfaf8acf0098e7f26de32fd47\ 38c581d9b6cbc84c98b28f39b

The hash value of the working copy after the investigation matches the respective hash value of /dev/sda1and thus any digital trace from the working copy is extractable from the partition, too.

If for some reason the final hash value does not match, the investigator has to carefully analyse his narrative minutes to find a step where he modified the working copy. An example of destroyed integrity is given in what follows:

$ openssl dgst -sha1 workingcopy-sda1.dd

SHA256(workingcopy-sda1.dd)= df69b585b1a1af40b1c71d4fe9792fd1e843f8a\ 2fe0c5c3a39aa205e652aabe4

8.3 Identiﬁcation of known digital objects

An

Big data challenge important issue in contemporary investigations of computer crime is handling

the huge amount of data. The reason is that as of today information is stored and distributed in a digital rather than an analogue way. Low costs of storage devices and cheap unlimited access to the Internet support our ubiquitous use of digital devices. As a consequence a digital forensic investigation typically confronts the IT forensic experts with terabytes of data stored on diﬀerent sorts of phyiscal or virtual devices: a classical personal computer, a laptop, a tablet PC, a smartphone, a mail provider, a cloud service provider to name only a few.

The

Finding the nee-dle in the haystack

terabytes of data can be seen as a big haystack, where the actual evidence of some megabytes has to be found, that is the investigator’s task is to ﬁnd the

(11)

needle in the haystack. In this section we present concepts, which automatically preprocess the terabytes of input data to support the investigator in proving or refuting a hypothesis. If we use the metaphor of ’ﬁnding the needle in the haystack’, two concepts are obvious:

1. First,decreasing the haystack means to scale down the actual data, which has Whitelisting

to be inspected by the digital forensic expert. This concept is known as whitelisting or ﬁltering out. Any object from the suspect’s drive, which is indexed by the whitelist, is not considered for further inspection. We discuss whitelisting in Section 8.3.1.

2. Second,increasing the needle means to ﬁnd hints to suspicious data structures, Blacklisting

which actually support a certain hypothesis. These hints have to be con-ﬁrmed manually by the investigator. This concept is known as blacklisting or ﬁltering in. We discuss blacklisting in Section 8.3.2.

For both concepts, we need databases of irrelevant data (i.e. a whitelist) or in- Databases

criminated ﬁles (i.e. a blacklist), respectively. The most common whitelist is the Reference Data Set (RDS) from the US-NIST National Software Reference Library (NSRL) [23]. The blacklist is case dependent (e.g., pictures of child abuse, classiﬁed documents).

Themost common basic technology for indexing ﬁles are hash functions. The Hash values are used

proceeding is quite simple: for each object of the seized device (e.g., a file) calculate the corresponding digest and compare the respective fingerprint against a white-or blacklist, respectively. As of today cryptographic hash functions (e.g., SHA-1, SHA-256 [21]) are used. Cryptographic hash functions are very efficient and effective in detecting bitwise identical duplicates, but they fail in revealing similar objects. However, investigators are typically interested in automatic identification of similar objects, for instance to detect the correlation between a blacklisted picture of child abuse and its thumbnail, which was discovered on a seized device.

8.3.1 Whitelisting

A Whitelists are based

on cryptographic hash functions

whitelist is an index of known to be good objects, that is of non-suspicious patterns. The concept of whitelisting is quite simple: any object from the suspect’s drive (typically an object is simply a file), which is indexed by the whitelist, is not considered for further inspection. Therefore whitelisting is referred to as filtering out, too. In order to handle a whitelist with respect to memory, a compressed representation of each whitelisted object is used. Additionally, as whitelisted objects are not considered for further investigation, the false match rate (FMR) must be 0. Otherwise it would be possible for an attacker to filter out relevant digital traces. Therefore whitelists are based on cryptographic hash functions.

Themost common whitelist is the Reference Data Set (RDS) from the US-NIST RDS

National Software Reference Library (NSRL) [23]. The RDS indexes ﬁles. Its website states2_{: The RDS is a collection of digital signatures of known, traceable}

software applications. There are application hash values in the hash set which may be considered malicious, i.e. steganography tools and hacking scripts. There are no hash values of illicit data, i.e. child abuse images.

(12)

B

Example 8.5

We enumerate sample entries of the NSRL Reference Data Set.

$ less NSRLFile.txt "SHA-1","MD5","CRC32","FileName","FileSize","ProductCode","OpSystemCode","SpecialCode" "000000206738748EDD92C4E3D2E823896700F849","392126E756571EBF112CB1C1CDEDF926","EBD105A0",\ "I05002T2.PFB",98865,3095,"WIN","" "0000004DA6391F7F5D2F7FCCF36CEBDA60C6EA02","0E53C14A3E48D94FF596A2824307B492","AA6A7B16",\ "00br2026.gif",2226,228,"WIN","" "000000A9E47BD385A0A3685AA12C2DB6FD727A20","176308F27DD52890F013A3FD80F92E51","D749B562",\ "femvo523.wav",42748,4887,"MacOSX","" "00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\ "J0180794.JPG",32768,18266,"358","" "00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\ "J0180794.JPG",32768,2322,"WIN","" "00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\ "J0180794.JPG",32768,2575,"WIN","" "00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\ "J0180794.JPG",32768,2583,"WIN","" "00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\ "J0180794.JPG",32768,3271,"WIN","" "00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF",\ "J0180794.JPG",32768,3282,"UNK",""

We see that the imageJ0180794.JPGhas a ﬁle size of 32768 bytes. It is listed six times, because the product code or the operating system code diﬀer.

The

Content of RDS RDS is updated four times a year. As of May 2015, the current release is

RDS 2.48, which contains about 21 million unique files. Its size is about 6 GiB. As listed in Example 8.5 each entry of the RDS lists the SHA-1, MD5 and CRC32 checksum together with the file name and file size of the indexed file. The entries are ordered with respect to the numerical value of the SHA-1 hashes. Hence it is easy to decide if an input file is indexed by the RDS.

Although

Effectiveness of whitelisting

filtering out using the RDS is widespread, only few results are available about its effectiveness. Back in 2008 Douglas White from NIST claims in a presen-tation at the American Academy of Forensic Sciences (AAFF) that file-based data reduction leaves an average of 30% of disk space for human investigation3_{. However, the}

RDS only indexes application hash values, it does not take any personal ﬁles into account.

Therefore Baier and Dichtelmüller [2] performed a study on data reduction for different user profiles. The baseline of their research is the data reduction in terms of the number of files rather than disc space (because an investigator has to look at a file rather than on a certain amount of memory). The methodology of Baier and Dichtelmüller [2] is to model different user behaviour and their corresponding file generation characteristics. Their data reduction rates for different profiles is given in Table 8.2.

MGmeans the number of generated ﬁles in the ﬁle system of the respective user

proﬁle and MRDSthe number of ﬁles in the system, which are indexed by the RDS,

too. The data reduction rate is the relation of the number of indexed ﬁles to all ﬁles, that is R =MRDS

MG . To be eﬀective, R should be as close as possible to 1. For

instance, the first row in Table 8.2 shows the result for a Windows XP operating system installation only, that is there are no user files. However, only 52.45% of the files in the file system are indexed by the RDS. It is obvious, that the reduction rate decreases if we insert additional user files. For example, if we model a user, which mainly uses his computer for playing games (i.e. the profile gamer), the

(13)

Proﬁle Nr. of Indexed by Data reduction ﬁles: MG RDS: MRDS rate: R XP, OS only 10,467 5,490 52.45% XP, standard software 22,801 9,689 42.49% XP gamer 126,684 18,213 14.38% W7, OS only 56,233 18,703 33.26% W7 standard software 77,601 23,414 30.17% W7 universal 322,128 42,296 13.13% Ubuntu 11.04 172,789 26,664 15.43%

Table 8.2: Data reduction rates for different user proﬁles using RDS [2]

data reduction rate is below 15%. In this case the investigator has to inspect the remaining 85% of the ﬁles manually.

Theseresults are informally conﬁrmed by practitioners, who are surprised by the Whitelisting is ineffective

’high’ data reduction rates of Baier and Dichtelmüller [2] and mention an expected data reduction rate of 5% for their cases. To sum up, the haystack does not decrease significantly using RDS. As the preprocessing of applying the whitelist takes a lot of effort, our overall assessment is that whitelisting is not effective to automatically preprocess bulk data.

8.3.2 Blacklisting

Incontrast to a whitelist a blacklist indexes known to be bad objects, that is suspicious Filter in

patterns. If an object from the suspect’s drive matches an element of the blacklist, the investigator gets a hint to a digital trace, which he inspects manually. Thus blacklisting is also called ﬁltering in. Again in order to handle a blacklist with respect to memory, a blacklist makes use of a compressed representation of each of its elements.

Inthis section we assess diﬀerent aspects of cryptographic hash functions and Assessment

approximate matching in the scope of blacklisting. The aspects and our assessment are summarised in Table 8.3. To illustrate our rating, we assign categories starting with + for the best rating followed in descending order by ⊕,�, � to the worst

rating −.

Property Cryptographic Bytewise approximate Semantic approximate

hash function matching matching

Run-time eﬃciency very fast + fast - medium ⊕ slow −

1 1.5to 6 20to 500

Compression short + 1%to 3% − short ⊕

≈ 256 bits of input length 256to 600 bits

Object similarity No − Yes + Yes +

detection

Cross correlation No − Yes + No −

Fragment detection No − Yes + No −

Embedded object No − Yes + No −

detection

Domain speciﬁc No + No + Yes −

(e.g., only images)

Encoding Yes − Yes − No +

dependency

FMR / FNMR 0% + Dependent � Dependent �

Indexing Yes + Ineﬃcient � Ineﬃcient �

Table 8.3: Assessment of hash functions with respect to blacklisting

Wefirst turn to the aspect efficiency, that is run-time and memory efficiency. Our Efficiency

(14)

We assign a relative speed of 1 to cryptographic hash functions. Then bytewise approximate matching diﬀers by a factor of 1.5 to much slower, e.g.,mrsh-v2has comparable speed to SHA-1, whilesdhashis much slower. However, bytewise ap-proximate matching is typically much faster than semantic apap-proximate matching, because the latter one requires more complex computational steps. With respect to compression, both cryptographic hash functions and semantic approximate match-ing perform well. The hash value is of ﬁxed small size. On the other hand, bytewise approximate matching outputs similarity digests of variable length, which is pro-portional to the input size (with the exception ofssdeep). For instance, a 1 TiB input requires a size of 10 GiB to 30 GiB for its bytewise approximate matching blacklist. This constitutes a key drawback of bytewise approximate matching. We

Resemblance next assess the aspect resemblance (see Section 8.1.3). Both bytewise and

se-mantic approximate matching are able to decide about object similarity, which is not the case for cryptographic hash functions. With regard to cross correlation (i.e. ﬁnding digital artefacts, which share a common object) only bytewise approxi-mate matching is able to successfully conduct it. The same holds for the aspect containment, i.e. fragment detection and embedded object detection: only bytewise approximate matching copes with containment.

The

Dependency next aspect is dependency with respect to application area and representation,

re-spectively. Both cryptographic hash functions and bytewise approximate matching consider the bytestream of an object, hence they are not bound to a speciﬁc domain of applications (e.g., image similarity, audio similarity). However, as semantic approximate matching extracts features to simulate human perception, it is bound to a certain domain of applications. If we examine encoding dependency, the situ-ation is vice versa: the byte-level algorithms are dependent on the actual encoding (e.g., an image encoded as jpg is considered to be diﬀerent from the same image encoded as png by both cryptographic hash functions and bytewise approximate matching). On the other hand as semantic approximate matching considers the perceptual level, it does not depend on the encoding representation.

With

Error rates respect to error rates, both the false non match rate (FNMR) and the false

match rate (FMR) are of interest. For convenience the FMR should be small, other-wise the investigator is annoyed in manually checking erroneous traces. On the other hand the FNMR must be as close as possible to 0. Otherwise the blacklist fails in pointing to potential evidence, and the trace must be found in a different way. Cryptographic hash functions do not suffer from error rates due to their security requirements from the cryptographic domain (e.g., preimage resistance, collision resistance). However, as approximate matching processes noisy input, it suffers from both a non-trivial FMR and FNMR. It is therefore the operator’s responsibility to prioritise the error rates.

Our

Indexing ﬁnal aspect concerns indexing, that is a sorting algorithm for digests. As

explained in Section 8.3.1 the RDS sorts cryptographic hash values with respect to their numerical value. Hence indexing is easily possible for blacklists based on cryptographic hash functions. With respect to approximate matching, first approaches towards indexing are available. As they suffer from run time or mem-ory inefficiency, we rate approximate matching rather negative with respect to sorting.

8.4 Summary

In this chapter we described the two main use cases of hash functions in digital forensics. The use cases are ’authenticity and integrity of digital traces’ (ensured by applying cryptographic hash functions) and ’identiﬁcation of known objects’ (e.g.,

(15)

illicit ﬁles). In the latter case we showed how whitelisting and blacklisting work and how these concepts aim to perform data reduction, respectively. Our conclusion is that whitelisting is not eﬀective and that blacklisting may be performed by cryptographic hash functions or approximate matching.