• No results found

3.6 Sub-file Signatures in Practice

3.6.5 Digital Forensics Processes

Sub-file techniques offer significant speed improvements when processing a device for contraband. In a digital investigation, this reduction in processing time would be best utilised in the triage stage of an investigation, where fast results are critical (see discussion in Section 2.2). Sub-file approaches operate at the logical file level, rather than the physical disk level, and therefore requires the file system to be mounted. This means that random disk block sampling approaches, working at the physical level, are not directly compatible with sub-file methods. However, the file system level data reduction used by Grier and Richard [43] would be a good counterpart for this approach, for two reasons: i) Data subsetting in Grier and Richard is achieved by applying heuristics after parsing file system metadata. This results in a set of logical files, which can then be immediately processed using the appropriate sub-file approach. ii) Since the file system metadata is parsed in order to subset files on the disk, LBA addresses for files can be extracted for little additional cost while the heuristics are applied. This would allow files to be accessed without excessive file system overheads, and can achieve performance akin to EXT4 on NTFS partitions, regardless of operating system.20 While the sub-file approaches described in this chapter could process all relevant files on a disk, this would mean that many innocuous files would also be processed, such as application icons and other images built into the operating system or mundane applications. By applying this approach to the subset of data most likely to be relevant, both approaches can work in tandem to generate rapid results. The combination of techniques also has the benefit of being applicable to a live system, write blocked disk, or forensic image.

enough. In this case it would be recommended to carry out confirmation hashes, or other verification steps, after all relevant files have been processed. This allows for the best compromise between fast results and robustness, allowing the analyst to quickly get an idea of the state of contraband on the device. Alternatively, experiments could be conducted to optimise the time to process both sub-file hashes and full file hashes on files which show positive hits. This could exploit caching technology on disk, as the entire file does not need to be re-read, or some of the file may already be in the read buffer. The sub-file portion may also be stored in memory along with a map of the LBA addresses, such that redundant reads are not performed for the full file hashing stage.

3.7

Conclusions

This chapter explored the possibility of creating robust, and accurate, sub-file signatures for the forensic detection of contraband media. Two general approaches were taken, with an implementation for the two most popular image formats on the Web.

The first approach, applied to PNGs, combines a number of low entropy header features in the file to create a signatures. These signatures were shown to be 99.8% accurate on a worst-case homogeneous dataset, with false positives being generated by solid background colours. Even on a dataset of relatively small files, significant speed improvements were found over traditional full file hashing, particularly on SSDs and the EXT4 file system. The second approach, applied to the JPEG format, uses coarse content representations, again found in the file header, to generate signatures which were essentially unique at the million image scale. An analysis of false positives showed that they were actually modified versions of the same file, and would therefore still relevant to an investigation. This approach reads roughly the same amount of data, with similar performance characteristics to the PNG approach.

Both techniques were evaluated against the contraband signature criteria from Sec- tion 3.2 throughout the chapter, and have proven to be fit for the task. A particular property of these approaches is that they are bound to small block disk performance, and are therefore better suited to NAND flash storage media. This work effectively lays the foundation for future forensics techniques which take advantage of the properties of modern non-mechanical media, which may be the key to dealing with forensic backlogs in the future.

Generic Sub-file Signatures

4.1

Introduction

The sub-file techniques in Chapter 3 are file type specific, and require an in-depth knowl- edge of all relevant file types to generate discriminative signatures. In contrast, the work in this chapter sets out to create sub-file signatures which are file type independent, which should work equally well on all files without knowledge of their structure. Small pieces of the file are hashed in lieu of the entire file, with the approach being somewhat similar to block-based hashing techniques in the literature [29, 44–46]. The work presented here differs in that the goal is to reduce the amount of data to read from disk, with hashes being performed at the logical file level, instead of the physical disk level.

Rather than focus on a stream-based approach, which allows for non-essential file information to be skipped, or treated appropriately, the approach taken in this chapter is block-based, and simply samples the file from a particular offset. In practice, this can be achieved by sampling arbitrary blocks in the file, but for the purposes of this work only the beginning and end of the file was considered. Despite being file type agnostic, datasets in this chapter contain both JPEG and PNG files, such that the discriminatory power and generalisability of the technique can be evaluated, as well as providing a variety of file sizes to work with.

Block-based sub-file generation alleviates some of the overheads and difficulties of the file-specific approach, as it should generalise to all file types, including those yet to be adopted. Ideally, as few blocks as necessary should be read from the disk to effect the same kind of data reduction achieved in Chapter 3, and attain the associated speed benefits. The work in this chapter is also evaluated in terms of the criteria identified in Section 3.2.

KiB MiB KiB MiB

Flickr 1 Million JPEG 1000000 124 KiB 0.12 MiB 117 KiB 0.11 MiB

Govdocs PNG PNG 108885 1426 KiB 1.39 MiB 344 KiB 0.34 MiB

Flickr Subset JPEG 25000 118 KiB 0.12 MiB 112 KiB 0.11 MiB

Flickr Subset PNG PNG 25000 295 KiB 0.29 MiB 295 KiB 0.29 MiB

Govdocs Sub. PNG PNG 25000 535 KiB 0.52 MiB 152 KiB 0.15 MiB

Table 4.1 Details of the datasets used in this chapter.

The full Flickr 1 Million and Govdocs PNG datasets are used for the local disk experiments in Sections 4.3 and 4.4, while 25,000 image subsets were used for the network storage experiments in Section 4.5.

4.2

Description of Datasets

The local disk experiments in Sections 4.3 and 4.4 make use of the full Flickr 1 Million and Govdocs PNG datasets discussed in the previous chapter. Section 4.5 explores the performance of generic sub-file approaches in networked environments, which have much lower throughput than local storage devices. As such, Section 4.5 makes use of 25,000 image subsets of the Flickr 1 Million and Govdocs PNG datasets in order to make experiment run time manageable, and to save on cloud storage costs. Details of the datasets are provided in Table 4.1. No modifications in the binary or pixel domain were made to the images, with the exception of converting Flickr images to PNG for one of the subsets. Subsets were chosen to create three distinct datasets with increasing file sizes to demonstrate file size scaling performance of sub-file approaches in networked environments.

The first subset, Flickr Subset, is composed of the first 25,000 Flickr 1 Million images in numerical file order (0.jpg, 1.jpg, 2.jpg .. 24999.jpg), and is 2.81 GiB total. No modification was made to these files. The second subset, Flickr Subset PNG, is the same 25,000 images converted to PNG to increase their file size, totalling 7.04 GiB total. Once again, the Python Pillow library [116] was used for conversion to the PNG format. The final collection is the first 25,000 images of the Govdocs PNG dataset, as listed by Python’s os.listdir function, and is the largest subset at 12.7 GiB. Further details are provided in Appendix A.2. The same file size considerations apply from Section 3.3.1, as all datasets possess a median file size under 350 KiB, with the mean file size of all subsets under 550 KiB.