Robert F. Erbacher, IEEE Member
Department of Computer Science, UMC 4205 Utah State University
Logan, UT 84322 [email protected]
John Mulholland
Department of Computer Science, UMC 4205 Utah State University
Logan, UT 84322 [email protected]
Abstract
This research examines the application of statistical analysis techniques for the identification of data types embedded within a file to assist analysts with the location of data, potentially relevant to criminal activity. The results show that the statistical analysis can effectively aid identification of the types of data embedded in a file and the approximate location of these data types. This analysis identifies component data types, irrespective of the type of file being analyzed. When applied, this technique will allow analysts to more effectively and efficiently locate relevant data on a hard drive, especially on today’s particularly large hard drives.
1. Introduction
Computer forensic analysts are dealing with ever larger hard drives needing analysis. The analysis process will typically require examination of hard drives with the goal of identifying information or data related to criminal activity, namely, how the hard drive was used and what data on the drive was accessed by whom and when. This analysis process has been analyzed extensively both from a technical [1] and from an incidence response point of view [2][7].
While the data needed for this analysis is available on the hard drive, the amount of information on a typical hard drive is making location of this information extremely time consuming. Typical hard drives of 100+ GB are becoming common and home systems—or home NAS setups for storing large music and video databases achieve—500+ GB, even 1TB, affordably. Thus, we must not only expect to deal with such large drives but must consider the continuing and rapid growth of drive capacity and prepare for continuing scale increases in the volume of data needing analysis.
This large volume of data results in making the general application of computer forensics unfeasible. The cost per value of computer forensics analysis often results in many hard drives going completely unanalyzed with full analysis reserved for critical scenarios, e.g., terrorist related cases.
This difficulty is made more challenging by increasing sophistication of criminals. Rather than simply having to locate files containing criminal activity hidden within the morass of files, analysts must locate the information hidden within otherwise innocuous files. While many techniques can be used to hide information on the hard drive, we are focusing on the location and identification of relevant information embedded into other innocuous appearing files.
To this end, we applied a range of statistical analysis techniques to a variety of file types and showed the ability to identify types of data embedded within a file. This allows the analyst to identify what types of data are embedded in the file and the approximate locations of such data without needing to open and examine the contents of each file individually. This will result in
Identification and Localization of Data
Types within Large-Scale File Systems
enormous improvements in efficiency and effectiveness for the analyst. The following provides clarification of these concepts:
•File type – The overall type of a file. This is often indicated by the application used to create or access the file.
•Data type – Indicative of the type of data embedded in a file. For example, while the file type may be a Microsoft Word .doc file, the file may contain text, images, spreadsheets, or tables, etc. Thus a single file type will often incorporate multiple data types.
When applied, the statistical analysis will identify the presence of different data types without the forensic analyst needing to view the contents of the file itself. Rather the file is opened to gather statistics automatically. For instance, if there is believed to be child pornography on a computer system then the statistical analysis can identify the locations of hidden or embedded images in documents without needing to manually open each and every document, greatly improving analysis efficiency.
2. Previous work
There has been a lot of work in the identification of file types. This derives from the fact that one of the simplest mechanisms for keeping out prying eyes is to misidentify the file. Such misidentifications make simply opening a file by double clicking on it fail. Such simple evasions are more often used to keep other users of the system from perusing sensitive documents. However, with the growing sophistication of criminals and the serious consequences for many such cyber crimes, especially possessing or distributing child pornography, the criminal will seek more sophisticated mechanisms for hiding any evidence. This can either include the child pornography itself or the records of transactions associated with the sale of child pornography. There are many techniques that can be applied for the hiding of data [12]; we are concerned with the embedding and appending of data in this work.
In essence what we are attempting to identify is a form of covert channel or steganography [11]. By applying analysis, i.e. steganalysis [4], we can identify not only the presence of the embedded data but also the relative locations of many of the data types; essentially identifying where within a file the data is located. This will allow rapid viewing of the data either by attempting to extract it automatically or identifying how far through the file the analyst must browse in order to locate the identified data.
This prior work on file type identification does have ramifications on the identification of data types as we are essentially working towards the same ends just from different levels, i.e. different points of view. For instance, our use of window sizes could be considered to be a derivative of the work on text identification through n-grams [3][9]. The application of the statistical analysis for data type identification is itself a direct extension of the work on file type identification by Karresand et al. [5][6] and McDaniel et al. [10].
3. Scenario descriptions
While pursuing this research we had several scenarios of potentially hidden information in mind for which we geared our solutions. The fundamental premise is that a sophisticated criminal would not leave sensitive data unhidden. Even simple techniques of hiding data can make location of the data unfeasible, due to cost and time constraints, for all but the most important cases. Examples of such scenarios include:
•Appending data to an executable. This is particularly effective on Linux-based systems. Pretty much any file type can be so appended. Such modified executables will continue to operate correctly. Similarly, many file types will ignore data appended to them.
A spreadsheet containing drug delivery schedules could easily be appended to a file.
Child pornography can be appended to files; applied to system files a large numbers of images can be hidden.
•Embedding data into a file. This is quite common for innocuous means but essentially revolves around incorporating data into a file.
Formatted text documents, such as Microsoft Word, can easily have images embedded into them. As many systems will have hundreds or thousands of such documents it becomes extremely time consuming to parse through these files to distinguish innocuous from criminally relevant files.
Formatted text documents can similarly have spreadsheets or tables embedded. Such spreadsheets could identify criminally relevant transactions such as drug deliveries or sale of child pornography.
4. Approach
During the course of this research, we explored a wide range of statistical equations in order to identify which statistics were most effective at differentiating the data type components of a file. Clearly, all of these statistics presented would not be relevant when designing and implementing a differentiation engine. However, the completeness provided here will prevent the need for other researchers to retread the same ground when it is clear which equations provide value and which do not.
4.1. Explored Statistical Algorithms
Thirteen statistics were chosen to determine the different characteristics of the file types. These statistics are:
•Average
•Delta Moving Average •Standard Deviation •Delta Standard Deviation •Delta2 Standard Deviation
•Deviation from the Standard Deviation (std2) •Distribution of Delta Standard Deviations
•Moving Average •Kurtosis
•Distribution of Averages •Distribution of Delta Averages •Distribution of Standard Deviations •Distribution of Deviations from the
Standard Deviation
Of these thirteen statistics for the file and data types analyzed for this portion of the research we found that the average, kurtosis, distribution of averages, standard deviation, and distribution of standard deviations were sufficient to effectively differentiate the different types of data and the additional statistics added nothing beneficial to the analysis. Incorporation of additional data types may necessitate incorporation of additional statistics for full differentiation. Other statistics outside of those tested may make analysis more effective. This understanding should aid other researchers in focusing their efforts. A complete description of the analyzed statistics is presented in Appendix I.
4.2. Sliding windows
As many of the above statistics are dependant on the sliding window size, we explored a range of differing window sizes to identify any and all impacts the window sizes may have on the statistical analysis. More specifically, we examined window sizes covering the powers of 2 from 64 bytes to 16K bytes. We found that below 256 bytes the graphs became too cluttered
and the features of the graph were obfuscated. Above 1K the characteristic features of the graphs were too greatly smoothed. This is partly a consequence of the file sizes we were using, but given the range of different files we experimented with it was determined that windows sizes above 1K did not provide any added value. Values between 256 and 1K do not provide substantial differentiation. Thus, window sizes of 256 bytes and 1K are presented here. These values provide the most distinctive graphs.
4.3. Final statistical analysis
Clearly the full set of experimental data cannot be provided here as it amounted to hundreds of graphs. For the data sets included here, GnuPlot was used to generate the graphs. The graphs have a vertical axis ranging from 0 to 255 when byte values are being represented and 0 to 0.5 when probabilities are being represented. The horizontal axis ranges from 0 to 1 in the majority of cases and is indicative of percentage position within a file. Thus, all files are essentially mapped onto a similar length, with shorter files being interpolated to add in the missing data points. This allows us to correlate similarities between data of the same type and identify consistent characteristics features of files. When dealing with probabilities the horizontal axis ranges from 0 to 255.
5. Experimental design
In order to help better determine the characteristics of file types, seven data types were chosen that are common on most hard drives, namely:
•Microsoft Word (.doc) •Executable file (.exe) •JPEG Image (.jpg)
•Adobe Portable Document Format (.pdf)
•Microsoft PowerPoint (.ppt) •Microsoft Excel (.xls) •Compressed Data (.zip)
Five files of each type were either acquired or created for the first part of the analysis. A program was written to divide these files into byte windows of the same size. For each window the program creates a statistic group that contains all of the statistics except for the distributions. The program calculates these at the file level using the corresponding statistic from the statistic groups.
Initial experiments were performed using five different window sizes: 64, 256, 1024, 4096, and 16384. As mentioned, our analysis indicated that 256 and 1024 were the most useful when analyzing the graphs visually. When analyzing the graphs algorithmically, alternative window sizes may be appropriate.
Finally, given the 5 window sizes, 13 statistics, and 7 file types, we initially created 455 graphs, each with 5 histograms. This was greatly reduced as we identified the most effective window sizes and statistics. In all cases, we continued to examine the 7 different file types. After initial analysis to identify characteristics of the different file and data types we created 9 additional files with anomalies embedded to show differentiation in actual scenarios.
6. Analysis of file type characteristics 6.1. Image (jpg) data
Jpg files are the most orderly of all the data types we examined. As can be seen from figures 1 and 2, showing the average byte values and byte distribution probability, jpg files have
average byte values in a tight band from 120 to 142. This consistency is maintained across the entire file except for a small region at the beginning of the file which has a much more chaotic range of potential average values. In essence, the use of a sliding window averages together the byte components, which given jpg’s format naturally results in a very consistent average value. The deviation at the head of the file is the result of header information describing the file.
Figure 1: Average byte values for jpg files. The file header shows a wide diversity. The rest of the file shows the narrowest band of any of the file types.
Figure 2: Distribution of byte averages of jpg data. The tight distribution for all example jpg data streams is confirmed by the narrow distribution graph.
Similarly, this consistency in the file results in a significant lack of peakedness. Consequently, the Kurtosis, Figure 3, maps uniformly to zero, except at the start of the file where header data is located. Given the unique appearance of each of the three graphs any of the three statistical techniques could be used to differentiate jpg imagery data.
Figure 3: Kurtosis of jpg data. The
distribution from Figure 2 does not show the deviation of values at the beginning of the data stream. The kurtosis shows this and can be used for identification purposes.
Figure 4: Average byte values of zip data. Initial variability can be seen at the beginning of the data stream with occasional selective peaks. The end of zip data always contains a dip as data content deviates.
6.2. Archival (zip) data
In many ways, zip files are similar to jpg files, showing the characteristic deviations at the beginning of the file where the file header information is located, though not as extreme and it contains a narrow band for the average data values. However, the band isn’t nearly as tight as that of the jpg files, ranging instead from 109 to 142. This deviation in band ranges is one factor usable in differentiating between these two data types. This similarity results from the fact that both data formats use forms of data compression.
This can be shown in the average statistic, Figure 4, and the average distribution, Figure 5. The zip data statistics do not differ greatly when the files contain different file types with the possible exception of the images_hicontrast.zip which contained png files. This file contained a peak around 65% of the way through the file.
All of the data streams contain a significant dip at the end of the file. This is likely the table of contents for the file and the deviation in information contents explains the significant change in average values. Since all files are mapped to the same scale in terms of file lengths, using percentage through the file, very large zip files will appear to have a shorter dip. The number of files in the zip file, and consequently, the size of the table of contents also impact the length of this portion of the graph. Zip data also has a very low kurtosis throughout, as shown in Figure 6. This lack of variability in the Kurtosis can be used as a distinctive characteristic.
Figure 5: Distribution of byte averages for zip data. Similar to jpg data but the value range is larger and there are deviations, dependent on the files archived.
Figure 6: Kurtosis values of zip data. The consistently low values are distinguishable. The values clearly are not fixed at zero but are extremely small.
6.3. Adobe acrobat (pdf) data
Pdf data is distinctive in the chaotic nature of the average values throughout the file, Figures 7a and 7b. Ultimately, the range of values taken on by pdf data is quite large. It is the lack of any type of consistency in values that is unique with respect to pdf data. As with zip data there is a dip at the end of the file consistently across data streams.
Figure 7b shows the same data as Figure 7a with the graph using a window size of 1024 as opposed to 256. The reduced obfuscation may allow the data characteristics to be more easily analyzed and comprehended. Additionally, extrapolating in either direction will give the reader an idea of how smaller window sizes become even more obfuscated and larger window sizes begin losing structure for the data characteristics.
The high variability in values for the data is exhibited in the kurtosis of the data, Figure 9. A key distinctive feature is the lack of a smooth curve in the average distribution, Figure 8. This shows the value of the distribution and kurtosis in providing for differentiation and identification.
Figure 7a: Average byte values for pdf data. The chaotic range is clear. Dips are visible at the beginning and end of the data streams.
Figure 7b: Average byte values for pdf data but with a 1024 byte window size.
Figure 8: Distribution of average
values for pdf data. In many respects the curve is similar to the previous data types. The lack of a smooth curve is distinctive.
Figure 9: Kurtosis of pdf data. The
variability exhibited in the previous graphs is also shown hear as there is more variability and thus more peakedness.
6.4. Executable data
Executable data streams essentially need to be considered as a class of two different types of data streams. For instance, installation programs (setup.exe type programs) are actually executable wrappers around file archives, i.e. zip data. The other class of files is pure executables. This example actually shows the effectiveness of this technique at identifying and differentiating different data types within a file. This is exemplified in Figure 10. The probability distributions clearly show patterns matching that of zip data with the associated peaks around an average byte value of 130. The non-installation programs show a completely different pattern.
In addition to the average value distributions from Figure 10, the kurtosis from Figure 11 also shows unique differences between the two classes of executables. Specifically, installation data streams will have more peaks at the beginning of the data stream and then normalize with the specification of the archived data. The normal executables will tend to have more peaks at the end of the data stream.
Figure 10: Distribution of average
values of executable data. Key characteristics are the high probability of average values of ~130 for installation data streams and the wider distribution of values among lower average byte values for normal executable data.
Figure 11: The kurtosis of executable data streams shows high peakedness around the beginning of the data stream for installation programs and a high peakedness around the end of the data stream for non-executable programs.
6.5. Spreadsheet (xls) data
These spreadsheet data streams are particularly unique in their characteristics. As can be seen from Figure 12, all xls spreadsheet data streams create a stair-step pattern when examining their average byte values over the specified byte windows. While we did not examine the characteristics and contents of xls files sufficiently to identify the cause of this feature we did identify it in every such spreadsheet file. Smaller spreadsheet files do not exhibit the feature as strongly, due to the lack of data, but it is clearly still present.
Other interesting features of such spreadsheet data are the high spikes in average values at both the beginning and end of the data stream. Additionally, a single spike can be identified in the middle of the jpeg_stats.xls file. This turns out to be a graph embedded into the middle of the spreadsheet data. This clearly not only identifies the existence of the artifact but its approximate position in the file, aiding rapid analysis of the file. We examine this feature in more depth in the next section on embedded file types.
6.6. Microsoft word (doc) data
The statistics from doc data depends largely on what type of data the data stream contains. If it contains text then the distribution of averages is around 90. If the data contains images then the averages are much higher, around 125. If it contains a lot of both then you can see peaks in both locations.
Figure 13 shows this set of data streams and clearly shows data streams which have images embedded within them. Any of the data streams which have average values that hit the 125-130 range have images within them, with exceptions at the very beginning and end of the data streams. The most interesting data stream is “With Images.doc” which includes a few small images. This data stream is essentially identical to “Formatted Text.doc” which acted as the original source data, except it contains no images, and “Plain Text.doc”, which had all formatting information as well as the images removed. The remaining two files show degenerate cases in which only images are incorporated into the data stream.
Figure 12: Averages values for xls
data. These data streams are particularly unique with the stair-step pattern. Also of note are the peaks at the beginning and end of the data streams. Finally, an embedded graph is visible at the 38% mark for one of the data streams.
Figure 13: Averages values for doc
data. These word processing data streams are interesting in the ability to identify the presence of different types of data within the file.
From analysis of these three data streams of interest we can identify deviations in the data; keeping in mind the graph represents percentage within the data and not absolute position. The relative location in the data stream where the images are defined can be identified from this graph. However, as will be seen with PowerPoint data, the location where such data is defined may not be relevant as the images could be located elsewhere. This relates to the idea of anchor points as a specification for the location of images, with the image definition located quite separate from its anchor point; often at the beginning of the file. In many cases, the imagery data takes so much of the file to specify compared to the rest of the file that the textual data becomes insignificant.
6.7. Microsoft powerpoint (ppt) data
Of the files analyzed to data, Microsoft PowerPoint data was the most difficult to analyze. These files appear particularly chaotic in nature. In order to analyze such data we can begin relying on the results garnered so far. PowerPoint data intrinsically takes on the data types of its underlying formats. For instance, examining the graph from Figure 14, all of the data streams clearly maintain a similar pattern towards the end of the data stream. This is the page description and text itself. The deviation at the beginnings of the data streams is indicative of special or unique features of each of the data streams.
The unique characteristics of each data stream can be aided by analysis of the average distribution from Figure 15. The distribution of ch1.ppt is clearly showing evidence of containing imagery data; which can be validated through examination of the contents of the file. The imagery data seems to all appear at the beginning of the file as we suspect the PowerPoint format specifies all imagery data at the beginning of the data stream and then simply refers to the correct imagery data within the page description portion of the data stream.
Figure 14: Average values for ppt data. It appears that all imagery and diagram data is specified at the beginning of the data stream and then simply referred to within the page description of the latter portion of the data stream. It is this latter portion of the data that exhibits a high range of values as the pages are described, including text and formatting.
Figure 15: Distribution of averages of ppt data. This distribution can be used to confirm the types of data present within a data stream. Clearly the presence or lack of data types can be identified through the distribution. For instance the presence of imagery data is indicated by the presence of values within the 130 range.
From the average distribution graph we can see that the graph for “Review of Modeling Cell Division in Cell Automata.ppt” appears anomalous in that there is a large peak at value ~48. The graph for this data stream is obscured in Figure 14. The anomaly is a result of the fact that this document contains no formal page layout, figures, diagrams, or extensive formatting. Essentially the background of the slides is just a plain white with the body of the slides just plain text. This results in a very high distribution of the documents data revolving around textual data and its associated binary values. Similar feature distinctions can be made for diagrams, etc.
In terms of differentiation, PowerPoint data can be differentiated by the large range of average values exhibited towards the end of the file, after definition of any imagery, diagram data, etc.
7. Recognition of embedded data types
The analysis in the previous section focused on typical sample data and the ability of the statistical techniques to differentiate the data types incorporated within these data streams. This section examines scenarios more typical of forensic analysis in which an individual is purposefully attempting to hide information. Clearly, we don’t examine every possibility but these are scenarios that will allow an individual to rapidly hide information that would take
large amounts of time for an analyst to find and the described technique provide the ability for this information to be found much more efficiently and effectively.
7.1. Obfuscation within doc files
The first set of experiments was based on identification of relevant data embedded into Microsoft Word files. Consider for instance the number of such Microsoft Word files that may be present on a typical hard drive, especially for an individual with a professional job. Embedding criminally relevant data into such a file would make it extremely time consuming to locate; i.e. each document would need to be opened and browsed through. Such embedded data could potentially include child pornography, spreadsheets of the criminal activities such as drug deliveries or bets made and by whom. Browsing of such documents would need to be done slowly in order to ensure all data/images are loaded.
Figure 16: Averages of doc files
highlighting the presence of a spreadsheet. More specifically the spreadsheet takes on the form of a large table within the Microsoft Word document.
Figure 17: Distribution of averages of doc files. The peaks aid identification of data types contained within a file. The table is indicated by the peak above 150.
An example of Microsoft Doc files is shown in Figure 16. The goal here is to identify a spreadsheet of drug deliveries; taking on the form of a table once embedded. This is identified by the large block of high variability of the third file, CyberSecurity-2SprdSht.doc. The table loses the stair-step pattern when converted to an embedded table but remains uniquely identifiable. From this unique characteristic, identification and location can be performed rapidly for retrieval. As far as the other two files go, it can be seen clearly that the first file contains large amounts of imagery data that had been removed from the other two files.
For additional differentiation, the average distribution, Figure 17, can be used to rapidly identify what types of data is contained within each of the files. The first file contains the typical characteristics of imagery data while the third file has a deviating spike above the 150 level indicative of the table.
7.2. Obfuscation within xls files
This second example looks at raw spreadsheet files, Figure 18 shows the average window values. In this scenario, the spreadsheet could be the target itself or it could be used to hide alternative data sources, such as child pornography. As with the other Windows formats,
imagery data is specified at the beginning of the file and referenced later, within the body of the file. This is exemplified here in which the image is actually anchored at the end of the file but the data appears at the beginning of the file.
Figure 18: Averages of xls files.
Imagery data is easily identified. Gaps in steps identify charts. Here deviations from expected patterns identify data of interest.
Figure 19: Standard deviation of xls files. The steep valleys confirm the presence of the charts and aids location identification.
The stair-step pattern of normal data is easily identified. More challenging is identification of the charts within the spreadsheet file. These are indicated by the larger gaps within the stair-step pattern. In terms of differentiation, the standard deviation, figure 19, aids identification of the charts due to the steep valleys within the stair-step pattern.
In addition to identification of the charts, secrets.xls shows normal low values as a typical characteristic of the end of xls files. As this xls file is short this characteristic is exacerbated.
7.3. Obfuscation within exe files
In this scenario, we consider the possibility of appending data to executables as a method of hiding the data. Most executables will run fine with the appended data as execution will never reach this block. Thus any form of data can be appended to an executable including: raw text, spreadsheets, imagery, etc. Such data would be difficult to find with normal techniques as the file would be identified as an executable and would run fine. With such a wide variety of different implementations and compilations of each executable it can be impossible to identify if the executable matches a known (valid) distribution.
Figure 20 shows the average values for a single Linux executable, bmp2tiff, with different data streams appended. A fifth file, zip.exe, is a Windows Cygwin executable. This last file was incorporated for comparison purposes and shows the need for additional analysis of Windows executables as its pattern deviated from the other executables.
For the executable with data appended, the typical data patterns can be identified. For instance, with the xls data appended, the stair-step pattern can be identified, though not clearly do to the small size of the spreadsheet. The textual data is identified by a very narrow range of averages around the 95 value mark. This is similarly identifiable in the textual spreadsheet, though the spreadsheets smallness makes only a portion of the graph identifiable.
Of greater interest in the identification of data types within or appended to an executable is the distribution of values within the standard deviation, Figure 21. Each file can be seen to have a different peak within this graph representing the type of data incorporated into the file. This
can be used to rapidly identify the types of data comprising a file. Similarly, data appended to other file formats will usually have such formats open correctly with the appended data being ignored, thus leaving it hidden. This technique allows for the detection of such data.
Figure 20: Averages of exe files. The graph characteristics can be used to identify the types of data appended to the executable.
Figure 21: Distribution of standard
deviations for exe files. Each file contains a different peak indicative of the data type within. The peak at the 100 mark identifies shared data.
8. Data extraction
Once alternative data types (sources) have been identified as being embedded within a specific file, extraction of that data is a straight forward matter in most cases of opening the file with the correct application and accessing the appropriate location within the file. Sufficient information can be identified from the analysis process specified above to locate the alternative data relatively quickly, thus speeding the analysis process.
The more difficult scenario is where data is appended to a file. In this case, the file can’t simply be opened. Instead, the information provided by the data distribution, namely the standard deviation distribution will identify the type of data appended. Additionally, the average values will identify approximately where the change in data types occurs. With these two pieces of information the appropriate location within the file can be algorithmically searched for known magic numbers associated with the given data types or for identifiable boundaries for data types without magic numbers. Given this position the appended data can be extracted/cut from the obfuscating file and displayed with an appropriate application.
9. Performance
As mentioned in the Introduction, the goal of this technique is to improve the efficiency and effectiveness of forensic analysts. Clearly, the technique is effective at aiding the rapid identification of data types within a file, irregardless of the file type. Adding this technique into the analysis process does not substantially add time to the analysis process. For instance, on a 3Ghz system with 1GB of memory analyzing 35 files (111 MB) with the 256 byte window size and all 14 statistics required approximately 4 minutes from start to finish (raw data to final graph creation). Clearly, analyzing terabytes of data will require some time, however, not all of the statistics would be needed in a production system and there is plenty of room for
optimization as this was not an issue in this project. Faster systems, especially in terms of disk access, a low end ATA in this case, would greatly improve performance.
10. Conclusion
We have shown that multiple statistics can easily identify individual data components intrinsic to a file; the differentiation between file types and data types is a very important concept in the forensic analysis of computer data. The distributions/probabilities can be used to rapidly identify the possible existence of data types within a file and the individual statistics mapped to file position can then be used to validate the data type and identify approximate positioning of the identified data types within the file for rapid analysis. We have identified which statistics allow for the identification of each data type and what characteristics allow for said identification and differentiation from other data types.
Clearly, it should be quite feasible to develop algorithmic techniques to easily identify not only the overall file type but the data types integral to the file. This will greatly reduce the manpower needed to examine large hard drives for this particular technique for hiding data.
11. Future work
There would be benefit from testing and experimenting with a wider range of file and data types with additional statistical algorithms. The next major component in the research will be the development of actual algorithms applying what has been identified as differentiating characteristics and determination of the effectiveness of the algorithmic approach. This would be a critical step before deployment by law enforcement for actual use.
Additionally, we must examine other mechanisms for data hiding and how such techniques can be identified. While many of these even more advanced techniques will not be used for some time we need to investigate techniques for their identification. For instance, it has been validated that data can be hidden in PCI expansion cards bioses [8].
12. References
[1] Brian Carrier, File System Forensic Analysis, Addison-Wesley, Upper Saddle River NJ, 2005. [2] Eoghan Casey, Handbook of Computer Crime Investigation, Academic Press, 2002.
[3] M. Damashek, “Gauging similarity with n-grams: Language independent categorization of text,” Science, vol. 267, Feb. 1995, pp. 843–848.
[4] Neil F. Johnson and Sushil Jajodia, “Steganalysis: The Investigation of Hidden Information,” IEEE Information Technology Conference, Syracuse, New York, 1998, pp. 113-116.
[5] M. Karresand and N. Shahmehri, “Oscar – file type identification of binary data in disk clusters and ram pages,” in Proceedings of IFIP International Information Security Conference: Security and Privacy in Dynamic Environments (SEC2006), LNCS, 2006, pp. 413-424.
[6] M. Karresand, N. Shahmehri, “File Type Identification of Data Fragments by Their Binary Structure,” In Proceedings of the IEEE Information Assurance Workshop, West Point, NY, June, 2006, pp. 140-147.
[7] Warren G. Kruse II and Jay G. Heiser, Computer Forensics: Incident Response Essentials, Addison-Wesley, 2002.
[8] Robert Lemos, PCI cards the next haven for rootkits?, 2006, http://www.securityfocus.com/brief/360.
[9] W.-J. Li, K. Wang, S. Stolfo, and B. Herzog, “Fileprints: Identifying file types by n-gram analysis,” in Proceedings from the sixth IEEE Systems, Man and Cybernetics Information Assurance Workshop, June 2005, pp. 64–71.
[10] M. McDaniel and M. Heydari, “Content based file type detection algorithms,” in Proceedings of the IEEE 36th Annual Hawaii International Conference on System Sciences (HICSS’03), Washington, DC, 2003, pp. 332.1.
[11]G.J. Simmons, “The Prisoner's Problem and the Subliminal Channel,” In Proceedings of CRYPTO '83, 1984, pp. 51-67.
13. Appendix
This appendix describes the thirteen explored statistics in more detail. The equations for each of the statistics are specified in the below table. Descriptions of the equations are provided in the following sections.
∑
= = N i i j X N X 1 ~ 1 (1) ⎟⎠ ⎞ ⎜ ⎝ ⎛∆ −∆ = ∆∆Sj abs Sj Sj−1 (6) D~ Pr((
B 1)
X~ j B) XB ≥ > + = (9) ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = ∆Xj abs X~j X~j−1 (2) DXB Pr((
B 1)
Xj B) ≥ ∆ > + = ∆ (10) ⎟ ⎠ ⎞ ⎜ ⎝ ⎛∆ −∆ = ∆∆Xj abs Xj Xj−1 (3) = ⎜⎝⎛ − ⎟⎠⎞ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ =∑
= J j J i i S S abs SS S J S ~ 1 ~ 1 (7)(
1)
) Pr( B S B D j SB ≥ > + = (11)∑
= ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = N i j i j X X N S 1 2 ~ 1 (4) D Pr((
B 1)
Sj B) SB ≥ ∆ > + = ∆ (12) ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = ∆Sj abs Sj Sj−1 (5) 2 1 2 ~ 1 4 ~ * ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − =∑
∑
= = N i j i N i j i j X X X X N K (8)(
1)
) Pr( B SS B D j SSB ≥ > + = (13) 13.1. Direct Statistics1. Average: The average byte value for each window provides an indication of what range of values are occurring. More specifically, the graph of averages will show how the range of values in each window changes across the file
2. Moving Average: The moving average is computed by taking the absolute value of the difference between the average of the current window and the average of the previous window. This can be construed as the delta of the average. The moving average identifies how the average is changing from one window to the next and identify if the average is changing dramatically across the file. In essence this can be though of as the derivative of the averages and identifies the rate at which the averages are changing; i.e., velocity from physics.
3. Delta Moving Average: The delta moving average is computed by taking the absolute value of the difference between the delta average of the current window and the delta average of the previous window. This can be construed as the second derivative of the byte averages and essentially identifies how quickly the moving average is changing; i.e., acceleration from physics.
4. Standard Deviation: The standard deviation is computed by summing the squares of the differences between each value in the window and the average value of that window. This sum is then divided by the N, the number of elements in the window, and then the square root is taken. This essentially identifies how chaotic elements values within a window are and how tightly knit the elements are to the median; i.e. are there many outliers in the window or are the values mostly consistent?
5. Delta Standard Deviation: The delta standard deviation is computed by taking the absolute value of the difference between the standard deviation of the current window and the standard deviation of the previous window. As with average byte values this can be construed as the first derivative, i.e. velocity, and identifies how quickly the standard deviation changes from one window to the next.
6. Delta2 Standard Deviation: The delta2 standard deviation is computed by taking the absolute value of the difference between the delta standard deviation of the current window and the delta standard deviation of the previous window. As with average byte values this can be
construed as the second derivative, i.e. acceleration, and identifies how quickly the delta standard deviation changes from one window to the next.
7. Deviation from the Standard Deviation (std2):The deviation from the standard deviation is
computed by first finding the average standard deviation for the entire file and then taking the absolute value of the difference between that average and the standard deviation for the current window. The goal of the std2 is to identify the consistency of the standard deviation across the entire file. The original standard deviation we examined only looked at the standard deviation within the individual byte windows. This statistic identifies the consistency of the standard deviation across the file as a whole.
8. Kurtosis: The Kurtosis is used to show peakedness in a dataset. It is computed by multiplying the number of elements in the window, N, by the summing all of the differences between the values and the average to the fourth power. This sum is then divided by the sum of the squares of the differences between the values and the average. The denominator is then squared. The goal with examining the effectiveness of the kurtosis is to identify flatness or consistency of the data directly. This is essentially another measure of consistency of the data.
13.2. Statistical Distributions/Probability of Occurrence
The goal with mapping the distribution of the statistics, i.e. measuring the probability of a statistical value occurring, is to provide a summary of the type of data in a file, providing an overview of the components of a file. It can be easier to identify components in the distribution than in the interpolated file view for the statistics discussed in the last section.
9. Distribution of Averages: The distribution statistics are taken at a file level for all statistics of the type gathered from each window. There are 256 values, 0-255, for each distribution, one for each possible byte value. The distribution of averages is the probability that an average chosen from all of the averages for the file is the value B.
10. Distribution of Delta Averages: Formulation identical to 9. 11. Distribution of Standard Deviations: Formulation identical to 9. 12. Distribution of Delta Standard Deviations: Formulation identical to 9.