• No results found

Extraction of PDF Document Structure

4.4 SL2013 System Design

4.4.2 Extraction of PDF Document Structure

Extraction of structural features defined in Section 4.4.1 must meet the following re- quirements:

R1: All paths must be extracted with their exact counts.

R2: The extraction algorithm must be deterministic, i.e., for two PDF files with the same logical structure it must produce the same set of paths.

R3: The choice among multiple paths to a given object should be semantically the most meaningful one with respect to the PDF Reference.

As the first step in the extraction process, the file is parsed using the PDF parser Poppler, version 0.14.3. Its key advantages are the robust treatment of various encodings used in PDF and the reliable extraction of objects from compressed streams. In principle, other robust PDF parsers would be suitable for extraction of structural paths as well. Our choice of Poppler was motivated by its open-source nature. The parser maintains an internal representation of the document and provides access to all PDF objects.

Conceptually, path extraction amounts to a recursive enumeration of leafs in the docu- ment structure, starting from the root node, i.e.,Catalog. The extracted paths are inserted into a suitable data structure, e.g., a hash table or a map, and their counts are accumulated. However, several refinements must be introduced to this general algorithm to ensure that it terminates and that the above requirements are met.

The requirementR1is naturally satisfied by the recursive nature of our feature extrac- tion. Since our recursion terminates only if a leaf node is encountered, the algorithm is guaranteed to never underestimate the count of a particular path. However, an overesti- mation of the path count may occur if there is a cycle in the structural graph leading to infinite recursion. To prevent it, the requirementR3must be enforced.

The enforcement of requirements R2andR3 is tightly coupled and ultimately relies on the intelligent treatment of indirect references. Obviously, one cannot always de- reference them, as this may result in an infinite recursion. One cannot also avoid their

de-referencing, as the algorithm would hardly ever move beyond the root node. Hence, a consistent strategy for selective de-referencing must be implemented.

In our extraction algorithm, we approach these issues by maintaining a breadth-first search (BFS) order in the enumeration of leaf objects. This strategy assumes that the shortest path to a given leaf is semantically the most meaningful. For example, this observation intuitively holds for various cases when circular relations arise from explicit upward references by means of the Parent entry in a dictionary, as demonstrated by our example in Fig. 4.3. We find the path /Pages to preserve more semantics than

/Pages/Kids/Parent which refers to the same object. In Section 4.6.2 we present a refinement of this rough heuristic for selecting semantically more informative paths introduced for Hidost.

Two further technical details are essential for the implementation of BFS traversal. It is important to keep track of all objects visited during the traversal and backtrack whenever an object was seen before in order to break graph cycles. It is also necessary to sort all entries in a dictionary in some fixed order before descending to the node’s children. Since no specific ordering of dictionary fields is required by the PDF Reference, such ordering must be artificially enforced in order to satisfy the requirementR2.

4.4.3 Learning and Classification

Once the counts or other embeddings over the set of structural paths are extracted, various learning algorithms can be applied to create a model from the given training data and use this model to classify unknown examples. For an overview of suitable algorithms, the reader may refer to any standard textbook on machine learning, e.g., [9, 36], or use any entry-level machine learning toolbox, such as SHOGUN3 or WEKA4. It is beyond the scope of this manuscript to provide a comprehensive experimental evidence as to which machine learning method is most suitable for detection of malicious PDF files using structural paths. We have chosen two specific algorithms, decision trees and Support Vector Machines, for subjective reasons presented in the following section along with a high-level description of the respective method.

Decision Trees

The decision tree is a popular classification technique in which predictions are made in a sequence of single-attribute tests. Each test either assigns a certain class to an example or invokes further tests. Decision trees have arisen from the field of operational decision making and are especially attractive for security applications, as they provide a clear justification for specific decisions – a feature appreciated by security administrators. An example of a decision tree classifying whether one should take an umbrella when leaving home is shown in Fig. 4.10.

3SHOGUN –http://www.shogun-toolbox.org/. 4WEKA –http://www.cs.waikato.ac.nz/ml/weka/.

Will it rain later today? Is it raining now? yes no no Take an umbrella yes Leave umbrella Take an umbrella

Figure 4.10: An example decision tree.

The goal ofautomatic decision tree inferenceis to build a decision tree from labeled training data. Several classical algorithms exist for decision tree inference, e.g., CART [11], RIPPER [19], C4.5 [73]. We have chosen a modern decision tree inference imple- mentation C5.05version 2.07 which provides a number of useful features, e.g., automatic cross-validation and class weighting. It can also transform decision trees into rule sets which facilitate the visual inspection of large decision trees.

Support Vector Machines

The Support Vector Machine (SVM) [21] is another popular machine learning algorithm. Its main geometric idea, illustrated in Fig. 4.11, is to fit a hyperplane to data so that the margin Mbetween examples of 2 classes is maximized. In the case of a linear decision function, it is represented by the hyperplane’s weight vectorwand the thresholdρwhich are directly used to assign labelsyto unknown examplesx:

y(x)=w>x−ρ

Nonlinear decision functions are achieved by applying a nonlinear transformation to in- put data which maps it into a feature space with special properties, the so-called Re- producing Kernel Hilbert Space (RKHS). The elegance of SVM consists in the fact that such transformations can be done implicitly, by choosing an appropriate nonlinearkernel function k(x1,x2) which compares two examples x1 and x2. The solution αto the dual SVM learning problem, equivalent to the primal solutionw, can be used for a nonlinear decision function expressed as a comparison of an unknown example xwith selected ex- amplesxiin the training data, the so-called “support vectors” (circles with black outlines

in Fig. 4.11):

y(x)= X

xi∈S V

αiyik(x,xi)−ρ

w M

Figure 4.11: Linear and nonlinear SVM. Decision boundary (vectorw) between the two classes depicted with two colors is shown with a solid and the margins M with dashed lines; points with a single outline are support vectors; point with a double outline is a test point that falls within the margin.

Efficient implementations of SVM learning are available in various machine learning packages. In our experiments, we used a well-known stand-alone SVM implementation LibSVM6version 3.12.