Non-Parametric Statistical Techniques for Computational Forensic Engineering

(1)

Non-Parametric Statistical Techniques for Computational Forensic

Engineering

Jennifer Lee Wong

Abstract

Computational forensic engineering is the process of identification of the tool or algorithm that was used to produce a particular output or solution by examining the structural properties of the output. We introduce a new Relative Generic Forensic Engineering (RGFE)technique that has several advantages over the previously proposed approaches. From the quantitative point of view, the new RGFE technique performs not only more accurate identification of the tool used but also provides the identification with a level of confidence. A higher degree of classification is achieved by our technique with the ability to identify the output as produced by an unknown tool. We introduce a generic formulation which enables rapid application of the RGFE approach to a variety of problems that can be formulated as 0-1 integer linear programs. Additionally, we present forensic engineering scenarios which enable a natural classification of the forensic engineering task with respect to the types and amount of information available to conduct the classification.

From the technical point of view, the key innovations of the RGFE technique include the development of a simulated annealing-based CART classification and clustering technique and a generic property formulation technique which provides a systematic way to develop properties for a given problem or facilitates their reuse. In addition to solution properties, we introduce instance properties which enable an enhanced classification of problem instances leading to a higher accuracy of algorithm identification. Finally, the single most important innovation, property calibration, interprets the value for a given algorithm for a given property relative to the values for other algorithms. We demonstrated the RGFE technique on two canonical optimization problems: boolean satisfiability (SAT)and graph coloring (GC) and used statistical techniques to establish the effectiveness of the approach.

1

Introduction

1.1

Motivation

The emergence and rapid growth of the Internet, and in particular Peer-to-Peer networking, has enabled more convenient, economic and faster methods of intellectual property distribution, which therefore also indirectly enables copyright infringement. Software piracy resulted in a loss of over $59 billion globally between 1995 and 2000, and continues to induce an average of $12 billion each year in the United States alone [49]. Intellectual Property Protection techniques (IPP), such as watermarking and ﬁngerprinting, have been proposed to aid in preventing and protecting software. These techniques have shown signiﬁcant potential. However, they introduce additional overhead on each application of the tool and IP and they cannot be applied to tools which already exist.

Computational forensic techniques remove these limitations and enable identiﬁcation of the tool or algorithm which was used to produce a particular output or solution. Therefore, forensic engineering techniques provide an inexpensive approach that can be applied to any existing or future design or software tool. Through the examination of structural properties of a particular output, or solutions, generated by a tool, the RGFE technique aims to identify the tool used to produce the output with a high degree of conﬁdence.

It is important to note that forensic engineering has a number of other applications that in many situations have even higher economical impact. For instance, computational forensic engineering can be used for optimization algorithm development, as a basis for developing or improving other IPP tech-niques, for the development of more powerful benchmarking tools, for enabling security, and facilitating runtime prediction. More speciﬁcally, computational forensic engineering can be used to perform opti-mization algorithm tuning, instance partitioning for optiopti-mization, algorithm development, and analysis

(2)

of algorithm scaling. For example, forensic engineering can analyze the performance of the algorithm on various test cases and pinpoint the types of instances on which the algorithm does not perform well. The technique can also be used to partition instances into components each of which can be processed using algorithms which will perform best with respect to the structure of each part. Furthermore, one can tune parameters of heuristics with respect to the properties of the targeted instance.

Additionally, computational forensic engineering can assist in the development of IPP techniques such as watermarking, obfuscation, and reverse engineering. Watermarking techniques often add a signature in the form of additional constraints into the instance structure. Forensic techniques can be used to determine the proper type of additional constraints to embed in the instance in order to ensure the uniqueness of the watermark without reducing the quality of the obtained solution. Reverse engineering of algorithms can be facilitated by forensic analysis in several ways. For example, the forensic technique can be used to determine which speciﬁc instances of the problem to analyze in order to identify the key optimization mechanisms of the algorithm.

Benchmark sets can be built to accurately identify the beneﬁts and limitations of an algorithm with respect to particular type of instance. Forensic analysis can also be used to generate instances with speciﬁc structures on which an algorithm does not perform well, performs exceptionally well, or to build a compact, yet diverse benchmark set which fairly tests all competing algorithms/tools.

Security applications include the generation of secure mobile codes and runtime checks of code. For example, one can check using computational forensic techniques whether or not a delivered piece of code was indeed generated using a particular compiler by looking at the register assignment (graph coloring) and scheduling. Lastly, forensic engineering can be used for the runtime prediction of a particular algorithm. Proper resource allocation, such as memory, can be identiﬁed using forensic techniques by examining the memory and runtime of an algorithm on instances with similar properties and of similar size.

The main motivation for our work is provided by the analysis of the limitations of the initially proposed computational forensic engineering technique [47]. The initial technique performed well on a number of specific instances and on a specific small set of algorithms for two optimization problems -graph coloring and boolean satisfiability. This technique showed remarkable effectiveness, however, only under rather limiting conditions. The RGFE techniques eliminates these conditions and provides several not only quantitative, but more importantly, qualitative and conceptual advantages.

RGFE can be applied to an arbitrary problem which can be formulated as a 0-1 linear programming problem. In this generic formulation, properties of the problem are extracted and used to analyze the structure of both instances of the problem and the output or solutions of a representative set of tools. Using the information gathered, the RGFE technique builds and verifies a Classification and Regression Tree (CART)model to represent the classification of the observed tools. Once built, the CART model can be used to identify the tool used to generate a particular instance output. This RGFE approach consists of three phases: Property Collection, Modeling, and Validation. The key enabling factors in the property collection phase are the ability to extract properties of a given problem systematically and to conduct calibration of these properties to reflect the differences between solutions generated by the tools.

1.2

Motivational Example

In order to demonstrate the benefits of the RGFE techniques, we use an example using the boolean satisfiability problem. For the sake of brevity, we focus our attention only on one of the novelties of the RGFE approach - the use of properties of the solution to facilitate the classification process. Specifically, we establish the need for instance properties using two basic boolean satisfiability algorithms and two instances. The key observation is that one has to consider the properties of a specific instance of the problem in order to properly interpret and identify a particular solution.

The boolean satisfiability problem (SAT)is a NP-complete combinatorial optimization problem. SAT problems consist of a set of variables and a set of clauses each containing a subset of complemented and uncomplemented forms of the variables. If each variable can be assigned a truth value in such a way that every clause is satisfied (contains at least one variable that evaluates to True), the instance is satisfiable. A solution to the problem is one satisfying the truth assignments for the variables. Examples of SAT instances are shown in Figure 2. A formal definition and additional information on the boolean satisfiability problem is presented in Chapter 3.3.

We first introduce two simple algorithms for solving the SAT problem. Each of the algorithms uses a different optimization mechanism to solve the problem. The first algorithm focuses on the number of occurrences for a particular literal in the SAT instance. This “Literal Approach” counts the number of

(3)

Literal Approach{ Index all Variables;

Assign all Variables in Clauses of size 1; while(instance not Satisﬁed){

Select Literal(s)with highest appearance count; if(multiple literals selected){

Select literal with lowest complemented appearance; if(multiple literals still selected)

Select literal with lowest index; }

Assign Variable according to selected literal form; Simplify Instance;

Assign all Variables in Clauses of size 1; } }

Diﬃcult Clauses{ Index all Variables;

Assign all Variables in Clauses of size 1; while(instance not satisﬁed){

Select all clauses of smallest size;

Select literal which occurs most in selected clauses (in case of tie, select one with lower index); Assign Variable according to selected literal form; Simplify Instance;

Assign all Variables in Clauses of size 1; CurrentSize++;

} }

Figure 1: Motivational example: Pseudo-code for boolean satisﬁability algorithms (Literal Approach,Diﬃ-cult Clauses). Instance 1: (x₁+x₂)(x₁+x₃)(x₁+x₄)(x₁+x₅)(x₁ +x₄+x₅)(x₁+x₃+x₅)(x₁+x₃+x₅) (x₁+x₂+x3)(x₁+x2+x₄+x5) Instance 2: (x₆+x₂)(x₆+x₅)(x₁+x₃)(x₁+x₂)(x₁x₆ +x₇)(x₁+x₂+x₆)(x₁+x₂+x₇+x₉) (x₁+x₂+x₉+x₁₀)(x₁+x₄+x₇)(x₆+x8+x9)(x₂+x₃+x4+x₆)(x3+x₅+x₆+x₇+x₉) (x₂+x₃+x₆x₈)(x₆+x₄+x₁₀)(x₄+x₅+x₆+x₇+x₈+x₉)

Figure 2: Motivational example: Two boolean satisﬁability instances.

occurrences in the instance for both the uncomplemented and complemented form of each variable. The approach assigns a truth assignment to each of the variables according to the count. For example, if a variable,v_i, appears complemented, ten times, and uncomplemented twenty times the algorithm assigns

vito True.

The second algorithm, which we entitle “Difficult Clauses”, attempts to satisfy all short clauses first. Clauses which contain few literals, variables in either complemented or uncomplemented form, are considered more difficult to satisfy due to more limited number of assignment options. The algorithm selects the literal which occurs most frequently in the short clauses, and assigns the variable according to the literal’s appearance. Pseudo-code for each of the algorithms is presented in Figure 1.

Consider the two instances shown in Figure 2. Each of these instances is solved using the Literal Approach and Difficult Clauses algorithms resulting in the SAT solutions shown in Figure 3. One way to compare the solutions produced by the two algorithms is to compare the number of non-important variables. We denote the non-important variables, or variables which can be assigned to either true or false without affecting the satisfiability of the instance, with a ’-’. The number of non-important variables is a solution property that can be used to distinguish algorithms. However, for these instances, this solution property does not distinguish between the two algorithms. On Instance 1, the Difficult Clauses algorithm has more non-important variables than the Literal Approach. However, on Instance 2, the Literal Clauses has a higher number of non-important variables.

This situation is by no means unique for these two algorithms and these two instances. Algorithms often perform differently on instances of different structure. The Difficult Clauses approach considers the most constrained portion of the problem, short clauses, and tries to address this part of the problem which it considers to be the most difficult. As a result, this approach performs better on instances that have the most constraining components in the “short” clauses.

(4)

By identifying the difference in the structures of the two instances, a proper classification of the algorithms can be made. Instance properties identify the structures of instances. In this example, the ratio of “short” clauses in the instance can aid in the classification. We define this property in the following way. 1 Nc Nc i=1 1 2li (1)

where,Ncis the number of clauses in the instance, andliis the number of literals in the clause. Instances with a larger number of “short” clauses will have a higher value for this property, implying that these instances have their most constraining components in “short” clauses. Speciﬁcally, for Instance 1, this property value is 0.2205, and for Instance 2, it is 0.1281.

By considering both the solution property (non-important variables)and the instance property (ratio of “short” clauses), on this simple example, it is possible to classify these two algorithms. When the number of non-important variables is high on instances with a low ratio of “short” clauses, the most likely the algorithm used is the Literal Approach. In the case consisting of a high number of non-important variables and a high ratio of “short” clauses, the classification would be for the Difficult Clauses approach. This simple example demonstrates the importance of instance properties. Note that for many complex algorithms and instances, a sufficient number of uncorrelated instance and solution properties are needed to perform this type of classification. Our prime goal is to identify these properties and find the most effective ways to combine them.

Instance 1 x1 x2 x3 x4 x5 DC Literal: 0 1 0 1 1 0 Diﬃcult: 1 - 1 - 1 2 Instance 2 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 DC Literal: 0 1 0 - 1 0 - - - - 5 Diﬃcult: 1 1 0 0 0 1 1 1 1 0 0

Figure 3: Motivational sxample: Solutions to SAT instances using algorithms Literal Approach and Diﬃcult Clauses.

1.3

Key Technical Features

In this section, we brieﬂy outline the key technical novelties of the RGFE technique.

• Relative Generic Forensic Engineering. We introduce a generic ﬂow for the RGFE technique that allows it to be applied to a variety of optimization problems with minimal retargeting. The new approach compares the properties of an instance in relative terms. Relative comparison is done by using benchmark testing instances to calibrate each of the algorithms/tools to each other. • Generic PropertyFormulation. A systematic way to develop instance and solution properties

for diﬀerent problems allows the generic RGFE technique to be applied to a variety of optimization problems. The generic property formulation is applied to a problem which has been formulated in terms of an objective function and constraints. Special emphasis is placed on the widely used 0-1 ILP formulation.

• Instance Properties. Problem instances have varying complexity which is often dependent upon particular structural aspects of the instance. Additionally, different algorithms perform differently depending on the complexity or structure of the problem instance it is presented with. Therefore, it is necessary to provide ways to measure and identify the structure of an instance with respect to other instances. We introduce instance properties which provide a measure for comparing instances and therefore facilitate more accurate analysis and classification of the algorithms.

• Calibration. Calibration is performed on both instance and solution properties in order to place the data into the proper perspective. For instance properties, calibration provides a way to scale and classify (order)the instances, while the solution properties for each algorithm are calibrated per instance to place the data into the proper perspective to diﬀerentiate the algorithms.

(5)

• Forensic Scenarios. The amount of available information is a key aspect of Forensic Engineering. A technique can only be as accurate as the amount of information available for analysis. We present ﬁve forensic scenarios which classify the amount of information which is available for analysis. • One-out-of-anyAlgorithm Classiﬁcation. The number of algorithms available for solving

a particular problem can be unlimited, and many may not be known when applying the RGFE technique. As a result, the technique must be able to classify algorithms not only in terms of the algorithms which have been previously analyzed but also as an unknown algorithm.

• CART model and Simulated Annealing. We have developed a new CART model for clas-siﬁcation and clustering. The key novelty is that the new CART model does not only partition the solution space so that classiﬁcation can be conducted but also maximizes the volume of space that indicates solutions that are created by none of the observed algorithms. The CART model is created using a simulated annealing algorithm.

1.4

Thesis Organization

In the next chapter, we present the related work to the forensic engineering technique. In Chapter 3, preliminary background, in order to make the work self-contained, is presented for the generic problem formulation used for the RGFE technique, the boolean satisﬁability problem, and the graph coloring problem. The details of the construction and use of the Relative Generic Forensic Engineering technique is presented in Chapter 4. In the next chapter, the new enabling features of the RGFE technique are presented in detail. Chapter 6 presented the generic formulation for properties and introduces instance and solution properties in their generic forms. Before presenting the experimental conclusions in Chapter 9, the details of the modeling and validation phases of the RGFE technique are discussed.

2

Related Work

The related work can be traced along a number of directions. We summarize research in the areas which are most directly related: intellectual property protection, forensic analysis, statistical methods, and algorithms for the two selected NP-complete problems on which we apply the RGFE technique: boolean satisﬁability, and graph coloring.

Due to the rapidly increasing reuse of intellectual property (IP)such as IC cores and software libraries, intellectual property protection (IPP)has become a mandatory step in the modern design process. Recently, a variety of IPP techniques, such as watermarking, ﬁngerprinting [5, 14, 71], metering [50], obfuscation [13], and reverse engineering [11], have attracted a great deal of attention in this area.

The most widely studied technique, watermarking, can be applied to two diﬀerent types of artifacts: static and functional. Static artifacts [3, 82] are ones which have only syntactic components that are not altered during its use and include images [81, 83], video [84], audio [48], and textual objects [8, 6]. The common denominator for all of these techniques is that they use postprocessing methods to embed a par-ticular message into the artifact. Techniques have also been proposed for the watermarking of computer generated graphical objects, both modeled [69] and animated [26, 44]. Watermarking techniques have also been proposed for functional artifacts at many diﬀerent levels of abstraction. Some of the targeted levels include system and behavioral synthesis, physical design, and logic synthesis [38, 46, 51]. These techniques mainly leverage the fact that there are often numerous solutions of similar quality for a given optimization problem and the solution which is generated by the technique has a certain characteristics that corresponds to the designer’s signature. More complex watermarking protocols, such as multiple watermarks [51], fragile watermarks [27], publicly detectable watermarks [70] and software watermarking [55, 66], have also been proposed.

Forensic analysis has been widely used in areas from anthropology to visual art. Most commonly forensic analysis is used for DNA identiﬁcation [72]. The closest work related to forensic engineering in the computer science and engineering community is copy detection [12]. While the focus of forensic engi-neering is to identify the tool used to produce a particular result, copy detection focuses on determining if an illegal copy of the tool exists. There are two categories of copy detection, mainly used for textual data, and are either signature-based and registration-based. Signature-based schemes [5, 8, 7] add extra infor-mation (a signature)into the object in order to identify its origin. However, these signatures are often easily removed and do not assist in detecting partial copies. Registration-based schemes [12, 57, 67, 88] focus on detecting the duplication of registered objects. These approaches have demonstrated strong

(6)

performance but also have drawbacks. Large modiﬁcations can render these techniques useless and the registration database is often large and cannot detect partial copies.

Other techniques which have a similar ﬂavor are the detection of authentic Java Byte codes [1] and code obfuscation. Code obfuscation techniques [13] attempt to transform programs into illegible equivalent programs which are diﬃcult to reverse engineer. This technique is most often applied to Java byte codes [56] and mobile code [37, 74].

Principal component analysis is a standard statistical procedure that reduces a number of possibly correlated variables into a set of smaller uncorrelated variables [42]. We use non-parametric statistical techniques for classification because they can be applied to data which has arbitrary distributions and without any assumptions on the densities of the data [80]. The Classification and Regression Trees (CART)model is a tree-building non-parametric technique widely used for the generation of decision rules for classification. An excellent reference for the CART model is [9].

Probabilistic optimization techniques were ﬁrst introduced by Metropolis et al. [61] for numerical calculation of integrals. The simulated annealing optimization technique originates from statistical me-chanics and is often used to generate approximate solutions to very large combinatorial problems [45].

Bootstrapping is a classiﬁcation validation technique that assesses the statistical accuracy of a model. In the case of nonparametric techniques, bootstrapping is used to provide standard errors and conﬁdence intervals. The standard references for bootstrapping include [25, 18, 35].

Since the mid-1940’s, integer and linear programming have been widely studied topics. Linear pro-gramming (LP)and the simplex algorithm were introduced by Dantzig [23]. While the simplex technique often works quite well in practice, its worst case runtime is exponential. Consequently, both pseudo-polynomial and pseudo-polynomial algorithms have been developed for LP. Although, integer linear program-ming (ILP)is an NP-complete problem, it works well on many speciﬁc instances of moderate size. There are a number of excellent references on linear programming [64, 17, 23, 24, 28] and ILP [39, 73, 79, 85]. In the CAD domain, ILP is applied for a number of optimization problems, such as behavioral synthesis [30] and system synthesis [68].

Boolean satisfiability (SAT)was the first problem that was determined to be NP-complete [29]. The problem has a variety of applications in many areas such as artificial intelligence, VLSI design and CAD, operations research, and combinatorial optimization [60]. Probably the most well-known applications of SAT in CAD are Automatic Test Pattern Generation (ATPG)[52, 58] and Deterministic Test Pattern Generation [34]. Other applications include logic verification, timing analysis, delay fault testing [16], FPGA routing [65], covering problems [20], and combinational equivalence checking [32].

Many diﬀerent techniques have been developed for solving the boolean satisﬁability problem. Tech-niques such as backtrack search [89], local search [78], algebraic manipulation [31], continuous formulation and recursive learning [59] are among the most popular. Additionally, several public domain software packages are available such as GRASP [59], GSAT [76], Sato [89] and others [2, 22, 77].

Graph coloring is a popular NP-complete optimization problem with many applications in CAD as well as many other application domains. The applications include operations scheduling [21], register assignment, multi-layer planar routing [19], and wireless spectrum estimation [43].

A number of diﬀerent algorithms have been developed for solving the graph coloring problem. For example, there are several iterative improvement approaches [4, 62], such as tabu search [36] and simulated annealing [15, 63, 41]. Also, a number of other techniques have been developed, including heuristics [10, 33, 54], constructive [86], and exact implicit enumeration algorithms [10, 21].

Additionally, a number of hardware platforms have been developed for eﬃcient graph coloring, in-cluding reconﬁgurable FPGAs [53] and ones based on couplet oscillators [87]. Excellent surveys on graph coloring are [40, 41].

3

Preliminaries

In order to make the presentation self-contained, in this chapter we introduce the background information in a number of areas. We begin by identifying the diﬀerences between the new RGFE technique and the Computational Forensic Engineering (CFE)technique[47]. In order to accomplish this task, we provide a brief summary of CFE. Then, we present a brief summary of the generic problem format which consists of an objective function and constraints. Finally, we present a brief review of each of the demonstration problems: boolean satisﬁability and graph coloring.

(7)

3.1

Computational Forensic Engineering Technique

The Computational Forensic Engineering technique identiﬁes an algorithm/tool, which has been used to generate a particular previously unclassiﬁed output, from a known set of algorithms/tools. This technique is composed of four phases: feature and statistics collection, feature extraction, algorithm clustering, and validation.

In the feature and statistics collection phase, solution properties of the problem are identified, quan-tified, analyzed for relevance and selected. Furthermore, preprocessing of the problem instances is done by pertubating the instances - removing any dependencies the algorithms have on the instance format. In the next phase, each of the pertubated instances are processed by each of the algorithms. From each output/solution from each instance and algorithm, the solution properties are extracted. The algorithm clustering phase then clusters the solution properties inn-dimensional space, wheren is the number of properties. The n-dimensional space is then partitioned into subspaces for each algorithm. The final step validates the accuracy of the partitioned space.

This approach performed well on both the graph coloring and boolean satisfiability problem. However, that was the case only under a number of limiting assumptions. The computational forensic engineering technique performed algorithm classification on one-out-of-k known algorithms, and was tested on a variety of different instances. However many of these instances had similar instance structures. This forensic engineering technique is problem specific and is not easily generalizable to other problems. Lastly, the CFE technique performed analysis of the techniques in the form of blackboxes. In this forensic scenario, which we discuss in more detail in Chapter 5.1, this technique has access to each of the considered algorithms and unlimited analysis can be performed. However, in many cases, this type of information or access may not be available.

The RGFE technique eliminates several major limitations of the CFE technique. The technique performs one-out-of-any classification instead on one-out-of-k. In this case, output of an algorithm that was never previously analyzed can be classified as unknown. The key enabler for the effectiveness of the Relative Generic Forensic technique is the calibration of problem instances. By identifying, analyzing, and classifying instances by their properties, the quality of the RFGE classification is expanded to another dimension enabling more statistically sound classification. Additionally, we present a generic formulation and generic property formulation that enables the application of this technique to numerous optimization problems. Lastly, we identify five scenarios which define the complexity of analysis for the RGFE technique based on the amount of information available.

3.2

Generic Problem Formulation

Numerous important optimization problems can be readily specified in the form defined by an objective function and constraints. Typical problems include set cover, the travelling salesman problem, vertex cover, template matching, k-way partitioning, graph coloring, maximum independent set, maximum satisfiability, satisfiability and scheduling [39, 73, 79, 85].

The form deﬁned by an objective function and constraints is the canonical form which enables the generic formulation of instance and solution properties. Both the original instance as well as the corre-sponding solution can be analyzed starting from the following form:

MAX(Y) =cx (2)

Ax≤B (3)

We have selected this form because many popular optimization problems can be formulated in this way, the format is mathematically tractable, and is simple to analyze in terms of the needs of the forensic engineering technique. We restrict the objective function, Eq. 2, and the constraints, in the form presented in Eq. 3, to be linear. Most of the optimization problems are formulated withx={x₁,· · ·, x_n} as 0-1 variables. This problem formulation is most commonly known due to its use in integer linear programming[39, 73, 79, 85].

3.3

Boolean Satisfiability

The boolean satisﬁability problem consists of a set of variables and clauses which are built over the complemented and uncomplemented forms of the variables. The goal is to ﬁnd a truth assignment for each variable in such a way that at least one variable in every clause evaluates to true. The problem can

(8)

be formally stated in the following way.

Problem: Boolean Satisfiability

Instance: A set U of variables and a collection C of clauses over U.

Question: Is there a satisfying truth assignment for C?

The boolean satisfiability problem can be specified using the canonical form in many different ways; we present two. The first approach is simple and straightforward, allowing direct mapping of the clauses to constraints, while the second approach is somewhat more complex.

The ﬁrst approach maps each variable of the SAT problem directly. We therefore deﬁne the variable

xin the following way.

xi =

1, ifxi is assigned to true. 0, ifx_i is assigned to false.

We deﬁne the objective function as presented in (2), where c is a negative identity vector. Note that we decided to maximize the number of variable that evaluate to true. Another alternative is to have an empty objective function term. Or, if the problem to be solved is MAXSAT, where we want to satisfy the problem with as many true variables as possible, we would assign c to be a positive identity vector. To generate proper constraints, we translate each clause of the satisﬁability problem into a single constraint. If the variable appears in the constraint as uncomplemented, the variable appears as a positive variable in the constraint. However, if the variable is complemented in the clause, we represent the variable as negative in the constraint. The bound term of each constraint is calculated using the following formula.

b= 1 - (number of variables which are complemented)

For example, if we have the clauses shown below, we would create the constraints shown to the right. (X₂,X₃,X₄) −→ X₂ +X₃ +X₄ ≥1

(X₁,X₃,X₄) −→ X₁ +X₃ -X₄ ≥0 (X₁,X₂,X₄) −→ -X₁ -X₂ -X₄ ≥-2

For the second approach, we begin by mapping the variables. For each variable in the SAT problem we create two new variables, one to represent the complemented form and one for the uncomplemented form. Therefore, when we have n variables, we create 2n variables, wherex1, ..., xn represent the un-complemented versions andx_n₊₁, ..., x₂_n represent the complemented versions. Formally, we deﬁne x_i as:

xi =

1, if variablex_iis selected to evaluate to true 0, otherwise.

The remainder of the problem is specified using two sets, S andC. We define each element in the finite setS as a single clause in the SAT problem. Therefore,Sj contains allxiwhich appear in clause

j. SetCcontains 2nelements, one for eachx_i. Each subsetC_i contains as elements allS_i in which the variableiappears in.

Now, the objective function is deﬁned as in (2), where c is a negative identity vector. For the constraints, the formula presented in (3)is used, and we deﬁnebas a positive identity vector andAijas follows

Aji =

1, if elementSjis in subsetCi

0, otherwise.

Finally, we include constraints that enforce that no variable can be simultaneously assigned to both the complemented and uncomplemented value.

∀i : x_i+x_i₊_n≤1 (4)

Given the following example deﬁned on a set of the variables{X1,. . .,X4}, we deﬁne xi ∈{1,. . .,8} wherex1,. . . ,x4represent uncomplemented forms andx5,. . .,x8represent complemented forms ofX1, . . . , X4. The matrixA_ij for these constraints is also shown.

(9)

(X2,X3,X4) −→ S1 ={x2,x3,x4} (X1,X3,X4) −→ S2 ={x1,x3,x8} (X1,X2,X4) −→ S3 ={x5,x6,x8} C1={S2},C2={S1},C3={S1,S2},C4={S1},C5={S3},C6={S3},C7={∅},C8={S2,S3} A= ₀ ₁ ₁ ₁ ₀ ₀ ₀ ₀ 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1

In the remainder of this manuscript, we will only use the ﬁrst formulation of the SAT problem.

3.4

Graph Coloring

The graph coloring problem asks for an assignment of a color to each node in the graph such that no two nodes with an edge between them are colored with the same color and to use the minimum number of colors. Formally, the problem is deﬁned in the following way [29].

Problem: Graph K-Colorability

Instance: Graph G = (V,E), positive integerK≤ |V|.

Question: Is G colorable, i.e. does there exist a function f: V→1,2,3, ..., K s.t. f(u)=f(v) whenever u,v∈E? A B C E D A B C E D (a) (b)

Figure 4: Graph coloring: (a) graph instance with 5 vertices and 6 edges. (b) a graph coloring solution for the graph instance.

An example of a graph coloring instance and a colored solution is presented in Figure 4. A graph instance can be transformed into the generic formulation in the following way. First, we deﬁne variables

xi andxij. xi = 1, if coloriis used 0, otherwise. xij =

1, if node jwas colored with colori 0, otherwise.

The goal is to minimize the number of colors used to color the graph. Therefore, the objective function can be written in the form of Eq. (2) , wherecis a negative identity vector, andxis the number of colors used to color the graph instance.

There are three types of constraints for the graph coloring problem. The ﬁrst constraint is on the number of colors used. If any node x_ij is colored with colori, then the color has been used, and the value ofxishould be 1. Therefore we have the following constraint.

∀j∈V : xi≥xij (5)

The second constraint is related to constraints induced by the connectivity of the graph. No two nodes with an edge between them can be colored using the same color. We deﬁne a matrixE which contains the elementsemn.

emn =

1, when nodesmandnare connected 0, otherwise.

Now, the second constraint can be stated as

(10)

The last type of constraint states that all nodes in the graph must be colored.

∀j∈V :

i

xij≥1 (7)

The graph presented in Figure 4(a), can be speciﬁed in the following generic form. We assume that only three colors are needed to color the instance. Therefore i ={1,2,3}. The instance contains ﬁve vertices that we denote using the following notation: V ={A,B,C,D,E}.

OF:MAX(Y)= (-c)x Constraints: Eq. (4) Eq. (5) x1≥x1A,x1≥x1B, . . . ,x1≥x1E x1A+x1D≤1,x2A+x2D≤1,x3A+x3D≤1 x2≥x2A,x2≥x2B, . . . ,x2≥x2E x1A+x1E≤1,x2A+x2E ≤1,x3A+x3E≤1 x1≥x3A,x3≥x3B, . . . ,x1≥x3E x1B+x1C≤1,x2B+x2C≤1,x3B+x3C≤1 x1B+x1E≤1,x2B+x2E≤1,x3B+x3E≤1 x1C+x1D≤1,x2C+x2D≤1,x3C+x3D≤1 Eq. (6) x1A+x2A+x3A≥1 x1B+x2B+x3B≥1 x1C+x2C+x3C≥1 x1D+x2D+x3D≥1 x1E+x2E+x3E≥1

4

Relative Generic Forensic Engineering

In this chpater, we introduce the RGFE technique. The technique operates on the input in the generic formulation, ILP. Our implementation is restricted to instances that are formulated as 0-1 ILP. The technique consists of two stages: analysis and evaluation. In the analysis stage, the goal is to classify the behavior of algorithms for a specific problem specified using the 0-1 ILP format with a high confidence. The classification is achieved using a CART model that is built using instance and solution properties from a set of known algorithms. The CART model is then used to evaluate the behavior of unclassified algorithm output. The flow of the analysis stage is presented graphically in Figure 5 and using pseudo code format in Figure 6. The steps in the evaluation stage are shown in Figure 7.

4.1

Analysis Stage

The analysis stage of the RGFE technique, shown in Figure 5, is composed of three phases: property collection, model building, and validation phases. The property collection phase defines, extracts, and calibrates both instance and solution properties for the given optimization problem. In the modeling phase, the relevant calibrated properties are used to develop a CART classification scheme. The last phase tests and validates the quality of the CART model to quantify and ensure high confidence in the developed classification.

Property Collection Phase

In this phase, the main steps are property selection and property calibration. We begin by selecting generic instance properties that will assist in classifying the characteristics of the given problem. The solution properties are selected to characterize the decisions or the optimization mechanisms used by an algorithm on a particular type of instance. We discuss a general approach for identifying and deﬁning properties in more detail in Chapter 6.1.

Once a set of properties have been selected for the targeted optimization problem, we proceed to extract each of the properties from our representative sets of instances and algorithms. The instance properties are extracted directly from each of the instances and the solution properties are extracted from the solution/output of each algorithm run on each of the instances. In this step, it is important to develop fast techniques and software for extracting the properties. While some properties can be deterministic, others may have statistical components, and therefore require large computational overhead for each instance or solution. Note that all property extraction methods are implemented to analyze the instances

(11)

Property

Selection

Extraction

Property

Properties

Calibrate

Properties

Relevant

Identify

Componet

Analaysis

Principal

Representative

Set of Instances

Representative

Set of Algorithms

Plot Properties

in n-D Space

Property Collection Phase

Build CART

Model

Modeling Phase

Bootstrapping

Learn and Test

Quality?

Additional

Instances

Validation Phase

Confidence

Interval

NO YES YES

# Tries?

NO YES NO

Figure 5: Overall ﬂow of the RGFE technique: Analysis Stage.

and solutions in the generic form. Therefore, all implemented property extraction methods can be reused for any new optimization problem. We discuss this process in greater detail in Chapter 6.1.

For each property, it is necessary to calibrate the raw property values in order to interpret their mean-ings properly. We calibrate each instance and solution property for all algorithms using two conceptually different methods: scaling and ranking. We discuss the details of the calibration process in Chapter 5.3. In the final steps, relevant properties are identified and principal component analysis is performed. Relevant properties are properties which aid in distinguishing the algorithms from each other. If a property yields the same value for all (or a great percentage of)instances or all algorithms, then it is not useful in the classification process, and is excluded from further consideration. Lastly, principal component analysis is performed in order to eliminate the set of properties which provide similar or identical information as other sets of properties. All of the steps in the property collection phase are encapsulated in Figure 6, lines 1, 2, and 9-14.

Modeling Phase

In the modeling phase, the calibrated properties are used to model the behavior of each of the algorithms. This is accomplished by representing each solution from each of the available algorithms as a point in

n-dimensional space. Each dimension represents either a solution or an instance property. The number of dimensions, n, is the total number of properties. Once the space is populated with the extracted data, we apply a generalized CART approach. The generalized CART approach, presented in Chapter 7, partitions the space into subspaces not only for each algorithm but also reserves empty regions for unobserved algorithms. Simulated annealing is used to develop a low complexity generalized CART model.

Validation Phase

In order to verify the quality of the CART model, we reﬁne and test the model built in the previous phase using two techniques: bootstrapping and learn-and-test. The validation phase of the technique is an iterative process. We begin by performing bootstrapping of the model. Bootstrapping is based on statistical sampling with replacement, presented in line 16 of the pseudo-code.

We evaluate the overall quality of the CART model after bootstrapping and if the quality is not above a speciﬁed threshold, it is reﬁned by introducing new instances into the representative set and beginning

(12)

Input: Representative Set of Instances,Ii, Representative Set of Algorithm,A_j. Algorithm:

1. P_kS = Deﬁne Solution Properties; 2. P_lI = Deﬁne Instance Properties;

3. while (LT <threshold &&<user speciﬁed number of tries){ 4. if(exceed user speciﬁed number of tries)

5. Add instances toIi and restart;

6. while (E <threshold &&<user speciﬁed number of tries){ 7. if(exceed user speciﬁed number of tries)

8. Add instances toI_iand restart;

9. Solij= Run each instance,Ii, on every algorithm,Aj; 10. V_I,A,P = Extract PropertiesP fromI andSol; 11. Calibrate Properties(VI,A,P);

12. R_pS= Identify Relevant Solution Properties (P_kS); 13. R_qI = Identify Relevant Instance Properties (P_lI); 14. Principal Component Analysis(VI,A,R);

15. N = Buildn-Dimensional Space (R, I,V_I,A,R);

16. M = Use Simulated Annealing to develop CART model (N); 17. E = Evaluate CART using Bootstrapping(M);

}

18. L_T = Evaluate CART using Learn and Test (M); }

19. C = Build Conﬁdence Interval (M);

Figure 6: Pseudo-code for the RGFE technique.

the process over at the property extraction step of the property collection phase. The quality is measured as the percentage of new instances incorrectly classified. Furthermore, as an alternative to inclusion of new instances, in the property calibration step, we attempt to conduct alternative calibration schemes for some of the properties, in order to refine the model. In the case that the quality of the model is not improved to at least the threshold level after m iterations of adding instances, where m is a user defined number of attempts, new properties are introduced and the analysis phase is restarted from the the beginning.

When the model passes the bootstrapping validation step, we perform learn-and-test validation. This statistical validation method process introducesknew instances which have been run on a set of observed and unobserved algorithms. The model is tested to evaluate if it correctly classifies the output of each new test instance. The quality is measured by the number of correct classifications. If the quality is below a set threshold, the same iterative process used in the bootstrapping step is applied. Once the CART model passes the learn-and-test step, the confidence interval of the model is determined. The confidence of the model is the accuracy of classification which is specified for each of the considered algorithms separately.

4.2

Evaluation Stage

The RGFE goal is to be able to correctly classify the output of an unknown algorithm. In this stage, we identify the process for evaluating unknown outputs. The evaluation process, shown in Figure 7, begins with property extraction of both instance properties and solution properties from the unknown instance and algorithm output. The properties that are extracted are the set of properties that we used to build the final CART model in the analysis stage. Next, the properties are properly calibrated according to the selected calibration scheme for each property. The calibrated properties of the unknown instance and solution are then evaluated by the CART model. The algorithm that the CART model classifies the output into is the algorithm which produced the solution with the confidence level of the algorithm in the model, which was found at the end of the analysis stage.

Note that the analysis stage of the approach must only be performed once for a set of observed algorithms. However, once the analysis stage is done, the evaluation stage can be applied repeatedly. Only when new algorithms are observed must the analysis process be repeated. In order to correctly classify the new observed algorithm(s), the properties must be recalibrated to take into account the new algorithm(s). In some cases, it may be necessary to deﬁne new properties and to process additional instances on each of the observed algorithms in order to achieve a high conﬁdence level.

(13)

5

Enabling Concepts

There are four major enabling concepts for the RGFE technique: forensic scenarios, instance properties, solution properties, and calibration. The ability of the RGFE technique to correctly link unknown output to the algorithm which produced it is highly dependent on the amount of information available to the technique. We introduce forensic scenarios that classify the amount and type of the available information. More importantly, we introduce the notion of instance and solution properties as required by the RGFE technique. Finally, the concept of calibration is introduced and developed calibration techniques are presented.

5.1

Forensic Scenarios

The ultimate goal of computational forensic engineering is to demonstrate that a speciﬁc tool was used to perform a task. Most commonly, forensic engineering is applied in situations where the owner of a tool wants to prove that the tool has been used by an unauthorized user.

Forensic engineering is the science and engineering of using information from both the original instance and the generated output of the instance to classify the tool that was used to generate a particular output. Furthermore, a complete scheme for forensic engineering should be able to identify that a particular output (solution)is produced by a tool never encountered before. Correct classification is strongly dependent on the amount of information available to the forensic engineering technique. The common assumption for all of the scenarios assumes that for each observed instance, both the instance and tool output are available. We, therefore, define five different scenarios which classify the amount of available information, and present them in order of increasing information. The most difficult scenario for a forensic engineering technique is the first scenario where none of the tools are available.

1. Limited instance information. Instances and output from several tools are available. However, access to the tools themselves is not provided. Note that no specific instances can be used to uncover particular features or properties of the tools. However, it is assumed that a significant number of instances are available to make any statistically significant conclusions.

2. Information on a pay-per-use basis. The diﬀerent tools are available on a pay-per-use bases. In this case, selected instances can be run on the tool for a fee. It is important in this case that the number of instances used in the representative set and the number of validation iterations of the forensic engineering technique are kept to a minimum. On the other hand, it is important that the submitted representative set of instances be diverse enough such that the technique can still accurately classify the tools.

3. Blackbox information. Blackboxes of the tools are available to do essentially unlimited instance testing. This scenario provides the forensic engineering technique with the opportunity to ana-lyze any number of instances with a variety of diﬀerent instance properties, however provides no information on the algorithm constructs used in the tools.

4. Whitebox information. Tools are available as whiteboxes, meaning the algorithmic details are available, however modiﬁcation to the tools is not possible. In this scenario, we assume that unlimited instance testing can be done using the tool, or the tool can be reproduced from the whitebox information enabling essentially unlimited testing.

5. Unlimited information access. In this scenario, complete access to the algorithms/tools are available to the forensic engineering technique. Modiﬁcations of the algorithms/tools are possible and unlimited instance testing can be done.

Classify Properties of

Unknown Unknown Instance

and Solution ExtractionProperty

Classification of Unknown with k% Confidence Property Calibration Relative Forensic Model

(14)

Of course, in all but one of these scenarios, the tools are available for testing instances. Note that some algorithms contain randomized components or can be obfuscated. As a result, the output of the algorithms will vary with each run, which greatly reduces the possibility of solely comparing the outputs to each other.

Finally, note that there are three underlying assumptions that are required to make any forensic technique feasible. The first assumption is that each of the algorithms is stable. This implies that the algorithms produce predictable outputs that are created with the intention to optimize the quality of the solution. Obviously, if there are two algorithms that intentionally produce completely random solutions, one can not distinguish their outputs. The second assumption is that algorithms are sufficiently different among themselves. The algorithms that are just minor modification of each other often produce identical or similar solutions on majority of instances. The last assumption is that each of the problem instances has a large solution space with many solutions of similar high quality. If a problem only contains a single solution, obviously, no conclusive forensic analysis can be done. Either the algorithm/tool will find the single solution or not. Note that active watermarking research clearly indicates that for many problems that is essentially always the case [14, 71].

5.2

Properties

In this section, we present instance properties. Afterwards, we identify the relevance of solution properties and discuss the role of instance and solution properties in the RGFE technique.

A property can be defined as a quality or trait belonging to an individual or object, or as an attribute that is common to all members of a class. Properties aid in the classification of objects. In the case of RGFE the properties are the distinguishing factors between different instances of a problem and the solutions or outputs generated by different algorithms. Instance properties provide insights into the structure of a problem and solutions. Furthermore, they facilitate instance classification. One important observation is that algorithms often perform differently on instances with different properties. For example, local search algorithms for the SAT problem often perform very poorly on random instances. However, they perform very well on structured instances.

While instance properties assist in the classiﬁcation of instances, solution properties facilitate identiﬁ-cation of the behaviors of algorithms. Solutions are illustrations of how an algorithm handled the instance and provides insights into their optimization mechanisms. Properties of solutions help to identify the algorithm’s mechanism(s), and to classify the behavior of the algorithms with respect to the instance.

Properties are the basis for the RGFE technique. The development of properties and their calibra-tion and relevance is the focus of the property colleccalibra-tion phase. Addicalibra-tional properties are identified in the case that extensive validation of the models, in the validation phase, does not lead to a model of sufficient quality. It is important to note that calibrated properties are the defining components of the

n-dimensional space, and therefore the classiﬁcation model.

5.3

Calibration

Calibration is one of the key steps of the property collection phase. It is vital for properly understanding property values for a particular algorithm/instance pair. In this section, we introduce the concept of calibration for the RGFE technique, discuss the beneﬁts of calibration, and deﬁne two possible approaches for calibrating properties: rank-ordered and scale-based.

Calibration is the mapping of raw data values to values which contain the maximum amount of information to facilitate a particular task, which in this case is algorithm classiﬁcation. The goal of calibration for the RGFE technique is to provide a relative and relevant perspective on particular features of a solution generated by a particular algorithm on a particular instance or on instances suitable to produce solutions of a particular type. Calibration aids the proper interpretation of data.

The best way to introduce calibration and establish its importance is to use a speciﬁc example. Consider two diﬀerent SAT instances, par8-2-c and par8-1-c, solved using the GRASP and Walksat SAT algorithms. These algorithms are discussed in more detail in Chapter 9.1. We evaluate the instances and solutions with the solution property of non-important variables and present the results in Table 1. Non-important variables are variables that may switch their assignment in such a way that the correctness of the obtained solution is preserved. In a sense, the number of non-important variables indicates the robustness of the obtained SAT solution. We discuss the property in greater detail in Chapter 6.3.

Without calibration, by considering only these two instances, we would associate a range of 0.39 to 0.59 to the GRASP solutions and of 0.53 to 0.71 to the Walksat solutions. These two ranges overlap and

(15)

therefore classification is difficult. The reason is obvious; the two instances have different structure. There is intrinsically many more non-important variable in the instance par8-2-c than in par8-1-c. Calibration can compensate for this difference in instances. For example, we see that in both cases, GRASP has a property value approximately 20% lower than that of Walksat for this property. Calibration of the values with respect to the other algorithms enables proper capturing of the relationships between the algorithms, which is not visible from the raw values.

GRASP Walksat par8-2-c 0.5294 0.7059 par8-1-c 0.3906 0.5939

Table 1: Property values for the solution property non-important variables on two SAT instances,par8-2-c and par8-1-c,solved by two SAT solvers,GRASP and Walksat.

Two technical difficulties arise when calibrating data. The first difficulty is determining the goal of the calibration scheme, or what the operational definition is for the quality of a proposed calibration scheme. The second difficulty is how to derive a high quality calibration scheme. We present two calibration schemes and discuss how each of them address these difficulties.

The ﬁrst calibration approach is a rank-ordered scheme. For each property value on a particular instance, we rank each of the algorithms. Using these rankings, a collaborative ranking for the property is built by examining the rankings of each of the algorithms on all instances. Additional consideration must be made on how to resolve any ties in ranking, and how to combine rankings for individual instances. One can use either average ranking, median ranking, or some other function of ranking on the individual instances. In our experimentations we used modal ranking - where the ranking of each algorithm is deﬁned as the rank that was detected on the largest number of instances. Note that in this situation, two or more algorithms can have the same ranking.

Consider the following example that illustrates the key ideas and trade-offs. For property P on instance X, consider algorithms a, b, c, and d which have property values 203, 31, 108, 130 respectfully. A forward rank order scheme would rank the algorithms (b, c, d, a)on instance X. In addition to instance X, consider instances Y and Z with algorithm rankings of (b, c, d, a)and (b, d, c, a), respectfully. By examining the rankings of each of the algorithms, we find algorithm b always has the first ranking and algorithm a always has the last. In the case of algorithm c and d, we find algorithm c more often appears second and d third. Note that the classifications of algorithms c and d are not exact. While for property P we are able to distinguish algorithms a and b, additional properties, and therefore dimensions of analysis, need to be considered.

Rank-order calibration schemes are simple to implement and are robust against data outliers. How-ever, rank order schemes eliminate the information about the relationship between numerical values for a given property of the algorithms. Additionally, the property after rank order calibration does not provide a mechanism for stating an unpopulated region. Unpopulated regions are necessary for the RGFE tech-nique to classify output from an algorithm which has not be observed or studied in the model. Rank-based property calibration can only classify unobserved areas when multiple properties are consider together. When analyzing multiple properties together, rank order calibrated properties may create empty regions, where no data is present.

The second type of calibration mechanism is a scale-based scheme. In these types of techniques, calibration is done by mapping the data values from the initial scale on to a new scale. Possible types of data-mapping are normalization against the highest or lowest value, against the median, or against the average value. We use a scheme where the smallest value on all instances is mapped to value 0, the largest value to value 1, and all other values are mapped according to the formula: xnew = xinit−xs

xl−xs ,

wherex_init is the initial value for the property before calibration,x_sis the smallest value, andx_lis the largest value before calibration.

The advantage of a scale-based scheme is, in principle, higher resolution and more expressive power than a rank order scheme. However, these types of approaches can be very sensitive to data outliers -a few exception-ally l-arge or sm-all v-alues. For -a sc-ale-b-ased scheme, e-ach of the property v-alues m-ay be plotted on a segment after data-interpretation on the absolute values has been applied. Regions of the segment which are populated by a particular algorithm are defined as classification regions for these algorithms. Regions of the segment which are not populated by any algorithm are specified as unclassified.

(16)

technique. A number of metrics can be defined to quantitatively compare calibration schemes. We mainly used two alternatives: percentage of incorrect classification and unclassified regions. In the case of both the rank-based and scale-based schemes, the final calibration scheme may misclassify a number of values. The number of incorrectly classified property values is an indicator of the quality of the scheme in the sense that it identifies the accuracy of the calibration.

Unclassified regions are regions after the scale-based calibration of a property that identify property value areas which are not indicating values associated with any of the observed algorithms. These regions are the key to identifying when a given instance is not generated by one of the previously observed algorithms. One way to evaluate the effectiveness of the calibration scheme is to consider the sum of the sizes of the unclassified regions. Larger regions indicate better classification of the algorithms.

Calibration and the results of calibration are used in three phases of the RGFE technique. First, calibration is performed in the property collection phase. In this phase, calibration techniques are applied to each of the instance and solution properties in such a way that the relevance of the property can be determined. Additionally, the calibration scheme of a particular property can be modiﬁed or changed in order to aid in increasing the quality of the proposed model. Lastly, in the evaluation phase, the calibrated value of each property of the unknown instance and solution is used to ﬁt the unknown into the model.

6

Properties

In this section, we discuss the purpose of the generic problem formulation and how the properties for diﬀerent optimization problems are extracted in a systematic generic way. In the second part of this section, we present generic forms for both instance properties and solution properties and demonstrate their meaning on the SAT and GC problems.

6.1

Generic Property Formulation and Extraction

Our generic problem formulation is the standard formulation for 0-1 integer linear programming. It consists of a linear objective function (OF)and linear constraints. All variables are integers and can be assigned to a value of either 0 or 1. We have adopted this format due to the fact that it naturally represents many optimization problems well. For many problems this formulation is readily available [39, 73, 79, 85]. There are two key benefits for developing properties in generic form. The first is a conceptual benefit, while the second is software reuse. The conceptual benefit is that treating properties in a generic form greatly facilitates the process of identifying new properties for newly targeted optimization problems. This is so because key insights are developed that are based on a combination of constraints that make a particular optimization problem difficult. The software reuse benefit lies in the fact that many properties that are developed for one optimization problem can be easily reused for forensic analysis of other optimization problems. Therefore, software for the extraction of these properties need only be written once.

Note that although many problems can be specified using the generic format, often a specific optimiza-tion problems have specific features. For example, depending on the problem, the generic formulaoptimiza-tion may or may not have an objective function. Specifically, the SAT problem contains an empty objective function; as long as all the constraints are satisfied, the value of the objective function is irrelevant. However the GC problem formulation relies on the objective function to enforce minimization of the number of colors used to color the graph. Other properties to consider include the types of variables that appear in the constraints (positive only, negative only, or both), the weights of the variables in the constraints (are they all the same or not), does the objective function contain all of the variables in the problem or only a subset, and so forth. The key is to identify the intrinsic essential properties of the problem and develop a quantitative way to measure them.

An important observation is that some problems can be represented in the generic form in different ways. For example, in Chapter 3.3, we introduced two different representations of the SAT problem in the generic format. It is important to note that these two different formulations are equivalent to each other and the original problem. However, depending on the formulation of the problem in the generic form, it may be easier and cleaner to identify and extract certain properties. This by no means implies that these properties do not exist in the other formulation; they are just more difficult to identify.

Note that the solution properties can also be extracted in the generic form by mapping the solution output of the algorithms to the generic solution form, then computing the property values. We illustrate the steps in Figure 8.

(17)

Property Selection Representative Set of Instances Representative Set of Algorithms Property Extraction Convert to Generic Form Run Algorithms on Instances Instance Solutions from Algorithm Convert to Generic Solution Form Solution Property

Extraction Impl. Instance PropertyExtraction Impl.

Calibrate Properties Extraction

Solutions Property

Values Instance PropertyValues

Figure 8: Procedure for generic property extraction.

The representative set of instances for the problem are used both in their standard representation and in generic form. On the right side of the Property Extraction phase, the instances are converted into generic form and then in this form the instance property extraction methods are applied. The instance property values are collected and passed on to the calibration step. On the left side of Figure 8, the representative set of algorithms executes each of the instances in the representative set. The solutions for each algorithm on each instance are converted to the generic solution formulation, which is then given to the solution property extraction method along with the original instance in its generic formulation. The solution property values for the given algorithm and instance are collected and the data passes to the calibration phase.

6.2

Instance Properties

We have developed a number of generic instance properties. We present generic instance properties in their generic form for the SAT and GC problems.

[I₁] Constraint Difficulty. Each constraint in the problem formulation contains coefficients for each variable appearing in the constraint and the value (b-value)on the right-hand side of the constraint. The goal of constraint difficulty is to provide a measure of how much effort and attention the algorithm places on a given constraint. For example, in the SAT formulation, each constraint represents a single clause, and therefore all variables have unit weight. The b-value of the constraint is dependent on the number of positive and negative literals in the constraint. Therefore, in this case this generic property summarizes information about the size of the clauses in the instance. The aggregate information about constraints can be expressed using statistical measures such as average and variance, which we actually used in our system. Also, note that we used a weighted average of clause size in the motivational example.

[I2] Ratio of Signs of Variables. The key observation is that some variables tend to appear in all constraints in a single form, while others variables will appear in multiple forms and have more balanced appearance counts. For this property, we assume, without loss of generality, that all coeﬃcients b are positive. In the problem formulation, analysis of the positive, negative, and x -weighted, occurrences of a variable can be examined with respect to the total number of occurrences of the variable in the instance. In the GC problem formulation there are only positive variables. However, we can consider the number of times a variable appears compared to the average number

(18)

P(constraint satisﬁed)= 1 - [_v

i P(variable assigned opposite of constraints beneﬁt)].

of appearances of all variables. This property measures the relative degree of the each node in the graph. In the SAT problem, we can use this property to identify the tendency of a variable in the instance to be assigned true or false. Again, various statistical measures can be used to aggregate this information. We use average and variance.

[I3] Ratio of Variables vs. Constraints. This property can be applied to all or a subset of variables in all or a subset of the constraints in the instance. It provides insight into the diﬃculty of these constraints. A low number of variables in a large number of constraints can imply that the constraints are diﬃcult to satisfy due to the fact that numerous constraints are highly dependent on the same variables. This property has a direct interpretation for the SAT problem. For the GC problem, when considering a subset of the constraints which represent the edges in the graph, the property provides a measure of the subgraph’s density, or the density of the entire graph when considering all edge constraints.

[I₄] Bias of a Variable. We measure the bias of a variable to be assigned to either zero or one, based on the number of constraints which would beneﬁt from the variable being assigned each way. Note, that this property has no interpretation for the GC formulation due to only positive occurrences of each variable.

[I5] Probabilityof satisfying constraints. This property considers the difficulty of satisfying each constraint based on the variables, weights of the variables, and its b-value. We define the probability of the constraint to be satisfied as shown below. For the SAT problem, this is a measure as to the difficulty of a clause. For our formulation of the GC problem, this property has no relevance, since most of the constraints contain the same number of v