Email Classification For Risk Assessment
5.3 Generating OBDDs For Email Classifier Model
5.3.1 Ordered Binary Decision Diagrams
Figure 5.3:Decision Tree Representation of Malicious Email
Binary decision diagrams (BDDs) [Bryant(1986)] have been recognized as abstract rep-resentations of Boolean functions. A BDD represents a Boolean function as a rooted, di-rected acyclic graph. As Figure 5.3 illustrates, a representation of the function f (F r, T o, H, EF ) defined by the truth table Table 5.9, leads to the special case where the graph is actually a tree. Terminal nodes of out-degree zero are labelled 0 or 1, and a set of variable nodes v of out-degree two are used. The two outgoing edges are given by two functions low(v) corresponding to the case where the variable is assigned 0, and high(v) corresponding to the case where the variable is assigned 1, these are shown as dotted and solid lines, respectively. A variable var(v) is associated with each variable node.
The key idea of OBDDs [Bryant(1992)] is that by restricting the representation, Boolean manipulation becomes much simpler computationally. A BDD is OBDD if on all paths through the graph the variables respect a given linear order x1 < x2 < ... < xn, such as F r < T o < H < EF . An OBDD is reduced if no two distinct nodes u and v have the same variable name and low- and high-successor, i.e., var(u) = var(v), low(u) = low(v), high(u) = high(v) implies u = v, and no variable node u has identical low- and high-successor, i.e., low(u) 6= high(u) [Bryant(1992)].
5.3.2 OBDD Representation From The Naive Bayesian Classifier Reduced OBDDs [Bryant(1992)] provide compact representations of Boolean expres-sions. They are all based on the crucial fact that for any function f : Bn→ B, there is
Figure 5.4:The removal process to build a reduced OBDD.
exactly one reduced OBDD representing it, for a given ordering. This means, in partic-ular, that there is exactly one reduced OBDD for the constant true and constant false function on Bn: the terminal nodes 1 and 0. Hence, it is possible to test in constant time whether a reduced OBDD is constantly true or false. Furthermore, OBBDs are good to reason about the properties of any Naive Bayesian classifier. Specifically, when any Naive Bayesian classifier is represented by an OBDD that is tractable in size even given an intractable number of instances. The size of the graph representing a function can depend heavily on the ordering of the input variables.
Table 5.9 represents the truth table of malicious email using a threshold 0.9. Figure 5.3 represents the classifier induced by the Bayesian network using Table 5.9. To build an OBDD from this decision tree, transformation rules [Bryant(1992)] were applied, e.g., remove duplicate terminals, remove duplicate nonterminals, then remove redundant tests (see Figure 5.4).
The transformation rules are defined in [Bryant(1992)] as follows:
• Remove Duplicate Terminals: Eliminate all but one terminal vertex with a
given label and redirect all arcs into the eliminated vertices to the remaining one.
• Remove Duplicatate Nonterminals: If nonterminal vertices u and v have var(u) = var(v), low(u) = low(v), high(u) = high(v) then eliminate one of the two vertices and redirect all incoming arcs to the other vertex.
• Remove Redundant Tests: If nonterminal vertex v has low(v) = high(v), then eliminate v and redirect all incoming arcs to low(v).
After this reduction of the decision tree, an reduced OBDD was produced as in Figure 5.5. This OBDD represents the naive Bayesian classifier induced by the network in Figure 5.2 with probability threshold 0.9, with respect to variable order (Fr, To, H, EF).
Figure 5.5:A reduced OBDD of Email Viruses
5.3.3 Email Classifier With OBDDs
A truth assignment to a Boolean function B is the same as fixing a set of variables in the domain of B, i.e., if X is a Boolean variable in the domain of B, then X can be assigned either 0 or 1 (denoted [X → 0] and [X → 1], respectively). Let X → Y1, Y2 denote the if-then-else operator. Then X → Y1, Y2 is true if either X and Y1 are true or X is false and Y2 is true; the variable X is said to be the test expression. More formally, we have:
X → Y1, Y2 = (X ∧ Y1) ∨ (¬X ∧ Y2)
All operators can easily be expressed using only the if-then-else operator and the con-stants 0 and 1. Hence the operator gives rise to a new kind of normal form.
• Definition An If-then-else Normal Form (INF) is a Boolean function built entirely from the if-then-else operator and the constants 0 and 1 such that all tests are performed only on variables.
This is known as the Shannon expansion of t with respect to u + v. From the Shannon expansion we get that any Boolean function can be expressed in an If-then-else normal form (INF) by iteratively using the above substitution scheme on t. The ordering of the variables, corresponding to the order in which the Shannon expansion is performed, is encoded in the BDD [Bryant(1986)]. If abnormal mail classifier is denoted by t, UBEs part is by u and Email virus part is by v, we by t[0/u + v] denote the Boolean expression obtained by replacing u + v with 0 in t and then it is not hard to see that the following equivalence holds:
t = u + v → t[1/u + v], t[0/u + v].
Then the abnormal mail classifier t is true if either u or v are true, which means that this classifier can say that a mail is abnormal by considering the UBEs part or email virus part in the detected raw packets. The truth table of the abnormal mail classifier is in Table 5.10.
Table 5.10:the truth table of UBE and that of abnormal mail
Fr To f UBE Virus f
F F 1 F F 0
F T 1 F T 1
T F 1 T F 1
T T 0 T T 1
Like building the OBDD for email viruses, an OBDD for UBEs also can be built. How-ever, considering the previous survey and examination of UBEs in Table 5.4, about 85%
of email was blocked and 85.3% of emails was reported as non-existent or an invalid rea-son. Apart of this statistics, non-existent or invalid senders/recipients are also protocol anomalies. Therefore, two factors have been chosen, i.e., a sender denoted by F r and a recipient by T o. A simple Boolean function can be created using the NAND Boolean operator as follows. The truth table of this Boolean function is in Table 5.10. The OBDD representations of the UBEs classification and abnormal mail classification is in Figure 5.6.:
u = ¬(F r ∧ T o)
Figure 5.6: OBDD representation of UBEs and abnormal mail classification
The UBEs classifier u is true if either a sender field F r or a recipient field T o are false as in Figure 5.6, which means that this classifier can say this mail is an UBE by either a sender field or a recipient field are malformed or wrong.
For the email virus part v, a Boolean function can be built according to Figure 5.5 in the following way:
v =
(F r∧¬T o∧¬H ∧EF )∨(¬F r∧T o∧¬H ∧EF )∨(¬F r∧¬T o∧H ∧EF )∨(¬F r∧¬T o∧¬H)
The email virus classifier v is true if four factors F r, T o, H, EF are joining towards terminal (1) as in Figure 5.5, which means that this classifier can estimate that this mail contains an email virus using these facts; although a sender field is correct, a recipient field and a header field are wrong and there is an attachment in the mail, or although a recipient field is correct, a sender field and a header field are wrong and there is an attachment in the mail, or a sender field and a recipient field are wrong even though a header field is ok and there is an attachment, or a sender field, a recipient field and a header field are all wrong whether there is an attachment or not.
The results on email classification which is presented in this chapter will be used in the intelligent firewall implementation as presented in Section 7.1.