4.2 COM Methodology
4.2.1 Algorithm
We present the causal outlier mining in Bayesian network (COMBN) algorithm in Al- gorithm 1.
Algorithm 1COMBN
Input: BN, parametersminconf,maxconf,|X|,τ and a test set
Output:DSAPs, anomalies
1. Computeminsuppandmaxsuppfor every parent node in BN using Equations 4.5 and 4.6 2. For all causal subspace in BN, repeat:
2.1. ApplyR1andR2using Equations 4.8 and 4.9 to discover DSAP 2.2. Compute sensitivity of discovered DSAP in BN
3. If (|DSAPs|>2× |X|) then, 3.1 Sort DSAPs
3.2 Output top (τ*|DSAPs|) low scored DSAPs else
Output all DSAPs extracted
4. Output test cases with DSAPs within as anomalies
We explain algorithm COMBN with the help of Bayesian network presented in Figure 4.3. This BN can be considered as a model for the domain where objective is to identify outliers from given test set based on knowledge captured by the model. As an input we are given Bayesian network, parameters minconf, maxconf, τ and a test set. Let parametersminsupp,maxsuppandτ are set to 10%, 80% and 50% respectively. Algorithm starts with computingminsuppandmaxsuppfor all parent nodes in the model using Equations 4.5 and 4.6 as indicated by the step 1 in COMBN. Thereafter, rulesR1 andR2are applied over two causal subspaces present in this BN to discover DSAP. For every DSAP extracted using rules, its sensitivity score is computed. The total number of DSAPs extracted from this BN is three (it is discussed before). Since |DSAPs| is less than|X|present in the BN so condition specified in step 3 of the algorithm is not satisfied and hence all three DSAPs extracted are given as output. Further, test cases with the presence of any of the three DSAP within are identified as outliers.
The computational complexity of the algorithm COMBN is governed by two key components in BN, i.e., (1) qualitative component, which specifies the number of nodes and directed links that present in the model and, (2) quantitative component, which in- dicates the total number of unconditional and conditional probability entries in the BN. Major computation involved are in Step 1 & 2 of the algorithm COMBN. In Step 1 of the algorithm, for every parent node,minsuppandmaxsuppare maintained. However, to compute the state for which probability of occurrence is minimum and maximum for each parent node, we are not inferencing in Bayesian network which is known to be a NP-hard problem [48]. These parameters are like prior probabilities, i.e., P(X = xi), ei-
ther provided by a domain expert or is learnt using EM algorithms from a given data set. Bayesian network development software like Netica [24] maintains this information for every node in the Bayesian network. Assuming this information is given, we only need to sort P(X = xi). Later minsuppandmaxsuppare set using Equations 4.5 and 4.6 for
every parent node in the BN.
In Step 2.1 of the algorithm, we use rules R1 andR2 in every causal subspace of a given Bayesian network to mine anomalous patterns. Intuitively, these rules are like finding conditional probability in some state of child node, given observations on par- ent nodes, i.e., P(C = ci|Pa(C)). Interestingly, for queries like P(C = ci|Pa(C)), again,
we do not need any complex inference in Bayesian network. Rather information on such query is already pre-computed in BN in the form of conditional probability table associated with every child node. Query such as, P(C = ci | W) where, W belongs
to set of descendent nodes of C in BN may require operations such as, marginaliza- tion over irrelevant variables for computing such probability of interest. Computing for such queries can go intractable if there are a large number of nodes in the Bayesian network. In comparison, query P(C = ci|Pa(C)) is always tractable. However, a large
conditional probability table could be a time consuming job in finding conditional prob- ability of interest. In order to avoid such circumstances, we designed pruning strategy especially for ruleR1. In ruleR1, we are in pursuit of finding that entry in the condi- tional probability table where confidence is greater than or equal tomaxconf threshold. For example, consider variable X with|Val(X)|= 3. On settingmaxconf to 70%, we can find only one conditional probability in X greater than 70%. This condition holds true for all child nodes in BN. As soon as we find that entry, we break the scanning process in conditional probability table associated with child node since there would be only one entry greater than maxconf threshold. For rule R2, we need to scan probabilities
in all possible states of child node for given observations for entries less thanminconf threshold since there could be more than one value satisfyingminconf threshold. It can be imagined as scanning matrix of size (1 x|Val(C)|) where,|Val(C)|represents number of states of child node C.
In Step 2.2, the interestingness of a DSAP is computed using a sensitivity analysis in BN. Sensitivity analysis, again is a NP-hard problem in the worst case. However, in our work we use this measure for variables which are causally related, i.e., in every causal subspace rather than using this measure where known observations are sparsely located from the node on which sensitivity has to be analysed. Thus, sensitivity analysis is not NP-hard in our case.