6.3 Mining Outliers in Markov Blanket Subspaces
6.3.1 Learning Bayesian Networks in Subspaces
We present three greedy strategies to learn a local Bayesian network in a Markov blanket subspace for outlier detection. Firstly, we employ the state-of-the-art MMMB methods to discover Markov blankets for each attribute (details in [5]). The MMMB algorithm identifies parents and children of a target attribute as the first step, and then discovers the spouses of the target attribute. Secondly, with the discovered Markov blankets, we learn a local Bayesian network for each Markov blanket subspace through the phases of structure learning and parameter learning.
Structure learning. We present a greedy method to learn a local Bayesian network structure in a Markov blanket subspace with the following strategies.
Strategy 1: if A∈M B(X), X ∈P C(A) andP C(A)− {X} is not empty, then add an edge inG:X→A, orA→B, B∈ {P C(A)− {X}}.
Since the directions of spouses of X, the children of both the spouses and X, and X
can be determined in the Markov blanket discovery phase, due to a well-known V-structure [93]: A→ C ←B. In this V-structure, A and B is initially independent, but will become
Anxiety Peer Pressure Yellow Fingers Smoking Lung Cancer Smoking Genetics Allergy Coughing Fatigue Lung Cancer
Figure 6.3: Examples of two Bayesian networks in Markov blanket subspaces
dependent when conditioned on C.
Strategy 2: ifP C(A)− {X} is empty, our method only considers adding an edgeA →
X, or X →A, A∈M B(X).
Strategy 3: Assume DAG(X) is the directed acyclic graph learned from the Markov blanket subspace ofX, andDAG(A) is to the Markov blanket subspace ofA. IfA∈P C(X),
X ∈ P C(A), and edge A → X exists in DAG(X), then the edge A → X should be in
DAG(A).
Strategy 3 keeps the directions between attributes be consistent. This is the key step for us to use Bayesian network inference to mine outliers in Markov blanket subspaces. If the directions between attributes are consistent in each local Bayesian network, then the joint probability for each attribute can be kept consistent. For example, in Figure 6.3, we can get two local Bayesian networks for “Lung-cancer” and ”Smoking”. In the two Bayesian networks, the directions between “Lung-cancer” and “Smoking” should be consistent.
Based on the score and our greedy search strategies, the structure learning problem can be formally expressed as follows: given a complete training data set of instancesO, find a
DAG G∗ such that
G∗ = arg max
G∈Gn
g(G:O)
where g(G :O) is the scoring function measuring the degree of fitness of any candidate G
to the data set, and Gn is the family of all the DAGs defined on O. To find a Bayesian
proposed in [32]. gBD(G:O) =log(p(G)) + n X i=1 qi X j=1 " log( Γ(ηij) Γ(Nij+ηij) ) + ri X k=1
log(Γ(Nijk+ηijk) Γ(ηijk)
) #
where log(p(G)) is the log-likelihood function, the values ηijk are the hyperparameters for
the Dirichlet prior distributions of the parameters given the network structure, qi is the
number of states of the Cartesian product ofDi’s parents, ri is the number of states of Di,
ηij =
Pri
k=1ηijk·Γ(.) is the Gamma function, Γ(c) =
R∞ 0 e
−uuc−1du.
The likelihood is a function of the parameters which is proportional to the probability of the observed data, and log(p(O|G)) is defined as follows whileD(im) is the mth instance
of attribute Di. logp(O|G) = Pn m=1logp(D(m)|θ) = Pn m=1 Pd i=1logp(D (m) i |D (m) πi , θi)
Parameter learning. We estimate parameters for Bayesian networks using maximum likelihood estimation. Given a data set O and the structure of a Bayesian network G, the maximum likelihood estimation aims to choose parameters θthat satisfy
L(θ∗:O|G) =M axθ∈ΘL(θ:O|G)
The parameter Θ is defined as a hypothesis space, a set of all parameters Θ∈[0,1]. With the Markov property of Bayesian networks,L(Θ :O) can be decomposed as follows.
L(θ:O|G) =Y
i
Li(θDi|P aDi :O|G) where the local likelihood function for Di is:
L(θDi|P aDi :O) = Y
j
P(Dji|P ajD
i :θDi|P aDi)
With the structure of Bayesian network G and the data set O, L(θ : O|G) is reduced to estimating θijk = P(Di = j|P a(Di) = k), that is, the maximum likelihood estimates
are simply the observed frequency estimates ˆθijk = nijk/nij ,where nijk is the number of
occurrences in the training set of the kth state ofD
i with the jth state of its parents, and
nij is the sum of nijk over all k.
To deal with the situationnijk = 0, we use the Dirichlet prior. Then, ˆθijkcan be written
as follows.
ˆ
θijk=
nijk+αDi,P aDi
where αDi,P aDi =α·P(Di, P aDi), P(Di, P aDi) = 1
|Di|·|P aDi|, and α = P
Di,P aDiαDi,P aDi. |Di|is the number of values thatDi takes, and|P aDi|is the number of (joint) values of the parents of Di.