Statistical data mining
5.1 Uncertainty measures and inference
5.1.1 Probability
An event is any proposition that can be either true or false and is formally a subset of the space , which is called the space of all elementary events. Elementary events are events that cannot be further decomposed, and cover all possible occurrences. Let a be a class of subsets of , called the event space. A probability function P is a function defined ona that satisfies the following axioms:
• P (A)≥0,∀A∈a
• P ()=1
• If A1, A2, . . . is a sequence of events of a that is pairwise mutually exclusive (i.e. Ai∩Aj = ∅fori=j, i, j =1,2, . . . ,) and if A1∪A2∪
. . .=∞i=1Ai ∈a, thenP (
∞ i=1Ai)=
∞
i=1P (Ai).
A probability function will also be known as a probability measure or simply as probability. The three axioms can be interpreted in the following way. The first axiom says the probability is a non-negative function. The second axiom says the probability of the event is 1;is an event that will always be true as it coincides with all possible occurrences. Since any event is a subset of, it fol- lows that the probability of any event is a real number in [0,1]. The third axiom says the probability of occurrence of any one of a collection of events (possibly
infinite, and mutually exclusive) is the sum of the probabilities of occurrence of each of them. This is the formal, axiomatic definition of probability due to Kolmogorov (1933). There are several interpretations of this probability. These interpretations will help us from an operational viewpoint when we come to construct a probability measure. In the classical interpretation, if an experiment gives rise to a finite numbernof possible results, thenP (A)=nA/n, wherenA
indicates the number of results in A (favourable results). In the more general frequentist interpretation, the probability of an event coincides with the relative frequency of the same event in a large (possibly infinite) sequence of repeated tri- als under the same experimental conditions. The frequentist interpretation allows us to take most of the concepts developed for frequencies (such as those in Chapter 3) and extend them to the realm of probabilities. In the even more gen- eral (although somewhat controversial) subjective interpretation, the probability is a degree of belief that an individual attaches to the occurrence of a certain event. This degree of belief is totally subjective but not arbitrary, since proba- bilities must obey coherency rules, that corresponds to the above axioms and all the rules derivable from those axioms. The advantage of the subjective approach is that it is always applicable, especially when an event cannot be repeated (a typical situation for observational data and data mining, and unlike experimental data).
We can use the three axioms to deduce the basic rules of probability. Here are the complement rule and the union rule:
• Complement rule: ifAis any event ina, andAis its complement (negation), thenP (A)=1−P (A).
• Union rule: For any pair of eventsA,B ∈a,P (A∪B)=P (A)+P (B)− P (A∩B), where the union eventA∪B is true when eitherAorB is true; the intersection eventA∩B is true when bothAand B are true.
Probability has so far been defined in the absence of information. Similar to the concept of relative frequency, we can define the probability of an event A
occurring, conditional on the information that the eventB is true. LetA andB
be two events ina. The conditional probability of the event A, given thatB is true, is
P (A|B)= P (A∩B)
P (B) withP (B) >0.
The previous definition extends to any conditioning sets of events. Conditional probabilities allows us to introduce further important rules:
• Intersection rule: Let A and B be two events in a. Then P (A∩B)= P (A|B)P (B)=P (B|A)P (A).
• Independence of events: If A is independent of B, the following relations hold:
P (A∩B)=P (A)P (B) P (A|B)=P (A) P (B|A)=P (B)
In other words, if two events are independent, knowing that one of them occurs does not alter the probability that the other one occurs.
• Total probability rule: ConsiderneventsHi, i =1, . . . , n, pairwise mutually
exclusive and exhaustive of(equivalently, they form a partition of), with
P (Hi) >0. Then the probability of an eventB ina is given by
P (B)=
n
i=1
P (B|Hi)P (Hi)
• Bayes’ rule: ConsiderneventsHi, i =1, . . . , n, pairwise mutually exclusive
and exhaustive of(equivalently, they form a partition of), withP (Hi) >
0. Then the probability of an eventB inasuch thatP (B) >0 is given by
P (Hi|B)=
P (B|Hi)P (Hi)
jP (B|Hj)P (Hj)
The total probability rule plays a very important role in the combination of dif- ferent probability statements; we will see an important application in Section 5.7. Bayes’ theorem is a very important rule, also known as the ‘inversion rule’ as it calculates the conditional probability of an event by using the reversed condi- tional probabilities. Note also that the denominator of Bayes’ rule is the result of the total probability rule; it acts as a normalising constant of the probabilities in the numerator. This theorem lies at the heart of the inferential methodology known as Bayesian statistics.