BAYESIAN INFERENCE 21 Bayesian inference. In this subsection and the next we develop an alternative

regola della catena

1.2. BAYESIAN INFERENCE 21 Bayesian inference. In this subsection and the next we develop an alternative

definition that does.

When doing Bayesian inference, there is some entity which has features, the states of which we wish to determine, but which we cannot determine for certain. So we settle for determining how likely it is that a particular feature is in a particular state. The entity might be a single system or a set of systems.

An example of a single system is the introduction of an economically beneficial chemical which might be carcinogenic. We would want to determine the relative risk of the chemical versus its benefits. An example of a set of entities is a set of patients with similar diseases and symptoms. In this case, we would want to diagnose diseases based on symptoms.

In these applications, a random variable represents some feature of the entity being modeled, and we are uncertain as to the values of this feature for the particular entity. So we develop probabilistic relationships among the variables.

When there is a set of entities, we assume the entities in the set all have the same probabilistic relationships concerning the variables used in the model. When this is not the case, our Bayesian analysis is not applicable. In the case of the chemical introduction, features may include the amount of human exposure and the carcinogenic potential. If these are our features of interest, we identify the random variables HumanExposure and CarcinogenicP otential (For simplicity, our illustrations include only a few variables. An actual application ordinarily includes many more than this.). In the case of a set of patients, features of interest might include whether or not a disease such as lung cancer is present, whether or not manifestations of diseases such as a chest X-ray are present, and whether or not causes of diseases such as smoking are present. Given these features, we would identify the random variables ChestXray, LungCancer, and SmokingHistory. After identifying the random variables, we distinguish a set of mutually exclusive and exhaustive values for each of them. The possible values of a random variable are the diﬀerent states that the feature can take.

For example, the state of LungCancer could be present or absent, the state of ChestXray could be positive or negative, and the state of SmokingHistory could be yes or no. For simplicity, we have only distinguished two possible values for each of these random variables. However, in general they could have any number of possible values or they could even be continuous. For example, we might distinguish 5 diﬀerent levels of smoking history (one pack or more for at least 10 years, two packs or more for at least 10 years, three packs or more for at lest ten years, etc.). The specification of the random variables and their values not only must be precise enough to satisfy the requirements of the particular situation being modeled, but it also must be suﬃciently precise to pass the clarity test, which was developed by Howard in 1988. That test is as follows: Imagine a clairvoyant who knows precisely the current state of the world (or future state if the model concerns events in the future). Would the clairvoyant be able to determine unequivocally the value of the random variable? For example, in the case of the chemical introduction, if we give HumanExposure the values low and high, the clarity test is not passed because we do not know what constitutes high or low. However, if we define high as

when the average (over all individuals), of the individual daily average skin contact, exceeds 6 grams of material, the clarity test is passed because the clairvoyant can answer precisely whether the contact exceeds that. In the case of a medical application, if we give SmokingHistory only the values yes and no, the clarity test is not passed because we do not know whether yes means smoking cigarettes, cigars, or something else, and we have not specified how long smoking must have occurred for the value to be yes. On the other hand, if we say yes means the patient has smoked one or more packs of cigarettes every day during the past 10 years, the clarity test is passed.

After distinguishing the possible values of the random variables (i.e. their spaces), we judge the probabilities of the random variables having their values.

However, in general we do not always determine prior probabilities; nor do we de-termine values in a joint probability distribution of the random variables. Rather we ascertain probabilities, concerning relationships among random variables, that are accessible to us. For example, we might determine the prior probability P (LungCancer = present), and the conditional probabilities P (ChestXray = positive|LungCancer = present), P (ChestXray = positive|LungCancer = absent), P (LungCancer = present| SmokingHistory = yes), and finally P (LungCancer = present|SmokingHistory = no). We would obtain these probabilities either from a physician or from data or from both. Thinking in terms of relative frequencies, P (LungCancer = present|SmokingHistory = yes) can be estimated by observing individuals with a smoking history, and de-termining what fraction of these have lung cancer. A physician is used to judging such a probability by observing patients with a smoking history. On the other hand, one does not readily judge values in a joint probability distribution such as P (LungCancer = present, ChestXray = positive, SmokingHistory = yes). If this is not apparent, just think of the situation in which there are 100 or more random variables (which there are in some applications) in the joint probability distribution. We can obtain data and think in terms of probabilistic relation-ships among a few random variables at a time; we do not identify the joint probabilities of several events.

As to the nature of these probabilities, consider first the introduction of the toxic chemical. The probabilities of the values of CarcinogenicP otential will be based on data involving this chemical and similar ones. However, this is certainly not a repeatable experiment like a coin toss, and therefore the prob-abilities are not relative frequencies. They are subjective probprob-abilities based on a careful analysis of the situation. As to the medical application involv-ing a set of entities, we often obtain the probabilities from estimates of rel-ative frequencies involving entities in the set. For example, we might obtain P (ChestXray = positive|LungCancer = present) by observing 1000 patients with lung cancer and determining what fraction have positive chest X-rays.

However, as will be illustrated in Section 1.2.3, when we do Bayesian inference using these probabilities, we are computing the probability of a specific individ-ual being in some state, which means it is a subjective probability. Recall from Section 1.1.1 that a relative frequency is not a property of any one of the trials (patients), but rather it is a property of the entire sequence of trials. You may

1.2. BAYESIAN INFERENCE 23 feel that we are splitting hairs. Namely, you may argue the following: “This subjective probability regarding a specific patient is obtained from a relative frequency and therefore has the same value as it. We are simply calling it a subjective probability rather than a relative frequency.” But even this is not the case. Even if the probabilities used to do Bayesian inference are obtained from frequency data, they are only estimates of the actual relative frequencies.

So they are subjective probabilities obtained from estimates of relative frequen-cies; they are not relative frequencies. When we manipulate them using Bayes’

theorem, the resultant probability is therefore also only a subjective probability.

Once we judge the probabilities for a given application, we can often ob-tain values in a joint probability distribution of the random variables. Theo-rem 1.5 in Section 1.3.3 obtains a way to do this when there are many vari-ables. Presently, we illustrate the case of two variables. Suppose we only identify the random variables LungCancer and ChestXray, and we judge the prior probability P (LungCancer = present), and the conditional probabili-ties P (ChestXray = positive|LungCancer = present) and P (ChestXray = positive|LungCancer = absent). Probabilities of values in a joint probability distribution can be obtained from these probabilities using the rule for condi-tional probability as follows:

P (present, positive) = P (positive|present)P (present) P (present, negative) = P (negative|present)P (present)

P (absent, positive) = P (positive|absent)P (absent) P (absent, negative) = P (negative|absent)P (absent).

Note that we used our abbreviated notation. We see then that at the outset we identify random variables and their probabilistic relationships, and values in a joint probability distribution can then often be obtained from the probabilities relating the random variables. So what is the sample space? We can think of the sample space as simply being the Cartesian product of the sets of all possible values of the random variables. For example, consider again the case where we only identify the random variables LungCancer and ChestXray, and ascertain probability values in a joint distribution as illustrated above. We can define the following sample space:

Ω =

{(present, positive), (present, negative), (absent, positive), (absent, negative)}.

We can consider each random variable a function on this space that maps each tuple into the value of the random variable in the tuple. For example, LungCancer would map (present, positive) and (present, negative) each into present. We then assign each elementary event the probability of its correspond-ing event in the joint distribution. For example, we assign

P ({(present, positive)}) = P (LungCancer = present, ChestXray = positive).ˆ

It is not hard to show that this does yield a probability function on Ω and that the initially assessed prior probabilities and conditional probabilities are the probabilities they notationally represent in this probability space (This is a special case of Theorem 1.5.).

Since random variables are actually identified first and only implicitly be-come functions on an implicit sample space, it seems we could develop the con-cept of a joint probability distribution without the explicit notion of a sample space. Indeed, we do this next. Following this development, we give a theorem showing that any such joint probability distribution is a joint probability dis-tribution of the random variables with the variables considered as functions on an implicit sample space. Definition 1.1 (of a probability function) and Defi-nition 1.5 (of a random variable) can therefore be considered the fundamental definitions for probability theory because they pertains both to applications where sample spaces are directly identified and ones where random variables are directly identified.

1.2.2 A Definition of Random Variables and Joint Proba-bility Distributions for Bayesian Inference

For the purpose of modeling the types of problems discussed in the previous subsection, we can define a random variable X as a symbol representing any one of a set of values, called the space of X. For simplicity, we will assume the space of X is countable, but the theory extends naturally to the case where it is not. For example, we could identify the random variable LungCancer as having the space {present, absent}. We use the notation X = x as a primitive which is used in probability expressions. That is, X = x is not defined in terms of anything else. For example, in application LungCancer = present means the entity being modeled has lung cancer, but mathematically it is simply a primi-tive which is used in probability expressions. Given this definition and primiprimi-tive, we have the following direct definition of a joint probability distribution:

Definition 1.8 Let a set of n random variablesV = {X¹, X2, . . . Xn} be speci-fied such that each Xi has a countably infinite space. A function, that assigns a real number P (X1 = x1, X2 = x2, . . . Xn = xn) to every combination of values of the xi’s such that the value of xi is chosen from the space of Xi, is called a joint probability distribution of the random variables in V if it satisfies the following conditions:

1. For every combination of values of the xi’s,

0 ≤ P (X1= x₁, X₂= x₂, . . . X_n= x_n) ≤ 1.

2. We have

x1,x2,...xn

P (X1= x1, X2= x2, . . . Xn= xn) = 1.

1.2. BAYESIAN INFERENCE 25

In document Learning Bayesian Networks(Neapolitan, Richard) (Page 32-36)