Example: Naive Bayes - Deep Learning.pdf

We now know enough probability theory that we can derive some simple machine learning tools.

TheNaive Bayes model is a simple probabilistic model that is often used to recognize patterns. The model consists of one random variable C representing a category, and a set of random variables F = _{F(1)_{, . . . , F}( )n_{} representing features of objects in each}

category. In this example, we’ll use Naive Bayes to diagnose patients as having the ﬂu or not. C can thus have two values: c0 representing the category of patients who do

is the random variable representing whether the patient has a sore throat, with f(1)₀ representing no sore throat, and f(1)₁ representing a sore throat. Suppose F(2) _{∈ R is} the patient’s temperature in degrees celsius.

When using the Naive Bayes model, we assume that all of the features are indepen- dent from each other given the category:

P C, F( (1), . . . , F( )n) = ( )ΠP C iP F( ( )i | C .)

These assumptions are very strong and unlikely to be true in practice, hence the name “naive.” Surprisingly, Naive Bayes often produces good predictions in practice, and is a good baseline model to start with when tackling a new problem.

Beyond these conditional independence assumptions, the Naive Bayes framework does not specify anything about the probability distribution. The speciﬁc choice of distributions is left up to the designer. In our ﬂu example, let’s makeP C( ) a Bernoulli distribution, with P C( = c1) = φ( )C . We can also make P F( (1) | C) a Bernoulli

distribution, with

P F( (1)= f(1)₁ _{| C} = ) = c φFc .

In other words, the Bernoulli parameter changes depending on the value ofC. Finally, we need to choose the distribution over F(2). Since F(2) is real-valued, a normal distribution is a good choice. Because F(2) is a temperature, there are hard limits to the values it can take on—it cannot go below 0K, for example. Fortunately, these values are so far from the values measured in human patients that we can safely ignore these hard limits. Values outside the hard limits will receive extremely low probability under the normal distribution so long as the mean and variance are set correctly. As with F(1)_{, we need}

to use diﬀerent parameters for diﬀerent values of , to represent that patients with theC

ﬂu have diﬀerent temperatures than patients without it:

F(2)∼ N (F (2)| µc, σc2).

Now we are ready to determine how likely a patient is to have the ﬂu. To do this, we want to compute P C( _{| F}), but we know P C( ) and P(F _{| C). This suggests that} we should use Bayes’ rule to determine the desired distribution. The word “Bayes” in the name “Naive Bayes” comes from this frequent use of Bayes’ rule in conjunction with the model. We begin by applying Bayes’ rule:

P C( _{| F ) =}P C P( ) (F | )C

P ( )F . (3.1)

We do not knowP F( ). Fortunately, it is easy to compute:

P ( ) =F 

c C∈

P C( = F ) (by the sum rule)c,

= 

c C∈

Substituting this result back into equation 3.1, we obtain P C( | F ) = P C P( ) (F | )C c C∈ P C( = ) (Fc P |C= )c = P C( )ΠiP F( ( )i _{| C)}  c C∈ P C( = )Πc iP F( ( )i | C = )c

by the Naive Bayes assumptions. This is as far as we can simplify the expression for a general Naive Bayes model.

We can simplify the expression further by substituting in the definitions of the par- ticular probability distributions we have defined for our flu diagnosis example:

P C( = c F_| (1)_f

1, F(2) = f2) =  g c( ) c_∈Cg c( )

where

g c( ) = (P C = ) (c P F(1)= f(1) _{| C}= ) (c p F(2)= f(2)_{| C} = )c .

Since C only has two possible values in our example, we can simplify this to:

P C( = c F_| (1)f1, F(2)= f2) = g(1) g(0) + 1g = 1 1 +g(0)_g(1) = 1 1 + exp(log (0)g ₋log (1)g = (log (1)σ g −log (0))g . (3.2) To go further, let’s simplify log ( ):g i

log ( ) = logg i  φ( )C i(1_{− φ}( )C )1−iφ( )F f1 1 (1− φ ( )F 1 )1−f1  1 2πσ2_i exp  −_2σ12 i (f2− µi)2 

= logi φ( )C +(1_−i) log1_{− φ}( )C+f1log φ( )_iF +(1−f1) log(1−φ( )_iF )−1

2 log 1 2πσ2_i − 1 2σ2_i (f2− µi) 2_.

Substituting this back into equation 3.2, we obtain

P C( = c F| (1)_f

1, F(2)= f2) =



log φ( )C −log(1−φ( )C) + f1 log φ₁( )F + (1− f1) log(1− φ( )₁F )

−f1log φ( )0F + (1− f1) log(1− φ( )0F) −1₂log 2πσ2 1+ 1 2 log 2πσ 2 0 − 1 2σ2₁(f2− µ1) 2₊ 1 2σ2₀ (f2 − µ0) 2_.

From this formula, we can read off various intuitive propertis of the Naive Bayes classifier’s behavior on this example problem. The probability of the patient having the flu grows like a sigmoidal curve. We move farther to the left as f2, the patient’s

Chapter 4 Numerical computation

TODO– redo intro and sections following spinoﬀ of math chapter

When implementing machine learning algorithms, we must keep in mind the fact that they are implemented on digital computers.

4.1 Overﬂow and underﬂow

The fundamental difficulty in performing continuous math on a digital computer is that we need to represent infinitely many real numbers with a finite number of bit patterns. This means that for almost all real numbers, we incur some approximation error when we represent the number in the computer. In many cases, this is just rounding error. Rounding error is problematic, especially when it compounds across many operations, and can cause algorithms that work in theory to fail in practice if they are not designed to minimize the accumulation of rounding error.

One form of rounding error that is particularly devastating is underflow. Underflow occurs when numbers near zero are rounded to zero. If the result is later used as the denominator of a fraction, the computer encounters a divide-by-zero error rather than performing a valid floating point operation.

Another highly damaging form of numerical error is overﬂow. Overﬂow occurs when numbers with large magnitude are approximated as∞or −∞. No more valid arithmetic can be done at this point.

For an example of the need to design software implementations to deal with overﬂow and underﬂow, consider the softmax function:

softmax( )xi =

exp(xi)

n

j exp(xj)

Consider what happens when all of the xiare equal to some constant . Analytically,c

we can see that all of the outputs should be equal to 1_n. Numerically, this may not occur when has large magnitude. If is very negative, then exp( ) will underﬂow. This meansc c c

the denominator of the softmax will become 0, so the ﬁnal result is undeﬁned. When

whole being undeﬁned. Both of these diﬃculties can be resolved by instead evaluating softmax( ) wherez z = x_{− || ||}x _∞.

For the most part, we do not explicitly detail all of the numerical considerations involved in implementing the various algorithms described in this book. Implementors should keep numerical issues in mind when developing implementations. Many numerical issues can be avoided by using Theano (Bergstra et al., 2010; Bastien et al., 2012), a software package that automatically detects and stabilizes many common numerically unstable expressions that arise in the context of deep learning.

In document Deep Learning.pdf (Page 57-61)