We now know enough probability theory that we can derive some simple machine learn- ing tools.
TheNaive Bayes model is a simple probabilistic model that is often used to recognize patterns. The model consists of one random variable C representing a category, and a set of random variables F = {F(1), . . . , F( )n} representing features of objects in each
category. In this example, we’ll use Naive Bayes to diagnose patients as having the flu or not. C can thus have two values: c0 representing the category of patients who do
is the random variable representing whether the patient has a sore throat, with f(1)0 representing no sore throat, and f(1)1 representing a sore throat. Suppose F(2) ∈ R is the patient’s temperature in degrees celsius.
When using the Naive Bayes model, we assume that all of the features are indepen- dent from each other given the category:
P C, F( (1), . . . , F( )n) = ( )ΠP C iP F( ( )i | C .)
These assumptions are very strong and unlikely to be true in practice, hence the name “naive.” Surprisingly, Naive Bayes often produces good predictions in practice, and is a good baseline model to start with when tackling a new problem.
Beyond these conditional independence assumptions, the Naive Bayes framework does not specify anything about the probability distribution. The specific choice of distributions is left up to the designer. In our flu example, let’s makeP C( ) a Bernoulli distribution, with P C( = c1) = φ( )C . We can also make P F( (1) | C) a Bernoulli
distribution, with
P F( (1)= f(1)1 | C = ) = c φFc .
In other words, the Bernoulli parameter changes depending on the value ofC. Finally, we need to choose the distribution over F(2). Since F(2) is real-valued, a normal distribution is a good choice. Because F(2) is a temperature, there are hard limits to the values it can take on—it cannot go below 0K, for example. Fortunately, these values are so far from the values measured in human patients that we can safely ignore these hard limits. Values outside the hard limits will receive extremely low probability under the normal distribution so long as the mean and variance are set correctly. As with F(1), we need
to use different parameters for different values of , to represent that patients with theC
flu have different temperatures than patients without it:
F(2)∼ N (F (2)| µc, σc2).
Now we are ready to determine how likely a patient is to have the flu. To do this, we want to compute P C( | F), but we know P C( ) and P(F | C). This suggests that we should use Bayes’ rule to determine the desired distribution. The word “Bayes” in the name “Naive Bayes” comes from this frequent use of Bayes’ rule in conjunction with the model. We begin by applying Bayes’ rule:
P C( | F ) =P C P( ) (F | )C
P ( )F . (3.1)
We do not knowP F( ). Fortunately, it is easy to compute:
P ( ) =F
c C∈
P C( = F ) (by the sum rule)c,
=
c C∈
Substituting this result back into equation 3.1, we obtain P C( | F ) = P C P( ) (F | )C c C∈ P C( = ) (Fc P |C= )c = P C( )ΠiP F( ( )i | C) c C∈ P C( = )Πc iP F( ( )i | C = )c
by the Naive Bayes assumptions. This is as far as we can simplify the expression for a general Naive Bayes model.
We can simplify the expression further by substituting in the definitions of the par- ticular probability distributions we have defined for our flu diagnosis example:
P C( = c F| (1)f
1, F(2) = f2) = g c( ) c∈Cg c( )
where
g c( ) = (P C = ) (c P F(1)= f(1) | C= ) (c p F(2)= f(2)| C = )c .
Since C only has two possible values in our example, we can simplify this to:
P C( = c F| (1)f1, F(2)= f2) = g(1) g(0) + 1g = 1 1 +g(0)g(1) = 1 1 + exp(log (0)g −log (1)g = (log (1)σ g −log (0))g . (3.2) To go further, let’s simplify log ( ):g i
log ( ) = logg i φ( )C i(1− φ( )C )1−iφ( )F f1 1 (1− φ ( )F 1 )1−f1 1 2πσ2i exp −2σ12 i (f2− µi)2
= logi φ( )C +(1−i) log1− φ( )C+f1log φ( )iF +(1−f1) log(1−φ( )iF )−1
2 log 1 2πσ2i − 1 2σ2i (f2− µi) 2.
Substituting this back into equation 3.2, we obtain
P C( = c F| (1)f
1, F(2)= f2) =
σ
log φ( )C −log(1−φ( )C) + f1 log φ1( )F + (1− f1) log(1− φ( )1F )
−f1log φ( )0F + (1− f1) log(1− φ( )0F) −12log 2πσ2 1+ 1 2 log 2πσ 2 0 − 1 2σ21(f2− µ1) 2+ 1 2σ20 (f2 − µ0) 2.
From this formula, we can read off various intuitive propertis of the Naive Bayes classifier’s behavior on this example problem. The probability of the patient having the flu grows like a sigmoidal curve. We move farther to the left as f2, the patient’s
Chapter 4
Numerical computation
TODO– redo intro and sections following spinoff of math chapter
When implementing machine learning algorithms, we must keep in mind the fact that they are implemented on digital computers.
4.1
Overflow and underflow
The fundamental difficulty in performing continuous math on a digital computer is that we need to represent infinitely many real numbers with a finite number of bit patterns. This means that for almost all real numbers, we incur some approximation error when we represent the number in the computer. In many cases, this is just rounding error. Rounding error is problematic, especially when it compounds across many operations, and can cause algorithms that work in theory to fail in practice if they are not designed to minimize the accumulation of rounding error.
One form of rounding error that is particularly devastating is underflow. Underflow occurs when numbers near zero are rounded to zero. If the result is later used as the denominator of a fraction, the computer encounters a divide-by-zero error rather than performing a valid floating point operation.
Another highly damaging form of numerical error is overflow. Overflow occurs when numbers with large magnitude are approximated as∞or −∞. No more valid arithmetic can be done at this point.
For an example of the need to design software implementations to deal with overflow and underflow, consider the softmax function:
softmax( )xi =
exp(xi)
n
j exp(xj)
.
Consider what happens when all of the xiare equal to some constant . Analytically,c
we can see that all of the outputs should be equal to 1n. Numerically, this may not occur when has large magnitude. If is very negative, then exp( ) will underflow. This meansc c c
the denominator of the softmax will become 0, so the final result is undefined. When
whole being undefined. Both of these difficulties can be resolved by instead evaluating softmax( ) wherez z = x− || ||x ∞.
For the most part, we do not explicitly detail all of the numerical considerations involved in implementing the various algorithms described in this book. Implementors should keep numerical issues in mind when developing implementations. Many numerical issues can be avoided by using Theano (Bergstra et al., 2010; Bastien et al., 2012), a software package that automatically detects and stabilizes many common numerically unstable expressions that arise in the context of deep learning.