• No results found

Lecture 17: Parameter Learning with Missing Values

N/A
N/A
Protected

Academic year: 2021

Share "Lecture 17: Parameter Learning with Missing Values "

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)Lecture 17: Parameter Learning with Missing Values The problem Gradient-based methods Introduction to expectation maximization EM for learning parameters of Bayes nets. 1. Incomplete data Some variables may not be assigned values in some instances E.g. not all patients undergo all medical tests Some variables may not be observed in any of the data items E.g. viewers preferences for a given show may depend on their metabolic cycle (what times they are awake) Problem: the fact that a value is missing may be indicative of what the value actually is E.g. A patient did not undergo X-ray because she had no bone problems; so X-ray would likely have come out negative. 2.

(2) Missing at random assumption The probability that the value of. . is missing is.   we can introduce an additional   Boolean variable, Observed . This is always observable! independent of its actual value, given the observed data. If this is not true, for variable. 3. Hidden (latent) variables These are variables that we never observe E.g. people get sick and we suspect it is a new virus, but no test for it exists yet Why should we consider latent variables? – We might get a better theory of the domain – Introducing random variables can provide a more compact model (we will come back to this later) – We might care about the value of the hidden variable itself (e.g. mixture models, clustering). 4.

(3) Why missing values make life hard.  

(4)  

(5) 

(6) 

(7)  .. Consider again the simple network.   . , and samples.  "! $#&% '"! (#. The parameters that we need are , , . If all samples are complete, the log-likelihood is a nice function, for which it is easy to compute the maximum. )+*-, /. , "! $#0% . , "! (# 132 *4,  16587 *69+:;,  1<5>= *-, "! $#0% 1 5>=?7 *@9A:B, C! (#&% 1 5D=E= *-, "! $# 1 5F7E7 *@9G:B, "! (# H1 5876= IKJML"NPO  

(8) "! (#&%

(9) '"! (# Q is nice because N is a product, so IKJML we’ll get a sum by taking the  is missing, and  $R S . What can we do? Suppose now that 5. Why missing values make life hard (2) We can consider both possible settings,. D R T. and. 3 R S .. For each setting we get a different likelihood The overall likelihood should combined the individual likelihoods, based on how probable each parameter setting is:. U *WV 2 XZY\[ 2 9 . , 1<U *^] X . 9_ .`baHcdcec  Y , 1 f U *WV 2 9 Y\[ 2 9 . , 1<U *^]g9 . 9_ .` aHcdcec  Y , 1 , "! $#0% *69+:;,  1 U *WV 2 XZY\[ 2 9 . , 1 2 , C! (#&% *@9A, :;,  1 f , , "! (# ,  "!, (# 1 f  , U *WV 2 9 Y\[ 2 9 . , 1 2 , @ * A 9 ; : C! (#&%  "! (# ,  Now when we take the log, we have an ugly sum inside of it! 3 and  a , we have to consider all If we have values missing for )G*4, 1 2. possible combinations of values to the instances 6.

(10) Effect of missing values Complete data. Missing values. Parameters of the model can be estimated. Parameters cannot be estimated. locally, and independently. independently. Log-likelihood has a unique maximum. There are many local maxima! Maximizing the likelihood becomes a non-linear optimization problem. Under certain assumptions, there is a nice,. Closed-form solutions cannot be. closed-form solution for the parameters. obtained. We can use Bayesian priors. Bayesian priors become much too expensive. 7. Two solutions 1. Gradient ascent: follow the gradient of the likelihood with respect to the parameters 2. Expectation minimization: use the current parameter setting to construct a local approximation of the likelihood which is “nice” and can be optimized easily. 8.

(11) Gradient Ascent What you would expect: move parameters in the direction of the gradient of the log-likelihood Note: It is easy to compute the gradient at any parameter setting Pros: – Flexible: allows different parameterizations at the nodes – Closely related to methods for training neural nets Cons: – Gradient needs to be projected on the space of legal parameters (by normalizing, in our case) – Sensitive to parameters (e.g. learning rates) – Slow! This problem can be solved by using fancier methods (e.g. optimize the learning rate) 9. Expectation Maximization (EM) A general purpose method for learning from incomplete data, used not only for Bayes nets, but whenever an underlying distribution is assumed Main idea: – If we had sufficient statistics for the data (e.g. exact counts of possible variable values), we could easily maximize the likelihood – So in the case of missing values, we will “fantasize” how the data should look like based on the current parameter setting – This means we compute expected sufficient statistics – Then we improve the parameter setting, based on these statistics 10.

(12) Outline of EM 1. Start with some initial parameters setting 2. Repeat as long as desired: (a) Expectation (E) step: Complete the data, by assigning “values” to the missing items (b) Maximization (M) step: Compute the best parameter setting based on the completed data Note that once the data is completed, computing log-likelihood and new parameters is easy!. 11. Two versions of the algorithm Hard EM: choose the instance that is most likely and include it in the data set Soft EM: put a weight on each instance, equal to its probability, and use the weights as counts This approach has the same flavor as likelihood weighting Then these numbers are used just as real counts, to provide a max.. . likelihood estimate for .. 12.

(13) Hard EM in our example. hi$# O6jm

(14) kMl  Q be the number of times  R S in the j l . We’ll use similar notation for all other counts. instances Make a guess for the parameters of the network,  n O  R Sqp 

(15)  n Q , o O  R Tp 

(16)  n Q Compute o Let. Note that this step requires exact inference! Hence, it’s not cheap.... uBv. E.g., if. D. o O  R SZp 

(17)  n Qsr o O  R Tp 

(18)  n Q , then  st S . Let. E-step: Complete the data using the most likely value for. be this new data set.. 13. Hard EM in our example (continued).  v , R which maximizes NPO  Q o O uwv p  Q the likelihood given the completed data: M-step: Compute a new parameter vector. In our case, this means. hx(# t h (#&% t KK. hx$# O6jkMl Q y S h $#0% O6jkMl Q. Then we fit the parameters as usual!. R h i $  # l z . ?{ (#&%. |{ (#. R R. h # h # "! (#&% xh $#0% '"! (# hx(#. 14.

(19) Soft EM in our example.  np 

(20) Q s. R $. R q S p. R s. R  T  

(21)  Q O O Compute } o n , } % o  n D. E-step: Complete the data using both possible values of . Make a guess for the parameters of the network,. u~ 7 # :  S

(22)  H

(23) €

(24)  a 

(25)   

(26) ‚/ u~ 7 #&% :  T

(27)  H

(28) €

(29)  a 

(30)   

(31) ‚/ – For a parameter vector  , we can compute the likelihood with NPO  p uƒ~ 7 # Q and NPO  p u~ 7 #&% Q . respect to each data set, > R S is } (similarly for 3 R T ) Our probability that indeed The expected likelihood for a parameters vector  is: NPO  Q R } NPO  p u ~ 7 # Q y } % NPO  p u ~ 7 #&% Q Intuitively, it’s as if we have two data sets now: –. 15. Soft EM in our example (continued).  v , which maximizes NPO  Q the expected likelihood given the completed data, M-step: Compute a new parameter vector. In our case, this means. hx(# t h (#&% t KK. hx(# O@jk'l Q y }. h (#&% O@jk'l Q y } %. (why?) Then we fit the parameters as usual!. 16.

(32) Comparison of hard EM and soft EM Soft EM does not commit to a particular value of the missing item. Instead, it considers all possible values, with some probability This is a pleasing property, given the uncertainty in the value The complexity of the two versions is the same: – Hard EM requires computing most probable values – Soft EM requires computing conditional probabilities for completing the missing values Both are instances of exact inference! And can be performed, e.g., using a junction tree algorithm Soft EM is almost always the method of choice (and often when people say “EM”, they means the soft version) 17. Example: Alarm network E R. B A. ]-„ 2 9 †. 2$‡ .^ˆ 2$‡ .†‰ 2ŠX .†‹ 2 X _ . Suppose we have the following current parameter values for the ,mŒ 2 XŽ ,, 2 XM 9 ,,‘ ! Œ # |{  # +2 XM“’ , network: ,‘ ! Œ #&% {  # G2 XM 9 , ,‘ ! Œ # ?{  #&% 2 XM“” , ,‘ ! Œ #&% {  #0% 2 XM–• , ,m— ! ‘ #&% 2 XM 9 , ,— ! ‘ # ˜2 X–™ , ,š !  #&% 2 XM“X€X 9 ,,š !  # "2ŠXœ› C. Suppose we have example. Then it is easy to compute the probabilities of the 4 possible assignments to the missing variables, e.g.:. U *] „ 2 9 . 2 9 .^ˆ 2 9 .†‰ 2 X †. ‹ 2 X _ 1 ž Ÿ“XM“ŽMŸ“XM 9 Ÿ“XM“’MŸ *@9€: XM“Ž1‚Ÿ“XM–› where we normalize appropriately (so probabilities sum to 1) 18.

(33) Computing expected sufficient statistics. ‘ # : h    ‘ # 2 ¢¤£g¥§¦ 4* ¨z© ¡ Y ¨ ¡ . , 1«ª * ˆ 2 9 Y ¨© ¡ 1 ¡# where ¬q­ ¡ are all possible completions for the ® th instance, ¬ ¡ O«° R Sqp ¬q­ ¡ Q is an indicator variable equal to S if and where ¯ ° R S in ¬q­ ¡ , and 0 otherwise. @± R S

(34) |² R´³

(35) ° Rµ³

(36) ¤¶ R T

(37) |· R T  Let ¬ be our data item Example: consider the hypothetical count for. 19. Computing expected sufficient statistics (cont.) We have:. .   ‘ # 2 ¢¤£ ¥ U *W¨© ¡ Y ¨ ¡ . , 1«ª * ˆ 2 9 Y ¨© ¡ 1 ¡# . Y ¨ . , 1gª * ˆ 2 9 Y ¨z© 1 f 2 Y ¨ ¡ . , 1gª * ˆ 2 9 Y ¨z© ¡ 1 W *  ¨ © * W  ¨ © ¡ ¥ ¢¤£ 7 ¦ ¡ # a ¤¢ £ ¦. Let’s look at the first term:. Y 1gª 2 Y 1 ¢¤£ 7 ¦ W* ¨© ¨ . , * ˆ 9 ¨© 2 *] „ 2 9 †. 2 X .^ˆ 2 9 .†‰ 2 X .‹ 2 X _ Y „ 2 9 †. ‰ 2ŠX .†‹ 2 X . , 1 2 X ‚. ‹ 2 X . , 1 f ¦ ¦ *^]-„ 2 9 . 2 9 .^ˆ 2 9 .†‰ 2 X .‚‹ 2 X _ Y „ 2 9 .?‰ Š 2 * ˆ 2 9 Y „ 2 9 .†‰ 2 X .‹ 2 X . , 132 * ˆ 2 9 Y ¨ ¡ . , 1 ¦ ¦ 20.

(38) Computing expected counts (continued) Based on our previous analysis, the expected count will be:.   ‘ # 2 ¸ U * ˆ 2 9 Y¨ ¡ . , 1 ¡# O Q In general, an expected sufficient statistic h parameter  , will be:  ¹* [º132 ¸ U * [»Y ¨ ¡ . , 1 ¡#. given. Note that we can compute sufficient statistics locally at each node because (as seen in the example), they involve only the node and its parents. 21. Soft EM algorithm for computing Bayes net parameters. ¼. Choose an initial parameter setting , then repeat until satisfied.  and ¬ ¡ , each random variable >

(39) o¾½    ,  each assignment of values to ’s family, O >

(40) o½Z   p ¬ ¡

(41)  Q compute o. 1. Expectation (E) step:. (a) For each data instance. (b) Compute expected sufficient statistics:.  ¿*4V  . * Á¦ À  1  1 2 ¦ ^* ]ÂV . Á¦ À  _  Y ¨ ¡ . , 1 ¡. 2. Maximization (M) step: compute the maximum likelihood estimates for the parameters:.   *WV  . * ¦ÁÀ  1  1 , ~zà !–ÄÆÅmÇÈÊÉ ÃFË ¹   *<* º¦ À  1  1 ¹ 22.

(42)

References

Related documents

Binding of the virus to its surface receptors, first to CD4 and then to the chemokine receptor, seems to lead to raft clustering and lateral assembly of a protein complex in

In the mean imputation, the mean of the values of an attribute that contains missing data is used to fill in the missing values. In the case of a categorical attribute, the

machine-readable documentation on diskette and CD-ROM. However, space limitations of desktop media place the Archive at a disadvantage. It is not feasible to place many of the

Испитани се 516 ученици со ПУ во одделенија со посебни по- треби, 154 ученици со ПУ од инклузивни одделенија и 245 ученици без ПОП во инте- грирани одделенија со

It returns a period (.) for DATE, DATETIME and numeric values , and return blanks for character variables when the &lt;sql-expression&gt; set is empty or consisted of only

I suggest that cognitive- constructivist learning theory would predict that a well-constructed context-rich multiple-choice item represents a complex problem-solving exercise

The final patent law policy, that of ensuring that the interested public is well-informed of the scope of issued patents, is advanced by all of the doctrines that limit the

This poem speaks to (2) in the first stanza: the breathing in of sweet aromas on what is declared to be a &#34;festive day.&#34; The second stanza moves to the sweet, musical sound