• No results found

machine learning

N/A
N/A
Protected

Academic year: 2021

Share "machine learning"

Copied!
45
0
0

Loading.... (view fulltext now)

Full text

(1)

CSE  575:  Sta*s*cal  Machine  Learning  

Jingrui  He   CIDSE,  ASU  

(2)
(3)

3  

1-­‐Nearest  Neighbor  

Four  things  make  a  memory  based  learner:  

1.  A  distance  metric            

 Euclidian  (and  many  more)  

2.  How  many  nearby  neighbors  to  look  at?          

                               One  

1.  A  weigh:ng  func:on  (op:onal)        

 Unused  

2.  How  to  fit  with  the  local  points?        

 Just  predict  the  same  output  as  the  nearest  neighbor.  

(4)

4  

Consistency  of  1-­‐NN  

•  Consider  an  es*mator  fn  trained  on  n  examples  

–  e.g.,  1-­‐NN,  regression,  ...  

•  Es*mator  is  consistent  if  true  error  goes  to  zero  as  amount  of  

data  increases  

–  e.g.,  for  no  noise  data,  consistent  if:  

•  Regression  is  not  consistent!  

–  Representa*on  bias  

•  1-­‐NN  is  consistent  (under  some  mild  fineprint)    

(5)

5  

(6)

6  

k-­‐Nearest  Neighbor  

Four  things  make  a  memory  based  learner:  

1.  A  distance  metric            

 Euclidian  (and  many  more)  

2.  How  many  nearby  neighbors  to  look  at?          

   k  

1.  A  weigh:ng  func:on  (op:onal)        

 Unused  

2.  How  to  fit  with  the  local  points?            

(7)

7  

k-­‐Nearest  Neighbor  (here  k=9)  

K-­‐nearest  neighbor  for  funcFon  fiHng  smooth  away  noise,  but  there  are  clear   deficiencies.  

(8)

8  

Curse  of  dimensionality  for   instance-­‐based  learning  

•  Must  store  and  retrieve  all  data!  

–  Most  real  work  done  during  tes*ng  

–  For  every  test  sample,  must  search  through  all  dataset  –  very  slow!   –  There  are  fast  methods  for  dealing  with  large  datasets,  e.g.,  tree-­‐based  

methods,  hashing  methods,    

(9)
(10)

w.x  =  ∑j  w(j)  x(j)   10  

Linear  classifiers  –  Which  line  is  beber?  

Data:  

(11)

11  

Pick  the  one  with  the  largest  margin!  

w.x  =  ∑j  w(j)  x(j)  

w.x

 +  b   =  0  

(12)

12  

Maximize  the  margin  

w.x

 +  b   =  0  

(13)

13  

But  there  are  a  many  planes…  

w.x

 +  b   =  0  

(14)

14   w.x

 +  b   =  0  

(15)

15   w.x  +  b   =  +1   w.x  +  b   =  -­‐1   w.x  +  b   =  0   margin  2γ x-­‐   x+  

Normalized  margin  –  Canonical   hyperplanes  

(16)

16  

Normalized  margin  –  Canonical   hyperplanes   w.x  +  b   =  +1   w.x  +  b   =  -­‐1   w.x  +  b   =  0   margin  2γ x-­‐   x+  

(17)

17  

Margin  maximiza*on  using   canonical  hyperplanes   w.x  +  b   =  +1   w.x  +  b   =  -­‐1   w.x  +  b   =  0   margin  2γ

(18)

18  

Support  vector  machines  (SVMs)  

w.x  +  b   =  +1   w.x  +  b   =  -­‐1   w.x  +  b   =  0   margin  2γ

•  Solve  efficiently  by  quadra*c   programming  (QP)  

–  Well-­‐studied  solu*on  algorithms  

(19)

19  

What  if  the  data  is  not  linearly   separable?  

Use  features  of  features     of  features  of  features….  

(20)

20  

What  if  the  data  is  s*ll  not  linearly   separable?  

•  Minimize  w.w  and  number  of  training   mistakes  

–  Tradeoff  two  criteria?  

•  Tradeoff  #(mistakes)  and  w.w  

–  0/1  loss  

–  Slack  penalty  C  

–  Not  QP  anymore  

–  Also  doesn’t  dis*nguish  near  misses  

(21)

21  

Slack  variables  –  Hinge  loss  

•  If  margin  ≥  1,  don’t  care  

(22)

22  

Side  note:  What’s  the  difference  

between  SVMs  and  logis*c  regression?  

SVM:   LogisFc  regression:  

(23)

23  

(24)

24  

Lagrange  mul*pliers  –  Dual  variables  

Moving  the  constraint  to  objecFve  funcFon   Lagrangian:  

(25)

25  

Lagrange  mul*pliers  –  Dual  variables  

(26)

26  

Dual  SVM  deriva*on  (1)  –     the  linearly  separable  case  

(27)

27  

Dual  SVM  deriva*on  (2)  –     the  linearly  separable  case  

(28)

28  

Dual  SVM  interpreta*on  

w.x

 +  b   =  0  

(29)

29  

Dual  SVM  formula*on  –     the  linearly  separable  case  

(30)

30  

Dual  SVM  deriva*on  –     the  non-­‐separable  case  

(31)

31  

Dual  SVM  formula*on  –     the  non-­‐separable  case  

(32)

32  

Why  did  we  learn  about  the  dual  SVM?  

•  There  are  some  quadra*c  programming  

algorithms  that  can  solve  the  dual  faster  than   the  primal  

•  But,  more  importantly,  the  “kernel  trick”!!!  

(33)

33  

Reminder  from  last  *me:  What  if  the   data  is  not  linearly  separable?  

Use  features  of  features     of  features  of  features….  

(34)

34  

Higher  order  polynomials  

number  of  input  dimensions  

nu m be r  o f  m on om ial  te rm s   d=2   d=4   d=3   m  –  input  features  

d  –  degree  of  polynomial  

grows  fast!   d  =  6,  m  =  100  

(35)

35  

Dual  formula*on  only  depends  on   dot-­‐products,  not  on  w!  

(36)

36  

(37)

37  

Finally:  the  “kernel  trick”!  

•  Never  represent  features  explicitly  

–  Compute  dot  products  in  closed  form  

•  Constant-­‐*me  high-­‐dimensional  dot-­‐ products  for  many  classes  of  features   •  Very  interes*ng  theory  –  Reproducing  

(38)

38  

Polynomial  kernels  

•  All  monomials  of  degree  d  in  O(d)  opera*ons:  

•  How  about  all  monomials  of  degree  up  to  d?  

– Solu*on  0:    

(39)

39  

Common  kernels  

•  Polynomials  of  degree  d  

•  Polynomials  of  degree  up  to  d  

•  Gaussian  kernels  

(40)

40  

Overfivng?  

•  Huge  feature  space  with  kernels,  what  about  

overfivng???  

– Maximizing  margin  leads  to  sparse  set  of  support   vectors    

– Some  interes*ng  theory  says  that  SVMs  search  for   simple  hypothesis  with  large  margin  

(41)

41  

What  about  at  classifica*on  *me  

•  For  a  new  input  x,  if  we  need  to  represent   Φ(x),  we  are  in  trouble!  

•  Recall  classifier:  sign(w.Φ(x)+b)  

(42)

42  

SVMs  with  kernels  

•  Choose  a  set  of  features  and  kernel  func*on  

•  Solve  dual  problem  to  obtain  support  vectors  

αi  

•  At  classifica*on  *me,  compute:  

(43)

43  

What’s  the  difference  between   SVMs  and  Logis*c  Regression?  

SVMs Logistic

Regression

Loss function Hinge loss Log-loss

High dimensional features with

kernels

(44)

44  

Kernels  in  logis*c  regression  

•  Define  weights  in  terms  of  support  vectors:  

(45)

45  

What’s  the  difference  between  SVMs   and  Logis*c  Regression?  (Revisited)  

SVMs Logistic

Regression

Loss function Hinge loss Log-loss

High dimensional features with

kernels

Yes! Yes!

Solution sparse Often yes! Almost always no!

Semantics of output

References

Related documents

Therefore the positive charge creates electric field away from the positive charge.. This is because the force my

 Complete Onboarding Decisions Form (ODF) and pre-work: To ensure a thorough understanding of your EHR vendor’s IHE standards-based capabilities, Truven and

He has been involved in training of Telkom staff in areas of personal financing, entrepreneurship and business development, communication skills and customer relationship

We can envisage a mechanism for which …rms growing at a lower than mean rate (hence lying on the left side) are more responsive to macroeconomic ‡uctuations and in expansion

The details you have provided will be used by Millennium Insurance Company Limited to process your request in accordance with the Data Protection Acts and other applicable laws..

By public transport from the Keleti and Déli railway station take metro line M2 towards Deák Ferenc tér station, transfer to metro line M3 in the direction of Újpest-Központ until

Exercise 75 is a guitar solo using the minor pentatonic scale with CD track 65 as the backing track. It uses a call-and-response structure and shows how just a few riffs can be used