• No results found

Online Learning for Big Data Analytics

N/A
N/A
Protected

Academic year: 2021

Share "Online Learning for Big Data Analytics"

Copied!
132
0
0

Loading.... (view fulltext now)

Full text

(1)

Online Learning for

Big Data Analytics

Irwin King and Haiqin Yang

Dept. of Computer Science and Engineering

The Chinese University of Hong Kong

(2)

Outline

Introduction

Big data: definition and history

Online learning and its applications

Online Learning Algorithms

Perceptron

Online linear classifiers

Online non-sparse learning

Online sparse learning

Online nonlinear classifiers

Online learning with kernels

Kernelized online imbalanced learning

(3)

Outline

Introduction

Big data: definition and history

Online learning and its applications

Online Learning Algorithms

Perceptron

Online linear classifiers

Online non-sparse learning

Online sparse learning

Online nonlinear classifiers

Online learning with kernels

Kernelized online imbalanced learning

(4)

What is Big Data?

Big Data refers to datasets grow so

large

and

complex

that it is difficult to

capture

,

store

,

manage

,

share

,

analyze

and

visualize

within current computational

architecture.

(5)

The First Big Data Challenge

1880 US census

Data: 50 million people

Features (5): Age,

gender (sex),

occupation,

education level, no. of

insane people in

household

Processed time: > 7

years

(6)

The First Big Data Solution

Hollerith Tabulating

System

Punched cards – 80

variables

Used for 1890 census

6 weeks instead of 7+

(7)

Big Science

The International

Geophysical Year

An international scientific

project

Last from Jul. 1, 1957 to

Dec. 31, 1958

A synoptic collection of

observational data on a

global scale

Implications

Big

budgets,

big

staffs,

big

machines,

big

(8)

Summary of Big Science

Laid foundation for ambitious projects

International Biological Program

Long Term Ecological Research Network

Ended in 1974

Many participants viewed it as a

failure

Nevertheless, it was a

success

Transform the way of processing data

Realize original incentives

(9)

Lessons from Big Science

Spawn new big data projects

Weather prediction

Physics research (supercollider data analytics)

Astronomy images (planet detection)

Medical research (drug interaction)

Businesses latched onto its techniques,

methodologies, and objectives

(10)

Big Science vs. Big Business

Common

Need technologies to work with data

Use algorithms to mine data

Big science

Source: Experiments and research conducted in

controlled environments

Goals: To answer questions, or prove theories

Big business

Source: Transactions in nature and little control

Goals: Discover new opportunities, measure efficiency,

(11)

Big Data is Everywhere

Lots of data is being

collected and

warehoused

Science experiments

Web data, e-commerce

Purchases at department/

grocery stores

Bank/Credit Card

transactions

(12)

Big Data from Science

CERN - Large Hadron Collider

~10 PB/year at 2008

~1000 PB in ~10 years

2500 physicists collaboration

Large Synoptic Survey

Telescope

~5-10 PB/year at start in 2012

~100 PB by 2025

Pan-STARRS

(Haleakala,

Hawaii)

now 800 TB/year

soon: 4 PB/year

(13)
(14)
(15)

Characteristics and Challenges

(16)

Big Data Analytics

Definition: A process of

inspecting

,

cleaning

,

transforming

, and

modeling

big data

with the goal

of

discovering

useful information,

suggesting

conclusions, and

supporting

decision making

Connection to

data mining

Analytics include both

data analysis

(

mining

) and

communication

(guide decision making)

Analytics is not so much concerned with

individual analyses or analysis steps, but with

the

entire methodology

(17)

Outline

Introduction

Big data: definition and history

Online learning and its applications

Online Learning Algorithms

Perceptron

Online linear classifiers

Online non-sparse learning

Online sparse learning

Online nonlinear classifiers

Online learning with kernels

Kernelized online imbalanced learning

(18)

Challenges and Aims

Challenges: Capturing, storing, searching, sharing,

analyzing and visualizing

Big data is not just about size

Finds insights from complex, noisy, heterogeneous,

longitudinal, and voluminous data

It aims to answer questions that were previously

unanswered

This tutorial focuses on

online learning techniques

(19)

Learning Techniques

Overview

Learning paradigms

Supervised learning

Semisupervised

learning

Transductive learning

Unsupervised learning

Universum learning

Transfer learning

(20)

What is Online Learning?

Batch/Offline Learning

Observe a

batch

of training

data

Learn a model from them

Make accurate predictions

on new samples

Online Learning

Observe a

sequence

of data

Learn a model

incrementally

as new instances appear

Make the sequence of online

predictions accurately

user

Make prediction

True response

Update model

{

(

x

i

, y

i

)

}

Ni=1

(x

1

, y

1

)

, . . . ,

(x

t

, y

t

)

Online learning is the process of

answering a sequence

of questions

given (maybe partial) knowledge of the correct answers to previous

questions and possibly additional available information. [Shal, 2012]

(21)

Online Prediction Algorithms

An initial prediction rule

For

t

= 1, 2, …

We observe and make a prediction

We observe the true outcome and then

compute a loss

The online algorithm updates the prediction rule

using the new example

f

t

1

(

x

t

)

l

(

f

t

1

(

x

t

)

, y

t

)

user

x

t

x

t

f

t

1

(

x

t

)

y

t

y

t

x

(22)

Online Prediction Algorithms

The total error of the algorithm is

Goal: this error is to be as small as possible

Predict unknown future one step a time: similar to

generalization error

user

x

t

f

t

1

(

x

t

)

y

t

x

T

X

t

=1

l

(

f

t

1

(

x

t

)

, y

t

)

Update

f

(23)

Regret Analysis

: the optimal prediction function from a class

H

,

e.g., the class of linear classifiers

with the minimum error after seeing all instances.

Regret for the online learning algorithm

f

(

·

)

We want regret as small as possible!

Regret =

1

T

T

X

t

=1

[

l

(

f

t

1

(

x

t

)

, y

t

)

l

(

f

(

x

t

)

, y

t

)

f

(

·

) := arg min

f

2

H

T

X

t

=1

l

(

f

(

x

t

)

, y

t

)

(24)

Why Low Regret?

Regret for the online learning algorithm

Advantages

Do not lose much from not knowing future events

Perform almost as well as someone who observes the

entire sequence and picks the best prediction strategy

in hindsight

Compete with changing environment

Regret =

1

T

T

X

t

=1

[

l

(

f

t

1

(

x

t

)

, y

t

)

l

(

f

(

x

t

)

, y

t

)

(25)

Advantages of Online

Learning

Meet many applications for data arriving

sequentially

while

predictions are required

on-the-fly

Avoid re-training when adding new data

Applicable in

adversarial

and

competitive

environment

(e.g., spam filtering, stock market)

Strong

adaptability

to

changing

environment

High

efficiency

and excellent

scalability

Simple

to understand and easy to implement

Easy to be parallelized

(26)

Where

to Apply Online

Learning?

Online

Learning

Social

Media

Internet

Security

Finance

(27)

Online Learning for Social

Media

(28)

Online Learning for Social

Media

Online

Learning

Social

Media

Internet

Security

Finance

(29)

Online Learning for Internet

Security

Electronic business sectors

Spam

email

filtering

Fraud

credit card transaction

detection

Network

intrusion detection

system

(30)

Online Learning for Social

Media

Online

Learning

Social

Media

Internet

Security

Finance

(31)

Online Learning for Financial

Decision

Financial decision

Online portfolio selection

Sequential investment

(32)

Outline

Introduction

Big data: definition and history

Online learning and its applications

Online Learning Algorithms

Perceptron

Online linear classifiers

Online non-sparse learning

Online sparse learning

Online nonlinear classifiers

Online learning with kernels

Kernelized online imbalanced learning

(33)

Outline

Introduction

Big data: definition and history

Online learning and its applications

Online Learning Algorithms

Perceptron

Online linear classifiers

Online non-sparse learning

Online sparse learning

Online nonlinear classifiers

Online learning with kernels

Kernelized online imbalanced learning

(34)

Perceptron Algorithm

One of the oldest machine learning algorithms

Online algorithm for learning a linear threshold

classifier with small errors

[F. Rosenblatt, 1958]

Activation function O=

(

1 if n

P

i=0 wixi > 0 1 otherwise

P

n

P

i=0 wixi =wTx

w

2

x

2

..

.

..

.

w

n

x

n

w

1

x

1

w

0

1

inputs weights
(35)

Perceptron Algorithm

Goal: find a linear classifier with small error

[F. Rosenblatt, 1958]

If no error, keeping the same weight;

otherwise, updating it.

(36)

Perceptron: Geometric View

w

0

(37)

Perceptron: Geometric View

w

0

(

x

1

, y

1

(= 1))

Misclassification

(38)

Perceptron: Geometric View

w

0

(

x

1

, y

1

(= 1))

w

1

=

w

0

+

x

1

y

1

(39)

Perceptron: Geometric View

(

x

1

, y

1

(= 1))

w

1

(

x

2

, y

2

(=

1))

Misclassification

t

= 2

(40)

Perceptron: Geometric View

(

x

1

, y

1

(= 1))

w

1

(

x

2

, y

2

(=

1))

w

2

=

w

1

+

x

2

y

2

Update

Continue…

(41)

Perceptron Mistake Bound

Consider well separate the data

Define margin

w

w

T

x

i

y

i

=

min

i

|

w

T

x

i

|

k

w

k

2

sup

i

k

x

i

k

2

2

The larger, the more

confidence

Norm of

x

: the larger,

the large mistake bound

The number of mistakes perceptron makes is at

most

(42)

Proof of Perceptron Mistakes

Bound

Proof:

Let be

be the hypothesis before the

k

-th

mistake. Assume that the

k

-th mistake occurs on the

input example

[Novikoff, 1963]

v

k

(

x

i

, y

i

)

First,

Second,

=

min

i

|

w

T

x

i

|

k

w

k

2

sup

i

k

x

i

k

2
(43)

Outline

Introduction

Big data: definition and history

Online learning and its applications

Online Learning Algorithms

Perceptron

Online linear classifiers

Online non-sparse learning

Online sparse learning

Online nonlinear classifiers

Online learning with kernels

Kernelized online imbalanced learning

(44)

Overview

Online Learning

Linear Classifiers

Nonlinear

Classifiers

(45)

Online/Stochastic Gradient

Descent

Online gradient descent

Stochastic gradient descent

user

x

t

f

t

1

(

x

t

)

y

t

x

user

x

t

f

t

1

(

x

t

)

y

t

x

Update

f

t

Update

f

t

(46)

Outline

Introduction

Big data: definition and history

Online learning and its applications

Online Learning Algorithms

Perceptron

Online linear classifiers

Online non-sparse learning

Online sparse learning

Online nonlinear classifiers

Online learning with kernels

Kernelized online imbalanced learning

(47)

Online Linear Classifiers

Models

Categories

Non-sparse models: the

weight is non-sparse

Sparse models: the weight

contains zero elements

w

f

(x) = sgn(w

T

x)

Classification

(48)

Online Non-Sparse Learning

First order

learning methods

Online gradient descent

(Zinkevich, 2003)

Passive aggressive learning

(Crammer et al., 2006)

Others (including but not limited)

ALMA: Approximate Large Margin Algorithm

(Gentile, 2001)

ROMMA: Relaxed Online Maximum Margin

Algorithm (Li & Long, 2002)

MIRA: Margin Infused Relaxed Algorithm (Crammer

& Singer, 2003)

(49)

Online Gradient Descent

(OGD)

Online convex optimization

Consider a convex objective function

where is a bounded convex set

Update by Online Gradient Descent (OGD)

or Stochastic Gradient Descent (SGD)

where is a learning rate

f

:

S

!

R

S

R

n

w

t

+1

S

(

w

t

r

f

(

w

t

))

[Zinkevich, 2003]

gradient descent

projection

Provide a framework to prove regret

bound for

online convex optimization

(50)

Online Gradient Descent

(OGD)

For

t

= 1, 2, …

An unlabeled sample arrives

Make a prediction based on existing weights

Observe the true class label

Update the weights by

x

t

ˆ

y

t

= sgn(

w

t

T

x

t

)

y

t

2

{

1

,

+1

}

w

t

+1

S

(

w

t

r

f

(

w

t

))

user

x

t

y

t

x

ˆ

y

t

w

[Zinkevich, 2003]

O

(

p

T

) regret bound

(51)

Passive-Aggressive Online

Learning

Each example defines a set of consistent hypotheses:

The new vector is set to be the project of on to

Classification

Regression

Uniclass

(52)

Passive-Aggressive

(53)

Passive Aggressive Online

Learning

PA (Binary classification)

PA-I (C-SVM)

PA-II (Relaxed C-SVM)

Closed-form solution

[Crammer et al., 2006]

(54)

Passive Aggressive Online

Learning

Algorithm

Objective

Closed-form solution

(55)

Online Non-Sparse Learning

First

order

methods

Learn a

linear

weight vector (first order) of model

Pros and cons

Simple and easy to implement

Efficient and scalable for high-dimensional data

Relatively slow convergence rate

(56)

Online Sparse Learning

Motivations

Space

constraint: RAM overflow

Test-time

constraint

How to induce

sparse

solutions?

(57)

Online Sparse Learning

Objective function

Problem in online learning

Standard stochastic gradient descent

methods does not yield sparse solution

Some representative work

Truncated gradient

(Langford et al., 2009)

FOBOS

: Forward Looking Subgradients

(Duchi & Singer, 2009)

Dual averaging

(Xiao, 2009)

etc.

(58)

Truncated Gradient

Objective function

Stochastic gradient descent (SGD)

Simple coefficient rounding when

the coefficient is small

[Langford et al., 2009]

Truncated gradient:

impose sparsity by rounding

the solution of stochastic gradient descent

(59)

Truncated Gradient

Simple coefficient rounding vs. Less aggressive truncation

(60)

Loss function

Logistic

Hinge

Least square

When , the update rule is

identical to the standard SGD

Truncated Gradient

The amount of shrinkage is measured

by a gravity parameter

[Langford et al., 2009]

The truncation can be performed every

K

online steps

(61)

Truncated Gradient

Regret bound

Let , the regret is

[Langford et al., 2009]

(62)

FOBOS

FO

rward-

B

ackward

S

plitting

(63)

FOBOS

Objective function

Repeat

I. Unconstrained (stochastic) subgradient/gradient

loss

II. Incorporate regularization

Similar to

-

Forward-backward splitting (Lions and Mercier 79)

-

Composite gradient methods (Wright et al. 09,

Nesterov 07)

(64)

FOBOS: Step I

Objective function

Unconstrained (stochastic) subgradient/gradient loss

(65)

FOBOS: Step II

Objective function

Incorporate regularization

[Duchi & Singer, 2009]

stay close to interim vector

Induce sparsity

(66)

Forward Looking Property

The optimum satisfies

Let and

Current

subgradient of loss,

forward

subgradient of

regularization

current loss

forward regularization

(67)

Batch Convergence and

Online Regret

Set or to obtain batch convergence

Online (average) regret bounds

p

(68)

High Dimensional Efficiency

Input space is sparse but huge

Need to perform lazy update to

Proposition: the following are equivalent

(69)

Dual Averaging Method

Objective function

Problem:

Truncated gradient

does not produce

truly sparse weight due to small learning rate

Fix:

Dual averaging

keeps two sate

representations

-

weight parameter

-

average gradient vector

[Xiao, 2010]

w

t

¯

g

t

=

1

t

t

X

f

i

(

w

i

)

(70)

Dual Averaging Method

Algorithm

[Xiao, 2010]

Three main steps

Closed-form solution

on

Advantage

: sparse on

the weight

Disadvantage

: keep a

non-sparse

(sub)gradients

w

t

+1

(71)

Convergence and Regret

Average regret

Theoretical bounds

O

(

p

T

) regret bound

(72)

Comparison

FOBOS

Subgradient

Local Bregman divergence

Coefficient

Equivalent to TG method when

Dual Averaging

Average subgradient

Global proximal function

Coefficient

g

t

w

t+1

= arg min

w

h

g

t

,

w

i

+

(

w

) +

1

2

t

k

w

w

t

k

2 2

w

t+1

= arg min

w

h

g

¯

t

,

w

i

+

(

w

) +

t

t

h

(

w

)

1

/

t

=

(

p

t

)

(

w

) =

k

w

k

1

¯

g

t

t

/t

=

(1

/

p

t

)

(73)
(74)

Empirical Comparison

(75)

Variants of Online Sparse

Learning Models

Other existing work

Online learning for Group Lasso (Yang et al.,

2010)

Online learning for multi-task feature

(76)

Efficient Online Learning for Multi-Task

Feature Selection

[

Yang et al., ACM TKDD 2013

]

(77)

Online Learning for Multi-Task

Feature Selection

Observations

Limited

labeled data in each task

Related tasks may

help

each other

Features among tasks are

redundant

or

irrelevant

Data come in

sequence

aMTFS

Model

: integrating multiple tasks with specific

regularizations

Solution

: Online learning for Multi-Task Feature Selection

Challenge

: How to update the model

adaptively

with

(78)

Multi-task Feature Selection

Data

Models

Objective

i.i.d. observations of

D

=

Q

[

q=1

D

q

D

q

=

{

z

qi

= (x

qi

, y

iq

)

}

Ni=1q

sampled from

P

q

,

q

= 1

, . . . , Q

min

W Q

X

q=1

1

N

q Nq

X

i=1

`

q

(

W

q

,

z

qi

) +

(

W

)

W

=

w

1

,

w

2

, . . . ,

w

Q

= (

W

1

, . . . ,

W

Q

) =

W

>

1

, . . . ,

W

>

d

>

(79)

(

W

) =

Q

X

q=1

k

W

q

k

1

=

d

X

j=1

W

j> 1

Different

regularization

yields different models

• iMTFS:

• aMTFS:

• MTFTS:

(

W

) =

d

X

j=1

W

>j 2

,r

=

d

X

j=1

r

j

W

j> 1

+

W

> j 2

ˆ w

iMTFS

aMTFS

MTFS

(80)

Online Learning Framework

min

W Q

X

q=1

1

N

q Nq

X

i=1

`

q

(

W

q

,

z

qi

) +

(

W

)

Three main steps

Compute subgradient

Update average subgradient

Update weights

Close-formed solution

e.g. aMTFS

Efficiency

Regret bound

O

(

d

Q

)

¯

R

T

O

(1

/

p

T

)

(81)

Empirical Study: Conjoint

Analysis

Description

Objective

: Predict rating by estimating respondents’ partworths

vectors

Data

: Ratings of 180 students for 20 different personal computers,

Q

= 180

Features

(d=14): Telephone hot line (TE), amount of memory

(RAM), screen size (SC), CPU speed (CPU), hard disk (HD),

CDROM/multimedia (CD), Cache (CA), Color (CO), availability (AV),

warranty (WA), software (SW), guarantee (GU) and price (PR)

Setup

Evaluation

: Root mean square errors (RMSEs)

Loss

: Square loss

(82)

Evaluation Results

Accuracy

Learning partworths vectors across respondents

simultaneously

can help to improve the performance

Online learning algorithms attain nearly the

same

accuracies as

batch-trained algorithms

Memory cost

Batch:

Online:

Computational cost

Batch:

Online:

(83)

Learned Features

Observations

Features learned from the online algorithms are

consistent

to those learned from the batch-trained algorithms

Ratings are strongly

negative

to the price and

positive

to

the RAM, the CPU speed, CDROM, etc.

TE RAM SC CPU HD CD CA CO AV WA SW GU PR −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 TE RAM SC CPU HD CD CA CO AV WA SW GU PR −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 TE RAM SC CPU HD CD CA CO AV WA SW GU PR −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3

(84)

Online Sparse Learning

Summary

Objective

Induce

sparsity

in learning the weights in online

learning algorithms

Pros and Cons

Simple and easy to implement

Efficient and scalable for high-dimensional data

Relatively slow convergence rate

(85)

Outline

Introduction

Big data: definition and history

Online learning and its applications

Online Learning Algorithms

Perceptron

Online linear classifiers

Online non-sparse learning

Online sparse learning

Online nonlinear classifiers

Online learning with kernels

Kernelized online imbalanced learning

(86)

Online Nonlinear Classifiers

Models

Categories

Online learning with kernels

Kernelized online imbalanced

learning

f

(

x

) =

h

f

(

·

)

, k

(

x

,

·

)

i

H

=

X

i

(87)

What is Kernel?

Similarity defined in the original space:

Similarity defined in the kernel space:

x

T

i

x

j

k

(

x

i

,

x

j

) = (

x

i

)

T

(

x

j

)

(88)

Outline

Introduction

Big data: definition and history

Online learning and its applications

Online Learning Algorithms

Perceptron

Online linear classifiers

Online non-sparse learning

Online sparse learning

Online nonlinear classifiers

Online learning with kernels

Kernelized online imbalanced learning

(89)

Objective

Representative methods

Naive Online Regularization Minimization

Algorithm (NORMA)

[Kivinen et al., 2004]

Forgetron

[Dekel et al., 2006]

Randomized Budget Perceptron

(RBP)

[Cavallanti et al., 2007]

Projectron

[Orabona et al., 2008]

etc.

Online Learning with Kernels

f

t

+1

= arg min

f

2

k

(90)

NORMA

Problem

: The kernel expansion

scales with the number of instances

[Kivinen et al., 2004]

Solution

: Drop smaller terms

Theory

Truncated error

Mistake bound

Applications

Classification

Novelty detection

Regression

(91)

NORMA

Objective

Update rule: Standard gradient descent

[Kivinen et al., 2004]

f

t

+1

= arg min

f

2

k

(92)

Forgetron

Problem

: The kernel expansion

scales with the number of instances

[Dekel et al., 2006]

Solution

: Fixed budget

Theory

Mistake bound

Strategies

Remove oldest

Gentler scaling

Smarter removal

M

2

U

2

+ 4

X

T

t

=1

ˆ

l

t

,

k

g

k 

U

(93)

Randomized Budget

Perceptron (RBP)

Problem

: The kernel

expansion scales with the

number of instances

Theory

Mistake bound

Solution

Shifting Perceptron

Randomly discard

[Cavallanti et al., 2007]

(94)

Projectron/Projectron++

Problem

: The kernel

expansion scales with the

number of instances

Theory:

Mistake bound

Solution:

Projectron/

Projectron++

(95)

Summary

Learn a nonlinear model by kernel expansion

Target: Maximizing accuracy

Scalability issue: The kernel expansion scales with

the number of instances

Advantages

Different strategies to tackle the scalability issues

Mistake bounds are provided

f

(

x

) =

h

f

(

·

)

, k

(

x,

·

)

i

H

=

X

i

(96)

Outline

Introduction

Big data: definition and history

Online learning and its applications

Online Learning Algorithms

Perceptron

Online linear classifiers

Online non-sparse learning

Online sparse learning

Online nonlinear classifiers

Online learning with kernels

Kernelized online imbalanced learning

(97)

Kernelized Online Imbalanced Learning

with Fixed Budgets

[

Hu et al., AAAI 2015, Hu et al., IEEE TNNLS Submitted

]

Signal detection task:

(98)

Learning from Imbalanced

Data

Problems

Accuracy

: dominated by the

majority class

Misclassification costs for

both classes are

different

Unknown misclassification

(99)

Learning from Imbalanced

Data

Evaluation metric

Accuracy

= (TP+TN)/(TP+FP+TN+FN)

Precision = TP/(TP+FP)

Recall (Sensitivity) = TP/(TP+FN)

F1 = (1+1

2

)*Recall*Precision/(1

2

*Recall+Precision)

AUC (Area under the Receiver Operating Characteristic (ROC) curve):

Prob.

of a

classifier ranks

a randomly chosen positive instance higher than a randomly chosen

negative one

)

(100)

Kernelized Online

Imbalanced Learning (KOIL)

Problem and motivation

Imbalanced data

Metric:

Accuracy

vs.

AUC

Scenario:

nonlinearity

and

streaming

Challenges

How to attain

smooth

update

?

How to control

functional

complexity

?

How to

determine kernels

(101)

Setup for KOIL

Goal: seek a

nonlinear

decision function for

imbalanced

data

with two fixed budgets buffers to store the

weight

(102)

KOIL: AUC Maximization

Given

f

and data, the AUC is calculated by

A convex surrogate

Objective: Minimizing the localized instantaneous

regularized risk of AUC

−10 0 1 2 0.5 1 1.5 2 Hinge loss yf(x) Loss

(103)

KOIL: Intuition

Two buffers to keep track of the

global

information

Update based on

k

-nearest opposite support

(104)
(105)

KOIL: Update Buffer

Case I

: Buffer is not full

Insert directly

Case II

: Buffer is full

Reservoir Sampling (RS)

First-In-First-Out (FIFO)

(106)

KOIL: Update Buffer (2)

Observation:

Performance decay

Why: Information loss

How to compensate?

(107)

KOIL: Update Buffer (3)

(108)

KOIL: Update Kernel

Minimize the

localized instantaneous

regularized risk of AUC

Update:

Stochastic gradient descent

(109)

KOIL: Update Buffer

1. Remove SV via FIFO or Reservoir

Sampling (RS)

2. Compensate the loss by sharing

the weight to its most similar SV

3. Set weight

f

t

++

+1

(

x

) = ˆ

f

t

+1

(

x

) +

c

·

k

(

x

c

,

x

)

=

f

t

+1

(

x

)

r

k

(

x

r

,

x

)

|

{z

}

Removal

+

c

·

k

(

x

c

,

x

)

|

{z

}

Compensation

(110)

Theoretical Results

Norm of

f

Pairwise hinge loss

Regret

(111)

Smooth Pairwise Hinge Loss

Smooth pairwise loss hinge loss

Minimize

smooth localized instantaneous

regularized risk of AUC

Stochastic gradient descent

Update the weights

`

sh

(

f,

z

,

z

0

) =

|

y

y

0

|

2

1

1

2

(

y

y

0

)(

f

(

x

)

f

(

x

0

))

+

!

2

˜

L

t

(f

) :=

1

2

k

f

k

2

H

+

C

X

z

i

`

sh

(f,

z

t

,

z

i

)

˜

f

t

+1

:= ˜

f

t

⌘@

f

L

˜

t

(

f

)

|

f

= ˜

f

t

i,t

=

8

<

:

2

Cy

t

P

i2V t

`

h

( ˜

f

t

,

z

t

,

z

i

)

,

i

=

t,

(1

)

i,t 1

2

Cy

t

`

h

( ˜

f

t

,

z

t

,

z

i

)

,

8

i

2

V

t

,

(112)

Theoretical Results

Remarks

˜

R

T

O

(1)

if

L

= 0

O

(

p

T

)

otherwise

(113)

Empirical Evaluation

KOIL performs better than online linear AUC algorithms

KOIL significantly outperforms all competing

(114)

KOIL: Compensation

The performance

of RS/FIFO

decreases

when

the budget is full

RS++/FIFO++

approximate

KOIL with infinite

budget

(115)

KOIL: Effect of Buffer Size

Stable

when the

buffer size is

relative

large

KOIL

cannot

capture sufficient

information

when

the buffer size is

too

small

(116)

KOIL: Effect of Localization

k

When

k

is small,

KOIL cannot

learn well

When

k

is large

,

KOIL is affected

easily by noisy

data

(117)

KOIL: How to Select Kernels?

Solution: Cross validation

Problem: Spending much time on poor kernels

Good solution:

Multiple kernel learning

(MKL)

Properties:

Summing kernels corresponds to

concatenating feature spaces

e.g.,

K

=

Q

X

q=1

q

K

q

q

0

k

1

(

z

1

,

z

2

) =

h

1

(

z

1

)

,

1

(

z

2

)

i

, k

2

(

z

1

,

z

2

) =

h

2

(

z

1

)

,

2

(

z

2

)

i

k

1

(

z

1

,

z

2

) +

k

2

(

z

1

,

z

2

) =

⌧✓

1

(

z

1

)

(

z

)

,

1

(

z

2

)

(

z

)

(118)

KOIL with MKL: Setup

Given a set of kernel functions

Learn a linear combination of these kernel

classifiers

The

l

-th kernel classifier at

t

-th trial is

K

=

{

k

l

:

X

X

!

R

, l

= 1, . . . , m

}

F

t

(

x

) =

m

X

l

=1

q

l

t

·

sign(

f

l,t

(

x

))

f

l,t

(

x

) =

X

i

2

I

t+

+

l,i,t

k

l

(

x

i

,

x

) +

X

j

2

I

t

l,j,t

k

l

(

x

j

,

x

)

q

t

=

w

t

/

|

w

t

|

(119)

KOIL with MKL: Update

Let the loss function is the localized instantaneous

risk with the regularization term

Update for classifiers weights at

t

-th trial

Choose each classifier with

Penalize the weight

of each classifier based on

the loss

Update rule for SV weights

Non-smooth pairwise hinge loss

Smooth pairwise hinge loss

ˇ

L

t

(

f

)

i,t = 8 < : ⌘Cyt|Vt|, i = t, ↵i,t 1 ⌘Cyt, 8i 2 Vt, ↵i,t 1, 8i 2 Iyt t [ Vt. ↵i,t= 8 < : 2⌘Cyt Pi2Vt `h( ˜ft,zt,zi), i = t, ↵i,t 1 2⌘Cyt`h( ˜ft,zt,zi), 8i 2 Vt, ↵i,t 1, 8i 2 Ityt [ Vt.

P

(

w

t

l

/

[max

j

w

t

j

])

w

i

t

+1

=

w

i

t

exp(

L

ˇ

t

(

f

i

, y

t

))

(120)

KOIL with MKL: Theory

(121)

KOIL with MKL: Results

KOIL with MKL attains better or comparable performance on most

datasets

(122)

Summary of Online

Nonlinear Classifiers

Models

Pros and Cons

More powerful

Theoretical guarantees

Complex

Slow in training

f

(

x

) =

h

f

(

·

)

, k

(

x

,

·

)

i

H

=

X

i

i

k

(

x

,

x

i

)

(123)

Outline

Introduction

Big data: definition and history

Online learning and its applications

Online Learning Algorithms

Perceptron

Online linear classifiers

Online non-sparse learning

Online sparse learning

Online nonlinear classifiers

Online learning with kernels

Kernelized online imbalanced learning

(124)

Discussions and Open

Issues

How to learn from Big Data to tackle the 4V’s

characteristics?

Volume

Volume

Volume

V

ariety

Structured, semi-structured, unstructured

V

eracity

Volume

Data inconsistency, ambiguities, deception

V

elocity

Batch data, real-time data, streaming data

V

olume

Size grows from

terabyte to exabyte to zettabyte

(125)

Discussions and Open

Issues

Data issues

High-dimensionality

Sparsity

Structure

Noise and incomplete

data

Concept drift

Domain adaption

Background

knowledge

incorporation

Platform issues

Parallel computing

Distributed computing

User interaction

Interactive OL vs.

Passive OL

Crowdsourcing

(126)

Discussions and Open

Issues

Applications

Social network and social media

Speech recognition and identification (e.g., Siri)

Financial engineering

Medical and healthcare informatics

Science and research: human genome decoding,

particle discoveries, astronomy

 

(127)

Conclusion

Introduction of Big Data and the challenges and

opportunities

Introduction of online learning and its possible

applications

Survey of classical and state-of-the-art online learning

techniques for

Linear models

Non-sparse learning models

Sparse learning models

Nonlinear models

Learning with kernels

(128)

One-slide Takeaway

Online learning is a promising tool for big data

analytics

Many challenges exist

Real-world scenarios: Concept drifting, sparse

data, high-dimensional, uncertain/imprecision

data, etc.

More advance online learning algorithms: Faster

convergence rate, less memory cost, etc.

Parallel implementation or running on distributing

platforms

(129)

If you have any questions, please contact Dr. Haiqin Yang

via hqyang@ieee.org

(130)

References

• G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Tracking the best hyperplane with a simple budget

perceptron. Machine Learning 69 (2-3):143–167, 2007.

• N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,

New York, NY, USA, 2006.

• K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive

algorithms. Journal of Machine Learning Research, 7:551–585, 2006.

• K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. In NIPS, pages

414–422, 2009.

• K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Journal of

Machine Learning Research, 3:951–991, 2003.

• O. Dekel, S. Shalev-Shwartz, and Y. Singer. The Forgetron: A kernel-based perceptron on a budget.

SIAM Journal on Computing, 37(5):1342–1372, 2008.

• Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine

Learning, 37(3):277–296, 1999.

• W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou. One-pass AUC optimization. In ICML, pages 906–914,

2013.

•J. C. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting.

Journal of Machine Learning Research, 10:2899–2934, 2009.

• C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning

Research, 2:213–242, 2001.

(131)

References

J. Hu, H. Yang, I. King, M. R. Lyu, and A. M.-C. So. Kernelized online imbalanced learning with

fixed budgets. In AAAI, Austin Texas, USA, Jan. 25-30 2015.

R. Jin, S. C. H. Hoi, and T. Yang. Online multiple kernel learning: Algorithms and mistake

bounds. In ALT, pages 390–404, 2010.

J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transactions

on Signal Processing, 52(8):2165–2176, 2004.

J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of

Machine Learning Research, 10:777–801, 2009.

Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1-3):

361–387, 2002.

P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM J.

Numer. Anal., 16(6):964–979, 1979.

Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion

Paper 2007/76, Catholic University of Louvain, Center for Operations Research and

Econometrics, 2007.

F. Orabona and K. Crammer. New adaptive algorithms for online classification. In NIPS, pages

1840–1848, 2010.

F. Orabona, J. Keshet, and B. Caputo. Bounded kernel-based online learning. Journal of

Machine Learning Research, 10:2643–2666, 2009.

(132)

References

S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and

Trends in Machine Learning 4(2): 107-194. 2012.

Y. Wang, R. Khardon, D.Pechyony, and R.Jones. Generalization bounds for online

learning algorithms with pairwise loss functions. In COLT, pages 13.1–13.22, 2012.

L. Xiao. Dual averaging methods for regularized stochastic learning and online

optimization. Journal of Machine Learning Research, 11:2543–2596, 2010.

H. Yang, I. King, and M. R. Lyu. Sparse Learning Under Regularization Framework. LAP

Lambert Academic Publishing, April 2011.

H. Yang, M. R. Lyu, and I. King. Efficient online learning for multi-task feature selection.

ACM Transactions on Knowledge Discovery from Data, 2013.

H. Yang, Z. Xu, I. King, and M. R. Lyu. Online learning for group lasso. In ICML, pages

1191–1198, Haifa, Israel, 2010.

H. Yang, Z. Xu, J. Ye, I. King, and M. R. Lyu. Efficient sparse generalized multiple kernel

learning. IEEE Transactions on Neural Networks, 22(3):433–446, March 2011.

T. Yang, M. Mahdavi, R. Jin, J. Yi, and S. C. H. Hoi. Online kernel selection: Algorithms

and evaluations. In AAAI, 2012.

P. Zhao, S. C. H. Hoi, R. Jin, and T. Yang. Online auc maximization. In ICML, pages

233–240, 2011.

oolbox: http://www.cse.cuhk.edu.hk/~hqyang

References

Related documents

Should DebWeb be unable to complete the website, within 30 days of the client signing the quote and paying a deposit, due to the client’s inability to supply the

It will ensure record of all rented machineries and equipment as per project with help of proper daily based data entry.. Resources includes mainly three

An increase in HPC bacteria in finished water above recommended concentrations can indicate a problem with treatment within the plant itself or a change in the quality of the

The Silicon Valley professional tax associations, in particular the Silicon Valley Tax Directors Group, undertake similar work to that observed by Suchman and Cahill (1996) in

+ BRC) patients. Right Panel: Variants identified as potential drivers. B) Left Panel: variants obtained from exome sequencing of tumor samples from ER positive Breast Cancer (ER+

The Westenberger-Kallrath problem has been understood as a typical scheduling problem occurring in process industry including the major characteristics of a real batch

Kljub opozarjanju zavarovalne stroke, tako na problematičnost obravnavanja zavarovalne goljufije v okviru splošnega kaznivega dejanja goljufije kakor tudi na izjemen obseg škode

2- Ontological Metaphors: Lakoff and Johnson (2003: 25) state that 'ontological metaphors arise when our experience of physical objects and substance provides a further