Online Learning for Big Data Analytics

(1)

Online Learning for

Big Data Analytics

Irwin King and Haiqin Yang

Dept. of Computer Science and Engineering

The Chinese University of Hong Kong

(2)

Outline

Introduction

✤

Big data: definition and history

✤

Online learning and its applications

Online Learning Algorithms

✤

Perceptron

✤

Online linear classifiers

•

Online non-sparse learning

•

Online sparse learning

✤

Online nonlinear classifiers

•

Online learning with kernels

•

Kernelized online imbalanced learning

(3)

Outline

Introduction

✤

Online Learning Algorithms

✤

Perceptron

✤

•

✤

•

(4)

What is Big Data?

Big Data refers to datasets grow so

large

and

complex

that it is difficult to

capture

,

store

,

manage

,

share

,

analyze

and

visualize

within current computational

architecture.

(5)

The First Big Data Challenge

1880 US census

Data: 50 million people

Features (5): Age,

gender (sex),

occupation,

education level, no. of

insane people in

household

Processed time: > 7

years

(6)

The First Big Data Solution

Hollerith Tabulating

System

Punched cards – 80

variables

Used for 1890 census

6 weeks instead of 7+

(7)

Big Science

The International

Geophysical Year

✤

An international scientific

project

✤

Last from Jul. 1, 1957 to

Dec. 31, 1958

A synoptic collection of

observational data on a

global scale

Implications

✤

Big

budgets,

big

staffs,

big

machines,

big

(8)

Summary of Big Science

Laid foundation for ambitious projects

✤

International Biological Program

✤

Long Term Ecological Research Network

Ended in 1974

Many participants viewed it as a

failure

Nevertheless, it was a

success

✤

Transform the way of processing data

✤

Realize original incentives

(9)

Lessons from Big Science

Spawn new big data projects

✤

Weather prediction

✤

Physics research (supercollider data analytics)

✤

Astronomy images (planet detection)

✤

Medical research (drug interaction)

✤

…

Businesses latched onto its techniques,

methodologies, and objectives

(10)

Big Science vs. Big Business

Common

✤

Need technologies to work with data

✤

Use algorithms to mine data

Big science

✤

Source: Experiments and research conducted in

controlled environments

✤

Goals: To answer questions, or prove theories

Big business

✤

Source: Transactions in nature and little control

✤

Goals: Discover new opportunities, measure efficiency,

(11)

Big Data is Everywhere

Lots of data is being

collected and

warehoused

✤

Science experiments

✤

Web data, e-commerce

✤

Purchases at department/

grocery stores

✤

Bank/Credit Card

transactions

(12)

Big Data from Science

CERN - Large Hadron Collider

✤

~10 PB/year at 2008

✤

~1000 PB in ~10 years

✤

2500 physicists collaboration

Large Synoptic Survey

Telescope

✤

~5-10 PB/year at start in 2012

✤

~100 PB by 2025

Pan-STARRS

(Haleakala,

Hawaii)

✤

now 800 TB/year

✤

soon: 4 PB/year

(13)

(14)

(15)

Characteristics and Challenges

(16)

Big Data Analytics

Definition: A process of

inspecting

,

cleaning

,

transforming

, and

modeling

big data

with the goal

of

discovering

useful information,

suggesting

conclusions, and

supporting

decision making

Connection to

data mining

✤

Analytics include both

data analysis

(

mining

) and

communication

(guide decision making)

✤

Analytics is not so much concerned with

individual analyses or analysis steps, but with

the

entire methodology

(17)

Outline

Introduction

✤

Perceptron

✤

•

✤

•

Kernelized online imbalanced learning

(18)

Challenges and Aims

Challenges: Capturing, storing, searching, sharing,

analyzing and visualizing

Big data is not just about size

✤

Finds insights from complex, noisy, heterogeneous,

longitudinal, and voluminous data

✤

It aims to answer questions that were previously

unanswered

This tutorial focuses on

online learning techniques

(19)

Learning Techniques

Overview

Learning paradigms

✤

Supervised learning

✤

Semisupervised

learning

✤

Transductive learning

✤

Unsupervised learning

✤

Universum learning

✤

Transfer learning

(20)

What is Online Learning?

Batch/Offline Learning

✤

Observe a

batch

of training

data

✤

Learn a model from them

✤

Make accurate predictions

on new samples

Online Learning

✤

Observe a

sequence

of data

✤

Learn a model

incrementally

as new instances appear

✤

Make the sequence of online

predictions accurately

user

Make prediction

True response

Update model

{

(

x

_i

, y

_i

)

}

N_i₌₁

(x

1

, y

1

)

, . . . ,

(x

t

, y

t

)

Online learning is the process of

answering a sequence

of questions

given (maybe partial) knowledge of the correct answers to previous

questions and possibly additional available information. [Shal, 2012]

(21)

Online Prediction Algorithms

An initial prediction rule

For

t

= 1, 2, …

✤

We observe and make a prediction

✤

We observe the true outcome and then

compute a loss

✤

The online algorithm updates the prediction rule

using the new example

f

t

1

(

x

t

)

l

(

f

t

1

(

x

t

)

, y

t

)

user

x

t

x

t

f

t

1

(

x

t

)

y

t

y

t

x

(22)

Online Prediction Algorithms

The total error of the algorithm is

Goal: this error is to be as small as possible

Predict unknown future one step a time: similar to

generalization error

user

x

t

f

t

1

(

x

t

)

y

t

x

T

X

t

=1

l

(

f

t

1

(

x

t

)

, y

t

)

Update

f

(23)

Regret Analysis

: the optimal prediction function from a class

H

,

e.g., the class of linear classifiers

with the minimum error after seeing all instances.

Regret for the online learning algorithm

f

⇤

(

·

)

We want regret as small as possible!

Regret =

1

T

X

t

=1

[

l

(

f

t

1

(

x

t

)

, y

t

)

l

(

f

⇤

(

x

t

)

, y

t

)

f

⇤

(

·

) := arg min

f

2

H

T

X

t

=1

l

(

f

(

x

t

)

, y

t

)

(24)

Why Low Regret?

Regret for the online learning algorithm

Advantages

✤

Do not lose much from not knowing future events

✤

Perform almost as well as someone who observes the

entire sequence and picks the best prediction strategy

in hindsight

✤

Compete with changing environment

Regret =

1

T

X

t

=1

[

l

(

f

t

1

(

x

t

)

, y

t

)

l

(

f

⇤

(

x

t

)

, y

t

)

(25)

Advantages of Online

Learning

Meet many applications for data arriving

sequentially

while

predictions are required

on-the-fly

✤

Avoid re-training when adding new data

Applicable in

adversarial

and

competitive

environment

(e.g., spam filtering, stock market)

Strong

adaptability

to

changing

environment

High

efficiency

and excellent

scalability

Simple

to understand and easy to implement

Easy to be parallelized

(26)

Where

to Apply Online

Learning?

Online

Learning

Social

Media

Internet

Security

Finance

(27)

Online Learning for Social

Media

(28)

Media

Online

Learning

Social

Media

Internet

Security

Finance

(29)

Online Learning for Internet

Security

Electronic business sectors

✤

Spam

email

filtering

✤

Fraud

credit card transaction

detection

✤

Network

intrusion detection

system

(30)

Media

Online

Learning

Social

Media

Internet

Security

Finance

(31)

Online Learning for Financial

Decision

Financial decision

✤

Online portfolio selection

✤

Sequential investment

(32)

Outline

Introduction

✤

Online learning and its applications

✤

Perceptron

✤

•

✤

•

(33)

Outline

Introduction

✤

Online learning and its applications

✤

Perceptron

✤

•

✤

•

(34)

Perceptron Algorithm

One of the oldest machine learning algorithms

Online algorithm for learning a linear threshold

classifier with small errors

[F. Rosenblatt, 1958]

Activation function O=

(

1 if n

P

i=0 wixi > 0 1 otherwise

P

n

P

i=0 wixi =wTx

w

₂

x

₂

..

.

..

.

w

_n

x

_n

w

₁

x

₁

w

₀

1

inputs weights

(35)

Perceptron Algorithm

Goal: find a linear classifier with small error

[F. Rosenblatt, 1958]

If no error, keeping the same weight;

otherwise, updating it.

(36)

Perceptron: Geometric View

w

₀

(37)

w

₀

(

x

1

, y

1

(= 1))

Misclassification

(38)

w

₀

(

x

1

, y

1

(= 1))

w

1

=

w

₀

+

⌘

x

1

y

1

(39)

(

x

1

, y

1

(= 1))

w

1

(

x

2

, y

2

(=

1))

Misclassification

t

= 2

(40)

(

x

1

, y

1

(= 1))

w

1

(

x

2

, y

2

(=

1))

w

₂

=

w

₁

+

⌘

x

₂

y

₂

Update

Continue…

(41)

Perceptron Mistake Bound

Consider well separate the data

Define margin

w

⇤

w

⇤

T

x

i

y

i

=

min

i

|

w

T

⇤

x

i

|

k

w

⇤

k

2

sup

i

k

x

i

k

2

The larger, the more

confidence

Norm of

x

: the larger,

the large mistake bound

The number of mistakes perceptron makes is at

most

(42)

Proof of Perceptron Mistakes

Bound

Proof:

Let be

be the hypothesis before the

k

-th

mistake. Assume that the

k

-th mistake occurs on the

input example

[Novikoff, 1963]

v

k

(

x

i

, y

i

)

First,

Second,

=

min

i

|

w

T

⇤

x

i

|

k

w

_⇤

k

₂

sup

_i

k

x

_i

k

₂

(43)

Outline

Introduction

✤

Online Learning Algorithms

✤

Perceptron

✤

•

✤

•

(44)

Overview

Online Learning

Linear Classifiers

Nonlinear

Classifiers

(45)

Online/Stochastic Gradient

Descent

Online gradient descent

Stochastic gradient descent

user

x

t

f

t

1

(

x

t

)

y

t

x

user

x

t

f

t

1

(

x

t

)

y

t

x

Update

f

t

Update

f

t

(46)

Outline

Introduction

✤

Perceptron

✤

•

Online sparse learning

✤

•

(47)

Online Linear Classifiers

Models

Online Non-Sparse Learning

First order

learning methods

✤

Online gradient descent

(Zinkevich, 2003)

✤

Passive aggressive learning

(Crammer et al., 2006)

✤

Others (including but not limited)

•

ALMA: Approximate Large Margin Algorithm

(Gentile, 2001)

•

ROMMA: Relaxed Online Maximum Margin

Algorithm (Li & Long, 2002)

•

MIRA: Margin Infused Relaxed Algorithm (Crammer

& Singer, 2003)

(49)

Online Gradient Descent

(OGD)

Online convex optimization

Consider a convex objective function

where is a bounded convex set

✤

Update by Online Gradient Descent (OGD)

or Stochastic Gradient Descent (SGD)

where is a learning rate

f

:

S

!

R

S

✓

R

n

w

t

+1

⇧

S

(

w

t

⌘

r

f

(

w

t

))

⌘

[Zinkevich, 2003]

gradient descent

projection

Provide a framework to prove regret

bound for

online convex optimization

(50)

Online Gradient Descent

(OGD)

For

t

= 1, 2, …

✤

An unlabeled sample arrives

✤

Make a prediction based on existing weights

✤

Observe the true class label

✤

Update the weights by

x

t

ˆ

y

t

= sgn(

w

t

T

x

t

)

y

t

2

{

1

,

+1

}

w

t

+1

⇧

S

(

w

t

⌘

r

f

(

w

t

))

user

x

t

y

t

x

ˆ

y

t

w

[Zinkevich, 2003]

O

(

p

T

) regret bound

(51)

Passive-Aggressive Online

Learning

Each example defines a set of consistent hypotheses:

The new vector is set to be the project of on to

Classification

Regression

Uniclass

(52)

Passive-Aggressive

(53)

Passive Aggressive Online

Learning

PA (Binary classification)

PA-I (C-SVM)

PA-II (Relaxed C-SVM)

Closed-form solution

[Crammer et al., 2006]

(54)

Passive Aggressive Online

Learning

Algorithm

Objective

(55)

Online Non-Sparse Learning

First

order

methods

✤

Learn a

linear

weight vector (first order) of model

Pros and cons

Simple and easy to implement

Efficient and scalable for high-dimensional data

Relatively slow convergence rate

(56)

Online Sparse Learning

Motivations

✤

Space

constraint: RAM overflow

✤

Test-time

constraint

✤

How to induce

sparse

solutions?

(57)

Online Sparse Learning

Objective function

Problem in online learning

✤

Standard stochastic gradient descent

methods does not yield sparse solution

Some representative work

✤

Truncated gradient

(Langford et al., 2009)

✤

FOBOS

: Forward Looking Subgradients

(Duchi & Singer, 2009)

✤

Dual averaging

(Xiao, 2009)

✤

etc.

(58)

Truncated Gradient

Objective function

Stochastic gradient descent (SGD)

Simple coefficient rounding when

the coefficient is small

[Langford et al., 2009]

Truncated gradient:

impose sparsity by rounding

the solution of stochastic gradient descent

(59)

Truncated Gradient

Simple coefficient rounding vs. Less aggressive truncation

(60)

Loss function

✤

Logistic

✤

Hinge

✤

Least square

When , the update rule is

identical to the standard SGD

Truncated Gradient

The amount of shrinkage is measured

by a gravity parameter

The truncation can be performed every

K

online steps

(61)

Truncated Gradient

Regret bound

Let , the regret is

(62)

FOBOS

FO

rward-

B

ackward

S

plitting

(63)

FOBOS

Objective function

Repeat

I. Unconstrained (stochastic) subgradient/gradient

loss

II. Incorporate regularization

Similar to

-

Forward-backward splitting (Lions and Mercier 79)

-

Composite gradient methods (Wright et al. 09,

Nesterov 07)

(64)

FOBOS: Step I

Objective function

Unconstrained (stochastic) subgradient/gradient loss

(65)

FOBOS: Step II

Objective function

Incorporate regularization

[Duchi & Singer, 2009]

stay close to interim vector

Induce sparsity

(66)

Forward Looking Property

The optimum satisfies

Let and

Current

subgradient of loss,

forward

subgradient of

regularization

current loss

forward regularization

(67)

Batch Convergence and

Online Regret

Set or to obtain batch convergence

Online (average) regret bounds

p

(68)

High Dimensional Efficiency

Input space is sparse but huge

Need to perform lazy update to

Proposition: the following are equivalent

(69)

Dual Averaging Method

Objective function

Problem:

Truncated gradient

does not produce

truly sparse weight due to small learning rate

Fix:

Dual averaging

keeps two sate

representations

-

weight parameter

-

average gradient vector

[Xiao, 2010]

w

t

¯

g

t

=

1

t

X

f

i

(

w

i

)

(70)

Dual Averaging Method

Algorithm

[Xiao, 2010]

Three main steps

on

Advantage

: sparse on

the weight

Disadvantage

: keep a

non-sparse

(sub)gradients

w

t

+1

(71)

Convergence and Regret

Average regret

Theoretical bounds

O

(

p

T

) regret bound

(72)

Comparison

FOBOS

✤

Subgradient

✤

Local Bregman divergence

✤

Coefficient

Equivalent to TG method when

Dual Averaging

✤

Average subgradient

✤

Global proximal function

✤

Coefficient

g

t

w

t+1

= arg min

w

⇢

h

g

t

,

w

i

+

(

w

) +

1

2

↵

_t

k

w

t

k

2 2

w

t+1

= arg min

w

⇢

h

g

¯

_t

,

w

i

+

(

w

) +

t

h

(

w

)

1

/

↵

_t

=

⇥

(

p

t

)

(

w

) =

k

w

k

1

¯

g

t

/t

=

⇥

(1

/

p

t

)

(73)

(74)

Empirical Comparison

(75)

Variants of Online Sparse

Learning Models

Other existing work

✤

Online learning for Group Lasso (Yang et al.,

2010)

✤

Online learning for multi-task feature

(76)

Efficient Online Learning for Multi-Task

Feature Selection

[

Yang et al., ACM TKDD 2013

]

(77)

Online Learning for Multi-Task

Feature Selection

Observations

✤

Limited

labeled data in each task

✤

Related tasks may

help

each other

✤

Features among tasks are

redundant

or

irrelevant

✤

Data come in

sequence

aMTFS

Model

: integrating multiple tasks with specific

regularizations

Solution

: Online learning for Multi-Task Feature Selection

Challenge

: How to update the model

adaptively

with

(78)

Multi-task Feature Selection

Data

Models

Objective

i.i.d. observations of

D

=

Q

[

q=1

D

q

D

q

=

{

z

q_i

= (x

q_i

, y

_iq

)

}

N_i₌₁q

sampled from

P

q

,

q

= 1

, . . . , Q

min

W Q

X

q=1

1

N

q Nq

X

i=1

`

q

(

W

_•q

,

z

q_i

) +

⌦

(

W

)

W

=

w

1

,

w

2

, . . . ,

w

Q

= (

W

•

1

, . . . ,

W

•

Q

) =

W

>

1

•

, . . . ,

W

>

d

•

>

(79)

⌦

(

W

) =

Q

X

q=1

k

W

_•q

k

₁

=

d

X

j=1

W

_j>_• ₁

Different

regularization

yields different models

• iMTFS:

• aMTFS:

• MTFTS:

⌦

(

W

) =

d

X

j=1

W

>_j_• ₂

⌦

_,_r

=

d

X

j=1

⇣

r

_j

W

_j>_• 1

+

W

> j_• ₂

⌘

ˆ w

iMTFS

aMTFS

MTFS

(80)

Online Learning Framework

min

W Q

X

q=1

1

N

q Nq

X

i=1

`

q

(

W

_•_q

,

z

q_i

) +

⌦

(

W

)

Three main steps

✤

Compute subgradient

✤

Update average subgradient

✤

Update weights

Close-formed solution

✤

e.g. aMTFS

Efficiency

Regret bound

O

(

d

⇥

Q

)

¯

R

T

⇠

O

(1

/

p

T

)

(81)

Empirical Study: Conjoint

Analysis

Description

✤

Objective

: Predict rating by estimating respondents’ partworths

vectors

✤

Data

: Ratings of 180 students for 20 different personal computers,

Q

= 180

✤

Features

(d=14): Telephone hot line (TE), amount of memory

(RAM), screen size (SC), CPU speed (CPU), hard disk (HD),

CDROM/multimedia (CD), Cache (CA), Color (CO), availability (AV),

warranty (WA), software (SW), guarantee (GU) and price (PR)

Setup

✤

Evaluation

: Root mean square errors (RMSEs)

✤

Loss

: Square loss

(82)

Evaluation Results

Accuracy

✤

Learning partworths vectors across respondents

simultaneously

can help to improve the performance

✤

Online learning algorithms attain nearly the

same

accuracies as

batch-trained algorithms

Memory cost

✤

Batch:

✤

Online:

Computational cost

✤

Batch:

✤

Online:

(83)

Learned Features

Observations

•

Features learned from the online algorithms are

consistent

to those learned from the batch-trained algorithms

•

Ratings are strongly

negative

to the price and

positive

to

the RAM, the CPU speed, CDROM, etc.

TE RAM SC CPU HD CD CA CO AV WA SW GU PR −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 TE RAM SC CPU HD CD CA CO AV WA SW GU PR −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 TE RAM SC CPU HD CD CA CO AV WA SW GU PR −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3

(84)

Online Sparse Learning

Summary

Objective

✤

Induce

sparsity

in learning the weights in online

learning algorithms

Pros and Cons

Simple and easy to implement

Efficient and scalable for high-dimensional data

Relatively slow convergence rate

(85)

Outline

Introduction

✤

Online learning and its applications

✤

Perceptron

✤

•

✤

•

(86)

Online Nonlinear Classifiers

Models

Kernelized online imbalanced

learning

f

(

x

) =

h

f

(

·

)

, k

(

x

,

·

)

i

H

=

X

i

(87)

What is Kernel?

Similarity defined in the original space:

Similarity defined in the kernel space:

x

T

i

x

j

k

(

x

i

,

x

j

) = (

x

i

)

T

(

x

j

)

(88)

Outline

Introduction

✤

Perceptron

✤

•

Online non-sparse learning

•

✤

•

(89)

Objective

Representative methods

✤

Naive Online Regularization Minimization

Algorithm (NORMA)

[Kivinen et al., 2004]

✤

Forgetron

[Dekel et al., 2006]

✤

Randomized Budget Perceptron

(RBP)

[Cavallanti et al., 2007]

✤

Projectron

[Orabona et al., 2008]

✤

etc.

Online Learning with Kernels

f

t

+1

= arg min

f

2

k

(90)

NORMA

Problem

: The kernel expansion

scales with the number of instances

Solution

: Drop smaller terms

Theory

✤

Truncated error

✤

Mistake bound

Applications

✤

Classification

✤

Novelty detection

✤

Regression

(91)

NORMA

Objective

Update rule: Standard gradient descent

f

t

+1

= arg min

f

2

k

(92)

Forgetron

Problem

: The kernel expansion

scales with the number of instances

[Dekel et al., 2006]

Solution

: Fixed budget

Theory

✤

Mistake bound

Strategies

✤

Remove oldest

✤

Gentler scaling

✤

Smarter removal

M



2

U

2

+ 4

X

T

t

=1

ˆ

l

t

,

k

g

k 

U

(93)

Randomized Budget

Perceptron (RBP)

Problem

: The kernel

expansion scales with the

number of instances

Theory

✤

Mistake bound

Solution

✤

Shifting Perceptron

✤

Randomly discard

[Cavallanti et al., 2007]

(94)

Projectron/Projectron++

Problem

: The kernel

expansion scales with the

number of instances

Theory:

Mistake bound

Solution:

Projectron/

Projectron++

(95)

Summary

Learn a nonlinear model by kernel expansion

Target: Maximizing accuracy

Scalability issue: The kernel expansion scales with

the number of instances

Advantages

✤

Different strategies to tackle the scalability issues

✤

Mistake bounds are provided

f

(

x

) =

h

f

(

·

)

, k

(

x,

·

)

i

H

=

X

i

(96)

Outline

Introduction

✤

Perceptron

✤

•

Online non-sparse learning

•

✤

•

(97)

Kernelized Online Imbalanced Learning

with Fixed Budgets

[

Hu et al., AAAI 2015, Hu et al., IEEE TNNLS Submitted

]

Signal detection task:

(98)

Learning from Imbalanced

Data

Problems

✤

Accuracy

: dominated by the

majority class

✤

Misclassification costs for

both classes are

different

✤

Unknown misclassification

(99)

Learning from Imbalanced

Data

Evaluation metric

✤

Accuracy

= (TP+TN)/(TP+FP+TN+FN)

✤

Precision = TP/(TP+FP)

✤

Recall (Sensitivity) = TP/(TP+FN)

✤

F1 = (1+1

2

)*Recall*Precision/(1

2

*Recall+Precision)

✤

AUC (Area under the Receiver Operating Characteristic (ROC) curve):

Prob.

of a

classifier ranks

a randomly chosen positive instance higher than a randomly chosen

negative one

)

(100)

Kernelized Online

Imbalanced Learning (KOIL)

Problem and motivation

✤

Imbalanced data

✤

Metric:

Accuracy

vs.

AUC

✤

Scenario:

nonlinearity

and

streaming

Challenges

✤

How to attain

smooth

update

?

✤

How to control

functional

complexity

?

✤

How to

determine kernels

(101)

Setup for KOIL

Goal: seek a

nonlinear

decision function for

imbalanced

data

with two fixed budgets buffers to store the

weight

(102)

KOIL: AUC Maximization

Given

f

and data, the AUC is calculated by

A convex surrogate

Objective: Minimizing the localized instantaneous

regularized risk of AUC

−10 0 1 2 0.5 1 1.5 2 Hinge loss yf(x) Loss

(103)

KOIL: Intuition

Two buffers to keep track of the

global

information

Update based on

k

-nearest opposite support

(104)

(105)

KOIL: Update Buffer

Case I

: Buffer is not full

•

Insert directly

Case II

: Buffer is full

•

Reservoir Sampling (RS)

•

First-In-First-Out (FIFO)

(106)

KOIL: Update Buffer (2)

Observation:

Performance decay

Why: Information loss

How to compensate?

(107)

KOIL: Update Buffer (3)

(108)

KOIL: Update Kernel

Minimize the

localized instantaneous

regularized risk of AUC

Update:

(109)

KOIL: Update Buffer

1. Remove SV via FIFO or Reservoir

Sampling (RS)

2. Compensate the loss by sharing

the weight to its most similar SV

3. Set weight

f

t

++

+1

(

x

) = ˆ

f

t

+1

(

x

) +

↵

c

·

k

(

x

c

,

x

)

=

f

t

+1

(

x

)

↵

r

k

(

x

r

,

x

)

|

{z

}

Removal

+

↵

c

·

k

(

x

c

,

x

)

|

{z

}

Compensation

(110)

Theoretical Results

Norm of

f

Pairwise hinge loss

Regret

(111)

Smooth Pairwise Hinge Loss

Smooth pairwise loss hinge loss

Minimize

smooth localized instantaneous

regularized risk of AUC

Update the weights

`

_sh

(

f,

z

,

z

0

) =

|

y

0

|

2



1

2

(

y

0

)(

f

(

x

)

f

(

x

0

))

+

!

2

˜

L

t

(f

) :=

1

2

k

f

k

2

H

+

C

X

z

i

`

sh

(f,

z

t

,

z

i

)

˜

f

t

+1

:= ˜

f

t

⌘@

f

L

˜

t

(

f

)

|

f

= ˜

f

t

↵

i,t

=

8

<

:

2

⌘

Cy

_t

P

_i₂_V t

`

h

( ˜

f

t

,

z

t

,

z

i

)

,

i

=

t,

(1

⌘

)

↵

_i,t ₁

2

⌘

Cy

_t

`

_h

( ˜

f

_t

,

z

_t

,

z

_i

)

,

8

i

2

V

_t

,

(112)

Theoretical Results

Remarks

˜

R

T

⇠

⇢

O

(1)

if

L

⇤

= 0

O

(

p

T

)

otherwise

(113)

Empirical Evaluation

KOIL performs better than online linear AUC algorithms

KOIL significantly outperforms all competing

(114)

KOIL: Compensation

The performance

of RS/FIFO

decreases

when

the budget is full

RS++/FIFO++

approximate

KOIL with infinite

budget

(115)

KOIL: Effect of Buffer Size

Stable

when the

buffer size is

relative

large

KOIL

cannot

capture sufficient

information

when

the buffer size is

too

small

(116)

KOIL: Effect of Localization

k

When

k

is small,

KOIL cannot

learn well

When

k

is large

,

KOIL is affected

easily by noisy

data

(117)

KOIL: How to Select Kernels?

Solution: Cross validation

Problem: Spending much time on poor kernels

Good solution:

Multiple kernel learning

(MKL)

Properties:

Summing kernels corresponds to

concatenating feature spaces

✤

e.g.,

K

=

Q

X

q=1

✓

_q

K

_q

✓

q

0

k

1

(

z

1

,

z

2

) =

h

1

(

z

1

)

,

1

(

z

2

)

i

, k

2

(

z

1

,

z

2

) =

h

2

(

z

1

)

,

2

(

z

2

)

i

k

1

(

z

1

,

z

2

) +

k

2

(

z

1

,

z

2

) =

⌧✓

1

(

z

1

)

(

z

)

◆

,

✓

1

(

z

2

)

(

z

)

◆

(118)

KOIL with MKL: Setup

Given a set of kernel functions

Learn a linear combination of these kernel

classifiers

The

l

-th kernel classifier at

t

-th trial is

K

=

{

k

l

:

X

⇥

X

!

R

, l

= 1, . . . , m

}

F

t

(

x

) =

m

X

l

=1

q

l

t

·

sign(

f

l,t

(

x

))

f

l,t

(

x

) =

X

i

2

I

_t+

↵

+

l,i,t

k

l

(

x

i

,

x

) +

X

j

2

I

_t

↵

l,j,t

k

l

(

x

j

,

x

)

q

t

=

w

t

/

|

w

t

|

(119)

KOIL with MKL: Update

Let the loss function is the localized instantaneous

risk with the regularization term

Update for classifiers weights at

t

-th trial

✤

Choose each classifier with

✤

Penalize the weight

of each classifier based on

the loss

Update rule for SV weights

✤

Non-smooth pairwise hinge loss

✤

Smooth pairwise hinge loss

ˇ

L

t

(

f

)

↵_i,t = 8 < : ⌘Cy_t_|V_t_|, i = t, ↵_i,t ₁ ⌘Cy_t, ₈i ₂ V_t, ↵_i,t ₁, ₈i ₂ Iyt t [ Vt. ↵i,t= 8 < : 2⌘Cyt P_i₂_V_t `h( ˜ft,zt,zi), i = t, ↵i,t 1 2⌘Cyt`h( ˜ft,zt,zi), 8i 2 Vt, ↵i,t 1, 8i 2 Ityt [ Vt.

P

(

w

t

l

/

[max

j

w

t

j

])

w

i

t

+1

=

w

i

t

exp(

⌘

L

ˇ

t

(

f

i

, y

t

))

(120)

KOIL with MKL: Theory

(121)

KOIL with MKL: Results

KOIL with MKL attains better or comparable performance on most

datasets

(122)

Summary of Online

Nonlinear Classifiers

Models

Pros and Cons

More powerful

Theoretical guarantees

Complex

Slow in training

f

(

x

) =

h

f

(

·

)

, k

(

x

,

·

)

i

H

=

X

i

↵

i

k

(

x

,

x

i

)

(123)

Outline

Introduction

✤

Online learning and its applications

✤

Perceptron

✤

•

✤

•

(124)

Discussions and Open

Issues

How to learn from Big Data to tackle the 4V’s

characteristics?

Volume

V

ariety

Structured, semi-structured, unstructured

V

eracity

Volume

Data inconsistency, ambiguities, deception

V

elocity

Batch data, real-time data, streaming data

V

olume

Size grows from

terabyte to exabyte to zettabyte

(125)

Issues

Data issues

✤

High-dimensionality

✤

Sparsity

✤

Structure

✤

Noise and incomplete

data

✤

Concept drift

✤

Domain adaption

✤

Background

knowledge

incorporation

Platform issues

✤

Parallel computing

✤

Distributed computing

User interaction

✤

Interactive OL vs.

Passive OL

✤

Crowdsourcing

(126)

Issues

Applications

Social network and social media

Speech recognition and identification (e.g., Siri)

Financial engineering

Medical and healthcare informatics

Science and research: human genome decoding,

particle discoveries, astronomy

(127)

Conclusion

Introduction of Big Data and the challenges and

opportunities

Introduction of online learning and its possible

applications

Survey of classical and state-of-the-art online learning

techniques for

✤

Linear models

•

Non-sparse learning models

•

Sparse learning models

✤

Nonlinear models

•

Learning with kernels

(128)

One-slide Takeaway

Online learning is a promising tool for big data

analytics

Many challenges exist

✤

Real-world scenarios: Concept drifting, sparse

data, high-dimensional, uncertain/imprecision

data, etc.

✤

More advance online learning algorithms: Faster

convergence rate, less memory cost, etc.

✤

Parallel implementation or running on distributing

platforms

(129)

If you have any questions, please contact Dr. Haiqin Yang

via [email protected]

(130)

References

• G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Tracking the best hyperplane with a simple budget

perceptron. Machine Learning 69 (2-3):143–167, 2007.

• N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,

New York, NY, USA, 2006.

• K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive

algorithms. Journal of Machine Learning Research, 7:551–585, 2006.

• K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. In NIPS, pages

414–422, 2009.

• K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Journal of

Machine Learning Research, 3:951–991, 2003.

• O. Dekel, S. Shalev-Shwartz, and Y. Singer. The Forgetron: A kernel-based perceptron on a budget.

SIAM Journal on Computing, 37(5):1342–1372, 2008.

• Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine

Learning, 37(3):277–296, 1999.

• W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou. One-pass AUC optimization. In ICML, pages 906–914,

2013.

•J. C. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting.

Journal of Machine Learning Research, 10:2899–2934, 2009.

• C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning

Research, 2:213–242, 2001.

(131)

References

•

J. Hu, H. Yang, I. King, M. R. Lyu, and A. M.-C. So. Kernelized online imbalanced learning with

fixed budgets. In AAAI, Austin Texas, USA, Jan. 25-30 2015.

•

R. Jin, S. C. H. Hoi, and T. Yang. Online multiple kernel learning: Algorithms and mistake

bounds. In ALT, pages 390–404, 2010.

•

J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transactions

on Signal Processing, 52(8):2165–2176, 2004.

•

J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of

•

Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1-3):

361–387, 2002.

•

P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM J.

Numer. Anal., 16(6):964–979, 1979.

•

Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion

Paper 2007/76, Catholic University of Louvain, Center for Operations Research and

Econometrics, 2007.

•

F. Orabona and K. Crammer. New adaptive algorithms for online classification. In NIPS, pages

1840–1848, 2010.

•

F. Orabona, J. Keshet, and B. Caputo. Bounded kernel-based online learning. Journal of

(132)

References

•

S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and

Trends in Machine Learning 4(2): 107-194. 2012.

•

Y. Wang, R. Khardon, D.Pechyony, and R.Jones. Generalization bounds for online

learning algorithms with pairwise loss functions. In COLT, pages 13.1–13.22, 2012.

•

L. Xiao. Dual averaging methods for regularized stochastic learning and online

optimization. Journal of Machine Learning Research, 11:2543–2596, 2010.

•

H. Yang, I. King, and M. R. Lyu. Sparse Learning Under Regularization Framework. LAP

Lambert Academic Publishing, April 2011.

•

H. Yang, M. R. Lyu, and I. King. Efficient online learning for multi-task feature selection.

ACM Transactions on Knowledge Discovery from Data, 2013.

•

H. Yang, Z. Xu, I. King, and M. R. Lyu. Online learning for group lasso. In ICML, pages

1191–1198, Haifa, Israel, 2010.

•

H. Yang, Z. Xu, J. Ye, I. King, and M. R. Lyu. Efficient sparse generalized multiple kernel

learning. IEEE Transactions on Neural Networks, 22(3):433–446, March 2011.

•

T. Yang, M. Mahdavi, R. Jin, J. Yi, and S. C. H. Hoi. Online kernel selection: Algorithms

and evaluations. In AAAI, 2012.

•

P. Zhao, S. C. H. Hoi, R. Jin, and T. Yang. Online auc maximization. In ICML, pages

233–240, 2011.

oolbox: http://www.cse.cuhk.edu.hk/~hqyang