Online Learning for
Big Data Analytics
Irwin King and Haiqin Yang
Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Outline
Introduction
✤
Big data: definition and history
✤
Online learning and its applications
Online Learning Algorithms
✤
Perceptron
✤
Online linear classifiers
•
Online non-sparse learning
•
Online sparse learning
✤
Online nonlinear classifiers
•
Online learning with kernels
•
Kernelized online imbalanced learning
Outline
Introduction
✤
Big data: definition and history
✤
Online learning and its applications
Online Learning Algorithms
✤
Perceptron
✤
Online linear classifiers
•
Online non-sparse learning
•
Online sparse learning
✤
Online nonlinear classifiers
•
Online learning with kernels
•
Kernelized online imbalanced learning
What is Big Data?
Big Data refers to datasets grow so
large
and
complex
that it is difficult to
capture
,
store
,
manage
,
share
,
analyze
and
visualize
within current computational
architecture.
The First Big Data Challenge
1880 US census
Data: 50 million people
Features (5): Age,
gender (sex),
occupation,
education level, no. of
insane people in
household
Processed time: > 7
years
The First Big Data Solution
Hollerith Tabulating
System
Punched cards – 80
variables
Used for 1890 census
6 weeks instead of 7+
Big Science
The International
Geophysical Year
✤
An international scientific
project
✤
Last from Jul. 1, 1957 to
Dec. 31, 1958
A synoptic collection of
observational data on a
global scale
Implications
✤
Big
budgets,
big
staffs,
big
machines,
big
Summary of Big Science
Laid foundation for ambitious projects
✤
International Biological Program
✤
Long Term Ecological Research Network
Ended in 1974
Many participants viewed it as a
failure
Nevertheless, it was a
success
✤
Transform the way of processing data
✤
Realize original incentives
Lessons from Big Science
Spawn new big data projects
✤
Weather prediction
✤
Physics research (supercollider data analytics)
✤
Astronomy images (planet detection)
✤
Medical research (drug interaction)
✤
…
Businesses latched onto its techniques,
methodologies, and objectives
Big Science vs. Big Business
Common
✤
Need technologies to work with data
✤
Use algorithms to mine data
Big science
✤
Source: Experiments and research conducted in
controlled environments
✤
Goals: To answer questions, or prove theories
Big business
✤
Source: Transactions in nature and little control
✤
Goals: Discover new opportunities, measure efficiency,
Big Data is Everywhere
Lots of data is being
collected and
warehoused
✤
Science experiments
✤
Web data, e-commerce
✤
Purchases at department/
grocery stores
✤
Bank/Credit Card
transactions
Big Data from Science
CERN - Large Hadron Collider
✤
~10 PB/year at 2008
✤
~1000 PB in ~10 years
✤
2500 physicists collaboration
Large Synoptic Survey
Telescope
✤
~5-10 PB/year at start in 2012
✤
~100 PB by 2025
Pan-STARRS
(Haleakala,
Hawaii)
✤
now 800 TB/year
✤
soon: 4 PB/year
Characteristics and Challenges
Big Data Analytics
Definition: A process of
inspecting
,
cleaning
,
transforming
, and
modeling
big data
with the goal
of
discovering
useful information,
suggesting
conclusions, and
supporting
decision making
Connection to
data mining
✤
Analytics include both
data analysis
(
mining
) and
communication
(guide decision making)
✤
Analytics is not so much concerned with
individual analyses or analysis steps, but with
the
entire methodology
Outline
Introduction
✤
Big data: definition and history
✤
Online learning and its applications
Online Learning Algorithms
✤
Perceptron
✤
Online linear classifiers
•
Online non-sparse learning
•
Online sparse learning
✤
Online nonlinear classifiers
•
Online learning with kernels
•
Kernelized online imbalanced learning
Challenges and Aims
Challenges: Capturing, storing, searching, sharing,
analyzing and visualizing
Big data is not just about size
✤
Finds insights from complex, noisy, heterogeneous,
longitudinal, and voluminous data
✤
It aims to answer questions that were previously
unanswered
This tutorial focuses on
online learning techniques
Learning Techniques
Overview
Learning paradigms
✤
Supervised learning
✤
Semisupervised
learning
✤
Transductive learning
✤
Unsupervised learning
✤
Universum learning
✤
Transfer learning
What is Online Learning?
Batch/Offline Learning
✤
Observe a
batch
of training
data
✤
Learn a model from them
✤
Make accurate predictions
on new samples
Online Learning
✤
Observe a
sequence
of data
✤
Learn a model
incrementally
as new instances appear
✤
Make the sequence of online
predictions accurately
user
Make prediction
True response
Update model
{
(
x
i, y
i)
}
Ni=1(x
1
, y
1
)
, . . . ,
(x
t
, y
t
)
Online learning is the process of
answering a sequence
of questions
given (maybe partial) knowledge of the correct answers to previous
questions and possibly additional available information. [Shal, 2012]
Online Prediction Algorithms
An initial prediction rule
For
t
= 1, 2, …
✤
We observe and make a prediction
✤
We observe the true outcome and then
compute a loss
✤
The online algorithm updates the prediction rule
using the new example
f
t
1
(
x
t
)
l
(
f
t
1
(
x
t
)
, y
t
)
user
x
t
x
t
f
t
1
(
x
t
)
y
t
y
t
x
Online Prediction Algorithms
The total error of the algorithm is
Goal: this error is to be as small as possible
Predict unknown future one step a time: similar to
generalization error
user
x
t
f
t
1
(
x
t
)
y
t
x
T
X
t
=1
l
(
f
t
1
(
x
t
)
, y
t
)
Update
f
Regret Analysis
: the optimal prediction function from a class
H
,
e.g., the class of linear classifiers
with the minimum error after seeing all instances.
Regret for the online learning algorithm
f
⇤
(
·
)
We want regret as small as possible!
Regret =
1
T
T
X
t
=1
[
l
(
f
t
1
(
x
t
)
, y
t
)
l
(
f
⇤
(
x
t
)
, y
t
)
f
⇤
(
·
) := arg min
f
2
H
T
X
t
=1
l
(
f
(
x
t
)
, y
t
)
Why Low Regret?
Regret for the online learning algorithm
Advantages
✤
Do not lose much from not knowing future events
✤
Perform almost as well as someone who observes the
entire sequence and picks the best prediction strategy
in hindsight
✤
Compete with changing environment
Regret =
1
T
T
X
t
=1
[
l
(
f
t
1
(
x
t
)
, y
t
)
l
(
f
⇤
(
x
t
)
, y
t
)
Advantages of Online
Learning
Meet many applications for data arriving
sequentially
while
predictions are required
on-the-fly
✤
Avoid re-training when adding new data
Applicable in
adversarial
and
competitive
environment
(e.g., spam filtering, stock market)
Strong
adaptability
to
changing
environment
High
efficiency
and excellent
scalability
Simple
to understand and easy to implement
Easy to be parallelized
Where
to Apply Online
Learning?
Online
Learning
Social
Media
Internet
Security
Finance
Online Learning for Social
Media
Online Learning for Social
Media
Online
Learning
Social
Media
Internet
Security
Finance
Online Learning for Internet
Security
Electronic business sectors
✤
Spam
filtering
✤
Fraud
credit card transaction
detection
✤
Network
intrusion detection
system
Online Learning for Social
Media
Online
Learning
Social
Media
Internet
Security
Finance
Online Learning for Financial
Decision
Financial decision
✤
Online portfolio selection
✤
Sequential investment
Outline
Introduction
✤
Big data: definition and history
✤
Online learning and its applications
Online Learning Algorithms
✤
Perceptron
✤
Online linear classifiers
•
Online non-sparse learning
•
Online sparse learning
✤
Online nonlinear classifiers
•
Online learning with kernels
•
Kernelized online imbalanced learning
Outline
Introduction
✤
Big data: definition and history
✤
Online learning and its applications
Online Learning Algorithms
✤
Perceptron
✤
Online linear classifiers
•
Online non-sparse learning
•
Online sparse learning
✤
Online nonlinear classifiers
•
Online learning with kernels
•
Kernelized online imbalanced learning
Perceptron Algorithm
One of the oldest machine learning algorithms
Online algorithm for learning a linear threshold
classifier with small errors
[F. Rosenblatt, 1958]
Activation function O=(
1 if nP
i=0 wixi > 0 1 otherwiseP
nP
i=0 wixi =wTxw
2x
2..
.
..
.
w
nx
nw
1x
1w
01
inputs weightsPerceptron Algorithm
Goal: find a linear classifier with small error
[F. Rosenblatt, 1958]
If no error, keeping the same weight;
otherwise, updating it.
Perceptron: Geometric View
w
0
Perceptron: Geometric View
w
0
(
x
1
, y
1
(= 1))
Misclassification
Perceptron: Geometric View
w
0
(
x
1
, y
1
(= 1))
w
1
=
w
0
+
⌘
x
1
y
1
Perceptron: Geometric View
(
x
1
, y
1
(= 1))
w
1
(
x
2
, y
2
(=
1))
Misclassification
t
= 2
Perceptron: Geometric View
(
x
1
, y
1
(= 1))
w
1
(
x
2
, y
2
(=
1))
w
2=
w
1+
⌘
x
2y
2Update
Continue…
Perceptron Mistake Bound
Consider well separate the data
Define margin
w
⇤
w
⇤
T
x
i
y
i
=
min
i
|
w
T
⇤
x
i
|
k
w
⇤
k
2
sup
i
k
x
i
k
2
2
The larger, the more
confidence
Norm of
x
: the larger,
the large mistake bound
The number of mistakes perceptron makes is at
most
Proof of Perceptron Mistakes
Bound
Proof:
Let be
be the hypothesis before the
k
-th
mistake. Assume that the
k
-th mistake occurs on the
input example
[Novikoff, 1963]
v
k
(
x
i
, y
i
)
First,
Second,
=
min
i|
w
T
⇤
x
i|
k
w
⇤k
2sup
ik
x
ik
2Outline
Introduction
✤
Big data: definition and history
✤
Online learning and its applications
Online Learning Algorithms
✤
Perceptron
✤
Online linear classifiers
•
Online non-sparse learning
•
Online sparse learning
✤
Online nonlinear classifiers
•
Online learning with kernels
•
Kernelized online imbalanced learning
Overview
Online Learning
Linear Classifiers
Nonlinear
Classifiers
Online/Stochastic Gradient
Descent
Online gradient descent
Stochastic gradient descent
user
x
t
f
t
1
(
x
t
)
y
t
x
user
x
t
f
t
1
(
x
t
)
y
t
x
Update
f
t
Update
f
t
Outline
Introduction
✤
Big data: definition and history
✤
Online learning and its applications
Online Learning Algorithms
✤
Perceptron
✤
Online linear classifiers
•
Online non-sparse learning
•
Online sparse learning
✤
Online nonlinear classifiers
•
Online learning with kernels
•
Kernelized online imbalanced learning
Online Linear Classifiers
Models
Categories
✤
Non-sparse models: the
weight is non-sparse
✤
Sparse models: the weight
contains zero elements
w
f
(x) = sgn(w
T
x)
Classification
Online Non-Sparse Learning
First order
learning methods
✤
Online gradient descent
(Zinkevich, 2003)
✤
Passive aggressive learning
(Crammer et al., 2006)
✤
Others (including but not limited)
•
ALMA: Approximate Large Margin Algorithm
(Gentile, 2001)
•
ROMMA: Relaxed Online Maximum Margin
Algorithm (Li & Long, 2002)
•
MIRA: Margin Infused Relaxed Algorithm (Crammer
& Singer, 2003)
Online Gradient Descent
(OGD)
Online convex optimization
Consider a convex objective function
where is a bounded convex set
✤
Update by Online Gradient Descent (OGD)
or Stochastic Gradient Descent (SGD)
where is a learning rate
f
:
S
!
R
S
✓
R
n
w
t
+1
⇧
S
(
w
t
⌘
r
f
(
w
t
))
⌘
[Zinkevich, 2003]
gradient descent
projection
Provide a framework to prove regret
bound for
online convex optimization
Online Gradient Descent
(OGD)
For
t
= 1, 2, …
✤
An unlabeled sample arrives
✤
Make a prediction based on existing weights
✤
Observe the true class label
✤
Update the weights by
x
t
ˆ
y
t
= sgn(
w
t
T
x
t
)
y
t
2
{
1
,
+1
}
w
t
+1
⇧
S
(
w
t
⌘
r
f
(
w
t
))
user
x
t
y
t
x
ˆ
y
t
w
[Zinkevich, 2003]
O
(
p
T
) regret bound
Passive-Aggressive Online
Learning
Each example defines a set of consistent hypotheses:
The new vector is set to be the project of on to
Classification
Regression
Uniclass
Passive-Aggressive
Passive Aggressive Online
Learning
PA (Binary classification)
PA-I (C-SVM)
PA-II (Relaxed C-SVM)
Closed-form solution
[Crammer et al., 2006]
Passive Aggressive Online
Learning
Algorithm
Objective
Closed-form solution
Online Non-Sparse Learning
First
order
methods
✤
Learn a
linear
weight vector (first order) of model
Pros and cons
Simple and easy to implement
Efficient and scalable for high-dimensional data
Relatively slow convergence rate
Online Sparse Learning
Motivations
✤
Space
constraint: RAM overflow
✤
Test-time
constraint
✤
How to induce
sparse
solutions?
Online Sparse Learning
Objective function
Problem in online learning
✤
Standard stochastic gradient descent
methods does not yield sparse solution
Some representative work
✤
Truncated gradient
(Langford et al., 2009)
✤
FOBOS
: Forward Looking Subgradients
(Duchi & Singer, 2009)
✤
Dual averaging
(Xiao, 2009)
✤
etc.
Truncated Gradient
Objective function
Stochastic gradient descent (SGD)
Simple coefficient rounding when
the coefficient is small
[Langford et al., 2009]
Truncated gradient:
impose sparsity by rounding
the solution of stochastic gradient descent
Truncated Gradient
Simple coefficient rounding vs. Less aggressive truncation
Loss function
✤
Logistic
✤
Hinge
✤
Least square
When , the update rule is
identical to the standard SGD
Truncated Gradient
The amount of shrinkage is measured
by a gravity parameter
[Langford et al., 2009]
The truncation can be performed every
K
online steps
Truncated Gradient
Regret bound
Let , the regret is
[Langford et al., 2009]
FOBOS
FO
rward-
B
ackward
S
plitting
FOBOS
Objective function
Repeat
I. Unconstrained (stochastic) subgradient/gradient
loss
II. Incorporate regularization
Similar to
-
Forward-backward splitting (Lions and Mercier 79)
-
Composite gradient methods (Wright et al. 09,
Nesterov 07)
FOBOS: Step I
Objective function
Unconstrained (stochastic) subgradient/gradient loss
FOBOS: Step II
Objective function
Incorporate regularization
[Duchi & Singer, 2009]
stay close to interim vector
Induce sparsity
Forward Looking Property
The optimum satisfies
Let and
Current
subgradient of loss,
forward
subgradient of
regularization
current loss
forward regularization
Batch Convergence and
Online Regret
Set or to obtain batch convergence
Online (average) regret bounds
p
High Dimensional Efficiency
Input space is sparse but huge
Need to perform lazy update to
Proposition: the following are equivalent
Dual Averaging Method
Objective function
Problem:
Truncated gradient
does not produce
truly sparse weight due to small learning rate
Fix:
Dual averaging
keeps two sate
representations
-
weight parameter
-
average gradient vector
[Xiao, 2010]
w
t
¯
g
t
=
1
t
t
X
f
i
(
w
i
)
Dual Averaging Method
Algorithm
[Xiao, 2010]
Three main steps
Closed-form solution
on
Advantage
: sparse on
the weight
Disadvantage
: keep a
non-sparse
(sub)gradients
w
t
+1
Convergence and Regret
Average regret
Theoretical bounds
O
(
p
T
) regret bound
Comparison
FOBOS
✤
Subgradient
✤
Local Bregman divergence
✤
Coefficient
Equivalent to TG method when
Dual Averaging
✤
Average subgradient
✤
Global proximal function
✤
Coefficient
g
t
w
t+1= arg min
w⇢
h
g
t,
w
i
+
(
w
) +
1
2
↵
tk
w
w
tk
2 2w
t+1= arg min
w⇢
h
g
¯
t,
w
i
+
(
w
) +
tt
h
(
w
)
1
/
↵
t=
⇥
(
p
t
)
(
w
) =
k
w
k
1
¯
g
t
t/t
=
⇥
(1
/
p
t
)
Empirical Comparison
Variants of Online Sparse
Learning Models
Other existing work
✤
Online learning for Group Lasso (Yang et al.,
2010)
✤
Online learning for multi-task feature
Efficient Online Learning for Multi-Task
Feature Selection
[
Yang et al., ACM TKDD 2013
]
Online Learning for Multi-Task
Feature Selection
Observations
✤
Limited
labeled data in each task
✤Related tasks may
help
each other
✤
Features among tasks are
redundant
or
irrelevant
✤Data come in
sequence
aMTFS
Model
: integrating multiple tasks with specific
regularizations
Solution
: Online learning for Multi-Task Feature Selection
Challenge
: How to update the model
adaptively
with
Multi-task Feature Selection
Data
Models
Objective
i.i.d. observations of
D
=
Q[
q=1D
qD
q=
{
z
qi= (x
qi, y
iq)
}
Ni=1qsampled from
P
q,
q
= 1
, . . . , Q
min
W QX
q=11
N
q NqX
i=1`
q(
W
•q,
z
qi) +
⌦
(
W
)
W
=
w
1
,
w
2
, . . . ,
w
Q
= (
W
•
1
, . . . ,
W
•
Q
) =
W
>
1
•
, . . . ,
W
>
d
•
>
⌦
(
W
) =
QX
q=1k
W
•qk
1=
dX
j=1W
j>• 1Different
regularization
yields different models
• iMTFS:
• aMTFS:
• MTFTS:
⌦
(
W
) =
dX
j=1W
>j• 2⌦
,r=
dX
j=1⇣
r
jW
j>• 1+
W
> j• 2⌘
ˆ wiMTFS
aMTFS
MTFS
Online Learning Framework
min
W QX
q=11
N
q NqX
i=1`
q(
W
•q,
z
qi) +
⌦
(
W
)
Three main steps
✤
Compute subgradient
✤
Update average subgradient
✤Update weights
Close-formed solution
✤e.g. aMTFS
Efficiency
Regret bound
O
(
d
⇥
Q
)
¯
R
T
⇠
O
(1
/
p
T
)
Empirical Study: Conjoint
Analysis
Description
✤
Objective
: Predict rating by estimating respondents’ partworths
vectors
✤
Data
: Ratings of 180 students for 20 different personal computers,
Q
= 180
✤
Features
(d=14): Telephone hot line (TE), amount of memory
(RAM), screen size (SC), CPU speed (CPU), hard disk (HD),
CDROM/multimedia (CD), Cache (CA), Color (CO), availability (AV),
warranty (WA), software (SW), guarantee (GU) and price (PR)
Setup
✤
Evaluation
: Root mean square errors (RMSEs)
✤
Loss
: Square loss
Evaluation Results
Accuracy
✤
Learning partworths vectors across respondents
simultaneously
can help to improve the performance
✤
Online learning algorithms attain nearly the
same
accuracies as
batch-trained algorithms
Memory cost
✤
Batch:
✤
Online:
Computational cost
✤
Batch:
✤
Online:
Learned Features
Observations
•
Features learned from the online algorithms are
consistent
to those learned from the batch-trained algorithms
•
Ratings are strongly
negative
to the price and
positive
to
the RAM, the CPU speed, CDROM, etc.
TE RAM SC CPU HD CD CA CO AV WA SW GU PR −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 TE RAM SC CPU HD CD CA CO AV WA SW GU PR −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 TE RAM SC CPU HD CD CA CO AV WA SW GU PR −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3
Online Sparse Learning
Summary
Objective
✤
Induce
sparsity
in learning the weights in online
learning algorithms
Pros and Cons
Simple and easy to implement
Efficient and scalable for high-dimensional data
Relatively slow convergence rate
Outline
Introduction
✤
Big data: definition and history
✤
Online learning and its applications
Online Learning Algorithms
✤
Perceptron
✤
Online linear classifiers
•
Online non-sparse learning
•
Online sparse learning
✤
Online nonlinear classifiers
•
Online learning with kernels
•
Kernelized online imbalanced learning
Online Nonlinear Classifiers
Models
Categories
✤
Online learning with kernels
✤
Kernelized online imbalanced
learning
f
(
x
) =
h
f
(
·
)
, k
(
x
,
·
)
i
H
=
X
i
What is Kernel?
Similarity defined in the original space:
Similarity defined in the kernel space:
x
T
i
x
j
k
(
x
i
,
x
j
) = (
x
i
)
T
(
x
j
)
Outline
Introduction
✤
Big data: definition and history
✤
Online learning and its applications
Online Learning Algorithms
✤
Perceptron
✤
Online linear classifiers
•
Online non-sparse learning
•
Online sparse learning
✤
Online nonlinear classifiers
•
Online learning with kernels
•
Kernelized online imbalanced learning
Objective
Representative methods
✤
Naive Online Regularization Minimization
Algorithm (NORMA)
[Kivinen et al., 2004]
✤
Forgetron
[Dekel et al., 2006]
✤
Randomized Budget Perceptron
(RBP)
[Cavallanti et al., 2007]
✤
Projectron
[Orabona et al., 2008]
✤
etc.
Online Learning with Kernels
f
t
+1
= arg min
f
2
k
NORMA
Problem
: The kernel expansion
scales with the number of instances
[Kivinen et al., 2004]
Solution
: Drop smaller terms
Theory
✤
Truncated error
✤
Mistake bound
Applications
✤
Classification
✤
Novelty detection
✤
Regression
NORMA
Objective
Update rule: Standard gradient descent
[Kivinen et al., 2004]
f
t
+1
= arg min
f
2
k
Forgetron
Problem
: The kernel expansion
scales with the number of instances
[Dekel et al., 2006]
Solution
: Fixed budget
Theory
✤
Mistake bound
Strategies
✤
Remove oldest
✤
Gentler scaling
✤
Smarter removal
M
2
U
2
+ 4
X
T
t
=1
ˆ
l
t
,
k
g
k
U
Randomized Budget
Perceptron (RBP)
Problem
: The kernel
expansion scales with the
number of instances
Theory
✤
Mistake bound
Solution
✤
Shifting Perceptron
✤
Randomly discard
[Cavallanti et al., 2007]
Projectron/Projectron++
Problem
: The kernel
expansion scales with the
number of instances
Theory:
Mistake bound
Solution:
Projectron/
Projectron++
Summary
Learn a nonlinear model by kernel expansion
Target: Maximizing accuracy
Scalability issue: The kernel expansion scales with
the number of instances
Advantages
✤
Different strategies to tackle the scalability issues
✤
Mistake bounds are provided
f
(
x
) =
h
f
(
·
)
, k
(
x,
·
)
i
H
=
X
i
Outline
Introduction
✤
Big data: definition and history
✤
Online learning and its applications
Online Learning Algorithms
✤
Perceptron
✤
Online linear classifiers
•
Online non-sparse learning
•
Online sparse learning
✤
Online nonlinear classifiers
•
Online learning with kernels
•
Kernelized online imbalanced learning
Kernelized Online Imbalanced Learning
with Fixed Budgets
[
Hu et al., AAAI 2015, Hu et al., IEEE TNNLS Submitted
]
Signal detection task:
Learning from Imbalanced
Data
Problems
✤
Accuracy
: dominated by the
majority class
✤
Misclassification costs for
both classes are
different
✤
Unknown misclassification
Learning from Imbalanced
Data
Evaluation metric
✤Accuracy
= (TP+TN)/(TP+FP+TN+FN)
✤Precision = TP/(TP+FP)
✤Recall (Sensitivity) = TP/(TP+FN)
✤F1 = (1+1
2)*Recall*Precision/(1
2*Recall+Precision)
✤
AUC (Area under the Receiver Operating Characteristic (ROC) curve):
Prob.
of a
classifier ranks
a randomly chosen positive instance higher than a randomly chosen
negative one
)
Kernelized Online
Imbalanced Learning (KOIL)
Problem and motivation
✤
Imbalanced data
✤
Metric:
Accuracy
vs.
AUC
✤
Scenario:
nonlinearity
and
streaming
Challenges
✤
How to attain
smooth
update
?
✤
How to control
functional
complexity
?
✤
How to
determine kernels
Setup for KOIL
Goal: seek a
nonlinear
decision function for
imbalanced
data
with two fixed budgets buffers to store the
weight
KOIL: AUC Maximization
Given
f
and data, the AUC is calculated by
A convex surrogate
Objective: Minimizing the localized instantaneous
regularized risk of AUC
−10 0 1 2 0.5 1 1.5 2 Hinge loss yf(x) Loss
KOIL: Intuition
Two buffers to keep track of the
global
information
Update based on
k
-nearest opposite support
KOIL: Update Buffer
Case I
: Buffer is not full
•
Insert directly
Case II
: Buffer is full
•
Reservoir Sampling (RS)
•
First-In-First-Out (FIFO)
KOIL: Update Buffer (2)
Observation:
Performance decay
Why: Information loss
How to compensate?
KOIL: Update Buffer (3)
KOIL: Update Kernel
Minimize the
localized instantaneous
regularized risk of AUC
Update:
Stochastic gradient descent
KOIL: Update Buffer
1. Remove SV via FIFO or Reservoir
Sampling (RS)
2. Compensate the loss by sharing
the weight to its most similar SV
3. Set weight
f
t
++
+1
(
x
) = ˆ
f
t
+1
(
x
) +
↵
c
·
k
(
x
c
,
x
)
=
f
t
+1
(
x
)
↵
r
k
(
x
r
,
x
)
|
{z
}
Removal
+
↵
c
·
k
(
x
c
,
x
)
|
{z
}
Compensation
Theoretical Results
Norm of
f
Pairwise hinge loss
Regret
Smooth Pairwise Hinge Loss
Smooth pairwise loss hinge loss
Minimize
smooth localized instantaneous
regularized risk of AUC
Stochastic gradient descent
Update the weights
`
sh(
f,
z
,
z
0) =
|
y
y
0|
2
1
1
2
(
y
y
0)(
f
(
x
)
f
(
x
0))
+!
2˜
L
t
(f
) :=
1
2
k
f
k
2
H
+
C
X
z
i`
sh
(f,
z
t
,
z
i
)
˜
f
t
+1
:= ˜
f
t
⌘@
f
L
˜
t
(
f
)
|
f
= ˜
f
t↵
i,t=
8
<
:
2
⌘
Cy
tP
i2V t`
h( ˜
f
t,
z
t,
z
i)
,
i
=
t,
(1
⌘
)
↵
i,t 12
⌘
Cy
t`
h( ˜
f
t,
z
t,
z
i)
,
8
i
2
V
t,
Theoretical Results
Remarks
˜
R
T
⇠
⇢
O
(1)
if
L
⇤
= 0
O
(
p
T
)
otherwise
Empirical Evaluation
KOIL performs better than online linear AUC algorithms
KOIL significantly outperforms all competing
KOIL: Compensation
The performance
of RS/FIFO
decreases
when
the budget is full
RS++/FIFO++
approximate
KOIL with infinite
budget
KOIL: Effect of Buffer Size
Stable
when the
buffer size is
relative
large
KOIL
cannot
capture sufficient
information
when
the buffer size is
too
small
KOIL: Effect of Localization
k
When
k
is small,
KOIL cannot
learn well
When
k
is large
,
KOIL is affected
easily by noisy
data
KOIL: How to Select Kernels?
Solution: Cross validation
Problem: Spending much time on poor kernels
Good solution:
Multiple kernel learning
(MKL)
Properties:
Summing kernels corresponds to
concatenating feature spaces
✤
e.g.,
K
=
QX
q=1✓
qK
q✓
q
0
k
1
(
z
1
,
z
2
) =
h
1
(
z
1
)
,
1
(
z
2
)
i
, k
2
(
z
1
,
z
2
) =
h
2
(
z
1
)
,
2
(
z
2
)
i
k
1
(
z
1
,
z
2
) +
k
2
(
z
1
,
z
2
) =
⌧✓
1
(
z
1
)
(
z
)
◆
,
✓
1
(
z
2
)
(
z
)
◆
KOIL with MKL: Setup
Given a set of kernel functions
Learn a linear combination of these kernel
classifiers
The
l
-th kernel classifier at
t
-th trial is
K
=
{
k
l
:
X
⇥
X
!
R
, l
= 1, . . . , m
}
F
t
(
x
) =
m
X
l
=1
q
l
t
·
sign(
f
l,t
(
x
))
f
l,t
(
x
) =
X
i
2
I
t+↵
+
l,i,t
k
l
(
x
i
,
x
) +
X
j
2
I
t↵
l,j,t
k
l
(
x
j
,
x
)
q
t
=
w
t
/
|
w
t
|
KOIL with MKL: Update
Let the loss function is the localized instantaneous
risk with the regularization term
Update for classifiers weights at
t
-th trial
✤
Choose each classifier with
✤
Penalize the weight
of each classifier based on
the loss
Update rule for SV weights
✤
Non-smooth pairwise hinge loss
✤
Smooth pairwise hinge loss
ˇ
L
t
(
f
)
↵i,t = 8 < : ⌘Cyt|Vt|, i = t, ↵i,t 1 ⌘Cyt, 8i 2 Vt, ↵i,t 1, 8i 2 Iyt t [ Vt. ↵i,t= 8 < : 2⌘Cyt Pi2Vt `h( ˜ft,zt,zi), i = t, ↵i,t 1 2⌘Cyt`h( ˜ft,zt,zi), 8i 2 Vt, ↵i,t 1, 8i 2 Ityt [ Vt.P
(
w
t
l
/
[max
j
w
t
j
])
w
i
t
+1
=
w
i
t
exp(
⌘
L
ˇ
t
(
f
i
, y
t
))
KOIL with MKL: Theory
KOIL with MKL: Results
KOIL with MKL attains better or comparable performance on most
datasets
Summary of Online
Nonlinear Classifiers
Models
Pros and Cons
More powerful
Theoretical guarantees
Complex
Slow in training
f
(
x
) =
h
f
(
·
)
, k
(
x
,
·
)
i
H
=
X
i
↵
i
k
(
x
,
x
i
)
Outline
Introduction
✤
Big data: definition and history
✤
Online learning and its applications
Online Learning Algorithms
✤
Perceptron
✤
Online linear classifiers
•
Online non-sparse learning
•
Online sparse learning
✤
Online nonlinear classifiers
•
Online learning with kernels
•
Kernelized online imbalanced learning
Discussions and Open
Issues
How to learn from Big Data to tackle the 4V’s
characteristics?
Volume
Volume
Volume
V
ariety
Structured, semi-structured, unstructuredV
eracity
Volume
Data inconsistency, ambiguities, deceptionV
elocity
Batch data, real-time data, streaming data
V
olume
Size grows from
terabyte to exabyte to zettabyte
Discussions and Open
Issues
Data issues
✤
High-dimensionality
✤
Sparsity
✤
Structure
✤
Noise and incomplete
data
✤
Concept drift
✤
Domain adaption
✤
Background
knowledge
incorporation
Platform issues
✤
Parallel computing
✤
Distributed computing
User interaction
✤
Interactive OL vs.
Passive OL
✤
Crowdsourcing
Discussions and Open
Issues
Applications
Social network and social media
Speech recognition and identification (e.g., Siri)
Financial engineering
Medical and healthcare informatics
Science and research: human genome decoding,
particle discoveries, astronomy
Conclusion
Introduction of Big Data and the challenges and
opportunities
Introduction of online learning and its possible
applications
Survey of classical and state-of-the-art online learning
techniques for
✤
Linear models
•
Non-sparse learning models
•
Sparse learning models
✤
Nonlinear models
•
Learning with kernels
One-slide Takeaway
Online learning is a promising tool for big data
analytics
Many challenges exist
✤
Real-world scenarios: Concept drifting, sparse
data, high-dimensional, uncertain/imprecision
data, etc.
✤
More advance online learning algorithms: Faster
convergence rate, less memory cost, etc.
✤
Parallel implementation or running on distributing
platforms
If you have any questions, please contact Dr. Haiqin Yang
via hqyang@ieee.org
References
• G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Tracking the best hyperplane with a simple budget
perceptron. Machine Learning 69 (2-3):143–167, 2007.
• N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,
New York, NY, USA, 2006.
• K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive
algorithms. Journal of Machine Learning Research, 7:551–585, 2006.
• K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. In NIPS, pages
414–422, 2009.
• K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Journal of
Machine Learning Research, 3:951–991, 2003.
• O. Dekel, S. Shalev-Shwartz, and Y. Singer. The Forgetron: A kernel-based perceptron on a budget.
SIAM Journal on Computing, 37(5):1342–1372, 2008.
• Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine
Learning, 37(3):277–296, 1999.
• W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou. One-pass AUC optimization. In ICML, pages 906–914,
2013.
•J. C. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting.
Journal of Machine Learning Research, 10:2899–2934, 2009.
• C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning
Research, 2:213–242, 2001.
References
•
J. Hu, H. Yang, I. King, M. R. Lyu, and A. M.-C. So. Kernelized online imbalanced learning with
fixed budgets. In AAAI, Austin Texas, USA, Jan. 25-30 2015.
•
R. Jin, S. C. H. Hoi, and T. Yang. Online multiple kernel learning: Algorithms and mistake
bounds. In ALT, pages 390–404, 2010.
•
J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transactions
on Signal Processing, 52(8):2165–2176, 2004.
•
J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of
Machine Learning Research, 10:777–801, 2009.
•
Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1-3):
361–387, 2002.
•
P. L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM J.
Numer. Anal., 16(6):964–979, 1979.
•
Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion
Paper 2007/76, Catholic University of Louvain, Center for Operations Research and
Econometrics, 2007.
•
F. Orabona and K. Crammer. New adaptive algorithms for online classification. In NIPS, pages
1840–1848, 2010.
•
F. Orabona, J. Keshet, and B. Caputo. Bounded kernel-based online learning. Journal of
Machine Learning Research, 10:2643–2666, 2009.
References
•
S. Shalev-Shwartz. Online Learning and Online Convex Optimization. Foundations and
Trends in Machine Learning 4(2): 107-194. 2012.
•
Y. Wang, R. Khardon, D.Pechyony, and R.Jones. Generalization bounds for online
learning algorithms with pairwise loss functions. In COLT, pages 13.1–13.22, 2012.
•
L. Xiao. Dual averaging methods for regularized stochastic learning and online
optimization. Journal of Machine Learning Research, 11:2543–2596, 2010.
•
H. Yang, I. King, and M. R. Lyu. Sparse Learning Under Regularization Framework. LAP
Lambert Academic Publishing, April 2011.
•
H. Yang, M. R. Lyu, and I. King. Efficient online learning for multi-task feature selection.
ACM Transactions on Knowledge Discovery from Data, 2013.
•
H. Yang, Z. Xu, I. King, and M. R. Lyu. Online learning for group lasso. In ICML, pages
1191–1198, Haifa, Israel, 2010.
•
H. Yang, Z. Xu, J. Ye, I. King, and M. R. Lyu. Efficient sparse generalized multiple kernel
learning. IEEE Transactions on Neural Networks, 22(3):433–446, March 2011.
•
T. Yang, M. Mahdavi, R. Jin, J. Yi, and S. C. H. Hoi. Online kernel selection: Algorithms
and evaluations. In AAAI, 2012.
•
P. Zhao, S. C. H. Hoi, R. Jin, and T. Yang. Online auc maximization. In ICML, pages
233–240, 2011.