Kernel methods with Imbalanced Data and Applications
to Weather Prediction
Theodore B. Trafalis
Laboratory of Optimization and Intelligent Systems School of Industrial and Systems Engineering
University of Oklahoma
San Francisco Airport Marriott Waterfront Congress Center
Part I
Kernel Methods for Imbalanced Data and
Application to Tornado Prediction
1 Imbalanced Data
What is an Imbalanced Data Problem?
Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning
2 Kernel Methods
General Description of Kernel Methods
Support Vector Machines Applied to Imbalanced Data
3 Application to Tornado Prediction
Description of the Experiments Results for the Tornado Data Set
Imbalanced Data Problems and their Importance
• The problem of learning from an imbalanced data set occurs when the
number of samples in one class is significantly greater than that of the other class.
• Imbalanced data is very important in data mining and data classification.
Examples of imbalanced data sets include:
• Fraudulent credit card transactions.
• Telecommunication equipment failures.
• Oil spills from satellite images.
• Tornado, earthquake and landslide occurrences.
Example of the Classification of Imbalanced Data
Source: Tang et al. SVMs Modeling for Highly Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(1):281-288, 2009.
Between Class and Within Class Imbalances
Source: He and Garcia. Learning from Imbalanced Data.IEEE Transactions on Knowledge and Data Engineering, 21(9):1263-1284, 2009.
1 Imbalanced Data
What is an Imbalanced Data Problem?
Impact of Imbalanced Data on Learning Machines
State of the Art Techniques for Imbalanced Learning
2 Kernel Methods
General Description of Kernel Methods
Support Vector Machines Applied to Imbalanced Data
3 Application to Tornado Prediction
Description of the Experiments Results for the Tornado Data Set
Impact of Imbalanced Data Problems in Classification
• Classifiers tend to provide an imbalanced degree of accuracy with the
majority class having close to 100 percent accuracy, and the minority class having an accuracy close to 0-10 percent.
• In the tornado data set, for example, a 10 percent accuracy for the
minority class suggests that 72 tornadoes would be classified as non-tornadoes.
Illustration of Impact of Imbalanced Classification
On the left, the accuracy of the minority class is zero percent. On the right, the accuracy for the minority class is 80 percent.
1 Imbalanced Data
What is an Imbalanced Data Problem?
Impact of Imbalanced Data on Learning Machines
State of the Art Techniques for Imbalanced Learning
2 Kernel Methods
General Description of Kernel Methods
Support Vector Machines Applied to Imbalanced Data
3 Application to Tornado Prediction
Description of the Experiments Results for the Tornado Data Set
Current Approaches for Imbalanced Learning I
• Algorithm Level• Threshold method.
• Learn only the minority class.
• Cost-sensitive approaches. • Data Level
• Random under-sampling and over-sampling.
• Uninformed under-sampling (EasyEnsemble, BalanceCascade).
• Synthetic sampling with data generation (SMOTE).
• Adaptive synthetic sampling (Borderline-SMOTE, ADASYN).
• Sampling with data cleaning (OSS method, CNN+Tomek Links integration method, Neighborhood Cleaning rule, SMOTE+ENN, SMOTE+Tomek).
• Cluster-based sampling (CBO).
• Integration of sampling and boosting (SMOTEBoost). • Kernel-based approaches
Current Approaches for Imbalanced Learning II
• Kernel Logistic Regression.
• Evaluation metrics. Metrics used to evaluate accuracies.
• Receiver Operating Characteristic (ROC), Precision-Recall (PR) and Cost Curves.
• Singular assessment metrics based on the confusion or multi-class cost matrix (F-measure, G-mean, etc).
Source: He and Garcia. Learning from Imbalanced Data.IEEE Transactions on Knowledge and Data Engineering, 21(9):1263-1284, 2009.
Problems with Imbalanced Data Prediction
• Improper evaluation metrics.
• Lack of data: absolute rarity. Number of observations is small in absolute sense.
• Relative lack of data: relative rarity.Relative to other events.
• Data fragmentation. Absolute lack of data within a single partition.
1 Imbalanced Data
What is an Imbalanced Data Problem?
Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning
2 Kernel Methods
General Description of Kernel Methods
Support Vector Machines Applied to Imbalanced Data
3 Application to Tornado Prediction
Description of the Experiments Results for the Tornado Data Set
Historical Perspective
• Efficient algorithms for detecting linear relations were used in the 1950s
and 1960s (perceptron algorithm).
• Handling nonlinear relationships was seen as major research goal at that
time but the development of nonlinear algorithms with the same efficiency and stability has proven as an elusive goal.
• In the mid 80s, the field of pattern analysis underwent a nonlinear
rev-olution with backpropagation neural networks (NNs) and decision trees (based on heuristics and lacking a firm theoretical foundation, local min-ima problems, nonconvexity).
• In the mid 90s, kernel based methods have been developed while
re-taining the guarantees and understanding that have been developed for linear algorithms.
Overview I
• Kernel Methods are a new class of machine learning algorithms which
can operate on very general types of data and can detect very general types of relations (e.g., Potential function method; Aizerman et al., 1964, Vapnik, 1982, 1995).
• Correlation, factor, cluster and discriminant analysis are some of the
types of machine learning analysis tasks that can be performed on data as diverse as sequences, text, images, graphs and vectors using kernels.
• Kernel methods provide also a natural way to merge and integrate
Overview II
• Kernel methods offer a modular framework.
• In a first step, a dataset is processed into a kernel matrix. Data can be
of various types and also of heterogeneous types.
• In a second step, a variety of kernel algorithms can be used to analyze
Modular Framework
Source: J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis, 2004.
Basic Idea of Kernel Methods
• Kernel Methods work by:
• Embedding data in a vector space called feature space using a kernel function.
• Looking for linear relations in such a space.
• Much of the geometry of the data in the embedding space (relative
posi-tions) is contained in all pair-wise inner products (information bottleneck).
• We can work in feature space by specifying an inner product functionk
between points in it.
• In many cases, inner product in the embedding space (feature space) is
Properties of Kernels I
Definition (Mercer kernel)
LetE be any set. A functionk:E×E→Rthat is continuous, symmetric and
finitely positive semi-definite is called here aMercer kernel.
Definition (Finitely positive semi-definiteness)
A functionk:E×E→R, whereEis any set, is afinitely positive semi-definite
kernelif m
∑
i=1 m∑
j=1 k(xi,xj)λiλj>0,for anym∈N,λi ∈R,xi ∈E andi∈J1,mK. It can be seen as the
Properties of Kernels II
Definition (RKHS)
AReproducing Kernel Hilbert Space(RKHS)Fis a Hilbert space of
complex-valued functions on a set E for which there exists a functionk :E×E →C
(the reproducing kernel) such that k(·,x)∈F for anyx∈E and such that hf,k(·,x)i=f(x)for allf ∈F (reproducing property).
• Ifkis a symmetric positive definite kernel then, by the Moore-Aronszajn’s
theorem, there is a unique RKHS withkas the reproducing kernel.
• A symmetric positive definite kernelk can be expressed as a dot
prod-uctk: (x,y)7→ hφ(x),φ(y)i, whereφ is a map fromRn to a RKHSH
Properties of Kernels III
Properties
• For anyx1, . . . ,x` the`×`matrix Kwith entriesKij=k(xi,xj) is sym-metric and positive semi-definite.
• The matrixKis calledkernel matrix.
• A kernelk can be expressed ask : (x,y)7→ hφ(x),φ(y)i, whereφ is a
map fromRnto a Hilbert spaceH (kernel trick).
Properties of Kernels IV
The image of Rd by φ is a manifold S in H. Kernels can be interpreted
as measures of distanceand measures of angles onS. Simple geometric
1 Imbalanced Data
What is an Imbalanced Data Problem?
Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning
2 Kernel Methods
General Description of Kernel Methods
Support Vector Machines Applied to Imbalanced Data
3 Application to Tornado Prediction
Description of the Experiments Results for the Tornado Data Set
Support Vector Machines
Definition (SVM)Support Vector Machines are a family of learning algorithmsthat use kernel
methods to solvesupervised learning problems.
• Common supervised learning tasks concern problems of classification
and regression.
• SVMs work by solving Quadratic Programming problems that aim to
min-imize thegeneralizationerror.
• If we are given a setSof`pointsxi∈Rnwhere eachxi belongs to either
of two classes defined byyi ∈ {−1,+1}, then the objective is to find a
hyperplane that dividesSleaving all the points of the same class on the
Optimal Separating Hyperplane
Source: Microsoft Research. Vision, Graphics, and Visualization Group, http://research.microsoft.com/en-us/groups/vgv/.
Dual problem in nonlinear case
• The optimal hyperplane is obtained by solving the following Quadratic
Programming (QP) problem: min
α∈R`
αtKα/2− h1,αi:hα,yi=0 and 06α 6C .
• This QP problem is the dual formulation of a QP problem that maximizes
the margin of separation between the sets of points in the feature space.
• Given a solutionα∗, the optimal hyperplane is expressed as{x∈Rn:
∑`i=1αiyik(xi,x) +b=0}wherebis computed using the complementary slackness conditions of the primal formulation.
Binary Classification of Imbalanced Data with SVMs
• Binary classification of imbalanced data needs a rewritting of the primal
SVM problem, namely: min w,b,ξ 1 2hw,wiH +C1y
∑
i=1 ξi+C−1∑
yi=−1 ξi subject to yi hw,φ(xi)iH +b >1−ξi, ∀i∈[1, `] ξi >0, ∀i∈[1, `]• C1is the trade-off coefficient for the minority class andC−1is the trade-off coefficient for the majority class. For imbalanced data, we wish to haveC−1<C1i.e. the penalty for outliers in the minority class is greater than the one for the majority class.
Illustration of SVM Training with Imbalanced Data
Source: Tang et al. SVMs Modeling for Highly Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(1):281-288, 2009.
One-Class SVM for Anomaly Detection
• Anomaly detection is equivalent to building an enclosure around a cloud
of points which are coding non-anomalous objects in order to separate them from outliers which represent anomalies.
• This problem is known as thesoft minimal hypersphereproblem and, for
points mapped in the feature spaceH, it is expressed as
min c,r,ξ r2+C `
∑
i=1 ξi subject to kφ(xi)−ck2H 6r2+ξi,∀i ∈[1, `] ξi >0,∀i ∈[1, `]Drawbacks of SVMs
• The soft-margin maximization paradigm minimizes the total error, which
in return introduces a bias toward the majority class.
• Offline calculations. Unsuitable for processing data streams.
• Inadequate for large problems (except when using heuristics such as
1 Imbalanced Data
What is an Imbalanced Data Problem?
Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning
2 Kernel Methods
General Description of Kernel Methods
Support Vector Machines Applied to Imbalanced Data
3 Application to Tornado Prediction
Description of the Experiments
Tornado Experiments I
• The data were randomly divided into two sets: training/validation and
independent testing.
• In the complete training/validation set, there are 361 cases of tornadic
observations and 5048 cases of non-tornadic observations from 59 storm days.
• In the independent testing set, there are 360 tornadic observations and
5047 non-tornadic observations from 52 storm days.
Tornado Experiments II
• Cross validations were applied with different combinations of kernel
functions (linear, polynomial and Gaussian radial basis function) and parameter values on the training/validation set.
• Each classifier is tested on the test observations drawn randomly with
replacement using bootstrap resampling (Efron and Tibshirani, 1993) with 1000 replications on the independent testing set to establish confidence intervals.
• The best support vector solution is chosen for which the classifier has
the highest mean Critical Success Index (Hit/(Hit + Miss + False Alarms)) on the validation set.
• The best classifier uses the Gaussian radial basis function kernel with
radius of 0.0001. We apply these optimal parameters to predict the outcomes of the testing set.
1 Imbalanced Data
What is an Imbalanced Data Problem?
Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning
2 Kernel Methods
General Description of Kernel Methods
Support Vector Machines Applied to Imbalanced Data
3 Application to Tornado Prediction
Description of the Experiments
Results for the Tornado Data Set
Results computed from the binary confusion matrix with a 95% confidence interval.
Measure Validation Test
POD 57%±13% 57%±13%
FAR 18%±10% 31%±14%
CSI 50%±10% 45%±12%
Bias 69%±21% 83%±20%
HSS 62%±9% 60%±11%
POD: probability of detection (hit/(hit + correct null)); FAR: false alarm rate (false alarm/(hit + false alarm)); CSI: critical success ratio (hit/(hit + false alarm + miss)); Bias ((hit + false alarm)/(hit + miss)); HSS: Heidke skill score.
Source: I. Adrianto, T. B. Trafalis, and V. Lakshmanan. Support vector machines for
Part II
4 Filtering with Kernel Methods
Key Notions
Approach Outline
Assimilation with Kernel Methods
5 Applications to Meteorology
Experimental Setup Lorenz 96 Model
Objectives of Dynamic Forecasting Using Kernel Methods
Dynamical systems
• Physical systems are mathematically represented bystatesin some
abstract space.
• Transitions between states are modeled using transition functions over
the state space in order to simulate the system dynamics.
Objectives
• To provide an alternative to Kalman filtering to predict the future states
of nonlinear dynamical systems.
• To use machine learning techniques and kernel methods in order to
Kalman Filtering
Definition (Kalman Filter)Given a sequence of perturbed measurements, a Kalman Filter is a process
that estimates thestatesof a dynamical system.
• We will consider only differentiable real-time nonlinear dynamical
sys-tems (to which correspondnonlinearKalman filters).
• The state transition and observation models of the nonlinear dynamical
system are
∂tx(t) =f(x,u,t) +w(t) and z(t) =h(x,t) +v(t),
where x is the state of the system, zis the observation, f is the state
transition function,his the observation model,uis the control and(w,v)
Example: radar tracking
From “Pattern Recognition and Machine Learning” by C. M. Bishop. Blue points: true positions; Green points: noisy observations; Red crosses:
Kalman Filter and Kernel Methods: Comparison of
Assumptions
• Unlike the linear Kalman Filter, the nonlinear variants do not necessarily
give an optimal state estimator. The filter may alsodiverge if the initial
estimate is wrong or if the model is incorrect.
• For Kalman filters, the process must be Markovian, the perturbations
must beindependentand they must follow aGaussian distribution.
Im-plementations must face problems related to matrix storage, matrix in-version and/or matrix factorization.
• Kernel methods need no statistical assumptions on the process noise
and they work with both Markovian and non-Markovian processes. Stor-age and computational requirements are modest.
4 Filtering with Kernel Methods
Key Notions
Approach Outline
Assimilation with Kernel Methods
5 Applications to Meteorology
Experimental Setup Lorenz 96 Model
Assimilation and Forecasting with Kernel Methods
1. AssimilationThe assimilation step attempts torecoverthe unperturbed system states from
the current and past observations usingkernel-basedregression techniques.
Kernel methods areremovingnoise from states trajectories and are updating
them from the previous forecasts.
2. Forecasting
• The last assimilated state can be used as an initial estimate for one
iter-ation of a nonlinearKalman filter.
• A polynomial predictive analysison the last recorded state trajectories using a Lagrange interpolation with Chebyshev nodes can provide reli-able extrapolations.
• Thegeneralizationproperty of the SVM regression function can be used to estimate the next future state.
Advantages and Shortcomings of Kernel Methods
Advantages
• Low memory requirements (some kernels require onlyO(n)elements to
be stored in memory).
• Acceptable computational complexity (of the order of O(n2), data
thin-ning can reduce the size of the input data set).
• Massive parallelization(can be applied separately to each trajectory).
• No statistical assumptions,no state transition model necessary.
• Can be combined with a Kalman filter if necessary.
Shortcomings
4 Filtering with Kernel Methods
Key Notions Approach Outline
Assimilation with Kernel Methods
5 Applications to Meteorology
Experimental Setup Lorenz 96 Model
Interpolating States Trajectories without Model
Main Idea Illustration
Find a non-trivial function f such as for every given sample (ti,xi)∈R2 we
havef(ti) =xi (orf(ti)is in an interval centered atxi and of half-widthε→0).
• The interpolation function is expressed using an affine combination of
kernel-based functionsk(t,·).
• The positive semi-definite matrixKwith entriesKij=k(ti,tj)is called the
kernel matrix.
• Kalman filtering works differently. Contrary to this approach, no state
tra-jectory interpolation takes place with KFs. Furthermore KFs absolutely need a model.
Fitting Functions
• The non-trivial functionf such thatf(ti) =xi is chosen to belong to the
function class: F= ( t∈R7→ `
∑
i=1 αik(ti,t) +b∈R:αtKα6B2 ) .• TheRademacher complexityofF measures the capability of the
func-tions of F to fit random data with respect to a probability distribution
generating this data.
• TheempiricalRademacher complexity ofF, denoted byRˆ(F), is such that
ˆ
R(F)62Bptr(K)/`+2|b|/
√
Minimizing the Generalization and the Empirical Errors
We useRˆ(F)to control the upper bound of the generalization error of the
in-terpolation function. Small empirical errors andRˆ(F)contribute to decrease
this bound, therefore:
• We need to minimize the absolute value of b. Also we have αtKα 6
kKk kαk2. Thus minimizingαtα+b2contributes to a smallerRˆ(F);
• The empirical error defined by ∑`i=1|ξi|
/`, where theξi’s are the dif-ferences between targets and desired outputs, should be minimized.
Aim
Optimization Problem
Introducing tolerances ρi >0, the empirical errorsξi are equal to f(ti)−xi. That is the only constraints associated with the previous objective function. Hence the previous calculations lead to:
Optimization Problem for Data Assimilation (Gilbert et al., 2010)
min
(α,ξ,b)∈R2`+1
αtα+b2+Cξtξ :Kα+b1`−x=ξ .
The solution of this optimization problem is:
Analytical Solution
K2+I`/C+1`1t`d=x,α =Kd, b=1t`d.
The solutionsα and b describe the regression function (that belongs to the
Computational details
To compute the solutiondof the linear system K2+I`/C+1`1t`
d=x, we
defineσ =1/√C, the matrixA=K+σI`(Ais symmetric and positive
defi-nite) and the following sequences:
• A˜u0=x,Au0= ˜u0,A˜un=un,Aun+1=2σ(un−σ˜un). • A˜v0=1`,Av0= ˜v0,A˜vn=vn,Avn+1=2σ(vn−σ˜vn). We setu=
∑
n>0 unandv=∑
n>0vn. Both series are rapidly convergent and they
are truncated at a step m>0 in practical problems. Oncemis determined
we then have
d=u− h1`,ui 1+h1`,vi
Approach Summary
• We have chosen a class of functions in order to interpolate state
trajec-tories without a model.
• We defined a fitness measure for this class of functions and linked it to
the parameters describing a function in that class.
• We defined an optimization problem that aims to minimize empirical
er-rors and maximize the fitness of the function interpolating state trajecto-ries.
• An analytical solution of the optimization problem was derived and its
4 Filtering with Kernel Methods
Key Notions Approach Outline
Assimilation with Kernel Methods
5 Applications to Meteorology
Experimental Setup
Lorenz 96 Model
Experimental Setup
Machine and Software
• All codes were implemented on MATLAB 7.9 using a 2002 DELL
Preci-sion Workstation 530 with two 2.4 GHz Intel Xeon processors and 2 GiB of RAM.
• EnKF forecasts were generated with the EnKF Matlab toolbox version
0.23 by Pavel Sakov (available at Evensen’s webpage at enkf.nersc.no).
Experimental Models
• The kernel approach was tested on the Lorenz 96 model and the
Quasi-Geostrophic 1.5-layer reduced gravity model.
• Forecasts were obtained using a combination of polynomial predictive
4 Filtering with Kernel Methods
Key Notions Approach Outline
Assimilation with Kernel Methods
5 Applications to Meteorology
Experimental Setup
Lorenz 96 Model
Lorenz 96 Model: Description
• Introduced by Lorenz and Emanuel (1998).
• The state transition model is representing the values of atmospheric
quantities at discrete locations spaced equally on a latitude circle (1D
problem). The state transition model at a locationi on the latitude circle
is: ∂txi = (xi+1−xi−2)xi−1 | {z } advection −xi |{z} dissipation + F |{z} external forcing .
• The states represent an unspecified scalar meteorological quantity, e.g.
“vorticity or temperature” (Lorenz and Emanuel). This model was intro-duced in order to select which locations on a latitude circle are the most effective in improving weather assimilation and forecasts.
• Observations where generated with an error variance of 1. The external
Lorenz 96 Model: Assimilation
The following figures illustrate how kernel methods remove the observational noise and interpolate state trajectories during assimilation.
4 Filtering with Kernel Methods
Key Notions Approach Outline
Assimilation with Kernel Methods
5 Applications to Meteorology
Experimental Setup Lorenz 96 Model
Quasi-Geostrophic Model: Description
• It is an atmospheric dynamical model involving an approximation of
ac-tual winds. It is used in the analysis of large scale extratropical weather systems. System states are scalar quantities representing the air flow.
• Horizontal winds are replaced by their geostrophic values in the
horizon-tal acceleration terms of the momentum equations, and horizonhorizon-tal ad-vection in the thermodynamic equation is approximated by geostrophic advection. Furthermore, vertical advection of momentum is neglected.
• It is a 2D problem: the atmosphere has a single level in the vertical and
was represented by a square 33×33 grid where each state is located on
a node of that grid. Observations where generated with an error variance of 1.
Quasi-Geostrophic Model: Assimilation Example
Quasi-Geostrophic Model: Forecast Example
Kernel and EnKF forecasts at time step 51. The EnKF forecast is 8 units away from the true state while the kernel forecast is 0.5 units away.
Conclusions I
What was Achieved?
• A viable kernel-based approach for data assimilation and forecasting has
been introduced for nonlinear dynamical systems. It showed predictable performances on meteorological models ranging from equivalent to that of EnKF with 50 ensemble members to exceeding that of EnKF with 100 ensemble members, with lower error as an inverse function of the amount of chaos present in the dynamical system.
• Encouraging results in removing observational noise and interpolating
state trajectories were obtained. They represent an improvement with respect to standard EnKF based on less than 20 ensembles.
Conclusions II
Future Work
• We are currently applying these techniques to financial and petroleum
engineering problems with the same type of multi-dimensional time se-ries.
• We are developing approaches that identify independent factors
influ-encing the shape of multi-dimensional time series in a nonlinear fashion. The same tools will be used for the prediction of rare events and their magnitude.
For Further Reading
R. C. Gilbert, M. B. Richman, L. M. Leslie, and T. B. Trafalis.Kernel Methods for Data Driven Numerical Modeling.Monthly Weather Review, submitted, 2010.
E. N. Lorenz and K. A. Emanuel. Optimal sites for supplementary weather observa-tions: simulation with a small model.Journal of the Atmospheric Sciences, 55(3):399-414, 1998.
P. Sarma and W. H. Chen.Generalization of the Ensemble Kalman Filter Using Kernels for Nongaussian Random Fields.In SPE Reservoir Simulation Symposium Proceedings, 2009. Society of Petroleum Engineers.
F. Steinke and B. Schölkopf.Kernels, regularization and differential equations.Pattern Recognition, 41(11): 3271-3286, 2004.
J. Shawe-Taylor and N. Cristianini.Kernel methods for pattern analysis.Cambridge Uni-versity Press, Cambridge, UK, 2004.