Kernel methods with Imbalanced Data and Applications to Weather Prediction

(1)

Kernel methods with Imbalanced Data and Applications

to Weather Prediction

Theodore B. Trafalis

Laboratory of Optimization and Intelligent Systems School of Industrial and Systems Engineering

University of Oklahoma

[email protected]

San Francisco Airport Marriott Waterfront Congress Center

(2)

Part I

Kernel Methods for Imbalanced Data and

Application to Tornado Prediction

(3)

1 Imbalanced Data

What is an Imbalanced Data Problem?

Impact of Imbalanced Data on Learning Machines State of the Art Techniques for Imbalanced Learning

2 Kernel Methods

General Description of Kernel Methods

Support Vector Machines Applied to Imbalanced Data

3 Application to Tornado Prediction

Description of the Experiments Results for the Tornado Data Set

(4)

Imbalanced Data Problems and their Importance

• The problem of learning from an imbalanced data set occurs when the

number of samples in one class is significantly greater than that of the other class.

• Imbalanced data is very important in data mining and data classification.

Examples of imbalanced data sets include:

• Fraudulent credit card transactions.

• Telecommunication equipment failures.

• Oil spills from satellite images.

• Tornado, earthquake and landslide occurrences.

(5)

Example of the Classification of Imbalanced Data

Source: Tang et al. SVMs Modeling for Highly Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(1):281-288, 2009.

(6)

Between Class and Within Class Imbalances

Source: He and Garcia. Learning from Imbalanced Data.IEEE Transactions on Knowledge and Data Engineering, 21(9):1263-1284, 2009.

(7)

1 Imbalanced Data

Impact of Imbalanced Data on Learning Machines

State of the Art Techniques for Imbalanced Learning

2 Kernel Methods

(8)

Impact of Imbalanced Data Problems in Classification

• Classifiers tend to provide an imbalanced degree of accuracy with the

majority class having close to 100 percent accuracy, and the minority class having an accuracy close to 0-10 percent.

• In the tornado data set, for example, a 10 percent accuracy for the

minority class suggests that 72 tornadoes would be classified as non-tornadoes.

(9)

Illustration of Impact of Imbalanced Classification

On the left, the accuracy of the minority class is zero percent. On the right, the accuracy for the minority class is 80 percent.

(10)

1 Imbalanced Data

Impact of Imbalanced Data on Learning Machines

State of the Art Techniques for Imbalanced Learning

2 Kernel Methods

(11)

Current Approaches for Imbalanced Learning I

• Algorithm Level

• Threshold method.

• Learn only the minority class.

• Cost-sensitive approaches. • Data Level

• Random under-sampling and over-sampling.

• Uninformed under-sampling (EasyEnsemble, BalanceCascade).

• Synthetic sampling with data generation (SMOTE).

• Adaptive synthetic sampling (Borderline-SMOTE, ADASYN).

• Sampling with data cleaning (OSS method, CNN+Tomek Links integration method, Neighborhood Cleaning rule, SMOTE+ENN, SMOTE+Tomek).

• Cluster-based sampling (CBO).

• Integration of sampling and boosting (SMOTEBoost). • Kernel-based approaches

(12)

Current Approaches for Imbalanced Learning II

• Kernel Logistic Regression.

• Evaluation metrics. Metrics used to evaluate accuracies.

• Receiver Operating Characteristic (ROC), Precision-Recall (PR) and Cost Curves.

• Singular assessment metrics based on the confusion or multi-class cost matrix (F-measure, G-mean, etc).

Source: He and Garcia. Learning from Imbalanced Data.IEEE Transactions on Knowledge and Data Engineering, 21(9):1263-1284, 2009.

(13)

Problems with Imbalanced Data Prediction

• Improper evaluation metrics.

• Lack of data: absolute rarity. Number of observations is small in absolute sense.

• Relative lack of data: relative rarity.Relative to other events.

• Data fragmentation. Absolute lack of data within a single partition.

(14)

1 Imbalanced Data

2 Kernel Methods

(15)

Historical Perspective

• Efficient algorithms for detecting linear relations were used in the 1950s

and 1960s (perceptron algorithm).

• Handling nonlinear relationships was seen as major research goal at that

time but the development of nonlinear algorithms with the same efficiency and stability has proven as an elusive goal.

• In the mid 80s, the field of pattern analysis underwent a nonlinear

rev-olution with backpropagation neural networks (NNs) and decision trees (based on heuristics and lacking a firm theoretical foundation, local min-ima problems, nonconvexity).

• In the mid 90s, kernel based methods have been developed while

re-taining the guarantees and understanding that have been developed for linear algorithms.

(16)

Overview I

• Kernel Methods are a new class of machine learning algorithms which

can operate on very general types of data and can detect very general types of relations (e.g., Potential function method; Aizerman et al., 1964, Vapnik, 1982, 1995).

• Correlation, factor, cluster and discriminant analysis are some of the

types of machine learning analysis tasks that can be performed on data as diverse as sequences, text, images, graphs and vectors using kernels.

• Kernel methods provide also a natural way to merge and integrate

(17)

Overview II

• Kernel methods offer a modular framework.

• In a first step, a dataset is processed into a kernel matrix. Data can be

of various types and also of heterogeneous types.

• In a second step, a variety of kernel algorithms can be used to analyze

(18)

Modular Framework

Source: J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis, 2004.

(19)

Basic Idea of Kernel Methods

• Kernel Methods work by:

• Embedding data in a vector space called feature space using a kernel function.

• Looking for linear relations in such a space.

• Much of the geometry of the data in the embedding space (relative

posi-tions) is contained in all pair-wise inner products (information bottleneck).

• We can work in feature space by specifying an inner product functionk

between points in it.

• In many cases, inner product in the embedding space (feature space) is

(20)

Properties of Kernels I

Definition (Mercer kernel)

LetE be any set. A functionk:E×E→_Rthat is continuous, symmetric and

finitely positive semi-definite is called here aMercer kernel.

Definition (Finitely positive semi-definiteness)

A functionk:E×E→_R, whereEis any set, is afinitely positive semi-definite

kernelif m

∑

i=1 m

∑

j=1 k(xi,xj)λiλj>0,

for anym∈_N,λi ∈R,xi ∈E andi∈J1,mK. It can be seen as the

(21)

Properties of Kernels II

Definition (RKHS)

AReproducing Kernel Hilbert Space(RKHS)Fis a Hilbert space of

complex-valued functions on a set E for which there exists a functionk :E×E →_C

(the reproducing kernel) such that k(·,x)∈F for anyx∈E and such that hf,k(·,x)i=f(x)for allf ∈F (reproducing property).

• Ifkis a symmetric positive definite kernel then, by the Moore-Aronszajn’s

theorem, there is a unique RKHS withkas the reproducing kernel.

• A symmetric positive definite kernelk can be expressed as a dot

prod-uctk: (x,y)7→ hφ(x),φ(y)i, whereφ is a map fromRn to a RKHSH

(22)

Properties of Kernels III

Properties

• For anyx1, . . . ,x` the`×`matrix Kwith entriesKij=k(xi,xj) is sym-metric and positive semi-definite.

• The matrixKis calledkernel matrix.

• A kernelk can be expressed ask : (x,y)7→ hφ(x),φ(y)i, whereφ is a

map fromRnto a Hilbert spaceH (kernel trick).

(23)

Properties of Kernels IV

The image of _Rd by φ is a manifold S in H. Kernels can be interpreted

as measures of distanceand measures of angles onS. Simple geometric

(24)

1 Imbalanced Data

2 Kernel Methods

(25)

Support Vector Machines

Definition (SVM)

Support Vector Machines are a family of learning algorithmsthat use kernel

methods to solvesupervised learning problems.

• Common supervised learning tasks concern problems of classification

and regression.

• SVMs work by solving Quadratic Programming problems that aim to

min-imize thegeneralizationerror.

• If we are given a setSof`pointsxi∈Rnwhere eachxi belongs to either

of two classes defined byyi ∈ {−1,+1}, then the objective is to find a

hyperplane that dividesSleaving all the points of the same class on the

(26)

Optimal Separating Hyperplane

Source: Microsoft Research. Vision, Graphics, and Visualization Group, http://research.microsoft.com/en-us/groups/vgv/.

(27)

Dual problem in nonlinear case

• The optimal hyperplane is obtained by solving the following Quadratic

Programming (QP) problem: min

α∈R`

αtKα/2− h1,αi:hα,yi=0 and 06α 6C .

• This QP problem is the dual formulation of a QP problem that maximizes

the margin of separation between the sets of points in the feature space.

• Given a solutionα∗, the optimal hyperplane is expressed as{x∈_Rn_:

∑`i=1αiyik(xi,x) +b=0}wherebis computed using the complementary slackness conditions of the primal formulation.

(28)

Binary Classification of Imbalanced Data with SVMs

• Binary classification of imbalanced data needs a rewritting of the primal

SVM problem, namely: min w,b,ξ 1 2hw,wiH +C1_y

∑

i=1 ξi+C−1

∑

yi=−1 ξi subject to yi hw,φ(xi)i_H +b >1−ξi, ∀i∈[1, `] ξi >0, ∀i∈[1, `]

• C1is the trade-off coefficient for the minority class andC−1is the trade-off coefficient for the majority class. For imbalanced data, we wish to haveC−1<C1i.e. the penalty for outliers in the minority class is greater than the one for the majority class.

(29)

Illustration of SVM Training with Imbalanced Data

Source: Tang et al. SVMs Modeling for Highly Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(1):281-288, 2009.

(30)

One-Class SVM for Anomaly Detection

• Anomaly detection is equivalent to building an enclosure around a cloud

of points which are coding non-anomalous objects in order to separate them from outliers which represent anomalies.

• This problem is known as thesoft minimal hypersphereproblem and, for

points mapped in the feature spaceH, it is expressed as

min c,r,ξ r2+C `

∑

i=1 ξi subject to kφ(xi)−ck2_H 6r2+ξi,∀i ∈[1, `] ξi >0,∀i ∈[1, `]

(31)

(32)

Drawbacks of SVMs

• The soft-margin maximization paradigm minimizes the total error, which

in return introduces a bias toward the majority class.

• Offline calculations. Unsuitable for processing data streams.

• Inadequate for large problems (except when using heuristics such as

(33)

1 Imbalanced Data

2 Kernel Methods

Description of the Experiments

(34)

Tornado Experiments I

• The data were randomly divided into two sets: training/validation and

independent testing.

• In the complete training/validation set, there are 361 cases of tornadic

observations and 5048 cases of non-tornadic observations from 59 storm days.

• In the independent testing set, there are 360 tornadic observations and

5047 non-tornadic observations from 52 storm days.

(35)

Tornado Experiments II

• Cross validations were applied with different combinations of kernel

functions (linear, polynomial and Gaussian radial basis function) and parameter values on the training/validation set.

• Each classifier is tested on the test observations drawn randomly with

replacement using bootstrap resampling (Efron and Tibshirani, 1993) with 1000 replications on the independent testing set to establish confidence intervals.

• The best support vector solution is chosen for which the classifier has

the highest mean Critical Success Index (Hit/(Hit + Miss + False Alarms)) on the validation set.

• The best classifier uses the Gaussian radial basis function kernel with

radius of 0.0001. We apply these optimal parameters to predict the outcomes of the testing set.

(36)

(37)

1 Imbalanced Data

2 Kernel Methods

Description of the Experiments

(38)

Results for the Tornado Data Set

Results computed from the binary confusion matrix with a 95% confidence interval.

Measure Validation Test

POD 57%±13% 57%±13%

FAR 18%±10% 31%±14%

CSI 50%±10% 45%±12%

Bias 69%±21% 83%±20%

HSS 62%±9% 60%±11%

POD: probability of detection (hit/(hit + correct null)); FAR: false alarm rate (false alarm/(hit + false alarm)); CSI: critical success ratio (hit/(hit + false alarm + miss)); Bias ((hit + false alarm)/(hit + miss)); HSS: Heidke skill score.

Source: I. Adrianto, T. B. Trafalis, and V. Lakshmanan. Support vector machines for

(39)

Part II

(40)

4 Filtering with Kernel Methods

Key Notions

Approach Outline

Assimilation with Kernel Methods

5 Applications to Meteorology

Experimental Setup Lorenz 96 Model

(41)

Objectives of Dynamic Forecasting Using Kernel Methods

Dynamical systems

• Physical systems are mathematically represented bystatesin some

abstract space.

• Transitions between states are modeled using transition functions over

the state space in order to simulate the system dynamics.

Objectives

• To provide an alternative to Kalman filtering to predict the future states

of nonlinear dynamical systems.

• To use machine learning techniques and kernel methods in order to

(42)

Kalman Filtering

Definition (Kalman Filter)

Given a sequence of perturbed measurements, a Kalman Filter is a process

that estimates thestatesof a dynamical system.

• We will consider only differentiable real-time nonlinear dynamical

sys-tems (to which correspondnonlinearKalman filters).

• The state transition and observation models of the nonlinear dynamical

system are

∂tx(t) =f(x,u,t) +w(t) and z(t) =h(x,t) +v(t),

where x is the state of the system, zis the observation, f is the state

transition function,his the observation model,uis the control and(w,v)

(43)

Example: radar tracking

From “Pattern Recognition and Machine Learning” by C. M. Bishop. Blue points: true positions; Green points: noisy observations; Red crosses:

(44)

Kalman Filter and Kernel Methods: Comparison of

Assumptions

• Unlike the linear Kalman Filter, the nonlinear variants do not necessarily

give an optimal state estimator. The filter may alsodiverge if the initial

estimate is wrong or if the model is incorrect.

• For Kalman filters, the process must be Markovian, the perturbations

must beindependentand they must follow aGaussian distribution.

Im-plementations must face problems related to matrix storage, matrix in-version and/or matrix factorization.

• Kernel methods need no statistical assumptions on the process noise

and they work with both Markovian and non-Markovian processes. Stor-age and computational requirements are modest.

(45)

Key Notions

Approach Outline

(46)

Assimilation and Forecasting with Kernel Methods

1. Assimilation

The assimilation step attempts torecoverthe unperturbed system states from

the current and past observations usingkernel-basedregression techniques.

Kernel methods areremovingnoise from states trajectories and are updating

them from the previous forecasts.

2. Forecasting

• The last assimilated state can be used as an initial estimate for one

iter-ation of a nonlinearKalman filter.

• A polynomial predictive analysison the last recorded state trajectories using a Lagrange interpolation with Chebyshev nodes can provide reli-able extrapolations.

• Thegeneralizationproperty of the SVM regression function can be used to estimate the next future state.

(47)

Advantages and Shortcomings of Kernel Methods

Advantages

• Low memory requirements (some kernels require onlyO(n)elements to

be stored in memory).

• Acceptable computational complexity (of the order of O(n2), data

thin-ning can reduce the size of the input data set).

• Massive parallelization(can be applied separately to each trajectory).

• No statistical assumptions,no state transition model necessary.

• Can be combined with a Kalman filter if necessary.

Shortcomings

(48)

Key Notions Approach Outline

(49)

Interpolating States Trajectories without Model

Main Idea Illustration

Find a non-trivial function f such as for every given sample (ti,xi)∈_R2 _we

havef(ti) =xi (orf(ti)is in an interval centered atxi and of half-widthε→0).

• The interpolation function is expressed using an affine combination of

kernel-based functionsk(t,·).

• The positive semi-definite matrixKwith entriesKij=k(ti,tj)is called the

kernel matrix.

• Kalman filtering works differently. Contrary to this approach, no state

tra-jectory interpolation takes place with KFs. Furthermore KFs absolutely need a model.

(50)

Fitting Functions

• The non-trivial functionf such thatf(ti) =xi is chosen to belong to the

function class: F= ( t∈_R7→ `

∑

i=1 αik(ti,t) +b∈R:αtKα6B2 ) .

• TheRademacher complexityofF measures the capability of the

func-tions of F to fit random data with respect to a probability distribution

generating this data.

• TheempiricalRademacher complexity ofF, denoted byRˆ(F), is such that

ˆ

R(F)62Bptr(K)/`+2|b|/

√

(51)

Minimizing the Generalization and the Empirical Errors

We useRˆ(F)to control the upper bound of the generalization error of the

in-terpolation function. Small empirical errors andRˆ(F)contribute to decrease

this bound, therefore:

• We need to minimize the absolute value of b. Also we have αtKα 6

kKk kαk2. Thus minimizingαtα+b2contributes to a smallerRˆ(F);

• The empirical error defined by ∑`i=1|ξi|

/`, where theξi’s are the dif-ferences between targets and desired outputs, should be minimized.

Aim

(52)

Optimization Problem

Introducing tolerances ρi >0, the empirical errorsξi are equal to f(ti)−xi. That is the only constraints associated with the previous objective function. Hence the previous calculations lead to:

Optimization Problem for Data Assimilation (Gilbert et al., 2010)

min

(α,ξ,b)∈_R2`+1

αtα+b2+Cξtξ :Kα+b1`−x=ξ .

The solution of this optimization problem is:

Analytical Solution

K2+I_`/C+1_`1t_`d=x,α =Kd, b=1t_`d.

The solutionsα and b describe the regression function (that belongs to the

(53)

Computational details

To compute the solutiondof the linear system K2+I`/C+1`1t`

d=x, we

defineσ =1/√C, the matrixA=K+σI`(Ais symmetric and positive

defi-nite) and the following sequences:

• A˜u0=x,Au0= ˜u0,A˜un=un,Aun+1=2σ(un−σ˜un). • A˜v0=1`,Av0= ˜v0,A˜vn=vn,Avn+1=2σ(vn−σ˜vn). We setu=

_∑

n>0 unandv=

∑

n>0

vn. Both series are rapidly convergent and they

are truncated at a step m>0 in practical problems. Oncemis determined

we then have

d=u− h1`,ui 1+h1`,vi

(54)

Approach Summary

• We have chosen a class of functions in order to interpolate state

trajec-tories without a model.

• We defined a fitness measure for this class of functions and linked it to

the parameters describing a function in that class.

• We defined an optimization problem that aims to minimize empirical

er-rors and maximize the fitness of the function interpolating state trajecto-ries.

• An analytical solution of the optimization problem was derived and its

(55)

Experimental Setup

Lorenz 96 Model

(56)

Experimental Setup

Machine and Software

• All codes were implemented on MATLAB 7.9 using a 2002 DELL

Preci-sion Workstation 530 with two 2.4 GHz Intel Xeon processors and 2 GiB of RAM.

• EnKF forecasts were generated with the EnKF Matlab toolbox version

0.23 by Pavel Sakov (available at Evensen’s webpage at enkf.nersc.no).

Experimental Models

• The kernel approach was tested on the Lorenz 96 model and the

Quasi-Geostrophic 1.5-layer reduced gravity model.

• Forecasts were obtained using a combination of polynomial predictive

(57)

Experimental Setup

Lorenz 96 Model

(58)

Lorenz 96 Model: Description

• Introduced by Lorenz and Emanuel (1998).

• The state transition model is representing the values of atmospheric

quantities at discrete locations spaced equally on a latitude circle (1D

problem). The state transition model at a locationi on the latitude circle

is: ∂txi = (xi+1−xi−2)xi−1 | {z } advection −xi |{z} dissipation + F |{z} external forcing .

• The states represent an unspecified scalar meteorological quantity, e.g.

“vorticity or temperature” (Lorenz and Emanuel). This model was intro-duced in order to select which locations on a latitude circle are the most effective in improving weather assimilation and forecasts.

• Observations where generated with an error variance of 1. The external

(59)

Lorenz 96 Model: Assimilation

The following figures illustrate how kernel methods remove the observational noise and interpolate state trajectories during assimilation.

(60)

(61)

(62)

Quasi-Geostrophic Model: Description

• It is an atmospheric dynamical model involving an approximation of

ac-tual winds. It is used in the analysis of large scale extratropical weather systems. System states are scalar quantities representing the air flow.

• Horizontal winds are replaced by their geostrophic values in the

horizon-tal acceleration terms of the momentum equations, and horizonhorizon-tal ad-vection in the thermodynamic equation is approximated by geostrophic advection. Furthermore, vertical advection of momentum is neglected.

• It is a 2D problem: the atmosphere has a single level in the vertical and

was represented by a square 33×33 grid where each state is located on

a node of that grid. Observations where generated with an error variance of 1.

(63)

Quasi-Geostrophic Model: Assimilation Example

(64)

Quasi-Geostrophic Model: Forecast Example

Kernel and EnKF forecasts at time step 51. The EnKF forecast is 8 units away from the true state while the kernel forecast is 0.5 units away.

(65)

(66)

Conclusions I

What was Achieved?

• A viable kernel-based approach for data assimilation and forecasting has

been introduced for nonlinear dynamical systems. It showed predictable performances on meteorological models ranging from equivalent to that of EnKF with 50 ensemble members to exceeding that of EnKF with 100 ensemble members, with lower error as an inverse function of the amount of chaos present in the dynamical system.

• Encouraging results in removing observational noise and interpolating

state trajectories were obtained. They represent an improvement with respect to standard EnKF based on less than 20 ensembles.

(67)

Conclusions II

Future Work

• We are currently applying these techniques to financial and petroleum

engineering problems with the same type of multi-dimensional time se-ries.

• We are developing approaches that identify independent factors

influ-encing the shape of multi-dimensional time series in a nonlinear fashion. The same tools will be used for the prediction of rare events and their magnitude.

(68)

(69)

For Further Reading

R. C. Gilbert, M. B. Richman, L. M. Leslie, and T. B. Trafalis.Kernel Methods for Data Driven Numerical Modeling.Monthly Weather Review, submitted, 2010.

E. N. Lorenz and K. A. Emanuel. Optimal sites for supplementary weather observa-tions: simulation with a small model.Journal of the Atmospheric Sciences, 55(3):399-414, 1998.

P. Sarma and W. H. Chen.Generalization of the Ensemble Kalman Filter Using Kernels for Nongaussian Random Fields.In SPE Reservoir Simulation Symposium Proceedings, 2009. Society of Petroleum Engineers.

F. Steinke and B. Schölkopf.Kernels, regularization and differential equations.Pattern Recognition, 41(11): 3271-3286, 2004.

J. Shawe-Taylor and N. Cristianini.Kernel methods for pattern analysis.Cambridge Uni-versity Press, Cambridge, UK, 2004.