• No results found

Fortgeschrittene Computerintensive Methoden

N/A
N/A
Protected

Academic year: 2021

Share "Fortgeschrittene Computerintensive Methoden"

Copied!
18
0
0

Loading.... (view fulltext now)

Full text

(1)

Fortgeschrittene Computerintensive

Methoden

Einheit 3: mlr - Machine Learning in R

Bernd Bischl

Matthias Schmid, Manuel Eugster, Bettina Grün, Friedrich Leisch

Institut für Statistik – LMU München SoSe 2014

(2)

Introduction + Motivation

No unifying interface for machine learning in R

Experiments require lengthy, tedious and error-prone code Machine learning is (also) experimental science:

We need powerful and flexible tools!

mlr now exists for 2-3 years, grown quite large

Still heavily in development, but official releases are stable Was used for nearly all of my papers

We cannot cover everything today, short intro + overview

(3)

Package + Documentation

Main project page on Github

URL:https://github.com/berndbischl/mlr

Contains further links, tutorial, issue tracker. Official versions are released to CRAN.

How to install

install.packages("mlr")

install_github("mlr", username = "berndbischl")

Documentation

Tutorial on project page (still growing)

(4)

Features I

Clear S3 / object oriented interface

Easy extension mechanism through S3 inheritance

Abstract description of learners and data by properties Description of data and task

Many convenience methods, generic building blocks for own experiments

Resampling like bootstrapping, cross-validation and subsampling Easy tuning of hyperparameters

Variable selection

(5)

Features II

Growing tutorial / examples Extensive unit-testing (testthat)

Extensive argument checks to help user with errors Parallelization through parallelMap (local, socket, MPI and BatchJobs modes)

Parallelize experiments without touching code

Job granularity can be changed, e.g., so jobs don’t complete too early

(6)

Remarks on S3

Not much needed (for usage) Makes extension process easy Extension is explained in tutorial

If you simply use the package, you don’t really need to care!

Constructors / Accessors

# c o n s t r u c t o r s

t a s k = makeClassifTask ( data = i r i s , t a r g e t = " Species " ) l r n = makeLearner ( " c l a s s i f . l d a " )

# accessors

t a s k $ t a s k . d e s c $ t a r g e t getTaskData ( t a s k )

(7)

Overview of Implemented Learners

Classification

LDA, QDA, RDA, MDA Logistic / multinomial regression

k-Nearest-Neighbours Naive Bayes

Decision trees (many variants) Random forests

Boosting (different variants) SVMs (different variants) Neural Networks . . . Regression Linear model Penalized LM: lasso and ridge Boosting

k-Nearest-Neighbours Decision trees, random forests

SVMs (different variants)

Neural / RBF networks . . .

(8)

Examples 1 + 2

ex1.R: Training and prediction

(9)

Resampling

Hold-Out Cross-validation Bootstrap Subsampling and quite a few extensions

(10)

Example 3

(11)

Remarks on model selection I

Basic machine learning

Fit parameters of model to predict new data

Generalisation error commonly estimated by resampling, e.g. 10-fold cross-validation

2nd, 3rd, . . . level of inference

Comparing inducers or hyperparameters is harder Feature selection either in 2nd level or adds a 3rd one . . . Statistical comparisons on the 2nd stage are non-trivial

Still active research

(12)

Tuning / Grid Search

Used to find “best” hyperparameters for a method in a data-dependend way

Most basic method

Exhaustively try all combinations of finite grid Inefficient

Searches large, irrelevant areas Combinatorial explosion

Reasonable for continuous parameters? Still often default method

(13)

Example 4

(14)

Remarks on Model Selection II

Salzberg (1997):

“On comparing classifiers: Pitfalls to avoid and a recommended approach“

Many articles do not contain sound statistical methodology Compare against enough reasonable algorithms

Do not just report mean performance values Do not cheat through repeated tuning

Use double cross-validation for tuning and evaluation Apply the correct test (assumptions?)

Adapt for multiple testing

Think about independence and distributions Do not rely solely on UCI→We might overfit on it

(15)
(16)

Example 5

(17)

Outlook / Not Shown Today I

Available

Smarter tuning algorithms

CMA-ES, iterated F-facing, model-based optimization . . . Variable selection

Filters: Simply imported available R methods

Wrappers: Forward, backward, stochastic search, GAs Bagging for arbitary base learners

Generic imputation for missing values Wrapping / tuning of preprocessing

Regular cost-sensitive learning (class-specific costs) . . .

(18)

Outlook / Not Shown Today II

Current work, most of it nearly done

Survival analysis:

Tasks, learners and measures

Cost-sensitive learning (example-dependent costs) Over / Undersampling for unbalanced class sizes Multi-critera optimization

References

Related documents

After the hardware installation has been successfully completed, the Serial Server can be configured and managed via an Ethernet and/or wireless connection using the ExtendView

Whereas the classic tourism products, accommodation and food and beverage serving services, accounted for only just over 32% of tourism consumption, their share of total gross

A large number of mammalian cells derived from different species and different tissues have been successfully used to express recombinant proteins for clinical

Level 2 DFD for Team Leader project leader logged in view my finished projects view/cha nge my on-going projects assign a task view /change tasks display my finished

• Unzip the file ml310_base_xps_vxworks_bsp.zip to the <Tornado Dir>\target\config directory. • Unzip the ml310_base_xps_vxworks_proj.zip file to the

The QuickVet ® /RapidVet ® Canine DEA 1.1 Test Cartridge is used in con- junction with the QuickVet ® Analyzer for in vitro blood group determination testing.. The QuickVet ®

There is a relationship between reported bee mortalities and planting of neonicotinoid treated corn and soybean seed in the intense corn growing regions of Canada, as discussed

If the structure were reduced in height so as not to exceed 296 feet above ground level (307 feet above mean sea level), it would not exceed obstruction standards and a