• No results found

Operations Research and Statistics Techniques: A Key to Quantitative Data Mining

N/A
N/A
Protected

Academic year: 2021

Share "Operations Research and Statistics Techniques: A Key to Quantitative Data Mining"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Operations Research and

Statistics Techniques:

A Key to Quantitative Data

Mining

Jorge Luis Romeu

IIT Research Institute, Rome NY

FCSM Conference, November 2001

(2)

Outline

• Introduction and Motivation

– Why Data Mining/Knowledge Discovery now?

• Phases in a DM/KDD study

– What are their intrinsic components? – How new (in what way) are they?

• Computer and Other Considerations

(3)

Introduction

• Data, collection methods and computers

• The advent of Internet and its implications

• Data base explosion: start of data-profiles

• Traditional v. New data analysis paradigm

• Characterizing Data Mining (DM) and

Knowledge Discovery in Databases (KDD)

• Quantitative (v. Qualitative) Data Mining

(4)

Motivation

• DM/KDD is a fast-growing activity

– In dire need of good people (analysts)

• A Special Data Mining Characteristic:

– research hypotheses and relationships between data variables are both obtained as a result

• Statistics and operations research areas

– well-suited for data mining activities

• Paper objective: to provide a targeted review

– Alert Stats/OR and Explain it to Others Players.

(5)

Phases in a DM/KDD study

• I) Determination of Objectives

• II) Preparation of the Data

• III) Mining the Data (***)

• IV) Analysis of Results

• V) Assimilation of the Knowledge

Extracted from the Data

(6)

Determination of Objectives

• Having a clear problem statement includes:

– review/validation of the basic information

– re-statement of goals and objectives – technical context, to avoid ambiguity – gather and review background literature

• about data, problem, component definitions

– prepare comprehensive/detailed project plan – obtain a formal agreement from our “client”.

(7)

Preparation of the Data

• Most time-consuming phase (60%)

• Is divided into three subtasks:

• 1) Selection/collection of the Data

– define, understand, identify, measure variables – data base (storage/retrieval) design issues

• 2) Data pre-processing task

– ensuring the quality of the collected data

• 3) Data transformation subtask

(8)

Illustrative Example

• Internet Data Collection Project

– Objective: forecast (specialized) Web usage

• Problem Components: Internet and user

• Indicators characterizing and relating them:

– Hits, page requests, page views, downloads;

– Dial-ups, unique-visitors, permanent connections – Internet subscribers: who, why, when, how, etc. – Web site (internal) movements (pages visited) – Traffic capacity, speed, rate, bandwidth

(9)

Mining the Data

• Traditional stats data analysis core

– standard objectives: study, classify or predict

• Data mining techniques, into five classes,

according to its objectives (Bradley et al):

– Predictive modeling,

– Clustering or segmentation – Dependency modeling,

– Data summarization

(10)

Regrouping the Approaches

• Overlap among Bradley’s existing classes

– not a partition of the existing methodology

• Define three (methodological) categories:

– mathematically based procedures – statistically based procedures

– and “mixed” algorithms

• A method can be used for multiple objectives

– and we want to emphasize methods over uses.

(11)

Mathematically Based

Algorithms

Mathematical programming

– linear, non-linear, integer programming

Network analyses/affinity analysis

– data flow is represented within a network

Memory-based reasoning

– nearest neighbors classified by their distances – form subsets of “similar” elements (neighbors)

(12)

Some technical problems

Mathematical programming

– defining objective function and constraints

Network analyses/affinity analysis

– complex relations are difficult to represent

Memory-based reasoning

– definition/combination of metrics is difficult – pre-established subsets of “similar” neighbors – the selection of the “training set”

(13)

Mixed Algorithms

Neural networks

– mimics the way the brain is configured/works

Genetic algorithms

– mimics biological genetics in human evolution

Decision trees

– mirror image of neural nets (top-down)

Clustering methods

(14)

Technical Problems

Neural networks

– select training data and subjective decisions

Genetic algorithms

– fitness function, defined by the model builder

Decision trees

– Definition of decision probabilities, training set

Clustering methods

– definition of the metric and of the number and components of each subgroup

(15)

Statistically Based Algorithms

Regression

– variables that better explain the relationship

Discrimination

– variables that better explain group differences

Time series

– black-box type of approach, no regressors

Factor analysis

(16)

Technical Problems

Regression

– data base, variable selection/reduction methods

Discrimination

– partition into a pre-specified number of groups

Time series

– type (no. of parameters) of the ARIMA model

Factor analysis

(17)

Analysis of Results

• DM/KDD as an enhanced form of EDA (?)

• DM/KDD problem results are iterative:

– are (*) inputs, used for a new iteration, or

– are stage-final -but until new data is available!

• DM/KDD is a team effort (!!!)

– one of its strongest assets and benefits

– members, work together and jointly interpret – some may not be satisfied or would like to

(18)

Assimilation of the Knowledge

Extracted from the Data

• Main DM/KDD objective: problem solving

• By analyzing system data, we can obtain

information to help resolve the problems

• Need to convert such information into

adequate courses of action (solution)

• Define what changes would be necessary

• Assessment triggers next iteration phase

(19)

Computer/Other Considerations

• The size of the problem (database)

• The nature of the analysis (algorithms)

• Makes DM/KDD a computer-based process

– requires combination of both theory and practice

• Needs software/high-performance computers

• More than mere statistical and mathematical

• Includes other important functions/algorithms

(20)

Summary of our Work

• Alerted statisticians and O.R. professionals

– about DM/KDD and how they can partake in it

• Overviewed main stats and O.R. techniques

– and the DM/KDD activities and paradigm

• Examples of uses, problems and limitations

– in several methods and paradigm phases

• Discussed work statisticians and O.R. do

– for other team members from different fields

(21)

Bibliography

Balasubramanian et al. Data Mining: Critical Review and Technology Assessment Report. IATAC/DTIC. 2000 Special Issue on Data Mining. INFORMS Journal on

Computing. Vol. 11, Number 3. 1999.

Data Mining and Knowledge Discovery; On-line journal:

http://www.digimine.com/usama/datamine/

Anderson, T. W. An Introduction to Multivariate Statistical Analysis. Wiley, NY. 1984.

Taha, H. A. Operations Research; an introduction. Prentice Hall. New Jersey. 1997

References

Related documents

podjetje je mlado prometa je malo, prav tako število zaposlenih in kupcev na vrhu podjetja je podjetnik – lastnik med vrhom podjetja in izvajalci je največ ena raven vodenja

such that strategic reports implement the preferred outcome of the majority, there is little evidence that tacit collusion works in the treatment without communica- tion, despite

In order to provide an appropriate and timely response, the IFRC Sahel Regional Representation, in collaboration with the Africa Region Disaster Management Unit and the Senegalese

The cost for cosmetology transfer students is $11 .30 per hour and $10 .00 per hour for barbering students to attended at PAUL MITCHELL THE SCHOOL Fort Myers; this does not include

property, it is often the desire of the shareholders to transfer the leased ground (and perhaps some equipment) to the newly created corporation with the intention that the departing

Four protocols were evaluated: the Wireless Internet Routing Protocol (WIRP) [2]; a Link State (LS) algorithm with constrained LS updates, based on ideas presented as part of the

PHCF agrees to provide each PHCF Market Vendor of the 2012 Pure Heat Community Festival with a 10 x 10 tented space with one table and two chairs per assigned space in the Pride

allocation across application needs, (ii) index management to facilitate indexing of data on flash, (iii) storage reclamation to handle deletions and reclamation of storage space,