• No results found

INTRODUCTION TO DATA MINING: DATA and DATA EXPLORATION

N/A
N/A
Protected

Academic year: 2020

Share "INTRODUCTION TO DATA MINING: DATA and DATA EXPLORATION"

Copied!
59
0
0

Loading.... (view fulltext now)

Full text

(1)

INTRODUCTION TO DATA MINING:

DATA and DATA EXPLORATION

(2)

WHAT IS DATA?

 Collection of data objects and their attributes

 An attribute is a property or characteristic of an object

– Examples: eye color of a person, temperature, etc.

– Attribute is also known as variable, field,

characteristic, or feature

 A collection of attributes describe an object

– Object is also known as record, point, case,

sample, entity, or instance

Tid Refund Marital

Status Taxable

Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes

1 0

Attributes

Objects

(3)

TYPES OF ATTRIBUTES

 There are different types of attributes

– Nominal

 Examples: ID numbers, eye color, zip codes – Ordinal

 Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}

– Interval

 Examples: calendar dates, temperatures in Celsius or Fahrenheit.

– Ratio

 Examples: temperature in Kelvin, length, time, counts

(4)

DISCRETE AND CONTINUOUS ATTRIBUTES

 Discrete Attribute

– Has only a finite or countably infinite set of values

– Examples: zip codes, counts, or the set of words in a collection of documents

– Often represented as integer variables.

– Note: binary attributes are a special case of discrete attributes

 Continuous Attribute

– Has real numbers as attribute values

– Examples: temperature, height, or weight.

– Practically, real values can only be measured and represented using a finite number of digits.

– Continuous attributes are typically represented as

floating-point variables.

(5)

TYPES OF DATA SETS

Record

Data Matrix

Document Data

Transaction Data

Graph

World Wide Web

Molecular Structures

Ordered

Spatial Data

Temporal Data

Sequential Data

Genetic Sequence Data

(6)

IMPORTANT CHARACTERISTICS OF STRUCTURED DATA

Dimensionality

Curse of Dimensionality

Sparsity

Only presence counts

Resolution

Patterns depend on the scale

(7)

RECORD DATA

 Data that consists of a collection of records, each

of which consists of a fixed set of attributes

(8)

TRANSACTION DATA

 A special type of record data, where

– each record (transaction) involves a set of items.

– For example, consider a grocery store. The set of products purchased by a customer during one

shopping trip constitute a transaction, while the individual products that were purchased are the items.

TID Items

1 Bread, Coke, Milk 2 Beer, Bread

3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

item

transaction

(9)

GRAPH DATA

 Examples: Generic graph and HTML Links

5

2

1 2

5

<a href="papers/papers.html#bbbb"> Data Mining </a>

<li> <a href="papers/papers.html#aaaa">

Graph Partitioning </a>

<li> <a href="papers/papers.html#aaaa">

Parallel Solution of Sparse Linear System of Equations </a>

<li> <a href="papers/papers.html#ffff"> N- Body Computation and Dense Linear

System Solvers

(10)

CHEMICAL DATA

 Benzene Molecule: C 6 H 6

(11)

ORDERED DATA

 Sequences of transactions

An element of the sequence

Items/Events

(12)

ORDERED DATA

 Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC

CGCAGGGCCCGCCCCGCGCCGTC

GAGAAGGGCCCGCCTGGCGGGCG

GGGGGAGGCGGGGCCGCCCGAGC

CCAACCGAGTCCGACCAGGTGCC

CCCTCTGCTCGGCCTAGACCTGA

GCTCATTAGGCGGCAGCGGACAG

GCCAAGTAGAACACGCGAAGCGC

TGGGCTGCCTGCTGCGACCAGGG

(13)

ORDERED DATA

 Spatio-Temporal Data

Average Monthly

Temperature of

land and ocean

(14)

DATA QUALITY

 What kinds of data quality problems?

 How can we detect problems with the data?

 What can we do about these problems?

 Examples of data quality problems:

– Noise and outliers

– missing values

– duplicate data

(15)

NOISE

 Noise refers to modification of original values

– Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

(16)

OUTLIERS

 Outliers are data objects with characteristics that

are considerably different than most of the other

data objects in the data set

(17)

MISSING VALUES

 Reasons for missing values

– Information is not collected

(e.g., people decline to give their age and weight)

– Attributes may not be applicable to all cases

(e.g., annual income is not applicable to children)

 Handling missing values

– Eliminate Data Objects

– Estimate Missing Values

– Ignore the Missing Value During Analysis

– Replace with all possible values (weighted by their

probabilities)

(18)

DUPLICATE DATA

 Data set may include data objects that are

duplicates, or almost duplicates of one another

– Major issue when merging data from heterogeous sources

 Examples:

– Same person with multiple email addresses

 Data cleaning

– Process of dealing with duplicate data issues

(19)

DATA PREPROCESSING

 Aggregation

 Sampling

 Dimensionality Reduction

 Feature subset selection

 Feature creation

 Discretization and Binarization

 Attribute Transformation

(20)

AGGREGATION

 Combining two or more attributes (or objects) into a single attribute (or object)

 Purpose

– Data reduction

 Reduce the number of attributes or objects – Change of scale

 Cities aggregated into regions, states, countries, etc – More “stable” data

 Aggregated data tends to have less variability

(21)

SAMPLING

 Sampling is the main technique employed for data selection.

– It is often used for both the preliminary investigation of the data and the final data analysis.

 Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming.

 Sampling is used in data mining because

processing the entire set of data of interest is too

expensive or time consuming.

(22)

SAMPLING …

 The key principle for effective sampling is the following:

– using a sample will work almost as well as using the entire data sets, if the sample is representative

– A sample is representative if it has approximately the

same property (of interest) as the original set of data

(23)

SAMPLE SIZE

8000 points 2000 Points 500 Points

(24)

CURSE OF DIMENSIONALITY

 When dimensionality increases, data becomes increasingly sparse in the space that it occupies

 Definitions of density and distance between

points, which is critical for clustering and outlier

detection, become less meaningful

(25)

DIMENSIONALITY REDUCTION

 Purpose:

– Avoid curse of dimensionality

– Reduce amount of time and memory required by data mining algorithms

– Allow data to be more easily visualized

– May help to eliminate irrelevant features or reduce

noise

(26)

FEATURE SUBSET SELECTION

 Reduce dimensionality of data Remove:

 Redundant features

– duplicate much or all of the information contained in one or more other attributes

– Example: purchase price of a product and the amount of sales tax paid

 Irrelevant features

– contain no information that is useful for the data mining task at hand

– Example: students' ID is often irrelevant to the task of

predicting students' GPA

(27)

FEATURE SUBSET SELECTION

 Techniques:

– Brute-force approch:

 Try all possible feature subsets as input to data mining algorithm

– Embedded approaches:

 Feature selection occurs naturally as part of the data mining algorithm

– Filter approaches:

 Features are selected before data mining algorithm is run

(28)

DATA MINING: EXPLORING DATA

(29)

WHAT IS DATA EXPLORATION?

 Key motivations of data exploration include

 Helping to select the right tool for preprocessing or analysis

 Making use of humans’ abilities to recognize patterns

 People can recognize patterns not captured by data analysis tools

 Related to the area of Exploratory Data Analysis (EDA)

A preliminary exploration of the data to better

understand its characteristics.

(30)

VISUALIZATION

Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.

 Visualization of data is one of the most powerful and appealing techniques for data exploration.

 Humans have a well developed ability to analyze large amounts of information that is presented visually

 Can detect general patterns and trends

 Can detect outliers and unusual patterns

(31)

IRIS SAMPLE DATA SET

Many of the exploratory data techniques are illustrated with the Iris Plant data set.

Can be obtained from the UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html

From the statistician Douglas Fisher

Three flower types (classes):

Setosa

Virginica

Versicolour

Four (non-class) attributes

Sepal width and length

Petal width and length

Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora:

Field office guide to plant species.

Northeast National Technical Center,

Chester, PA. Courtesy of USDA NRCS

Wetland Science Institute.

(32)

VISUALIZATION TECHNIQUES: HISTOGRAMS

 Histogram

 Usually shows the distribution of values of a single variable

 Divide the values into bins and show a bar plot of the number of objects in each bin.

 The height of each bar indicates the number of objects

 Shape of histogram depends on the number of bins

 Example: Petal Width (10 and 20 bins, respectively)

(33)

SCATTER PLOT ARRAY OF IRIS ATTRIBUTES

(34)

PRACTICE!

 The use the WEKA software. It is opensource (GNU) freely downloadable from

 http://www.cs.waikato.ac.nz/ml/weka/

 Weka 3.6 is the latest stable version of Weka.

 Weka is a collection of algorithms for data mining tasks and contains tools for data pre- processing, classification, regression,

clustering, association rules, and visualization.

 Tutorials can be found at the Weka web site

(35)

38

MACHINE LEARNING WITH WEKA

 Comprehensive set of tools:

 Pre-processing and data analysis

 Mining algorithms

(for classification, clustering, etc.)

 Evaluation metrics

 Three modes of operation:

 GUI

 command-line (not discussed)

 Java API (not discussed)

(36)

ARFF FILES

 ARFF files have two distinct sections.

 Header information: The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in the data), and their types.

 Data information.

(37)

40

ATTRIBUTE-RELATION FILE FORMAT (ARFF)

 Weka reads ARFF files:

@relation adult

@attribute age numeric

@attribute name string

@attribute education {College, Masters, Doctorate}

@attribute class {>50K,<=50K}

@data

50,Leslie,Masters,>50K

?,Morgan,College,<=50K

 Supported attributes:

 numeric, nominal, string, date

Comma Separated Values (CSV)

Header

(38)

ARFF FILE: EXAMPLE

@RELATION iris

@ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC

@ATTRIBUTE class {Iris-setosa,Iris- versicolor,Iris-virginica}

@DATA

5.1,3.5,1.4,0.2,Iris-setosa

4.9,3.0,1.4,0.2,Iris-setosa

4.7,3.2,1.3,0.2,Iris-setosa

4.6,3.1,1.5,0.2,Iris-setosa

5.0,3.6,1.4,0.2,Iris-setosa

5.4,3.9,1.7,0.4,Iris-setosa

4.6,3.4,1.4,0.3,Iris-setosa

5.0,3.4,1.5,0.2,Iris-setosa

(39)

EXERCISE 1

 Download and install Weka 3.6

 Open Weka, select the Explorer button and load the IRIS dataset

 How many records are in this dataset?

 How many attributes? And classes?

 Which are the types of the attribute?

 Which is the attribute with greatest standard deviation? Which is the interpretation of this value?

 Visualize the plot of all the attributes.

(40)

EXERCISE 2

 Preprocessing on the Iris dataset:

 Discretize the numeric attributes of the Iris dataset using the unsupervised Discretize filter into 5 bins.

 Notice the obtained discretization. How many elements in each interval? Is this

discretization well balanced?

 Now “undo” and try supervised

discretization: how many bins you have now?

Are the intervals well balanced?

 Which are the attributes most relevant

respect to the class?

(41)

EXERCISE 3

 Load the dataset old_faithful

 Use the visualization tab to perform visual inspection

 Visualize the scatter plot for the

combinations of: F1 vs F2. Are these attributes related or not?

 Determine visually the data points that are the outliers (extreme high or low values).

 From the preprocessing tab, normalize the

attributes to be in the [0,1] interval – use of

the Normalize filter

(42)

EXERCISE 4

 Load the breast cancer dataset

 Remove the ID using the Remove filter

 How is the correlation between the variables?

(43)

EXERCISE 5

 Prepare data for Weka: edit the car dataset to be read by Weka.

 Remind you need the CSV format, therefore

the name of the attributes must appear in

the first row. You can use Excel or a text

editor.

References

Related documents

We conduct a comparison between DG3 (three-point discontinuous Galerkin scheme; Huynh, 2007), MCV5 (fifth- order multi-moment constrained finite volume scheme; Ii and Xiao, 2009)

GAAP and IFRS have general requirements for hedge accounting as well as requirements for specifi c types of hedging relationships (i.e., a fair value hedge or a cash fl ow

Specifically, passive participants were shown all the screenshots given to active participants in Stage I and Stage II, and, after being informed about whether that active

Potential explanations for the large and seemingly random price variation are: (i) different cost pricing methods used by hospitals, (ii) uncertainty due to frequent changes in

Schools who cancel reservations within 10 working days of course commencement WILL BE INVOICED for fees, unless the STAR Office is able to find a replacement student

World Health Organization and the European research Organization on Genital Infection and Neoplasia in the year 2000 mentioned that HPV testing showed

property, it is often the desire of the shareholders to transfer the leased ground (and perhaps some equipment) to the newly created corporation with the intention that the departing

Genotipe-genotipe yang mengalami recovery dari kedua persilangan memiliki morfologi dan fisiologi yang sama baiknya dengan genotipe tahan dan cek tahan sehingga