• No results found

Data Mining - Introduction

N/A
N/A
Protected

Academic year: 2021

Share "Data Mining - Introduction"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

Institut für Softwarewissenschaften – Universität Wien P.Brezany

Data Mining - Introduction

Peter Brezany

Institut für Scientific Computing Universität Wien

Tel. 4277 39425 Sprechstunde: Di, 13.00-14.00

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 2

Outline

Business Intelligence and its components Knowledge discovery in databases

Data mining techniques

- associative and sequence rules - classification - prediction - clustering - neural networks Data warehousing Data webhousing

(2)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 3

Literature

Mark and Mary Whitehorn: Business Intelligence: The IBM Solution. Springer-Verlag, 2000.

R. Kimball: The Data Warehouse Toolkit. John Willey, 1996. J. Han, M. Kamber: Data Mining. Concepts and Techniques Morgam Kaufmann Publishers, 2000.

M. Ester, J. Sander: Knowledge Discovery in Databases. Springer-Verlag, 2000.

I.H. Witten, E. Frank: Data Mining. (Practical Machine Learning Tools and Techniques with Java Implementations). Morgam Kaufmann Publishers, 2000.

Business Intelligence

Definition:

Business Intelligenceis an umbrella term, broadly covering the processes involved in extracting valuable business information

and knowledgefrom the mass of data that exists within a typical

(3)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 5

Business Intelligence Tools

Data warehouses

OLAP (On-Line Analytical Processing) tools

Data mining tools

Text mining tools

Web mining tools

Data joiners (integrators)

Business Intelligence portals, etc.

the focus of our lectures

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 6

Business Intelligence Tools (cont.)

Data warehouse - a repository of multiple heterogeneous data sources, organized under a unified schema at a single site in order to facilitate management decision making.

OLAP – analysis techniques with functionlities such as summari-zation, consolidation, and aggregation, as well as the ability to view information from different angles.

Data mining – extracting or “mining“ knowledge from large data sets.

Text mining – “mining“ large textual (document) databases.

Web mining – discovering knowledge from hypertext data.

Data joiner - working with data from disparate, heterogeneous data sources

Business Intelligence portal – a Web site designed to be the first point of entry for visitors to information about a company. With help of the portal´s personalising functions, the user can choose informa-tion sources that he needs for performing a specific task.

(4)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 7

DATA MINING

Introduction

This lecture topic is about the theme which has come to be known as data mining and knowledge discovery in large databases, data warehouses, and other massive information repositories.

Data mining emerged during the late 1980s; has made great strides during the late 1990s, and is expected to continue to flourish into the next future.

We introduce interesting data mining techniques and systems, and discuss applications and research

(5)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 9

What Motivated Data Mining? Why

Is It Important?

There is the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge.

Applications ranging from business management, production control, and market analysis, to engineering design and science exploration.

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 10

Motivation

Business Medicine Scientific experiments Simulations Earth observations

Data and data exploration

cloud

Data and data exploration

(6)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 11

CERNs challenge

Starting point

New detector LHC

» Large Hadron Collider, 14 TeV

» Goals: Search for Higgs Boson and Graviton (and others)

Start 2006

Challenges

Data are accessed worldwide

» CERN and Regional Centers (Europe, Asia, America)

» 2000 users

Huge data volumes

Data semantics

Performance and throughput

CMS ATLAS

LHCb

(7)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 13

Multi-Tier Model

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 14

The Evolution of Database Technology

Data Collection and Database Creation (1960s and earlier) - Primitive file processing

Database Management Systems (1970s-early 1980s) - Hierarchical, network and relational DB systems - Query languages (SQL, etc), query optimization

- Transaction management, concurrency control, recovery - Data modeling tools

Advanced Database Systems

(mid-1980s-present)

object-oriented, object-relational, spatial, multimedia, ...

Web-based Database Systems

(1990s-present)

- XML-based DB systems, - Web mining

Data Warehousing and Data Mining(late 1980s-present) - Data warehouse and OLAP technology

(8)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 15

Database Querying and Data Mining

Query languages like SQL are standardized and powerful, but for not skilled users are they too difficult.

OLAP Tools allow flexible multidimensional queries. Their methods are query-centric.

Query languages like SQL

OLAP Tools Data Mining Tools

Data Warehouse

(9)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 17

So, What Is Data Mining?

Data mining – searching for knowledge (interesting patterns) in your data.

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 18

Data Mining As a Step in the

Process of Knowledge Discovery

• Many people treat data mining as a synonym for the term Knowledge Discovery in Databases, or KDD. • Alternative view: data mining as n step in KDD:

– 1, Data cleaning (to remove noise and inconsistent data)

– 2. Data integration (where multiple data sources may be combined) – 3. Data selection (where data relevant to the analysis task are

retrieved from the database)

– 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or

aggregation operations, for instance)

– 5. Data mining(an essential process where intelligent methods are applied in order to extract patterns)

– 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) – 7. Knowledge presentation to the user

(10)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 19

Data Mining in Knowledge Discovery

Architecture of a Data Mining System

Graphical user interface

Pattern evaluation Data mining engine

Database or

Knowledge base

(11)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 21

Architecture of a Data Mining System (2)

Database, data warehouse, or other information repository:

One or a set of databases, data warehouses, spreadsheets, etc.

Database or data warehouse server: responsible for fetching the relevant data, based on the user’s data mining request.

Knowledge base: domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organi-ze attribute values into different levels of abstraction.

Data mining engine: essential to the data mining system; ideally consists of a set of functional modules for tasks such as charac-terization, association, classification, cluster analysis, and evolu-tion and deviaevolu-tion analysis.

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 22

Architecture of a Data Mining System (3)

Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining so as to focus the search towards interesting patterns. It may use interestingness thresholds to filter out discovered patterns.

Graphical user interface: This module communicates between users and the data mining system allowing the user

• to specify a data mining query or task

• provide information to help focus the search

• perform exploratory data mining based on the intermediate data mining results

• browse database and data warehouse schemas or data structures • evaluate mined patterns

(12)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 23

Stages of a Data Exploration Project

Time to Importance

complete to success (percent of total) (percent of total)

1. Exploring the problem 10 15

2. Exploring the solution 9 20 14 80

3. Implementation specification 1 51 4. Knowledge discovery a. Data preparation 60 15 b. Data surveying 15 3 c. Data modeling 5 2 80 20 Based on:

Data Preparation for Data Mining, by Dorian Pyle, Morgan Kaufmann

Relational Database

• A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a

database, and a set of software programs to manage and access the data.

• A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes

(columns or fields) and usually stores a large set of tuples

(records or rows). Each tuple represents an object identified by a unique key.

• Relational data can be accessed by database querieswritten in a relational query language, such as SQL.

(13)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 25

Relational Databases – Example

The AllElectronics company is described by the following table: customer, item, employee, and branch. Fragments of these tables are shown on the next slide; the attribute that represents the key or composite key component is underlined. •The relation customer consists of a set of attributes,

inclu-ding a unique customer identity number (cust_ID), and so on. •Tables can also be used to represent the relationships

bet-ween or among multiple relational tables. E.g., these include purchases (customer purchases items, creating a sales tran-saction that is handled by an employee), items_sold (lists the items sold in the given transaction), and works_at (employee works at a branch of AllElectronics).

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 26

(14)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 27

Data Warehouses

A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and which usually resides at a single site.

Data warehouses are constructed via a process of data cleaning, data transformation, data integration, data loading and periodic data refreshing.

Figure on the next slide shows the basic architecture of a data warehouse for AllElectronics.

In order to facilitate decision making, the data in a data ware-house are organized around major subjects, such as customer, item, supplier, and activity. The data are stored from a histori-cal perspective and are typically summarized.

Architecture of a Data Warehouse

Clean Transform Integrate Load Data source in Ch.

Data source in NY Data

warehouse

Query and analysis tools

(15)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 29

Modeling a Data Warehouse

A data warehouse is usually modeled by a multidimensional database structure, where each dimensioncorresponds to an attribute in the schema, each cell stores the value of some aggregate measure, such as count or sales_amount.

The actual physical structure of a data warehouse may be a relational data store or a multidimensional data cube.It provides a multidimensional view of data and allows the precomputation and fast accessing of summarized data.

Example:

A data cube for summarized sales data of AllElectronics is presented in the next slide.

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 30

(16)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 31

Modeling a Data Warehouse (2)

Data warehouse vs. Data mart: A data warehouse collects information about subjects and span an entire organization, and thus its scope is enterprise wide.A data mart is

a department-wide.

Data warehouse systems are well suited for On-Line Analytical processing, or OLAP.

OLAP operations allow the presentation of data at different levels of abstractions.

Examples of OLAP operations include drill-down and roll-up, which allow the user to view the data at different degrees of summarization as illustrated in the previous slide.

Transactional Databases

A transactional database consists of a file where each record represents a transaction.

A transaction includes a unique transaction identity number (trans_id), and a list of the items making up the transaction (such as items purchased in a store).

The transactional database may have additional tables associated with it, which contain other information regarding the sale, such

(17)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 33

Transactional Databases (2)

Trans_id list of item_Ids

T100 I1, I3, I8, I16 . . . . . .

The transactional database is usually either stored in a flat file in a format similar to that of the above table, or unfolded into a standard relation in a format similar to that of the

items_sold table in slide no. 18.

A regular data retrieval system is not able to answer queries like “Which items sold well together?”

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 34

Advanced Database Systems and

Database Applications

Relational DB systems have been widely used in business app-lications.

The new database applications include handling • spatial data (e.g. maps)

• engineering design data (e.g., the design of buildings or integrated circuits)

• hypertext and multimedia data (text, image, video, audio data) • time-related data (e.g. stock exchange data)

• World Wide Web (a huge, widely distributed information repo-sitory made available by the Internet)

(18)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 35

Data Mining Tasks

Data Mining Functionalities - What

Kinds of Patterns Can be Minded?

• Data mining functionalities are used to specify the kind of patterns that can be found in data mining tasks. • Data mining tasks can be classified into 2 categories:

Descriptive - they characterize the general properties of the data in the database.

Prescriptive - they perform inference on the current data in order to make predictions.

(19)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 37

Association Analysis

Association analysis is the discovery of association rules showing attribute-value conditions that occur frequently in a given set of data.

• The association rule X => Y is interpreted as “database tuples that satisfy the conditions in Xare also likely to satisfy the conditions in Y.”

Example A data mining system may find inAllElectronics: age(X, “20..29”) and income(X, “20K..29K”) => buys(X,”CD player”) [support = 2%, confidence = 60%]

• X is a variable representing a customer. The rule indicates that of the customers under study, 2% are 20 to 29 years of age with an income of 20K to 29K and have purchased a CD player. There is a 60% probability that a customer in this age and income group will purchase a CD player.

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 38

Association Analysis (Cont.)

• We would like to determine which items are frequently purchased together within the same transactions. E.g., contains(T, “computer”) => contains(T, “software”) [support = 1%, confidence = 50%]

• Explanation: if a transaction, T, contains “computer”, there is a 50% chance that it contains “software” as well, and 1% of all of the transactions contain both. • This rule involves a single attribute or predicate (i.e.

contains) => single-dimensional association rule. It can be written simpy as “computer => software {1%,50%]”

(20)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 39

Classification and Prediction

Classification is the process of finding a set of models (or functions) that describe and distinguish data

classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of a training data (i.e., data objects whose class label is known),

• “How is the derived model presented?”

Classification (IF-THEN) rules

Mathematical formulae

Decision tree - it is a flow-chart-like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and the tree leaves represent classes or class distributions.

Neural networks - a collection of neuron-like processing units with weighted connections between the units.

Classification and Prediction (Cont.)

Prediction - in many applications, users may wish to predict some missing or unavailable data valuesrather then class labels. The predicted values are usually numerical data.

(21)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 41

A Sample Data Set

Fictional data set that describes the weather conditions for playing some unspecified game.

sunny hot high false no sunny hot high true no overcast hot high false yes rainy mild high false yes rainy cool normal false yes rainy cool normal true no overcast cool normal true yes

outlook temperat. humidity rainy play outlook temperat. humidity rainy play sunny mild high false no sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes overcast mild high true yes overcast hot normal false yes rainy mild high true no

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 42

Learning Rules

Example for a set of rules learned from the example data set: if outlook = sunny and humidity = high then play = no

if outlook = rainy and windy = true then play = no

if outlook = overcast then play = yes

if humidity = normal then play = yes

if none of the above then play = yes

These are classification rules that assign an output class (play or not) to each instance (single example in a data set).

(22)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 43

Learning Decision Trees

Classification learning

• Training set: set of examples, where each example is a feature vector (i.e., a set of (attribute,value) pairs) with its associated class. Model is built on this set.

• Test set: a set of examples disjoint from the training set, used for testing the accuracy of a model.

(23)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany Slide author: J. Han 45

Institut für Softwarewissenschaften – Universität Wien

(24)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 47

Cluster Analysis

Clustering analyzes data objects without consulting a known class label.

• Clustering can be used to generate such labels. • The objects are clustered or grouped based on the

principle of maximizing the intraclass similarity and minimizing the interclass similarity.

• Each cluster can be viewed as a class of objects, from which rules can be derived.

Example Cluster analysis can be performed on AllElec-tronicscustomer data in order to identify homoge-neous subpopulations of customers. These clusters may represent individual target groups for marketing. (Figure on the next slide shows a 2-D plot of customers with respect to customer locations in a city).

(25)

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 49

Outlier Analysis

• A database may contain data objects that do not comply with the general behaviour or model of the data. These data objects are outliers,

• Most data mining methods discard outliers as noise or exceptions.

• In some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones,

Example Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of extremely large amounts for a given account number in comparison to regular charges incurred by the same account.

Institut für Softwarewissenschaften – Universität Wien

P.Brezany 50

On-Line Demo on Clustering

http://www.elet.polimi.it/upload/matteucc/Clustering/tu torial_html/AppletKM.html

References

Related documents

advantages for the study progress of all students who participated in the peer mentoring program when we compare the number of courses passed by mentees to non-mentees from

To allocate functional units, the software pipelining algorithm is modied so that when a state n is scheduled there is a reservation table associated with n describing resource usage

As described above, our benchmark model uses the three variables that growth theory suggests should have approximately the same permanent components: Real output per hour (variable

Solution-orientated Basic engineering Basic product know-how Basic process engineering Know-how RM network Demo know-how Advanced engineering Advanced process engineering.

Thus, in line with the JD-R model, job crafting addresses job demands and job resources (Tims & Bakker, 2010), which can positively influence employee

The second sub-question, “How have Nigerian techniques of material fabrication developed to produce new styles of furnishings, sets and promotional forms?” was addressed in

Ste 300 Atlanta 30334 Jacqueline Booker

amended police report were lulled into a belief that the Applicant’s claim, and in particular the circumstances surrounding the accident, were indisputable.. They