• No results found

Business Intelligence and Data Mining

N/A
N/A
Protected

Academic year: 2021

Share "Business Intelligence and Data Mining"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

Dr. Hui Xiong

Business Intelligence and Data Mining

g

Rutgers University

Learning

 

Objectives

• Understand the need for business intelligence systems.

• Know the characteristics of reporting systems.

• Know the purpose and role of data warehouses and 

data marts.

U d d f d l d i i h i

• Understand fundamental data‐mining techniques.

• Know the purpose, features, and functions of 

knowledge management systems. 

The

 

Need

 

for

 

Business

 

Intelligence

 

Systems

• According to a study done at the University of 

California at Berkeley, a total of 403 petabytes of new 

data were created.

• 403 petabytesis roughly the amount of all printed 

material ever written.

– The printed collection of the Library of Congress is 

.01 petabytes.

– 400 petabytes equals 40,000 copies of the print 

collection of the Library of Congress.

The

 

Need

 

for

 

Business

 

Intelligence

 

Systems

 

(Continued)

• The generation of all these data has much to 

do with Moore’s Law.

• The capacity of storage devices increases as 

thei o t de ea e

their costs decrease.

• Today, storage capacity is nearly unlimited.

• We are drowning in data and starving for 

information.

Figure

 

9

1

 

How

 

big

 

is

 

an

 

Exabyte?

Source: Used with permission of Peter Lyman and Hal R. Varian, University of California at Berkeley.

Figure

 

9

2

 

Hard

Disk

 

Storage

 

Capacity

(2)

Business

 

Intelligence

 

Tools

• Tools for searching business data in an attempt 

to find patterns is called business intelligence 

(BI) tools.

• Reporting tools are programs that read data 

f i t f th t d t

from a variety of sources, process that data, 

produce formatted reports, and deliver those 

reports to the users who need them.

Business

 

Intelligence

 

Tools

• The processing of data is simple:

– Data are sorted and grouped.

– Simple totals and averages are calculated.

• Reporting tools are used primarily for assessment

– They are used to address questions like:

•What has happened in the past?

•What is the current situation?

•How does the current situation compare to 

the past?

Business

 

Intelligence

 

Tools

 

(Continued)

Data‐mining toolsprocess data using statistical 

techniques, many of which are sophisticated and 

mathematically complex.

Data mining involves searching for patterns and 

relationships among data.

• In most cases data mining tools are used to make

• In most cases, data‐mining tools are used to make 

predictions.

• For example, we can use one form of analysis to compute 

the probability that a customer will default on a loan. 

• Another way to distinguish the differences of reporting 

tools and data‐mining tools is :

– Reporting tools use simple operations like sorting, grouping, 

and summing.

– Data‐mining tools use sophisticated techniques.

Business

 

Intelligence

 

Systems

• An information systemis a collection of 

hardware, software, data, procedures, and 

people.

• The purpose of a business intelligence (BI) 

systemis to provide the right information to

systemis to provide the right information, to 

the right user, at the right time.

• BI systems help users accomplish their goals 

and objectives by producing insights that lead 

to actions.

Business

 

Intelligence

 

Systems

 

(Continued)

• A reporting toolcan generate a report that shows a 

customer has canceled an important order.

• A reporting system, however, alerts that customer’s 

salesperson with this unwanted news, and does so in 

time for the salesperson to try to alter the customer’s 

decision decision.

• A data‐mining toolcan create an equation that 

computes the probability that a customer will default 

on a loan.

• A data‐mining systemuses that equation to enable 

banking personnel to assess new loan applications.

Reporting

 

Systems

• The purpose of a reporting systemis to create 

meaningful information from disparate data 

sources and to deliver that information to the 

proper user on a timely basis.

• Reporting systems generate information from

• Reporting systems generate information from 

data as a result of four operations:

– Filtering data

– Sorting data

– Grouping data

(3)

Figure 9‐3 Trade Data for NDX.X (NASDAQ 100) Figure 94 Report Based on Trade Data in Figure 93

Components

 

of

 

Reporting

 

Systems

• A reporting system maintains a database of 

reporting metadata.

• The metadata describes the reports, users, 

groups, roles, events, and other entities 

involved in the reporting activity.

• The reporting system uses the metadata to 

prepare and deliver reports to the proper users 

on a timely basis.

Figure 9‐5 Components of a Reporting System

Figure

 

9

6

 

Summary

 

of

 

Report

 

Characteristics

Report

 

Type

• In terms of a report type, reports can be staticor 

dynamic.

Static reportsare prepared once from the 

underlying data, and they do not change.

– Example,p , a reportp  of ppast year’sy  sales

Dynamic reports:the reporting system reads 

the most current data and generates the report 

using that fresh data.

– Examples are: a report on sales today and a 

(4)

Report

 

Type

 

(Continued)

Query reportsare prepared in response to data 

entered by users.

Online analytical processing(OLAP) reports allow 

the user to dynamically change the report 

i

grouping structures.

Report

 

Media

• Reports are delivered via many different report 

mediaor channels.

• Some reports are printed on paper, and others 

are created in a format like PDF whereby they 

cana  bee p i e o printed or viewedie e e e electronically.o i a y

• Other reports are delivered to computer screens.

• Companies sometimes place reports on internal 

corporate Web sites for employees to access.

Report

 

Media

 

(Continued)

• Another report medium is a digital dashboard

which is an electronic display customized for a 

particular user.

– Vendors like Yahoo! and MSN provide common 

examples.p

– Users of these services can define content they want‐

say, a local weather forecast, a list of stock prices, or a 

list of news sources.

– The vendor constructs the display customized for 

each user.

Report

 

Media

 

(Continued)

• Other dashboards are particular to an organization.

– The organization might have a dashboard that shows up‐to‐the‐

minute production and sales activities.

Alertsare another form of report.

– Users can declare that they wish to receive notifications of 

events say via email or on their cell phones events, say, via email or on their cell phones.

• Reports can be published via a Web service.

– The Web service produces the report in response to requests 

from the service‐consuming application.

Figure

 

9

7

 

Digital

 

Dashboard

 

Example

Report

 

Mode

• The report mode can be either push reportor 

pull report.

• Organizations send a push reportto users 

according to a preset schedule.

– Users receive the reportp  without anyy activityy 

on their part.

• Users must request a pull report.

– To obtain a pull report, a user goes to a Web 

portal or digital dashboard and clicks a link 

or button to cause the reporting system to 

(5)

Functions

 

of

 

Reporting

 

Systems

• Three functions of reporting systems are:

– Authoring – Management – Delivery

• Report authoring involves connecting to data 

sources, creating the reporting structure, and 

formatting the report.

Report

 

Management

• The purpose of report managementis to define who 

receives what reports, when, and by what means.

• Most report‐management systems allow the report 

administrator to define user accounts and user groups 

and to assign particular users to particular groups.

• Reports that have been created using the report‐

authoring system are assigned groups and users.

Report

 

Management

 

(Continued)

• Assigning reports to groups saves the 

administrator work. 

– When a report is created, changed, or removed, the 

administrator need only change the report 

assignments to the group.

– All of the users in the ggroupp will inherit the changes.g

• Metadata also indicates what channel is to be used and 

whether the report is to be pushed or pulled.

– If the report is to be pushed, the administrator 

declares whether the report is to be generated on a 

regular schedule or as an alert.

Report

 

Delivery

• The report‐delivery function of a reporting system 

pushes reports or allows them to be pulled according 

to report‐management metadata.

• Reports can be delivered via an email server, Web site, 

XML Web services, or by other program‐specific 

means

means. 

• The report‐delivery system uses the operating system 

and other program security components to ensure that 

only authorized users receive authorized reports.

Report

 

Delivery

 

(Continued)

• The report‐delivery system also ensures that 

push reports are produced at appropriate 

times.

• For query reports, the report‐delivery system 

serves as an intermediary between the user and

serves as an intermediary between the user and 

the report generator.

– It receives user query data, such as item 

numbers in an inventory query, passes the 

query data to the report generator, receives 

the resulting report, and delivers the report 

to the user.

Online

 

Analytical

 

Processing

Online analytical processing (OLAP)provides the 

ability to sum, count, average, and perform other 

simple arithmetic operations on groups of data.

• The remarkable characteristics of OLAP reports is that 

theyey a e y a i are dynamic.

• The viewer of the report can change the report’s 

(6)

Online

 

Analytical

 

Processing

• An OLAP report has measures and dimensions.

• A measureis the data item of interest.

– It is the item that is to be summed or averaged or 

otherwise processed in the OLAP report.

• AA dimensiondimensionis a characteristic of a measureis a characteristic of a measure.

– Purchase data, customer type, customer location, 

and sales region are all examples of dimension.

Online Analytical Processing (Continued)

• With an OLAP report, it is possible to drill down into 

the data.

– This term means to further divide the data into more detail.

• Special‐purpose products called OLAP servershave 

been developed to perform OLAP analysis.

A O A f

• An OLAP server reads data from an operational 

database, performs preliminary calculations, and stores 

the results of those operations in an OLAP database.

Figure 9‐13 OLAP Family and Store Location by Store Type Figure 9‐14 Role of OLAP Server and OLAP Database

Data

 

Warehouses

 

and

 

Data

 

Marts

• Basic reports and simple OLAP analyses can be made 

directly from operational data.

• For the most part, such reports display the current 

state of the business; and  if there are a few missing 

values or small inconsistencies with the data, no one is 

too concerned

too concerned.

• Operational data are unsuited to more sophisticated 

analyses, particularly, data‐mining analyses that 

require high‐quality input for accurate and useful 

results.

Data Warehouses and Data Marts (Continued)

• Many organizations choose to extract operational data 

into facilities called data warehousesand data marts

both of which are facilities that prepare, store, and 

manage data specifically for data mining and other 

analyses.

• Programsg  read operationalp  data and extract,, clean,, and 

prepare that data for BI processing.

• The prepared data are stored in a data‐warehouse 

database using data‐warehouse DBMS, which can be 

(7)

Data

 

Warehouses

 

and

 

Data

 

Marts

• Data warehouses include data that are purchased from 

outside sources.

• Metadata concerning the data, its source, its format, its 

assumptions and constraints, and other facts about the 

data is keptp in a data‐warehouse metadata database.

• The data‐warehouse DBMS extracts and provides data to 

business intelligence tools such as data‐mining 

programs.

Figure 9‐15 Components of a Data Warehouse

Figure 9‐16 Consumer Data Available for Purchase from Data Vendors

Problems

 

with

 

Operational

 

Data

 

(Continued)

• Inconsistent data are particularly common for data that 

have been gathered over time.

– When an area code changes, for example, the phone number 

for a given customer before the change will not match the 

customer’s number after the change.

• Some data inconsistencies occur from the nature of the 

business activity

business activity.

• Nonintegrated data can cause problems when data 

comes from different management information 

systems.

Figure 9‐17 Problems of Using Transaction Data for Analysis and Data Mining

Data

 

Warehouses

 

Versus

 

Data

 

Marts

• The data warehousetakes data from the data 

manufacturers (operational systems and purchased 

data), cleans and processes the data, and locates the 

data on the shelves, so to speak, of the data warehouse.

• A data martis a data collection, smaller than the data 

warehouse, that addresses a particular component or 

(8)

Data Warehouse Versus Data Marts (Continued)

• The data warehouse is like the distributor in the supply 

chain and the data mart is like the retail store in the 

supply chain.

• Users in the data mart obtain data that pertain to a 

particular business function from the data warehouse.

p

• It is expensive to create, staff, and operate data 

warehouses and data marts.

Figure

 

9

18

 

Data

 

Mart

 

Examples

Data

 

Mining

 

and

 

Business

 

Intelligence

Dr Hui Xiong

Knowledge Discovery in Data

Dr. Hui Xiong

Rutgers University

• Lots of data is being collected 

and warehoused 

– Web data, e‐commerce – purchases at department/

grocery stores

Why

 

Mine

 

Data?

 

Commercial

 

Viewpoint

– Bank/Credit Card 

transactions

• Computers have become cheaper and more 

powerful

• Competitive Pressure is Strong 

– Provide better, customized services for an edge (e.g. in 

Customer Relationship Management)

Why Mine Data? Scientific Viewpoint

• Data collected and stored at 

enormous speeds (GB/hour)

– remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene 

expressionp  data – scientific simulations 

generating terabytes of data

• Traditional techniques infeasible for 

raw data

• Data mining may help scientists 

– in classifying and segmenting data – in Hypothesis Formation

(9)

Mining Large Data Sets ‐Motivation

• There is often information “hidden” in the data that is 

not readily evident

• Human analysts may take weeks to discover useful 

information

• Much of the data is never analyzed at all

3 500 000 4,000,000 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 1995 1996 1997 1998 1999

The Data Gap Total new disk (TB) since

1995 Number of

analysts

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

Scale

 

of

 

Data

Organization Scale of Data

Walmart ~ 20 million transactions/day

Google ~ 8.2 billion Web pages

Yahoo ~10 GB Web data/hr

NASA satellites ~ 1.2 TB/day

NCBI GenBank ~ 22 million genetic sequences

France Telecom 29 2 TB

“The great strength of computers is that

they can reliably manipulate vast amounts

of data very quickly. Their great weakness is

that they don’t have a clue as to what any

France Telecom 29.2 TB

UK Land Registry 18.3 TB

AT&T Corp 26.2 TB

Why

 

Do

 

We

 

Need

 

Data

 

Mining

 

?

• Leverage organization’s data assets

– Only a small portion (typically ‐5%‐10%) of the 

collected data is ever analyzed

– Data that may never be analyzed continues to be 

collected, at a greatg  expense,p  out of fear that 

something which may prove important in the 

future is missing.

– Growth rates of data precludes traditional 

“manually intensive” approach

Why

 

Do

 

We

 

Need

 

Data

 

Mining?

• As databases grow, the ability to support the decision 

support process using traditional query languages 

becomes infeasible

– Many queries of interest are difficult to state in a 

query language (Query formulation problem)

query language (Query formulation problem)

– “find all cases of fraud”

– “find all individuals likely to buy a FORD 

expedition”

– “find all documents that are similar to this 

customers problem”

(Latitude, Longitude)1

What

 

is

 

Data

 

Mining?

• Many Definitions

– Non‐trivial extraction of implicit, previously unknown and 

potentially useful information from data

– Exploration & analysis, by automatic or semi‐automatic 

means, of large quantities of data in order to discover 

meaningful patterns

What

 

is

 

(not)

 

Data

 

Mining?

zWhat is Data Mining?

zWhat is not Data Mining?

–Look up phone number in phone directory 

–Check the dictionary for the meaning of a word

–Certain names are more prevalent in certain US 

locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

–Group together similar documents returned by 

search engine according to their context (e.g. Amazon 

(10)

Data

 

Mining:

 

Confluence

 

of

 

Multiple

 

Disciplines

?

20x20 ~ 2^400 ≈10^120 patterns

Data

 

Mining

 

Applications

• Market analysis

• Risk analysis and management

• Fraud detection and detection of unusual 

patterns (outliers)

p ( )

• Text mining (news group, email, documents) 

and Web mining

• Stream data mining

• DNA and bio‐data analysis

Fraud Detection & Mining Unusual Patterns

• Approaches: Clustering & model construction for frauds, 

outlier analysis

• Applications: Health care, retail, credit card service, …

– Auto insurance: ring of collisions 

– Money laundering: suspicious monetary transactions 

Medical insurance – Medical insurance

• Professional patients, ring of doctors, and ring of references • Unnecessary or correlated screening tests

– Telecommunications: phone‐call fraud

• Phone call model: destination of the call, duration, time of day or 

week.  Analyze patterns that deviate from an expected norm – Retail industry

• Analysts estimate that 38% of retail shrink is due to dishonest 

employees – Anti‐terrorism

Data

 

Mining

 

and

 

Business

 

Intelligence

Tid Refund Marital

Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Data

Data

 

Mining

 

Tasks

 

5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk

(11)

• Finding groups of objects such that the objects in a group 

will be similar (or related) to one another and different 

from (or unrelated to) the objects in other groups

Inter-cluster

Intra-Clustering

cluster

distances

are

maximize

d

a

cluster

distances

are

minimize

d

• Understanding

– Group related documents 

for browsing 

– Group genes and proteins 

that have similar 

functionality

– Group stocks with similar 

Discovered Clusters Industry Group

1 Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Technology1-DOWN 2 Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN 3 Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN 4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP

Applications

 

of

 

Cluster

 

Analysis

p price fluctuations

• Summarization

– Reduce the size of large 

data sets

4 Schlumberger-UP Oil UP

Use of K‐means to partition Sea 

Surface Temperature (SST) and Net 

Primary Production (NPP) into 

clusters that reflect the Northern and 

Southern Hemispheres. 

Clustering:

 

Application

 

1

• Market Segmentation:

Goal:subdivide a market into distinct subsets of customers 

where any subset may conceivably be selected as a market 

target to be reached with a distinct marketing mix.

Approach:

• Collect different attributes of customers based on their • Collect different attributes of customers based on their 

geographical and lifestyle related information. • Find clusters of similar customers.

• Measure the clustering quality by observing buying 

patterns of customers in same cluster vs. those from 

different clusters. 

Clustering:

 

Application

 

2

• Document Clustering:

Goal:To find groups of documents that are similar  to each other based on the important terms  appearing in them.

Approach:To identify frequently occurring terms in  each document. Form a similarity measure based on  the frequencies of different terms. Use it to cluster.

What

 

is

 

not

 

Cluster

 

Analysis?

• Simple segmentation

– Dividing students into different registration groups 

alphabetically, by last name

• Results of a query

– GroupingsGroupings are a result of an external specification are a result of an external specification

– Clustering is a grouping of objects based on the data

• Supervised classification

– Have class label information

• Association Analysis

– Local vs. global connections

Notion

 

of

 

a

 

Cluster

 

can

 

be

 

Ambiguous

How many clusters? Six Clusters

Four Clusters Two Clusters

(12)

Types

 

of

 

Clusterings

• A clusteringis a set of clusters

• Important distinction between hierarchical

and partitionalsets of clusters 

• Partitional Clustering

– A division data objects into non‐overlapping 

subsets (clusters) such that each data object is in 

exactly one subset

• Hierarchical clustering

– A set of nested clusters organized as a hierarchical 

tree 

Partitional

 

Clustering

Original Points A Partitional Clustering

Hierarchical

 

Clustering

p4 p1 p3 p2 p4 p1 p2 p3

Traditional Hierarchical Clustering Traditional Dendrogram

p4 p1 p3 p2 p4 p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram Traditional Dendrogram

Other Distinctions Between Sets of Clusters

• Exclusive versus non‐exclusive

– In non‐exclusive clusterings, points may belong to multiple 

clusters.

– Can represent multiple classes or ‘border’ points

• Fuzzy versus non‐fuzzy

– In fuzzy clustering, a point belongs to every cluster with some 

weight between 0 and 1 weight between 0 and 1 – Weights must sum to 1

– Probabilistic clustering has similar characteristics

• Partial versus complete

– In some cases, we only want to cluster some of the data

• Heterogeneous versus homogeneous

– Clusters of widely different sizes, shapes, and densities

Types

 

of

 

Clusters

• Well‐separated clusters

• Center‐based clusters

• Contiguous clusters

b d l

• Density‐based clusters

• Property or Conceptual

• Described by an Objective Function

Types of Clusters: Well‐Separated

• Well‐Separated Clusters: 

– A cluster is a set of points such that any point in a 

cluster is closer (or more similar) to every other point 

in the cluster than to any point not in the cluster. 

(13)

Types of Clusters: Center‐Based

• Center‐based

– A cluster is a set of objects such that an object in a 

cluster is closer (more similar) to the “center” of a 

cluster, than to the center of any other cluster  

– The center of a cluster is often a centroid, the 

average of all the points in the cluster, or amedoid,

average of all the points in the cluster, or a medoid, 

the most “representative” point of a cluster 

4 center-based clusters

Types of Clusters: Contiguity‐Based

• Contiguous Cluster (Nearest neighbor or 

Transitive)

– A cluster is a set of points such that a point in a 

cluster is closer (or more similar) to one or more other 

points in the cluster than to any point not in the 

cluster.

8 contiguous clusters

Types of Clusters: Density‐Based

• Density‐based

– A cluster is a dense region of points, which is 

separated by low‐density regions, from other 

regions of high density. 

– Used when the clusters are irregular or intertwined, 

and when noise and outliers are present

and when noise and outliers are present. 

6 density-based clusters

Types of Clusters: Conceptual Clusters

• Shared Property or Conceptual Clusters

– Finds clusters that share some common property 

or represent a particular concept.

2 Overlapping Circles

Characteristics of the Input Data Are Important

• Type of proximity or density measure

– This is a derived measure, but central to clustering  

• Sparseness

– Dictates type of similarity – Adds to efficiency

• Attribute type

– Dictates type of similarity

• Type of Data

– Dictates type of similarity

– Other characteristics, e.g., autocorrelation

• Dimensionality

• Noise and Outliers

• Type of Distribution

Tid Refund Marital

Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Data

Data

 

Mining

 

Tasks

 

5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk

(14)

Association

 

Rule

 

Discovery:

 

Definition

• Given a set of records each of which contain 

some number of items from a given 

collection

– Produce dependency rules which will predict occurrence of 

an item based on occurrences of other items.

TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

Rules Discovered:

{Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Rules Discovered:

{Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Association

 

Analysis:

 

Applications

• Market‐basket analysis

– Rules are used for sales promotion, shelf management, and 

inventory management

• Telecommunication alarm diagnosis

– Rules are used to find combination of alarms that occur 

together frequently in the same time period

• Medical Informatics

– Rules are used to find combination of patient symptoms 

and complaints associated with certain diseases

Application

 

Deployment

 

Challenge

Tid Refund Marital

Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Data

Data

 

Mining

 

Tasks

 

5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk

• Find a model  for class attribute as a function of 

the values of other attributes

Tid Employed Education Level of

# years at present address

Credit Worthy

1 Yes Graduate 5 Yes

Model for predicting credit worthiness

Employed

No Yes

Predictive

 

Modeling:

 

Classification

1 Yes Graduate 5 Yes

2 Yes High School 2 No

3 No Undergrad 1 No

4 Yes High School 10 Yes

10

No Education

Number of years

Graduate { High school, Undergrad }

Yes No > 7 yrs < 7 yrs Yes Number of years No > 3 yr < 3 yr

Classification

 

Example

Tid Employed Level of

Education # years at present address Credit Worthy

1 Yes Graduate 5 Yes

2 Yes High School 2 No

Tid Employed Education Level of

# years at present address Credit Worthy 1 Yes Undergrad 7 ? 2 No Graduate 3 ?

3 Yes High School 2 ?

10 Test Set Training Set

Model

Learn

Classifier

3 No Undergrad 1 No

4 Yes High School 10 Yes

(15)

• Predicting tumor cells as benign or 

malignant

• Classifying credit card transactions 

as legitimate or fraudulent

• Classifying secondary structures of 

Examples

 

of

 

Classification

 

Task

Classifying secondary structures of 

protein as alpha‐helix, beta‐sheet, or 

random coil

• Categorizing news stories as finance, 

weather, entertainment, sports, etc

• Identifying intruders in the cyberspace

Classification:

 

Application

 

1

• Fraud Detection

Goal:Predict fraudulent cases in credit card transactions.

Approach:

• Use credit card transactions and the information on its 

account‐holder as attributes.

– When does a customer buy, what does he buy, how 

often he pays on time, etc

• Label past transactions as fraud or fair transactions. This 

forms the class attribute.

• Learn a model for the class of the transactions. • Use this model to detect fraud by observing credit card 

transactions on an account.

Classification:

 

Application

 

2

• Churn prediction for telephone customers

Goal:To predict whether a customer is likely to be 

lost to a competitor.

Approach:

• Use detailed record of transactions with each of the past 

and present customers, to find attributes.

– How often the customer calls, where he calls, what 

time‐of‐the day he calls most, his financial status, 

marital status, etc.

• Label the customers as loyal or disloyal. • Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

Classification:

 

Application

 

3

• Sky Survey Cataloging

Goal:To predict class (star or galaxy) of sky objects, 

especially visually faint ones, based on the telescopic survey 

images (from Palomar Observatory).

– 3000 images with 23,040 x 23,040 pixels per image.

Approach:

Approach:

• Segment the image. 

• Measure image attributes (features) ‐40 of them per 

object.

• Model the class based on these features.

• Success Story: Could find 16 new high red‐shift quasars, 

some of the farthest objects that are difficult to find!

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Classifying

 

Galaxies

Early Intermediate Class: • Stages of Formation Attributes: • Image features, • Characteristics of light

waves received, etc.

Late

Data Size:

• 72 million stars, 20 million galaxies • Object Catalog: 9 GB

• Image Database: 150 GB

Classification

 

Techniques

• Base Classifiers

– Decision Tree based Methods

– Rule‐based Methods

– Nearest‐neighbor

N l N k

– Neural Networks

– Naïve Bayes and Bayesian Belief Networks

– Support Vector Machines

• Ensemble Classifiers

(16)

Example

 

of

 

a

 

Decision

 

Tree

ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No Home Owner Yes No Splitting Attributes 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 1 0 MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K

Training

Data

Model:

Decision

Tree

Another

 

Example

 

of

 

Decision

 

Tree

MarSt Home Owner Income NO NO Yes No Married Single, Divorced ID Home Owner Marital Status IncomeAnnual Defaulted Borrower

1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No Income YES NO NO < 80K > 80K

There could be more than one tree that fits the same data!

3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10

Decision

 

Tree

 

Classification

 

Task

Learn Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 N M di 75K N Apply Model 9 No Medium 75K No 10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Decision Tree

Apply

 

Model

 

to

 

Test

 

Data

Home Owner MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10

Test Data

Start from the

root of tree.

MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K

Apply

 

Model

 

to

 

Test

 

Data

MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10

Test

Data

Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K

Apply

 

Model

 

to

 

Test

 

Data

MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K

(17)

Apply

 

Model

 

to

 

Test

 

Data

MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K

Apply

 

Model

 

to

 

Test

 

Data

MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K

Apply

 

Model

 

to

 

Test

 

Data

MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K

Assign

Defaulted

to “No”

Decision

 

Tree

 

Classification

 

Task

Learn Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes Apply Model Model 9 No Medium 75K No 10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Decision Tree

Decision

 

Tree

 

Induction

• Many

 

Algorithms:

– Hunt’s

 

Algorithm

 

(one

 

of

 

the

 

earliest)

– CART

ID3 C4 5

– ID3,

 

C4.5

– SLIQ,SPRINT

Tid Refund Marital

Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Data

Data

 

Mining

 

Tasks

 

5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk

(18)

Deviation/Anomaly

 

Detection

• Detect significant 

deviations from normal 

behavior • Applications:

– Credit Card FraudCredit Card Fraud 

Detection

– Network Intrusion 

Detection

Anomaly

 

Detection

• Challenges

– How many outliers are there in the data?

– Method is unsupervised

• Validation can be quite challenging (just like for clustering)

– Finding needle in a haystackFinding needle in a haystack

• Working assumption

– There are considerably more “normal” observations 

than “abnormal” observations (outliers/anomalies) 

in the data

Anomaly

 

Detection

 

Schemes

 

• General Steps

– Build a profile of the “normal” behavior

• Profile can be patterns or summary statistics for 

the overall population

– Use the “normal” profile to detect anomalies

• A li b ti h h t i ti

• Anomalies are observations whose characteristics

differ significantly from the normal profile

• Types of anomaly 

detection schemes

– Graphical & Statistical‐based

– Distance‐based

– Model‐based

Graphical

 

Approaches

• Boxplot (1‐D), Scatter plot (2‐D), Spin plot (3‐D)

• Limitations

– Time consuming

– Subjective

Statistical

 

Approaches

• Assume a parametric model describing the 

distribution of the data (e.g., normal 

distribution) 

• Apply a statistical test that depends on 

– Data distribution

Parameter of distribution (e g mean variance)

– Parameter of distribution (e.g., mean, variance)

– Number of expected outliers (confidence limit)

Limitations

 

of

 

Statistical

 

Approaches

• Most

 

of

 

the

 

tests

 

are

 

for

 

a

 

single

 

attribute

• In

 

many

 

cases,

 

data

 

distribution

 

may

 

not

 

be

 

known

• For high dimensional data it may be

• For

 

high

 

dimensional

 

data,

 

it

 

may

 

be

 

(19)

Distance

based

 

Approaches

• Data

 

is

 

represented

 

as

 

a

 

vector

 

of

 

features

• Three

 

major

 

approaches

– Nearest

neighbor

 

based

– Density

 

based

– Clustering

 

based

Nearest

Neighbor

 

Based

 

Approach

• Approach:

– Compute the distance between every pair 

of data points

– There are various ways to define outliers:

• Data oi t fo hi h the e a e fe e tha

• Data points for which there are fewer than p

neighboring points within a distance D

• The top n data points whose distance to the kth 

nearest neighbor is greatest

• The top n data points whose average distance 

to the k nearest neighbors is greatest 

Density

based:

 

LOF

 

approach

• For each point, compute the density of its local 

neighborhood

• Compute local outlier factor (LOF) of a sample pas the 

average of the ratios of the density of sample pand the 

density of its nearest neighbors

• Outliers are pointsp  with largestg  LOF value

p2 × p1 × In the NN approach, p2is not considered as outlier, while LOF approach find both

p1and p2 as outliers

Clustering

Based

• Basic idea:

¾Cluster the data into 

groups of different 

density

¾Choose points in small 

l t did t

cluster as candidate 

outliers

¾ Compute the distance between candidate points 

and non‐candidate clusters. 

‐ If candidate points are far from all other 

non‐candidate points, they are outliers

KDD

 

Process

• Develop an understanding of the application domain 

– Relevant prior knowledge, problem objectives, success criteria, 

current solution, inventory resources, constraints, terminology, 

cost and benefits • Create target data set

– Collect initial data, describe, focus on a subset of variables,Collect initial data, describe, focus on a subset of variables, 

verify data quality • Data cleaning and preprocessing

– Remove noise, outliers, missing fields, time sequence 

information, known trends, integrate data • Data Reduction and projection

– Feature subset selection, feature construction, discretizations, 

aggregations

KDD

 

Process

• Selection of data mining task

– Classification, segmentation, deviation detection, 

link analysis

• Select data mining approach 

D i i d l

• Data mining to extract patterns or models

• Interpretation and evaluation of 

patterns/models

(20)

Knowledge

 

Discovery

Challenges

 

of

 

Data

 

Mining

• Scalability • Dimensionality

• Complex and Heterogeneous Data

• Data Quality

• Data Ownership and Distribution

• Privacy Preservation

• Streaming Data

• Data from Multi‐Sources 

Similarities Between Data Miners and Doctors 

Data Characteristics

Data Mining Techniques Medical Devices

Commercial

 

and

 

Research

 

Tools

WEKA:  http://www.cs.waikato.ac.nz/ml/weka/ SAS:  http://www.sas.com/ Clementine: Clementine:   http://www.spss.com/spssbi/clementine/ Intelligent Miner  http://www‐3.ibm.com/software/data/iminer/ Insightful Miner  http://www.insightful.com/products/product.asp?PID=26

Textbooks

Knowledge

 

Management

• Knowledge management systems concern the sharing 

of knowledge that is already known to exist, either in 

libraries of documents, in the heads of employees, or in 

other known sources.

Knowledge management (KM)is the process of 

i l f i ll l i l d h i

creating value from intellectual capital and sharing 

that knowledge with employees, managers, suppliers, 

(21)

Knowledge

 

Management

 

(Continued)

• Knowledge management is a process that is 

supported by the five components of an 

information system.

– Its emphasis is on people, their knowledge, and 

effective means for sharing that knowledge with 

others.

• The benefits of KM concern the application of 

knowledge to enable employees and others to 

leverage organizational knowledge to work 

smarter.

• KM preserves organizational memory by 

capturing and storing the lessons learned and 

best practices of key employees.

Content

 

Management

 

Systems

• Content management systems are information 

systems that track organizational documents, Web 

pages, graphics, and related materials.

• Such systems differ from operational document 

systems in that they do not directly support business 

i operations.

• KM content management systems are concerned with 

the creation, management, and delivery of documents 

that exist for the purpose of imparting knowledge.

Content

 

Management

 

Systems

 

(Continued)

• Typical users of content management systems are 

companies that sell complicated products and want to 

share their knowledge of those products with 

employees and customers.

• The basic functions of content management systems are 

h f h

the same as for report management systems: author, 

manage, and deliver.

• The only requirement that content managers place on 

document authoring is that the document has been 

created in a standardized format.

Content

 

Management

 

Problems

• Documents may refer to one another or multiple 

documents may refer to the same product or 

procedure.

– When one of them changes, others must change as 

well.

– Some content management systems keep semantic 

linkagesg  amongg documents so that content 

dependencies can be known and used to maintain 

document consistency.

• Document contents are perishable.

– Documents become obsolete and need to be altered, removed, 

or replaced.

• Multinational companies have to ensure document 

language translations.

Figure 9‐23 Document Management at  

Microsoft.com (as of December 2003)

Source: microsoft.com/backstage/inside.htm (accessed February 2004). © 2003 Microsoft Corporation. All rights reserved.

Figure 9‐24 Reporting Services: United States

(22)

Figure

 

9

25

 

Reporting

 

Services:

 

China

Source: Used with permission of Tom Rizzo of Microsoft Corporation.

Content

 

Delivery

• Almost all users of content management systems pull 

the contents.

• Users cannot pull content if they do not know it 

exists.

– The content must be arranged and indexed, and a facility for 

searching the content devised. searching the content devised.

• Documents that reside behind a corporate firewall, 

however, are not publicly accessible and will not be 

reachable by Google or other search engines.

– Organizations must index their own proprietary documents 

and provide their own search capability for them.

KM Systems to Facilitate the Sharing of Human 

Knowledge

• Nothing is more frustrating for a manager to 

contemplate than the situation in which one employee 

struggles with a problem that another employee knows 

how to solve easily.

• KM systems are concerned with the sharing not only of 

content, but also with the sharing of knowledge among 

humans.

– How can one person share her knowledge with another? – How can one person learn of another person’s great idea?

KM

 

Systems

 

to

 

Facilitate

 

the

 

Sharing

 

of

 

Human

 

Knowledge

 

(Continued)

• Three forms of technology are used for 

knowledge‐sharing among humans:

– Portals, discussion groups, and email

– Collaborations systems

– Collaborations systems

– Expert systems

Portals

– Employees can share ideas by posting 

knowledge on a Web portal whereby 

managers and employees can pull the 

knowledge from the portal.

Figure 9‐26 Technology Support of Sharing 

Human Knowledge

KM Systems to Facilitate the Sharing of Human 

Knowledge (Continued)

Discussion Groups

Discussion groupsallow employees or customers to 

post questions and queries seeking solutions to 

problems they have.

Oracle IBM PeopleSoft and other vendors support

– Oracle, IBM, PeopleSoft, and other vendors support 

product discussion groups where users can post 

questions and where employees, vendors, and other 

users can answer them.

– Later, the organization can edit and summarize the 

questions from such discussion groups into 

(23)

KM Systems to Facilitate the Sharing of Human 

Knowledge (Continued)

Discussion groups (continued)

– Basic email can also be used for knowledge‐sharing, 

especially if email lists have been constructed with 

KM in mind.

– Two human factors inhibit knowledge‐sharing.

• Employees can be reluctant to exhibit their 

ignorance.

• Competition exists between employees.

– A KM application may be ill‐suited to a competitive 

group.

• The company may be able to restructure rewards 

and incentives to foster sharing of ideas among 

employees.

KM Systems to Facilitate the Sharing of Human 

Knowledge (Continued)

Collaboration Systems

Collaboration systemsare information systems that enable 

people to work together more effectively.

– The Internet can be used as a broadcast medium for speeches, 

panel discussion, and other types of meetings.

Web broadcasts, because they are digital, can be readily saved 

and replayed at the viewer’s convenience and replayed at the viewer s convenience.

– Web broadcasts can also be made interactive by combining 

them with discussion group bulletin boards that are live during 

the broadcast.

Video conferencingis another popular form of IT‐supported 

meetings.

• Video‐conferencing equipment is expensive and normally is 

located in selected sites in the organization.

Figure

 

9

27

 

Net

 

Meeting

 

Graphic

KM Systems to Facilitate the Sharing of Human 

Knowledge (Continued)

Expert Systems

Expert systemsare created by interviewing experts 

in a given business domain and codifying the rules 

stated by those experts.

– Many expert systems were created in the late 1980sMany expert systems were created in the late 1980s 

and 1990s, and some of them have been successful.

– Expert systems suffer from three major 

disadvantages.

• They are difficult and expensive to develop.

• They are difficult to maintain.

• They were unable to live up to the high 

References

Related documents

This attitude is informed by the notion that the whole essence of an organization is to contribute to the development of the society such organizations implement programmes

In this study, data on 52 newborns with esophageal atre- sia (EA) with or without tracheoesophageal fistula (TEF) are shown in order to evaluate the contingent occurrence of

The results of the pre-service teacher surveys completed following the service- learning unit but prior to the first 10-week school experience reveal that pre-service teachers had

In this example, the application of intense enforcement of parking regulations along a critical arterial roadway resulted in increasing curb-side parking capacity by reducing

Building on the work of Statistics Canada, World Health Organization (WHO), and studies from the United Kindom and United States, we defined 120 cause of death groupings

Is the insured person already entitled to daily allowance or pension from: health insurance, Suva or another compulsory accident insurance, disability insurance, old age or

Solution-orientated Basic engineering Basic product know-how Basic process engineering Know-how RM network Demo know-how Advanced engineering Advanced process engineering.

For given dimensions of guaranteed sound bond, specified by the end user, the cladder must add extra material to allow for edge effects. Areas of peripheral weak bond