Business Intelligence and Data Mining

(1)

Dr. Hui Xiong

g

Rutgers University

Learning

Objectives

• Understand the need for business intelligence systems.

• Know the characteristics of reporting systems.

• Know the purpose and role of data warehouses and

data marts.

U d d f d l d i i h i

• Understand fundamental data‐mining techniques.

• Know the purpose, features, and functions of

knowledge management systems.

The

Need

for

Business

Intelligence

Systems

• According to a study done at the University of

California at Berkeley, a total of 403 petabytes of new

data were created.

• 403 petabytesis roughly the amount of all printed

material ever written.

– The printed collection of the Library of Congress is

.01 petabytes.

– 400 petabytes equals 40,000 copies of the print

collection of the Library of Congress.

The

Need

for

Business

Intelligence

Systems

(Continued)

• The generation of all these data has much to

do with Moore’s Law.

• The capacity of storage devices increases as

thei o t de ea e

their costs decrease.

• Today, storage capacity is nearly unlimited.

• We are drowning in data and starving for

information.

Figure

9

‐

1

How

big

is

an

Exabyte?

Source: Used with permission of Peter Lyman and Hal R. Varian, University of California at Berkeley.

Figure

9

‐

2

Hard

‐

Disk

Storage

Capacity

(2)

Business

Intelligence

Tools

• Tools for searching business data in an attempt

to find patterns is called business intelligence

(BI) tools.

• Reporting tools are programs that read data

f i t f th t d t

from a variety of sources, process that data,

produce formatted reports, and deliver those

reports to the users who need them.

Business

Intelligence

Tools

• The processing of data is simple:

– Data are sorted and grouped.

– Simple totals and averages are calculated.

• Reporting tools are used primarily for assessment

– They are used to address questions like:

•What has happened in the past?

•What is the current situation?

•How does the current situation compare to

the past?

Business

Intelligence

Tools

(Continued)

•Data‐mining toolsprocess data using statistical

techniques, many of which are sophisticated and

mathematically complex.

•Data mining involves searching for patterns and

relationships among data.

• In most cases data mining tools are used to make

• In most cases, data‐mining tools are used to make

predictions.

• For example, we can use one form of analysis to compute

the probability that a customer will default on a loan.

• Another way to distinguish the differences of reporting

tools and data‐mining tools is :

– Reporting tools use simple operations like sorting, grouping,

and summing.

– Data‐mining tools use sophisticated techniques.

Business

Intelligence

Systems

• An information systemis a collection of

hardware, software, data, procedures, and

people.

• The purpose of a business intelligence (BI)

systemis to provide the right information to

systemis to provide the right information, to

the right user, at the right time.

• BI systems help users accomplish their goals

and objectives by producing insights that lead

to actions.

Business

Intelligence

Systems

(Continued)

• A reporting toolcan generate a report that shows a

customer has canceled an important order.

• A reporting system, however, alerts that customer’s

salesperson with this unwanted news, and does so in

time for the salesperson to try to alter the customer’s

decision decision.

• A data‐mining toolcan create an equation that

computes the probability that a customer will default

on a loan.

• A data‐mining systemuses that equation to enable

banking personnel to assess new loan applications.

Reporting

Systems

• The purpose of a reporting systemis to create

meaningful information from disparate data

sources and to deliver that information to the

proper user on a timely basis.

• Reporting systems generate information from

data as a result of four operations:

– Filtering data

– Sorting data

– Grouping data

(3)

Figure 9‐3 Trade Data for NDX.X (NASDAQ 100) _Figure₉_‐₄_Report_Based_on_Trade_Data_in_Figure₉_‐₃

Components

of

Reporting

Systems

• A reporting system maintains a database of

reporting metadata.

• The metadata describes the reports, users,

groups, roles, events, and other entities

involved in the reporting activity.

• The reporting system uses the metadata to

prepare and deliver reports to the proper users

on a timely basis.

Figure 9‐5 Components of a Reporting System

Figure

9

‐

6

Summary

of

Report

Characteristics

Report

Type

• In terms of a report type, reports can be staticor

dynamic.

•Static reportsare prepared once from the

underlying data, and they do not change.

– Example,p , a reportp of ppast year’sy sales

•Dynamic reports:the reporting system reads

the most current data and generates the report

using that fresh data.

– Examples are: a report on sales today and a

(4)

Report

Type

(Continued)

•Query reportsare prepared in response to data

entered by users.

•Online analytical processing(OLAP) reports allow

the user to dynamically change the report

i

grouping structures.

Report

Media

• Reports are delivered via many different report

mediaor channels.

• Some reports are printed on paper, and others

are created in a format like PDF whereby they

cana bee p i e o printed or viewedie e e e electronically.o i a y

• Other reports are delivered to computer screens.

• Companies sometimes place reports on internal

corporate Web sites for employees to access.

Report

Media

(Continued)

• Another report medium is a digital dashboard,

which is an electronic display customized for a

particular user.

– Vendors like Yahoo! and MSN provide common

examples.p

– Users of these services can define content they want‐

say, a local weather forecast, a list of stock prices, or a

list of news sources.

– The vendor constructs the display customized for

each user.

Report

Media

(Continued)

• Other dashboards are particular to an organization.

– The organization might have a dashboard that shows up‐to‐the‐

minute production and sales activities.

• Alertsare another form of report.

– Users can declare that they wish to receive notifications of

events say via email or on their cell phones events, say, via email or on their cell phones.

• Reports can be published via a Web service.

– The Web service produces the report in response to requests

from the service‐consuming application.

Figure

9

‐

7

Digital

Dashboard

Example

Report

Mode

• The report mode can be either push reportor

pull report.

• Organizations send a push reportto users

according to a preset schedule.

– Users receive the reportp without anyy activityy

on their part.

• Users must request a pull report.

– To obtain a pull report, a user goes to a Web

portal or digital dashboard and clicks a link

or button to cause the reporting system to

(5)

Functions

of

Reporting

Systems

• Three functions of reporting systems are:

– Authoring – Management – Delivery

• Report authoring involves connecting to data

sources, creating the reporting structure, and

formatting the report.

Report

Management

• The purpose of report managementis to define who

receives what reports, when, and by what means.

• Most report‐management systems allow the report

administrator to define user accounts and user groups

and to assign particular users to particular groups.

• Reports that have been created using the report‐

authoring system are assigned groups and users.

Report

Management

(Continued)

• Assigning reports to groups saves the

administrator work.

– When a report is created, changed, or removed, the

administrator need only change the report

assignments to the group.

– All of the users in the ggroupp will inherit the changes.g

• Metadata also indicates what channel is to be used and

whether the report is to be pushed or pulled.

– If the report is to be pushed, the administrator

declares whether the report is to be generated on a

regular schedule or as an alert.

Report

Delivery

• The report‐delivery function of a reporting system

pushes reports or allows them to be pulled according

to report‐management metadata.

• Reports can be delivered via an email server, Web site,

XML Web services, or by other program‐specific

means

means.

• The report‐delivery system uses the operating system

and other program security components to ensure that

only authorized users receive authorized reports.

Report

Delivery

(Continued)

• The report‐delivery system also ensures that

push reports are produced at appropriate

times.

• For query reports, the report‐delivery system

serves as an intermediary between the user and

the report generator.

– It receives user query data, such as item

numbers in an inventory query, passes the

query data to the report generator, receives

the resulting report, and delivers the report

to the user.

Online

Analytical

Processing

• Online analytical processing (OLAP)provides the

ability to sum, count, average, and perform other

simple arithmetic operations on groups of data.

• The remarkable characteristics of OLAP reports is that

theyey a e y a i are dynamic.

• The viewer of the report can change the report’s

(6)

Online

Analytical

Processing

• An OLAP report has measures and dimensions.

• A measureis the data item of interest.

– It is the item that is to be summed or averaged or

otherwise processed in the OLAP report.

• AA dimensiondimensionis a characteristic of a measureis a characteristic of a measure.

– Purchase data, customer type, customer location,

and sales region are all examples of dimension.

Online Analytical Processing (Continued)

• With an OLAP report, it is possible to drill down into

the data.

– This term means to further divide the data into more detail.

• Special‐purpose products called OLAP servershave

been developed to perform OLAP analysis.

A O A f

• An OLAP server reads data from an operational

database, performs preliminary calculations, and stores

the results of those operations in an OLAP database.

Figure 9‐13 OLAP Family and Store Location by Store Type Figure 9‐14 Role of OLAP Server and OLAP Database

Data

Warehouses

and

Data

Marts

• Basic reports and simple OLAP analyses can be made

directly from operational data.

• For the most part, such reports display the current

state of the business; and if there are a few missing

values or small inconsistencies with the data, no one is

too concerned

too concerned.

• Operational data are unsuited to more sophisticated

analyses, particularly, data‐mining analyses that

require high‐quality input for accurate and useful

results.

Data Warehouses and Data Marts (Continued)

• Many organizations choose to extract operational data

into facilities called data warehousesand data marts,

both of which are facilities that prepare, store, and

manage data specifically for data mining and other

analyses.

• Programsg read operationalp data and extract,, clean,, and

prepare that data for BI processing.

• The prepared data are stored in a data‐warehouse

database using data‐warehouse DBMS, which can be

(7)

Data

Warehouses

and

Data

Marts

• Data warehouses include data that are purchased from

outside sources.

• Metadata concerning the data, its source, its format, its

assumptions and constraints, and other facts about the

data is keptp in a data‐warehouse metadata database.

• The data‐warehouse DBMS extracts and provides data to

business intelligence tools such as data‐mining

programs.

Figure 9‐15 Components of a Data Warehouse

Figure 9‐16 Consumer Data Available for Purchase from Data Vendors

Problems

with

Operational

Data

(Continued)

• Inconsistent data are particularly common for data that

have been gathered over time.

– When an area code changes, for example, the phone number

for a given customer before the change will not match the

customer’s number after the change.

• Some data inconsistencies occur from the nature of the

business activity

business activity.

• Nonintegrated data can cause problems when data

comes from different management information

systems.

Figure 9‐17 Problems of Using Transaction Data for Analysis and Data Mining

Data

Warehouses

Versus

Data

Marts

• The data warehousetakes data from the data

manufacturers (operational systems and purchased

data), cleans and processes the data, and locates the

data on the shelves, so to speak, of the data warehouse.

• A data martis a data collection, smaller than the data

warehouse, that addresses a particular component or

(8)

Data Warehouse Versus Data Marts (Continued)

• The data warehouse is like the distributor in the supply

chain and the data mart is like the retail store in the

supply chain.

• Users in the data mart obtain data that pertain to a

particular business function from the data warehouse.

p

• It is expensive to create, staff, and operate data

warehouses and data marts.

Figure

9

‐

18

Data

Mart

Examples

Data

Mining

and

Business

Intelligence

Dr Hui Xiong

Knowledge Discovery in Data

Dr. Hui Xiong

Rutgers University

• Lots of data is being collected

and warehoused

– Web data, e‐commerce – purchases at department/

grocery stores

Why

Mine

Data?

Commercial

Viewpoint

– Bank/Credit Card

transactions

• Computers have become cheaper and more

powerful

• Competitive Pressure is Strong

– Provide better, customized services for an edge (e.g. in

Customer Relationship Management)

Why Mine Data? Scientific Viewpoint

• Data collected and stored at

enormous speeds (GB/hour)

– remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene

expressionp data – scientific simulations

generating terabytes of data

• Traditional techniques infeasible for

raw data

• Data mining may help scientists

– in classifying and segmenting data – in Hypothesis Formation

(9)

Mining Large Data Sets ‐Motivation

• There is often information “hidden” in the data that is

not readily evident

• Human analysts may take weeks to discover useful

information

• Much of the data is never analyzed at all

3 500 000 4,000,000 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 1995 1996 1997 1998 1999

The Data Gap Total new disk (TB) since

1995 _{Number of}

analysts

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

Scale

of

Data

Organization Scale of Data

Walmart ~ 20 million transactions/day

Google ~ 8.2 billion Web pages

Yahoo ~10 GB Web data/hr

NASA satellites ~ 1.2 TB/day

NCBI GenBank ~ 22 million genetic sequences

France Telecom 29 2 TB

“The great strength of computers is that

they can reliably manipulate vast amounts

of data very quickly. Their great weakness is

that they don’t have a clue as to what any

France Telecom 29.2 TB

UK Land Registry 18.3 TB

AT&T Corp 26.2 TB

Why

Do

We

Need

Data

Mining

?

• Leverage organization’s data assets

– Only a small portion (typically ‐5%‐10%) of the

collected data is ever analyzed

– Data that may never be analyzed continues to be

collected, at a greatg expense,p out of fear that

something which may prove important in the

future is missing.

– Growth rates of data precludes traditional

“manually intensive” approach

Why

Do

We

Need

Data

Mining?

• As databases grow, the ability to support the decision

support process using traditional query languages

becomes infeasible

– Many queries of interest are difficult to state in a

query language (Query formulation problem)

– “find all cases of fraud”

– “find all individuals likely to buy a FORD

expedition”

– “find all documents that are similar to this

customers problem”

(Latitude, Longitude)1

What

is

Data

Mining?

• Many Definitions

– Non‐trivial extraction of implicit, previously unknown and

potentially useful information from data

– Exploration & analysis, by automatic or semi‐automatic

means, of large quantities of data in order to discover

meaningful patterns

What

is

(not)

Data

Mining?

zWhat is Data Mining?

zWhat is not Data Mining?

–Look up phone number in phone directory

–Check the dictionary for the meaning of a word

–Certain names are more prevalent in certain US

locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

–Group together similar documents returned by

search engine according to their context (e.g. Amazon

(10)

Data

Mining:

Confluence

of

Multiple

Disciplines

?

20x20 ~ 2^400 ≈10^120 patterns

Data

Mining

Applications

• Market analysis

• Risk analysis and management

• Fraud detection and detection of unusual

patterns (outliers)

p ( )

• Text mining (news group, email, documents)

and Web mining

• Stream data mining

• DNA and bio‐data analysis

Fraud Detection & Mining Unusual Patterns

• Approaches: Clustering & model construction for frauds,

outlier analysis

• Applications: Health care, retail, credit card service, …

– Auto insurance: ring of collisions

– Money laundering: suspicious monetary transactions

Medical insurance – Medical insurance

• Professional patients, ring of doctors, and ring of references • Unnecessary or correlated screening tests

– Telecommunications: phone‐call fraud

• Phone call model: destination of the call, duration, time of day or

week. Analyze patterns that deviate from an expected norm – Retail industry

• Analysts estimate that 38% of retail shrink is due to dishonest

employees – Anti‐terrorism

Data

Mining

and

Business

Intelligence

Tid Refund Marital

Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Data

Data

Mining

Tasks

…

5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk

(11)

• Finding groups of objects such that the objects in a group

will be similar (or related) to one another and different

from (or unrelated to) the objects in other groups

Inter-cluster

Intra-Clustering

cluster

distances

are

maximize

d

a

cluster

distances

are

minimize

d

• Understanding

– Group related documents

for browsing

– Group genes and proteins

that have similar

functionality

– Group stocks with similar

Discovered Clusters Industry Group

1 Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Technology1-DOWN 2 Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN 3 Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN 4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP

Applications

of

Cluster

Analysis

p price fluctuations

• Summarization

– Reduce the size of large

data sets

4 Schlumberger-UP Oil UP

Use of K‐means to partition Sea

Surface Temperature (SST) and Net

Primary Production (NPP) into

clusters that reflect the Northern and

Southern Hemispheres.

Clustering:

Application

1

• Market Segmentation:

–Goal:subdivide a market into distinct subsets of customers

where any subset may conceivably be selected as a market

target to be reached with a distinct marketing mix.

–Approach:

• Collect different attributes of customers based on their • Collect different attributes of customers based on their

geographical and lifestyle related information. • Find clusters of similar customers.

• Measure the clustering quality by observing buying

patterns of customers in same cluster vs. those from

different clusters.

Clustering:

Application

2

• Document Clustering:

– Goal:To find groups of documents that are similar to each other based on the important terms appearing in them.

– Approach:To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.

What

is

not

Cluster

Analysis?

• Simple segmentation

– Dividing students into different registration groups

alphabetically, by last name

• Results of a query

– GroupingsGroupings are a result of an external specification are a result of an external specification

– Clustering is a grouping of objects based on the data

• Supervised classification

– Have class label information

• Association Analysis

– Local vs. global connections

Notion

of

a

Cluster

can

be

Ambiguous

How many clusters? Six Clusters

Four Clusters Two Clusters

(12)

Types

of

Clusterings

• A clusteringis a set of clusters

• Important distinction between hierarchical

and partitionalsets of clusters

• Partitional Clustering

– A division data objects into non‐overlapping

subsets (clusters) such that each data object is in

exactly one subset

• Hierarchical clustering

– A set of nested clusters organized as a hierarchical

tree

Partitional

Clustering

Original Points A Partitional Clustering

Hierarchical

Clustering

p4 p1 p3 p2 p4 p1 p2 p3

Traditional Hierarchical Clustering Traditional Dendrogram

p4 p1 p3 p2 p4 p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram Traditional Dendrogram

Other Distinctions Between Sets of Clusters

• Exclusive versus non‐exclusive

– In non‐exclusive clusterings, points may belong to multiple

clusters.

– Can represent multiple classes or ‘border’ points

• Fuzzy versus non‐fuzzy

– In fuzzy clustering, a point belongs to every cluster with some

weight between 0 and 1 weight between 0 and 1 – Weights must sum to 1

– Probabilistic clustering has similar characteristics

• Partial versus complete

– In some cases, we only want to cluster some of the data

• Heterogeneous versus homogeneous

– Clusters of widely different sizes, shapes, and densities

Types

of

Clusters

• Well‐separated clusters

• Center‐based clusters

• Contiguous clusters

b d l

• Density‐based clusters

• Property or Conceptual

• Described by an Objective Function

Types of Clusters: Well‐Separated

• Well‐Separated Clusters:

– A cluster is a set of points such that any point in a

cluster is closer (or more similar) to every other point

in the cluster than to any point not in the cluster.

(13)

Types of Clusters: Center‐Based

• Center‐based

– A cluster is a set of objects such that an object in a

cluster is closer (more similar) to the “center” of a

cluster, than to the center of any other cluster

– The center of a cluster is often a centroid, the

average of all the points in the cluster, or amedoid,

average of all the points in the cluster, or a medoid,

the most “representative” point of a cluster

4 center-based clusters

Types of Clusters: Contiguity‐Based

• Contiguous Cluster (Nearest neighbor or

Transitive)

– A cluster is a set of points such that a point in a

cluster is closer (or more similar) to one or more other

points in the cluster than to any point not in the

cluster.

8 contiguous clusters

Types of Clusters: Density‐Based

• Density‐based

– A cluster is a dense region of points, which is

separated by low‐density regions, from other

regions of high density.

– Used when the clusters are irregular or intertwined,

and when noise and outliers are present

and when noise and outliers are present.

6 density-based clusters

Types of Clusters: Conceptual Clusters

• Shared Property or Conceptual Clusters

– Finds clusters that share some common property

or represent a particular concept.

2 Overlapping Circles

Characteristics of the Input Data Are Important

• Type of proximity or density measure

– This is a derived measure, but central to clustering

• Sparseness

– Dictates type of similarity – Adds to efficiency

• Attribute type

– Dictates type of similarity

• Type of Data

– Dictates type of similarity

– Other characteristics, e.g., autocorrelation

• Dimensionality

• Noise and Outliers

• Type of Distribution

Tid Refund Marital

Data

Mining

Tasks

…

(14)

Association

Rule

Discovery:

Definition

• Given a set of records each of which contain

some number of items from a given

collection

– Produce dependency rules which will predict occurrence of

an item based on occurrences of other items.

TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

Rules Discovered:

{Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Rules Discovered:

{Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Association

Analysis:

Applications

• Market‐basket analysis

– Rules are used for sales promotion, shelf management, and

inventory management

• Telecommunication alarm diagnosis

– Rules are used to find combination of alarms that occur

together frequently in the same time period

• Medical Informatics

– Rules are used to find combination of patient symptoms

and complaints associated with certain diseases

Application

Deployment

Challenge

Tid Refund Marital

Data

Mining

Tasks

…

• Find a model for class attribute as a function of

the values of other attributes

Tid Employed Education Level of

# years at present address

Credit Worthy

1 Yes Graduate 5 Yes

Model for predicting credit worthiness

Employed

No Yes

Predictive

Modeling:

Classification

1 Yes Graduate 5 Yes

2 Yes High School 2 No

3 No Undergrad 1 No

4 Yes High School 10 Yes

… … … … …

10

No Education

Number of years

Graduate { High school,_{Undergrad }}

Yes No > 7 yrs < 7 yrs Yes Number of years No > 3 yr < 3 yr

Classification

Example

Tid Employed Level of

Education # years at present address Credit Worthy

1 Yes Graduate 5 Yes

2 Yes High School 2 No

Tid Employed _EducationLevel of

# years at present address Credit Worthy 1 Yes Undergrad 7 ? 2 No Graduate 3 ?

3 Yes High School 2 ?

… … … … … 10 Test Set Training Set

Model

Learn

Classifier

3 No Undergrad 1 No

4 Yes High School 10 Yes

… … … … …

(15)

• Predicting tumor cells as benign or

malignant

• Classifying credit card transactions

as legitimate or fraudulent

• Classifying secondary structures of

Examples

of

Classification

Task

Classifying secondary structures of

protein as alpha‐helix, beta‐sheet, or

random coil

• Categorizing news stories as finance,

weather, entertainment, sports, etc

• Identifying intruders in the cyberspace

Classification:

Application

1

• Fraud Detection

– Goal:Predict fraudulent cases in credit card transactions.

– Approach:

• Use credit card transactions and the information on its

account‐holder as attributes.

– When does a customer buy, what does he buy, how

often he pays on time, etc

• Label past transactions as fraud or fair transactions. This

forms the class attribute.

• Learn a model for the class of the transactions. • Use this model to detect fraud by observing credit card

transactions on an account.

Classification:

Application

2

• Churn prediction for telephone customers

–Goal:To predict whether a customer is likely to be

lost to a competitor.

–Approach:

• Use detailed record of transactions with each of the past

and present customers, to find attributes.

– How often the customer calls, where he calls, what

time‐of‐the day he calls most, his financial status,

marital status, etc.

• Label the customers as loyal or disloyal. • Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

Classification:

Application

3

• Sky Survey Cataloging

– Goal:To predict class (star or galaxy) of sky objects,

especially visually faint ones, based on the telescopic survey

images (from Palomar Observatory).

– 3000 images with 23,040 x 23,040 pixels per image.

– Approach:

• Segment the image.

• Measure image attributes (features) ‐40 of them per

object.

• Model the class based on these features.

• Success Story: Could find 16 new high red‐shift quasars,

some of the farthest objects that are difficult to find!

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Classifying

Galaxies

Early Intermediate Class: • Stages of Formation Attributes: • Image features, • Characteristics of light

waves received, etc.

Late

Data Size:

• 72 million stars, 20 million galaxies • Object Catalog: 9 GB

• Image Database: 150 GB

Classification

Techniques

• Base Classifiers

– Decision Tree based Methods

– Rule‐based Methods

– Nearest‐neighbor

N l N k

– Neural Networks

– Naïve Bayes and Bayesian Belief Networks

– Support Vector Machines

• Ensemble Classifiers

(16)

Example

of

a

Decision

Tree

ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No Home Owner Yes No Splitting Attributes 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 1 0 MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K

Training

Data

Model:

Decision

Tree

Another

Example

of

Decision

Tree

MarSt Home Owner Income NO NO Yes No Married Single, Divorced ID Home _Owner Marital _Status _IncomeAnnual Defaulted _Borrower

1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No Income YES NO NO < 80K > 80K

There could be more than one tree that fits the same data!

3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10

Decision

Tree

Classification

Task

Learn Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 N M di 75K N Apply Model 9 No Medium 75K No 10 No Small 90K Yes 10

11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Decision Tree

Apply

Model

to

Test

Data

Home Owner MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10

Test Data

Start from the

root of tree.

MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K

Apply

Model

to

Test

Data

MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10

Test

Data

Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K

Apply

Model

to

Test

Data

MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K

(17)

Apply

Model

to

Test

Data

Apply

Model

to

Test

Data

Apply

Model

to

Test

Data

Assign

Defaulted

to “No”

Decision

Tree

Classification

Task

Learn Model

1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes Apply Model Model 9 No Medium 75K No 10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Decision Tree

Decision

Tree

Induction

• Many

Algorithms:

– Hunt’s

Algorithm

(one

of

the

earliest)

– CART

ID3 C4 5

– ID3,

C4.5

– SLIQ,SPRINT

Tid Refund Marital

Data

Mining

Tasks

…

(18)

Deviation/Anomaly

Detection

• Detect significant

deviations from normal

behavior • Applications:

– Credit Card FraudCredit Card Fraud

Detection

– Network Intrusion

Detection

Anomaly

Detection

• Challenges

– How many outliers are there in the data?

– Method is unsupervised

• Validation can be quite challenging (just like for clustering)

– Finding needle in a haystackFinding needle in a haystack

• Working assumption

– There are considerably more “normal” observations

than “abnormal” observations (outliers/anomalies)

in the data

Anomaly

Detection

Schemes

• General Steps

– Build a profile of the “normal” behavior

• Profile can be patterns or summary statistics for

the overall population

– Use the “normal” profile to detect anomalies

• A li b ti h h t i ti

• Anomalies are observations whose characteristics

differ significantly from the normal profile

• Types of anomaly

detection schemes

– Graphical & Statistical‐based

– Distance‐based

– Model‐based

Graphical

Approaches

• Boxplot (1‐D), Scatter plot (2‐D), Spin plot (3‐D)

• Limitations

– Time consuming

– Subjective

Statistical

Approaches

• Assume a parametric model describing the

distribution of the data (e.g., normal

distribution)

• Apply a statistical test that depends on

– Data distribution

Parameter of distribution (e g mean variance)

– Parameter of distribution (e.g., mean, variance)

– Number of expected outliers (confidence limit)

Limitations

of

Statistical

Approaches

• Most

of

the

tests

are

for

a

single

attribute

• In

many

cases,

data

distribution

may

not

be

known

• For high dimensional data it may be

• For

high

dimensional

data,

it

may

be

(19)

Distance

‐

based

Approaches

• Data

is

represented

as

a

vector

of

features

• Three

major

approaches

– Nearest

‐

neighbor

based

– Density

based

– Clustering

based

Nearest

‐

Neighbor

Based

Approach

• Approach:

– Compute the distance between every pair

of data points

– There are various ways to define outliers:

• Data oi t fo hi h the e a e fe e tha

• Data points for which there are fewer than p

neighboring points within a distance D

• The top n data points whose distance to the kth

nearest neighbor is greatest

• The top n data points whose average distance

to the k nearest neighbors is greatest

Density

‐

based:

LOF

approach

• For each point, compute the density of its local

neighborhood

• Compute local outlier factor (LOF) of a sample pas the

average of the ratios of the density of sample pand the

density of its nearest neighbors

• Outliers are pointsp with largestg LOF value

p2 × p1 × In the NN approach, p2is not considered as outlier, while LOF approach find both

p1and p2 as outliers

Clustering

‐

Based

• Basic idea:

¾Cluster the data into

groups of different

density

¾Choose points in small

l t did t

cluster as candidate

outliers

¾ Compute the distance between candidate points

and non‐candidate clusters.

‐ If candidate points are far from all other

non‐candidate points, they are outliers

KDD

Process

• Develop an understanding of the application domain

– Relevant prior knowledge, problem objectives, success criteria,

current solution, inventory resources, constraints, terminology,

cost and benefits • Create target data set

– Collect initial data, describe, focus on a subset of variables,Collect initial data, describe, focus on a subset of variables,

verify data quality • Data cleaning and preprocessing

– Remove noise, outliers, missing fields, time sequence

information, known trends, integrate data • Data Reduction and projection

– Feature subset selection, feature construction, discretizations,

aggregations

KDD

Process

• Selection of data mining task

– Classification, segmentation, deviation detection,

link analysis

• Select data mining approach

D i i d l

• Data mining to extract patterns or models

• Interpretation and evaluation of

patterns/models

(20)

Knowledge

Discovery

Challenges

of

Data

Mining

• Scalability • Dimensionality

• Complex and Heterogeneous Data

• Data Quality

• Data Ownership and Distribution

• Privacy Preservation

• Streaming Data

• Data from Multi‐Sources

Similarities Between Data Miners and Doctors

Data Characteristics

Data Mining Techniques Medical Devices

Commercial

and

Research

Tools

WEKA: http://www.cs.waikato.ac.nz/ml/weka/ SAS: http://www.sas.com/ Clementine: Clementine: http://www.spss.com/spssbi/clementine/ Intelligent Miner http://www‐3.ibm.com/software/data/iminer/ Insightful Miner http://www.insightful.com/products/product.asp?PID=26

Textbooks

Knowledge

Management

• Knowledge management systems concern the sharing

of knowledge that is already known to exist, either in

libraries of documents, in the heads of employees, or in

other known sources.

•Knowledge management (KM)is the process of

i l f i ll l i l d h i

creating value from intellectual capital and sharing

that knowledge with employees, managers, suppliers,

(21)

Knowledge

Management

(Continued)

• Knowledge management is a process that is

supported by the five components of an

information system.

– Its emphasis is on people, their knowledge, and

effective means for sharing that knowledge with

others.

• The benefits of KM concern the application of

knowledge to enable employees and others to

leverage organizational knowledge to work

smarter.

• KM preserves organizational memory by

capturing and storing the lessons learned and

best practices of key employees.

Content

Management

Systems

• Content management systems are information

systems that track organizational documents, Web

pages, graphics, and related materials.

• Such systems differ from operational document

systems in that they do not directly support business

i operations.

• KM content management systems are concerned with

the creation, management, and delivery of documents

that exist for the purpose of imparting knowledge.

Content

Management

Systems

(Continued)

• Typical users of content management systems are

companies that sell complicated products and want to

share their knowledge of those products with

employees and customers.

• The basic functions of content management systems are

h f h

the same as for report management systems: author,

manage, and deliver.

• The only requirement that content managers place on

document authoring is that the document has been

created in a standardized format.

Content

Management

Problems

• Documents may refer to one another or multiple

documents may refer to the same product or

procedure.

– When one of them changes, others must change as

well.

– Some content management systems keep semantic

linkagesg amongg documents so that content

dependencies can be known and used to maintain

document consistency.

• Document contents are perishable.

– Documents become obsolete and need to be altered, removed,

or replaced.

• Multinational companies have to ensure document

language translations.

Figure 9‐23 Document Management at

Microsoft.com (as of December 2003)

Figure 9‐24 Reporting Services: United States

(22)

Figure

9

‐

25

Reporting

Services:

China

Source: Used with permission of Tom Rizzo of Microsoft Corporation.

Content

Delivery

• Almost all users of content management systems pull

the contents.

• Users cannot pull content if they do not know it

exists.

– The content must be arranged and indexed, and a facility for

searching the content devised. searching the content devised.

• Documents that reside behind a corporate firewall,

however, are not publicly accessible and will not be

reachable by Google or other search engines.

– Organizations must index their own proprietary documents

and provide their own search capability for them.

KM Systems to Facilitate the Sharing of Human

Knowledge

• Nothing is more frustrating for a manager to

contemplate than the situation in which one employee

struggles with a problem that another employee knows

how to solve easily.

• KM systems are concerned with the sharing not only of

content, but also with the sharing of knowledge among

humans.

– How can one person share her knowledge with another? – How can one person learn of another person’s great idea?

KM

Systems

to

Facilitate

the

Sharing

of

Human

Knowledge

(Continued)

• Three forms of technology are used for

knowledge‐sharing among humans:

– Portals, discussion groups, and email

– Collaborations systems

– Expert systems

Portals

– Employees can share ideas by posting

knowledge on a Web portal whereby

managers and employees can pull the

knowledge from the portal.

Figure 9‐26 Technology Support of Sharing

Human Knowledge

Knowledge (Continued)

Discussion Groups

–Discussion groupsallow employees or customers to

post questions and queries seeking solutions to

problems they have.

Oracle IBM PeopleSoft and other vendors support

– Oracle, IBM, PeopleSoft, and other vendors support

product discussion groups where users can post

questions and where employees, vendors, and other

users can answer them.

– Later, the organization can edit and summarize the

questions from such discussion groups into

(23)

Discussion groups (continued)

– Basic email can also be used for knowledge‐sharing,

especially if email lists have been constructed with

KM in mind.

– Two human factors inhibit knowledge‐sharing.

• Employees can be reluctant to exhibit their

ignorance.

• Competition exists between employees.

– A KM application may be ill‐suited to a competitive

group.

• The company may be able to restructure rewards

and incentives to foster sharing of ideas among

employees.

Collaboration Systems

–Collaboration systemsare information systems that enable

people to work together more effectively.

– The Internet can be used as a broadcast medium for speeches,

panel discussion, and other types of meetings.

–Web broadcasts, because they are digital, can be readily saved

and replayed at the viewer’s convenience and replayed at the viewer s convenience.

– Web broadcasts can also be made interactive by combining

them with discussion group bulletin boards that are live during

the broadcast.

–Video conferencingis another popular form of IT‐supported

meetings.

• Video‐conferencing equipment is expensive and normally is

located in selected sites in the organization.

Figure

9

‐

27

Net

Meeting

Graphic

Expert Systems

–Expert systemsare created by interviewing experts

in a given business domain and codifying the rules

stated by those experts.

– Many expert systems were created in the late 1980sMany expert systems were created in the late 1980s

and 1990s, and some of them have been successful.

– Expert systems suffer from three major

disadvantages.

• They are difficult and expensive to develop.

• They are difficult to maintain.

• They were unable to live up to the high