Dr. Hui Xiong
Business Intelligence and Data Mining
g
Rutgers University
Learning
Objectives
• Understand the need for business intelligence systems.
• Know the characteristics of reporting systems.
• Know the purpose and role of data warehouses and
data marts.
U d d f d l d i i h i
• Understand fundamental data‐mining techniques.
• Know the purpose, features, and functions of
knowledge management systems.
The
Need
for
Business
Intelligence
Systems
• According to a study done at the University of
California at Berkeley, a total of 403 petabytes of new
data were created.
• 403 petabytesis roughly the amount of all printed
material ever written.
– The printed collection of the Library of Congress is
.01 petabytes.
– 400 petabytes equals 40,000 copies of the print
collection of the Library of Congress.
The
Need
for
Business
Intelligence
Systems
(Continued)
• The generation of all these data has much to
do with Moore’s Law.
• The capacity of storage devices increases as
thei o t de ea e
their costs decrease.
• Today, storage capacity is nearly unlimited.
• We are drowning in data and starving for
information.
Figure
9
‐
1
How
big
is
an
Exabyte?
Source: Used with permission of Peter Lyman and Hal R. Varian, University of California at Berkeley.
Figure
9
‐
2
Hard
‐
Disk
Storage
Capacity
Business
Intelligence
Tools
• Tools for searching business data in an attempt
to find patterns is called business intelligence
(BI) tools.
• Reporting tools are programs that read data
f i t f th t d t
from a variety of sources, process that data,
produce formatted reports, and deliver those
reports to the users who need them.
Business
Intelligence
Tools
• The processing of data is simple:
– Data are sorted and grouped.
– Simple totals and averages are calculated.
• Reporting tools are used primarily for assessment
– They are used to address questions like:
•What has happened in the past?
•What is the current situation?
•How does the current situation compare to
the past?
Business
Intelligence
Tools
(Continued)
•Data‐mining toolsprocess data using statistical
techniques, many of which are sophisticated and
mathematically complex.
•Data mining involves searching for patterns and
relationships among data.
• In most cases data mining tools are used to make
• In most cases, data‐mining tools are used to make
predictions.
• For example, we can use one form of analysis to compute
the probability that a customer will default on a loan.
• Another way to distinguish the differences of reporting
tools and data‐mining tools is :
– Reporting tools use simple operations like sorting, grouping,
and summing.
– Data‐mining tools use sophisticated techniques.
Business
Intelligence
Systems
• An information systemis a collection of
hardware, software, data, procedures, and
people.
• The purpose of a business intelligence (BI)
systemis to provide the right information to
systemis to provide the right information, to
the right user, at the right time.
• BI systems help users accomplish their goals
and objectives by producing insights that lead
to actions.
Business
Intelligence
Systems
(Continued)
• A reporting toolcan generate a report that shows a
customer has canceled an important order.
• A reporting system, however, alerts that customer’s
salesperson with this unwanted news, and does so in
time for the salesperson to try to alter the customer’s
decision decision.
• A data‐mining toolcan create an equation that
computes the probability that a customer will default
on a loan.
• A data‐mining systemuses that equation to enable
banking personnel to assess new loan applications.
Reporting
Systems
• The purpose of a reporting systemis to create
meaningful information from disparate data
sources and to deliver that information to the
proper user on a timely basis.
• Reporting systems generate information from
• Reporting systems generate information from
data as a result of four operations:
– Filtering data
– Sorting data
– Grouping data
Figure 9‐3 Trade Data for NDX.X (NASDAQ 100) Figure 9‐4 Report Based on Trade Data in Figure 9‐3
Components
of
Reporting
Systems
• A reporting system maintains a database of
reporting metadata.
• The metadata describes the reports, users,
groups, roles, events, and other entities
involved in the reporting activity.
• The reporting system uses the metadata to
prepare and deliver reports to the proper users
on a timely basis.
Figure 9‐5 Components of a Reporting System
Figure
9
‐
6
Summary
of
Report
Characteristics
Report
Type
• In terms of a report type, reports can be staticor
dynamic.
•Static reportsare prepared once from the
underlying data, and they do not change.
– Example,p , a reportp of ppast year’sy sales
•Dynamic reports:the reporting system reads
the most current data and generates the report
using that fresh data.
– Examples are: a report on sales today and a
Report
Type
(Continued)
•Query reportsare prepared in response to data
entered by users.
•Online analytical processing(OLAP) reports allow
the user to dynamically change the report
i
grouping structures.
Report
Media
• Reports are delivered via many different report
mediaor channels.
• Some reports are printed on paper, and others
are created in a format like PDF whereby they
cana bee p i e o printed or viewedie e e e electronically.o i a y
• Other reports are delivered to computer screens.
• Companies sometimes place reports on internal
corporate Web sites for employees to access.
Report
Media
(Continued)
• Another report medium is a digital dashboard,
which is an electronic display customized for a
particular user.
– Vendors like Yahoo! and MSN provide common
examples.p
– Users of these services can define content they want‐
say, a local weather forecast, a list of stock prices, or a
list of news sources.
– The vendor constructs the display customized for
each user.
Report
Media
(Continued)
• Other dashboards are particular to an organization.
– The organization might have a dashboard that shows up‐to‐the‐
minute production and sales activities.
• Alertsare another form of report.
– Users can declare that they wish to receive notifications of
events say via email or on their cell phones events, say, via email or on their cell phones.
• Reports can be published via a Web service.
– The Web service produces the report in response to requests
from the service‐consuming application.
Figure
9
‐
7
Digital
Dashboard
Example
Report
Mode
• The report mode can be either push reportor
pull report.
• Organizations send a push reportto users
according to a preset schedule.
– Users receive the reportp without anyy activityy
on their part.
• Users must request a pull report.
– To obtain a pull report, a user goes to a Web
portal or digital dashboard and clicks a link
or button to cause the reporting system to
Functions
of
Reporting
Systems
• Three functions of reporting systems are:
– Authoring – Management – Delivery
• Report authoring involves connecting to data
sources, creating the reporting structure, and
formatting the report.
Report
Management
• The purpose of report managementis to define who
receives what reports, when, and by what means.
• Most report‐management systems allow the report
administrator to define user accounts and user groups
and to assign particular users to particular groups.
• Reports that have been created using the report‐
authoring system are assigned groups and users.
Report
Management
(Continued)
• Assigning reports to groups saves the
administrator work.
– When a report is created, changed, or removed, the
administrator need only change the report
assignments to the group.
– All of the users in the ggroupp will inherit the changes.g
• Metadata also indicates what channel is to be used and
whether the report is to be pushed or pulled.
– If the report is to be pushed, the administrator
declares whether the report is to be generated on a
regular schedule or as an alert.
Report
Delivery
• The report‐delivery function of a reporting system
pushes reports or allows them to be pulled according
to report‐management metadata.
• Reports can be delivered via an email server, Web site,
XML Web services, or by other program‐specific
means
means.
• The report‐delivery system uses the operating system
and other program security components to ensure that
only authorized users receive authorized reports.
Report
Delivery
(Continued)
• The report‐delivery system also ensures that
push reports are produced at appropriate
times.
• For query reports, the report‐delivery system
serves as an intermediary between the user and
serves as an intermediary between the user and
the report generator.
– It receives user query data, such as item
numbers in an inventory query, passes the
query data to the report generator, receives
the resulting report, and delivers the report
to the user.
Online
Analytical
Processing
• Online analytical processing (OLAP)provides the
ability to sum, count, average, and perform other
simple arithmetic operations on groups of data.
• The remarkable characteristics of OLAP reports is that
theyey a e y a i are dynamic.
• The viewer of the report can change the report’s
Online
Analytical
Processing
• An OLAP report has measures and dimensions.
• A measureis the data item of interest.
– It is the item that is to be summed or averaged or
otherwise processed in the OLAP report.
• AA dimensiondimensionis a characteristic of a measureis a characteristic of a measure.
– Purchase data, customer type, customer location,
and sales region are all examples of dimension.
Online Analytical Processing (Continued)
• With an OLAP report, it is possible to drill down into
the data.
– This term means to further divide the data into more detail.
• Special‐purpose products called OLAP servershave
been developed to perform OLAP analysis.
A O A f
• An OLAP server reads data from an operational
database, performs preliminary calculations, and stores
the results of those operations in an OLAP database.
Figure 9‐13 OLAP Family and Store Location by Store Type Figure 9‐14 Role of OLAP Server and OLAP Database
Data
Warehouses
and
Data
Marts
• Basic reports and simple OLAP analyses can be made
directly from operational data.
• For the most part, such reports display the current
state of the business; and if there are a few missing
values or small inconsistencies with the data, no one is
too concerned
too concerned.
• Operational data are unsuited to more sophisticated
analyses, particularly, data‐mining analyses that
require high‐quality input for accurate and useful
results.
Data Warehouses and Data Marts (Continued)
• Many organizations choose to extract operational data
into facilities called data warehousesand data marts,
both of which are facilities that prepare, store, and
manage data specifically for data mining and other
analyses.
• Programsg read operationalp data and extract,, clean,, and
prepare that data for BI processing.
• The prepared data are stored in a data‐warehouse
database using data‐warehouse DBMS, which can be
Data
Warehouses
and
Data
Marts
• Data warehouses include data that are purchased from
outside sources.
• Metadata concerning the data, its source, its format, its
assumptions and constraints, and other facts about the
data is keptp in a data‐warehouse metadata database.
• The data‐warehouse DBMS extracts and provides data to
business intelligence tools such as data‐mining
programs.
Figure 9‐15 Components of a Data Warehouse
Figure 9‐16 Consumer Data Available for Purchase from Data Vendors
Problems
with
Operational
Data
(Continued)
• Inconsistent data are particularly common for data that
have been gathered over time.
– When an area code changes, for example, the phone number
for a given customer before the change will not match the
customer’s number after the change.
• Some data inconsistencies occur from the nature of the
business activity
business activity.
• Nonintegrated data can cause problems when data
comes from different management information
systems.
Figure 9‐17 Problems of Using Transaction Data for Analysis and Data Mining
Data
Warehouses
Versus
Data
Marts
• The data warehousetakes data from the data
manufacturers (operational systems and purchased
data), cleans and processes the data, and locates the
data on the shelves, so to speak, of the data warehouse.
• A data martis a data collection, smaller than the data
warehouse, that addresses a particular component or
Data Warehouse Versus Data Marts (Continued)
• The data warehouse is like the distributor in the supply
chain and the data mart is like the retail store in the
supply chain.
• Users in the data mart obtain data that pertain to a
particular business function from the data warehouse.
p
• It is expensive to create, staff, and operate data
warehouses and data marts.
Figure
9
‐
18
Data
Mart
Examples
Data
Mining
and
Business
Intelligence
Dr Hui Xiong
Knowledge Discovery in Data
Dr. Hui Xiong
Rutgers University
• Lots of data is being collected
and warehoused
– Web data, e‐commerce – purchases at department/
grocery stores
Why
Mine
Data?
Commercial
Viewpoint
– Bank/Credit Card
transactions
• Computers have become cheaper and more
powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene
expressionp data – scientific simulations
generating terabytes of data
• Traditional techniques infeasible for
raw data
• Data mining may help scientists
– in classifying and segmenting data – in Hypothesis Formation
Mining Large Data Sets ‐Motivation
• There is often information “hidden” in the data that is
not readily evident
• Human analysts may take weeks to discover useful
information
• Much of the data is never analyzed at all
3 500 000 4,000,000 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 1995 1996 1997 1998 1999
The Data Gap Total new disk (TB) since
1995 Number of
analysts
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
Scale
of
Data
Organization Scale of Data
Walmart ~ 20 million transactions/day
Google ~ 8.2 billion Web pages
Yahoo ~10 GB Web data/hr
NASA satellites ~ 1.2 TB/day
NCBI GenBank ~ 22 million genetic sequences
France Telecom 29 2 TB
“The great strength of computers is that
they can reliably manipulate vast amounts
of data very quickly. Their great weakness is
that they don’t have a clue as to what any
France Telecom 29.2 TB
UK Land Registry 18.3 TB
AT&T Corp 26.2 TB
Why
Do
We
Need
Data
Mining
?
• Leverage organization’s data assets
– Only a small portion (typically ‐5%‐10%) of the
collected data is ever analyzed
– Data that may never be analyzed continues to be
collected, at a greatg expense,p out of fear that
something which may prove important in the
future is missing.
– Growth rates of data precludes traditional
“manually intensive” approach
Why
Do
We
Need
Data
Mining?
• As databases grow, the ability to support the decision
support process using traditional query languages
becomes infeasible
– Many queries of interest are difficult to state in a
query language (Query formulation problem)
query language (Query formulation problem)
– “find all cases of fraud”
– “find all individuals likely to buy a FORD
expedition”
– “find all documents that are similar to this
customers problem”
(Latitude, Longitude)1
What
is
Data
Mining?
• Many Definitions
– Non‐trivial extraction of implicit, previously unknown and
potentially useful information from data
– Exploration & analysis, by automatic or semi‐automatic
means, of large quantities of data in order to discover
meaningful patterns
What
is
(not)
Data
Mining?
zWhat is Data Mining?
zWhat is not Data Mining?
–Look up phone number in phone directory
–Check the dictionary for the meaning of a word
–Certain names are more prevalent in certain US
locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
–Group together similar documents returned by
search engine according to their context (e.g. Amazon
Data
Mining:
Confluence
of
Multiple
Disciplines
?
20x20 ~ 2^400 ≈10^120 patterns
Data
Mining
Applications
• Market analysis
• Risk analysis and management
• Fraud detection and detection of unusual
patterns (outliers)
p ( )
• Text mining (news group, email, documents)
and Web mining
• Stream data mining
• DNA and bio‐data analysis
Fraud Detection & Mining Unusual Patterns
• Approaches: Clustering & model construction for frauds,
outlier analysis
• Applications: Health care, retail, credit card service, …
– Auto insurance: ring of collisions
– Money laundering: suspicious monetary transactions
Medical insurance – Medical insurance
• Professional patients, ring of doctors, and ring of references • Unnecessary or correlated screening tests
– Telecommunications: phone‐call fraud
• Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm – Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest
employees – Anti‐terrorism
Data
Mining
and
Business
Intelligence
Tid Refund Marital
Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Data
Data
Mining
Tasks
…
5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk
• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
Intra-Clustering
cluster
distances
are
maximize
d
a
cluster
distances
are
minimize
d
• Understanding– Group related documents
for browsing
– Group genes and proteins
that have similar
functionality
– Group stocks with similar
Discovered Clusters Industry Group
1 Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, Sun-DOWN Technology1-DOWN 2 Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, Computer-Assoc-DOWN,Circuit-City-DOWN, Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN 3 Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN 4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP
Applications
of
Cluster
Analysis
p price fluctuations
• Summarization
– Reduce the size of large
data sets
4 Schlumberger-UP Oil UP
Use of K‐means to partition Sea
Surface Temperature (SST) and Net
Primary Production (NPP) into
clusters that reflect the Northern and
Southern Hemispheres.
Clustering:
Application
1
• Market Segmentation:
–Goal:subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
–Approach:
• Collect different attributes of customers based on their • Collect different attributes of customers based on their
geographical and lifestyle related information. • Find clusters of similar customers.
• Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from
different clusters.
Clustering:
Application
2
• Document Clustering:
– Goal:To find groups of documents that are similar to each other based on the important terms appearing in them.
– Approach:To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.
What
is
not
Cluster
Analysis?
• Simple segmentation
– Dividing students into different registration groups
alphabetically, by last name
• Results of a query
– GroupingsGroupings are a result of an external specification are a result of an external specification
– Clustering is a grouping of objects based on the data
• Supervised classification
– Have class label information
• Association Analysis
– Local vs. global connections
Notion
of
a
Cluster
can
be
Ambiguous
How many clusters? Six Clusters
Four Clusters Two Clusters
Types
of
Clusterings
• A clusteringis a set of clusters
• Important distinction between hierarchical
and partitionalsets of clusters
• Partitional Clustering
– A division data objects into non‐overlapping
subsets (clusters) such that each data object is in
exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical
tree
Partitional
Clustering
Original Points A Partitional Clustering
Hierarchical
Clustering
p4 p1 p3 p2 p4 p1 p2 p3Traditional Hierarchical Clustering Traditional Dendrogram
p4 p1 p3 p2 p4 p1 p2 p3
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering Non-traditional Dendrogram Traditional Dendrogram
Other Distinctions Between Sets of Clusters
• Exclusive versus non‐exclusive
– In non‐exclusive clusterings, points may belong to multiple
clusters.
– Can represent multiple classes or ‘border’ points
• Fuzzy versus non‐fuzzy
– In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1 weight between 0 and 1 – Weights must sum to 1
– Probabilistic clustering has similar characteristics
• Partial versus complete
– In some cases, we only want to cluster some of the data
• Heterogeneous versus homogeneous
– Clusters of widely different sizes, shapes, and densities
Types
of
Clusters
• Well‐separated clusters
• Center‐based clusters
• Contiguous clusters
b d l
• Density‐based clusters
• Property or Conceptual
• Described by an Objective Function
Types of Clusters: Well‐Separated
• Well‐Separated Clusters:
– A cluster is a set of points such that any point in a
cluster is closer (or more similar) to every other point
in the cluster than to any point not in the cluster.
Types of Clusters: Center‐Based
• Center‐based
– A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
– The center of a cluster is often a centroid, the
average of all the points in the cluster, or amedoid,
average of all the points in the cluster, or a medoid,
the most “representative” point of a cluster
4 center-based clusters
Types of Clusters: Contiguity‐Based
• Contiguous Cluster (Nearest neighbor or
Transitive)
– A cluster is a set of points such that a point in a
cluster is closer (or more similar) to one or more other
points in the cluster than to any point not in the
cluster.
8 contiguous clusters
Types of Clusters: Density‐Based
• Density‐based
– A cluster is a dense region of points, which is
separated by low‐density regions, from other
regions of high density.
– Used when the clusters are irregular or intertwined,
and when noise and outliers are present
and when noise and outliers are present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters
– Finds clusters that share some common property
or represent a particular concept.
2 Overlapping Circles
Characteristics of the Input Data Are Important
• Type of proximity or density measure
– This is a derived measure, but central to clustering
• Sparseness
– Dictates type of similarity – Adds to efficiency
• Attribute type
– Dictates type of similarity
• Type of Data
– Dictates type of similarity
– Other characteristics, e.g., autocorrelation
• Dimensionality
• Noise and Outliers
• Type of Distribution
Tid Refund Marital
Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Data
Data
Mining
Tasks
…
5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk
Association
Rule
Discovery:
Definition
• Given a set of records each of which contain
some number of items from a given
collection
– Produce dependency rules which will predict occurrence of
an item based on occurrences of other items.
TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Rules Discovered:
{Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Association
Analysis:
Applications
• Market‐basket analysis
– Rules are used for sales promotion, shelf management, and
inventory management
• Telecommunication alarm diagnosis
– Rules are used to find combination of alarms that occur
together frequently in the same time period
• Medical Informatics
– Rules are used to find combination of patient symptoms
and complaints associated with certain diseases
Application
Deployment
Challenge
Tid Refund Marital
Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Data
Data
Mining
Tasks
…
5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk
• Find a model for class attribute as a function of
the values of other attributes
Tid Employed Education Level of
# years at present address
Credit Worthy
1 Yes Graduate 5 Yes
Model for predicting credit worthiness
Employed
No Yes
Predictive
Modeling:
Classification
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10
No Education
Number of years
Graduate { High school, Undergrad }
Yes No > 7 yrs < 7 yrs Yes Number of years No > 3 yr < 3 yr
Classification
Example
Tid Employed Level of
Education # years at present address Credit Worthy
1 Yes Graduate 5 Yes
2 Yes High School 2 No
Tid Employed Education Level of
# years at present address Credit Worthy 1 Yes Undergrad 7 ? 2 No Graduate 3 ?
3 Yes High School 2 ?
… … … … … 10 Test Set Training Set
Model
Learn
Classifier
3 No Undergrad 1 No4 Yes High School 10 Yes
… … … … …
• Predicting tumor cells as benign or
malignant
• Classifying credit card transactions
as legitimate or fraudulent
• Classifying secondary structures of
Examples
of
Classification
Task
Classifying secondary structures of
protein as alpha‐helix, beta‐sheet, or
random coil
• Categorizing news stories as finance,
weather, entertainment, sports, etc
• Identifying intruders in the cyberspace
Classification:
Application
1
• Fraud Detection
– Goal:Predict fraudulent cases in credit card transactions.
– Approach:
• Use credit card transactions and the information on its
account‐holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions. • Use this model to detect fraud by observing credit card
transactions on an account.
Classification:
Application
2
• Churn prediction for telephone customers
–Goal:To predict whether a customer is likely to be
lost to a competitor.
–Approach:
• Use detailed record of transactions with each of the past
and present customers, to find attributes.
– How often the customer calls, where he calls, what
time‐of‐the day he calls most, his financial status,
marital status, etc.
• Label the customers as loyal or disloyal. • Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
Classification:
Application
3
• Sky Survey Cataloging
– Goal:To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic survey
images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
– Approach:
• Segment the image.
• Measure image attributes (features) ‐40 of them per
object.
• Model the class based on these features.
• Success Story: Could find 16 new high red‐shift quasars,
some of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classifying
Galaxies
Early Intermediate Class: • Stages of Formation Attributes: • Image features, • Characteristics of lightwaves received, etc.
Late
Data Size:
• 72 million stars, 20 million galaxies • Object Catalog: 9 GB
• Image Database: 150 GB
Classification
Techniques
• Base Classifiers
– Decision Tree based Methods
– Rule‐based Methods
– Nearest‐neighbor
N l N k
– Neural Networks
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
• Ensemble Classifiers
Example
of
a
Decision
Tree
ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No Home Owner Yes No Splitting Attributes 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 1 0 MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80KTraining
Data
Model:
Decision
Tree
Another
Example
of
Decision
Tree
MarSt Home Owner Income NO NO Yes No Married Single, Divorced ID Home Owner Marital Status IncomeAnnual Defaulted Borrower
1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No Income YES NO NO < 80K > 80K
There could be more than one tree that fits the same data!
3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10
Decision
Tree
Classification
Task
Learn Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 N M di 75K N Apply Model 9 No Medium 75K No 10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Decision Tree
Apply
Model
to
Test
Data
Home Owner MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10
Test Data
Start from the
root of tree.
MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80KApply
Model
to
Test
Data
MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10
Test
Data
Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80KApply
Model
to
Test
Data
MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K
Apply
Model
to
Test
Data
MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80KApply
Model
to
Test
Data
MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K
Apply
Model
to
Test
Data
MarSt NO Yes No Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Home Owner MarSt Income YES NO NO NO Married Single, Divorced < 80K > 80K
Assign
Defaulted
to “No”
Decision
Tree
Classification
Task
Learn Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes Apply Model Model 9 No Medium 75K No 10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Decision Tree
Decision
Tree
Induction
• Many
Algorithms:
– Hunt’s
Algorithm
(one
of
the
earliest)
– CART
ID3 C4 5
– ID3,
C4.5
– SLIQ,SPRINT
Tid Refund Marital
Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Data
Data
Mining
Tasks
…
5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk
Deviation/Anomaly
Detection
• Detect significant
deviations from normal
behavior • Applications:
– Credit Card FraudCredit Card Fraud
Detection
– Network Intrusion
Detection
Anomaly
Detection
• Challenges
– How many outliers are there in the data?
– Method is unsupervised
• Validation can be quite challenging (just like for clustering)
– Finding needle in a haystackFinding needle in a haystack
• Working assumption
– There are considerably more “normal” observations
than “abnormal” observations (outliers/anomalies)
in the data
Anomaly
Detection
Schemes
• General Steps
– Build a profile of the “normal” behavior
• Profile can be patterns or summary statistics for
the overall population
– Use the “normal” profile to detect anomalies
• A li b ti h h t i ti
• Anomalies are observations whose characteristics
differ significantly from the normal profile
• Types of anomaly
detection schemes
– Graphical & Statistical‐based
– Distance‐based
– Model‐based
Graphical
Approaches
• Boxplot (1‐D), Scatter plot (2‐D), Spin plot (3‐D)
• Limitations
– Time consuming
– Subjective
Statistical
Approaches
• Assume a parametric model describing the
distribution of the data (e.g., normal
distribution)
• Apply a statistical test that depends on
– Data distribution
Parameter of distribution (e g mean variance)
– Parameter of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)
Limitations
of
Statistical
Approaches
• Most
of
the
tests
are
for
a
single
attribute
• In
many
cases,
data
distribution
may
not
be
known
• For high dimensional data it may be
• For
high
dimensional
data,
it
may
be
Distance
‐
based
Approaches
• Data
is
represented
as
a
vector
of
features
• Three
major
approaches
– Nearest
‐
neighbor
based
– Density
based
– Clustering
based
Nearest
‐
Neighbor
Based
Approach
• Approach:
– Compute the distance between every pair
of data points
– There are various ways to define outliers:
• Data oi t fo hi h the e a e fe e tha
• Data points for which there are fewer than p
neighboring points within a distance D
• The top n data points whose distance to the kth
nearest neighbor is greatest
• The top n data points whose average distance
to the k nearest neighbors is greatest
Density
‐
based:
LOF
approach
• For each point, compute the density of its local
neighborhood
• Compute local outlier factor (LOF) of a sample pas the
average of the ratios of the density of sample pand the
density of its nearest neighbors
• Outliers are pointsp with largestg LOF value
p2 × p1 × In the NN approach, p2is not considered as outlier, while LOF approach find both
p1and p2 as outliers
Clustering
‐
Based
• Basic idea:
¾Cluster the data into
groups of different
density
¾Choose points in small
l t did t
cluster as candidate
outliers
¾ Compute the distance between candidate points
and non‐candidate clusters.
‐ If candidate points are far from all other
non‐candidate points, they are outliers
KDD
Process
• Develop an understanding of the application domain
– Relevant prior knowledge, problem objectives, success criteria,
current solution, inventory resources, constraints, terminology,
cost and benefits • Create target data set
– Collect initial data, describe, focus on a subset of variables,Collect initial data, describe, focus on a subset of variables,
verify data quality • Data cleaning and preprocessing
– Remove noise, outliers, missing fields, time sequence
information, known trends, integrate data • Data Reduction and projection
– Feature subset selection, feature construction, discretizations,
aggregations
KDD
Process
• Selection of data mining task
– Classification, segmentation, deviation detection,
link analysis
• Select data mining approach
D i i d l
• Data mining to extract patterns or models
• Interpretation and evaluation of
patterns/models
Knowledge
Discovery
Challenges
of
Data
Mining
• Scalability • Dimensionality
• Complex and Heterogeneous Data
• Data Quality
• Data Ownership and Distribution
• Privacy Preservation
• Streaming Data
• Data from Multi‐Sources
Similarities Between Data Miners and Doctors
Data Characteristics
Data Mining Techniques Medical Devices
Commercial
and
Research
Tools
WEKA: http://www.cs.waikato.ac.nz/ml/weka/ SAS: http://www.sas.com/ Clementine: Clementine: http://www.spss.com/spssbi/clementine/ Intelligent Miner http://www‐3.ibm.com/software/data/iminer/ Insightful Miner http://www.insightful.com/products/product.asp?PID=26
Textbooks
Knowledge
Management
• Knowledge management systems concern the sharing
of knowledge that is already known to exist, either in
libraries of documents, in the heads of employees, or in
other known sources.
•Knowledge management (KM)is the process of
i l f i ll l i l d h i
creating value from intellectual capital and sharing
that knowledge with employees, managers, suppliers,
Knowledge
Management
(Continued)
• Knowledge management is a process that is
supported by the five components of an
information system.
– Its emphasis is on people, their knowledge, and
effective means for sharing that knowledge with
others.
• The benefits of KM concern the application of
knowledge to enable employees and others to
leverage organizational knowledge to work
smarter.
• KM preserves organizational memory by
capturing and storing the lessons learned and
best practices of key employees.
Content
Management
Systems
• Content management systems are information
systems that track organizational documents, Web
pages, graphics, and related materials.
• Such systems differ from operational document
systems in that they do not directly support business
i operations.
• KM content management systems are concerned with
the creation, management, and delivery of documents
that exist for the purpose of imparting knowledge.
Content
Management
Systems
(Continued)
• Typical users of content management systems are
companies that sell complicated products and want to
share their knowledge of those products with
employees and customers.
• The basic functions of content management systems are
h f h
the same as for report management systems: author,
manage, and deliver.
• The only requirement that content managers place on
document authoring is that the document has been
created in a standardized format.
Content
Management
Problems
• Documents may refer to one another or multiple
documents may refer to the same product or
procedure.
– When one of them changes, others must change as
well.
– Some content management systems keep semantic
linkagesg amongg documents so that content
dependencies can be known and used to maintain
document consistency.
• Document contents are perishable.
– Documents become obsolete and need to be altered, removed,
or replaced.
• Multinational companies have to ensure document
language translations.
Figure 9‐23 Document Management at
Microsoft.com (as of December 2003)
Source: microsoft.com/backstage/inside.htm (accessed February 2004). © 2003 Microsoft Corporation. All rights reserved.
Figure 9‐24 Reporting Services: United States
Figure
9
‐
25
Reporting
Services:
China
Source: Used with permission of Tom Rizzo of Microsoft Corporation.
Content
Delivery
• Almost all users of content management systems pull
the contents.
• Users cannot pull content if they do not know it
exists.
– The content must be arranged and indexed, and a facility for
searching the content devised. searching the content devised.
• Documents that reside behind a corporate firewall,
however, are not publicly accessible and will not be
reachable by Google or other search engines.
– Organizations must index their own proprietary documents
and provide their own search capability for them.
KM Systems to Facilitate the Sharing of Human
Knowledge
• Nothing is more frustrating for a manager to
contemplate than the situation in which one employee
struggles with a problem that another employee knows
how to solve easily.
• KM systems are concerned with the sharing not only of
content, but also with the sharing of knowledge among
humans.
– How can one person share her knowledge with another? – How can one person learn of another person’s great idea?
KM
Systems
to
Facilitate
the
Sharing
of
Human
Knowledge
(Continued)
• Three forms of technology are used for
knowledge‐sharing among humans:
– Portals, discussion groups, and email
– Collaborations systems
– Collaborations systems
– Expert systems
Portals
– Employees can share ideas by posting
knowledge on a Web portal whereby
managers and employees can pull the
knowledge from the portal.
Figure 9‐26 Technology Support of Sharing
Human Knowledge
KM Systems to Facilitate the Sharing of Human
Knowledge (Continued)
Discussion Groups
–Discussion groupsallow employees or customers to
post questions and queries seeking solutions to
problems they have.
Oracle IBM PeopleSoft and other vendors support
– Oracle, IBM, PeopleSoft, and other vendors support
product discussion groups where users can post
questions and where employees, vendors, and other
users can answer them.
– Later, the organization can edit and summarize the
questions from such discussion groups into
KM Systems to Facilitate the Sharing of Human
Knowledge (Continued)
Discussion groups (continued)
– Basic email can also be used for knowledge‐sharing,
especially if email lists have been constructed with
KM in mind.
– Two human factors inhibit knowledge‐sharing.
• Employees can be reluctant to exhibit their
ignorance.
• Competition exists between employees.
– A KM application may be ill‐suited to a competitive
group.
• The company may be able to restructure rewards
and incentives to foster sharing of ideas among
employees.
KM Systems to Facilitate the Sharing of Human
Knowledge (Continued)
Collaboration Systems
–Collaboration systemsare information systems that enable
people to work together more effectively.
– The Internet can be used as a broadcast medium for speeches,
panel discussion, and other types of meetings.
–Web broadcasts, because they are digital, can be readily saved
and replayed at the viewer’s convenience and replayed at the viewer s convenience.
– Web broadcasts can also be made interactive by combining
them with discussion group bulletin boards that are live during
the broadcast.
–Video conferencingis another popular form of IT‐supported
meetings.
• Video‐conferencing equipment is expensive and normally is
located in selected sites in the organization.
Figure
9
‐
27
Net
Meeting
Graphic
KM Systems to Facilitate the Sharing of HumanKnowledge (Continued)
Expert Systems
–Expert systemsare created by interviewing experts
in a given business domain and codifying the rules
stated by those experts.
– Many expert systems were created in the late 1980sMany expert systems were created in the late 1980s
and 1990s, and some of them have been successful.
– Expert systems suffer from three major
disadvantages.
• They are difficult and expensive to develop.
• They are difficult to maintain.
• They were unable to live up to the high