COMP33111
Data Integration and
Analysis
Goran Nenadic
School of Computer Science
2
Lots of data is being collected
and stored
purchases at department/ grocery stores
bank/credit card transactions
Web data, Internet traffic
Blogs, web sites
e-commerce sensor data CCTV …
Data is everywhere
Wal-mart supermarket~1 million customers every hour ~200 million transactions per week 2.5 petabytes of data
Search engines
Google: 34,000 searches per second,
2 million per minute;
121 million per hour; 3 billion per day; 88 billion per month
Social networks
Facebook: more than 40 billion photos, 700 status updates per second
Big data is everywhere
4 Mobile phone subscriptions
4.6 billion users
Health-care
a single electrocardiogram generates 1,000 readings per sec
Genomics
human genome with 3 billion base pairs
Science
one biomedical article every 2 minutes ~17 million biomedical articles available
And still growing …
Big data is everywhere
examples
Big data is everywhere
Examples include web logs, RFID, sensor
networks, social networks, social data, Internet
text and documents, Internet search indexing,
call detail records, astronomy, atmospheric
science, genomics, biogeochemical, biological,
and other complex and often interdisciplinary
scientific research, military surveillance, medical
records, photography archives, video archives,
and large-scale e-commerce.
http://en.wikipedia.org/wiki/Big_data
continued
Scientific databases more and more important
biomedicine/bioinformatics/genetics
in the range of Pbytes per year (gene expressions)
astronomy
already in hundreds of Tbytes/year
environmental science
already in hundreds of Tbytes/year; predictions: 15 Pbytes
medicine and health care; electronic patient records
~Pbytes (mostly images)
social sciences/humanities
continued
Big data is everywhere
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.
Four dimensions: Volume, Velocity, Variety, and Veracity
Volume
Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes— of information.
Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
Convert 350 billion annual meter readings to better predict power consumption
http://www-01.ibm.com/software/data/bigdata/
Big data is everywhere
VelocityScrutinize 5 million trade events created each day to identify potential fraud
Analyze 500 million daily call detail records in real-time to predict customer churn faster
Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.
http://www-01.ibm.com/software/data/bigdata/
Big data is everywhere
Variety
Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.
Monitor 100’s of live video feeds from surveillance cameras to target points of interest
Exploit the 80% data growth in images, video and documents to improve customer satisfaction
Big data is everywhere
Veracity
1 in 3 business leaders don’t trust the information they use to make decisions. How can you act upon information if you don’t trust it?
Establishing trust in big data presents a huge challenge as the variety and number of sources grows.
http://www-01.ibm.com/software/data/bigdata/
13
What to do with this data?
Cablecom - Swiss telecom operator
analysed the number of calls to customer services and spotted that it peaks around the 9th month into contract – customers make a decision to leave soon after that
-> offer special deals 7-8 months into the contract to keep customers
Royal Shakespeare Company
profiling best customers -> targeted marketing -> 70% increased visits/sales
examples
Google Flu Trends
Analysed search terms when there was flu around
15
Traffic control
16
Why analyse data?
Business: competitive pressure is strong:
provide business intelligence and/or customised services
Science: data analysis may help
scientists in
classifying and segmenting data hypothesis formation
Healthcare: study findings both at
individual and population levels
provide better understanding of causal links
Variety of data types
Traditional data types
numbers, characters, strings, dates
structured data with ‘clear’ meaning
Multimedia data
text, graphics (drawings, illustrations), images, animations, audio, video, composed (mixed) multimedia
un- and semi-structured data, no ‘clear’ or correct meaning assigned
18
Example: insurance company
Data types in an insurance company
accident reports (text)images of accidents (image)
reconstruction of accident (video, animation)
audio recording of the parties involved (audio)
medical reports (text)
supporting medical materials (images)
This module
Previous database course focused on:
Database technologies: infrastructure for managing and
querying data.
Database design: techniques for working out what to store
and how.
Database programming: developing applications over
databases.
This course unit focuses principally on making the most of data within an organisation:
Data integration: getting the data into a form that supports
and facilitates aggregation, exploration and mining.
Data analysis: techniques for making sense of data and
learning new lessons.
Module contents
Data integration
Data warehousing
modelling, design, architectures, ETL process
Managing and storing multi-media data
methods for capturing multi-media data
Data analysis
On-line Analytical Processing (OLAP)
exploration of data through OLAP operations
Data mining
23
Organisation
Lectures and guest lecture [10]
introducing main concepts
Tutorials [weekly, from week 2]
understanding and practical work (groups) tools for data exploration
(WEKA, Palo OLAP and Dundas OLAP)
Lab tests [2]
some topics will have practical individual labs (e.g. data analysis using software products)
24
Reading list
TM. Connolly, CE. Begg: Database systems: a Practical Approach
to Design, Implementation, and Management ISBN: 0130412120, Pearson Education Limited
Elmasri, R., Navathe, S.: Fundamentals of Database Systems,
5th Edition, Benjamin/Cummings
O. Maimon, L. Rokach (Ed): Data Mining and Knowledge Discovery
Handbook, Springer Verlag
(http://www.springerlink.com/content/978-0-387-09822-7)
R. Nisbet, J. Elder, G. Miner: Handbook of Statistical Analysis and Data Mining Applications, Elsevier, ISBN: 978-0-12-374765-5 (e-book)
Many online materials, including case-studies, WEKA tutorials etc. all materials are on the Web and Blackboard (lectures, tutorials,
labs)
Assessment
Exam: 85%2 hours, calculators allowed 3 out of 5 questions
Lab tests: 15%
two assessed labs
Pre-requisites
Good knowledge of SQL
26
Plan – lectures
Week 1: Introduction to data integration & analysis
Week 2: Data warehousing (DW)
Week 3: Introduction to OLAP
Week 4: Association rule mining
Week 5: Analysis of textual data (text analytics)
Week 7: Data classification
Week 8: Data clustering
Week 9: Multi-media data management
Week 11: Enterprise Resource Planning (ERP)
Week 12: Guest lectures: Smart Analytics (IBM) Forensic analytics (PwC)
27
Plan – tutorials and lab tests
Week 2: Tutorial 1: basic data exploration Week 3: Tutorial 2: data profiling, DW
Week 4: Tutorial 3: OLAP
Week 5: Lab test 1 (data profiling, DW, OLAP)
Week 7: WEKA lab exercises (3, 4)
Week 8: WEKA lab exercises (3, 4)
Week 9: Tutorial 4: Data mining
Week 10: Lab test 2 (Data mining)
Week 11: Tutorial 5: Multimedia data management continued
Summary
This course unit aims to introduce:
data warehousing: architectures and methods for
integrating and organising data in a way that supports further analyses
data analytics: techniques for exploring and making
sense from the data
Course web resources
Blackboard – COMP3311
29
Making sense
of data
-
Overview of main concepts
-Overview of main topics
-Introduction to data mining
30
From data to decisions
Data
Mary Jones April 1, 1999 dog
Florida Information A. Berger M. Jones T. Martin J. Smith 50,000 46,800 29,200 75,500 Shoes Scarves Jewelry Groceries MoU Qty Income Education Knowledge A. Berger is most likely to buy new product T. Martin is profitable customer but is likely to switch carriers Decisions and actions Offer A. Berger promotion for a new product Launch a new campaign
Data/information management
Store useful data and information day-to-day operational data (e.g. transactions) external data (e.g. market data)
human resources data
Provide effective information storage and access to support data analysis and integration
distributed/federated databases:
leave the data where it is
data warehouses
32
Types of data (values)
Qualitative (descriptive)
categorical (nominal) ordinal Quantitative (numerical)
interval ratio
See Lab exercise 1 – Data Refresher Slides
33
The kinds of data we have
Traditional “transactional” information, i.e.
operational data that documents everyday life in
an enterprise/organisation
retail (e.g. sales in supermarket stores)
financial services (e.g. ATM withdrawals)
transport (e.g. flight bookings)
telecommunications (e.g. mobile billing, Internet)
healthcare (e.g. drug prescriptions)
Recording and processing this type of data is
known as “online transaction processing”
(OLTP)
Online transaction processing
OLTP: processing and recording transactions
that create new data and/or update existing
information in operational DBs:
insertions, updates, deletions
Typically a small number of rows are affected in
each transaction.
Traditional DBMS optimised to perform well in
OLTP, but not in comprehensive exploration,
aggregation and decision making.
35
Why analyse data?
There is often information “hidden” in the data that is not readily evident
Human analysts may take weeks to discover useful information
Much of the data is never analysed at all
0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 4,000,000 1995 1996 1997 1998 1999 data gap
Total new disk (TB)
Number of analysts
36
Need for data analysis
Modern business/science environment
markets evolve faster than ever competition is more intense than ever
quantity of information is increasing
In order to succeed, an organisation must
have a comprehensive view of all of its aspects -> data integration make informed and reliable decisions -> data analysis
take timely actions and accurate predictions
“Business intelligence”
Business intelligence
an ongoing process of monitoring the competitive environment in order to identify opportunities to act on, and/or threats to business to be avoided.
It is analytical analysis of available business
data (internal and external)
It is NOT about spying, sleuthing, espionage
it is estimated that 80% of business intelligence of38
Business intelligence examples
Customer data and patterns
What are the characteristics of our customers?
What are their buying patterns?
Who are the customers likely to move away?
Who are the most loyal customers?
39
Business intelligence examples
Sales analysis and identification of trends
Which products sell the most at specific timeperiods?
What are the products that are selling best as combinations?
What are the products sold during the highest profitability transactions?
How many visas were issued country-by-country for the three most busy months in the last 12 months?
continued
Business intelligence examples
Business targets and promotion effectiveness
Who are the customers most likely to respond toan advertising campaign by post?
Which day of the week a new advertising campaign should be launched?
How promotional campaigns are linked with other leading brands over time?
How a certain campaign has affected the sales in a region?
How were visa applications affected by implementing a new on-line access system?
etc.
41
Data warehouse (DW)
DW: an integrated database designed to support
data analysis, business intelligence, and better
and faster decision making
DWs integrate and aggregate data from various
operational and external DBs maintained by
different units
DW needs to provide
more complex aggregation and analysis of data
mining “new” data (e.g. spending trends)
42
General DW role
OLAP DSS data mining applications data warehouse operational DBs external DBs Extract Transform Load ETLLondon branch Sale branch Manchester branch Census data Detailed transactional data
Data warehouse
Integrate Clean Summarise Direct Query Reporting tools Mining tools OLAPData analysis
GIS data44
Data warehouse applications
Online analytical processing (OLAP)
complex analysis of data from DWe.g. trend analysis, time series, etc.
Decision support systems (DSS)
high level data processing for managementexecutive information systems (EIS)
Data mining (DM)
support for “knowledge discovery”
search for unanticipated knowledge
45
OLAP
Term coined in mid-1990’s
Main goal: support ad-hoc but complex
querying performed by business analysts
OLAP = interactive process of creating,
managing, analysing and reporting on data
Extends spreadsheet-like analysis to work with
huge amounts of data in a data warehouse
OLAP
Data exploration & aggregation in various ways
Typical applications include accessing the
effectiveness of a marketing campaign, product
sales forecasting (predictive analysis) and
capacity planning
Also, spot trends, pinpoint problems, perform
“what-if” modelling
Allows a sophisticated user to analyse data
using complex, multi-dimensional views
47
Typical OLAP queries
Write a multi-table join to compare sales for
each product line year-to-date (YTD) this year
vs. last year.
Repeat the above process to find the top 5
product contributors to margin.
Repeat the above process to find the sales of a
product line to new vs. existing customers.
Repeat the above process to find the customers
that have had negative sales growth.
example
48
What is data mining?
Many
definitions
non-trivial extraction of implicit, previously unknown and potentially useful information/knowledge from data
exploration & analysis, by automatic or semi-automatic means, of
large quantities of data in order to discover new knowledge and meaningful patterns
What is (not) data mining?
What is data mining:– Discover that certain names are more prevalent in certain locations
– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com)
– Identify patterns in buying a PC and related equipment
– Identify patterns in accessing a web site
What is not data mining:
– Look up phone number
in a phone directory
– Query a Web search
engine for information about “Amazon”
– Find the average value of market basket during a given period
– Rank web-pages based
50
Data mining – definition
Process of extracting valid, previously unknown,
comprehensible and actionable information from
extremely large databases, in order to make crucial
decisions or prediction
also known as “knowledge discovery in databases” (KDD)
Process of identifying valid, non-trivial, novel,
potentially useful patterns in data
use such patterns to predict or classify specific events, or act accordingly
51
Data mining process
1.
Identification of possible data mining
applications and problems
2.
Analysis of data and identification of possible
solutions
3.
Selection and implementation of data mining
techniques
4.
Monitoring the effectiveness of the proposed
solutions
5.
Using results for decision making, prediction,
profiling, etc.
Increasing potential to support
business decisions End User
Business Analyst Data Analyst DBA Making decisions Data presentation Visualization Techniques Data mining Knowledge Discovery Data exploration
OLAP, Statistical Analysis, Reporting
Data Warehouses Operational data Sources
53
Data mining aims/approaches
Descriptive data mining
acquire specific/general properties of the data
find patterns in data
use them for new business models
Predictive data mining
learn attributes from data that can be used to predict behaviour/activities in future (based on past and current data)
use known data to train/learn the model
54
Predictive data mining: examples
Who is likely to buy if we offer a discount?
Learn profiles of customers that buy new
collections, whatever the cost.
Would a new store improve sales?
Loan payment prediction
Predict toxicity of a new drug
Predict flooding or natural disasters
Prediction outcomes
True Positives (TP) = an entity has been predicted to
have a certain property, and it does have it (in reality)
e.g. a predicted toxic drug is toxic
False Positives (FP) = an entity has been predicted to
have a certain property, and it does not have it (in reality)
e.g. a predicted toxic drug is not toxic
True Negatives (TN) = an entity has not been predicted
to have a certain property, and it does not have it
e.g. a customer has not been predicted to buy a new CD, and (s)he does not buy it
False Negatives (FN) = an entity has not been predicted
to have a certain property, but it does have it
e.g. a customer has not been predicted to buy a new CD, but (s)he buys it
56
Prediction outcomes
Prediction of false positives
e.g. if a data mining system (e.g. prediction of terror suspects) has only 2% errors, then analysing 100 million passengers would generate 2 million false positives!
huge impact on security, efficiency, privacy
false positives with manual analysis too
Still, data mining has the potential to reduce the
high rate of false positives
combining multiple models can reduce false positives continued
57
Descriptive data mining
Main types
identification
categorisation
optimisation
Data mining – identification
Identify existence of a new activity
e.g. a new buying patternnew pattern in using on-line services
identifying the best products for different customers
identify factors that attract new customers
59
Data mining – categorisation
Learn attributes that can partition the data (e.g.
customers) into “meaningful” groups
find “model” customers who share the same characteristics: interest, income level, spending habits, etc.
e.g. shoppers: regular, ‘posh’, ‘rush’, discount seeking, etc.
online users: mail-only, occasional surfers, addicts…
Profiling customers, citizens, students, genes
60
Data mining – optimisation
Optimise usage of resources
e.g. internal computer networkingworkload at customer services
maximise sales under given constraints
suggests adjustments on the pricing and variety of goods in different stores/regions etc.
optimise distribution of stores/base stations in a region/country
monitor market directions
When is data mining useful?
Can we extract/mine useful patterns from data
Patterns are interesting if they are
easily understood by humansvalid on new data (with some degree of certainty)
potentially useful, novel
validate some hypothesis that a user seeks to confirm
can be used to improve understanding of business or scientific process
62
Overview of data
mining techniques
63
Data mining techniques
Some of data mining techniques
Relevance analysis
Time series analysis
Sequential patterns
Association Rules
Classification
Regression
Clustering
Anomaly detection
Relevance analysis
Identification of relevant attributes and the
degree of their relevance
Example
payment-income ratio is a relevant factor for loan approval, while education level and debt ratio are not
Identify these relevant attributes automatically
from data
65
Times series analysis
Examine value/behaviour of an attribute over
time at evenly spaced time points (daily, weekly,
yearly etc)
e.g. daily stock price for each company
sales in summer/winter
Observe behaviour of several groups of values
Time series can be visualised by plotting the
values over the time points
66
Sequential patterns
Based on a time sequence of actions
Relationship is based on time
e.g. a customer buys a PC, the following month he/she typically buys a printer; in 12 months he/she purchase printer cartridge
use such information to offer a deal or to define a new strategy
Sequential patterns
Analyse a company web log data to determine
how users access the company web site
If, for example, 70% of users of page A follow
one of the following patterns of behaviour
(in terms of links):
(A, B, C)
or
(A, D, B, C)
or
(A, E, B, C)
then add a link directly from page A to page C
68
Association rules
Establish an “association” link
common example: market-basket analysis
market basket: items bought during one visit
e.g. if a customer buys milk and tea, they also buy cookies
if a female shopper buys a handbag, she is very likely to buy matching shoes
See the lecture on “Association rule mining”
69
Classification
Mapping data into predefined classes
classes are known (business determined)
Learn how to classify a new (unseen) case
e.g. application for a loan: 5 ranges of credit-cardworthiness
visa application: binary classification (2 classes)
Entity (data-set) can be classified into several
classes corresponding to various “dimensions”
See the lecture on “Classification”
Regression
Classification is about categorical values
Regression – for numeric prediction
e.g. predict house prices
e.g. predict one’s retirement savings based on its current value and several past values
e.g. predict sales amounts of a new product based on advertising expenditure
e.g. predict wind velocities as a function of temperature, humidity, air pressure, etc.
71
Clustering
Partition data into groups that might or might not
be disjointed
soft and hard clustering
Groups (i.e. clusters) are not known in advance
Discover new trends or outliers
e.g. fraud detection72
Clustering
Most similar data are grouped into clusters
similarity measure between the data is needed to identify related entities
example: when two transactions are “similar”?
aggregation and generalisation can be used
e.g. type of product instead product_id
milk includes all types and quantities
depends on the applications
difference between semi-skimmed and full milk may be essential for a healthy-life style company
See the lecture on “Clustering”
continued
Anomaly detection
Detect significant deviations from “normal” behaviour
Examples
credit card fraud detection
network intrusion detection
74
Data mining
Important note: data mining techniques we
discuss build models automatically from data
resulting classification or association rules are NOT designed by a business/data analyst
rather, they have been ‘learnt’ from data automatically
So,
classification = automatic classification
clustering = automatic clustering
continued 75
Data Mining
Database systems Statistics Other disciplines Algorithms Machine learning VisualizationMachine learning and DM
ML = automatic acquisition of a model from datacapturing/generalising behaviour of data from examples
Supervised learning
predefined what to learn (e.g. classes) use a training set as a source to learn from
Unsupervised
no training sets, no targets: learn from data (e.g. clustering)
Data mining uses many ML techniques
but association rule mining for example is not ML handling large data sets (scalability)
77
Data mining and privacy
A lot of controversy about data mining ethics
companies routinely collect information aboutcustomers and use it for marketing, etc.
Can information collected for one purpose be
used for mining data for another purpose
in Europe, generally no, without explicit consent
in the US, generally yes
Powerful profiling techniques
www.kdnuggets.com/gpspubs/ieee-expert-9504-priv.html
78
Data mining and privacy
Can discrimination be based on features like
sex, age, national origin?
in some areas (e.g. mortgages, employment) certain features cannot be used for decision making
In other areas, these features are needed to
assess the risk factors
e.g. people of African descent are more susceptible to sickle cell anemia
continued
Data mining on new data types
Currently, most data mining is on flat tables
mining traditional data types (categorical data)
Richer but unstructured data sources available
text, links, web, images, multimedia
Mining on a meta-level
e.g. knowledge bases80
Data integration & analysis
Competitive advantage
increased productivity of decision making
allow decision-makers to access data that reveal unknown, unavailable and untapped information (e.g. customer profiles, trends and demands)
Potential high returns on investment
average return on business intelligence investments (hardware, data, tools, human resources) is over 400% over a period of 2-3 years
(International Data Corporation data)
market growth: 1995: $2 billion; 2004: $10-15 billion
81
Summary
Making sense from large datasets
DW: an integrated database for supporting
higher-level analysis of data
OLAP: ad-hoc exploration
Data mining: systematic exploration
Mining multi-media data
Benefits
supports competitive business/scientific intelligence
Reading for this lecture
Chapters 34 in [Connolly & Begg] Chapter 27 (27.1) in [Elmasri & Navathe]
The Data Deluge Report, The Economist, Feb 2010
On-line materials: Blackboard and
http://www.cs.manchester.ac.uk/ugt/COMP33111
Additional reading (on the web)
Two IBM case studies:
crime prediction
83
For next week:
Complete Tutorial 1: Basic data exploration
types of data
median, variance, etc.
correlation
distance measures
bring the solutions to the lab session