Data is everywhere. Big data is everywhere COMP33111, 2012/ Goran Nenadic School of Computer Science

(1)

COMP33111

Data Integration and

Analysis

Goran Nenadic

School of Computer Science

2



Lots of data is being collected

and stored

purchases at department/ grocery stores

bank/credit card transactions

Web data, Internet traffic

Blogs, web sites

e-commerce sensor data CCTV …

Data is everywhere

 Wal-mart supermarket

~1 million customers every hour ~200 million transactions per week 2.5 petabytes of data

 Search engines

Google: 34,000 searches per second,

2 million per minute;

121 million per hour; 3 billion per day; 88 billion per month

 Social networks

Facebook: more than 40 billion photos, 700 status updates per second

Big data is everywhere

(2)

4  Mobile phone subscriptions

4.6 billion users

 Health-care

a single electrocardiogram generates 1,000 readings per sec

 Genomics

human genome with 3 billion base pairs

 Science

one biomedical article every 2 minutes ~17 million biomedical articles available

 And still growing …

Big data is everywhere

examples

Big data is everywhere



Examples include web logs, RFID, sensor

networks, social networks, social data, Internet

text and documents, Internet search indexing,

call detail records, astronomy, atmospheric

science, genomics, biogeochemical, biological,

and other complex and often interdisciplinary

scientific research, military surveillance, medical

records, photography archives, video archives,

and large-scale e-commerce.

http://en.wikipedia.org/wiki/Big_data

continued

 Scientific databases more and more important

biomedicine/bioinformatics/genetics

in the range of Pbytes per year (gene expressions)

astronomy

already in hundreds of Tbytes/year

environmental science

already in hundreds of Tbytes/year; predictions: 15 Pbytes

medicine and health care; electronic patient records

~Pbytes (mostly images)

social sciences/humanities

continued

(3)

Big data is everywhere

 Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.

 Four dimensions: Volume, Velocity, Variety, and Veracity

 Volume

Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes— of information.

Turn 12 terabytes of Tweets created each day into improved product sentiment analysis

Convert 350 billion annual meter readings to better predict power consumption

http://www-01.ibm.com/software/data/bigdata/

Big data is everywhere

 Velocity

Scrutinize 5 million trade events created each day to identify potential fraud

Analyze 500 million daily call detail records in real-time to predict customer churn faster

Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.

Big data is everywhere



Variety

Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.

Monitor 100’s of live video feeds from surveillance cameras to target points of interest

Exploit the 80% data growth in images, video and documents to improve customer satisfaction

(4)

Big data is everywhere



Veracity

1 in 3 business leaders don’t trust the information they use to make decisions. How can you act upon information if you don’t trust it?

Establishing trust in big data presents a huge challenge as the variety and number of sources grows.

13

What to do with this data?

 Cablecom - Swiss telecom operator

analysed the number of calls to customer services and spotted that it peaks around the 9th_{month into contract – customers} make a decision to leave soon after that

-> offer special deals 7-8 months into the contract to keep customers

 Royal Shakespeare Company

profiling best customers -> targeted marketing -> 70% increased visits/sales

examples

Google Flu Trends

 Analysed search terms when there was flu around

(5)

15

Traffic control

16

Why analyse data?

 Business: competitive pressure is strong:

provide business intelligence and/or customised services

 Science: data analysis may help

scientists in

classifying and segmenting data hypothesis formation

 Healthcare: study findings both at

individual and population levels

provide better understanding of causal links

Variety of data types



Traditional data types

numbers, characters, strings, dates

structured data with ‘clear’ meaning



Multimedia data

text, graphics (drawings, illustrations), images, animations, audio, video, composed (mixed) multimedia

un- and semi-structured data, no ‘clear’ or correct meaning assigned

(6)

18

Example: insurance company



Data types in an insurance company

accident reports (text)

images of accidents (image)

reconstruction of accident (video, animation)

audio recording of the parties involved (audio)

medical reports (text)

supporting medical materials (images)

This module

 Previous database course focused on:

Database technologies: infrastructure for managing and

querying data.

Database design: techniques for working out what to store

and how.

Database programming: developing applications over

databases.

 This course unit focuses principally on making the most of data within an organisation:

Data integration: getting the data into a form that supports

and facilitates aggregation, exploration and mining.

Data analysis: techniques for making sense of data and

learning new lessons.

Module contents

Data integration

 Data warehousing

modelling, design, architectures, ETL process

 Managing and storing multi-media data

methods for capturing multi-media data

Data analysis

 On-line Analytical Processing (OLAP)

exploration of data through OLAP operations

 Data mining

(7)

23

Organisation

 Lectures and guest lecture [10]

introducing main concepts

 Tutorials [weekly, from week 2]

understanding and practical work (groups) tools for data exploration

(WEKA, Palo OLAP and Dundas OLAP)

 Lab tests [2]

some topics will have practical individual labs (e.g. data analysis using software products)

24

Reading list

 TM. Connolly, CE. Begg: Database systems: a Practical Approach

to Design, Implementation, and Management ISBN: 0130412120, Pearson Education Limited

 Elmasri, R., Navathe, S.: Fundamentals of Database Systems,

5th_{Edition, Benjamin/Cummings}

 O. Maimon, L. Rokach (Ed): Data Mining and Knowledge Discovery

Handbook, Springer Verlag

(http://www.springerlink.com/content/978-0-387-09822-7)

 R. Nisbet, J. Elder, G. Miner: Handbook of Statistical Analysis and Data Mining Applications, Elsevier, ISBN: 978-0-12-374765-5 (e-book)

 Many online materials, including case-studies, WEKA tutorials etc. all materials are on the Web and Blackboard (lectures, tutorials,

labs)

Assessment

 Exam: 85%

2 hours, calculators allowed 3 out of 5 questions

 Lab tests: 15%

two assessed labs

 Pre-requisites

Good knowledge of SQL

(8)

26

Plan – lectures

 Week 1: Introduction to data integration & analysis

 Week 2: Data warehousing (DW)

 Week 3: Introduction to OLAP

 Week 4: Association rule mining

 Week 5: Analysis of textual data (text analytics)

 Week 7: Data classification

 Week 8: Data clustering

 Week 9: Multi-media data management

 Week 11: Enterprise Resource Planning (ERP)

 Week 12: Guest lectures: Smart Analytics (IBM) Forensic analytics (PwC)

27

Plan – tutorials and lab tests

 Week 2: Tutorial 1: basic data exploration

 Week 3: Tutorial 2: data profiling, DW

 Week 4: Tutorial 3: OLAP

 Week 5: Lab test 1 (data profiling, DW, OLAP)

 Week 7: WEKA lab exercises (3, 4)

 Week 8: WEKA lab exercises (3, 4)

 Week 9: Tutorial 4: Data mining

 Week 10: Lab test 2 (Data mining)

 Week 11: Tutorial 5: Multimedia data management continued

Summary



This course unit aims to introduce:

 data warehousing: architectures and methods for

integrating and organising data in a way that supports further analyses

 data analytics: techniques for exploring and making

sense from the data



Course web resources

Blackboard – COMP3311

(9)

29

Making sense

of data

-

Overview of main concepts

-

Overview of main topics

-

Introduction to data mining

30

From data to decisions

Data

Mary Jones April 1, 1999 dog

Florida Information A. Berger M. Jones T. Martin J. Smith 50,000 46,800 29,200 75,500 Shoes Scarves Jewelry Groceries MoU Qty Income Education Knowledge A. Berger is most likely to buy new product T. Martin is profitable customer but is likely to switch carriers Decisions and actions Offer A. Berger promotion for a new product Launch a new campaign

Data/information management

 Store useful data and information

 day-to-day operational data (e.g. transactions)  external data (e.g. market data)

 human resources data

 Provide effective information storage and access to support data analysis and integration

distributed/federated databases:

leave the data where it is

data warehouses

(10)

32

Types of data (values)



Qualitative (descriptive)

categorical (nominal) ordinal 

Quantitative (numerical)

interval ratio



See Lab exercise 1 – Data Refresher Slides

33

The kinds of data we have



Traditional “transactional” information, i.e.

operational data that documents everyday life in

an enterprise/organisation

retail (e.g. sales in supermarket stores)

financial services (e.g. ATM withdrawals)

transport (e.g. flight bookings)

telecommunications (e.g. mobile billing, Internet)

healthcare (e.g. drug prescriptions)



Recording and processing this type of data is

known as “online transaction processing”

(OLTP)

Online transaction processing



OLTP: processing and recording transactions

that create new data and/or update existing

information in operational DBs:



insertions, updates, deletions



Typically a small number of rows are affected in

each transaction.



Traditional DBMS optimised to perform well in

OLTP, but not in comprehensive exploration,

aggregation and decision making.

(11)

35

Why analyse data?

There is often information “hidden” in the data that is not readily evident

Human analysts may take weeks to discover useful information

Much of the data is never analysed at all

0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 4,000,000 1995 1996 1997 1998 1999 data gap

Total new disk (TB)

Number of analysts

36

Need for data analysis



Modern business/science environment

 markets evolve faster than ever

 competition is more intense than ever

 quantity of information is increasing



In order to succeed, an organisation must

 have a comprehensive view of all of its aspects -> data integration

 make informed and reliable decisions -> data analysis

 take timely actions and accurate predictions

“Business intelligence”



Business intelligence

an ongoing process of monitoring the competitive environment in order to identify opportunities to act on, and/or threats to business to be avoided.



It is analytical analysis of available business

data (internal and external)



It is NOT about spying, sleuthing, espionage

it is estimated that 80% of business intelligence of

(12)

38

Business intelligence examples



Customer data and patterns

 What are the characteristics of our customers?

 What are their buying patterns?

 Who are the customers likely to move away?

 Who are the most loyal customers?

39

Business intelligence examples



Sales analysis and identification of trends

Which products sell the most at specific time

periods?

What are the products that are selling best as combinations?

What are the products sold during the highest profitability transactions?

How many visas were issued country-by-country for the three most busy months in the last 12 months?

continued

Business intelligence examples



Business targets and promotion effectiveness

Who are the customers most likely to respond to

an advertising campaign by post?

Which day of the week a new advertising campaign should be launched?

How promotional campaigns are linked with other leading brands over time?

How a certain campaign has affected the sales in a region?

How were visa applications affected by implementing a new on-line access system?



etc.

(13)

41

Data warehouse (DW)



DW: an integrated database designed to support

data analysis, business intelligence, and better

and faster decision making



DWs integrate and aggregate data from various

operational and external DBs maintained by

different units



DW needs to provide

more complex aggregation and analysis of data

mining “new” data (e.g. spending trends)

42

General DW role

OLAP DSS data mining applications data warehouse operational DBs external DBs Extract Transform Load ETL

London branch Sale branch Manchester branch Census data Detailed transactional data

Data warehouse

Integrate Clean Summarise Direct Query Reporting tools Mining tools OLAP

Data analysis

GIS data

(14)

44

Data warehouse applications



Online analytical processing (OLAP)

complex analysis of data from DW

e.g. trend analysis, time series, etc.



Decision support systems (DSS)

high level data processing for management

executive information systems (EIS)



Data mining (DM)

support for “knowledge discovery”

search for unanticipated knowledge

45

OLAP



Term coined in mid-1990’s



Main goal: support ad-hoc but complex

querying performed by business analysts



OLAP = interactive process of creating,

managing, analysing and reporting on data



Extends spreadsheet-like analysis to work with

huge amounts of data in a data warehouse

OLAP



Data exploration & aggregation in various ways



Typical applications include accessing the

effectiveness of a marketing campaign, product

sales forecasting (predictive analysis) and

capacity planning



Also, spot trends, pinpoint problems, perform

“what-if” modelling



Allows a sophisticated user to analyse data

using complex, multi-dimensional views

(15)

47

Typical OLAP queries



Write a multi-table join to compare sales for

each product line year-to-date (YTD) this year

vs. last year.



Repeat the above process to find the top 5

product contributors to margin.



Repeat the above process to find the sales of a

product line to new vs. existing customers.



Repeat the above process to find the customers

that have had negative sales growth.

example

48

What is data mining?



Many

definitions

non-trivial extraction of implicit, previously unknown and potentially useful information/knowledge from data

exploration & analysis, by automatic or semi-automatic means, of

large quantities of data in order to discover new knowledge and meaningful patterns

What is (not) data mining?

What is data mining:

– Discover that certain names are more prevalent in certain locations

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com)

– Identify patterns in buying a PC and related equipment

– Identify patterns in accessing a web site

What is not data mining:

– Look up phone number

in a phone directory

– Query a Web search

engine for information about “Amazon”

– Find the average value of market basket during a given period

– Rank web-pages based

(16)

50

Data mining – definition

 Process of extracting valid, previously unknown,

comprehensible and actionable information from

extremely large databases, in order to make crucial

decisions or prediction

also known as “knowledge discovery in databases” (KDD)

 Process of identifying valid, non-trivial, novel,

potentially useful patterns in data

use such patterns to predict or classify specific events, or act accordingly

51

Data mining process

1.

Identification of possible data mining

applications and problems

2.

Analysis of data and identification of possible

solutions

3.

Selection and implementation of data mining

techniques

4.

Monitoring the effectiveness of the proposed

solutions

5.

Using results for decision making, prediction,

profiling, etc.

Increasing potential to support

business decisions End User

Business Analyst Data Analyst DBA Making decisions Data presentation Visualization Techniques Data mining Knowledge Discovery Data exploration

OLAP, Statistical Analysis, Reporting

Data Warehouses Operational data Sources

(17)

53

Data mining aims/approaches



Descriptive data mining

acquire specific/general properties of the data

find patterns in data

use them for new business models



Predictive data mining

learn attributes from data that can be used to predict behaviour/activities in future (based on past and current data)

use known data to train/learn the model

54

Predictive data mining: examples



Who is likely to buy if we offer a discount?



Learn profiles of customers that buy new

collections, whatever the cost.



Would a new store improve sales?



Loan payment prediction



Predict toxicity of a new drug



Predict flooding or natural disasters

Prediction outcomes

 True Positives (TP) = an entity has been predicted to

have a certain property, and it does have it (in reality)

e.g. a predicted toxic drug is toxic

 False Positives (FP) = an entity has been predicted to

have a certain property, and it does not have it (in reality)

e.g. a predicted toxic drug is not toxic

 True Negatives (TN) = an entity has not been predicted

to have a certain property, and it does not have it

e.g. a customer has not been predicted to buy a new CD, and (s)he does not buy it

 False Negatives (FN) = an entity has not been predicted

to have a certain property, but it does have it

e.g. a customer has not been predicted to buy a new CD, but (s)he buys it

(18)

56

Prediction outcomes



Prediction of false positives

e.g. if a data mining system (e.g. prediction of terror suspects) has only 2% errors, then analysing 100 million passengers would generate 2 million false positives!

huge impact on security, efficiency, privacy

false positives with manual analysis too



Still, data mining has the potential to reduce the

high rate of false positives

combining multiple models can reduce false positives continued

57

Descriptive data mining

Main types



identification



categorisation



optimisation

Data mining – identification



Identify existence of a new activity

e.g. a new buying pattern

new pattern in using on-line services

identifying the best products for different customers

identify factors that attract new customers

(19)

59

Data mining – categorisation



Learn attributes that can partition the data (e.g.

customers) into “meaningful” groups

find “model” customers who share the same characteristics: interest, income level, spending habits, etc.

e.g. shoppers: regular, ‘posh’, ‘rush’, discount seeking, etc.

online users: mail-only, occasional surfers, addicts…



Profiling customers, citizens, students, genes

60

Data mining – optimisation



Optimise usage of resources

e.g. internal computer networking

workload at customer services

maximise sales under given constraints

suggests adjustments on the pricing and variety of goods in different stores/regions etc.

optimise distribution of stores/base stations in a region/country

monitor market directions

When is data mining useful?



Can we extract/mine useful patterns from data



Patterns are interesting if they are

easily understood by humans

valid on new data (with some degree of certainty)

potentially useful, novel

validate some hypothesis that a user seeks to confirm

can be used to improve understanding of business or scientific process

(20)

62

Overview of data

mining techniques

63

Data mining techniques

Some of data mining techniques



Relevance analysis



Time series analysis



Sequential patterns



Association Rules



Classification



Regression



Clustering



Anomaly detection

Relevance analysis



Identification of relevant attributes and the

degree of their relevance

Example

payment-income ratio is a relevant factor for loan approval, while education level and debt ratio are not



Identify these relevant attributes automatically

from data

(21)

65

Times series analysis



Examine value/behaviour of an attribute over

time at evenly spaced time points (daily, weekly,

yearly etc)

e.g. daily stock price for each company

sales in summer/winter



Observe behaviour of several groups of values



Time series can be visualised by plotting the

values over the time points

66

Sequential patterns



Based on a time sequence of actions



Relationship is based on time

e.g. a customer buys a PC, the following month he/she typically buys a printer; in 12 months he/she purchase printer cartridge

use such information to offer a deal or to define a new strategy

Sequential patterns



Analyse a company web log data to determine

how users access the company web site



If, for example, 70% of users of page A follow

one of the following patterns of behaviour

(in terms of links):

(A, B, C)

or

(A, D, B, C)

or

(A, E, B, C)

then add a link directly from page A to page C

(22)

68

Association rules



Establish an “association” link

common example: market-basket analysis

market basket: items bought during one visit

e.g. if a customer buys milk and tea, they also buy cookies

if a female shopper buys a handbag, she is very likely to buy matching shoes



See the lecture on “Association rule mining”

69

Classification



Mapping data into predefined classes

classes are known (business determined)



Learn how to classify a new (unseen) case

e.g. application for a loan: 5 ranges of credit-card

worthiness

visa application: binary classification (2 classes)



Entity (data-set) can be classified into several

classes corresponding to various “dimensions”



See the lecture on “Classification”

Regression



Classification is about categorical values



Regression – for numeric prediction

e.g. predict house prices

e.g. predict one’s retirement savings based on its current value and several past values

e.g. predict sales amounts of a new product based on advertising expenditure

e.g. predict wind velocities as a function of temperature, humidity, air pressure, etc.

(23)

71

Clustering



Partition data into groups that might or might not

be disjointed

soft and hard clustering



Groups (i.e. clusters) are not known in advance



Discover new trends or outliers

e.g. fraud detection

72

Clustering

 Most similar data are grouped into clusters

similarity measure between the data is needed to identify related entities

example: when two transactions are “similar”?

aggregation and generalisation can be used

e.g. type of product instead product_id

milk includes all types and quantities

depends on the applications

difference between semi-skimmed and full milk may be essential for a healthy-life style company

 See the lecture on “Clustering”

continued

Anomaly detection



Detect significant deviations from “normal” behaviour



Examples

credit card fraud detection

network intrusion detection

(24)

74

Data mining



Important note: data mining techniques we

discuss build models automatically from data

resulting classification or association rules are NOT designed by a business/data analyst

rather, they have been ‘learnt’ from data automatically



So,

classification = automatic classification

clustering = automatic clustering

continued 75

Data Mining

Database systems Statistics Other disciplines Algorithms Machine learning Visualization

Machine learning and DM

 ML = automatic acquisition of a model from data

capturing/generalising behaviour of data from examples

 Supervised learning

predefined what to learn (e.g. classes) use a training set as a source to learn from

 Unsupervised

 no training sets, no targets: learn from data (e.g. clustering)

 Data mining uses many ML techniques

but association rule mining for example is not ML handling large data sets (scalability)

(25)

77

Data mining and privacy



A lot of controversy about data mining ethics

companies routinely collect information about

customers and use it for marketing, etc.



Can information collected for one purpose be

used for mining data for another purpose

in Europe, generally no, without explicit consent

in the US, generally yes



Powerful profiling techniques

www.kdnuggets.com/gpspubs/ieee-expert-9504-priv.html

78

Data mining and privacy



Can discrimination be based on features like

sex, age, national origin?

in some areas (e.g. mortgages, employment) certain features cannot be used for decision making



In other areas, these features are needed to

assess the risk factors

e.g. people of African descent are more susceptible to sickle cell anemia

continued

Data mining on new data types



Currently, most data mining is on flat tables

mining traditional data types (categorical data)



Richer but unstructured data sources available

text, links, web, images, multimedia



Mining on a meta-level

e.g. knowledge bases

(26)

80

Data integration & analysis



Competitive advantage

increased productivity of decision making

allow decision-makers to access data that reveal unknown, unavailable and untapped information (e.g. customer profiles, trends and demands)



Potential high returns on investment

average return on business intelligence investments (hardware, data, tools, human resources) is over 400% over a period of 2-3 years

(International Data Corporation data)

market growth: 1995: $2 billion; 2004: $10-15 billion

81

Summary



Making sense from large datasets



DW: an integrated database for supporting

higher-level analysis of data



OLAP: ad-hoc exploration



Data mining: systematic exploration



Mining multi-media data



Benefits

supports competitive business/scientific intelligence

Reading for this lecture

 Chapters 34 in [Connolly & Begg]

 Chapter 27 (27.1) in [Elmasri & Navathe]

 The Data Deluge Report, The Economist, Feb 2010

 On-line materials: Blackboard and

http://www.cs.manchester.ac.uk/ugt/COMP33111

Additional reading (on the web)

 Two IBM case studies:

 crime prediction

(27)

83

For next week:



Complete Tutorial 1: Basic data exploration

types of data

median, variance, etc.

correlation

distance measures

bring the solutions to the lab session

