Data Quality Through Curation at Big Data Scale Andy Palmer, CEO & Co-Founder, Tamr, Inc. MIT CDOIQ Symposium, July 23 & 24, 2014

(1)

Data Quality Through Curation at Big

Data Scale

Andy Palmer, CEO & Co-Founder, Tamr, Inc.

(2)

Top 3 Issues for Data Quality in Enterprise

1. Always start with questions and context 2. Always start with questions and context 3. Always start with questions and context

Internal External Structured Unstructured Data ➢ Agile/Rapid ➢ Dynamic ➢ Reproducible ➢ Scalable “1,000 Questions” Prescriptive Predictive Descriptive

(3)

Descriptive

Retailers use data variety to create a more complete view of their customer’s buying decisions to improve online offers & increase sales

Predictive

Banks use multiple sources of background data to improve lending decisions

Prescriptive

Healthcare providers use automated data analysis to predict patient issues and intervene with remediation plans

Three “Flavors” of Analytical Context

(4)

Big Data Opportunity in the Enterprise

• On average Large Enterprises have 2,000+ data sources

• Of those, on average, <10% of data sources are getting into “master data,” data warehouses, data marts, and “ETL”

• Many organizations purchase the same data for the same vendors - once even 3-4 times!

• Need to optimize analytic return on data investment - both purchased and produced

• Individual Analytics may require <2-3 sources but need to ensure that it’s the correct sources - not just the “familiar” sources

(5)

The Data Source Problem

• The centralization of systems (SAP, Oracle, etc.) has failed to achieve a unified view of enterprise data

• Silos are a reality of every business

• Internally- vs. Externally-generated data

• Centralized vs. Distributed organizational structures and pendulum of change over decades

• Use of systems to reinforce organizational models • Natural to end up with radical heterogeneity

• What is a source?

• Database? Instance? Table? Spreadsheet? JSON object? • Each source contains attributes, records and content

(6)

Evolution of Enterprise Data Quality and Curation

Internal External

Structured

Unstructured

Extreme top-down and deterministic approach to improving context and analytical use

Use of more unstructured and semi-structured sources in enterprise

Application of probabilistic and collaborative

approaches to curate and leverage structured and unstructured enterprise data sources

Info service providers embrace practices of modern public internet search companies

Extreme bottom-up probabilistic approach to improving context and use of content

(7)

Evolution of Enterprise Data Quality and Curation

Internal External

Structured

Unstructured

Use of more unstructured and semi-structured sources in enterprise

Info service providers embrace practices of modern public internet search companies

(8)

Bottom-up Probabilistic and Collaborative Curation: Only

Viable Way to Tame Enterprise Data Variety

Back to the Future 1990’s web:

• Probabilistic search and website connection - unstructured public data

2010’s enterprise:

• Bottom-up probabilistic data source connection and curation of structured and semi-structured data

(9)

Data Quality and Curation in the Enterprise

OLD NEW

1. Master data management, ETL, data warehouses, and marts

1. Probabilistic, authoritative Data Curation

2. Small 10’s of data architects and curators inside of IT 2. Hundreds/thousands of stakeholders from across enterprises engaged in collaborative data curation 3. Highly engineered “Deterministic” ETL into

“warehouses”

3. Bottom-up probabilistic mapping and matching complemented with impact from many data experts 4. Pre-defined, static reporting 4. Agile analytics with radical contextual diversity and

rich visualizations

5. “Need to know” information access controls 5. Privacy through control of information use 6. Bimodal - either highly scalable and reproducible or

ad-hoc

(10)

New Approach Required to Complement Traditional MDM,

ETL, Data Profiling and Data Quality

Enterprise Sources & Variety Enterprise Analysts/Users Individual Curation Tools Many Few Few Many Legacy Data Integrators Next Generation Probabilistic and Collaborative Data Curation

(11)

Big Data

Email DB C CRM C Retail Sales DB C Mobile App DB C Online Sales DB C Google Analytics C Credit Card DB C Twitter/Facebook Feeds C

Example:

(12)

Contextual

Example:

_{How well am I doing retaining my best customers?}

C

• The High Spender segment’s online sales are down due to drop in conversion rate. Conversion rates increase when presented coupon over indexing in retail stores.

• Customer engagement in my app and social feeds are

increasing, key influencers contribute only 1% of sales, but their followers 20%.

• Customers acquired through Facebook channels become the best customers ($50 ALTD, 50% spend using gift cards).

(13)

Making

Context

Enterprise-wide is a Challenge

• No single definition of data - each system has own reference • No way to tap into humans to collaborate on context

• Custom bridges across data sources don’t scale (more complexity)

• Constantly changing applications with data variety • Data variance across sources ([email protected] ≠

[email protected]) • No metadata catalog

(14)

Building Rich Context Requires a New Approach

• Identify relationships across your sources using a machine learning “bottom up” approach

• Continuous active learning combining machine/human insight

• Cost effective as you unify more sources - marginal cost of new source = at least linear.

• Deploy context-rich sources for the different LOBs across the enterprise • Enterprise metadata catalog - all your attributes, all your sources

• Services (e.g. APIs) can also be directly deployed in data warehouses/lakes and operational workflows

(15)

As Sources are Unified, Data Quality Improves Continuously

(16)

Top 3 Issues for Data Quality in Enterprise

1. Always start with questions and context 2. Always start with questions and context 3. Always start with questions and context

Internal External Structured Unstructured Data ➢ Agile/Rapid ➢ Dynamic ➢ Reproducible ➢ Scalable “1,000 Questions” Prescriptive Predictive Descriptive