Data Quality Through Curation at Big
Data Scale
Andy Palmer, CEO & Co-Founder, Tamr, Inc.
Top 3 Issues for Data Quality in Enterprise
1. Always start with questions and context 2. Always start with questions and context 3. Always start with questions and context
Internal External Structured Unstructured Data ➢ Agile/Rapid ➢ Dynamic ➢ Reproducible ➢ Scalable “1,000 Questions” Prescriptive Predictive Descriptive
Descriptive
Retailers use data variety to create a more complete view of their customer’s buying decisions to improve online offers & increase sales
Predictive
Banks use multiple sources of background data to improve lending decisions
Prescriptive
Healthcare providers use automated data analysis to predict patient issues and intervene with remediation plans
Three “Flavors” of Analytical Context
Big Data Opportunity in the Enterprise
• On average Large Enterprises have 2,000+ data sources
• Of those, on average, <10% of data sources are getting into “master data,” data warehouses, data marts, and “ETL”
• Many organizations purchase the same data for the same vendors - once even 3-4 times!
• Need to optimize analytic return on data investment - both purchased and produced
• Individual Analytics may require <2-3 sources but need to ensure that it’s the correct sources - not just the “familiar” sources
The Data Source Problem
• The centralization of systems (SAP, Oracle, etc.) has failed to achieve a unified view of enterprise data
• Silos are a reality of every business
• Internally- vs. Externally-generated data
• Centralized vs. Distributed organizational structures and pendulum of change over decades
• Use of systems to reinforce organizational models • Natural to end up with radical heterogeneity
• What is a source?
• Database? Instance? Table? Spreadsheet? JSON object? • Each source contains attributes, records and content
Evolution of Enterprise Data Quality and Curation
Internal External
Structured
Unstructured
Extreme top-down and deterministic approach to improving context and analytical use
Use of more unstructured and semi-structured sources in enterprise
Application of probabilistic and collaborative
approaches to curate and leverage structured and unstructured enterprise data sources
Info service providers embrace practices of modern public internet search companies
Extreme bottom-up probabilistic approach to improving context and use of content
Evolution of Enterprise Data Quality and Curation
Internal External
Structured
Unstructured
Use of more unstructured and semi-structured sources in enterprise
Info service providers embrace practices of modern public internet search companies
Bottom-up Probabilistic and Collaborative Curation: Only
Viable Way to Tame Enterprise Data Variety
Back to the Future 1990’s web:
• Probabilistic search and website connection - unstructured public data
2010’s enterprise:
• Bottom-up probabilistic data source connection and curation of structured and semi-structured data
Data Quality and Curation in the Enterprise
OLD NEW
1. Master data management, ETL, data warehouses, and marts
1. Probabilistic, authoritative Data Curation
2. Small 10’s of data architects and curators inside of IT 2. Hundreds/thousands of stakeholders from across enterprises engaged in collaborative data curation 3. Highly engineered “Deterministic” ETL into
“warehouses”
3. Bottom-up probabilistic mapping and matching complemented with impact from many data experts 4. Pre-defined, static reporting 4. Agile analytics with radical contextual diversity and
rich visualizations
5. “Need to know” information access controls 5. Privacy through control of information use 6. Bimodal - either highly scalable and reproducible or
ad-hoc
New Approach Required to Complement Traditional MDM,
ETL, Data Profiling and Data Quality
Enterprise Sources & Variety Enterprise Analysts/Users Individual Curation Tools Many Few Few Many Legacy Data Integrators Next Generation Probabilistic and Collaborative Data Curation
Big Data
Email DB C CRM C Retail Sales DB C Mobile App DB C Online Sales DB C Google Analytics C Credit Card DB C Twitter/Facebook Feeds CExample:
Contextual
Example:
How well am I doing retaining my best customers?
C
• The High Spender segment’s online sales are down due to drop in conversion rate. Conversion rates increase when presented coupon over indexing in retail stores.
• Customer engagement in my app and social feeds are
increasing, key influencers contribute only 1% of sales, but their followers 20%.
• Customers acquired through Facebook channels become the best customers ($50 ALTD, 50% spend using gift cards).
Making
Context
Enterprise-wide is a Challenge
• No single definition of data - each system has own reference • No way to tap into humans to collaborate on context
• Custom bridges across data sources don’t scale (more complexity)
• Constantly changing applications with data variety • Data variance across sources (jsmith@apple.com ≠
jsmith@tamr.com) • No metadata catalog
Building Rich Context Requires a New Approach
• Identify relationships across your sources using a machine learning “bottom up” approach
• Continuous active learning combining machine/human insight
• Cost effective as you unify more sources - marginal cost of new source = at least linear.
• Deploy context-rich sources for the different LOBs across the enterprise • Enterprise metadata catalog - all your attributes, all your sources
• Services (e.g. APIs) can also be directly deployed in data warehouses/lakes and operational workflows
As Sources are Unified, Data Quality Improves Continuously
Top 3 Issues for Data Quality in Enterprise
1. Always start with questions and context 2. Always start with questions and context 3. Always start with questions and context
Internal External Structured Unstructured Data ➢ Agile/Rapid ➢ Dynamic ➢ Reproducible ➢ Scalable “1,000 Questions” Prescriptive Predictive Descriptive