The Lab and The Factory
Architecting for Big Data Management
April Reeve
“A good speech should be like a
woman's skirt: long enough to cover
the subject and short enough to create
interest.”
April Reeve
• Twenty five years doing data oriented stuff
• Data Management disciplinesData Management disciplinesData Management disciplinesData Management disciplines ––– Data Integration, Data Governance, Data – Data Integration, Data Governance, Data Data Integration, Data Governance, Data Data Integration, Data Governance, Data
Modeling, Data Quality, Business Modeling, Data Quality, Business Modeling, Data Quality, Business
Modeling, Data Quality, Business Intelligence, Master Data Intelligence, Master Data Intelligence, Master Data Intelligence, Master Data Management,
Management, Management,
Management, Data ConversionData ConversionData ConversionData Conversion, Data , Data , Data , Data Warehousing , Enterprise Content Warehousing , Enterprise Content Warehousing , Enterprise Content Warehousing , Enterprise Content Management, Big Data Management
Management, Big Data Management Management, Big Data Management Management, Big Data Management
• Currently implementing Data Governance programs and developing Big Data
Strategies for Life Sciences and Financial Services organizations
• Certifications –
– Certified Data Management Professional (DAMA)Certified Data Management Professional (DAMA)Certified Data Management Professional (DAMA)Certified Data Management Professional (DAMA)
– Certified Data Governance and Stewardship Professional (DGSP)Certified Data Governance and Stewardship Professional (DGSP)Certified Data Governance and Stewardship Professional (DGSP)Certified Data Governance and Stewardship Professional (DGSP) – Certified Business Intelligence Professional (CBIP)Certified Business Intelligence Professional (CBIP)Certified Business Intelligence Professional (CBIP)Certified Business Intelligence Professional (CBIP)
– Certified in Enterprise Governance of IT (ISACA) – Certified Information Systems Auditor (ISACA)
• Masters degree in Financial Management (financial risk management, derivatives
Agenda
•
Big Data
•
The Data Scientist environment for predictive analytics
– the Lab
•
Operationalizing predictions – the Factory
•
How does it fit with legacy data management
Analytics Maturity
• Volume:Volume:Volume:Volume: data volumes approaching multiple petabytes
• VelocityVelocityVelocityVelocity: data being generated and ingested for analysis in real-time
• VarietyVarietyVarietyVariety: tabular, documents, e-mail, metering, network, video, image, audio
• ComplexityComplexityComplexityComplexity: different standards, domain rules, and storage formats per data type
More than just about data volume, smart big data
strategies also consider the velocity, variety, and
complexity of information
Transactional Data Transactional Data Transactional Data Transactional Data Documents DocumentsDocumentsDocuments Smart GridSmart GridSmart GridSmart Grid
Variety Variety Variety
Variety ComplexityComplexityComplexityComplexity
Velocity Velocity Velocity
Velocity VolumeVolumeVolumeVolume
Gartner March 2011 New insights on customers, products, and operations Contextual and location-aware delivery to any device Images Images Images
Big Data Goal:
More, Faster, Better Data for Purpose
Area Revolution
Latency “No time to read. In-memory is the new DB”
Enrichment “Tagging is the new Transformation”
Query “Federated Query is the new ETL”
Purpose “Purposeful View is the new Master”
Analytics “Predictive is the new Reactive”
Predictive Analytics
•
The Data Scientist chooses Internal and External data
(lots of it!) and throws into an Analytical Sandbox
•
The Data Scientist identifies patterns in the data and
develops predictive models of behavior involving
combining historical information concerning a
What is Data Science?
•
Data Science refers to the scientific method:
•
The scientist (Data Scientist) develops a hypothesis
(model of behavior)
•
Using a large amount of historical data and statistical
Leveraging Big Data for Action
•
The organization develops software which populates
models using historical customer information and
installs into the operational reporting environment
•
Real time processing combines customer information
Leveraging Big Data for Action –
Big Data Analytics Architecture
• In “Big Data” management we need:
• A “Lab” or “Sandbox” “Lab” or “Sandbox” “Lab” or “Sandbox” environment that is very dynamic and “Lab” or “Sandbox” can be used by the Data Scientists to throw in or throw away massive amounts of structured and unstructured data against which to do analysis, find patterns and insights, and develop models
New Data Hubs –
The Analytical Sandbox & NoSQL Data Stores
Hadoop Data Store Hadoop Data Store Analytic Sandbox
Exploratory Analytic Environment Structured BI Reporting Environment
Data Preparation and Enrichment ALL data fed into
Hadoop Data Store
Data Latency Spectrum
Use Case Time Interval
Ultra low latency messaging < 100 microseconds
Extreme transaction processing < 1 millisecond
Streaming data analysis; no intermediate persistence < 100 milliseconds
Real time event characterization < 1 second
Complex event processing; near real-time dashboards < 30 seconds
Operational dashboard < 5 minutes
Intraday analysis < 2 hours
Daily rollup ≤ 24 hours
Considerations in Organizing People
The Lab
• “In their search for new
insights, data scientists write enormous quantities of code. But it is not designed to meet commercial standards for
scalability, security, and stability. You create and support commercial-grade code in the factory.”
The Factory
• “The [Factory] requires many more people with a wider variety of skill sets, a more rigid environment, and
different sorts of metrics…. To be clear, creativity and
experimentation are
important in the factory, but you must not expect more than incremental thinking and production-oriented solutions.”
From Article From Article From Article
Contact Information
•
April Reeve
– EMC Consulting – Enterprise Information Management Practice
• [email protected] • +1 (201) 396-1831
• @Datagrrl on Twitter
• Blog - http://infocus.emc.com/april_reeve/
• Book Book Book Book ---- “Managing Data in Motion “Managing Data in Motion “Managing Data in Motion –“Managing Data in Motion ––– Data Integration Data Integration Data Integration Data Integration