Terri L. Lomax Vice Chancellor
Research, Innovation + Economic Development
Data Science
Initiative
Joint Research Committees Meeting October 27, 2014Data Science
is a Big Deal
Managing, processing and exploiting data to improve decision making will continue to grow in importance
Key Insights: after McKinsey (May 2011)
1. Data have swept into every industry and business function and are now animportant factor of production.
2. Big Data creates value but ONLY if appropriate and advanced Data
Science is used to deal with Big Data.
3. Use of Big Data and Data Science will become a key basis of
competition, growth and pro-active risk management for individual firms.
4. Big Data underpins new waves of productivity growth and consumer
surplus.
5. Big Data will matter across sectors, but some sectors are poised for
greater gains.
6. There is already a shortage of talent necessary for organizations to take
advantage of Big Data.
From McKinsey (May 2011):
Demand for deep analytical positions could exceed the supply by 140,000 to 190,000 positions in
2018.
The need for additional managers and analysts in the US who can ask the right questions and
consume the results of data analysis is estimated at 1.5 million.
In the short-term, retraining existing talent will be required to meet demand.
Big data can generate significant financial value across sectors
Source: Data Science Research Center, Amsterdam (http://dsrc.nl/ what-is-data-science/)
Data Science is
multidisciplinary
Security
Privacy
Provenance
Understand and decide Percep3on Cogni3on Business Analy3cs Visual Analy3cs Decision Theory Store and Process Analyze & Model Large Scale Databases SoBware Engineering System/ Network Engineering Distributed Processing Reasoning Knowledge Represent’n Mul3media Retrieval Modeling & Simula3on Machine Learning Informa3on RetrievalResearch Triangle - Data Science Powerhouse
Data4Decisions’ unique concept is guided by the event’s Advisory Council, a powerful compendium of the region’s leading research universi3es, private companies and thought-‐leading Research Triangle-‐based associa3ons:
NC State Data Science-Related
Centers + Institutes New: NSF I/UCRC on End-‐to-‐End
Enablement of Data
Laboratory for Analytic Sciences (LAS)
NSA’s goal is to build an advanced data innovation hub in the Research Triangle with LAS as the anchor tenant
CHANCELLOR’S FACULTY EXCELLENCE CLUSTERS
• Institute for Advanced Analytics – PSM, primarily SAS tools
• COE/PCOM executive education for based on
open-source tools
• CSC graduate track in Data Science
• CSC/Stat/Math undergraduate concentration in Data Science
• PCOM executive education based on IBM tools
• UNC GA Research Opportunities Initiative – Data Science
Institutionalize NC State’s Data Science Initiative (together with UNC Charlotte and RENCI)
Data Science Infrastructure at NC State
• Portions of VCL-HPC facilities (large memory
machines) – run Linux-based, user-provided analytics
• CSC MRC VCL-BigData testbed (x86 and IBM Power7
and Power8 computers with lots of memory, tightly coupled storage, and advanced accelerators) – run IBM Analytics
• NCBP-VCL cluster – lots of memory and disk space –
runs SAS analytics
• IAA VCL facilities – primarily runs SAS analytics
• OSCAR lab - Extensive BiGData and Data Science
computational and data storage facilities, including a BlueGene/P supercomputer, also LAS “low lab”.
NC State Data Science Initiative
Goals
• Raise visibility & increase reputation
• Coordinate data science activities, including education
• Increase research funding
• Build industry partnerships
• Establish interdisciplinary undergraduate curriculum
• Provide services & infrastructure to faculty
Organizational Structure
• Director / Assistant• Coordinating Council
• Steering Committee
Data Science Initiative – Coordinating Council
(formative stage)
COE
Mladen Vouk – CSC – Director Dan Stancil – ECE
Jerry Bernholc – CHIPS Michael Young – DGRC James Lester – CEI
Paul Turinsky – CASL Jacob Jones – AIF Dennis Kekas – ITng Yousry Azmy – CNEC
Rada Chirkova – new I/UCRC (STEED Lab, CHMPR)
COS
Montse Fuentes – Stat Marie Davidian – CQSB
Alyson Wilson – Cluster, LAS John Blondin – Phys
Loek Helminck – Math Tom Banks – CRSC
Fred Wright – Bioinformatics
CED
Glenn Kleiman – WIFIEI
CNR
Ross Mietenmeyer – GSA
PCOM
Mike Kowolenko – CIMS
CHASS
Carolyn Miller – Dig. Humanities
Provost
Michael Rappa – IAA
ORIED
Summary
•
Managing and extracting information from complex
data sets continues to grow in importance in most all sectors of the US economy
•
The Research Triangle has significant programs in data science that will be leveraged for future growth
•
Importance of data science recognized by all levels and types of industry, government and academia
•
Working together at NC State, we can:
–
capitalize on multidisciplinary opportunities,
–
build significant programs, and
–
educate the skilled workforce to maintain our
Terri L. Lomax
research.ncsu.edu
Data to
Gap Analysis for Data Science Cluster Proposal Physical models Social models Symbolic models Numerical solvers HPC, HPD, OS So=ware architectures Storage/Index/Access Privacy Security StaDsDcal methods Discrete mathemaDcs AI/Knowledge Mgmt Database Data integraDon Natural Language
Coverage: Weak LiIle Strong
1
2
3 4
Current Gap Analysis for Data Science Cluster (Oct 2014) Physical models Social models Symbolic models Numerical solvers HPC, HPD, OS So=ware architectures Storage/Index/Access Privacy Security StaDsDcal methods Discrete mathemaDcs AI/Knowledge Mgmt Database Data integraDon Natural Language
Coverage: Weak LiIle Strong
1
2
3 4
• Ensemble and Comparative Visualization of Scientific Datasets (Sandia,
Christopher Healey)
• Computer-aided Human Centric Cyber Situation Awareness (Penn State, Peng
Ning; Michael Young)
• Runtime System for I/O Staging in Support of In-Situ Processing of Extreme
Scale Data (DOE, Nagiza Samatova)
• Scalable and Power Efficient Data Analytics for Hybrid Exascale Systems (DOE,
Nagiza Samatova)
• Damsel: A Data Model Storage Library for Exascale Science (DOE, Nagiza
Samatova)
• Scalable Data Management, Analysis, and Visualization (SDAV) Institute (DOE)
Nagiza Samatova; Anatoli Melechko)
• Scientific Data Management Center (DOE, Vouk)
• Collaborative Research: Understanding Climate Change: A Data Driven Approach
(NSF, Nagiza Samatova; Frederick Semazzi)
• Policy-Based Governance for the OOI Cyberinfrastructure (NSF, Munindar Singh)
• Interdisciplinary Cyber-Enabled Crime Reconstruction through Innovative
Methodology and Engagement (IC-CRIME); (NSF, David Hinks; Michael Young, ASU, IU-B)
Capturing Value from Data
• Create transparency
• Enable experimentation to discover needs, expose
variability, and improve performance
• Segment populations to customize actions
• Replace and/or support human decision making with
automated algorithms
• Innovate new business models, products, and services
Source: Big Data: The next fron4er for innova4on, compe44on, and produc4vity,
Analytics Acquisition Computation Sciences Cyber Infrastructure Cyber Security Data Management Education Gaming Informatics Mobility/ Wireless Networking Policy/ Governance Processing & Preservation Modeling & Simulation Visualization Virtualization ApplicaDons Biological Sciences Business Climate Engineering Aps. Energy, Health Social, Humani3es Physics Security Policy Etc. IAA, CSC, CIMS, MEAS, + Data Types Structured Unstructured Image Signal Streams Data Science ICSE ORSC VCL CSC Math Physics ICSE ITng CSC ECE SOSI ITng SoSI CSC IAA CEI CSC Stat DGRC BRC, CSC ITng CSC, ECE, SOSI ITng CSC ECE CHiPS COE NC B-‐Prepared RENCI CSC COD CQSB NCICS CSC ECE Fault Tolerance /Recovery
Big Data Research and Development IniDaDve Natural Language Processing Pattern Recognition Consider: v-‐Centennial CSC VCL ITng CSC SDM CSC Trans- portation
Graphic Representation of Data Science
• Two one-day Workshops held at NC State
– Hosted by VCs Terri Lomax and Marc Hoit
– Led by Tina Bennefield, HR Senior Consultant &
Performance Leadership Program Manager
– Organized by Bonnie Aldridge
• Day 1
– Individual faculty presentations on current research
– Table discussions on trends, barriers and needs
• Day 2
– Developing a shared vision
– Developing recommendations
McDonald* Wolfram Bird* Devine Whetten* Baron* Bolotnov* Chakrabortty* Chirkova Chow* Dai Edwards* Ferguson* Franzon* Healey* Krim* Misra* Muth Overton Rotenberg Vouk Westmoreland* Xie* Breen Kennedy-Stoskopf* Kouri* Kowolenko* Krishnamurthy* Blondin Brown* Daniels* Ghosh Ipsen Mitasova* Reading* Sullivant* Xie* Yuter Zhou Pasquinelli CALS CVM PCOM PAMS COE COT CHASS CNR
• Research trends
– Analysis of unstructured data sets
– Enhanced visualization methods
– Data interoperability and fusion techniques
– Model-driven vs data-driven approaches
• Barriers
– Infrastructure (bandwidth, storage, power, etc.)
– Human capital
– Privacy, proprietary and standards
– Departmental cultures
• Needs & Vision
– Understand industry funding
– Collaborative data tools
– Communication between producers and consumers
– Overarching coordinating structure