How do we train Data Scientists
and Data Engineers?
Eric Rozier
Asst Prof of EECS at the University of Cincinnati
Faculty Mentor DSSG at the University of Chicago
Training the Next Generation of
Data Scientists
•
Focus on two main programs:
– Summer 3 month intensive program
• DSSG
– Normal year curriculum development to support in class hands-on experiences
http://dssg.uchicago.edu @datascifellows
Eric & Wendy Schmidt
Data Science for Social Good
Data Science for Social Good @datascifellows
What is DSSG?
40-50 Fellows in teams of 3-4 Experienced Mentors 12 weeks in Chicago Impactful problems with non-profit & govtpartners
Data Science for
Social Good Fellowship
Goals of the Fellowship
• Train data scientists who care about and understand how to solve social problems
• Expose and train governments & non profits to use data to make better decisions
• Seed a community of people and organizations working together to make social impact
• Create open source data science tools that are targeted at the needs of high impact social problems
Data Science for Social Good @datascifellows
48 Fellows
8 Mentors
14 Projects
12 Weeks
36 Fellows
6 Mentors
12 Projects
12 Weeks
2013
2014
By the Numbers
Ideal Fellows
Making an Impact with Data
Computer Science &
Programming Statistics & Machine Learning
Econometrics & Social Science Methods
Databases Experimental Design
Communication Problem Formulation
Data Science for Social Good @datascifellows
~1000
Applicants40
countries ~250 Universities84
fellows
Computer ScientistsStatisticians Economists Public Policy
(and other computational and quantitative fields)
2013-2014 Fellows
CMU U. of Chicago Northwestern Harvard MIT Stanford ITAM Cornell Yale Villanova Ohio State USC U Penn Notre Dame U of Minnesota U of Michigan Cambridge McGill UC Berkeley U of Colorado Swarthmore Oberlin UIUC Emory Duke Fordham Johns Hopkins IIT SAIC NYU Penn State Simon Fraser UC Santa Barbara• Partners: Non-Profits, Government Agencies, Corporations with a Social Mission
• Geographies: Local, State, National, and International
• Types of Problems: Impact Evaluation, Targeting, Risk Modeling,
• Types of Data: Structured data, geospatial data, time series, text data, network data
Data Science for Social Good @datascifellows
Health Energy Education
Economic Development Corruption Federal Budgeting Predicting lead poisoning Home inspection data Reducing energy use via
disaggregatio n Smart-meter data Predicting high school dropout Education records Targeting and assessing urban revitalization Administra -tive data Detecting collusion Contract data Identifying earmarks Congress-ional bills
Data Science for Social Good @datascifellows
• Improving high school graduation rates by identifying at-risk students early
• Increasing government transparency by identifying earmarks
• Developing new strategies to reduce maternal mortality
• Preventing Lead Poisoning by proactive home inspections and health check-ups
Buildings: 197,157 Time: 76 years Money: $98 million Buildings: 42,695 Time: 16.4 years Money: $21.3 million Buildings: 378 Time: 2 months Money: $189,000
Prediction Saves Time & Money
The Eric & Wendy SchmidtData Science for Social Good Summer Fellowship 2014
At Risk Children
Lead Levels During Childhood Even without detailed child-level features, there are strong, sanity-checked, prediction-capable patterns
Target: Prediction From Birth
Data Science for Social Good @datascifellows
Who we’re looking for ?
• Expertise in one or more of the following ares
• Computer Science • Statistics
• Public Policy • Social Science
• Other Quantitative or Analytical Areas
• Some coding experience
• Passion for making a social impact
• Problem solving (critical thinking) experience • Enjoy working on a team
Data Science for Social Good @datascifellows • Deep expertise in computer science, machine learning, statistics, or social sciences • Experience working on real problems in industry • Experience leading teams
and managing projects
Mentors
Ben Yuhas Principal, Yuhas Consulting Group Eric Rozier Assistant Professor of Electrical and Computer Engineering Kate CagneySociology & Health Studies Director, Population
Research Center U’Chicago
Joe Walsh
Lead Forecaster for GE Healthcare & Policy
• Organizations that…
1. Have an interesting social-impact problem to solve
2. Have data that can help solve it
3. Have a desire to put our work into action
• Especially interested in longer-term collaborations beyond 12 weeks of the fellowship
Project Partners
Governments / Government Orgs Foundations Research Institutions Non-ProfitsData Science for Social Good @datascifellows
Get Involved!
• Application deadlines: • Fellows: Feb 1, 2015 • Mentors: Feb 1, 2015 • Partners: Jan 10, 2015• Applications & more info: http://dssg.uchicago.edu • Or email: [email protected]
Data Science in
the Curriculum with
The Data Deluge
•
Big Data education suffers from similar
challenges.
How do we help
students drink from
the fire hose?
Big Data and the Curriculum
•
Big Data is putting pressure on the curriculum
– Not just CS/ECE: Business, Finance, Social Science, Economics, Biology, Medicine, Public Policy
•
NIH has held several meetings on Big Data
education.
– Wants to integrate
Big Data/Data Science into the regular curriculum.
NIH Conclusions
•
Teach from case studies
– Proper training should include hands on experience with real data.
– Use and study of cutting edge:
• Tools
•
Teach from case studies
– Proper training should include hands on experience with real data.
– Use and study of cutting edge:
• Tools
• Techniques
NIH Conclusions
•
Train Data Scientists to work as team
members.
– The team is one of the most important parts of real data science applications.
New Ways of Thinking
•
Get students used to the pace of change,
New Ways of Learning
Active Learning
•
After 2 weeks we tend to remember:
– Passive learning
• 10% of what we read
• 20% of what we hear
• 30% of what we see
• 50% of what we hear and see
– Active learning
• 70% of what we say
Bloom’s Taxonomy
Evaluation
Synthesis
Analysis
Application
Comprehension
Knowledge
Three Pronged Approach
•
Reading, presenting,
and discussing
current state of the
art.
•
Hands on study with
real data.
•
Original research in
Creating a Classroom Around a
Digital Observatory
Transitioning a DSSG like
Environment to the Year
•
Identify a smaller number of partners to work
with larger groups on a longer time scale.
•
Understand that our expectations need to be
tempered
– Summer – exclusive, competitive program with international recruitment
– Year – drawn from, admittedly excellent, student body at large, motivation may be lower.
Frontiers of Data Science Class
•
Several published papers resulting from the
class.
•
Mixed undergrad and graduate,
interdisciplinary environment.
•
Awarded Frontiers of
Engineering Education by
the NAE
Growth of the Course
•
First year
– 8 students
– Electrical Engineers, Computer Engineers,
Computer Scientists, Environmental Scientists, Economists
Growth of the Course
•
First year
– 8 students
– Electrical Engineers, Computer Engineers,
Computer Scientists, Environmental Scientists, Economists
•
Second year
– 14 students
Developing Scalable Infrastructures
•
Understand the financial limitations of the
classroom
•
Develop resources which can be leveraged for
research and curriculum, a practical
curriculum based on real experience will have
similar needs anyway!
The Need for Practice in the
Academy
•
We need to train ourselves in Data Science to
teach it.
– Many faculty haven’t had real industrial experience with Data Science.
– The field and practice is changing fast.
– Encourage the development of Data Science
workshops, boot camps, and summer programs for faculty as well as students.