Detroit ASA – January 2015
Introducing open source
statistical and data science tools
to business analytics students
and professionals
Mark Isken
Assoc. Prof. of MIS
School of Business Administration
Oakland University
Abstract
Tools such as Excel, SQL databases, SPSS and SAS have long been
staples of the quantitative side of business education and professional
practice. Recently, this community has seen a surge in popularity of open
source tools of a more computational nature. In response, in the spring of
2014, I developed and delivered a course entitled "Practical Computing for
Business Analytics" within the School of Business at Oakland University.
This course relied entirely on open source software for course development,
delivery and student work. Specifically we used the Linux OS along with R
and Python. R Markdown documents and IPython notebooks were the
primary teaching tools. I will describe my teaching methods, discuss how
the course went, and share plans for the continued dissemination of this
material in the business analytics community.
Mark Isken
●
BSE, MSE, Ph.D. in
Industrial and Operations Engineering
from University of Michigan
●
Operations analyst for
William Beaumont Hospital
and
Henry Ford Health System
and some small consulting
companies (~10 years)
●
Joined OU Fall 1999 as full-time faculty member of Dept. of
Decision and Information Sciences
●
I’m a techie – love working with computers and mathematical
models to help solve business problems
●
Teach/taught business analytics, statistics, computer
simulation, intro MIS courses and healthcare operations mgt
•
http://www.sba.oakland.edu/faculty/isken/
• I remember when INFORMS was ORSA and TIMS and the
controversy that ensued when merger proposed.
Business analytics
Data science
Business intelligence
Statistics
Machine learning
Management science
Operations
research
Data mining
Analytics
Knowledge discovery
in databases
Big data
Decision science
Data visualization
Data warehousing
OLAP
X
X
X
Healthcare Operations Analysis
●
Internal business analysis / decision support consultant
●Simulation modeling
– Critical care tower, emergency departments
– Pneumatic tube systems, outpatient clinics, pharmacy robots
●
Staffing and scheduling models
– people, cases, tests, etc.
– Queueing, simulation, optimization
●
Database and analytical tool development using Access,
Excel, VBA and other software
●
Various statistical and operations analysis studies
●Short term census forecasting
Postpartum Staffing Needs
0 5 10 15 20 25 30 35 40 45 Su n 12 a m Su n 06 a m Su n 12 p m Su n 06 p m M on 1 2 am M on 0 6 am M on 1 2 pm M on 0 6 pm T ue 1 2 am T ue 0 6 am T ue 1 2 pm T ue 0 6 pm W ed 1 2 am W ed 0 6 am W ed 1 2 pm W ed 0 6 pm T hu 1 2 am T hu 0 6 am T hu 1 2 pm T hu 0 6 pm Fr i 1 2 am Fr i 0 6 am Fr i 1 2 pm Fr i 0 6 pm Sa t 1 2 am Sa t 0 6 am Sa t 1 2 pm Sa t 0 6 pm N ur se s Introduction to BAM 5
My view of business analytics
Programming &
databases Math & stats Domain
knowledge Art and Craft of modeling Communication,
visualization, story-telling
Simulation
Optimization
Management science modeling
Modeling
Modeling/engineeringSpreadsheetVBA
Basic foundation
Spreadsheet Based Modeling & Decision
Support
(taught since 2001)OLAP/DW
Statistics
EDA/data viz
Data analysis
Data Mining
Automation User Forms Environment customization, error prevention & handlingApplication Development
Introduction to BAM 7
Database
Getting started with Free and Open
Source Software (FOSS)
●
PhD days using FORTRAN based network flow algorithms for
scheduling problems in addition to commercial tools like IBM's OSL
and CPLEX
●
Clearly saw
– FOSS allowed me to learn by code exploration
– It allowed me to create decision support apps with sophisticated code built in that didn't force end user organizations (hospitals with no extra $$$) to buy expensive commercial software
– that I could extend this software to solve my specific problem better
●
Wrote dissertation in LaTex
●
My research experience along with several years as a practicing
industrial engineer with two large healthcare systems and a few
years of university teaching launched my real plunge into the world
of FOSS
FOSS for analytics in practice
●
A smallish healthcare analytics firm run by a good friend of mine
from grad school days
●
Very Microsoft-centric place and client base (SQL Server, .Net apps,
Excel, Access, PPT)
●
I've been introducing and helping people get up to speed with things
like R and Python to overcome common limitations of their current
Excel centric analytical workflow practices
– ad hoc and non-reproducible data cleaning, transforming – lots of pointing and clicking for repetitive tasks
– sketchy documentation of analytical workflow
– ease of doing things with R (via apply family or plyr and ggplot2) and
Python (via pandas and matplotlib) such as
● percentile calculations within “pivot style” or “group by” analysis ● small multiples
● both of the above are hideous to do in Excel and even the first is tough in
specialized tools like Tableau. I've got “beginner level” tutorials on these things on hselab.org both in R and Python
No fun to do in Excel
“Small multiples”
Percentiles by
group
hselab.org
●
This is my primary outlet for sharing tutorials,
teaching materials, FOSS and other analytics
related things
–
Tutorials
and
guides
–Blog posts
–
Links to my
FOSS projects
–Working on
Shiny apps
–
Open
courses
Science, engineering, research are all evolving in response to
calls for reproducibility, open access to data and results, changes to publishing models and the possibilities offered by
FOSS along with internet infrastructure that facilitates organic evolution of social and
technical ecosystems
Got me thinking we really needed a course on this
stuff within the School of Business...
MIS 480/680
- Practical Computing for
Business Analytics
Hey MBAs! Microsoft isn't the
only game in town.
If you really want to do analytics in the business
world, you better learn to do some programming!
Structure – 14 3hr sessions
202EH Computer Teaching Lab
First half of the semester Second half of the semester
Session 1: Intro to analytics 9/10: Intro to Python
2: Intro to R and R Studio 11: Data analysis and plotting in Python
3: Exploratory data analysis with R 12: Data acquisition, prep and more analysis
4: Group by analysis and more stats - R 13: Time series, datetime analysis in Python
5: Linear models in R
6: Data mining in R (kNN, cluster, Rattle) 7/8 – Text files, regex, Linux tools (e.g.
Our computing “appliance” -
pcba
Computer running Windows, Mac OS, or Linux
●
Programs
–MS Office
–Notepad
–Browser
–VirtualBox
●Documents
–Spreadsheets, Word
documents, text files,pdf
–
Virtual machines
VM running Lubuntu Linux
● Programs – R, R Studio, R packages – Python (Anaconda) – Geany – OpenOffice – Browser – File Manager – Shell ● Documents
– R scripts, Python programs – OpenOffice documents – Text files, pdf
Why R and Python?
● Both R and Python are widely used in the data science and business analytics worlds
● A quote from Enterprise Data Analysis and Visualization: An Interview Study on the growing need for technically adept analysts:
When discussing recruitment, one Chief Scientist said “analysts that can’t program are disenfranchised here”
● Both support a combination of interactive use via tools like R Studio and IPython along with programmatic use via text scripting
● Huge communities and ecosystems supporting R and Python for analytics work ● Both facilitate reproducible analysis
● Some things that are simply hideously difficult to do in tools like Excel or a database, are simple in R and/or Python
– Group By or Pivoting type analysis for operations such as percentiles – Small multiples and other complex graphing/charting/plotting
Flow of a typical class
●
Guided exploration of topics via interactive use
of R Markdown documents or IPython
Notebooks
●
In class assignment where I act as “roving”
consultant
●
Open lab time for homework and project work
–
Collaborate with classmates
R Markdown
documents
● Mixture of markdown (simple plain text formatting) and executable R “code
chunks”
● Facilitates authoring informative and reproducible analysis documents ● Can generate output in numerous forms including PDF, HTML, MS Word ● Can publish resultant HTML directly to RPubs
● Used in PCBA as an interactive session delivery, exploration and note
taking method, homework submissions, and project deliverables
● IPython notebooks facilitate interactive Python computing in a browser
based environment
– Mixture of markdown and Python – Inline plotting
IPython Notebooks
Gallery of interesting notebooks
nbviewer
Fernando Perez
Notebooks are
just json text
files
Homework assignments
HW0 - Intro to PCBA- guided exploration of pcba virtual appliance
- overview reading from DDS and exploration of links on course website
HW1 - Intro to R
- use R Studio, create an RMarkdown document - data importing and exporting
- view and modify dataframes (change data types, add cols)
- answer questions about some R lists, vectors, arrays, matrices - generate html from Rmd file
HW2 - EDA with R
- EDA: summary stats, group by, plots - data reshaping
HW3 - Predictive modeling with R
- regression models to predict MLB winning percentage
- try out a few predictive modeling techniques for the Kaggle Titanic Challenge - feature engineering
HW4 - Simulating the Monty Hall 3-Door Problem with Python
Final Projects
Options
1. Analyze dataset of interest
2. Research into techniques and/or tools
3. Compete in active Kaggle competition
A few of the resulting projects
- Neural nets for the
Kaggle bike share competition
- financial portfolio analysis with Python and tKinter (for GUI)
- exploration of R packages for financial analysis
- blackjack simulator in Python for exploring different playing strategies
- a tutorial for creating a basic R Shiny app
- maps of website use based on Apache logs using Python, pandas, matplotlib
- using EDA, kNN, decision trees to explore factors affecting vehicle fuel
Student mix
●MBA
6
●BS-MIS
3
●MS-STA
1
●MSITM
16
●Post-Bac
1
●BS-FIN
1
●BS-MIS 2
●BS-POM 1
●MACC
7
●MBA
12
●MSITM
13
R/Python
Analytics
Spreadsheet based
Business Analytics
Fall 2014
Summer I 2014
Our analytical profiles
●
I'll give everyone an index card and on it you'll profile yourself
(on a relative scale) with respect to the following dimensions
–
Computer programming
–
Math
–
Statistics
–
Data visualization
–
Machine learning / data mining
–
Modeling
–
Domain expertise
Why learn to use Linux for analytics?
●
Linux widely used in the data science and analytics world
●
Linux shell FAR superior to Windows command line
application
–
Powerful shell scripting language
–Tab completion
–
Command line is often way more efficient than GUI
●
Linux is free (as in freedom and beer) and open source
●
Sets you apart from other business analysts who only
The Geek Factor
● Using and creating FOSS earns you geek cred ● It's fun to use tools like R and Python and Linux
– Do an import this and then an import antigravity an IPython notebook or any Python shell
– Do you think MS inspires this kind of thing? :)
● FOSS facilitates users becoming more tech savvy
– Real geeks use Linux; but seriously, – command line use
– github
– installing software
– getting your hands dirty
– leveraging the Unix philosophy of small focused tools that you can put together to do amazing things
● $wc -l *.pdb | sort -g | head -1
– here's the overview presentation I use as part of the hands-on session to introduce B-school students to Linux
About “Practical Computing for
Business Analytics”
● introduced B-school students to non-Microsoft world that exists
– Linux shell scripting, Linux OS, and world of FOSS for Linux – R and Python
● created a Lubuntu based computing appliance
– distributed as .ova exported from VirtualBox – free!
– I could get it totally configured by installing and setting up the software as I wanted – students didn't waste time trying to get myriad of tools working on their systems – minimized hassle on our OU IT staff
● entire course was created and delivered with FOSS ● "I wouldn't use Windows at all if not for Excel
● An open version of the course website is available from my hselab site in
the courses section
● In Summer I 2015, course will be offered again (and every Summer I)