• No results found

Introducing open source statistical and data science tools to business analytics students and professionals

N/A
N/A
Protected

Academic year: 2021

Share "Introducing open source statistical and data science tools to business analytics students and professionals"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Detroit ASA – January 2015

Introducing open source

statistical and data science tools

to business analytics students

and professionals

Mark Isken

Assoc. Prof. of MIS

School of Business Administration

Oakland University

(2)

Abstract

Tools such as Excel, SQL databases, SPSS and SAS have long been

staples of the quantitative side of business education and professional

practice. Recently, this community has seen a surge in popularity of open

source tools of a more computational nature. In response, in the spring of

2014, I developed and delivered a course entitled "Practical Computing for

Business Analytics" within the School of Business at Oakland University.

This course relied entirely on open source software for course development,

delivery and student work. Specifically we used the Linux OS along with R

and Python. R Markdown documents and IPython notebooks were the

primary teaching tools. I will describe my teaching methods, discuss how

the course went, and share plans for the continued dissemination of this

material in the business analytics community.

(3)

Mark Isken

BSE, MSE, Ph.D. in

Industrial and Operations Engineering

from University of Michigan

Operations analyst for

William Beaumont Hospital

and

Henry Ford Health System

and some small consulting

companies (~10 years)

Joined OU Fall 1999 as full-time faculty member of Dept. of

Decision and Information Sciences

I’m a techie – love working with computers and mathematical

models to help solve business problems

Teach/taught business analytics, statistics, computer

simulation, intro MIS courses and healthcare operations mgt

http://www.sba.oakland.edu/faculty/isken/

• I remember when INFORMS was ORSA and TIMS and the

controversy that ensued when merger proposed.

(4)

Business analytics

Data science

Business intelligence

Statistics

Machine learning

Management science

Operations

research

Data mining

Analytics

Knowledge discovery

in databases

Big data

Decision science

Data visualization

Data warehousing

OLAP

X

X

X

(5)

Healthcare Operations Analysis

Internal business analysis / decision support consultant

Simulation modeling

– Critical care tower, emergency departments

– Pneumatic tube systems, outpatient clinics, pharmacy robots

Staffing and scheduling models

– people, cases, tests, etc.

– Queueing, simulation, optimization

Database and analytical tool development using Access,

Excel, VBA and other software

Various statistical and operations analysis studies

Short term census forecasting

Postpartum Staffing Needs

0 5 10 15 20 25 30 35 40 45 Su n 12 a m Su n 06 a m Su n 12 p m Su n 06 p m M on 1 2 am M on 0 6 am M on 1 2 pm M on 0 6 pm T ue 1 2 am T ue 0 6 am T ue 1 2 pm T ue 0 6 pm W ed 1 2 am W ed 0 6 am W ed 1 2 pm W ed 0 6 pm T hu 1 2 am T hu 0 6 am T hu 1 2 pm T hu 0 6 pm Fr i 1 2 am Fr i 0 6 am Fr i 1 2 pm Fr i 0 6 pm Sa t 1 2 am Sa t 0 6 am Sa t 1 2 pm Sa t 0 6 pm N ur se s Introduction to BAM 5

(6)

My view of business analytics

Programming &

databases Math & stats Domain

knowledge Art and Craft of modeling Communication,

visualization, story-telling

(7)

Simulation

Optimization

Management science modeling

Modeling

Modeling/engineeringSpreadsheet

VBA

Basic foundation

Spreadsheet Based Modeling & Decision

Support

(taught since 2001)

OLAP/DW

Statistics

EDA/data viz

Data analysis

Data Mining

Automation User Forms Environment customization, error prevention & handling

Application Development

Introduction to BAM 7

Database

(8)

Getting started with Free and Open

Source Software (FOSS)

PhD days using FORTRAN based network flow algorithms for

scheduling problems in addition to commercial tools like IBM's OSL

and CPLEX

Clearly saw

– FOSS allowed me to learn by code exploration

– It allowed me to create decision support apps with sophisticated code built in that didn't force end user organizations (hospitals with no extra $$$) to buy expensive commercial software

– that I could extend this software to solve my specific problem better

Wrote dissertation in LaTex

My research experience along with several years as a practicing

industrial engineer with two large healthcare systems and a few

years of university teaching launched my real plunge into the world

of FOSS

(9)

FOSS for analytics in practice

A smallish healthcare analytics firm run by a good friend of mine

from grad school days

Very Microsoft-centric place and client base (SQL Server, .Net apps,

Excel, Access, PPT)

I've been introducing and helping people get up to speed with things

like R and Python to overcome common limitations of their current

Excel centric analytical workflow practices

– ad hoc and non-reproducible data cleaning, transforming – lots of pointing and clicking for repetitive tasks

– sketchy documentation of analytical workflow

– ease of doing things with R (via apply family or plyr and ggplot2) and

Python (via pandas and matplotlib) such as

● percentile calculations within “pivot style” or “group by” analysis ● small multiples

● both of the above are hideous to do in Excel and even the first is tough in

specialized tools like Tableau. I've got “beginner level” tutorials on these things on hselab.org both in R and Python

(10)

No fun to do in Excel

“Small multiples”

Percentiles by

group

(11)

hselab.org

This is my primary outlet for sharing tutorials,

teaching materials, FOSS and other analytics

related things

Tutorials

and

guides

Blog posts

Links to my

FOSS projects

Working on

Shiny apps

Open

courses

Science, engineering, research are all evolving in response to

calls for reproducibility, open access to data and results, changes to publishing models and the possibilities offered by

FOSS along with internet infrastructure that facilitates organic evolution of social and

technical ecosystems

Got me thinking we really needed a course on this

stuff within the School of Business...

(12)

MIS 480/680

- Practical Computing for

Business Analytics

Hey MBAs! Microsoft isn't the

only game in town.

If you really want to do analytics in the business

world, you better learn to do some programming!

(13)

Structure – 14 3hr sessions

202EH Computer Teaching Lab

First half of the semester Second half of the semester

Session 1: Intro to analytics 9/10: Intro to Python

2: Intro to R and R Studio 11: Data analysis and plotting in Python

3: Exploratory data analysis with R 12: Data acquisition, prep and more analysis

4: Group by analysis and more stats - R 13: Time series, datetime analysis in Python

5: Linear models in R

6: Data mining in R (kNN, cluster, Rattle) 7/8 – Text files, regex, Linux tools (e.g.

(14)
(15)

Our computing “appliance” -

pcba

Computer running Windows, Mac OS, or Linux

Programs

MS Office

Notepad

Browser

VirtualBox

Documents

Spreadsheets, Word

documents, text files,pdf

Virtual machines

VM running Lubuntu Linux

● Programs – R, R Studio, R packages – Python (Anaconda) – Geany – OpenOffice – Browser – File Manager – Shell ● Documents

– R scripts, Python programs – OpenOffice documents – Text files, pdf

(16)
(17)

Why R and Python?

● Both R and Python are widely used in the data science and business analytics worlds

● A quote from Enterprise Data Analysis and Visualization: An Interview Study on the growing need for technically adept analysts:

When discussing recruitment, one Chief Scientist said “analysts that can’t program are disenfranchised here”

● Both support a combination of interactive use via tools like R Studio and IPython along with programmatic use via text scripting

● Huge communities and ecosystems supporting R and Python for analytics work ● Both facilitate reproducible analysis

● Some things that are simply hideously difficult to do in tools like Excel or a database, are simple in R and/or Python

– Group By or Pivoting type analysis for operations such as percentiles – Small multiples and other complex graphing/charting/plotting

(18)

Flow of a typical class

Guided exploration of topics via interactive use

of R Markdown documents or IPython

Notebooks

In class assignment where I act as “roving”

consultant

Open lab time for homework and project work

Collaborate with classmates

(19)

R Markdown

documents

● Mixture of markdown (simple plain text formatting) and executable R “code

chunks”

● Facilitates authoring informative and reproducible analysis documents ● Can generate output in numerous forms including PDF, HTML, MS Word ● Can publish resultant HTML directly to RPubs

● Used in PCBA as an interactive session delivery, exploration and note

taking method, homework submissions, and project deliverables

● IPython notebooks facilitate interactive Python computing in a browser

based environment

– Mixture of markdown and Python – Inline plotting

(20)

IPython Notebooks

Gallery of interesting notebooks

nbviewer

Fernando Perez

Notebooks are

just json text

files

(21)

Homework assignments

HW0 - Intro to PCBA

- guided exploration of pcba virtual appliance

- overview reading from DDS and exploration of links on course website

HW1 - Intro to R

- use R Studio, create an RMarkdown document - data importing and exporting

- view and modify dataframes (change data types, add cols)

- answer questions about some R lists, vectors, arrays, matrices - generate html from Rmd file

HW2 - EDA with R

- EDA: summary stats, group by, plots - data reshaping

HW3 - Predictive modeling with R

- regression models to predict MLB winning percentage

- try out a few predictive modeling techniques for the Kaggle Titanic Challenge - feature engineering

HW4 - Simulating the Monty Hall 3-Door Problem with Python

(22)

Final Projects

Options

1. Analyze dataset of interest

2. Research into techniques and/or tools

3. Compete in active Kaggle competition

A few of the resulting projects

- Neural nets for the

Kaggle bike share competition

- financial portfolio analysis with Python and tKinter (for GUI)

- exploration of R packages for financial analysis

- blackjack simulator in Python for exploring different playing strategies

- a tutorial for creating a basic R Shiny app

- maps of website use based on Apache logs using Python, pandas, matplotlib

- using EDA, kNN, decision trees to explore factors affecting vehicle fuel

(23)

Student mix

MBA

6

BS-MIS

3

MS-STA

1

MSITM

16

Post-Bac

1

BS-FIN

1

BS-MIS 2

BS-POM 1

MACC

7

MBA

12

MSITM

13

R/Python

Analytics

Spreadsheet based

Business Analytics

Fall 2014

Summer I 2014

(24)

Our analytical profiles

I'll give everyone an index card and on it you'll profile yourself

(on a relative scale) with respect to the following dimensions

Computer programming

Math

Statistics

Data visualization

Machine learning / data mining

Modeling

Domain expertise

(25)

Why learn to use Linux for analytics?

Linux widely used in the data science and analytics world

Linux shell FAR superior to Windows command line

application

Powerful shell scripting language

Tab completion

Command line is often way more efficient than GUI

Linux is free (as in freedom and beer) and open source

Sets you apart from other business analysts who only

(26)

The Geek Factor

● Using and creating FOSS earns you geek cred ● It's fun to use tools like R and Python and Linux

– Do an import this and then an import antigravity an IPython notebook or any Python shell

– Do you think MS inspires this kind of thing? :)

● FOSS facilitates users becoming more tech savvy

– Real geeks use Linux; but seriously, – command line use

– github

– installing software

– getting your hands dirty

– leveraging the Unix philosophy of small focused tools that you can put together to do amazing things

● $wc -l *.pdb | sort -g | head -1

– here's the overview presentation I use as part of the hands-on session to introduce B-school students to Linux

(27)

About “Practical Computing for

Business Analytics”

● introduced B-school students to non-Microsoft world that exists

– Linux shell scripting, Linux OS, and world of FOSS for Linux – R and Python

● created a Lubuntu based computing appliance

– distributed as .ova exported from VirtualBox – free!

– I could get it totally configured by installing and setting up the software as I wanted – students didn't waste time trying to get myriad of tools working on their systems – minimized hassle on our OU IT staff

● entire course was created and delivered with FOSS ● "I wouldn't use Windows at all if not for Excel

● An open version of the course website is available from my hselab site in

the courses section

● In Summer I 2015, course will be offered again (and every Summer I)

Industrial and Operations Engineering William Beaumont Hospital Henry Ford Health System http://www.sba.oakland.edu/faculty/isken/ hselab.org Tutorials guides Blog posts FOSS projects Shiny apps courses MIS 480/680 reproducible analysis Group By or Pivoting type analysis for operations such as percentiles Small multiples and other complex graphing/charting/plotting R Markdown markdown RPubs Gallery of interesting notebooks nbviewer Fernando Perez Kaggle Titanic Challenge Kaggle bike share competition

References

Related documents

The application of a combined S-BPM/DSM approach to process integration can be illustrated using a cross- organisational research funding process involving four

In connection with signing a class-wide settlement term sheet with counsel for Plaintiff Sabrina Cardenas, Sony extended the warranties on the KDS- 50A2020, KDS-55A2020,

The present investigation was carried out with the idea of developing an Online Pest Management Information System (PMISNET) on major agricultural crops containing

Figure 1 shows the thickness values of the CdTe films deposited as a function of the different substrate temperatures and deposition times.. From Figure 1 can be appreciated

“If a significant separately identifiable Evaluation and Management service (eg, office of other outpatient services, preventive medicine services) is performed, the appropriate

From 1990 through 1999 almost 3.2 billion guilders from the Netherlands’ budget for development assistance were spent on relief of the external debt of developing countries. A

In this example, the application of intense enforcement of parking regulations along a critical arterial roadway resulted in increasing curb-side parking capacity by reducing

The constituents are selected from a universe comprised of constituents trading on the Toronto Stock Exchange (the “TSX”) and are classified as Canadian securities by