CSE 6040
Computing for Data Analytics:
Methods and Tools
Lecture 1 – Course Overview
DA KUA N G , P O LO C H AU G EO RG I A T EC H
Course Staff
Instructor
◦ Da Kuang
◦ Postdoctoral Researcher, CSE
◦ Office: Klaus 1305 (facing the kitchen door)
◦ Office hour: Thu 4-5pm, Klaus 1315
Instructor
◦ Duen Horng (Polo) Chau
◦ Assistant Professor, CSE
◦ Office hour: Thu 4-5pm, Klaus 1315
TA
◦ Lianxiao (Shawn) Qiu
◦ MS CS Student
MS Analytics Curriculum
Computing◦ Computing for Data Analysis: Methods and Tools ◦ Data and Visual Analytics
◦ Computational Data Analysis
◦ High Performance Computing Statistics/Optimization
◦ Introduction to Analytical Methods ◦ Regression Analysis
◦ Deterministic Optimization
◦ Probabilistic Models
◦ Data Mining and Statistical Learning
◦ Simulation
◦ Time Series Analysis Business
◦ Introduction to Business for Analytics ◦ Risk Analytics
◦ Project Management
◦ Pricing Analytics and Revenue Management
◦ Business Process Analysis and Design
◦ Customer Relationship Management
Data Analytics Problems
Regression: Predicting a numerical variable
Y-axis: # New homes sold in the US (shaded areas indicate US recessions)
Data Analytics Problems
Regression: Predicting a numerical variable
[Hal Varian, Predicting the present with search engine data, 2013] Search frequencies on Google used as predictors
Target variable: # new homes sold in the US
Data Analytics Problems
Classification: Predicting a categorical variable (or its probability)
Statistical machine translation Query classifi-cation News classifi-cation
Data Analytics Problems
Clustering: Finding patterns without human labeling
Both topic modeling and recommender system can be viewed as a clustering problem.
Data Analytics Pipeline
Data collection
Data storage/retrieval
Data analysis
Data Analytics Pipeline
Data collection
Data storage/retrieval
Data analysis
Data visualization
sqlite pandasnumpy scikit-learn igraph bokeh Scrapy Selenium BeautifulSoup
Names in red are Python packages.
Data Analytics Pipeline
Data collection
Data storage/retrieval
Data analysis
What you will learn in this course
Python programming (and a little bit Java and Matlab) – 4 lecture◦ One of Google’s 3 main languages Python packages
◦ Data collection – 2 lectures
◦ Data storage and retrieval – 1 lecture
◦ Data analysis
◦ Data visualization – 2 lectures
Basic linear algebra (math tools, matrices, etc.) – 2 lectures
Basic numerical computing (how to do math programmatically) – 4 lectures Several fundamental machine learning algorithms (focusing on intuitive ideas and software development for them)
◦ Linear regression – 2 lectures
◦ Logistic regression – 1 lecture
◦ K-means – 2 lectures
◦ Singular value decomposition – 4 lectures
Logistics
Course website (with tentative schedule; slides and assignments will be
posted here):
http://www.cc.gatech.edu/~dkuang3/cse6040/
Discussion, Q&A, find teammates on Piazza (please sign up):
https://piazza.com/gatech/fall2014/cse6040/home
Homework/Project submissions on T-square (only for submission; use Piazza for discussion):
Logistics
3 homework assignments (30%) Mid-term (20%)
Project (40%) – more details coming soon! Class and Piazza participation (10%)
No late homework allowed.
Start now to find project teammates
What you will do in this course
Attend the lectures
ACTIVELY participate in class discussion
◦ Based on both in-class and Piazza activities
◦ Chat with / Help out your classmates on Piazza (but DO NOT share your answers)
◦ 10% of your grade
Read tutorials/references for programming languages Read documentation for software packages
Solve simple math problems
What you will do in this course
(cont’d)
Coding, of course!
◦ Homework #1: Collect real data online (10%)
◦ Homework #2: Visualize the data you collected (10%)
◦ Homework #3: Implement a machine learning algorithm (10%)
◦ Play with different machine learning frameworks/packages
Project (40%): Work on the Yelp Dataset
◦ Data for five cities (US, Canada, UK); four of them just released this month
◦ Includes businesses, attributes, check-ins, tips, users, user connections, reviews
◦ Work in teams of 2~3 students
◦ Get inspired: https://github.com/Yelp/dataset-examples (Again, in Python!) (DO NOT copy these examples for your project)
◦ Write your own team proposal
Course Expectation
You will never say “I don’t have data”.
You will be exposed to the entire lifecycle of data analytics (in a simplified way).
You will be able to code in Python, a common scripting language for data analytics and employed by many companies (e.g., one of Google’s 3 main languages), as well as have experience in many useful packages. You will know some most fundamental machine learning algorithms. If you already know them, you will have deeper understanding for them from the computational aspect.
Why Python?
One of Google’s 3 main languages
Simpler code: Focus on concepts rather than machine details More readable
Many useful packages
◦ Data manipulation
◦ Machine learning
◦ Image processing
◦ Natural language processing
◦ Spatial analysis
◦ Web application
◦ ...
Reasonably fast
Python Setup
A text editor + A terminal (command-line window)
◦ This is the convention for (Python) developers in companies
Text editor suggestions:
◦ Windows: Notepad++ (open source, with auto-indent and auto-fill)
◦ Linux: Vim, Emacs, Sublime
◦ Mac: Sublime, TextWrangler
We use Python 2.7, NOT the highest version 3.x
Go Jackets!
Everyone – Sign up on Piazza:
https://piazza.com/gatech/fall2014/cse6040/home
Windows users – Install Python on your own machine: https://www.python.org/downloads/
◦ Make sure it’s Python 2.7.8, NOT Python 3.x
◦ Make sure “python” can be called on command-line (may need to set up environment variables)
◦ Make sure the “Python27” directory is located in a root directory, NOT in “Program Files”
Everyone – Setup your development environment
◦ See https://developers.google.com/edu/python/set-up
Everyone – Download your own Yelp dataset: (423M tarball) http://www.yelp.com/dataset_challenge
◦ We cannot share it by the terms and conditions