• No results found

CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview

N/A
N/A
Protected

Academic year: 2021

Share "CSE 6040 Computing for Data Analytics: Methods and Tools. Lecture 1 Course Overview"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

CSE 6040

Computing for Data Analytics:

Methods and Tools

Lecture 1 – Course Overview

DA KUA N G , P O LO C H AU G EO RG I A T EC H

(2)

Course Staff

Instructor

◦ Da Kuang

◦ Postdoctoral Researcher, CSE

◦ Office: Klaus 1305 (facing the kitchen door)

◦ Office hour: Thu 4-5pm, Klaus 1315

Instructor

◦ Duen Horng (Polo) Chau

◦ Assistant Professor, CSE

◦ Office hour: Thu 4-5pm, Klaus 1315

TA

◦ Lianxiao (Shawn) Qiu

◦ MS CS Student

(3)

MS Analytics Curriculum

Computing

Computing for Data Analysis: Methods and Tools ◦ Data and Visual Analytics

◦ Computational Data Analysis

◦ High Performance Computing Statistics/Optimization

◦ Introduction to Analytical Methods ◦ Regression Analysis

◦ Deterministic Optimization

◦ Probabilistic Models

◦ Data Mining and Statistical Learning

◦ Simulation

◦ Time Series Analysis Business

◦ Introduction to Business for Analytics ◦ Risk Analytics

◦ Project Management

◦ Pricing Analytics and Revenue Management

◦ Business Process Analysis and Design

◦ Customer Relationship Management

(4)

Data Analytics Problems

Regression: Predicting a numerical variable

Y-axis: # New homes sold in the US (shaded areas indicate US recessions)

(5)

Data Analytics Problems

Regression: Predicting a numerical variable

[Hal Varian, Predicting the present with search engine data, 2013] Search frequencies on Google used as predictors

Target variable: # new homes sold in the US

(6)

Data Analytics Problems

Classification: Predicting a categorical variable (or its probability)

Statistical machine translation Query classifi-cation News classifi-cation

(7)

Data Analytics Problems

Clustering: Finding patterns without human labeling

Both topic modeling and recommender system can be viewed as a clustering problem.

(8)

Data Analytics Pipeline

Data collection

Data storage/retrieval

Data analysis

(9)

Data Analytics Pipeline

Data collection

Data storage/retrieval

Data analysis

Data visualization

sqlite pandasnumpy scikit-learn igraph bokeh Scrapy Selenium BeautifulSoup

Names in red are Python packages.

(10)

Data Analytics Pipeline

Data collection

Data storage/retrieval

Data analysis

(11)

What you will learn in this course

Python programming (and a little bit Java and Matlab) – 4 lecture

◦ One of Google’s 3 main languages Python packages

◦ Data collection – 2 lectures

◦ Data storage and retrieval – 1 lecture

◦ Data analysis

◦ Data visualization – 2 lectures

Basic linear algebra (math tools, matrices, etc.) – 2 lectures

Basic numerical computing (how to do math programmatically) – 4 lectures Several fundamental machine learning algorithms (focusing on intuitive ideas and software development for them)

◦ Linear regression – 2 lectures

◦ Logistic regression – 1 lecture

◦ K-means – 2 lectures

◦ Singular value decomposition – 4 lectures

(12)

Logistics

Course website (with tentative schedule; slides and assignments will be

posted here):

http://www.cc.gatech.edu/~dkuang3/cse6040/

Discussion, Q&A, find teammates on Piazza (please sign up):

https://piazza.com/gatech/fall2014/cse6040/home

Homework/Project submissions on T-square (only for submission; use Piazza for discussion):

(13)

Logistics

3 homework assignments (30%) Mid-term (20%)

Project (40%) – more details coming soon! Class and Piazza participation (10%)

No late homework allowed.

Start now to find project teammates

(14)

What you will do in this course

Attend the lectures

ACTIVELY participate in class discussion

◦ Based on both in-class and Piazza activities

◦ Chat with / Help out your classmates on Piazza (but DO NOT share your answers)

◦ 10% of your grade

Read tutorials/references for programming languages Read documentation for software packages

Solve simple math problems

(15)

What you will do in this course

(cont’d)

Coding, of course!

◦ Homework #1: Collect real data online (10%)

◦ Homework #2: Visualize the data you collected (10%)

◦ Homework #3: Implement a machine learning algorithm (10%)

◦ Play with different machine learning frameworks/packages

Project (40%): Work on the Yelp Dataset

◦ Data for five cities (US, Canada, UK); four of them just released this month

◦ Includes businesses, attributes, check-ins, tips, users, user connections, reviews

◦ Work in teams of 2~3 students

◦ Get inspired: https://github.com/Yelp/dataset-examples (Again, in Python!) (DO NOT copy these examples for your project)

◦ Write your own team proposal

(16)

Course Expectation

You will never say “I don’t have data”.

You will be exposed to the entire lifecycle of data analytics (in a simplified way).

You will be able to code in Python, a common scripting language for data analytics and employed by many companies (e.g., one of Google’s 3 main languages), as well as have experience in many useful packages. You will know some most fundamental machine learning algorithms. If you already know them, you will have deeper understanding for them from the computational aspect.

(17)

Why Python?

One of Google’s 3 main languages

Simpler code: Focus on concepts rather than machine details More readable

Many useful packages

◦ Data manipulation

◦ Machine learning

◦ Image processing

◦ Natural language processing

◦ Spatial analysis

◦ Web application

◦ ...

Reasonably fast

(18)

Python Setup

A text editor + A terminal (command-line window)

◦ This is the convention for (Python) developers in companies

Text editor suggestions:

◦ Windows: Notepad++ (open source, with auto-indent and auto-fill)

◦ Linux: Vim, Emacs, Sublime

◦ Mac: Sublime, TextWrangler

We use Python 2.7, NOT the highest version 3.x

(19)

Go Jackets!

Everyone – Sign up on Piazza:

https://piazza.com/gatech/fall2014/cse6040/home

Windows users – Install Python on your own machine: https://www.python.org/downloads/

◦ Make sure it’s Python 2.7.8, NOT Python 3.x

◦ Make sure “python” can be called on command-line (may need to set up environment variables)

◦ Make sure the “Python27” directory is located in a root directory, NOT in “Program Files”

Everyone – Setup your development environment

◦ See https://developers.google.com/edu/python/set-up

Everyone – Download your own Yelp dataset: (423M tarball) http://www.yelp.com/dataset_challenge

◦ We cannot share it by the terms and conditions

References

Related documents

The purpose of this phenomenological study is to develop a better understanding of student perceptions as to why some middle school students lose their intrinsic motivation and

Expression of matrix metalloproteinases and their inhibitors correlates with invasion and metastasis in squamous cell carcinoma of the head and neck.. Head

Japan’s universal health insurance system is composed of four main insurance systems, i.e., community health insurance for the self-employed and unemployed (National Health

www.vceplus.com - Download A+ VCE (latest) free Open VCE Exams - VCE to PDF Converter - VCE Exam Simulator - VCE Online - IT Certifications R5#sh ip ospf int s1/0Serial1/0 is up,

Our conceptualization of frame of reference may be used to interpret how the opportunity to engage in self-determined behaviors (in this study, individualized goal- setting)

Under core plus, an individual acquisition workforce member must attain the existing certification standards applicable to their respective functional career field.. This

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

The new equations are referred to as the characteristically averaged homentropic Euler (CAHE) equations. An existence and uniqueness proof for the modified equations is given. The