Cleveland State University
CIS 695 Big Data Processing and Data Analytics (3-0-3) – 2016
Section 51 – Class Nbr. 5493. Tues, Thur – TBA
Prerequisites:
CIS 505 and CIS 530. CIS 612, CIS 660 Preferred.
Instructor:
Dr. Sunnie S. Chung
Office Location:
FH211
Phone:
216 687 4732
Email:
sschung.cis@gmail.com
s.chung@csuohio.edu
Webpage
: http://grail.csuohio.edu/~sschung
Office Time:
Tues, Thurs 2:00 – 3:30 PM and 5:45 – 6:45 PM (or by appointment)
Class Location:
TBA
Section
51
Tue & Thu TBA
Key Concepts: Big Data Processing and Parallel Computing, Google’s MapReduce and Aphathe Hadoop,
Hadoop/Map Reduce Based Big Data Processing Systems: Google’s Big Table, Facebook HBase, Hive, Apathe Pig Latin, Key Value Store Systems: MongoDB, Cassandra, Cloud Computing Systems: Google Cloud, Amazon Elastic Cloud, Parallel Data Warehouse Based Big Data Processing Systems, Data Analytics using OLAP Cube, Text Data Analytics, Data Analytic for Big Data, Building Big Data Processing
Infrastructures for Data Analytics, Data Analytics in Twitter, LinkedIn, and Practicum in Data Analytics: Case Study from Progressive, Explorys by IBM and Data Think
List of Required Materials: Microsoft SQL Server 2012,
Microsoft Visual Studio 2012 or any higher
Microsoft SQL Server Business Intelligence Data Analytic Tool 2012 – OLAP Server and SQL Server Data Tool
They are available at the Microsoft Academic Alliance program:
http://e5.onthehub.com/WebStore/ProductsByMajorVersionList.aspx?ws=31b9929b-c09b-e011-969d-0030487d8897
Adventure Works 2012 Data Warehouse Database for SQL Server 2012 – Will be directed in class Applied Analytics Using SAS Enterprise Miner
Apathe Pig Latin: http://pig.apache.org/docs/r0.11.1/basic.html WEKA: http://www.cs.waikato.ac.nz/~ml/weka/
R and MapR
Detailed Instructions on Installation and Setting Up for Big Data Processing Systems will be given in Class.
Text Book:
1. Lecture Notes – Will be given in class
2. List of Selected Industry R&D Papers on Data Analytics and Big Data Processing will be given in class
Supplement Text Book:
1. “Data Mining with Microsoft SQL Server BI Data Tool” (2008 or 2012), Jamie MacLennan, Bogdan Crivat, Publisher: Wiley; 1 Edition ISBN-10: 0470277742 | ISBN-13: 978-0470277744 | Edition: 1
Official Calendar
Please consult the page http://www.csuohio.edu/enrollmentservices/registrar/calendar/index.html Final exam: Thur, Dec 11 4:00-6:00 PM.
Grading: The course grade is based on a student's overall performance through the entire Semester. The final grade is distributed among the following components:
1. Exams (Midterm & Final) 45% (15% Midterm, 30% Final) 2. Computer Labs 25% (about 3-4 Assignments)
3. 1 Project on Big data processing: 2-3 person group project (20%) 4. Research Topic Presentation : 10% -- Tentative
A 94% + A: Outstanding (student's performance is genuinely excellent) A- 90% - 93%
B+ 87% - 89%
B 80% - 86% B: Very Good (student's performance is clearly commendable but not necessarily outstanding)
B- 70% - 79%
C <70% C: Good (student's performance meets every course requirement and is acceptable; not distinguished)
D F
< 60% D: Below Average (student's performance fails to meet course objectives and standards)
F: Failure (student's performance is unacceptable)
Examination Policy: Students are allowed to bring to the tests a summary page (standard letter size) with their own notes. During the exams: (1) the use of books, cell phones, calculators, or any electronic devices is prohibited, and (2) students must not share any materials.
Make-Up Exam Policy: No makeup exams will be given unless notified and agreed to in advance. Requests will be considered only in case of exceptional demonstrated need.
Homework Policy: The students are expected to attend all classes. The students are responsible for collecting the notes, handouts and any other course material distributed during the class period. All assignments must be individually and independently completed and must represent the effort of the student turning in the assignment. Should two or more students turn in substantially the same solution or output, in the judgment of the instructor, the solution will be considered group effort. All involved in group effort homework will receive a zero grade for that assignment. A student turning in a group effort assignment more than once will automatically receive an “F” grade for the course.
Late Assignment: All lab assignments are due at the beginning of class on the date specified. Laboratory Assignments handed in after the class has begun will be accepted with a 25% grade penalty for up to a week and then not accepted at all. All laboratory assignments must be completed. Failure to do so will lower your course grade one additional letter grade.
Student Conduct: Students are expected to do their own work. Academic misconduct, student misconduct, cheating and plagiarism will not be tolerated. Violations will be subject to disciplinary action as specified in the CSU Student Conduct Code. A copy can be obtained on the web page at:
http://www.csuohio.edu/studentlife/StudentCodeOfConduct.pdf or by contacting Valerie Hinton Hannah, Judicial Affairs Officer in the Department of Student Life (MC 106 email v.hintonhannah@csuohio.edu ). For more information consult the following web page CSU Judicial Affairs available at
http://www.csuohio.edu/studentlife/jaffairs/faq.html
Course Schedule: The tentative schedule of topics and their order of coverage is given below. The schedule and topics to be covered may vary depending upon the students’ progress made.
1 - 2 Big Data Processing and Parallel Computing
Google’s MapReduce and Aphathe Hadoop
Google Map Reduce Tutorial Apache Hadoop File System
• MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean (Google) and Sanjay Ghemawat (Google) in the proceedings of OSDI 2004 http://grail.csuohio.edu/~sschung/CIS695/mapreduce-osdi04.pdf
• Apathy Hadoop in White Papers by Apache, Yahoo https://hadoop.apache.org/
Lab 1: MapReduce Programming on Hadoop
Lecture Notes, Listed Papers
3 – 5 Hadoop/Map Reduce Based Big Data Processing Systems:
Google’s Big Table
Data Models for Unstructured Data Processing
• Bigtable: A Distributed Storage System for Structured Data, by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Google, Inc. in the Proceedings of OSDI 2006
http://grail.csuohio.edu/~sschung/CIS695/googlebigtable-osdi06.pdf
Facebook HBase, Hive
• Data Warehousing and Analytics Infrastructure at Facebook, in SIGMOD 2010 by Ashish Thusoo (Facebook), et al,
http://grail.csuohio.edu/~sschung/CIS695/facebook_sigmodwarehouse2010_CIS695.pdf
http://grail.csuohio.edu/~sschung/CIS695/hive.pdf
• Petabyte Scale Databases and Storage Systems Deployed at Facebook, in SIGMOD 2013 by Dhruba Borthakur
http://grail.csuohio.edu/~sschung/IST734/Facebook_Sig2013_Paper_IST734.pdf
http://grail.csuohio.edu/~sschung/IST734/Facebook_Sigmod2013_Presentation_IST734.pdf
https://hbase.apache.org/ Apathe Pig Latin
• Pig Latin: A Not-So-Foreign Language for Data Processing in SIGMOD 2008
http://pig.apache.org/docs/r0.11.1/basic.html
http://grail.csuohio.edu/~sschung/cis612/WebDataProcessingonDWYahooSig08.pdf
Key Value Stores
MongoDB, Cassandra https://www.mongodb.com/ http://cassandra.apache.org/ Cloud
Google Cloud
Amazon Elastic Cloud http://aws.amazon.com/ http://aws.amazon.com/what-is-cloud-computing/ https://cloud.google.com/storage/docs/resources-support#samples http://en.m.wikipedia.org/wiki/Google_Cloud_Platform http://www.wired.com/2014/03/urs-google-story/ Lecture Notes, Listed Papers
Lab2:Processing Streaming Twitter Logging Data to build Data Warehouse or Mongo DB on Hadoop
6 – 7 Parallel Data Warehouse Based Big Data Processing Systems:
Data Warehouse and OLAP
-
Decision Support Technology-
On Line Analytical Processing-
Star Schema-
OLAP Aggregation Operators:Data Cube, Roll Up, Drill Down
-
Building BI Data Analytics using DW and OLAP-
MDX, DMX• An Overview of Data Warehousing and OLAP Technology by Surajit
Chaudhuri (Microsoft) and Umeshwar Dayal (HP Labs) , in the proceedings of IEEE 1995
• Data Cube: A Relational Aggregation Operator Generalizing Group By, Cross Tab, and SubTotals by Jim Gray (Microsoft), et al, in the proceedings of IEEE 1996
MS PDW Optimization
• Query Optimization in Microsoft SQL Server Parallel Data Warehouse in SIGMOD 2012 by Srinath Shankar, Microsoft, David DeWitt, Microsoft, César Galindo-Legaria, Microsoft, et al.
http://grail.csuohio.edu/~sschung/IST734/MicrosoftSqlOptimizationSig2012-shankar.pdf
Parallel Data Warehouse with OLAP Query Processing: Microsoft Extended PDW with Map Reduce and Hadoop : Oracle, Teradata Columnar Databases : SAP HANA Databases
Extended PDW with Columnar Data Processing :Teradata, Microsoft
PDW on Cloud
Azure Cloud System by Micrsoft published in IEEE 2011
http://grail.csuohio.edu/~sschung/IST734/azure2_Microsoft_IEEE2011.pdf http://grail.csuohio.edu/~sschung/IST734/AzureSQLMicrosoft_Sigmod2010-campbell
Big Data Processing System with PDW on Cloud
http://grail.csuohio.edu/~sschung/IST734/CloudVista_huiqixu_vldb2012.pdf
Listed papers. Lecture Notes
8 - 9 Big Data Processing Techniques for Data Analytic Tasks
MapReduce Join Algorithms
Pig Latin Job Execution Steps and Operators: Filter By, Group, Cogroup Unstructured Data Processing: Data Transformation from Unstructured/Semi Structure to Object Exchange Model (JASON, XML) or Relation/CSV Introduction to Information Retrieval and Web Logging Data Processing Text Data Analytics, Data Analytic for Big Data
Optimization Techniques for Big Data Processing
Lab3: Unstructured Logging Data Procession to JSON format for Object Exchange Model (OEM)
Lecture Notes. Selected Papers
10 - 11 Data Analytics in Twitter and LinkedIn:
Twitter:
Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture
Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin (Twitter, Inc.)
http://grail.csuohio.edu/~sschung/IST734/FastDataTwitter.pdf LinkedIn:
The “Big Data” Ecosystem at LinkedIn
Roshan Sumbaly, Jay Kreps, and Sam Shah (LinkedIn)
http://grail.csuohio.edu/~sschung/IST734/BigDataEcoSystemLinkedInSigmod2 014.pdf
Avatara: OLAP for Webscale Analytics Products Lili Wu, et al. (LinkedIn)
http://grail.csuohio.edu/~sschung/cis612/LinkedIn_liliwu_vldb2012.pdf
Lab4: Data Analytics with Steaming Twitter Messages
Lecture Notes. Listed Papers,
12 – 13 Practicum in Data Analytics: Case Study • Progressive • Explorys by IBM • Data Think Lecture Notes Selected Papers
14 Data Analytics and Security
Text mining – Intrusion detection in Network and System Database Security
Security in Cloud
Lecture Notes
15 - 16 Presentation of Significant Research Papers on Data Analytics and Big Data Processing: List of Selected Papers will be given in class.
Selected Papers
NOTE: The instructor reserves the right to retain, for pedagogical reasons, either the original or a copy of your work submitted either individually or as a group project for this class. Students' names will be deleted from any retained items