• No results found

Cleveland State University

N/A
N/A
Protected

Academic year: 2021

Share "Cleveland State University"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

Cleveland State University

CIS 695 Big Data Processing and Data Analytics (3-0-3) – 2016

Section 51 – Class Nbr. 5493. Tues, Thur – TBA

Prerequisites:

CIS 505 and CIS 530. CIS 612, CIS 660 Preferred.

Instructor:

Dr. Sunnie S. Chung

Office Location:

FH211

Phone:

216 687 4732

Email:

sschung.cis@gmail.com

s.chung@csuohio.edu

Webpage

: http://grail.csuohio.edu/~sschung

Office Time:

Tues, Thurs 2:00 – 3:30 PM and 5:45 – 6:45 PM (or by appointment)

Class Location:

TBA

Section

51

Tue & Thu TBA

Key Concepts: Big Data Processing and Parallel Computing, Google’s MapReduce and Aphathe Hadoop,

Hadoop/Map Reduce Based Big Data Processing Systems: Google’s Big Table, Facebook HBase, Hive, Apathe Pig Latin, Key Value Store Systems: MongoDB, Cassandra, Cloud Computing Systems: Google Cloud, Amazon Elastic Cloud, Parallel Data Warehouse Based Big Data Processing Systems, Data Analytics using OLAP Cube, Text Data Analytics, Data Analytic for Big Data, Building Big Data Processing

Infrastructures for Data Analytics, Data Analytics in Twitter, LinkedIn, and Practicum in Data Analytics: Case Study from Progressive, Explorys by IBM and Data Think

List of Required Materials: Microsoft SQL Server 2012,

Microsoft Visual Studio 2012 or any higher

Microsoft SQL Server Business Intelligence Data Analytic Tool 2012 – OLAP Server and SQL Server Data Tool

They are available at the Microsoft Academic Alliance program:

http://e5.onthehub.com/WebStore/ProductsByMajorVersionList.aspx?ws=31b9929b-c09b-e011-969d-0030487d8897

Adventure Works 2012 Data Warehouse Database for SQL Server 2012 – Will be directed in class Applied Analytics Using SAS Enterprise Miner

Apathe Pig Latin: http://pig.apache.org/docs/r0.11.1/basic.html WEKA: http://www.cs.waikato.ac.nz/~ml/weka/

R and MapR

Detailed Instructions on Installation and Setting Up for Big Data Processing Systems will be given in Class.

Text Book:

1. Lecture Notes – Will be given in class

2. List of Selected Industry R&D Papers on Data Analytics and Big Data Processing will be given in class

Supplement Text Book:

1. “Data Mining with Microsoft SQL Server BI Data Tool” (2008 or 2012), Jamie MacLennan, Bogdan Crivat, Publisher: Wiley; 1 Edition ISBN-10: 0470277742 | ISBN-13: 978-0470277744 | Edition: 1

Official Calendar

Please consult the page http://www.csuohio.edu/enrollmentservices/registrar/calendar/index.html Final exam: Thur, Dec 11 4:00-6:00 PM.

(2)

Grading: The course grade is based on a student's overall performance through the entire Semester. The final grade is distributed among the following components:

1. Exams (Midterm & Final) 45% (15% Midterm, 30% Final) 2. Computer Labs 25% (about 3-4 Assignments)

3. 1 Project on Big data processing: 2-3 person group project (20%) 4. Research Topic Presentation : 10% -- Tentative

A 94% + A: Outstanding (student's performance is genuinely excellent) A- 90% - 93%

B+ 87% - 89%

B 80% - 86% B: Very Good (student's performance is clearly commendable but not necessarily outstanding)

B- 70% - 79%

C <70% C: Good (student's performance meets every course requirement and is acceptable; not distinguished)

D F

< 60% D: Below Average (student's performance fails to meet course objectives and standards)

F: Failure (student's performance is unacceptable)

Examination Policy: Students are allowed to bring to the tests a summary page (standard letter size) with their own notes. During the exams: (1) the use of books, cell phones, calculators, or any electronic devices is prohibited, and (2) students must not share any materials.

Make-Up Exam Policy: No makeup exams will be given unless notified and agreed to in advance. Requests will be considered only in case of exceptional demonstrated need.

Homework Policy: The students are expected to attend all classes. The students are responsible for collecting the notes, handouts and any other course material distributed during the class period. All assignments must be individually and independently completed and must represent the effort of the student turning in the assignment. Should two or more students turn in substantially the same solution or output, in the judgment of the instructor, the solution will be considered group effort. All involved in group effort homework will receive a zero grade for that assignment. A student turning in a group effort assignment more than once will automatically receive an “F” grade for the course.

Late Assignment: All lab assignments are due at the beginning of class on the date specified. Laboratory Assignments handed in after the class has begun will be accepted with a 25% grade penalty for up to a week and then not accepted at all. All laboratory assignments must be completed. Failure to do so will lower your course grade one additional letter grade.

Student Conduct: Students are expected to do their own work. Academic misconduct, student misconduct, cheating and plagiarism will not be tolerated. Violations will be subject to disciplinary action as specified in the CSU Student Conduct Code. A copy can be obtained on the web page at:

http://www.csuohio.edu/studentlife/StudentCodeOfConduct.pdf or by contacting Valerie Hinton Hannah, Judicial Affairs Officer in the Department of Student Life (MC 106 email v.hintonhannah@csuohio.edu ). For more information consult the following web page CSU Judicial Affairs available at

http://www.csuohio.edu/studentlife/jaffairs/faq.html

Course Schedule: The tentative schedule of topics and their order of coverage is given below. The schedule and topics to be covered may vary depending upon the students’ progress made.

(3)

1 - 2 Big Data Processing and Parallel Computing

Google’s MapReduce and Aphathe Hadoop

Google Map Reduce Tutorial Apache Hadoop File System

• MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean (Google) and Sanjay Ghemawat (Google) in the proceedings of OSDI 2004 http://grail.csuohio.edu/~sschung/CIS695/mapreduce-osdi04.pdf

• Apathy Hadoop in White Papers by Apache, Yahoo https://hadoop.apache.org/

Lab 1: MapReduce Programming on Hadoop

Lecture Notes, Listed Papers

3 – 5 Hadoop/Map Reduce Based Big Data Processing Systems:

Google’s Big Table

Data Models for Unstructured Data Processing

• Bigtable: A Distributed Storage System for Structured Data, by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Google, Inc. in the Proceedings of OSDI 2006

http://grail.csuohio.edu/~sschung/CIS695/googlebigtable-osdi06.pdf

Facebook HBase, Hive

• Data Warehousing and Analytics Infrastructure at Facebook, in SIGMOD 2010 by Ashish Thusoo (Facebook), et al,

http://grail.csuohio.edu/~sschung/CIS695/facebook_sigmodwarehouse2010_CIS695.pdf

http://grail.csuohio.edu/~sschung/CIS695/hive.pdf

• Petabyte Scale Databases and Storage Systems Deployed at Facebook, in SIGMOD 2013 by Dhruba Borthakur

http://grail.csuohio.edu/~sschung/IST734/Facebook_Sig2013_Paper_IST734.pdf

http://grail.csuohio.edu/~sschung/IST734/Facebook_Sigmod2013_Presentation_IST734.pdf

https://hbase.apache.org/ Apathe Pig Latin

• Pig Latin: A Not-So-Foreign Language for Data Processing in SIGMOD 2008

http://pig.apache.org/docs/r0.11.1/basic.html

http://grail.csuohio.edu/~sschung/cis612/WebDataProcessingonDWYahooSig08.pdf

Key Value Stores

MongoDB, Cassandra https://www.mongodb.com/ http://cassandra.apache.org/ Cloud

Google Cloud

Amazon Elastic Cloud http://aws.amazon.com/ http://aws.amazon.com/what-is-cloud-computing/ https://cloud.google.com/storage/docs/resources-support#samples http://en.m.wikipedia.org/wiki/Google_Cloud_Platform http://www.wired.com/2014/03/urs-google-story/ Lecture Notes, Listed Papers

(4)

Lab2:Processing Streaming Twitter Logging Data to build Data Warehouse or Mongo DB on Hadoop

6 – 7 Parallel Data Warehouse Based Big Data Processing Systems:

Data Warehouse and OLAP

-

Decision Support Technology

-

On Line Analytical Processing

-

Star Schema

-

OLAP Aggregation Operators:

Data Cube, Roll Up, Drill Down

-

Building BI Data Analytics using DW and OLAP

-

MDX, DMX

An Overview of Data Warehousing and OLAP Technology by Surajit

Chaudhuri (Microsoft) and Umeshwar Dayal (HP Labs) , in the proceedings of IEEE 1995

Data Cube: A Relational Aggregation Operator Generalizing Group By, Cross Tab, and SubTotals by Jim Gray (Microsoft), et al, in the proceedings of IEEE 1996

MS PDW Optimization

• Query Optimization in Microsoft SQL Server Parallel Data Warehouse in SIGMOD 2012 by Srinath Shankar, Microsoft, David DeWitt, Microsoft, César Galindo-Legaria, Microsoft, et al.

http://grail.csuohio.edu/~sschung/IST734/MicrosoftSqlOptimizationSig2012-shankar.pdf

Parallel Data Warehouse with OLAP Query Processing: Microsoft Extended PDW with Map Reduce and Hadoop : Oracle, Teradata Columnar Databases : SAP HANA Databases

Extended PDW with Columnar Data Processing :Teradata, Microsoft

PDW on Cloud

Azure Cloud System by Micrsoft published in IEEE 2011

http://grail.csuohio.edu/~sschung/IST734/azure2_Microsoft_IEEE2011.pdf http://grail.csuohio.edu/~sschung/IST734/AzureSQLMicrosoft_Sigmod2010-campbell

Big Data Processing System with PDW on Cloud

http://grail.csuohio.edu/~sschung/IST734/CloudVista_huiqixu_vldb2012.pdf

Listed papers. Lecture Notes

8 - 9 Big Data Processing Techniques for Data Analytic Tasks

MapReduce Join Algorithms

Pig Latin Job Execution Steps and Operators: Filter By, Group, Cogroup Unstructured Data Processing: Data Transformation from Unstructured/Semi Structure to Object Exchange Model (JASON, XML) or Relation/CSV Introduction to Information Retrieval and Web Logging Data Processing Text Data Analytics, Data Analytic for Big Data

Optimization Techniques for Big Data Processing

Lab3: Unstructured Logging Data Procession to JSON format for Object Exchange Model (OEM)

Lecture Notes. Selected Papers

(5)

10 - 11 Data Analytics in Twitter and LinkedIn:

Twitter:

Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture

Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin (Twitter, Inc.)

http://grail.csuohio.edu/~sschung/IST734/FastDataTwitter.pdf LinkedIn:

The “Big Data” Ecosystem at LinkedIn

Roshan Sumbaly, Jay Kreps, and Sam Shah (LinkedIn)

http://grail.csuohio.edu/~sschung/IST734/BigDataEcoSystemLinkedInSigmod2 014.pdf

Avatara: OLAP for Webscale Analytics Products Lili Wu, et al. (LinkedIn)

http://grail.csuohio.edu/~sschung/cis612/LinkedIn_liliwu_vldb2012.pdf

Lab4: Data Analytics with Steaming Twitter Messages

Lecture Notes. Listed Papers,

12 – 13 Practicum in Data Analytics: Case Study • Progressive • Explorys by IBM • Data Think Lecture Notes Selected Papers

14 Data Analytics and Security

Text mining – Intrusion detection in Network and System Database Security

Security in Cloud

Lecture Notes

15 - 16 Presentation of Significant Research Papers on Data Analytics and Big Data Processing: List of Selected Papers will be given in class.

Selected Papers

NOTE: The instructor reserves the right to retain, for pedagogical reasons, either the original or a copy of your work submitted either individually or as a group project for this class. Students' names will be deleted from any retained items

.

References

Related documents

Recently, Google rolled out Place Search, which reorganizes how local results are displayed on the search engine results page (SERP).. Place Search shows when Google predicts that

Yeast Surface Two-hybrid for Quantitative in Vivo Detection of Protein- Protein Interactions via the Secretory Pathway[J]. Cell Surface Assembly of HIV gp41 Six-Helix Bundles

Consequently, the industry in this sector has closely examined the strategy of substituting refined wheat flour for wholemeal ingredients with high added value, such as

First we consider whether in presence of a large share of informality, …rms’ job reallocation in the formal labour market drops just above the threshold, implying that the presence

When deciding for an eCommerce Solution that works with SAP Business ONE you can measure the available features against those success factors.. In order to do this in a

Different reasons are cited to explain the situation: a widespread splitting up of farmland, and the difficulty of reaching smallholders, a low rate of access to financial services

The Canberra Rape Crisis Centre and the YWCA of Canberra both deliver programs that work with young people to address and change violence-supportive attitudes and behaviours –

Valores digitais de quatro imagens do sensor, dos meses de fevereiro, maio, agosto e outubro de 2008, com polarizações HH (emissão e recebimento de onda na