Data Management Course Syllabus

(1)

Data Management: This course is designed to give students a broad understanding of modern storage systems, data management techniques, and how these systems are used to store, access and analyze Big Data. Topics include data modeling; storage system design of disk arrays, network attached storage, clusters and data centers: relational databases and the use of madlib techniques for data analytics; no-SQL databases and their advantages; cloud data storage and the use of clouds for big data; data warehouses and data mining; and the map-reduce paradigm for data analytics and the hadoop file system. Homework assignments will give students practical experience with important topics covered in the course, including the use of cloud storage, relational databases, NoSQL databases, and hadoop/Map Reduce.

Week Topic Readings Homework Exams

1 Introduction, data modeling

Gray, J. "Evolution of data management". Computer, 29(10):38-46, 1996.

J. Gray, D. T. Liu, M. Nieto-Santisteban, A. Szalay, D. J. DeWitt, and G. Heber, "Scientific data management in the coming decade," ACM SIGMOD Record, vol. 34, pp. 34-41, 2005.

Jim Gray on eScience: a transformed scientific method, The Fourth Paradigm: Data-Intensive Scientific Discovery, Edited by Tony Hey, Stewart Tansley, and Kristin Tolle.

Ch. 2 of A First Course in Database systems, by Jeff Ullman, and Jennifer Widom http://infolab.stanford.edu/~ullman/fcdb.ht ml 2 Disk arrays, Network Attached Storage, Clusters

D. A. Patterson, G. Gibson, and R. H. Katz, "A case for redundant arrays of inexpensive disks (RAID)," in ACM SIGMOD international conference on Management of data (SIGMOD '88): ACM, 1988, pp. 109-116.

Y. Saito, S. Frølund, A. Veitch, A.

Merchant, and S. Spence, "FAB: building distributed enterprise disk arrays from commodity components," ACM SIGOPS Operating Systems Review, vol. 38, pp.

Homework 1 assigned

(2)

48-58, 2004.

G. A. Gibson and R. Van Meter, "Network attached storage architecture,"

Communications of the ACM, vol. 43, pp. 37-45, 2000.

Dillow, Z. Zhang, and B. W. Settlemyer, "Workload characterization of a

leadership class storage cluster," in Petascale Data Storage Workshop (PDSW), 2010 5th, 2010, pp. 1-5. 3 Data Centers The Datacenter as a Computer: An

Introduction to the Design of Warehouse-Scale Machines, Second edition

July 2013, 154 pages,

Luiz André Barroso, Jimmy Clidaras, Urs Hölzle, Google, Inc.

http://www.morganclaypool.com/doi/abs/1 0.2200/S00516ED2V01Y201306CAC024 Homework 1 due, Homework 2 assigned 4 Cloud Storage Systems

M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, and I. Stoica, "A view of cloud computing,"

Communications of the ACM, vol. 53, pp. 50-58, 2010.

C. Wang, K. Ren, W. Lou, and J. Li, "Toward publicly auditable secure cloud data storage services," Network, IEEE, vol. 24, pp. 19-24, 2010.

R. Grossman and Y. Gu, "Data mining using high performance data clouds: experimental studies using sector and sphere," in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp. 920-927.

5 Commercial Cloud Storage Systems

Amazon S3, Cloud Computing Storage for Files, Images, Videos

aws.amazon.com/s3/

Amazon SimpleDB - Amazon Web

Homework 2 due,

Homework 3 assigned

(3)

Services

aws.amazon.com/simpledb/

Amazon Glacier - Amazon Web Services aws.amazon.com/glacier/

6 File Systems for Massive Storage

B. Welch, M. Unangst, Z. Abbasi, G. A. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou, "Scalable Performance of the Panasas Parallel File System," in FAST, 2008, pp. 1-17.

Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file

system." ACM SIGOPS Operating Systems Review. Vol. 37. No. 5. ACM, 2003. Homework 3 due 7 Midterm 1 and Relational databases and analytics

M.-S. Chen, J. Han, and P. S. Yu, "Data mining: an overview from a database perspective," Knowledge and data Engineering, IEEE Transactions on, vol. 8, pp. 866-883, 1996.

Optional background reading: A First Course in Database systems, by Jeff Ullman, and Jennifer Widom

Midterm 1 8 Relational databases and analytics (cont.)

Cohen, Jeffrey, et al. "MAD skills: new analysis practices for big data."

Proceedings of the VLDB Endowment 2.2 (2009): 1481-1492.

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

Homework 4 assigned

9 NoSQL

Databases

R. Cattell, "Scalable SQL and NoSQL data stores," ACM SIGMOD Record, vol. 39, pp. 12-27, 2011.

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, "Bigtable: A distributed storage system for structured data," ACM Transactions on Computer Systems (TOCS), vol. 26, p. 4, 2008. Homework 4 due, Homework 5 assigned 10 NoSQL Databases

Lakshman and P. Malik, "Cassandra: a decentralized structured storage system,"

(4)

(cont.) ACM SIGOPS Operating Systems Review, vol. 44, pp. 35-40, 2010. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: amazon's highly available key-value store," in SOSP, 2007, pp. 205-220.

11 Distributed Databases

M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J. Sidell, C. Staelin, and A. Yu, "Mariposa: a wide-area distributed database system," The VLDB Journal, vol. 5, pp. 48-63, 1996.

J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, and P.

Hochschild, "Spanner: Google’s globally-distributed database," in Proceedings of OSDI, 2012. Homework 5 due, Homework 6 assigned 12 Map Reduce, Hadoop File System

J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107-113, 2008.

K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010, pp. 1-10.

13 Map Reduce, Hadoop File System (cont.)

D. Wegener, M. Mock, D. Adranale, and S. Wrobel, "Toolkit-based

high-performance Data Mining of large Data on MapReduce Clusters," in Data Mining Workshops, 2009. ICDMW'09. IEEE International Conference on, 2009, pp. 296-301. Homework 6 due 14 Midterm 2, Data Warehouses

Surajit Chaudhuri Umeshwar Dayal. An Overview of Data Warehousing and OLAP Technology.. SIGMOD Record, 26(1), 1997, 65-74. Midterm 2 15 Data Warehouses (cont.) J. C. Prather, D. F. Lobach, L. K. Goodwin, J. W. Hales, M. L. Hage, and W. E. Hammond, "Medical data mining: knowledge discovery in a clinical data

(5)

warehouse," in Proceedings of the AMIA Annual Fall Symposium, 1997, p. 101. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, "Hive-a petabyte scale data warehouse using hadoop," in Data Engineering (ICDE), 2010 IEEE 26th International Conference on, 2010, pp. 996-1005.