BS3519 - Exploratory Data Analysis - UNDERGRADUATE PROGRAMME HANDBOOK

School Cardiff Business School Department Code CARBS0

Module Code BS3519 Number of Credits 10

Level L6

Module Leader Dr Peter Morgan Semester Autumn Semester Academic Year 2013/4

Outline Description of Module

Exploratory Data Analysis (EDA) is a framework for analyzing data to seek hypotheses worthy of testing. It complements the tools of conventional statistics for testing hypotheses.

In the era of ‘Big Data’, the existence of vast quantities of data of all types, and our capacity to create even more, pose a severe challenge to managers, accountants and statisticians alike. Hence there is a need for data reduction or data mining methods which will cope with high dimensional data, large numbers of cases and/or a variety of data types. Data now occurs in many forms – text, images (static and moving), sound, numbers, etc. and conventional methods may not cope very well with these more unusual forms of data. This module is designed to give students experience in both new and established techniques for EDA.

On completion of the module a student should be able to A Knowledge and Understanding:

 Understand the scope and challenges of the modern data environment and newer methods of data exploration, acquisition and creation.

 Develop an understanding of a range of analysis tools for data which may have qualitative or quantitative variables or a mixture of both.

B Intellectual Skills:

 Discuss the need for new techniques to cope with data sets which are extensive in terms of number of cases and/or variables.

 Use these analysis tools to develop and test hypotheses about the data and the data generating process

 Critically evaluate the suitability of a given technique to a particular analytical situation

C Discipline Specific (including practical) Skills:

 Use a variety of software tools and appreciate the criteria for their use in a varied selection of data analysis situations

D Transferable Skills:

 Acquire the ability to use a variety of standard software packages

 Practise analytical problem solving skills through data analysis

 Develop reporting skills through maintaining an electronic journal How the module will be deli vered

The teaching will be based around a series of guided workshop and tutorial activities interspersed with inputs in the form of lectures on core topics. The pattern will be set around 11 lectures and 5 fortnightly workshops interleaved with 5 software tutorial sessions based around, for example, case-studies including the use of software such as Excel together with Open Source software packages such as R, GGobi, Gauguin and Mondrian. Each student will keep a data analysis journal completed in the IT lab sessions, some of which will be submitted in electronic form as part of their module assessment.

An invited lecturer will also provide some input on industrial practice in this area.

Indicative study hours - 100

How the module will be assessed

There will be an examination which will include examples of software output of the type used by students in the tutorial sessions.

The production of a dataset analysis journal by individual students will concentrate on the material and datasets that they cover in the computer laboratory sessions. This will evidence their progress and include analysis of data unique to the individual student.

Assessment Breakdown

Type % Title Duration(hrs) Week

Examination - Autumn Semester 50 Exploratory Data Analysis 2 N/A Written Assessment 50 Data Analysis Journal N/A 12 Syllabus content

Theoretical Content

1. Introduction – types of data and the information content of data – levels of measurement

The concept of Knowledge Discovery – Data Mining and Exploratory Data Analysis (EDA)

2. The concept of perspective – viewpoint, foreground Background and data and dimensional reduction

The histogram and allied plots as prototypical data analysis tools for univariate data

3. Extensions to 2 and 3 dimensions

Simple visualization tools – scatter plots, matrix scatter plots, brushing, and 3-D scatter plot

Coping with nominal, ordinal and mixed data 4. High dimensional data

Plotting methods for ‘high-D’ data

Outliers, clusters, nonlinearity and distance measures

Perspective and data reduction revisited - Projection methods 5. Cluster analysis

6. Analysis of textual and nominal data. Mosaic plots

7. Regression and time series methods – examination of residuals for model refinement

8. Acquiring and verifying large data sets and the missing data problem 9. Commonly-used multivariate methods - including Multidimensional Scaling

(MDS) - from a software user’s perspective and including a discussion of the scope and pitfalls of such methods

10. Validating models

Practical and Software Content

1. Input and creation of data through such media as the WWW, Data Reformatting, Table and Data handling aspects of spreadsheet, word processor and database programs

2. Univariate and Bivariate analyses using simple charting techniques on a standard dataset and testing the results

3. Examining data through data visualization methods implemented in Ggobi – exploring a variety of plotting methods including matrix scatter plots, parallel coordinate plots and the use of brushing to link plots and highlight clusters and outliers

4. Using cluster analysis techniques on a standard dataset with statistical validation (using external variables, etc.)

5. Analysis of textual and qualitative data using such software as Mondrian and R

6. Use of a variety of multivariate software tools such as Minitab and Excel and Open Source software applications and environments such as R, GGobi, Gaugion and Mondrian

7. Exposition of some standard data sets.

Indicative Reading and Resource List

John W. Tukey, (1977) ‘Exploratory Data Analysis’ Addison-Wesley

Brian S. Everitt, Sabine Landau, and Morven Leese (2001) ‘Cluster Analysis’, Hodder Arnold

Usama M. Fayyad, Georges G. Grinstein and Andreas Wierse, (2001) ‘Information Visualization in Data Mining and Knowledge Discovery’, Morgan Kaufmann

William G. Jacoby (1998) ‘Statistical Graphics for Visualizing Multivariate Data’, Sage Publications

Unwin, A., Theus, M. and Hoffman, H. (2006) ‘The Graphics of Large Data Sets:

Visualizing Million’, Springer-Verlag, NY(http://stats.math.uni-augsburg.de/GOLD/) http://www.ggobi.org/ A powerful Open Source software tool for Data Visualization

In document UNDERGRADUATE PROGRAMME HANDBOOK (Page 175-179)