BE_VII>DMDW.M04 Department of Computer Sc. & Engg.
B.I.T., Mesra Sub.: CS 8031 Data Mining & Data Warehousing Tutorial Sheet-I Class BE(CS)-VIII Sem. Module - 1
1. What is data mining? In your answer, address the following a) Is it another type?
b) Is it a simple transformation of technology developed from database, statistics, and machine learning ?
c) Explain how the evolution of database technology led to data mining. d) Describe the steps involved in data mining when viewed as a process of knowledge discovery.
2. Present as example where data mining is crucial to the success of a business. What data mining functions does this business need? Can them be performed alternatively by data query processing or simple statistical analysis?
3. How is a data warehouse different from a database? How are they similar?
4. Briefly describe the following advanced database systems and
applications: object-oriented database, spatial databases, text databases, multimedia databases, the World Wide Web.
5. Define each of the following data mining functionalities: characterization, discrimination, association, classification, prediction, clustering, and evolution analysis. Give examples of each data mining functionality, using a real-life database that you are familiar with.
6. What is the difference between discrimination and classification? Between characterization and clustering? Between classification and prediction? For each of these pairs of tasks, how are they similar? 7. Based on your observation, describe another possible kind of knowledge that needs to be discovered by data mining methods but has not been listed in this chapter. Does it require a mining methodology that is quite different from those outlined in this chapter?
8. Describe three challenges to data mining regarding data mining methodology and user interaction issues.
9. Describe two challenges to data mining regarding performance issues. Module - 2
10. State why, for the integration of multiple heterogeneous information sources, many companies in industry prefer the update-driven approach (which constructs and uses data warehouses), rather than the driven (which applies wrappers and integrators). Describe situations where the query-driven approach is preferable over the update-driven approach.
explain your point(s).
a) Snowflake schema, fact constellation, starnet query model b) Data cleaning, data transformation, refresh
c) Discovery-driven cube, multi feature cube, virtual warehouse 12. Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit.
a) Enumerate three classes of schemas that are popularly used for modeling data warehouses.
b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a).
c) Starting with the base cuboid [day, doctor, patient], what specific OLAP operating should be performed in order to list the total fee collected by each doctor in 2000?
d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with the schema fee (day, month,
year, doctor, hospital, patient, count, charge).
13. Suppose that a data warehouse for Big-University consists of the following four dimensions : student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g. for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given conbination.
a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester,
instructor], what specific OLAP operations (e.g. roll-up from semester to year) should one perform in order to list the average grade of CS courses for each Big-University student.
c) If each dimension has five levels (including all), such as student < major < status < university < all, how many cuboids will this cube contain (including the base and apex cuboids)?
14. Regarding the computation of measures in a data cube:
a) Enumerate three categories of measures, based on the kind of aggregate functions used i computing a data cube.
b) For a data cube with the three dimensions time, location, and product, which category does the function variance belong to? Describe how to compute it if the cube is partitioned into many chunks.
c) Suppose the function is "top 10 sales". Discuss how to efficiently compute this measure in a data cube.
15. In data warehouse technology, a multiple dimensional view can be implemented by a relational database technique (ROLAP), or by a multidimensional database technique (MOLAP), or by a hybrid database technique (HOLAP).
a) Briefly describe each implementation technique.
b) For each technique, explain how each of the following functions may be implemented:
i) The generation of a data warehouse (including aggregation)
ii) Roll-up iii) Drill-down iv) Incremental updating which implementation techniques do you prefer, and why?
16. Suppose that a data warehouse contains 20 dimension, each with about five levels of granularity.
a) Users are mainly interested in four particular dimensions, each having three frequently accessed levels for rolling for rolling up and drilling down. How would you design a data cube structure to support this preference efficiently?
b) At times, a user may want to drill through the cube, down to the raw data for one or two particular dimensions. How would you support
this feature?
17. Consider the following multi feature cube query: Grouping by all subsets of [item, region, month], find the minimum shelf life in 2000 for each group, and the fraction of the total sales due to tuples whose price is less than $100, and whose shelf life is within 25% of the minimum shelf life, and within 50% of the minimum shelf life.
a) Draw the multi feature cube graph for the query. b) Express the query in extended SQL.
c) Is this a distributive multi feature cube? Why or why not?
18. What are the differences between the three main types of data warehouse usage: information processing, analytical processing, and data mining? Discuss the motivation behind OLAP mining (OLAM).
Module – 3
19. Data quality can be assessed om terms of accuracy, completeness, and consistency. Propose two other dimensions of data quality.
20. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe various methods for handling this problem.
21. Suppose that the data for analysis include the attribute age. The age values for the data tuples are (in increasing order): 13, 15, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 53, 70.
a) Use smoothing by bin means to smooth the above data ,using a bin depth of 3. Illustrate your steps. Comment on the effect of this technique for the given data.
b) How might you determine outliers in the data? c) What other methods are there for data smoothing? 22. Discuss issues to consider during data integration.
23. Propose an algorithm, in pseudocode or in your favorite programming language, for the following :
a) The automatic generation of a concept hierarchy for datagorical data based on the number of distinct values of attributes in the
given schema.
b) The automatic generation of a concept hierarchy for numeric data based on the equiwidth partitioning rule
c) The automatic generation of a concept hierarchy for numeric data based on the equidepth partitioning rule.
24. List and describe the 5 primitives for specifying a data mining task. 25. Describe why concept hierarchies are useful in data mining?
Module - 4
26. The 4 major types of concept hirerachies are : schema hierarchies, grouping hierarchies, operation-derived hierachies and rule-based hirearchies.
a) Briefly define each type of hierarchy.
b) For each hierarchy type, provide an example that was not presented in this chapter.
27. Suppose that the University course DB for Big-University includes the following attributes describing students :
name, address, status (e.g. undergraduate or graduate), major, and GPA(cumulative grade point average).
a) Propose a concept hierarchy for the attributes address, status, major, and GPA.
b) For each concept hierarchy that you have proposed above, what type of concept hierarchy is it?
c) Define each hierarchy using DMQL syntax.
d) Write a DMQL query to find the characteristics of students who have an excellent GPA.
e) Write a DMQL query to compare students majoring in science with students majoring in arts.
f) Write a DMQL query to find associations involving course instructors, student grades, and some other attribute of your choice. Use a metarule to specify the format of associations you would like to find. Specify minimum thresholds for the confidence and support of the association rules reported
g) Write a DMQL query to predict student grades in "Computing Science 101" based on student GPA and course instructor.