CS 8031 Data Mining and Data Warehousing Tutorial

(1)

BE_VII>DMDW.M04 Department of Computer Sc. & Engg.

B.I.T., Mesra Sub.: CS 8031 Data Mining & Data Warehousing Tutorial Sheet-I Class BE(CS)-VIII Sem. Module - 1

1. What is data mining? In your answer, address the following a) Is it another type?

b) Is it a simple transformation of technology developed from database, statistics, and machine learning ?

c) Explain how the evolution of database technology led to data mining. d) Describe the steps involved in data mining when viewed as a process of knowledge discovery.

2. Present as example where data mining is crucial to the success of a business. What data mining functions does this business need? Can them be performed alternatively by data query processing or simple statistical analysis?

3. How is a data warehouse different from a database? How are they similar?

4. Briefly describe the following advanced database systems and

applications: object-oriented database, spatial databases, text databases, multimedia databases, the World Wide Web.

5. Define each of the following data mining functionalities: characterization, discrimination, association, classification, prediction, clustering, and evolution analysis. Give examples of each data mining functionality, using a real-life database that you are familiar with.

6. What is the difference between discrimination and classification? Between characterization and clustering? Between classification and prediction? For each of these pairs of tasks, how are they similar? 7. Based on your observation, describe another possible kind of knowledge that needs to be discovered by data mining methods but has not been listed in this chapter. Does it require a mining methodology that is quite different from those outlined in this chapter?

8. Describe three challenges to data mining regarding data mining methodology and user interaction issues.

9. Describe two challenges to data mining regarding performance issues. Module - 2

10. State why, for the integration of multiple heterogeneous information sources, many companies in industry prefer the update-driven approach (which constructs and uses data warehouses), rather than the driven (which applies wrappers and integrators). Describe situations where the query-driven approach is preferable over the update-driven approach.

(2)

explain your point(s).

a) Snowflake schema, fact constellation, starnet query model b) Data cleaning, data transformation, refresh

c) Discovery-driven cube, multi feature cube, virtual warehouse 12. Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit.

a) Enumerate three classes of schemas that are popularly used for modeling data warehouses.

b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a).

c) Starting with the base cuboid [day, doctor, patient], what specific OLAP operating should be performed in order to list the total fee collected by each doctor in 2000?

d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with the schema fee (day, month,

year, doctor, hospital, patient, count, charge).

13. Suppose that a data warehouse for Big-University consists of the following four dimensions : student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g. for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given conbination.

a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester,

instructor], what specific OLAP operations (e.g. roll-up from semester to year) should one perform in order to list the average grade of CS courses for each Big-University student.

c) If each dimension has five levels (including all), such as student < major < status < university < all, how many cuboids will this cube contain (including the base and apex cuboids)?

14. Regarding the computation of measures in a data cube:

a) Enumerate three categories of measures, based on the kind of aggregate functions used i computing a data cube.

b) For a data cube with the three dimensions time, location, and product, which category does the function variance belong to? Describe how to compute it if the cube is partitioned into many chunks.

c) Suppose the function is "top 10 sales". Discuss how to efficiently compute this measure in a data cube.

15. In data warehouse technology, a multiple dimensional view can be implemented by a relational database technique (ROLAP), or by a multidimensional database technique (MOLAP), or by a hybrid database technique (HOLAP).

a) Briefly describe each implementation technique.

b) For each technique, explain how each of the following functions may be implemented:

i) The generation of a data warehouse (including aggregation)

ii) Roll-up iii) Drill-down iv) Incremental updating which implementation techniques do you prefer, and why?

(3)

16. Suppose that a data warehouse contains 20 dimension, each with about five levels of granularity.

a) Users are mainly interested in four particular dimensions, each having three frequently accessed levels for rolling for rolling up and drilling down. How would you design a data cube structure to support this preference efficiently?

b) At times, a user may want to drill through the cube, down to the raw data for one or two particular dimensions. How would you support

this feature?

17. Consider the following multi feature cube query: Grouping by all subsets of [item, region, month], find the minimum shelf life in 2000 for each group, and the fraction of the total sales due to tuples whose price is less than $100, and whose shelf life is within 25% of the minimum shelf life, and within 50% of the minimum shelf life.

a) Draw the multi feature cube graph for the query. b) Express the query in extended SQL.

c) Is this a distributive multi feature cube? Why or why not?

18. What are the differences between the three main types of data warehouse usage: information processing, analytical processing, and data mining? Discuss the motivation behind OLAP mining (OLAM).

Module – 3

19. Data quality can be assessed om terms of accuracy, completeness, and consistency. Propose two other dimensions of data quality.

20. In real-world data, tuples with missing values for some attributes are a common occurrence. Describe various methods for handling this problem.

21. Suppose that the data for analysis include the attribute age. The age values for the data tuples are (in increasing order): 13, 15, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 53, 70.

a) Use smoothing by bin means to smooth the above data ,using a bin depth of 3. Illustrate your steps. Comment on the effect of this technique for the given data.

b) How might you determine outliers in the data? c) What other methods are there for data smoothing? 22. Discuss issues to consider during data integration.

23. Propose an algorithm, in pseudocode or in your favorite programming language, for the following :

a) The automatic generation of a concept hierarchy for datagorical data based on the number of distinct values of attributes in the

given schema.

b) The automatic generation of a concept hierarchy for numeric data based on the equiwidth partitioning rule

(4)

c) The automatic generation of a concept hierarchy for numeric data based on the equidepth partitioning rule.

24. List and describe the 5 primitives for specifying a data mining task. 25. Describe why concept hierarchies are useful in data mining?

Module - 4

26. The 4 major types of concept hirerachies are : schema hierarchies, grouping hierarchies, operation-derived hierachies and rule-based hirearchies.

a) Briefly define each type of hierarchy.

b) For each hierarchy type, provide an example that was not presented in this chapter.

27. Suppose that the University course DB for Big-University includes the following attributes describing students :

name, address, status (e.g. undergraduate or graduate), major, and GPA(cumulative grade point average).

a) Propose a concept hierarchy for the attributes address, status, major, and GPA.

b) For each concept hierarchy that you have proposed above, what type of concept hierarchy is it?

c) Define each hierarchy using DMQL syntax.

d) Write a DMQL query to find the characteristics of students who have an excellent GPA.

e) Write a DMQL query to compare students majoring in science with students majoring in arts.

f) Write a DMQL query to find associations involving course instructors, student grades, and some other attribute of your choice. Use a metarule to specify the format of associations you would like to find. Specify minimum thresholds for the confidence and support of the association rules reported

g) Write a DMQL query to predict student grades in "Computing Science 101" based on student GPA and course instructor.

(5)

TUTORIAL SHEET – II

MODULE - 4

28.Discuss the importance of establishing standardization date mining query language. What are

some of the potential benefit and challenges? Inralveel in such took? List a few of the recent

proposal in this area?

29. Describe the differences between the following crehiteetane for the integration of the data.

Mining system with database or data wore home system: on coupling, loose coupling semi tight

coupling and tight coupling. Stall which crepitate you think is most popular and why.

MODULE-1

30. (a) what is relation database.

(b) What is transactional database?

(c) What is online-Analytical processing?

MODULE-2

31.A popular data ware heaves implementation is constrict a multidimensional database, known as

a data cube. Unfortunately this may often generate a huge yet very sparse multidimensional matrix.

(a) Present an example illustrating such a huge and sparse data cube.

(b) Design an implementation method that can be elegantly overcome this sparse matrix

Problem note that yet reel to explain your data structure in detail and discuss the sparse needle

or will as how retrieve data from your structures.

(c) Modify your design in (b) to handle inerenental data updates. Gives the easeneing behind

your new design.

MODULE-3

32. Use the flow chart summaries the following procedures for attribute subset selection

(a) Step wise foreword selection.

(b) Step wise Back word selection

(c) A combination of back word elimination and foreword selection.

Module-5

(33) For Class Character section, what are the major differences between a data enbe bored

implementation and relational implementation such as attribute oriented in diction?

Discuss which method is most efficient and under what condition then is so.

(34) Suppers that the following table is derived by attribute-oriented

Induction.

Class

birth free

count

Candor

180

(6)

Canada

20 DBA

Others

80 (a) Transform the table in to Eros stab. Showing the associated t-weight and

d-weight.

(b) Map the class programmer in to a quantitative. Abstractive rule for example,

X, programmer (X)=) (birth – fleece (X)=” Canada” Λ---) [t: x%, d: y%]…V (…..)

[t: w%, d : z]

(35) Suppose that the data for analyses include the attribute age. The age. Value

For the data tepees are

13,15,16,16,19,20,20,21,22,22,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,

70. (a) What is the mean of the data? What is the median?

(b) What is the mode of the data? Comment on the data’s modality?

(c) What is the midrange of the data?

(d) Can you find the first quartile (Q1) and third quartile (Q2) of the data?

(e) Give the five numbers summery of the data.

(f) Shows a box flot of the data.

(g) How is a quartile flot?

Module –6

(36) The apriority algorithm makes use of prior knowledge of subset support properties.

(a) Prove that all non-empty subset of a frequent item set must also be frequent.

(b) Prove that the support of any non-empty subsets of item sets must be as great as the

support of S.

(c) Given frequent item set L and subset S of L, prove that the confidence of the rule

“S’



(1-s’)” cannot be more than the confidence of “S’



(1-s’)”, where S’ is a

subset of S.

(d) A partitioning rauiator of Aprieri subdivides the transactions of a database D in to

N nonolpping partitioning. Prove that any item set that is frequent in D must be

frequent in at lust one petition of D.

(37) A database has four transactions. Let min-sub=60%.and min-conf =80%

TID

date

item-bought

T 100

10/15/99

{K,A,D,B}

T 200

10/15/99

{D,A,C,E,B}

T 300

10/15/99

{C,A,B,E}

T 400

10/15/99

{B,A,D}

(a) Find all frequent item set using Aprion and FP-growth, uspeetingely . Compare the

efficiency of the two mining processes.

(b) List all of the strong association rules, matching the following meta rule, where X is

a variable. Reprinting customers, and item denote valuables repenting items(eg. ”A”,

B”, etc):

(7)

(38) Suppose that frequent item set are saved for a large transaction database, DB. Discuss

how to efficiently mine the (goral) association rules under the same minimum support

turnsole, it a set of new transactions, doffed as ∆ ADB, is (incrementally) cereal in ?

(39) Proposal and outline a level-shoved mining approach to mining multilevel association

rule in which and item is enfold by its level position, and as initial scan of the olatatabase

collects the count for each item of each concept level, identifying frequent and sub

frequent items. Comment on the professing cost of mining multi –level association with

this me toed in comprising to mining single level association.

(40) When mining cross level Association rules, suppose it is tound that the item set”{IBM

desktop computer, printer}” dose not satiety minimum support can this information be

need to prune the mining of a “descenelent” itemset seethes {IBM desktop computer/w

printer} beige a general rule enplaning how this information may be used for purring

the search Space.

(41) Prove that each entry in the following table correctly characterizes its corresponding rule

constraint for frequent item set mining.

Rule Constraint

Ant monotone

Monotone

securest

(a) VЄ S

no

yes

(b) Sє V

yes

no

yes

(c) min (s) ≤ V

no

yes

(d) range (s) ≤ V yes

no

(e) Varian (s) ≤ V Convertible

Convertible no

Module-7

(42) Briefly outline the major steps of decision tree classification .

(43) Why is tree preening useful in decision tree inculcation ? What is draw back of using a

separate set of samples to evaluate purring.

(44) The following table shows the midterm and find exam grades obtained for student in a

database course.

X

Y

Midterm exam

Final exam

72

84 (a) Plot the data. Do X and Y seem to have a linear relationship.

50.

63

81 77 (b) Use the method of least squares to find an equation for the

74

78 fraction of the students final exam. Grade based on the

94

90 students’ midterm grade in the course.

86

75 (c) Predict the final exam grade of a student who received an 86 on

59

49 midterm exam.

83

79

65

77

33

52

88

74

81

90

(8)

(45) What is boosting? State whey it may improve the accuracy of decision tree induction.

(46) Show that accuracy is a fun of sensitivity and specificity, that it --

Pos Pos

accuracy = sensitivity { Pos+neg } + specificity (pos+neg).

Module-8

(47) Briefly outline how to compute the dissimilarity between objects described by the

Following type of variable.

(a) Asymmetric be nay variables

(b) Nominal variables.

(c) Ratio-scaled variable.

(d) Numerical variables.

(48) Given two objects rap rental by the topples (22,1,42,10) and (20,0,36,8):

(a) Computer the Euclidean distended between the two objects.

(b) Computer the mandate is thence between the two objects.

(c) Computer the minnows distended between the two objects, wring q=3.

(49) Suppose the data mining took to leister the following eight points (with (x,y) reporting logon)

into 3 clusters.

A1(2,10), A2(2,5),A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2),C2(4,9).

The distance fun is Euclidean distance. Suppose we initially assign A1,B1 and C1 as the

Centers of each cluster, respectively. Use the K-means algorithms to show only.

(a) The three cluster enter softer the first round exam. and.

(b) The final three clusters.

(50) Data cubes and multidimensional database contain categorical, ordinal, and numerical data

in heretical or agree grate forms. Basal on what you have leaned at the clustering method , design

a clustering method that find clusters in large data cube effectively and efficiently.

Module - 9

(51) Suppose that a chain restatement would like to mine customers” Consumption behavior

related to major sport events, such as “Every time there is a Canucks hockey game on TV, the

sales of ken turkey Fried chicken will go up 20% one her before the match.

(a) Describe a method to find such pattern efficiently

(b) Most time related association mining algorithms. Use Apron- like algorithms to mine such

Patterns. An alternative database projection – bared frequent pattern (FP) growth method,

Is efficient a mining frequent item sets. Can you external the FP growth method to find such

time related patters efficiently.

(52) Suppose that a power station stores data about power consumption levels by time and by

region,

and power usage in formation per customer in each region. Disuses how to ashes the fallowing

problems in such a time series database.

(a) Final similar power consumption curve fragments for a given region on Fridays.

(b) Every time a power consumption curve rises sharply what may happen within 20 minutes?

(c) How can we find the most influential features that distinguish a stable power consumption

(9)