1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

(1)

1. What are the uses of statistics in data mining? Statistics is used to

• Estimate the complexity of a data mining problem.

• Suggest which data mining techniques are most likely to be successful, and • Identify data fields that contain the most “surface information”.

2. What is the main goal of statistics?

The basic goal of statistics is to extend knowledge about a subset of a collection to the entire collection.

3. What are the factors to be considered while selecting the sample in statistics? The sample should be

• Large enough to be representative of the population. • Small enough to be manageable.

• Accessible to the sampler. • Free of bias.

4. Name some advanced database systems? • Object-oriented databases.

• Object-relational databases.

5. Name some specific application oriented databases? • Spatial databases.

• Time-series databases. • Text databases

• Multimedia databases. 6. Define Relational databases?

Relational databases are a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (rows or records). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values.

7. Define Transactional databases?

Transactional databases consist of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number (trans_ID), and a list of the items making up the transaction.

(2)

8. Define Spatial Databases?

Spatial databases contain spatial-related information. Such databases include geographic (map) databases, VLSI chip design databases, and medical and satellite image databases. Spatial data may be represented in raster format, consisting of n-dimensional bit maps or pixel maps.

9. What is Temporal Database?

Temporal database store time related data. It usually stores relational data that include time related attributes. These attributes may involve several time stamps, each having different semantics.

10. What is a Time-Series database?

A Time-Series database stores sequences of values that change with time, such as data collected regarding the stock exchange.

11. What is Legacy database?

A Legacy database is a group of heterogeneous databases that combines different kinds of data systems such as relational or objects –oriented databases, hierarchical databases, network databases, and spread sheets, multimedia databases or file systems. 12. What are the steps in the data mining process?

• Data Cleaning • Data Integration • Data Selection • Data Transformation • Data Mining • Pattern Evaluation • Knowledge Representation 13. Define data cleaning?

Data Cleaning means removing the inconsistent data or noise and collecting necessary information.

14. Define data mining?

Data mining is a process of extracting or mining knowledge from huge amount of data.

15. Define pattern evaluation?

Pattern evaluation is used to identify the truly interesting patterns representing knowledge based on some interesting measures.

(3)

16. Define Knowledge representation?

Knowledge representation techniques are used to present the mined knowledge to the user.

17. Define class/ concept description?

Data can be associated with classes or concepts. It can be useful to describe individual classes and concepts in summarized, concise and yet precise terms. Such description of a class or a concept is called class/ concept descriptions.

18. What is Data Characterization?

Data Characterization is a summarization of the general characteristics or features of a target class of data. The data corresponding to the user specified class or typically collected by a database query.

19. What is data discrimination?

Data discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. 20. What is Association analysis?

Association analysis is the discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data.

21. Define association rules?

Association rules are of the form X⇒ Y, that is “A1 ∧ ….. ∧ Am → B1∧……∧Bn”, where Ai (for i∈{1, ….,m}) and Bj(for j∈ {1,…,n}) are attribute- value

pairs . The association rule X⇒ Y is interpreted as “database tuples that satisfy the condition in X are also likely to satisfy the conditions in Y”.

22. List out the major components of a typical data mining system?

The major components in the typical data mining system architecture are Database, Data warehouse, World Wide Web or other information repositories.

Database or data warehouse server Knowledge base

Data mining engine Pattern evaluation module User interface

(4)

23. How does a data warehouse differ from a database? How are they similar? Difference:

A database system or DBMS, consists of a collection of interrelated data, know as a database and a set of software programs to manage and access the data.

A data warehouse is a repository of information collected from multiple sources, stored under a unified schema.

Similarity:

Queries can be applied to both database and data warehouse. A data warehouse is modeled by multidimensional database structure.

24. What is concept description of hierarchies?

Concept description generates description for the characterization and comparison of the data. It is some times called class description, when the concept to be described refers to a class of objects.

25. What is constraint based association mining?

Specification of constraints or expectations to confine the search space of database mining process called constraint based association mining. The constraints can be,

• Knowledge based constraints • Data constraints

• Dimension/level constraints • Interestingness constraints • Rule constraints.

26. What is linear regression?

Linear regression involves finding the best line to fit two attributes, so that one attribute can be used to predict the other.

Example:

A random variable y (response variable) can be modeled as a linear function of another random variable x (predictor variable).With the equation

y = wx+b.

27. What are the two data structures in cluster analysis? Two data structures in cluster analysis are,

• Data matrix (object by variable structure) • Dissimilarity matrix (object by object structure)

(5)

28. How are concept hierarchies useful in OLAP?

In the multidimensional model, data are organized onto multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies. This organization provides users with a flexibility to view data from different perspectives. LAP provides a user-friendly environment for interactive data analysis. 29. What do you mean by virtual warehouse?

A virtual warehouse is a set of views over operational databases. For effective query processing only some of the possible summary views may be materialized. A virtual warehouse is easy to build but requires excess capacity on operational database servers.

30. List out five data mining tools • IBM’S Intelligent miner

• Data mined corporation Data mined • Pilots Discovery server

• Tools from business objects and SAS Institute • End user tools.

31. What is KDD?

Knowledge discovery is a process and consists of an iterative sequence of the following steps. • Data cleaning • Data Integration • Data Selection • Data transformation • Data Mining • Pattern evaluation • Knowledge presentation

32. List out the classification of data mining system?

• Classification according to the kinds of databases mined. • Classification according to the kinds of knowledge mined. • Classification according to the techniques utilized.

(6)

33. What is concept description?

Concept description is a form of data generalization. A concept typically refers to a collection of data such as frequent-buyers, graduate-students etc. Concept description generates descriptions for the characterization and comparison of the data.

34. What is association rule mining?

It consists of first finding frequent item sets (set of items, such as A and B, satisfies a minimum support threshold, or percentage of the task-relevant tuples) from which strong association rules in the form of A=>B are generated. The rules also satisfy a minimum confidence threshold. Associative can be further analyzed to uncover correlation rules.

35. What is tree pruning?

Tree pruning is used to remove the anomalies in the training data due to noise outliers. It addresses the problem of overfilling the data. Two approaches of tree pruning. Pre-pruning --Tree is pruned by halting its construction early.

Post-pruning--Removes sub trees from a fully grown tree. 36. What is cluster analysis?

The process of grouping a set of physical or abstract object into classes of similar objects is called clustering.

A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.

37. What is concept hierarchy?

Concept hierarchy defines a sequence of mapping from a set of low-level concepts to higher-level, more general concepts. Concept hierarchies are implicit within the database schema. A concept hierarchy is a total or partial order among attribute in a database schema is called schema hierarchy.

38. What is Aggregation and metadata?

Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amount. This is used in constructing a data cube for analysis of data at multiple granularities.

Metadata are data about data which define warehouse objects. Metadata are created for the data names and definition of the given warehouse.

(7)

39. What is star schema and snow flake schema?

Star schema is the most common modeling paradigm in which the data warehouse contains

• A large central table (fact table) containing the bulk of data with no redundancy. • A set of smaller attendant tables, one for each dimension.

Snowflake schema is a variant of the star schema model where some dimension tables are normalized; thereby further splitting the data into additional tables.

40. Write short notes on spatial clustering?

Spatial data clustering identifies clusters or densely populated regions, according to some distance measurements in a large, multidimensional data set.

41. State the types of Linear Model and state its use?

Generalized Linear model represent the theoretical foundation on which linear regression can be applied to the modeling of categorical response variables. The types of generalized linear model are

• Logistic regression • Poisson regression

42. What are the goals of Time series analysis? • Finding patterns in the data

• Predicting future values. 43. What is smoothing?

Smoothing is an approach that is used to remove nonsystematic behaviors found in a time series. It can be used to detect trends in time series.

44. What is Lag?

The time difference between related items is referred to as Lag.

45. Write the preprocessing steps that may be applied to the data for classification and prediction?

• Data cleaning • Relevance analysis • Data transformation 46. Define Data Classification?

It is a two-step process. In the first step, a model is built describing a predetermined set of data classes or concepts. The model is constructed by analyzing

(8)

database tuples described by attributes. In the second step, the model is used for classification.

47. What are Bayesian Classifiers?

Bayesian Classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. 48. What is a decision tree?

It is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf node represents classes or class distribution. Decision tree is a predictive model. Each branch of the tree is a classification question and leaves of the tree are partition of the data set with their classification.

49. Where are Decision Trees mainly used?

• Used for exploration of data set and business problems • Data preprocessing for other predictive analysis

• Statisticians use decision trees for exploratory analysis.

50. How will you solve a classification problem using Decision Tree? • Decision Tree Induction:

Construct a decision tree using training data.

• For each ti € D apply the decision tree to determine its class ti-tuple

D-Database

51. How is association rules mined from large databases? Association rule mining is a two step process.

• Find all frequent itemsets.

• Generate strong association rules from the frequent itemsets. 52. What is the classification of association rules based on various criteria?

1. Based on the types of values handled in the rule a. Boolean association rule

b. Quantitative association rule.

2. Based on the dimensions of data involved in the rule a. Single dimensional association rule

b. Multidimensional association rule

(9)

a. Single level association rule b. Multilevel association rule

4. Based on various extensions to association mining a. Maxpatterns

b. Frequent closed itemsets 53. What is Apriori algorithm?

Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean association rules using prior knowledge. Apriori algorithm uses prior knowledge of frequent itemset properties and it employees an iterative approach known as level-wise search where k-itemsets are used to explore (k+1)-itemsets.

54. Define a Data mart?

Data mart is a pragmatic collection of related facts, but does not have to be exhaustive or exclusive. A data mart is both a kind of subject area and an application. Datamart is a collection of numeric facts.

55. What is data warehouse performance issue?

The performance of data warehouse is largely a function of the quantity and the type of data stored within a database and the query/data loading work load placed upon the system.

56. What is Data Inconsistency Cleaning?

This can be summarized as the process of cleaning up the small inconsistencies that introduce themselves into the data. Examples include duplicate keys and unreferenced foreign keys.

57. Merits of Data warehouse.

* Ability to make effective decisions from database * Better Analysis of data and decision support

* Discover trends and correlations that benefits business * Handle huge amount of data

58. What are the characteristics of data warehouse? * Separate

* Available * Integrated * Subject oriented * Not dynamic

(10)

* Iterative Development * Aggregation Performance

59. List some of the data warehouse tools. * OLAP (Online Analytic Processing) * ROLAP (Relational OLAP)

* End User Data Access Tool * Ad Hoc Query Tool

* Data Transformation Services * Replication

60. Explain OLAP.

The general activity of querying and presenting text and number data from data warehouses, as well as a specifically dimensional style of querying and presenting that is exemplified by a number of "OLAP Vendors". The OLAP vendors technology is non-relational and is almost always biased on an explicit multidimensional cube of data. OLAP databases are also known as multidimensional cube of databases.

61. Explain ROLAP.

ROLAP is a set of user interfaces and applications that give a relational database, a dimensional flavor. ROLAP stands for Relational Online Analytic Processing.

62. Explain End User Data Access Tool?

End User Data Access Tool is a client of the data warehouse. In a relational data warehouse, such as client maintains a session with the presentation server, sending a stream of separate SQL requests to the server. Eventually the End User Data Access Tool is done with the SQL session and turns around to present a screen of data or a report, a graph, or some other higher form of analysis to the user. An End User Data Access Tool can be as simple as an Ad Hoc Query Tool or can be complex as a sophisticated data mining or modeling application.

63. Explain Ad Hoc Query Tool?

It is a specific kind of end user data access tool that invites the user to form their own queries by directly manipulating relational tables and their joins. Ad Hoc Query Tools, as powerful as they are, can only be effectively used and understood by about 10% of all the potential end users of a data warehouse.

64. Name some of the data mining applications.

* Data mining for biomedical and DNA Data Analysis * Data Mining for Financial Data Analysis

(11)

* Data Mining for the Retail Industry

* Data Mining for the Telecommunication Industry

65. What are the contributions of Data Mining to DNA Analysis?

* Semantic Integration of heterogeneous, distributed genome databases * Similarity Search and Comparison among DNA Sequences

* Association Analysis: identification of co-occurring gene sequences * Path Analysis: Linking genes to different stages of disease development * Visualization Tools and genetic data analysis

66. Name some examples of Data Mining in Retail Industry.

* Design and Construction of Data Warehouses based on the benefits of Data Mining

* Multidimensional Analysis of sales, customers, products, time and region * Analysis of the effectiveness of sales campaigns

* Customer retention analysis of customer loyalty * Purchase recommendation and cross-reference of item

67. What is the difference between "supervised" and "unsupervised" learning scheme?

In Data Mining during classification the class label of each training sample is provided, this type of training is called "supervised learning" i.e., the learning of the model is supervised in that it is told to which class each training sample belongs. E.g. Classification

In unsupervised learning the class label of each training sample is not known and the member or set of classes to be learned may not be known in advance. E.g. Clustering 68. Discuss the importance of similarity metric clustering? Why is it difficult to handle categorical data for clustering?

The process of grouping a set of physical or abstract objects into classes of similar objects is called "clustering". Similarity metric is important because it is used for outlier detection. The clustering algorithm which is main memory based can operate only on the following two data structures namely,

a) Data Matrix

b) Dissimilarity Matrix

(12)

69. Mention at least 3 advantages of Bayesian Networks for data analysis. Explain each one

a) Bayesian Network is a graphical representation of unknown knowledge that is easy to construct and interpret.

b) The representation has formal probabilistic semantics, making it suitable for statistical manipulation

c) The representation is used for encoding uncertain expert knowledge in expert systems.

70. Why do we need to prune a decision tree? Why should we use a separate pruning data set instead of pruning the tree with the training database?

When a decision tree is built, many of the branches will reflect animation in the training data due to noise or outliers. Tree pruning methods are needed to address this problem of over fitting the data.

71. Explain the various OLAP operations?

a) Roll-up: The roll up operation performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension.

b) Drill-down: It is the reverse of roll up. It navigates from less detailed data to more detailed data.

c) Slice: Performs a selection on one dimension of the given cube, resulting in a sub cube.

72. Discuss the concepts of frequent itemset, support & confidence?

A set of items is referred to as itemset. An itemset that contains k items is called k-itemset. An itemset that satisfies minimum support is referred to as frequent itemset.

Support is the ratio of the number of transactions that include all items in the antecedent and consequent parts of the rule to the total number of transactions.

Confidence is the ratio of the number of transactions that include all items in the consequent as well as antecedent to the number of transactions that include all items in antecedent.

73. Why is data quality so important in a data warehouse environment?

Data quality is important in a data warehouse environment to facilitate decision- making. In order to support decision-making, the stored data should provide information from a historical perspective and in a summarized manner.

(13)

74. How can data visualization help in decision-making?

Data visualization helps the analyst gain intuition about the data being observed. Visualization applications frequently assists the analyst in selecting display formats, viewer perspective and data representation schemas that faster deep intuitive understanding thus facilitating decision-making.

75. What do you mean by high performance data mining?

Data mining refers to extracting or mining knowledge. It involves an integration of techniques from multiple disciplines like database technology, statistics, machine learning, neural networks, etc. when it involves techniques from high performance computing it is referred as high performance data mining.

76. What are the merits Of Data Warehouse? The merits of data warehouse are the following

• Ability to make effective decisions form the database.

• To discover trends and correlations as they provide benefit to the business. • Better analysis of data and decision support.

• It leads to better understanding of the business and handle huge amount of data.

• There is a possibility of the customer being served better. • Better understanding of the business risks.

• Improvement of the business process.

• Being able to make tailor made products and services. 77. What are the merits of spatial Data Warehouse?

The merits of spatial data warehouse are the following • Make dynamic geographic queries on data.

• To aggrgate your data to geographic areas. • To analyse data and spatial reorganization of it. • Visualization and presentation of data.

78. Describe the two common approaches of Tree Pruning?

In the pre pruning approach a tree is pruned by halting its construction early. The second approach, post pruning, removes branches from a fully grown tree. A tree node is pruned by removing its branches.

(14)

79. What is clustering?

Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters.

80. What are the requirements of clustering? • Scalability

• Ability to deal with different types of attributes • Ability to deal with noisy data

• Minimal requirements for domain knowledge to determine input parameters • Constraint based clustering

• Interpretability and usability

81. State the categories of clustering methods? • Partitioning methods

• Hierarchical methods • Density based methods • Grid based methods • Model based methods

82. Differentiate between lazy learner and eager learner?

Nearest neighbor classifiers are lazy learners in that they store all of the training samples and do not build a classifier until a new (unlabeled) sample needs to be classified.

In eager learning methods such as decision tree induction, back propagation constructs a generalized model before receiving a new sample to classify.

83. What is network pruning?

The first step forwards extracting rules from neural networks pruning. This consists of removing weighted links that do not result in a decrease in the classification accuracy of the given network.

84. List the various criteria of classification in data mining system? • Kinds of databases mined

• Kinds of knowledge mined • Kinds of techniques utilized • Application adapted

(15)

85. Name some data mining techniques? • Statistics

• Machine learning • Decision trees

• Hidden markov model • Artificial neural networks • Genetic algorithms • Meta learning

86. Explain DBMiner tool in data mining? • System Architecture

• Input and Output

• Data mining tasks supported by the system • Support for task and method selection • Support of the KDD process

• Main applications • Current status 87. Define Iceberg query?

It computes an aggregate function over an attribute or set of attributes in order to find aggregate values above some specified threshold. Given relation R with attributes a1,a2,…..an and b, and an aggregate function, agg_f, an iceberg query is the form Select R.a1,R.a2,……….,R.an, agg_f(R.b) from relation R group by R.a1,R.a2,……….,R.an having agg_f(R.b)>= threshold

88. Define DBMiner?

DBMiner is an Online Analytical Mining System, developed fro interactive mining of multiple –level knowledge in large relational databases and data warehouses. 89. List out the DBMiner tasks?

• OLAP analyzer • Association • Classification • Clustering • Prediction

(16)

90. Explain how data mining is used in Health care analysis? • Healthcare data mining and its aims

• Healthcare data mining technique • Segmenting patients into groups

• Identifying patients with recurring health problems • Relation between disease and symptoms

• Curbing the treatment costs • Predicting medical diagnosis • Medical research

• Hospital administration

• Applications of data mining in Healthcare

91. Explain Data mining applications for financial data analysis? • Loan payment prediction and customer credit policy analysis. • Classification and clustering of customers for targeted marketing. • Detection of money laundering and other financial crimes.

92. Explain Data mining applications for the Telecommunication industry? • Multidimensional analysis of telecommunication data.

• Fraudulent pattern analysis and the identification of unusual patterns. • Multidimensional association and sequential pattern analysis.

• Use of visualization tools in telecommunication data analysis. 93. Define Spatial Data Warehouse?

A Spatial Data warehouse is a subject oriented, integrated, time variant and non-volatile collection of both spatial and non-spatial data in support of spatial data mining and spatial data related decision making process.

94. What are the different types of dimensions in a spatial data cube? • A non spatial dimension.

• A spatial to non spatial dimensions. • A spatial to spatial dimensions. 95. Define Spatial Association rule?

A spatial association rule is of the form A⇒B[s%, c%].where A,B are sets of spatial or non spatial predicates, s% is support of the rule and c% is the confidence of the rule.

(17)

96. Define Horizontal Parallelism?

Horizontal Parallelism which means that the database is partitioned across multiple disks and parallel processing occurs within a specific task that is performed concurrently on different processors against different sets of data.

97. Define Vertical Parallelism?

Vertical Parallelism which occurs among different tasks all component query operations is executed in parallel in a pipelined fashion.

98. What is the need for OLAP?

• To analyze data stored in database

• To analyze different dimensions in multidimensional database. 99. Explain the various types of variables used in clustering?

• Interval scaled variables • Binary variables

o Symmetric binary variables o Asymmetric binary variables • Nominal variables

• Ordinal variables • Ratio-scaled variables

100. Explain the hierarchical method of clustering? • Agglomerative and Divisive hierarchical clustering

• BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) • CURE(Clustering Using REpresentatives)