Chapter 3 Data Collection and Preprocessing
3.3 Data Types and Preprocessing and Validation
Once the dataset is cleaned and structured, the next step is to assess and explore whether the variable values are directly usable or need some transformations to enable them to be
analyzed. Attributes of continuous value, categorical attributes, scaled data, and text descriptions exist in the dataset. Certain numeric attributes can be used directly as they are or can be
subjected to some form of normalization. Nominal, binary, scaled, and text data, however, each require their different methods to be preprocessed. This section explains the steps taken for transforming certain attributes.
Continuous variables: Attributes with continuous values constitute a significant portion of a dataset. These attributes can be transformed using several techniques such as z-score
normalization and log transformation. For the current research question, most of the continuous variables are scores of students, number of students, quantities of materials, and so forth. A z- score normalization would be suitable for many of these, and it also works as an easy method to detect outliers. Continuous variables can also be subjected to measures of central tendency and many calculations to measure similarity and dissimilarity.
Categorical variables: Categorical data is data that specifies a category or group to which an observation belongs. It can originate from continuous values as well as from qualitative forms. It can be a dichotomous attribute, which can have only two values such as “yes” or “no”; it can also be polytomous, or nominal, with multiple classes, such as grades A, B, C, or D or states in the United States. For most machine learning and data mining algorithms, these variables have to be converted into numeric codes.
A numeric ranking methodology to convert categorical variables to numeric labels has been followed. For example, the categorical attribute MetAttendTarg has three levels (“yes,” “no,” and “NA”). To enable effective clustering, these values have been converted to 1, 2, and 3, respectively. The nature of the data type has not been altered. Similarly, scaled attributes such as grades have been coded using similar numeric denominations.
Text document transformations: Applying computational methods to text for analyzing them requires transformations that have to be in numeric form. Since most of the similarity and dissimilarity measures require some form of numeric input, text data also need to be converted into a form where the words are represented in the form of numbers. This converted structure is called a term document matrix, described next.
Term document matrix: A term is a word. Keeping a count of how many times a word appears in a document or a corpus is an important part of text mining and is called term
frequency. The terms that appear the most frequently are not necessarily the most important ones. Because very common words such as “student” or “class” do not contribute much to analysis of the comments given by the teacher, such extremely frequent terms are eliminated from term weighting. The importance of a word comes with its context in the document or corpus. This encoding of text and attributing some degree of importance is called term weighting. The method that is most adopted is called inverse document frequency, and the overall method is called term frequency/inverse document frequency. The formulation for it is given as:
i j i j i df N tf w, , log
where wi,j is the weight of the term i in the document j, tfi, j = number of occurrences of term i in
document j, N is the total number of documents, and dfi is the number of documents containing
term i.
The following small example details how the transformation is implemented. Given the following three documents a, b, c:
a. Student is excellent in math and always completes homework b. Student is very good in math and helpful to fellow class mates c. Student is very motivated in sciences. Has to improve in English
When these are to be transformed, the initial step is to do the cleaning process of removing stop words and stemming. Then the frequency of each word in the documents is tabulated. Figure 4 shows the term frequencies for the three documents.
Terms
Docs alway and class complet english excel fellow good hav help homework improv mate math motiv science student veri 3 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 3 2
Figure 4: Term document matrix shows the document frequency of words from three documents.
Multicollinearities: When an attribute can be predicted from another attribute, there is a high degree of correlation between them. Often in datasets, the same information is represented in multiple forms, or measured through more than one attribute. When this happens, there will be multicollinearities in the data. This problem causes regression-based models to fail to identify the real relationships between the variables accurately. Thus, after assessment variables’ importance using statistical methods, some variables are removed. For numeric attributes, principal
component analysis is applied, and for categorical attributes, a chi-square correlational analysis is performed.