Performance Evaluation of Educational Data Mining Methods for E Learning System

(1)



Abstract— Educational organizations are maintaining the history of data for future analysis to predict and improve the benefits, profits, and development of the organization. Decision making on student performance can be made after mining the historical student information that will result in useful insight. This paper proposes developing and implementing the Student Performance Prediction System using three Data Mining Methods Neural Network, Decision Tree, and K- Nearest Neighbor. Furthermore, a comparison of their results is provided based on three educational data sets. The experimental results indicated that Neural Network exceeds the K- Nearest Neighbor and Decision Tree for all data sets based on holdout validation obtaining 97 percent, 92 percent, 83 percent of accuracy for Iraq, Math, Por datasets, respectively.

Index Terms— Educational Data Mining (EDM), Prediction, Accuracy.

I. INTRODUCTION

In general, data mining represents the "core" step of the Knowledge Discovery in Database (KDD) process, which deals with the extraction of interesting patterns by selecting the following tasks: a specific data mining method or task (e.g., summary, classification, clustering, regression, etc.), appropriate algorithms to perform the task at hand, and a suitable representation of the results of the output. Some definitions of what data mining means, “automatic yet non-trivial extraction of implicit, previously unknown, and probably valuable data”, “Computer process to extract useful knowledge from large amounts of data automatically”, “Automated analysis and exploration of large amounts of data to detect meaningful patterns”. All definitions are all roughly equivalent to each other. They all agree on the main aspects of data mining, which are: (i) huge quantity of data that, (ii) should be analyzed so as to (iii) extract what is called “knowledge”, or “useful information”, or “patterns”, i.e., (iv) something that can be processed and profitably exploited by human beings [1].

Educational data mining is a data mining subdomain that was developed to extract and analyze new information from educational data sources [2]. EDM handles academic data

suggestions to academic planners in institutions of higher education to improve their decision-making process, improve student academic performance, reduce failure rates, better understand student behavior, assist instructors, improve teaching and develop predictive models to predict student performance [3][4].

Data Mining has two main objectives: prediction and description. Prediction is often referred to as supervised Data Mining, while Data Mining descriptive includes unsupervised Data Mining and visualization aspects. Most data mining techniques are based on inductive learning, where a model is explicitly or implicitly constructed by generalizing a sufficient number of training examples. The inductive approach's underlying assumption is that the trained model applies to future cases [5].

There are many data mining methods used for various purposes and goals. In order to help understanding of the variety of methods, their interrelationships, and grouping, taxonomy is required so that differentiating between two main types of data mining is useful: verification-oriented (the system verifies the user's hypothesis) and discovery-oriented (the system autonomously finds new rules and patterns). Figure 1 presents this taxonomy [6]. Discovery methods are those that identify patterns in the data automatically. The branch of the discovery method consists of the following methods [7]:

• Descriptive Methods: these methods are oriented towards data interpretation, focusing on understanding (e.g. through visualization) how the underlying data related to its parts. Also, it outlines a collection of data in a brief but comprehensive process and provides interesting information properties without any predefined goal. These methods do not predict a target value but focus more on the intrinsic structure, relations, interconnectedness, etc. of the data. Clustering, (i.e. decomposing or reordering a collection of data into groups), is the most distinctive descriptive modeling task. Typically, points within a group should be similar and at the same time as different from points in other groups as possible [7][8].

Performance Evaluation of Educational Data

Mining Methods for E-Learning System

Saja Taha Ahmed

1

_{, Prof.Dr. Rafah Al-Hamdani}

1 1_{The Informatics Institute for Postgraduate Studies}

Iraqi Commission for Computers & Informatics (IIPS-ICCI)

Baghdad, Iraq

Dr. Muayad Sadik Croock

2

2_{Computer Engineering Department, University of Technology.}

(2)

[image:2.595.51.273.93.294.2]

form the discovered knowledge in a way that is understandable and easy to use. Some predictive-oriented methods can also assist in providing data understanding [7].

Figure. 1. Taxonomy of Data Mining Methods [5] Work in EDM prediction can be classified into two main categories classification and regression. Classification uses prior knowledge to build a learning model and then uses that model as a binary or categorical variable for the new data, such as a decision tree. While regression is a model used to predict variables, which is different from classification, regression models predict continuous variables, such as neural networks. These models have been used widely in the area of EDM to predict which students should be classified as at-risk [9].

II. PROBLEM STATEMENT

As educational institutions in online courses continue to grow tremendously, increasing enrollment has been dominated by low retention rates. The e-learning has a failed retention rate of 10% to 20% higher than traditional classroom environments [10]. In addition, there is a substantially high number of dropout students in online learning, between 40% and 80% [11].

The challenge facing instructors is the close monitoring of student progress and provide support and resources with increased enrollment. In some of the current online learning systems, there is no viable predictive tool to support instructors with a comprehensive view of the student’s academic performance early in the advancement of a course. The instructor in traditional learning needs to wait until mid-term exams are completed to identify students who are at-risk where intervention strategies are ineffective [12].

Although, there are many researchers investigated applying EDM techniques for student performance prediction, the development of a web-based application utilizing machine learning methods in an online status to categorize students at an early stage is still fragmented and lacking structure in the research field. Since the process requires data extraction, transformation, and other costly operation. In addition, high

dimensionality, irrelevant, and redundant dataset may adversely influence the knowledge discovery during the training stage as well as degrading performance precision of machine learning. All these factors are a key ingredient missing in a number of studies using EDM techniques.

III. RESEARCH AIM AND OBJECTIVE

The purpose of the proposed system is to develop an automated, web-based Student Performance Prediction System (SPPS) that can be easily incorporated into a standard educational database management system. The SPPS model allows instructors to identify students’ performance via performing detection of students early in the semester. The SPPS is designed to automatically analyze and select optimal feature subset from the historical educational database at regular intervals. Moreover, the prediction mechanism is embedded in the system and the web interface is easy to use by teaching staff even if they do not possess a primitive understanding of prediction methods.

IV. THE PROPOSED SYSTEM

The proposed system performs the task of student performance prediction by analyzing the historical training instances of previously registered students, as in each semester there is a need to predict the performance risk for newly enrolled students. Therefore, the prediction of performance is not only necessary, but it is also critical to be achieved with maximum efficiency. In the following subsections, the whole system is described by viewing its structure, users and modules task.

A. The SPPS Definition

The main design consideration of the proposed system focuses on providing advice to newly registered students in a particular course, as well as keeping track of student status until the end of the course. The proposed system is as user-friendly as possible, taking into account the responsibilities of the SPPS users.

B. The Proposed System Structure

The structure of the proposed SPPS is explained in Figure2. It consists of three main layers (Interface layer, DB layer, Execution layer). Each of the layers is explained as follows:

• Interface layer: This can also be called the view layer. It hosts the framework's Graphical User Interfaces (GUIs). It is the layer that can be presented to the users and acts as the system entry point as well as providing the end-users with the required command and functionality. Based on the log-in interface, the SPPS GUI is split into two categories the staff graphical user interface and the student interface. The users can log in and perform multiple operations with distinct levels of authentication.

(3)

[image:3.595.53.287.50.239.2]

Figure 2. The SPPS Structure

 The execution layer: The execution SPPS layer consists of different units for data preprocessing, prediction, evaluation, and decision recommendation. It is faster and has a low error rate than a human expert.

C. The Proposed System Users

The SPPS deals with three types of users; each of them plays a specific role in the proposed system and has its own responsibilities:

• Student: The new students are required to fill out the registration form that incorporates personal, social and economic factors saved in a student DB. The proposed prediction models utilize this information to estimate student performance. A registered student can download lectures in a pdf file, apply examinations, and view the results of his performance prediction through the student Graphical User Interface (GUI).

• Teachers: Teaching staff can use staff GUI to upload lectures and the results of exams into SPPS and monitors their student performance level via the proposed prediction mechanism.

• System Administrator: Server management is the main task in the system. Although SPPS is as automated as possible, the existence of system administrator is necessary to manage staff and students, upload notification, generate performance report and to control the system workflow when there is a need for human interaction.

D. The SPPS Operation Modes

The SPPS is designed to support three main modes of operation; they are:

• Registration: In this phase, everyone who wants to subscribe to SPPS must first register in the system. This task includes fill out the registration form and giving each student his identification code, store student’s information in a student database.

• Lifecycle changes: There is a lifecycle for every student’s information. Student’s information lifecycle begins after the student register himself into SPPS. All subsequent operations are considered as a lifecycle till his account is terminated

system works such as adding grades of semesters for the student.

• Maintenance: Server must be kept running with the highest performance and efficiency. Maintenance includes installing or upgrading the software components, when it is needed, and making a periodic central database backup.

V. THE PROPOSED SPPS WORKFLOW

The first and common stage is the collection of data required for the study. The methodology is applied to real data containing educational and social information about learners. The SPPS execution layer involves pre-processing the data that collected from student DB, generalize data, and makes it suitable for calculations.

The dataset is split using one of the validation techniques into training and test sets. The training data is feed into a learning algorithm to train a model, the labels of test data (i.e., unseen data) are predicted based on the trained model. The evaluation is done in which the number of wrong predictions is counted based on test data to estimate the model’s accuracy.

Before training data used for learning the dataset is analyzed to recognize the optimal feature subset that used during testing via the proposed feature selection mechanism, these features are utilized for dimensionality reduction and increase the predictive performance.

The SPPS models performed the student performance prediction using the three Data Mining models, Neural Network (NN), Decision Tree (DT), and K-Nearest Neighbor (KNN). Finally, decision making is determined based on extracted knowledge (i.e., rules). The workflow of SPPS is explained in Figure 3.

Figure 3. The Proposed System Workflow

[image:3.595.323.556.438.696.2]

(4)

submitting) questionnaire in three Iraqi secondary schools for both applicable and biology branches of the final stage during the second semester of the 2018 year and can be downloaded from [14] with full description. While the second and third datasets (Student Alcohol Consumption Dataset), are obtained from UCI Portugal [15], which incorporates two datasets: student-mat.csv (Math) and student-por.csv (Por). Dataset preprocessing can be summarized in the following steps: • Dataset Encoding: The dataset contains attributes of various data types, for instance: Binary, Interval, Numeric and Categorical (Nominal, Ordinal). The KNN requires data to be in the numerical formulation. This is due to that there are many feature encoding methods for transforming categorical data to numeric ones, such as label encoding or integer encoding, one-hot encoding, binarized and hashing. In this research, the datasets are encoded using Label Encoder, which is the most common method to transform categorical features into numerical labels. Numerical labels are always being between 0 and (#attribute_value-1) [13].

• Dataset Normalization: In the machine learning algorithm where the distance plays a vital role like KNN, the datasets must be normalized for a better predictor (i.e. avoid misclassification) and to efficiently train the algorithm. The normalization is the process of scaling attribute values within a specific range (such as 0 to 1), in a manner that all attributes have approximately similar magnitudes. This research normalizes the attribute values using Min-Max normalization at the range [-1, 1].

• Feature Selection: The results of the proposed feature selection in [13] obviously show that the highest performance accuracy is achieved by social factors in combination with marks. This research selects top eight features subset based on Pearson correlation ranking criteria. Feature subset of Iraqi dataset includes the following questions: “Q37 Worry Effect”, “Q20 Family Economic Level”, “Q25Reason of study”, “Q27 Failure Year”, “Q8 Father Alive”, “Q17 Secondary Job”, “Q33 Study Hour”, “Q23 Specialization”, while UCI.student-por.csv feature subset includes: “Q10 reason” , “Q8 Mjob”, “Q21 internet”, “Q3 address”, “Q7 Fedu”, “Q6 Medu”, “Q13 study time”, “Q20 higher”, and UCI.student-mat.csv feature subset has ”Q17 paid”, “Q8 Mjob”, “Q1 sex”, “Q3 address”, “Q10 reason”, “Q7 Fedu”, “Q20 higher”, “Q6 Medu features”.

VII. THE EXPERIMENTAL RESULTS

The evaluation on the basis of accuracy is performed, which is the most expressive performance measure. It is merely a proportion of properly expected observation to total observations.

The model validation empowers locating the best features of the model while also shielding it from getting the chance to be over fitted. The proposed model is assessed utilizing two of the most popular evaluation criteria 10-fold cross-validation and hold out methods. 10-fold cross-validation is used to validate the proposed DDT, in which all the dataset has been divided into 10 subsets of approximately equal size. This is an iterative procedure, each time 9 subsets acts as a training data and one set is used as a testing data. In the holdout method, the data sets in all proposed models are separated into two sets of training data is 70% of the entire dataset and testing data is 30%, represents the remaining dataset.

The main objective of this research is to forecast the achievement of the student, as stated earlier. In addition, the decision tree needs the data to be in the categorical

formulation. The grade features must-have discrete values to obtain better results. The discretization mechanism has been exploited to convert the grade values from numerical values to nominal ones. For this purpose, specific classes are formulated for each dataset, which can be either “Pass” or “Fail”. In UCI datasets, the target attribute is final grade G3 since there are three average G1, G2 and G3 have ranged from 0 to 20. Thus, if the student has an average equal or higher than 10, it should be classified under the “Pass” label, otherwise should be defined as “Fail” student. In Iraqi dataset, the target attribute is second semester Avg2, and grade scores are within range 0-100. If the student has an average equal or higher than 50, it should be defined within the “Pass” label. Otherwise, it is classified as a “Fail” student. The proposed SPPS depends on the average of all materials in the first and second semester to predict student performance.

[image:4.595.296.558.522.594.2]

Table 1 shows the results of NN, KNN, ID3 based on holdout validation. which is appeared that accuracy of NN is outperformed the KNN and ID3.

Table 1. Accuracy of NN, KNN, And ID3 Based On Holdout Validation

Datasets NN KNN ID3

Iraq 97.2 83.3 83

Math 92 78.1 62

Por 83.1 69.2 67

Since ID3 may be over fitted, thus Holdout cross-validation may waste datasets and produce a high error rate. Since the aim is generalizing proposed model well without overfitting, therefore 10-fold cross-validation is used to ensure all observations are used for both training and testing. Each observation is used for testing exactly once. Table 2 shows 10 iterations and the average accuracy using 10-fold cross-validation for ID3. Table 3 shows ID3 best accuracy along with its iteration numbers taken from Table 2.

Table 2. 10-Fold Cross Validation of ID3

Table 3. ID3 Best Accuracy Iteration

From Tables 1 and 3, it is clear that NN surpasses the KNN and ID3 for all educational datasets based on holdout validation, which obtains accuracy of 97%, 92%, 83.1% for Iraq, Math, and Por datasets, respectively, but ID3 outperforms the NN and KNN accuracy of 87.5% for Por dataset based on iteration 6 of 10-fold cross-validation. VIII.DESIGNED GUI OF PROPOSED SPPS

ID3 1 2 3 4 5 6 7 8 9 10 10Fol

d AVG

Iraq 59.3 78 59.3 77 86 91.6 91.6 90 66.6 91.6 79

Math 53.8 64 74 51 58.9 61.5 64 66.66 69 58.9 62

Por 87 82.8 64 84 81 87.5 73.4 71.8 57.8 60.9 75

Datasets ID3

ACC #Iteration

Iraq 91.6 6,7,10

Math 74 3

[image:4.595.338.497.616.679.2]

(5)

The experiments and the web application of proposed SPPS are developed based on visual studio C# and ASP development server 2015, which is executed on Intel Corei7 machine, 1.8 GHz CPU and 12 GB of RAM with the 64-bit operating system.

[image:5.595.286.550.151.355.2]

The proposed SPPS contains the following interfaces: home page, registration form, login page, preprocessing page, feature selection, and prediction pages. Figure 4 shows the home page of proposed SPPS.

Figure 4. SPPS Home Page

When a student needs to subscribe to SPPS for the first time he has to fill a registration form, all his information can be saved in student DB. Fields of this form contain the same features as Iraqi dataset features. In addition, user name, email, and password. Figure 5 displays the SPPS registration form. The next time when a student accepted in SPPS, he/she can log in to the system using his email and password through login page to be author using the SPPS courses.

Figure 5 Registration Form of SPPS

The student performance prediction would be in an online manner when the student clicks on create student button. The inner prediction operations using data mining methods would be hidden to the students, but it would be clear to teaching

staff as well as a system administrator. The proposed SPPS can be presented through its main processing stages, which can be summarized as:

A.Preprocessing Stage

[image:5.595.47.560.154.364.2] [image:5.595.304.554.453.671.2]

The SPPS administrator can upload, view educational data, and dose all data preparation steps via preprocessing page. Figure 6 shows information about the student in Iraqi performance prediction dataset.

Figure 6 Preprocessing Page

Since the first step of preprocessing is dataset encoding, thus Lookup table is generated to contain all attributes with its category. Figure 7 shows the lookup table of Math dataset. Features obviously appeared containing binary, nominal, numerical values.

Figure 7 Math Lookup Table

[image:5.595.61.503.487.683.2] [image:5.595.49.281.498.685.2]

(6)

[image:6.595.46.292.50.309.2]

Figure 8 Pro Dataset Encoding

Dataset normalization can be viewed in Figure 9. It can be noticed that all values ranged in the interval from -1 to 1. This makes an algorithm learns faster. According to experimental tests and implementation, dataset normalization has important issues to be noted. Normalization considers as part of dataset preparation if desirable care is no longer taken, then the data set may lose the internal structure, which leads to lower accuracy.

Figure 9 Dataset Normalization

Result of the correlation matrix among features [13] can be illustrated in Figure 10. This matrix indicates that there is no strong relation among features (i.e., no redundant features),

[image:6.595.306.558.131.342.2]

except the relation among G1, G2, and G3. For example, the Pearson correlation of Math dataset between G1 and G3 is 0.801468, and between G2 and G3 is 0.904868, demonstrates that G1, G2, G3 are strongly correlated. On the other hand, R between absence and G3 is 0.034247, signs to poorly correlated features.

Figure 10 Correlation Matrix B.Prediction Stage

[image:6.595.44.563.430.670.2]

The outcomes of the Pearson-based Math dataset can be shown in Figure11, which indicates the NN training error and accuracies attained at all iterations. The training error starts at a high rate, and after epoch 250, it begins stable.

Figure 11 Math Dataset NN Iterations

(7)

Figure 12 KNN and DT Prediction IX. CONCLUSION

This paper has designed and implemented an e-learning framework for effectively and accurately anticipating student academic achievement. The proposed system structure, workflow, and users are provided. It also defines the selective procedures of the algorithm performed in the prediction phase to select the best predictive method.

The accuracy evaluation is explained among the utilized data mining methods, which showed that NN has higher performance than KKN and ID3.

REFERENCES

[1] Francesco Gullo, “From Patterns in Data to Knowledge Discovery: What Data Mining Can Do”, Elsever Inc., 3rd International Conference Frontiers in Diagnostic Technologies, ICFDT3, Physics Procedia, Vol.62, P. 18 – 22, 2015.

[2] John J., Kavya J., Paarth K., Shubha P., " Educational Data Mining Techniques and their Applications", IEEE, International Conference on Green Computing and Internet of Things (ICGCIoT), DOI: 10.1109/ICGCIoT.2015.7380675, 2015.

[3] C. Romero and S. Ventura "Educational Data Mining: A Review of the State of the Art", IEEE Transactions on Systems Man, and Cybernetics-Part C: Applications and Reviews, Vol. 40, no. 6, 2010.

[4] Umamaheswari. K, and S. Niraimathi "A Study on Student Data Analysis Using Data Mining Techniques", International Journal of Advanced Research in Computer Science and Software Engineering, Vol.3, Issue 8, ISSN: 2277 128X, 2013.

[5] Arturas Kaklauskas, “Biometric and Intelligient Decision Making Support”, Springer, ISSN 1868-4394, DOI:10.1007978-3-319-2, 2015.

[6] Lior Rokach, Oded Z. Maimon, “Data Mining with Decision Trees: Theory and Applications”, 2nd Edition, World Scientific Publishing Co. Pte. Ltd., 2015.

[7] Saraswat D., “Knowledge Discovery with Hybrid Data Mining Approach”, (Doctoral dissertation), Dayalbagh Educational Institute, 2017.

[8] Natthakan Iam‑ _{On, Tossapon Boongoen, “Generating}

[9] Abdulmohsen Algarni, “Data Mining in Education”, International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 7, No. 6, DOI:10.14569/IJACSA.2016.070659, 2016.

[10] Herbert M., “Staying the course: A study in online student satisfaction and retention”, Online Journal of Distance Learning Administration, 9(4). Retrieved from http://www.westga.

edu/~distance/ojdla/winter94/herbert94.htm, 2006.

[11] Heyman, E., “Overcoming student retention issues in higher education online programs: A Delphi study”, (Doctoral dissertation). Retrieved from ProQuest Dissertations and Theses database. (ProQuest document ID: 748309429). Retrieved from http://search.proquest.com/docview/748309429?accountid =13360, 2010.

[12] Huang, S., & Fang, N. (2013). Predicting student academic performance in an engineering dynamics course: A comparison of four types of predictive models. Computers and Education, Vol.61, P.133-145, 2013. [13] Saja Taha Ahmed, Rafah Shihab al-Hamdani, Muayad

Sadik Croock, “EDM Preprocessing and Hybrid Feature Selection for Improving Classification Accuracy”, Journal of Theoretical and Applied Information Technology, Vol.96. No 1 ISSN: 1992-8645, 2019.

[14] Saja Taha, “Iraqi Student Performance Prediction”, Mendeley Data, http://dx.doi.org/10.17632/smgx6s5pwr.1, DOI: 10.17632/smgx6s5pwr.1, 2018.