Prediction of Students Learning Performance Using Machine Learning Algorithms
1Mrs.S.Menaka, 2Dr.G.Kesavaraj
1Ph.D., Research Scholar, 2Assistant Professor,
1,2Department of Computer Science,
Vivekanandha College of Arts and Sciences for Women (Autonomous), Tiruchengode, Tamil Nadu, INDIA-637205
Abstract
Predictthestudent’sperformanceismoredifficultduetolargenumberofdatabase.Nowaday’sstudentsarefa cingmanyproblemsintheiracademicstudies.Themainobjectiveofanyeducationalinstituteistoprovidequali tyeducationandgrowtheoverallperformanceofanorganizationbylookingatindividualperformances.Toref erredtheexistingdatabaseneedmorenewlargedatabaseforanalyzingprocess.Inexistingpapersusesomeda taminingalgorithmtechniquessuchasNaïveBayes,J48andNeuralNetworksusingWEKATool.Inthisarticle weusedtoclassificationmethodforanalyzingtheperformanceoftheStudent.Hereweintroducethenewdatam iningtechniquesuchasC5.0,NaïveBayesandRandomForestAlgorithms.RProgrammingLanguageisanope nsourcetoolforpredictingthestudentlowlearner’sacademicperformancewithgoodaccuracy.
Keywords: C5.0, DataMining Algorithm, Educational DataMining, Naïve Bayes, R Programming.
1.Introduction
EducationalDataMining(EDM)isaresearchfieldforEducationallearningprincipleforpredictingthestude nteducationalperformance.TheirmanyEDMapplicationsusedinDataminingfieldareMachineLearningan dStatisticstoinformationgeneratedforeducationalsettings[9].EDMhascontributedtothetheoriesoflearningi nvestigationbytheresearchersineducationalpsychologyandthelearningenvironment.Educationaldatamin ingreferstothetechniquesandtoolsautomaticallyextractingthelargedatasetrepositoriesofdatageneratedbyt heresearcher[7].TheseEDMtechniquesareusedfortheresearchlearnersandtheeffectofvariouslearningenvir onment.TherearefourgoalsofEducationalDataMiningtechniques,theyare:
Predictingthestudentsfuturelearningbehavior–Thismethodisusedforstudentmodelingtocreat estudentmodelsandfindthelearnerscharacteristics
DiscoveringorImprovingDomainModels–Toanalyzetheexistingmodelsandtocreateanewmo delandalgorithm
Studyingtheeffectofstudentssupport–Usingthealgorithmtoolsforfindingthestudentslearning systems
2.LITERATURE REVIEW
Theclassificationtechniques,NeuralNetworkandDecisionTreearethetwomethodshighlyusedbythere searchersforpredictingstudent’sperformance[3].Themeta-analysisonpredictingstudent’sperformancehas motivatedustocarryoutfurtherresearchtobeappliedinourenvironment.Itwillhelptheeducationalsystemto monitorthestudent’sperformanceinasystematicway.
AstudyofYadavetal.predictsstudents’performanceattheendofthesemesterbyapplyingthreedecisiontr eealgorithmsID3,CARTandC4.5.Intheirstudytheyachieved52.08%,56.25%and45.83%accuracy[8].
TopredictperformancelevelsintheendofthedegreeintheVSemester.Randomforests,decisiontrees,sup portvectormachines,naivebayes,baggedtreesandboostedtrees[2].Adatasetof2459studentsfromaEuropean EngineeringSchoolofapublicresearchUniversityisusedtovalidatetheproposedmethodology.Theempirica
lresultsdemonstratetheabilityoftheproposedmodeltopredictthestudents'performancelevelwithaccuracya bove95%,inanearlystageofthestudents'academicpath.
Anovelclassificationmodelbasedontheirrelationalassociationrulediscoveryforpredictingthesuccessf ulcompletionofanacademiccourse,basedonthegradesreceivedbystudentsduringtheacademicsemester[1]. ExperimentsconductedonthreerealdatasetscollectedfromBabes¸BolyaiUniversityfromRomaniahavesho wnagoodperformanceoftheclassifier[5].Theobtainedexperimentalresultshighlightedthatourclassifierisbe tterthan,orcomparableto,thesupervisedclassifiersalreadyappliedintheEDM(EducationalDataMining)lite ratureforstudents’performanceprediction.
3.METHODOLOGY
A. Machine Learning Algorithms
Classificationisasupervisedlearningmethodwheredataisdividedintodifferentcategoriesorclasses.Th eobjectiveofclassificationtoforecasttargetclassforgivendatasetfromthedatabase.Therearevarioustechniq uesofclassificationusedlikeC5.0,NaïveBayesclassifier,RandomForestapproach;theseareimportanttechn iquesforclassification.Accuracyofgoalpredictionisdependsupontheselectionofclassificationtechniqueus edindatamining.Inreallifesituationsclassificationisbasicallyprobabilistic;itisanundecidedtowhichclassd ataisbelonging.
B. ToolUsed
RStudioisaprogramminglanguageandsoftwareenvironmentforstatisticalanalysis,graphicsrepresentati onandcoverage.RwasdevelopedbyRossIhakaandRobertGentlemanattheUniversityofAuckland,NewZea land,andiscurrentlycreatedbytheRDevelopmentCoreTeam.RisfreelyopensourceavailableundertheGNU GeneralPublicLicense,andpre-compiledbinaryversionsareprovidedfordifferentoperatingsystemslikeWi ndows,LinuxandMac.ThisprogramminglanguagewasnamedasRorRStudio,basedonthefirstletteroffirst nameofthetwoRauthors(RobertGentlemanandRossIhaka)andpartlyaplayonthenameoftheBellLabsLang uageS.
C. C5.0Classifier
C5.0algorithmisasuccessorofC4.5algorithmalsodevelopedbyQuinlan(1994)givesabinarytreeormultibr anchestree.UsesInformationGain(Entropy)asitsdividingcriteriaC5.0pruningtechniqueadoptstheBinomi alConfidenceLimittechnique.
C5.0usestheconceptofentropyformeasuringclaritytheentropyofamodelofdataindicateshowvariedthecl assvaluesare;theleastvalueof00indicatesthatthemodelistotallyhomogenous,while11indicatesthemaximu mamountofdisorder.Thedefinitionofentropycanbespecified.
D. Naïve Bayes Classifier
BayesianclassificationisbasedonBayestheorem.Theposteriorprobabilityoftheclassthatadatabelongstois anapproximatedusingpriorprobabilitywhichdrawnfromtrainingdataset.Classificationmodelcalculatethel ikelihoodofthedatabelongingtoeachclass.TheclasshavehighestpreventsforYtooccurwhenactionsforXpr obabilitybecomestheclasslabelforthedatabase.
BayesTheoremDefinition:GiventworandomvariablesXandY,eachofthemtakingaspecificvaluecorres pondstoarandomevent.AconditionalprobabilityP(X/Y)representstheprobabilityofeventsforYtohappenw heneventforXhavealreadyoccurred.
E. Random Forest
Randomforestsorrandomdecisionforestsareancollectionlearningmethodforclassification,regressionand otherresponsibilitiesthatoperatesbyconstructingalargeamountofdecisiontreesattrainingtimeandoutputti ngtheclassthatisthemodeoftheclasses(classification)ormeanprediction(regression)oftheentitytrees.Rand omdecisionforestsexactfordecisiontrees'habitofoverfittingtotheirtrainingset.
ThefirstalgorithmforrandomdecisionforestswasdevelopedbyTinKamHousingtherandomsubspacemeth odwhichinHo'sformulationisatechniquetoimplementthe"stochasticdiscrimination"approachtoclassificat ionproposedbyEugeneKleinberg.
AnextensionofthealgorithmwasdevelopedbyLeoBreimanandAdeleCutlerwasregistered“RandomFor ests”asatrademark(asof2019ownedbyMinitabInc.).TheextensioncombinesBreiman's"bagging"designa ndrandomselectionoffeatures,introducedfirstbyHoandlaterthiswasseparatelybyAmitandGemaninordert ocreateagroupofdecisiontreeswithcontrolleddifference.
F. Proposed Model
Figure:1showsFlowchartofProposedWork
G. Training Dataset
Adatasetof500studentsfromvariouscollegestocollectthetrainingdatasetofBCAIIIyearstudentsStudent sattributeslikeGender,Area,SSC_Medium,SSC_Percentage,HSC_faculty,Math_At_HSC,Graduation_
Marks,Entrance_Rank,ParentsIncome,Attendance,Communication_Skill,Learning_Behavior(ClassLa
Student
Performance Data Data Set
Pre-Processing
R Tool C5.0
Navie Bayes Random
Forest
Data Mining Classification
Algorithms
Pattern Evaluation
Final Result (Accuracy) of Students Slow Learners Academic Performance
P(X/Y)= P(X/Y).P(Y)
P(X) P(Y/X)= P(X/Y).P(Y)
P(Y)
bel).Inthisdatasetthereare25attributesarecollected.Only12attributestakenfortheresearchusedtovalidatet heproposedmethodology.Bycomparingexistingdatasetcollectmoreattributesfromthestudentinvariousde partments.TopredictlearningbehaviorofstudentsfromgiventrainingdatasetusingC5.0,NaïveBayesandRa ndomForestalgorithms.
H. Data Pre-processing
Datawaspre-processedbyperformingfollowingoperations
Convertingallfieldsintodifferentcategories.
Featuresarecombiningtoreduceemptyvalue.
NullandMissingvaluesarereplacedbyusingsomealgorithms.
Table1:TrainingDatasetforAcademicStudentsPerformance sr
.n o
Ge nd er
A re a
SSLC _medi um
SSL C_P er
HS C_
per
Math satH
SC
Gradu ationM ark
Entra nceRa nk
Paren tsInco me
Atte nda nce
Com mSk ill
Learnin gBehavi
or 1 M
R ur al
Englis h
Exc ellen
t
Poo
r Yes Excelle
nt Good High Poor Goo
d Slow
2 M Ur ba n
Englis h
Goo d
Goo
d Yes Poor Poor Mediu m
Aver
age Poor Fast
3 M Ur ba n
Englis h
Goo d
Poo
r No Good Good Low Goo
d
Goo
d Average
4 F Ur ba n
Tamil Poor Goo
d Yes Good Avera
ge Low Goo
d
Goo
d Slow
5 M R ur al
Tamil Poor Exc elle nt
No Poor Poor High Aver
age Poor Fast
6 M R ur al
Tamil Exc ellen
t
Poo
r No Excelle
nt Good Mediu
m Poor Exce
llent Average
7 F Ur ba n
Tamil Ave rage
Exc elle nt
Yes Poor Good Mediu m
Aver
age Poor Slow
8 F R ur al
Tamil Poor Poo
r No Good Avera
ge Low Aver age
Exce
llent Fast
9 M R ur al
Tamil Exc ellen
t
Poo
r No Good Good Low Goo
d
Goo
d Fast 1
0 F Ur ba
Englis
h Poor Goo
d Yes Poor Good High Aver age
Exce
llent Average
n 1
1 M R ur al
Tamil Poor Exc elle nt
No Poor Poor High Aver
age Poor Fast 1
2 M R ur al
Tamil Exc ellen
t
Poo
r No Excelle
nt Good Mediu
m Poor Exce
llent Average 1
3 F Ur ba n
Tamil Ave rage
Exc elle nt
Yes Poor Good Mediu m
Aver
age Poor Slow 1
4 F R ur al
Tamil Poor Poo
r No Good Avera
ge Low Aver age
Exce
llent Fast 1
5 M R ur al
Tamil Exc ellen
t
Poo
r No Good Good Low Goo
d
Goo
d Fast
5.RESULT AND DISCUSSION
A. C5.0ClassifierAlgorithm ClassificationTree
Numberofsamples:40 Numberofpredictors:12 Treesize:5
Classspecifiedbyattribute`outcome'
Read40cases(13attributes)fromundefined.data Decisiontree:
Maths.at.HSC=Yes:
:...Comm.Skill=Excellent:Average(6) :Comm.Skillin{Good,Poor}:Slow(9) Maths.at.HSC=No:
:...Graduation.Mark=Excellent:Average(5) Graduation.Markin{Good,Poor}:
:...Area=Rural:Fast(18) Area=Urban:Average(2)
Evaluationontrainingdata(40cases):
DecisionTree ---
(a)(b)(c)<-classifiedas ---
13(a):classAverage
18(b):classFast 9(c):classSlow
ConfusionMatrixandStatistics PredictionAverageFastSlow Average1600
Fast0212 Slow0011 OverallStatistics Accuracy:0.96
95%CI:(0.8629,0.9951) NoInformationRate:0.42 P-Value[Acc>NIR]:3.498e-16 Kappa:0.9382
Mcnemar'sTestP-Value:NA StatisticsbyClass:
Class:AverageClass:FastClass:Slow Sensitivity1.001.00000.8462 Specificity1.000.93101.0000 PosPredValue1.000.91301.0000 NegPredValue1.001.00000.9487 Prevalence0.320.42000.2600 DetectionRate0.320.42000.2200 DetectionPrevalence0.320.46000.2200 BalancedAccuracy1.000.96550.9231
C5.0 Algorithm Result Using Rstudio
Figure:2C5.0ClassifierResult
TheabovediagramshowsthatC5.0DecisionTreealgorithmusingRStudiogivesanaccuracyvalueof96%s howstheaccuratestudentslowlearnersfromthegivendataset.
Visualization Tree Using C5.0 Algorithm
Figure:3C5.0Classifier Result with Decision Tree Algorithm B. NaïveBayesClassifierAlgorithm
NaïveBayesClassifierforDiscretePredictors Call:
naiveBayes.default(x=X,y=Y,laplace=laplace) A-prioriprobabilities:
Y
AverageFastSlow
0.32432430.45945950.2162162 Conditionalprobabilities:
sr.no Y[,1][,2]
Average28.5000013.83342 Fast27.7647115.82533 Slow23.0000015.68439 Gender
YFM
Average0.50000000.5000000 Fast0.47058820.5294118 Slow0.75000000.2500000 Area
YRuralUrban
Average0.333333330.66666667 Fast0.941176470.05882353 Slow0.250000000.75000000
SSLC_medium YEnglishTamil
Average0.666666670.33333333 Fast0.058823530.94117647 Slow0.250000000.75000000 SSLC_Per
YAverageExcellentGoodPoor
Average0.000000000.333333330.166666670.50000000 Fast0.000000000.294117650.058823530.64705882 Slow0.625000000.250000000.000000000.12500000 HSC_per
YExcellentGoodPoor
Average0.000000000.500000000.50000000 Fast0.176470590.058823530.76470588 Slow0.625000000.125000000.25000000 Maths.at.HSC
YNoYes
Average0.500000000.50000000 Fast0.941176470.05882353 Slow0.000000001.00000000 Graduation.Mark
YExcellentGoodPoor
Average0.33333330.16666670.5000000 Fast0.00000000.76470590.2352941 Slow0.25000000.12500000.6250000 Entrance.Rank
YAverageGoodPoor
Average0.00000001.00000000.0000000 Fast0.47058820.29411760.2352941 Slow0.12500000.87500000.0000000 Parents.Income
YHighLowMedium
Average0.500000000.166666670.33333333 Fast0.176470590.764705880.05882353 Slow0.250000000.125000000.62500000 Attendance
YAverageGoodPoor
Average0.50000000.16666670.3333333 Fast0.70588240.29411760.0000000 Slow0.62500000.12500000.2500000
Comm.Skill
YExcellentGoodPoor
Average0.83333330.16666670.0000000 Fast0.47058820.29411760.2352941 Slow0.00000000.37500000.6250000 ConfusionMatrixandStatistics PredictionAverageFastSlow Average1202
Fast0161 Slow015
OverallStatistics Accuracy:0.8919 95%CI:(0.7458,0.9697) NoInformationRate:0.4595 P-Value[Acc>NIR]:4.464e-08 Kappa:0.8287
Mcnemar'sTestP-Value:NA StatisticsbyClass:
Class:AverageClass:FastClass:Slow Sensitivity1.00000.94120.6250 Specificity0.92000.95000.9655 PosPredValue0.85710.94120.8333 NegPredValue1.00000.95000.9032 Prevalence0.32430.45950.2162 DetectionRate0.32430.43240.1351 DetectionPrevalence0.37840.45950.1622 BalancedAccuracy0.96000.94560.7953 NaïveBayesAlgorithmResult
UsingRStudio
Figure:4NaïveBayesClassifierResult
TheabovediagramshowsthatNaïveBayesalgorithmusingRStudiogivesanaccuracyvalueof89%showsth eaccuratestudentslowlearnersfromthegivendataset.
C. RandomForestAlgorithm Call:
randomForest(formula=LB~.,data=TrainSet,importance=TRUE) Typeofrandomforest:classification
Numberoftrees:500
No.ofvariablestriedateachsplit:3 OOBestimateoferrorrate:5.71%
Confusionmatrix:
AverageFastSlowclass.error Average12000.00000000 Fast11400.06666667 Slow1070.12500000 predTrainAverageFastSlow Average1200
Fast0150 Slow008 PredTrain
16132732246365023
AverageSlowFastAverageAverageFastFastFastSlow 72535111528244112
SlowFastAverageFastFastAverageFastAverageAverage 33451721432938519
FastSlowSlowFastFastSlowSlowFastAverage 484961018343032
AverageSlowAverageAverageFastFastFastAverage Levels:AverageFastSlow
ConfusionMatrixandStatistics PredictionAverageFastSlow Average1200
Fast0150 Slow008
OverallStatistics Accuracy:1 95%CI:(0.9,1)
NoInformationRate:0.4286 P-Value[Acc>NIR]:1.321e-13
Kappa:1
Mcnemar'sTestP-Value:NA StatisticsbyClass:
Class:AverageClass:FastClass:Slow Sensitivity1.00001.00001.0000 Specificity1.00001.00001.0000 PosPredValue1.00001.00001.0000 NegPredValue1.00001.00001.0000 Prevalence0.34290.42860.2286 DetectionRate0.34290.42860.2286 DetectionPrevalence0.34290.42860.2286 BalancedAccuracy1.00001.00001.0000 RandomTreeAlgorithmUsingRStudio
Figure:5RandomForestClassifierResult
TheabovediagramshowsthatRandomTreealgorithmusingRStudiogivesanaccuracyvalueof100%sho wstheaccuratestudentslowlearnersfromthegivendataset.
D. ResultComparisonTableforC5.0,naïvebayesandrandomforestalgorithms
Table:2Comparisonofthreeclassifieralgorithmsforacademicstudent’s Performance
6. CONCLUSION
C5.0,Naïvebayestheorem,RandomForestisimplementedusingRProgrammingtodetermineSlowLear ner,AverageLearnerandFastLearner.Thisapplicationisusefulineducationsystemtocategoriesstudentacco rdingtotheirlearningbehavior.Proposedapplicationisveryuserfriendlyandapplicableforanyhighereducati onsector.ThenewdataminingtechniquessuchasC5.0,NaïveBayesandRandomForestAlgorithmsusingRPr ogrammingLanguageshowstheaccuratevalueofstudent’sslowlearners.Finallyweconcludethattheabovere
Classifier Accuracy Kappa Sensitivity
C5.0 96 0.9382 0.8462
NaïveBayes 89 0.8287 0.6250
RandomForest 100 1 1.0000
sultsshowthattheRandomForestAlgorithmusingRProgramminggottheaccuracyvalueof100%bycompari ngC5.0andNaïveBayesAlgorithm.Infutureworktocollectmoredatasetswithattributesandusingdifferental gorithmstogetbestresults.
REFERENCES
1. K.PrasadaRao.et.al.,“PredictingLearningBehaviorofStudentsusingClassificationTechniques”,Interna tionalJournalofComputerApplications(0975–8887),Volume139–No.7,April2016, Pp. 15-19
2. R.KaviyarasiandT.Balasubramanian,“ExploringtheHighPotentialFactorsthatAffectsStudents’Acade micPerformance”,I.J.EducationandManagementEngineering,November2018,6,,inMECS(http://ww w.mecs-press.net)DOI:10.5815, Pp.15-23
3. SumithaandE.S.Vinothkumar,“PredictionofStudentsOutcomeUsingDataMiningTechniques”,Internat ionalJournalofScientificEngineeringandAppliedScience(IJSEAS)–Volume-2,Issue-6,June2016ISSN :2395-3470, Pp. 132-139.
4. VrushaliMhetreandProf.MayuraNagar,“Classificationbaseddataminingalgorithmstopredictslow,aver ageandfastlearnersineducationalsystemusingWeka”,ProceedingsoftheIEEE2017InternationalConfer enceonComputingMethodologiesandCommunication(ICCMC),,978-1-5090-4890-8/17/$31.00©201 7IEEE, Pp. 475-479
5. SwatiandRajinderKaur,“MultifactorNaïveBayesClassificationforTheSlowLearnerPredictionOverM ulticassStudentDataset”,InternationalJournalonComputationalScience&Applications(IJCSA)Vol.6, No.4,August2016, Pp.11-23
6. ShiwaniRana*,RoopaliGarg,”SlowLearnerPredictionusingMulti-VariateNaïveBayesClassificationAl gorithm”,DepartmentofInformationTechnology,UIET,PanjabUniversity,Chandigarh,India.02Decem ber2016, Pp. 11-23.
7. Harwatia,ArditaPermataAlfiania,FebrianaAyuWulandari,”MappingStudent’sPerformanceBasedonD ataMiningApproach”,ScienceDirectAgricultureandAgriculturalScienceProcedia3(2015), Pp.
173–177.
8. ManolisChalaris,StefanosGritzalis,ManolisMaragoudakis,CleoSgouropoulouandAnastasiosTsolakid is,“ImprovingQualityofEducationalProcessesProvidingNewKnowledgeusingDataMiningTechniques
”,ScienceDirect-Procedia-SocialandBehavioralSciences147(2014), Pp. 390–397.
9. TanandVipinKumar,“IntroductiontoDataMining”Pearson,2013.
10. http://datamining.businessintelligence.uoc.edu/home/j48-decision-tree.
11. Ch.RaviKishore,K.PrasadaRaoet.al,“PerformanceEvaluationofEntropyandGiniusingThreadedand NonThreadedID3onAnaemiaDataset”2015IEEEDOI10.1109/CSNT.2015.112, Pp. 1080 – 1084.
12. AjayKumarPal,SaurabhPal,AnalysisandMiningofEducationalDataforPredictingthePerformanceof Students”,InternationalJournalofElectronicsCommunicationandComputerEngineering2013,Volume 4,Issue5.
13. Anne–SophieHoffait,MichaelSchyns.“Earlydetectionofuniversitystudentswithpotentialdifficult ies”.DecisionSupportSystems101(2017), Pp.1-11.
14. AbimbolaR.,et.al.,“PredictingStudentAcademicPerformanceinComputerScienceCourses:ACom parisonofNeuralNetworkModels",InternationalJournalofModernEducationandComputerScience(IJ MECS),Vol.10,No.6,2018, Pp.1-9
15. ReenaThakur,A.R.Mahajan,“PreprocessingandClassificationofDataAnalysisinInstitutionalSyste musingWeka”,InternationalJournalofComputerApplications(0975–8887)Volume112–No.6,February 2015, Pp.9-11.
16. https://en.wikipedia.org/wiki/R_(programming_language)