Prediction of Students Learning Performance Using Machine Learning Algorithms

(1)

Prediction of Students Learning Performance Using Machine Learning Algorithms

1Mrs.S.Menaka, ²Dr.G.Kesavaraj

1Ph.D., Research Scholar, ²Assistant Professor,

1,2Department of Computer Science,

Vivekanandha College of Arts and Sciences for Women (Autonomous), Tiruchengode, Tamil Nadu, INDIA-637205

Abstract

Predictthestudent’sperformanceismoredifficultduetolargenumberofdatabase.Nowaday’sstudentsarefa cingmanyproblemsintheiracademicstudies.Themainobjectiveofanyeducationalinstituteistoprovidequali tyeducationandgrowtheoverallperformanceofanorganizationbylookingatindividualperformances.Toref erredtheexistingdatabaseneedmorenewlargedatabaseforanalyzingprocess.Inexistingpapersusesomeda taminingalgorithmtechniquessuchasNaïveBayes,J48andNeuralNetworksusingWEKATool.Inthisarticle weusedtoclassificationmethodforanalyzingtheperformanceoftheStudent.Hereweintroducethenewdatam iningtechniquesuchasC5.0,NaïveBayesandRandomForestAlgorithms.RProgrammingLanguageisanope nsourcetoolforpredictingthestudentlowlearner’sacademicperformancewithgoodaccuracy.

Keywords: C5.0, DataMining Algorithm, Educational DataMining, Naïve Bayes, R Programming.

1.Introduction

EducationalDataMining(EDM)isaresearchfieldforEducationallearningprincipleforpredictingthestude nteducationalperformance.TheirmanyEDMapplicationsusedinDataminingfieldareMachineLearningan dStatisticstoinformationgeneratedforeducationalsettings^[9].EDMhascontributedtothetheoriesoflearningi nvestigationbytheresearchersineducationalpsychologyandthelearningenvironment.Educationaldatamin ingreferstothetechniquesandtoolsautomaticallyextractingthelargedatasetrepositoriesofdatageneratedbyt heresearcher^[7].TheseEDMtechniquesareusedfortheresearchlearnersandtheeffectofvariouslearningenvir onment.TherearefourgoalsofEducationalDataMiningtechniques,theyare:

 Predictingthestudentsfuturelearningbehavior–Thismethodisusedforstudentmodelingtocreat estudentmodelsandfindthelearnerscharacteristics

 DiscoveringorImprovingDomainModels–Toanalyzetheexistingmodelsandtocreateanewmo delandalgorithm

 Studyingtheeffectofstudentssupport–Usingthealgorithmtoolsforfindingthestudentslearning systems

2.LITERATURE REVIEW

Theclassificationtechniques,NeuralNetworkandDecisionTreearethetwomethodshighlyusedbythere searchersforpredictingstudent’sperformance^[3].Themeta-analysisonpredictingstudent’sperformancehas motivatedustocarryoutfurtherresearchtobeappliedinourenvironment.Itwillhelptheeducationalsystemto monitorthestudent’sperformanceinasystematicway.

AstudyofYadavetal.predictsstudents’performanceattheendofthesemesterbyapplyingthreedecisiontr eealgorithmsID3,CARTandC4.5.Intheirstudytheyachieved52.08%,56.25%and45.83%accuracy^[8].

TopredictperformancelevelsintheendofthedegreeintheVSemester.Randomforests,decisiontrees,sup portvectormachines,naivebayes,baggedtreesandboostedtrees^[2].Adatasetof2459studentsfromaEuropean EngineeringSchoolofapublicresearchUniversityisusedtovalidatetheproposedmethodology.Theempirica

(2)

lresultsdemonstratetheabilityoftheproposedmodeltopredictthestudents'performancelevelwithaccuracya bove95%,inanearlystageofthestudents'academicpath.

Anovelclassificationmodelbasedontheirrelationalassociationrulediscoveryforpredictingthesuccessf ulcompletionofanacademiccourse,basedonthegradesreceivedbystudentsduringtheacademicsemester^[1]. ExperimentsconductedonthreerealdatasetscollectedfromBabes¸BolyaiUniversityfromRomaniahavesho wnagoodperformanceoftheclassifier^[5].Theobtainedexperimentalresultshighlightedthatourclassifierisbe tterthan,orcomparableto,thesupervisedclassifiersalreadyappliedintheEDM(EducationalDataMining)lite ratureforstudents’performanceprediction.

3.METHODOLOGY

A. Machine Learning Algorithms

Classificationisasupervisedlearningmethodwheredataisdividedintodifferentcategoriesorclasses.Th eobjectiveofclassificationtoforecasttargetclassforgivendatasetfromthedatabase.Therearevarioustechniq uesofclassificationusedlikeC5.0,NaïveBayesclassifier,RandomForestapproach;theseareimportanttechn iquesforclassification.Accuracyofgoalpredictionisdependsupontheselectionofclassificationtechniqueus edindatamining.Inreallifesituationsclassificationisbasicallyprobabilistic;itisanundecidedtowhichclassd ataisbelonging.

B. ToolUsed

RStudioisaprogramminglanguageandsoftwareenvironmentforstatisticalanalysis,graphicsrepresentati onandcoverage.RwasdevelopedbyRossIhakaandRobertGentlemanattheUniversityofAuckland,NewZea land,andiscurrentlycreatedbytheRDevelopmentCoreTeam.RisfreelyopensourceavailableundertheGNU GeneralPublicLicense,andpre-compiledbinaryversionsareprovidedfordifferentoperatingsystemslikeWi ndows,LinuxandMac.ThisprogramminglanguagewasnamedasRorRStudio,basedonthefirstletteroffirst nameofthetwoRauthors(RobertGentlemanandRossIhaka)andpartlyaplayonthenameoftheBellLabsLang uageS.

C. C5.0Classifier

C5.0algorithmisasuccessorofC4.5algorithmalsodevelopedbyQuinlan(1994)givesabinarytreeormultibr anchestree.UsesInformationGain(Entropy)asitsdividingcriteriaC5.0pruningtechniqueadoptstheBinomi alConfidenceLimittechnique.

C5.0usestheconceptofentropyformeasuringclaritytheentropyofamodelofdataindicateshowvariedthecl assvaluesare;theleastvalueof00indicatesthatthemodelistotallyhomogenous,while11indicatesthemaximu mamountofdisorder.Thedefinitionofentropycanbespecified.

D. Naïve Bayes Classifier

BayesianclassificationisbasedonBayestheorem.Theposteriorprobabilityoftheclassthatadatabelongstois anapproximatedusingpriorprobabilitywhichdrawnfromtrainingdataset.Classificationmodelcalculatethel ikelihoodofthedatabelongingtoeachclass.TheclasshavehighestpreventsforYtooccurwhenactionsforXpr obabilitybecomestheclasslabelforthedatabase.

BayesTheoremDefinition:GiventworandomvariablesXandY,eachofthemtakingaspecificvaluecorres pondstoarandomevent.AconditionalprobabilityP(X/Y)representstheprobabilityofeventsforYtohappenw heneventforXhavealreadyoccurred.

(3)

E. Random Forest

Randomforestsorrandomdecisionforestsareancollectionlearningmethodforclassification,regressionand otherresponsibilitiesthatoperatesbyconstructingalargeamountofdecisiontreesattrainingtimeandoutputti ngtheclassthatisthemodeoftheclasses(classification)ormeanprediction(regression)oftheentitytrees.Rand omdecisionforestsexactfordecisiontrees'habitofoverfittingtotheirtrainingset.

ThefirstalgorithmforrandomdecisionforestswasdevelopedbyTinKamHousingtherandomsubspacemeth odwhichinHo'sformulationisatechniquetoimplementthe"stochasticdiscrimination"approachtoclassificat ionproposedbyEugeneKleinberg.

AnextensionofthealgorithmwasdevelopedbyLeoBreimanandAdeleCutlerwasregistered“RandomFor ests”asatrademark(asof2019ownedbyMinitabInc.).TheextensioncombinesBreiman's"bagging"designa ndrandomselectionoffeatures,introducedfirstbyHoandlaterthiswasseparatelybyAmitandGemaninordert ocreateagroupofdecisiontreeswithcontrolleddifference.

F. Proposed Model

Figure:1showsFlowchartofProposedWork

G. Training Dataset

Adatasetof500studentsfromvariouscollegestocollectthetrainingdatasetofBCAIIIyearstudentsStudent sattributeslikeGender,Area,SSC_Medium,SSC_Percentage,HSC_faculty,Math_At_HSC,Graduation_

Marks,Entrance_Rank,ParentsIncome,Attendance,Communication_Skill,Learning_Behavior(ClassLa

Student

Performance Data Data Set

Pre-Processing

R Tool C5.0

Navie Bayes Random

Forest

Data Mining Classification

Algorithms

Pattern Evaluation

Final Result (Accuracy) of Students Slow Learners Academic Performance

P(X/Y)= P(X/Y).P(Y)

P(X) P(Y/X)= P(X/Y).P(Y)

P(Y)

(4)

bel).Inthisdatasetthereare25attributesarecollected.Only12attributestakenfortheresearchusedtovalidatet heproposedmethodology.Bycomparingexistingdatasetcollectmoreattributesfromthestudentinvariousde partments.TopredictlearningbehaviorofstudentsfromgiventrainingdatasetusingC5.0,NaïveBayesandRa ndomForestalgorithms.

H. Data Pre-processing

Datawaspre-processedbyperformingfollowingoperations

 Convertingallfieldsintodifferentcategories.

 Featuresarecombiningtoreduceemptyvalue.

 NullandMissingvaluesarereplacedbyusingsomealgorithms.

Table1:TrainingDatasetforAcademicStudentsPerformance sr

.n o

Ge nd er

A re a

SSLC _medi um

SSL C_P er

HS C_

per

Math satH

SC

Gradu ationM ark

Entra nceRa nk

Paren tsInco me

Atte nda nce

Com mSk ill

Learnin gBehavi

or 1 M

R ur al

Englis h

Exc ellen

t

Poo

r Yes Excelle

nt Good High Poor Goo

d Slow

2 M Ur ba n

Englis h

Goo d

Goo

d Yes Poor Poor Mediu m

Aver

age Poor Fast

3 M Ur ba n

Englis h

Goo d

Poo

r No Good Good Low Goo

d

Goo

d Average

4 F Ur ba n

Tamil Poor Goo

d Yes Good Avera

ge Low Goo

d

Goo

d Slow

5 M R ur al

Tamil Poor Exc elle nt

No Poor Poor High Aver

age Poor Fast

6 M R ur al

Tamil Exc ellen

t

Poo

r No Excelle

nt Good Mediu

m Poor Exce

llent Average

7 F Ur ba n

Tamil Ave rage

Exc elle nt

Yes Poor Good Mediu m

Aver

age Poor Slow

8 F R ur al

Tamil Poor Poo

r No Good Avera

ge Low Aver age

Exce

llent Fast

9 M R ur al

Tamil Exc ellen

t

Poo

d

Goo

d Fast 1

0 F Ur ba

Englis

h Poor Goo

d Yes Poor Good High Aver age

Exce

llent Average

(5)

n 1

1 M R ur al

Tamil Poor Exc elle nt

No Poor Poor High Aver

age Poor Fast 1

2 M R ur al

Tamil Exc ellen

t

Poo

r No Excelle

nt Good Mediu

m Poor Exce

llent Average 1

3 F Ur ba n

Tamil Ave rage

Exc elle nt

Yes Poor Good Mediu m

Aver

age Poor Slow 1

4 F R ur al

Tamil Poor Poo

r No Good Avera

ge Low Aver age

Exce

llent Fast 1

5 M R ur al

Tamil Exc ellen

t

Poo

d

Goo

d Fast

5.RESULT AND DISCUSSION

A. C5.0ClassifierAlgorithm ClassificationTree

Numberofsamples:40 Numberofpredictors:12 Treesize:5

Classspecifiedbyattribute`outcome'

Read40cases(13attributes)fromundefined.data Decisiontree:

Maths.at.HSC=Yes:

:...Comm.Skill=Excellent:Average(6) :Comm.Skillin{Good,Poor}:Slow(9) Maths.at.HSC=No:

:...Graduation.Mark=Excellent:Average(5) Graduation.Markin{Good,Poor}:

:...Area=Rural:Fast(18) Area=Urban:Average(2)

Evaluationontrainingdata(40cases):

DecisionTree ---

(a)(b)(c)<-classifiedas ---

13(a):classAverage

(6)

18(b):classFast 9(c):classSlow

ConfusionMatrixandStatistics PredictionAverageFastSlow Average1600

Fast0212 Slow0011 OverallStatistics Accuracy:0.96

95%CI:(0.8629,0.9951) NoInformationRate:0.42 P-Value[Acc>NIR]:3.498e-16 Kappa:0.9382

Mcnemar'sTestP-Value:NA StatisticsbyClass:

Class:AverageClass:FastClass:Slow Sensitivity1.001.00000.8462 Specificity1.000.93101.0000 PosPredValue1.000.91301.0000 NegPredValue1.001.00000.9487 Prevalence0.320.42000.2600 DetectionRate0.320.42000.2200 DetectionPrevalence0.320.46000.2200 BalancedAccuracy1.000.96550.9231

C5.0 Algorithm Result Using Rstudio

Figure:2C5.0ClassifierResult

(7)

TheabovediagramshowsthatC5.0DecisionTreealgorithmusingRStudiogivesanaccuracyvalueof96%s howstheaccuratestudentslowlearnersfromthegivendataset.

Visualization Tree Using C5.0 Algorithm

Figure:3C5.0Classifier Result with Decision Tree Algorithm B. NaïveBayesClassifierAlgorithm

NaïveBayesClassifierforDiscretePredictors Call:

naiveBayes.default(x=X,y=Y,laplace=laplace) A-prioriprobabilities:

Y

AverageFastSlow

0.32432430.45945950.2162162 Conditionalprobabilities:

sr.no Y[,1][,2]

Average28.5000013.83342 Fast27.7647115.82533 Slow23.0000015.68439 Gender

YFM

Average0.50000000.5000000 Fast0.47058820.5294118 Slow0.75000000.2500000 Area

YRuralUrban

Average0.333333330.66666667 Fast0.941176470.05882353 Slow0.250000000.75000000

(8)

SSLC_medium YEnglishTamil

Average0.666666670.33333333 Fast0.058823530.94117647 Slow0.250000000.75000000 SSLC_Per

YAverageExcellentGoodPoor

Average0.000000000.333333330.166666670.50000000 Fast0.000000000.294117650.058823530.64705882 Slow0.625000000.250000000.000000000.12500000 HSC_per

YExcellentGoodPoor

Average0.000000000.500000000.50000000 Fast0.176470590.058823530.76470588 Slow0.625000000.125000000.25000000 Maths.at.HSC

YNoYes

Average0.500000000.50000000 Fast0.941176470.05882353 Slow0.000000001.00000000 Graduation.Mark

YExcellentGoodPoor

Average0.33333330.16666670.5000000 Fast0.00000000.76470590.2352941 Slow0.25000000.12500000.6250000 Entrance.Rank

YAverageGoodPoor

Average0.00000001.00000000.0000000 Fast0.47058820.29411760.2352941 Slow0.12500000.87500000.0000000 Parents.Income

YHighLowMedium

Average0.500000000.166666670.33333333 Fast0.176470590.764705880.05882353 Slow0.250000000.125000000.62500000 Attendance

YAverageGoodPoor

Average0.50000000.16666670.3333333 Fast0.70588240.29411760.0000000 Slow0.62500000.12500000.2500000

(9)

Comm.Skill

YExcellentGoodPoor

Average0.83333330.16666670.0000000 Fast0.47058820.29411760.2352941 Slow0.00000000.37500000.6250000 ConfusionMatrixandStatistics PredictionAverageFastSlow Average1202

Fast0161 Slow015

OverallStatistics Accuracy:0.8919 95%CI:(0.7458,0.9697) NoInformationRate:0.4595 P-Value[Acc>NIR]:4.464e-08 Kappa:0.8287

Class:AverageClass:FastClass:Slow Sensitivity1.00000.94120.6250 Specificity0.92000.95000.9655 PosPredValue0.85710.94120.8333 NegPredValue1.00000.95000.9032 Prevalence0.32430.45950.2162 DetectionRate0.32430.43240.1351 DetectionPrevalence0.37840.45950.1622 BalancedAccuracy0.96000.94560.7953 NaïveBayesAlgorithmResult

UsingRStudio

Figure:4NaïveBayesClassifierResult

(10)

TheabovediagramshowsthatNaïveBayesalgorithmusingRStudiogivesanaccuracyvalueof89%showsth eaccuratestudentslowlearnersfromthegivendataset.

C. RandomForestAlgorithm Call:

randomForest(formula=LB~.,data=TrainSet,importance=TRUE) Typeofrandomforest:classification

Numberoftrees:500

No.ofvariablestriedateachsplit:3 OOBestimateoferrorrate:5.71%

Confusionmatrix:

AverageFastSlowclass.error Average12000.00000000 Fast11400.06666667 Slow1070.12500000 predTrainAverageFastSlow Average1200

Fast0150 Slow008 PredTrain

16132732246365023

AverageSlowFastAverageAverageFastFastFastSlow 72535111528244112

SlowFastAverageFastFastAverageFastAverageAverage 33451721432938519

FastSlowSlowFastFastSlowSlowFastAverage 484961018343032

AverageSlowAverageAverageFastFastFastAverage Levels:AverageFastSlow

ConfusionMatrixandStatistics PredictionAverageFastSlow Average1200

Fast0150 Slow008

OverallStatistics Accuracy:1 95%CI:(0.9,1)

NoInformationRate:0.4286 P-Value[Acc>NIR]:1.321e-13

(11)

Kappa:1

Class:AverageClass:FastClass:Slow Sensitivity1.00001.00001.0000 Specificity1.00001.00001.0000 PosPredValue1.00001.00001.0000 NegPredValue1.00001.00001.0000 Prevalence0.34290.42860.2286 DetectionRate0.34290.42860.2286 DetectionPrevalence0.34290.42860.2286 BalancedAccuracy1.00001.00001.0000 RandomTreeAlgorithmUsingRStudio

Figure:5RandomForestClassifierResult

TheabovediagramshowsthatRandomTreealgorithmusingRStudiogivesanaccuracyvalueof100%sho wstheaccuratestudentslowlearnersfromthegivendataset.

D. ResultComparisonTableforC5.0,naïvebayesandrandomforestalgorithms

Table:2Comparisonofthreeclassifieralgorithmsforacademicstudent’s Performance

6. CONCLUSION

C5.0,Naïvebayestheorem,RandomForestisimplementedusingRProgrammingtodetermineSlowLear ner,AverageLearnerandFastLearner.Thisapplicationisusefulineducationsystemtocategoriesstudentacco rdingtotheirlearningbehavior.Proposedapplicationisveryuserfriendlyandapplicableforanyhighereducati onsector.ThenewdataminingtechniquessuchasC5.0,NaïveBayesandRandomForestAlgorithmsusingRPr ogrammingLanguageshowstheaccuratevalueofstudent’sslowlearners.Finallyweconcludethattheabovere

Classifier Accuracy Kappa Sensitivity

C5.0 96 0.9382 0.8462

NaïveBayes 89 0.8287 0.6250

RandomForest 100 1 1.0000

(12)

sultsshowthattheRandomForestAlgorithmusingRProgramminggottheaccuracyvalueof100%bycompari ngC5.0andNaïveBayesAlgorithm.Infutureworktocollectmoredatasetswithattributesandusingdifferental gorithmstogetbestresults.

REFERENCES

1. K.PrasadaRao.et.al.,“PredictingLearningBehaviorofStudentsusingClassificationTechniques”,Interna tionalJournalofComputerApplications(0975–8887),Volume139–No.7,April2016, Pp. 15-19

2. R.KaviyarasiandT.Balasubramanian,“ExploringtheHighPotentialFactorsthatAffectsStudents’Acade micPerformance”,I.J.EducationandManagementEngineering,November2018,6,,inMECS(http://ww w.mecs-press.net)DOI:10.5815, Pp.15-23

3. SumithaandE.S.Vinothkumar,“PredictionofStudentsOutcomeUsingDataMiningTechniques”,Internat ionalJournalofScientificEngineeringandAppliedScience(IJSEAS)–Volume-2,Issue-6,June2016ISSN :2395-3470, Pp. 132-139.

4. VrushaliMhetreandProf.MayuraNagar,“Classificationbaseddataminingalgorithmstopredictslow,aver ageandfastlearnersineducationalsystemusingWeka”,ProceedingsoftheIEEE2017InternationalConfer enceonComputingMethodologiesandCommunication(ICCMC),,978-1-5090-4890-8/17/$31.00©201 7IEEE, Pp. 475-479

5. SwatiandRajinderKaur,“MultifactorNaïveBayesClassificationforTheSlowLearnerPredictionOverM ulticassStudentDataset”,InternationalJournalonComputationalScience&Applications(IJCSA)Vol.6, No.4,August2016, Pp.11-23

6. ShiwaniRana*,RoopaliGarg,”SlowLearnerPredictionusingMulti-VariateNaïveBayesClassificationAl gorithm”,DepartmentofInformationTechnology,UIET,PanjabUniversity,Chandigarh,India.02Decem ber2016, Pp. 11-23.

7. Harwatia,ArditaPermataAlfiania,FebrianaAyuWulandari,”MappingStudent’sPerformanceBasedonD ataMiningApproach”,ScienceDirectAgricultureandAgriculturalScienceProcedia3(2015), Pp.

173–177.

8. ManolisChalaris,StefanosGritzalis,ManolisMaragoudakis,CleoSgouropoulouandAnastasiosTsolakid is,“ImprovingQualityofEducationalProcessesProvidingNewKnowledgeusingDataMiningTechniques

”,ScienceDirect-Procedia-SocialandBehavioralSciences147(2014), Pp. 390–397.

9. TanandVipinKumar,“IntroductiontoDataMining”Pearson,2013.

10. http://datamining.businessintelligence.uoc.edu/home/j48-decision-tree.

11. Ch.RaviKishore,K.PrasadaRaoet.al,“PerformanceEvaluationofEntropyandGiniusingThreadedand NonThreadedID3onAnaemiaDataset”2015IEEEDOI10.1109/CSNT.2015.112, Pp. 1080 – 1084.

12. AjayKumarPal,SaurabhPal,AnalysisandMiningofEducationalDataforPredictingthePerformanceof Students”,InternationalJournalofElectronicsCommunicationandComputerEngineering2013,Volume 4,Issue5.

13. Anne–SophieHoffait,MichaelSchyns.“Earlydetectionofuniversitystudentswithpotentialdifficult ies”.DecisionSupportSystems101(2017), Pp.1-11.

14. AbimbolaR.,et.al.,“PredictingStudentAcademicPerformanceinComputerScienceCourses:ACom parisonofNeuralNetworkModels",InternationalJournalofModernEducationandComputerScience(IJ MECS),Vol.10,No.6,2018, Pp.1-9

15. ReenaThakur,A.R.Mahajan,“PreprocessingandClassificationofDataAnalysisinInstitutionalSyste musingWeka”,InternationalJournalofComputerApplications(0975–8887)Volume112–No.6,February 2015, Pp.9-11.

16. https://en.wikipedia.org/wiki/R_(programming_language)