Predicting Student
Persistence Using Data
Mining and Statistical
Analysis Methods
Koji Fujiwara
Office of Institutional Research and Effectiveness
Bemidji State University & Northwest Technical College December 18, 2012
NTC Enrollment
Northwest
Technical
College
enrollment
grew
very
quickly
between
2004
and
2009,
but
now
it
has
been
falling
off.
Fall 2004 Headcount = 794
Fall 2009 Headcount = 1,604
Fall Semester Headcount and FTE
FTE = Undergraduate Semester Hours / 15.
1,217 1,373 1,604 1,420 1,384 1,168 779 744 859 819 749 650 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800
Fall 07 Fall 08 Fall 09 Fall 10 Fall 11 Fall 12
Headcount (30th day) FTE (Final)
NTC Enrollment Report
Headcount and FTE for Term by Campus - Northwest Technical College
Spring 2013 Date: 12/17/2012 Days to beginning of term: 27.5693
Unduplicated Duplicated
Campus Headcount Headcount FTE Credits FYE
NTC 576 576 366.13 5492 183.07
NTC Distance 696 895 309.07 4636 154.53
Total 1272 1471 675.19 10128 337.60
Retention & Grad Rates – IPEDS definition
78% 74% 81% 71% 76% 40% 37% 48% 35% 39% 32% 29% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%Fall 2007 (166) Fall 2008 (114) Fall 2009 (170) Fall 2010 (141) Fall 2011 (104)
Cohort (Size)
Fall to Spring Fall to Fall 150% Grad Rates
Retention/Success Study at NTC (& other colleges)
A
study
was
conducted
last
year
to
identify
the
factors
that
affected
student
persistence
(success,
to
be
defined
later).
Factors: Credit Completion Rate & GPA
The
first
semester
credit
completion
rate
was
analyzed
using
a
2
(success
status:
successful
vs.
unsuccessful)
×
2
(college
ready
status:
ready
vs.
underprepared)
×
2
(late
registration
status:
regular
vs.
late)
What did we learn? (AIRUM 2011 / AIR 2012)
0 20 40 60 80 100
Early Reg. Late Reg. Early Reg. Late Reg. Underprepared College Ready
Unsuccessful Successful
Credit Comp. Rate (%)
We need to focus on
these students as well.
More Proactive Retention/Success Efforts
How
can
an
IR
office
contribute
to
this
project?
We
looked
at
eight
data
mining
and
two
statistical
models
to
find
the
best
method
to
develop
a
predictive
model
for
Proposed Retention/Success Plan
[Summer]
Retention/Success Study
& Develop a Model
[Fall]
Run the model
[Spring]
Interventions
Target Variable
Target
Variable:
Second
Fall
Success
Define “Second Fall Success = Success” if
At the beginning of second Fall
Students came back to NTC (retained) or
Students transferred out to another institution (transferred) or
Students completed the program (graduated)
Models
Decision Tree
Naive Bayes Classifier
K‐Nearest Neighbor Classifier
Neural Network
Support Vector Machine
Bagging
Boosting
Random Forest
Probit Regression
DATA
Combined
Fall
2008
– 2010
first
‐
year
,
full
‐
time
,
degree
seeking
students
data
There was no difference in the success status and cohort year (Chi‐
square test, p = 0.72).
In
this
study,
we
included
first
‐
time
regular
students
and
Predictor Variables
• Age
• Gender (Female/Male)
• College Ready Status (Underprepared/Ready)
• if a student needed to take at least one college readiness course,
then the student was categorized as “Underprepared”.
• Pell Status (No/Yes)
• Late Registration Status (before or on/after August 1st)
• Underrepresented Status (No/Yes)
Predictor Variables
• # Attempted Credit Hours
• Enrolled Semester GPA
• Enrolled Semester Credit Completion Rate
How to Estimate a Model Performance?
Simple
Cross
‐
Validation
Original Data
N = 622
Testing (for validation)
N2 = 311
Training (for modeling)
How to Compare the Results?
We
propose
to
use
these
three
indicators:
Overall Classification Rate
Sensitivity
Prob. that the model classified as “Successful” when given to a
group of students who were actually “Successful”
Specificity
Prob. that the model classified as “Unsuccessful” when given to a
Results
Method Classification Sensitivity Specificity
Logistic Regression 81.7% 89.8% 67.8%
Probit Regression 81.7% 90.8% 66.1%
Decision Tree 81.4% 89.3% 67.8%
Naive Bayes Classifier 82.3% 90.3% 68.7%
K‐Nearest Neighbor (5) 82.3% 90.3% 68.7%
Neural Network 81.4% 86.7% 72.2%
Support Vector Machine 82.6% 92.9% 65.2%
Bagging 82.3% 89.8% 69.6%
Boosting 78.9% 87.8% 63.5%
Conclusion and Future Research
We
are
going
to
use
a
Neural
Network model
and
a
bagging model.
December
• These models will be applied to the actual data. • The list will be sent to the Student Services.
January
• The Student Services will start contacting to the students. • Interventions (what types?)
Next Summer
and Fall
• Summer – Conduct Retention Study • Fall – Evaluate the models
Conclusion and Future Research (contd.)
Current
Model
Need Students First Semester Academic Performance Data
Interventions: Spring Semester
Ideal
Model
Can run before the Fall semester begins
Need other predictors: i.e.) FA data?