Esther Wilkinson
Division of Analytics and Integrated Planning
University of Central Florida
Maximize ROI’s
Creating a Dual Purpose
Retention Model
using
Who We Are: University of Central Florida
• Large public university established in 1963 with over
72,200 students in Fall 2020 of which,
• 61,700 are undergraduate
• 49% First-Time-in-College (FTIC), 49%
Transfers, 2% Other undergraduates
• 93% of undergraduates are from Florida
• 10,000 are graduates
• just under 500 medical students
• 2019-2020 FTIC Summer-Fall Full-Time Cohort has
7,150 students
• UCF acceptance rate for FTIC is approximately 45%
with a 37% enrollment rate
WHY Create a Retention Model?
• Resources and funding tied to state metrics, include a measure for First Year Retention of
First-Time-in-College (FTIC) students.
• University wide commitment to student success
• Program accountability is increasing and being evaluated for continuance or reallocating resources
• Want to focus retention intervention efforts on most important factors
• Learn about persistence factors that will support on-time completion
ONGOING ANALYSES
• Anecdotal approach and descriptive statistics used frequently to determine areas of weakness in student
success
➢ Limitations to types of data available; understanding outside influences, current events and the state
of the economy outside our control
A Dual-Purpose Retention Model
1. Get insights to the factors that have the most strength
impacting first-year retention. Use insights to affirm and guide
programs where they can make the biggest difference in
terms of retention or persistence.
2. Use the model to predict individual probability scores to guide
retention intervention strategy and outreach on future cohorts.
Retention Intervention Team
• Created in 2016 to raise 1
st
year retention rates above 90%
• The 2016-2017 FTIC cohort was the first group to receive
interventions
• Strategies were quickly added from original spring/summer
efforts to all year
• This summer we utilized the probability scores calculated
from the model to finalize outreach efforts
First Year Retention of FTIC
2011-2012 2012-2013 2013-2014 2014-2015 2015-2016 2016-2017 2017-2018 2018-2019 2019-2020
N
6,145
5,905
5,811
6,201
6,290
6,138
6,684
7,055
7,150
% Retained
87.6%
86.9%
87.5%
89.1%
88.8%
89.6%
90.4%
91.5%
92.1%
87.6%
86.9%
87.5%
89.1%
88.8%
89.6%
90.4%
91.5%
92.1%
50% 60% 70% 80% 90% 100%Retention rates increased 3.3% in the last 4 years from 2015-16 to 2019-20
compared to 1.2% in the 4 years from 2011-12 to 2015-16
Methodology
Data
Preparation
• Warehouse tables
• Spreadsheets
• Calculated Variables
Data Exploration
• Distributions
• Correlations
• T-Test
Feature
Selection
• Lasso
• Extreme Gradient
Boosting
• Logistic Regression
Modeling
• Logistic Regression
Data
• Students from three FTIC Summer Fall Full-Time cohorts, 2016-17 through 2018-19. These
cohorts were each a part of the retention team initiative (about ½ the cohort each year).
• Total of 19,886 students
• 151 variables mix of categorical and numeric
• Academic variables by semester, summer, fall, spring, summer (1 full academic year)
• Data collected from external departments (mentoring programs, volunteering, intramurals, other activities)
• Demographics
• Incoming data (test scores, high school and GPA, county)
• Student income, family income, unmet need, Bright Futures (Florida scholarship program)
• Calculated variables and flag variables
Two Subsets
Fall
• Fall academic and engagement
activity, demographics and financial
characteristics
• Flag for students who also later
attended summer sessions
• Included all students N=19,886
Fall/Spring
• Fall academic and engagement
activity, demographics and financial
characteristics
• Flag for students who also later
attended summer sessions
PLUS
• Spring academic and engagement
data, and flag if students were eligible
to attend summer sessions
Other Attributes of the Data
• Missing finance data of 2,137 records (11%)
• Missing zip code used to calculate distance to UCF from high school
in 209 records
• Imbalanced data with 90.5% in one class (retained), and other 9.5% in other class (not
retained)
• Methods considered to work with imbalance,
• Undersampling (suffers losing valuable information)
•
Oversampling can add unwanted noise, while Synthetic Minority
Oversampling Technique (
SMOTE) can help with this problem
Correlations
Feature Selection
Lasso
• Similar to forward or
backward
regression
• Penalizes
coefficients, shrinks
them to zero
• Only variables with
greatest impact
surface
• Cannot deal with
missing data
Logistic
Regression
• Forwards and
backwards
• Must remove
records with
missing data
• If many correlated
variables, choose
which variable to
keep
• Good interpretability
XGBoost
• High predictive
accuracy
• Handles numeric or
categorical
• Handles missing
data
• Handles correlated
variables
• Selects important
features
Key Findings
Summer Enrollment Prior to Second Fall
Students who attend summer session before
the start of second fall were 3.8 times more
likely to be retained
• 60% of students retained were enrolled in summer
Any Probation Status End of 1
st
Fall
Students who had any probation status at the
end of fall were 42% less likely to be retained
• 4% of students who were retained had a probation status
compared to 35% of those not retained
Not Retained to Fall (n=1,884)
4%
31%
65%
Academic Disqualified 1st Fall Probation No 1st Fall ProbationNumber of W Grades in the Fall
• Students who had one W grade in the fall term
were 46% less likely to be retained compared
to students with zero W grades
• 93% of students retained had zero W grades
in the fall compared to 77% of those not
retained
Fall UCF GPA
• There was a positive impact to retention when Fall
UCF GPA is above 2.60
• a larger proportion of the population not retained
(51%) had fall GPAs below 2.60 compared to 12%
of the population that were retained
77% 16% 3% 1% 2% 1% 0% 93% 6% 0% 0% 20% 40% 60% 80% 100% 0 1 2 3 4 5 6 PE R CE N T O F R ET A IN ED O R N O T R ET A IN ED
NUMBER OF W GRADES IN FALL
Not Retained
Retained
0% 2% 4% 6% 8% 10% 12% 14% 16% 18% P ER C EN T O F R ET A INE D O R N O T R ET A INE D
FALL UCF GPA
Participation in the LINK Program
• Program participants were 27% more likely to be
retained than those who did not participate
Other important features
• Students were 57% less likely to be retained if
they were from out of state
• Earning W grades in the fall term had a greater
negative impact on retention than earning F
grades
• Earning a term GPA in the spring above 2.40
had a significant positive impact on retention
(only 53% of those not retained met this mark)
944 940 7,156 10,846 0 2,000 4,000 6,000 8,000 10,000 12,000 Non-Participant Participant N UM BE R O F S TUD EN TS
Not Retained
Retained
LINK is an education and involvement-based program to help students new to UCF
get involved on campus. For each program attended LINK Loot will be awarded just
by using the app to sign in at an event. Students then use their accumulated Loot
at various events throughout the year to bid on items such as TVs, theme park
tickets, UCF gear and more!
Application to New Cohort 2019-2020
• Calculated individual probability scores to test model on new cohort
• Provided predicted probability scores to Retention Team to assist in selection of students
for targeted intervention/outreach
Accuracy of the Fall model tested on the 2019-2020 cohort:
✓ The fall model predictions compared to actual registration for fall of the 2019-2020 cohort was 93.6%
accurate
✓ It predicted the overall retention rate was 0.932 (95% CI: 0.929, 0.935)
✓ Actual retention rate this year is 0.924 (before drop-for-non-pay); afterwards 0.921
Predicted
Not
Registered Registered
Total N
Not Retained
140
34
174
Application of Key Findings for Outreach and Programs
1. Encourage enrollment in summer session prior to second fall start.
2. Probation status in the fall is a major red flag. Evaluate programs to work with students
throughout the first academic year who fall in this category.
3. Reach out to students with fall UCF GPA less than 2.60.
4. Engage out-of-state students.
5. Watch students who have W grades in the fall, or any DFW’s in spring term.
6. Reach out to students who may be tracking towards a spring term GPA less than 2.40
7. Create activities to engage first year students that connect them to other students. Use
Limitations and Future Study
• Creating a model with spring semester data provides more information that can lead to a slightly better model than
fall; however with attrition from fall to spring, the model can only be tested against actual outcomes of those who
attended spring semester and not the whole cohort.
• Some students transfer out to another institution though they have a high probability of being retained. A separate
analysis could be conducted on a subset of these students.
• A lack of qualitative data. As seen in modeling with variables that have sparse information, incorporating qualitative
data would only improve the model if there were a large percentage of the population providing answers to the
same questions.
• The model performed well on predicting those that were retained, but not as well on those who were not. A
secondary analysis could be conducted on the subset of students who were not retained.
• Many institutions are making efforts to raise graduation rates. A second and third year retention model may give
insights to a graduation model.
R Code Correlations Matrix
#Subset numeric variables from data, make matrix for p-values
num_cols<-unlist(lapply(mydata, is.numeric)) num_cols data<-mydata[ , num_cols] data<-na.omit(data) p.mat=cor_pmat(data) p.mat data %>% cor %>%
ggcorrplot( ggtheme = theme_minimal, colors = c("#6D9EC1","white","#E46726"),
show.diag = T,
p.mat=p.mat, # produces a matrix of p-values
insig = "blank", # this will produce a blank space for correlations that are not significant at 0.05 or less
lab = T, lab_size = 2.5,
title = "Correlation Matrix of Mydata", legend.title = “Coefficient",
R Code for XGBoost (Extreme Gradient Boosting)
require(xgboost) require(Matrix) require(data.table)
if (!require('vcd')) install.packages('vcd')
#Split data into train and test sets
set.seed(1)
train_index<-sample(1:nrow(data), 0.5 *nrow(data)) test_index<-setdiff(1:nrow(data),train_index)
Xtrain <-data[train_index,]
Ytrain <-data[train_index, “target"] Xtest <-data[test_index,]
Ytest <-data[test_index, “target"]
#/Set missing data as "missing“ for one-hot encoding
Xtrain[is.na(Xtrain)]<-"Missing" Xtest[is.na(Xtest)]<-"Missing“
#ONE HOT ENCODE CATEGORICAL VARIABLES
install.packages("onehot") library(onehot)
dvar <- dummyVars("~.", data = Xtrain )
#Make a data frame and sparse matrix
df <- data.table(new, keep.rownames = F)
sparse_matrix <- sparse.model.matrix(target~ ., data = df)[,-1] output_vector = df[,target]=="1“
#Train xgboost model
xgmodel <- xgboost(data = sparse_matrix, label = output_vector, max_depth = c(2,4,6), gamma=(2), eta = 0.1, cs=c(1/3,2/3,1), ss=c(0.25,0.5,0.75,1), nthread = 2, nrounds = 1000, objective = "binary:logistic“, Verbose=1)
importance <- xgb.importance(feature_names = colnames(sparse_matrix), model = xgmodel) head(importance)
Final Modeling:
Important Features back into Logistic Regression
Effect Estimate Std Error P-Value Odds Ratio 95% CI
FRST_FALL_UCF_GPA 0.665 0.076 <.001 1.944 (1.677 , 2.254) Enrolled_Summ2 1.991 0.104 <.001 7.320 (5.970 , 8.974) FALL_W_GRADES -0.622 0.071 <.001 0.537 (0.467 , 0.617) Y1_Challenge_COMBO 0.107 0.018 <.001 1.113 (1.074 , 1.153) FALL_F_GRADES -0.334 0.068 <.001 0.716 (0.626 , 0.818) MatricDays_prior_Fall 0.003 0.001 <.001 1.003 (1.001 , 1.004) Out_of_State -0.562 0.130 <.001 0.570 (0.442 , 0.736) FRST_FALL_ANY_PROB -0.545 0.163 <.001 0.580 (0.421 , 0.797) LINK_Participation 0.236 0.082 <.01 1.266 (1.078 , 1.487) FRST_FALL_RWC 0.005 0.002 <.05 1.005 (1.001 , 1.011) FALL_ONLINE_CRDS -0.041 0.018 <.05 0.960 (0.926 , 0.994)
Effect Estimate Std Error P-Value Odds Ratio 95% CI
SP_SUM_DFW -0.391 0.067 <.001 0.677 (0.593 , 0.771) Enrolled_Summ2 1.344 0.110 <.001 3.833 (3.091 , 4.757) Y1_Challenge_COMBO 0.084 0.020 <.001 1.088 (1.046 , 1.131) FRST_SP_CUR_GPA 0.310 0.084 <.001 1.364 (1.156 , 1.607) MatricDays_prior_Fall 0.004 0.001 <.001 1.004 (1.002 , 1.006) Distance_to_UCF -0.001 0.0001 <.001 0.999 (0.999 , 0.999) FRST_FALL_TOT_HRS_ERN 0.084 0.020 <.001 1.088 (1.046 , 1.131) FRST_FALL_TOT_CRS_LD -0.125 0.043 <.01 0.883 (0.811 , 0.960) SP_ONLINE_CRDS -0.054 0.018 <.01 0.947 (0.915 , 0.981) FALL_ONLINE_CRDS -0.050 0.023 <.05 0.951 (0.909 , 0.995) FRST_SP_RWC 0.006 0.003 <.05 1.006 (1.000 , 1.102) MajorChanges -0.243 0.109 <.05 0.784 (0.633 , 0.971)