• No results found

Maximize ROI s. Creating a Dual Purpose Retention Model using Machine Learning Methods in R

N/A
N/A
Protected

Academic year: 2021

Share "Maximize ROI s. Creating a Dual Purpose Retention Model using Machine Learning Methods in R"

Copied!
26
0
0

Loading.... (view fulltext now)

Full text

(1)

Esther Wilkinson

Division of Analytics and Integrated Planning

University of Central Florida

Maximize ROI’s

Creating a Dual Purpose

Retention Model

using

(2)

Who We Are: University of Central Florida

• Large public university established in 1963 with over

72,200 students in Fall 2020 of which,

• 61,700 are undergraduate

• 49% First-Time-in-College (FTIC), 49%

Transfers, 2% Other undergraduates

• 93% of undergraduates are from Florida

• 10,000 are graduates

• just under 500 medical students

• 2019-2020 FTIC Summer-Fall Full-Time Cohort has

7,150 students

• UCF acceptance rate for FTIC is approximately 45%

with a 37% enrollment rate

(3)

WHY Create a Retention Model?

• Resources and funding tied to state metrics, include a measure for First Year Retention of

First-Time-in-College (FTIC) students.

• University wide commitment to student success

• Program accountability is increasing and being evaluated for continuance or reallocating resources

• Want to focus retention intervention efforts on most important factors

• Learn about persistence factors that will support on-time completion

ONGOING ANALYSES

• Anecdotal approach and descriptive statistics used frequently to determine areas of weakness in student

success

➢ Limitations to types of data available; understanding outside influences, current events and the state

of the economy outside our control

(4)

A Dual-Purpose Retention Model

1. Get insights to the factors that have the most strength

impacting first-year retention. Use insights to affirm and guide

programs where they can make the biggest difference in

terms of retention or persistence.

2. Use the model to predict individual probability scores to guide

retention intervention strategy and outreach on future cohorts.

(5)

Retention Intervention Team

• Created in 2016 to raise 1

st

year retention rates above 90%

• The 2016-2017 FTIC cohort was the first group to receive

interventions

• Strategies were quickly added from original spring/summer

efforts to all year

• This summer we utilized the probability scores calculated

from the model to finalize outreach efforts

(6)

First Year Retention of FTIC

2011-2012 2012-2013 2013-2014 2014-2015 2015-2016 2016-2017 2017-2018 2018-2019 2019-2020

N

6,145

5,905

5,811

6,201

6,290

6,138

6,684

7,055

7,150

% Retained

87.6%

86.9%

87.5%

89.1%

88.8%

89.6%

90.4%

91.5%

92.1%

87.6%

86.9%

87.5%

89.1%

88.8%

89.6%

90.4%

91.5%

92.1%

50% 60% 70% 80% 90% 100%

Retention rates increased 3.3% in the last 4 years from 2015-16 to 2019-20

compared to 1.2% in the 4 years from 2011-12 to 2015-16

(7)

Methodology

Data

Preparation

• Warehouse tables

• Spreadsheets

• Calculated Variables

Data Exploration

• Distributions

• Correlations

• T-Test

Feature

Selection

• Lasso

• Extreme Gradient

Boosting

• Logistic Regression

Modeling

• Logistic Regression

(8)

Data

• Students from three FTIC Summer Fall Full-Time cohorts, 2016-17 through 2018-19. These

cohorts were each a part of the retention team initiative (about ½ the cohort each year).

• Total of 19,886 students

• 151 variables mix of categorical and numeric

• Academic variables by semester, summer, fall, spring, summer (1 full academic year)

• Data collected from external departments (mentoring programs, volunteering, intramurals, other activities)

• Demographics

• Incoming data (test scores, high school and GPA, county)

• Student income, family income, unmet need, Bright Futures (Florida scholarship program)

• Calculated variables and flag variables

(9)

Two Subsets

Fall

• Fall academic and engagement

activity, demographics and financial

characteristics

• Flag for students who also later

attended summer sessions

• Included all students N=19,886

Fall/Spring

• Fall academic and engagement

activity, demographics and financial

characteristics

• Flag for students who also later

attended summer sessions

PLUS

• Spring academic and engagement

data, and flag if students were eligible

to attend summer sessions

(10)

Other Attributes of the Data

• Missing finance data of 2,137 records (11%)

• Missing zip code used to calculate distance to UCF from high school

in 209 records

• Imbalanced data with 90.5% in one class (retained), and other 9.5% in other class (not

retained)

• Methods considered to work with imbalance,

• Undersampling (suffers losing valuable information)

Oversampling can add unwanted noise, while Synthetic Minority

Oversampling Technique (

SMOTE) can help with this problem

(11)

Correlations

(12)

Feature Selection

Lasso

• Similar to forward or

backward

regression

• Penalizes

coefficients, shrinks

them to zero

• Only variables with

greatest impact

surface

• Cannot deal with

missing data

Logistic

Regression

• Forwards and

backwards

• Must remove

records with

missing data

• If many correlated

variables, choose

which variable to

keep

• Good interpretability

XGBoost

• High predictive

accuracy

• Handles numeric or

categorical

• Handles missing

data

• Handles correlated

variables

• Selects important

features

(13)
(14)
(15)
(16)

Key Findings

Summer Enrollment Prior to Second Fall

Students who attend summer session before

the start of second fall were 3.8 times more

likely to be retained

• 60% of students retained were enrolled in summer

Any Probation Status End of 1

st

Fall

Students who had any probation status at the

end of fall were 42% less likely to be retained

• 4% of students who were retained had a probation status

compared to 35% of those not retained

Not Retained to Fall (n=1,884)

4%

31%

65%

Academic Disqualified 1st Fall Probation No 1st Fall Probation

(17)

Number of W Grades in the Fall

• Students who had one W grade in the fall term

were 46% less likely to be retained compared

to students with zero W grades

• 93% of students retained had zero W grades

in the fall compared to 77% of those not

retained

Fall UCF GPA

• There was a positive impact to retention when Fall

UCF GPA is above 2.60

• a larger proportion of the population not retained

(51%) had fall GPAs below 2.60 compared to 12%

of the population that were retained

77% 16% 3% 1% 2% 1% 0% 93% 6% 0% 0% 20% 40% 60% 80% 100% 0 1 2 3 4 5 6 PE R CE N T O F R ET A IN ED O R N O T R ET A IN ED

NUMBER OF W GRADES IN FALL

Not Retained

Retained

0% 2% 4% 6% 8% 10% 12% 14% 16% 18% P ER C EN T O F R ET A INE D O R N O T R ET A INE D

FALL UCF GPA

(18)

Participation in the LINK Program

• Program participants were 27% more likely to be

retained than those who did not participate

Other important features

• Students were 57% less likely to be retained if

they were from out of state

• Earning W grades in the fall term had a greater

negative impact on retention than earning F

grades

• Earning a term GPA in the spring above 2.40

had a significant positive impact on retention

(only 53% of those not retained met this mark)

944 940 7,156 10,846 0 2,000 4,000 6,000 8,000 10,000 12,000 Non-Participant Participant N UM BE R O F S TUD EN TS

Not Retained

Retained

LINK is an education and involvement-based program to help students new to UCF

get involved on campus. For each program attended LINK Loot will be awarded just

by using the app to sign in at an event. Students then use their accumulated Loot

at various events throughout the year to bid on items such as TVs, theme park

tickets, UCF gear and more!

(19)

Application to New Cohort 2019-2020

• Calculated individual probability scores to test model on new cohort

• Provided predicted probability scores to Retention Team to assist in selection of students

for targeted intervention/outreach

Accuracy of the Fall model tested on the 2019-2020 cohort:

✓ The fall model predictions compared to actual registration for fall of the 2019-2020 cohort was 93.6%

accurate

✓ It predicted the overall retention rate was 0.932 (95% CI: 0.929, 0.935)

✓ Actual retention rate this year is 0.924 (before drop-for-non-pay); afterwards 0.921

Predicted

Not

Registered Registered

Total N

Not Retained

140

34

174

(20)

Application of Key Findings for Outreach and Programs

1. Encourage enrollment in summer session prior to second fall start.

2. Probation status in the fall is a major red flag. Evaluate programs to work with students

throughout the first academic year who fall in this category.

3. Reach out to students with fall UCF GPA less than 2.60.

4. Engage out-of-state students.

5. Watch students who have W grades in the fall, or any DFW’s in spring term.

6. Reach out to students who may be tracking towards a spring term GPA less than 2.40

7. Create activities to engage first year students that connect them to other students. Use

(21)

Limitations and Future Study

• Creating a model with spring semester data provides more information that can lead to a slightly better model than

fall; however with attrition from fall to spring, the model can only be tested against actual outcomes of those who

attended spring semester and not the whole cohort.

• Some students transfer out to another institution though they have a high probability of being retained. A separate

analysis could be conducted on a subset of these students.

• A lack of qualitative data. As seen in modeling with variables that have sparse information, incorporating qualitative

data would only improve the model if there were a large percentage of the population providing answers to the

same questions.

• The model performed well on predicting those that were retained, but not as well on those who were not. A

secondary analysis could be conducted on the subset of students who were not retained.

• Many institutions are making efforts to raise graduation rates. A second and third year retention model may give

insights to a graduation model.

(22)

R Code Correlations Matrix

#Subset numeric variables from data, make matrix for p-values

num_cols<-unlist(lapply(mydata, is.numeric)) num_cols data<-mydata[ , num_cols] data<-na.omit(data) p.mat=cor_pmat(data) p.mat data %>% cor %>%

ggcorrplot( ggtheme = theme_minimal, colors = c("#6D9EC1","white","#E46726"),

show.diag = T,

p.mat=p.mat, # produces a matrix of p-values

insig = "blank", # this will produce a blank space for correlations that are not significant at 0.05 or less

lab = T, lab_size = 2.5,

title = "Correlation Matrix of Mydata", legend.title = “Coefficient",

(23)

R Code for XGBoost (Extreme Gradient Boosting)

require(xgboost) require(Matrix) require(data.table)

if (!require('vcd')) install.packages('vcd')

#Split data into train and test sets

set.seed(1)

train_index<-sample(1:nrow(data), 0.5 *nrow(data)) test_index<-setdiff(1:nrow(data),train_index)

Xtrain <-data[train_index,]

Ytrain <-data[train_index, “target"] Xtest <-data[test_index,]

Ytest <-data[test_index, “target"]

#/Set missing data as "missing“ for one-hot encoding

Xtrain[is.na(Xtrain)]<-"Missing" Xtest[is.na(Xtest)]<-"Missing“

#ONE HOT ENCODE CATEGORICAL VARIABLES

install.packages("onehot") library(onehot)

dvar <- dummyVars("~.", data = Xtrain )

#Make a data frame and sparse matrix

df <- data.table(new, keep.rownames = F)

sparse_matrix <- sparse.model.matrix(target~ ., data = df)[,-1] output_vector = df[,target]=="1“

#Train xgboost model

xgmodel <- xgboost(data = sparse_matrix, label = output_vector, max_depth = c(2,4,6), gamma=(2), eta = 0.1, cs=c(1/3,2/3,1), ss=c(0.25,0.5,0.75,1), nthread = 2, nrounds = 1000, objective = "binary:logistic“, Verbose=1)

importance <- xgb.importance(feature_names = colnames(sparse_matrix), model = xgmodel) head(importance)

(24)

Thank you!

Questions?

Contact

Esther Wilkinson

Institutional Research Analyst II

[email protected]

(25)

Final Modeling:

Important Features back into Logistic Regression

Effect Estimate Std Error P-Value Odds Ratio 95% CI

FRST_FALL_UCF_GPA 0.665 0.076 <.001 1.944 (1.677 , 2.254) Enrolled_Summ2 1.991 0.104 <.001 7.320 (5.970 , 8.974) FALL_W_GRADES -0.622 0.071 <.001 0.537 (0.467 , 0.617) Y1_Challenge_COMBO 0.107 0.018 <.001 1.113 (1.074 , 1.153) FALL_F_GRADES -0.334 0.068 <.001 0.716 (0.626 , 0.818) MatricDays_prior_Fall 0.003 0.001 <.001 1.003 (1.001 , 1.004) Out_of_State -0.562 0.130 <.001 0.570 (0.442 , 0.736) FRST_FALL_ANY_PROB -0.545 0.163 <.001 0.580 (0.421 , 0.797) LINK_Participation 0.236 0.082 <.01 1.266 (1.078 , 1.487) FRST_FALL_RWC 0.005 0.002 <.05 1.005 (1.001 , 1.011) FALL_ONLINE_CRDS -0.041 0.018 <.05 0.960 (0.926 , 0.994)

Effect Estimate Std Error P-Value Odds Ratio 95% CI

SP_SUM_DFW -0.391 0.067 <.001 0.677 (0.593 , 0.771) Enrolled_Summ2 1.344 0.110 <.001 3.833 (3.091 , 4.757) Y1_Challenge_COMBO 0.084 0.020 <.001 1.088 (1.046 , 1.131) FRST_SP_CUR_GPA 0.310 0.084 <.001 1.364 (1.156 , 1.607) MatricDays_prior_Fall 0.004 0.001 <.001 1.004 (1.002 , 1.006) Distance_to_UCF -0.001 0.0001 <.001 0.999 (0.999 , 0.999) FRST_FALL_TOT_HRS_ERN 0.084 0.020 <.001 1.088 (1.046 , 1.131) FRST_FALL_TOT_CRS_LD -0.125 0.043 <.01 0.883 (0.811 , 0.960) SP_ONLINE_CRDS -0.054 0.018 <.01 0.947 (0.915 , 0.981) FALL_ONLINE_CRDS -0.050 0.023 <.05 0.951 (0.909 , 0.995) FRST_SP_RWC 0.006 0.003 <.05 1.006 (1.000 , 1.102) MajorChanges -0.243 0.109 <.05 0.784 (0.633 , 0.971)

ROC: 0.844

Optimal Probability Score Threshold: 0.509

ROC: 0.804

Optimal Probability Score Threshold: 0.568

(26)

1. UGRD courses only

2. At least 20% DFW rate historically

3. At least 50 students taken the class historically

4. Course is credit-bearing (SCH > 0)

5. GEP courses based on 2017-18 catalog year

Appendix A: GEP Challenge Courses

References

Related documents

Scroll the options with [Start Stop] and [Light Lock] and press [Next] to select a POD or heart rate sensor to pair.. Hold your Suunto Ambit3 Peak close to the device you are

 Keep copies of all attendance logs (for both agent/RR, and management meetings) in the General Office training log.  Periodically notify non-registered agents,

The second aim was to test the effectiveness of the released Si from the sol-gel coatings promoting osteogenic differentiation of hMSCs on the surface of the materials and

Carleton faculty members and student assistants help motivated high school students like you develop the critical thinking and creative problem-solving skills you need in

Magnetization transfer MR imaging to monitor muscle tissue formation during myogenic in vivo differentiation of muscle precursor cells.. Rottmar, Markus ; Haralampieva, Deana ;

There are many current biodiversity- related CS projects that collect information on habitats, and there is a considerable amount of land cover and land use now being

Nesse caso, por estarem intimamente ligados ao ecossistema, as casas receberam os nomes de: Camarão, Caranguejo, Ostra e Sururu, cada uma com seu respectivo brasão

firmus to suppress nematode motility, egg hatching, root penetration, and galling on a variety of host plants, though the organism’s specific nematicidal mode of action