Developing Data-‐Driven Predictive Models of Student
Success
Kresge Data Mining Project
Phase Two Report
University of Maryland University College
November 27, 2013
Table of Contents
Executive Summary ... 1
Introduction ... 3
Research Goals ... 3
Section 1: General Grant Overview ... 4
Section 2: Key Findings and Conclusions from Phase 1 ... 5
Objectives and Milestones ... 6
Section 3: Relevant Literature ... 7
Section 4: Data Sources ... 10
Section 5: Overview of Research Design and Target Variables ... 10
Section 6: Key Findings from Data Mining ... 12
Research Goal 1: Profile students based on community college course taking behaviors ... 12
Figure 1. Change in GPA for students retained or not retained at UMUC ... 14
Table 1. Community college grade distributions for students successful or not at UMUC .... 15
Table 2. Community college grade distributions for students retained or not at UMUC ... 16
Figure 2. Success quadrants ... 16
Figure 3. Likelihood of community college course selections for Stars ... 18
Figure 4. Likelihood of community college course selections for Strivers and Slackers ... 19
Figure 5. Likelihood of community college course selections for Splitters ... 20
Figure 6. Binned number of community college credits by community college GPA for each success profile ... 22
Figure 7. Binned number of community college credits by UMUC GPA for each success profile ... 23
Figure 8. Binned number of community college credits by delta GPA for each success profile ... 24
Table 3. Community college credits and GPA and UMUC GPA by success profile ... 25
Section 7: Key Findings from Predictive Analyses ... 26
Research Goal 2: Identify demographic profiles of MC and PGCC students transferring to UMUC 26 Table 4. Description of demographic and community college course taking background data clusters for Montgomery College ... 27
Table 5. Description of demographic and community college course taking background data clusters for Prince George’s Community College ... 28
Table 6. UMUC first term GPA ... 30
Research Goal 4: Identify demographic and community college background factors predicting course success at UMUC ... 30
Table 7. Summary of predictors for logistic regressions predicting overall GPA and success in specific courses ... 32
Research Goal 4a: Examine demographic, community college background factors, and course efficiency as predicting course success at UMUC ... 33
Table 8. Courses taken at community college by institution ... 34
Table 9. Success at UMUC by coursework taken or not taken ... 34
Table 10. Course efficiency rates differentiated by types of courses taken or not ... 34
Table 11. Results of multivariate logistic regression analysis of success at UMUC ... 35
Research Goal 4b: Examine demographic, community college background factors, and change in GPA as predicting retention at UMUC ... 36
Table 12. Results of multivariate logistic regression analysis of retention at UMUC ... 36
Research Goal 5: Investigate predictors of behaviors in WebTycho and success at UMUC ... 37
Table 13. Description of WebTycho activity clusters for Montgomery College ... 39
Table 14. Description of WebTycho activity clusters for Prince George’s Community College ... 40
Table 15. Summary of top ten predictors of success at UMUC and WebTycho cluter membership ... 42
Section 8: Summary of Results ... 44
Section 9: Research and Intervention Planning in Phase 3 ... 46
References ... 50
Appendices ... 52
Executive Summary
This report documents analyses and findings completed in Phase 2 of the Kresge Data Mining Grant: Developing Data-‐Driven Predictive Models of Student Success. This grant was awarded to
University of Maryland University College (UMUC) in collaboration with two community college partners: Montgomery College (MC) and Prince George’s Community College (PGCC). The purpose of the grant was:
1. To build an integrated database tracking students across institutions, from community college to UMUC.
2. To use predictive statistical models and data mining techniques to track and model students’ progress across institutions.
3. To identify factors predictive of students’ success at UMUC that may inform the
development of interventions aimed to improve outcomes for undergraduate students transferring from community colleges to UMUC or other four-‐year institutions.
In Phase 1 of the grant UMUC, in collaboration with partner institutions, designed and developed a database, the Kresge Data Mart (KDM), with records of more than 250,000 students. This database includes information on student demographics, academic performance at UMUC and the
community college, and student behaviors in courses hosted in WebTycho, UMUC’s propriety online learning management system.
Key results from Phase 1 included a literature review of publications on students’ performance in online courses, successful course completion, re-‐enrollment, and retention. Further, literature on data mining techniques in higher education was examined. The literature review showed that factors such as the number of schools students attended, the number of credits students transferred, and the students’ community college GPA were associated with successful course completion and retention. Regression analyses determined that students’ online classroom activities prior to the start of a class and during the early weeks of the course were predictive of successful course completion.
In Phase 1, three goals for the project were identified:
1. Validate the predictive models and data mining techniques explored in Phase 1 on an expanded dataset.
2. Build profiles of successful students and their online learning behaviors.
3. Develop interventions to improve the success of students transferring from community colleges to UMUC.
The above three goals were accomplished in Phase 2, which involved examination of students’ demographic profiles, course work from the community colleges, and performance at UMUC. A variety of methodologies were used to identify predictors of students’ success and retention. These include:
1. Cluster analyses to determine profiles of students based on demographic factors and community college course-‐taking backgrounds.
2. Logistic regression to examine demographic factors and variables associated with students’ community college course-‐taking histories to predict success at UMUC.
3. Cluster analyses to determine profiles of students’ online behaviors in courses at UMUC. 4. Data mining techniques to identify profiles in the student population based on GPA and re-‐
enrollment. Community college grade distributions and course taking preferences for these different groups of students were examined.
In addition to predicting outcomes associated with success, analyses in Phase 2 determined a variety of trends characterizing the student population and developed student profiles based on demographics, prior academic work, and online classroom behavior.
The primary outcome measures of interest in Phase 2 include students’ success at UMUC, defined as earning a first term GPA of 2.0 or above and students’ retention at UMUC within 12 months
following their first academic term. Key findings are presented below.
1. Across studies, age and marital status were associated with success at UMUC. Older, married students are more likely to succeed, perhaps indicative of students’ maturity or a stronger commitment to their educational goals.
2. Four success profiles of students at UMUC were identified based on students’ GPA and re-‐ enrollment. Profiles differed in terms of community college course taking preferences and course load, and in the change in GPA when transferring to UMUC. Again, these results suggests that the degree of student preparedness, particularly in specific target areas (e.g., accounting, economics), is predictive of success at UMUC.
3. Course efficiency, the ratio of credits earned to credits attempted, in the community college was determined to be a predictor of success at UMUC. The higher the course efficiency, the more likely a student will succeed.
4. A new factor, delta GPA, was introduced in these analyses, corresponding to the difference between students’ GPA at the community college and at UMUC. While most students experienced a decreased GPA when transferring to UMUC, the magnitude of this decrease was predictive of students’ continued enrollment at UMUC, beyond the first term (i.e., retention)
5. Similarly, students who took math or honors courses in community college were more likely to succeed at UMUC, suggesting that rigor of community college courses may prepare students to succeed at a university.
6. Students’ behaviors in the online classroom indicated high variability in the extent to which they engage in course content and course-‐related activities. A substantial percentage of students accessed course content and course materials to a limited extent, thus impacting successful course completion.
Based on findings in Phase 2, interventions aimed at promoting success of transfer students at UMUC are presented. These interventions differ in the audience targeted and whether they provide social support (e.g.., peer mentor) or academic support (e.g.., check-‐list) to promote student
success. Further, long-‐term initiatives to promote student success that have been developed collaboratively with partner institutions are introduced.
Introduction
The purpose of this report is to document work done by UMUC, MC, and PGCC on the Kresge Data Mining Grant: Developing Data-‐Driven Predictive Models of Student Success. This report has three primary purposes:
1. To review prior work completed on the Kresge Data Mining Grant in Phase 1.
2. To document work completed in Phase 2 of the grant, expanding on findings from Phase 1. 3. To introduce research-‐driven future directions and interventions aimed at promoting
transfer students’ success at UMUC; the evaluation of these interventions will be undertaken in Phase 3 of the Kresge grant.
The research in this report has been conducted by the UMUC Institutional Research Office. Research from Phase 2 has been documented in detail. This report presents the research in nine sections:
Section 1: General grant overview
Section 2: Key findings and conclusions from Phase 1 Section 3: Relevant literature
Section 4: Data sources
Section 5: Overview of research design and target variables Section 6: Key findings from data mining
Section 7: Key findings from predictive analyses Section 8: Summary of results
Section 9: Research and intervention planning in Phase 3
In Phase 2, five key research goals were accomplished. Specifically, researchers were able to:
1. Profile students at UMUC based on community college course taking behaviors. 2. Identify demographic profiles of MC and PGCC students transferring to UMUC. 3. Determine MC and PGCC transfer students’ performance at UMUC.
4. Identify demographic and community college background factors predicting course success at UMUC.
a. Examine demographic, community college background factors, and course efficiency as predicting course success at UMUC.
b. Examine demographic, community college background factors, and change in GPA as predicting retention at UMUC.
5. Investigate predictors of behaviors in WebTycho and success at UMUC.
Section 1: General Grant Overview
Grant Partnership
UMUC is a four-‐year public university that offers online degree programs to a diverse population of working adults. With support from this grant, UMUC established partnerships with two Maryland community colleges that also serve large and diverse student populations. Montgomery College (MC), established in 1946, enrolls over 60,000 students annually. Prince George’s Community College (PGCC) enrolls more than 40,000 students from approximately 128 different countries. Both institutions serve the metro-‐D.C. area, but differ in that PGCC serves more low income students. Both institutions have endorsed the goals of this project and are committed to working with UMUC to find ways to promote student success throughout their academic careers.
Financial Support
The Kresge Foundation awarded UMUC a $1.2 million grant to build an integrated database, explore data mining techniques, build predictive models of student success, implement and evaluate
intervention strategies that are designed to improve student success, and disseminate the results of this research to national constituents.
In Phase 1 of the research study, approximately 41% of total grant funds were expended on purchasing hardware to house the data-‐mining database, collecting data from partner institutions, and to provide dedicated salaries for a data mining specialist and a graduate assistant. Additional staff resources were provided in kind by UMUC. In Phase 2, UMUC expended funds for additional data collection, data mining consulting, and conferences presentations. (See Appendix A for the financial statement.) In Phase 3, expenses are expected to total $400,000. These funds are intended to be spent on collecting additional data from the community colleges, additional data mining research, and implementing interventions, with a graduate student to coordinate the interventions. In addition, funds will support a national convening to present and discuss research findings on educational data mining, predictive modeling, and learner analytics.
Section 2: Key Findings and Conclusions from Phase I
In Phase 1, a Memorandum of Understanding (MOU) was negotiated and signed between UMUC and partner institutions in order to clarify the data security and parameters for use of this data in the research project. The MOU allows UMUC researchers to conduct research using individual student data while protecting student information and confidentiality.
UMUC, in collaboration with partner community colleges MC and PGCC, designed, developed, and implemented a database of over 250,000 student records. The Kresge Data Mart (KDM) contains information on student demographics, academic performance at the community colleges and at UMUC, and student behaviors in the online classroom at UMUC.
Key outcomes of Phase 1 included a literature review on students’ success in online courses. Further, literature about the use of data mining techniques in higher education was identified and reviewed; this literature is described in Section 3. Data mining determined that factors associated with successful outcomes included students’ prior academic work, namely the number of schools students attended, the number of credits students transferred, and students’ GPA in community college. These predictors were associated with both successful GPA and retention at UMUC. Additional findings from Phase 1 included that certain online course behaviors, such as opening and reading conference notes in the first four weeks of a course, were associated with course success, as was students’ engagement in the online classroom prior to the start of a class.
The analyses in Phase 1 were focused on examining a large variety of factors to determine their value in predicting student success. These findings were used to develop initial predictive models of successful performance at UMUC. These predictive models were refined and validated in Phase 2.
At the conclusion of Phase 1, three goals for the completion of the grant were identified:
1. Validate the predictive models and data mining techniques explored in Phase 1 on an expanded dataset.
2. Build profiles of successful students and their online learning behaviors.
3. Develop interventions to improve the success of students transferring from community colleges to UMUC.
Objectives and Milestones
Specific objectives and milestones are presented below for each stage of the research project. Objectives from Phase 1 and Phase 2 of the project are abridged with planned Phase 3 work further expanded. These objectives and milestones have been modified throughout the course of the project, but are consistent with grant requirements.
Objectives Milestones Status
Phase 1 April 2011 – October 2012
Develop a Project Action Plan
Develop a project action and collaboration plan with the partnering agencies.
Complete Data Collection and
Preparation
Prepare a data “universe” (integrated database system) on CC transfer students in the UMUC population (KDM)
Complete Understand variables; define student characteristics and
retention data; develop data dictionary.
Complete Data Analysis Conduct initial predictive analyses and employ data mining
techniques to identify factors contributing of students’ success
Complete
Project Evaluation Conduct ongoing project evaluation. Take action on identified areas for improvement.
Complete
Phase 2 November 2012 – October 2013
Develop and Validate Analytic Models of Student Success
Analyze data and identify factors that predict success/failure. Complete Validate predictive analyses and models developed through
data mining techniques to predict students’ success and retention at UMUC.
Complete
Build student profiles based on analyses. Complete Disseminate Key
Findings Discuss results with Kresge Workgroup and share with advisory board. Complete Discuss results with Project Partners and obtain feedback. Complete Present key findings at national conferences on higher
education
Ongoing Develop
Interventions
Work with stakeholders at UMUC and CC partners to develop a list of potential interventions.
Complete Project Evaluation Conduct ongoing project evaluation. Take action on
identified areas for improvement.
Ongoing Research Plan 3 Design and develop KDM2 to update and In progress
Plan Phase 3 analyses on expanded integrated data. In progress
Phase 3 November 2013 – October 2014
Develop Interventions
Review relevant literature on interventions that promote student success in online learning.
In progress Develop an implementation plan and timeline for piloting of
interventions.
In progress Implement Pilot
Interventions Implement and evaluate pilot interventions. Not yet started Disseminate
Results on Interventions
Develop and disseminate report on the pilot interventions Not yet started Phase 3 Analyses Develop and execute Phase 3 research plan In progress Report Findings Present key findings from Phase 3 analyses at national
conferences; publish research in journals
Not yet started Prepare written report of both Phase 3 analyses and full
scope of Kresge grant work. Not yet started Dissemination of
Results and Resources
Develop website and repository for educational data mining
and student success. Not yet started
Host a national convening on data mining and learner
analytics. Not yet started
Project Evaluation Conduct final project evaluation. Not yet started
Section 3: Relevant Literature
The literature review discussed below addresses examinations of factors contributing to students’ success in online courses, research on the use of data mining techniques in educational research, and research on factors impacting the success and retention of non-‐traditional students. A review of published literature on students’ success in online courses, research on the use of data mining techniques in educational research, and research on factors impacting the success and retention of non-‐traditional students was undertaken to inform the development of interventions aimed at promoting students’ success.
Online student success literature
Current literature on student success focuses on student outcomes such as course success, course withdrawal, retention, and retention. For example, student variables such as student
characteristics, previous course work, grades, and time spent in course discussions and activities may be useful in predicting course success (Aragon & Johnson, 2008; Morris & Finnegan, 2009; Morris, Finnegan & Lee 2009; Park & Choi, 2009). Course-‐level variables acquired from student login data from the learning management system may have predictive value in measuring course withdrawal (Willging & Johnson, 2008; Nistor & Neubauer, 2010). Student, course, program, and institution level variables such as student characteristics, number of transfer credits, final grade in any given course, experience in online environments, and course load may be useful in predicting re-‐enrollment and retention (Aragon & Johnson, 2008; Morris & Finnegan, 2009; Boston, Diaz, Gibson, Ice, Richardson & Swan, 2011).
Although these studies showcase a variety of findings related to student success, the majority of studies in retention in online learning environments use traditional statistical or qualitative methods. Park and Choi (2009) point out that expansion of methods such as data mining may have utility when student, course, program, and institutional level variables are well defined and
institutionally meaningful. Literature related to educational data mining focusses on exploratory research.
Educational data mining literature
Data mining is a method of discovering new and potentially useful information from large amounts of data (Baker & Yacef, 2009; Luan, 2001). Educational data mining is a subset of the field of data mining that draws on a wide variety of literatures such as statistics, psychometrics, and
computational modeling to examine relationships that may predict student outcomes (Romano & Ventura, 2007; Baker & Yacef, 2009). In educational data mining, data mining algorithms are used to create and improve models of student behavior in order to better understand student learning (Luan, 2002).
Data mining methods are most helpful for finding patterns already present in data, not necessarily in testing hypotheses (Luan, 2001). Baker and Yucef (2009) suggest that research in higher education should use a variety of algorithms, such as classification, clustering or association algorithms in determining relationships between variables. Although many definitions of these techniques exist in data mining literature, Han and Kamber (2001) offer the following definitions. Classification is the process of finding a set of models or functions that describe and distinguish data classes or concepts to predict a class of objects whose class label is unknown. Clustering analyzes data objects that are related to similar outcomes without consulting a class label. Association is the
discovery of rules showing attribute value conditions that occur frequently together in a given set of data (Han & Kamber, 2001).
Recent research suggests that these data mining algorithms can be used to examine variables related to student success. Yu, DiGangi, Jannach-‐Pennell, Lo, and Kaprolet (2010) used a
classification algorithm to explore potential predictors related to student retention in a traditional undergraduate institution. In this study, the authors used a decision tree to explore demographic, academic performance, and enrollment variables as they related to student retention. This study revealed a predictable relationship between earned hours and retention, but also found that at this institution, retention was closely related to state of residence (in-‐state/out of state) and living location (on campus/off campus). The authors speculate that this finding points to the potential utility of online courses in improving retention for out-‐of-‐state or off-‐campus students.
Despite these recent developments in exploring variables related to student success in traditional higher education settings, research using data mining techniques to uncover relationships among variables in online courses is limited in scope. This study is designed to fill this gap in the extant literature by utilizing data on online students who attended multiple institutions.
Retention in Non-‐Traditional Student Populations
Historically, research on student retention largely focused on the experiences of traditional
students, until a seminal book by Tinto (1993) expanded on extant models of retention to consider which factors may impact the retention of non-‐traditional students. Across the literature, non-‐ traditional students are considered to be those above age 26 or taking classes through non-‐ traditional pathways, including distance and online learning. For both traditional and non-‐ traditional students, retention was thought to be a consequence of students’ academic and social integration (Tinto, 1993). Other research has echoed the central role of social factors in predicting retention for non-‐traditional students, online, and distance learners (Boston, Diaz, Gibson, Ice, Richardson, & Swan, 2009). At the same time, the processes and policies that foster social
integration in online environments are different from the factors that foster social connections in more traditional settings. For students enrolled in online courses, feelings of social integration may stem from learners and instructors conveying a sense of themselves through the use of para-‐ language (i.e., emoticons), self-‐disclosure, humor or other verbal expressions of personal emotions and/or values (Boston et al., 2009). These behaviors are believed to result in open communication, trust, and group cohesion and are identified as necessary for successful collaboration (Boston et al., 2009).
Using social network analysis, Dawson (2010) found that visualizing classroom interaction patterns could provide insights into the nature of interactions for high-‐ versus low-‐achieving students completing an online course. Dawson (2010) determined that high-‐performing students primarily interacted with other high-‐performing students, and likewise, low-‐performing students were more likely to have interactions with other low-‐performing students. More importantly, in examining instructor-‐student interactions, instructors networked with high-‐performing students (81.7%) at significantly higher rates than they did with low-‐performing students (34.61%).
Social connections in online learning may result in cognitive and learning gains as well. Rovai (2002) found a correlation between levels of engagement in the classroom community and increased levels of content learning and understanding; this was especially true for females.
Theories of student retention have considered the contributions that student motivation and challenges that external barriers may present for students’ continued enrollment in college. Kember (1989) presents students’ decisions to re-‐enroll as the result of a cost-‐benefit analysis, wherein students compare the price of attendance and time-‐commitment associated with college attendance to the anticipated benefits of receiving a degree.
Examinations of student retention have focused on two complimentary processes, those of persistence and attrition (e.g., Rovai, 2002); with positive academic variables associated with persistence and negative academic variables associated with attrition (Bean & Metzner, 1985). In predicting persistence, external factors, such as family and organizational support of the students’ academic efforts, played a major role in determining intent to persist, and course satisfaction and perceived relevance to students’ daily lives was a significant source of motivation to persist in college course work (Park & Choi, 2009).
Predictive models of student retention have considered students’ background factors, such as previous GPA and academic performance (Bean & Metzner, 1985). Further, students’ use of web-‐ based technologies positively impacted students’ engagement and retention for online learners (Chen, Lambert, & Guidry, 2010).
Whereas the aforementioned studies focused on individual student factors predicting retention, Moore and Fetzner (2009) addressed the institutional characteristics that fostered commitment in non-‐traditional students. These factors included having a leadership culture that fosters
commitment to student success and institutional policies and practices that incorporate student support services and technological support. For online learners, access to services and to support that meets their needs was found to be crucial (Moore & Fetzner, 2009). Further, student
satisfaction, defined as students happy with their progress and with support received for learning, and with a perception that the knowledge they were learning was valuable, was predictive of retention. Faculty satisfaction, stemming from involvement in curricular design and training in the use of online technologies supporting learning, were found to be key to engagement and
contributors to retention (Moore & Fetzner, 2009).
The findings from the published literature, offers insights into (a) factors that may be modeled as predictive of students’ success, (b) techniques that may be used to investigate and model student success, and (c) areas, specific to the needs of non-‐traditional learners, that may be targeted for intervention.
Section 4: Data Sources
One of the key achievements of Phase 1 of the Kresge research grant was the development of the KDM, an integrated multi-‐institutional database that chronicles the prior academic work of transfer students. Data for the KDM came from four data systems:
1. Banner -‐ Montgomery College’s Student Information System
2. Datatel -‐ Prince George’s Community College’s Student Information System 3. PeopleSoft – UMUC’s student information system.
4. WebTycho – UMUC’s propriety learning management system that records students’ activities in an online classroom.
Demographic, academic, and enrollment data were collected on each student from each institution. In addition, transfer data and online classroom behavior data were included from UMUC.
Demographic data included students’ gender, age, marital status, and race/ethnicity. Enrollment data included course registration, program of study or major, and student status. Academic data included information about students’ academic history prior to transferring to UMUC, such as course grades, repeated courses, and remedial coursework. Transfer data included the number of courses transferred, transfer GPA, and prior degrees earned. There were two sources for this data: community college data provided through the Kresge project, and UMUC transcript data. The latter may be incomplete because UMUC records contain information only on courses students chose to transfer to UMUC and equivalent to a UMUC course. Classroom behavior data was specific to each course and each WebTycho session. Each session recorded a login time, access to various modules within the classroom, and posting of or responding to conference notes. Each action that students made in the classroom was recorded and totaled for each session, defining student activity.
The KDM served as the primary resource for all the analyses and findings for this research grant. Section 5 describes the research and methods for Phase 2.
Section 5: Overview of Research Design and Target Variables
In Phase 2, research was developed to comprehensively answer and expand on questions introduced during Phase 1 of the project. The findings from these knowledge sheets are
summarized in subsequent sections. Section 6 of this report presents findings from data mining analyses focused on exploratory analyses identifying potential predictors of students’ success and retention at UMUC. The following questions were considered.
1. Which profiles of students at UMUC can be identified?
2. To what extent does community college course taking differentiate each success profile at UMUC?
Section 7 of this report presents findings from predictive analyses, including cluster analyses and logistic regression, modeling factors in students’ demographic and community college course taking backgrounds that predict success at UMUC and validating specific predictors of students’ success identified in Section 6. The following questions were considered.
3. What are the demographic profiles of community college students transferring from MC and PGCC to UMUC?
4. Which factors from students’ demographic profiles and course-‐taking backgrounds in CC predict success at UMUC overall, and in specific courses?
5. What kinds of online learning behaviors do students transferring to UMUC engage in?
These questions encompassed examinations of students’ performance in community college overall (Research Questions 1 and 2) as well as in specific courses (Research Questions 2 and 4 ). The questions examined not only UMUC GPA but also reenrollment (Research Question 1) as a desired outcome variable, and considered not only performance but also process and learning behaviors at UMUC (Research Question 4). In addition, a number of possible predictors of success not
previously considered, were included, such as students’ course efficiency in community college (the ratio of credits completed to credits attempted) and change in GPA (the difference between
students’ community college and UMUC GPA).
Student Population. The population of interest for the Phase 2 analyses was defined as first term undergraduate students transferring to UMUC from MC or PGCC between Spring 2005 and Spring 2012. Subsets of this population were drawn for subsequent analyses.
Variables. In this report a number of outcomes are associated with student success:
Course Success – earning a final grade of A, B, or C in any course.
Unsuccessful Course Completion – earning a grade of D, F, FN, or W in a course. Student success – students’ first term GPA of 2.0 or above.
Re-‐enrollment – enrollment in the immediate next semester after initial enrollment. Retention –defined as re-‐enrollment at UMUC within 12 months after initial enrollment.
The first term GPA cut-‐off point of 2.0 is based on current UMUC policies that define academic probation. On a 4-‐point scale, 2.0 corresponds to a C average.
Section 6: Key Findings from Data Mining
The findings presented in this section are a result of data mining efforts aimed at identifying factors contributing to students’ success and retention at UMUC. Data mining is an exploratory technique that identifies factors emerging from big data and allows iterative predictive models to be run, using a variety of algorithms and boosting techniques to improve prediction accuracy. In the data mining phase of the analyses, a large number and variety of models were run with the aim of predicting retention and student success at UMUC. The key models and factors identified through data mining are presented. A summary of models to be discussed in the results can be found in Appendix B, along with information about model fit.
Research Goal 1. Profile students at UMUC based on community college course taking behaviors
In these analyses two joint indicators of students’ success at UMUC were used: achievement at UMUC of a first-‐semester GPA of 2.0 or above and retention at UMUC. These indicators of success were used to create outcome profiles, and then a predictive model was built on the students’ prior academic work and demographic variables.
Sample. The initial data set consisted of 14,218 students with a total of 187,697 course enrollments from Montgomery College, and 11,046 students and a total of 156,373 course enrollments from Prince George’s Community College.
The top 50 courses from each community college were determined and were organized by course subject area. These top 50 courses from each community college represented a sample from a total number of 1,404 PGCC courses and the 2,737 MC courses. As a result, the final dataset included 12,637 students and 108,237 enrollments. The number of students and a listing of all of the courses included in each data set are included in Appendix C.
Methods. Data exploration was performed using IBM Modeler, SPSS, SAS JMP 10 Pro, and Excel. Data were transformed and new variables were created as needed. Transformations were performed in Modeler, JMP, and Excel.
A variety of black box algorithms -‐ neural nets, boosted trees, and Random Forests -‐ were used to develop profiles of students’ success. Random Forests is a recently developed algorithm which provides strong data modeling, but its findings may not be readily interpretable. It built a large number of small trees and averaged the results. JMP’s Bootstrap Forest, used on a dataset of variables derived solely from the community college data, provided a way of differentiating the likelihood of retention among those students with low UMUC GPAs.
To evaluate effectiveness, these models were developed on a subset of the data and then applied to a different subset (a holdout dataset) that had not been used in the model building. The misclassification rate (the proportion of wrong predictions) was used to evaluate the effectiveness of the models. A number of other measurements of effectiveness were assessed, including lift, sensitivity, specificity, false positive rate and false negative rate. However, the models which performed well on the original dataset did not yield equally good results on the holdout dataset, indicating that the models were overfitting the data (i.e., they would not generalize well to other data).
Four indices of model fit were used to compare and evaluate model quality. Performance indicators were calculated based on model fit for the validation data subset.
• Overall accuracy is the percentage of students correctly identified as “successful” or “not successful.”
• Accuracy improvement (lift) compares the accuracy of the model to the accuracy of predicting the majority case (“successful”) for everyone. Negative lift means the accuracy is worse than simply predicting the majority case for everyone.
• False positive rate is the percentage of “not successful” students identified as “successful.”
• False negative rate is the percentage of “successful” students identified as “not successful.”
Results -‐ Student retention
Change in GPA. The first set of models used students’ retention as an outcome. The strongest predictor of student retention was change in GPA, computed by subtracting students’
community college GPA from their GPA in their first semester at UMUC. Values range from -‐4.0 to +4.0 and were binned in intervals of 0.25. Among students who were retained within a year, only 40% experienced a drop in their GPA in their first semester at UMUC. By contrast, among students who were not retained within a year, 70% had experienced a decrease in their GPA.
!The main finding is that, regardless of whether their UMUC GPA was above or below 2.0, students whose first-‐semester GPA at UMUC was lower than what it was at community college were less likely to demonstrate persistence at the university.
Model 1 summary information is presented in Appendix B. The distribution of delta GPA is presented in Figure 1 on the following page.
Figure 1.
Change in GPA for Students Retained or Not Retained at UMUC
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % of s tu de nt s re ta in ed o r n ot
Change in GPA from CC to UMUC (binned)
Change in GPA and Reten;on
Results -‐ Student Success
Demographic Factors. A variety of models, presented in Appendix B, were used to determine predictors of success. First, a model was developed to predict students’ success using demographic factors. Independent models predicting success at UMUC based on student demographics were run separately for MC and PGCC students. Model information is summarized in Models 2, 3, and 4 in Appendix B
Community College GPA. Community college GPA was binned as being successful if greater than or equal to 2.0, or unsuccessful if less than 2.0. CC GPA was found to be a significant predictor of students’ success at UMUC GPA. Further, students’ success at UMUC was predicted by the percentage of A, B, and C grades that students’ received at community college. See Appendix B, Models 5, 6, and 7 for summary information. Distributions of community college grades for students classified as successful or not successful at UMUC are displayed in Table 1.
! The main finding is that students who earned a UMUC first term GPA of 2.0 or above were more likely to have earned As at community college than students earning a UMUC first term GPA below 2.0.
!Conversely, students who earned a UMUC first term GPA below 2.0 were more likely to have earned Fs or Ws at community college than students earning a UMUC first term GPA above 2.0. See Appendix B, Model 10 for summary information. The importance of students’ community college performance in predicting UMUC success was upheld through both data mining and predictive (Section 7) approaches.
Table 1. Community College Grade Distributions for Students Successful or Not at UMUC (N=15890)
CC grades (mean %) A
grades grades B grades C grades D grades F grades W UMUC GPA ≥ 2.0
(10,871 students) 30% 27% 17% 6% 10% 11%
UMUC GPA < 2.0
(5,019 students) 16% 20% 17% 7% 22% 19%
Note: Grade distributions were computed based on the total number of course enrollments Similarly, distributions of community college grades for students classified as retained or not at UMUC are displayed in Table 2 on the following page.
!No substantial differences were found when evaluating whether or not there were differential community college grade distributions for those students retained at UMUC within a year.
Table 2. Community College Grade Distributions for Students Retained or Not UMUC. (N=15890)
CC grades (mean %) A grades B grades C grades D grades F grades W grades Retention YES 26% 26% 17% 6% 12% 12% Retention NO 22% 23% 16% 6% 17% 16%
In addition to independently considering these two outcomes of student success – UMUC GPA and retention at UMUC – researchers also examined these two predictors jointly. Thus, profiles of student success at UMUC were determined that classified students based on successful GPA and retention. All combinations of the two attributes were examined. Four quadrants were formed with students evidencing a high or low GPA, and being retained or not. These four Success Quadrants were named Stars, Strivers, Slackers, and Splitters. Each quartile is described in Figure 2 below. Figure 2.
Success Quadrants