Big Data in Education: Analyzing Student Clusters to Influence Success and Retention

December 8, 2014 Srivatsan Ramanujam

featured-bigdata-educationData science is helping educational institutions predict student success.

Back in 2009, online learning was showing signs of growth, and a 2014 article on shows the booming growth and investment interest ahead. In this environment, educational institution directors with physical campuses face both commoditization and competition—they need ways to offer more value to students and help them achieve more success.

With these pressures, one educational institution set out on a course to understand student success better and build new relationships with students based on data science. Ultimately, with the goal of improving retention and graduation rates, developing a more pro-active relationship with students to help them be more successful during and after graduation. By pursuing programs with an impact on these metrics, the educational institution would be in a better competitive position and improve the value of their education. For entities and companies across many industries, this is the journey to becoming a truly data-driven business where information is used to predict and affect business outcomes before they happen.

With the overarching goals in place, the stakeholders saw how a specific data science project could be initiated to capture a 360-degree perspective of student behaviors. By unifying data from multiple systems and applying data science, the educational institution would have a more comprehensive understanding of behavior, and the predictive models would be more accurate. The IT organization was already managing a series of applications, data stores, and data warehouses across various divisions where data files were moved across these groups on a manual, as needed basis. To meet the future goals, they needed one location where vast amounts and multiple sets of unstructured and structured data could be analyzed and predictive models could form, growing over time. They wanted to lay a foundation for the future and set the bar high—to become a model educational institution for applying data science. These goals led to a consulting project with Pivotal Data Labs and a deployment of Pivotal Big Data Suite which at that time included Apache Hadoop® and Pivotal Greenplum Database.

Approach and Solution

The system intended to lay the foundation for a longer term vision of a data lake—a place where additional sets of structured and unstructured data could be added, enhancing analytical insight and predictive capability. To start, this project sourced data from four systems:

  1. Online Applications for Education: First, unstructured web log data was taken from online applications and included with class assignments and work submissions.
  2. Forums: Systems also existed with discussion boards—this unstructured data set included student questions, answers, and views.
  3. Help Desk: Third, students opened IT tickets that included general unstructured conversation about the problems as well as structured data like specific timestamps, topics, and categories.
  4. Student Demographic and Operational Information: Lastly, one of the systems brought broad, structured student profile information—demographic, courses, age, background, test scores, GPA, previous education, applications, admissions, and enrollments.

Together, several hundred distinct features, or data elements, were pulled together for analysis across several years of student data, totaling records for hundreds of thousands of students

Importantly, the outset of the project began with some of the most significant challenges because the source data contained personally identifiable information . This meant compliance with laws and regulations, like HIPAA and FERPA. For each data source PII was either removed or masked through a non-reversible masking operation and all operations were logged. Eventually the masked data was ingested into the data lake allowing the data scientists to work with completely anonymized data.

All data landed on a Pivotal Data Computing Appliance (DCA), running Pivotal Greenplum, and Pivotal HD. Modeling was done the open source, parallel, in-database library of machine learning algorithms – MADlib and also using PL/Python and PL/R. The team used Tableau and Python libraries like IPython Interactive Computing, matplotlib, and pandas for data visualization. As the project unfolded, the educational institution’s data science team was trained on all of these tools, including HAWQ, Pivotal’s SQL interface to Hadoop®.

The Models and Results

From a pure data science algorithms and models perspective, the Pivotal Data Science team developed a number of different models. First was a student segmentation model. MADlib’s k-means module was used to uncover student segments exhibiting similar properties as describing the several hundred engineered features. Amongst the many clusters that were uncovered, three exhibited very distinct properties of the student population, the team presented a case for why they mattered as groups, basically answering the question, “What data forms these clusters?” Amongst the features that have high leverage in determining the clusters, some are those we’re born into (ex: demographics) while many are those that can be nurtured (ex: discussion board activity, participation, course work plan, internships). Even demographical features can be influenced by encouraging /combining students from different demographics into student groups for class assignments or projects.

Once these clusters were understood, there were two types of predictions to follow—one was for predicting student retention and the other for predicting success in their grades with timely graduation.

  • MADlib’s linear regression and elastic net regression models were used to build success prediction models, uncovering what good student profiles looked like based on the data. Interesting correlations were found between the features and success, many of which can be nurtured/influenced in at risk students to enable them succeed.
  • As well, MADlib’s logistic regression models were used to predict student retention, uncovering the data characteristics for students who leave or complete their degree. Here, we could look at the data to glean early indicators for students who eventually dropped out. This meant an early warning system could identify students when they matched this behavioral profile.
  • Overall, other interesting correlations were also uncovered between student account activity, discussion board activity, and internships towards success and retention.

With all of the insight on behavior and predictors for retention and success, the joint team was able to realize that these groups could influence each other, supporting the development of programs to push “at risk” students to contact and be influenced by other groups who were “low risk.” For example, students with internships and graduate degree plans could help other students who were unclear. Altogether, the project built a foundation for the educational institution to pursue the next level of data science—studying correlations in the data to potentially determine cause and effect. The educators are ultimately those in the best position to understand and influence student retention and success, Big Data and Data Science provides them with deep insights which are hard to be gleaned otherwise especially at this scale.

Learning More:

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author


Six Questions from the Lean Startup Tour at Pivotal Labs
Six Questions from the Lean Startup Tour at Pivotal Labs

Tim McCoy gives a tour of Pivotal Labs to Lean Startup Conference attendees. As a kickoff to the Lean Start...

Building Clouds
Building Clouds

There is much about the Pivotal Process that is strikingly different from how most software is written. Th...

SpringOne 2022

Register Now