Predicting School Closure

In recent years, countless news stories and other antidotes have focused on the negative impact of school closures on low-income and minority students, parents, and communities. While the aim of school closure may be to improve educational outcomes for students at underperforming schools, the negative externalities often outweigh the proposed benefits, as nicely described in the Schott Foundation's infographic below (original source here):


Obviously, it would be beneficial to policymakers to identify at-risk schools before they were forced to close in order to better implement intervention strategies. For my third Metis project, I sought to answer the question:

Can we identify schools at risk for closure by performance and other characteristics?

Data

To answer my research question, I analyzed regular (e.g. non-vocational or special education) public primary and secondary institutions that were operational during the 2009-10 school year (approximately 90,000 schools). I analyzed 2009-10 school characteristics and performance data reported in the National Center for Education Statistics'  Common Core of Data and EDFacts. I then looked at which of these schools had shuttered their doors within the next five years (e.g. 2010-11 through 2014-15), again using Common Core of Data extracts from this period.


Descriptive Statistics

The map below shows the proportion of closed schools by county. Interestingly, it looks like rural areas are often more affected by closure than more urban environments; with the exception of the District of Columbia and the greater Detroit metropolitan area, the areas hit the hardest by closure are rural (e.g. Northern Maine, Arizona, Western Nebraska, etc.). This was an unexpected observation, given school closure antidotes most often take place in large metropolitan areas.

That said, the box-plots below support much of the antidotal evidence surrounding school closures. Closed schools have, on average, a higher proportion of minority and low-income students (as measured by the proportion of students receive free or reduced price lunch), and perform worse on math and English Language Arts (ELA) statewide assessments. However, these differences are not as drastic as I expected - the 25th-75th percentile ranges of these characteristics show a lot of overlap between both open and closed schools.

Modeling Choices and Performance Evaluation

Can we use these characteristics described above to accurately predict school closures? For my analysis, I trained and tested three common decision-tree based ensemble methods:

  1. Bootstrapped Decision Trees
  2. Random Forest
  3. Extremely Randomized Trees (i.e. Random Forests with a random, not optimal, feature split at each node).

We need a way to determine which of these classifiers is most effective at predicting school closure. Given the imbalance between our two classes (only ~6% of schools in our 90,000 school sample had closed), overall accuracy (i.e. proportion of test set cases classified correctly) is not a particularly useful metric; we could obtain a 94% accuracy simply by predicting schools would never close, regardless of how underperforming they are!

Instead, I used precision-recall on the minority class (e.g. closed schools) to evaluate model performance. In this context, precision - recall can be interpreted as follows:

  • Precision: Proportion of schools that actually closed out of all predicted closures. A higher precision indicates our model is only picking up actual "at risk" schools, which would lead to more cost-efficient policymaking.

  • Recall: Proportion of closed schools correctly identified. A higher recall indicates our model will "pick up" more of the at risk schools, and hence lead to more effective policymaking.

Upsampling and Downsampling

My goal was to maximize recall without compromising precision too heavily. The curves below show the precision - recall tradeoffs for each classifier tested. Random Forest performed the best, but it still does not perform particularly well - only attaining a ~23% precision for ~60% recall. Could we do better?

Next, I tried downsampling the majority class (e.g. schools remaining open) when training the model. This is a common technique when dealing with imbalanced classes; as the model will seek to maximize accuracy, it will often underestimate predictions for the less common occurrence (in this case school closures). Downsampling counteracts this behavior by fitting a model to all closed schools an equally large randomly selected sample of open schools. After downsampling, I got a slight bump in random forest performance, but nothing to write home about.

Finally, I evaluated performance for various downsampling and upsampling ratios for random forest (our best classifier performance-wise). For downsampling, I fit the model with a balanced number of open-closed schools, as well as 2x, 5x, and 10x the number of open schools to closed schools. When upsampling, I sampled (with replacement) 1x, 2x, 5x, and 10x the overall number of closed schools from both the closed and open schools in the training set.

As shown below, the precision-recall curves remain stubbornly immobile regardless of the downsampling or upsampling ratio applied.

Insights

The less than ideal precision-recall tradeoffs, even after downsampling and upsampling the training set, indicate that our inconclusive model performance was not a result of simple class imbalance between open vs. closed schools. Instead, it seems either the model features were not strongly enough correlated with closure to predict school closure at a desirable precision-recall combination, or ensemble methods are not powerful enough algorithms to predict such a nuanced issue.

For future work on this project, I would like to:

  • Collect data on other features that may be predictive of school closure, such as district funding and local demographics, that may enhance models' predictive power.
  • Test out more complicated boosting, non-linear, and deep learning algorithms.
  • Reduce the dimensionality of the feature set to see if that improves model performance.

Interested in learning more? Check out the GitHub repo.