Machine Learning Project (Spring 2015)

In the spring of 2015, as part of COMP 540-Statistical Machine Learning, taught by Dr. Devika Subramanian, I teamed up with a Computer Science PhD student (Rohan Mukherjee) and worked on a machine learning problem using the kaggle competition website.

Goal: Predict that a patient will be readmitted within 30 days after being discharged from a hospital.


In this classification problem we try to predict the probability that a patient discharged from a hospital will be readmitted within 30 days. The dataset was divided in two sets and comprised anonymized patient data from two hospitals. The first set, used for training included N=14,878 visits and 11,960 unique patients (a patient can have one or more visits); the second dataset, was used for testing the model and included N=6,359 visits, comprising 5,126 unique patients.

In order to streamline data curation and preparation, a data pipeline was created using Pentaho, an ETL open source software. Weka, a machine learning and data mining tool was used for running most of the experiments. Two domain experts were interviewed in order to gain insights for improving feature selection. After numerous attempts at creating features, computing the difference between the last and first test result for all available tests in a given visit, produced the best prediction using Logistic Regression (AUC = 0.59606). Therefore, a recommendation for doctors and administrators, albeit not perfect and to be used with caution, is the use of lab results trends during hospital visits to make decisions about discharging patients. Using an Ensemble method (Bagging), AUC accuracy increased to 0.60943.

We hypothesize that trends in lab results are the best predictor; therefore we conclude that a time series-based methodology is most likely the best approach to further improve accuracy. The summary of our final scores and ranking is as follows: final ranking: public 18, private 21; best public score: 0.60943 (Bagging with logistic regression); best private score: 0.58947 (Super RC Ensemble); best public algorithm: Bagging with logistic regression using 10-fold cross validation and delta lab tests results.