Executive Summary

Further exploring machine learning methods such as Random Forests and Balanced Random Forests with and without days_after our variable for how long in days did the patient die after discharge from hospital. We see that Random Forest models perform much more efficently than Balanced Random Forests and specifically the model with the inclusion of days_after. The overall best performing model has a training AUC of 0.8799 and a test AUC of 0.9093.

## Selecting by n()
## Selecting by n()
## Selecting by n()

Technical Summary

In exploring machine learning methods to predict death for a patient after their discharge I categorized our death outcome as a separate factor with 6 levels: within Stay, within 30 days, within 60 days, within 90 days, within a year, and over a year after their discharge date did they die. This gives us a multinomial outcome which some traditional methods such as basic binary logisitic regression models will struggle in making predictions. We would have to further explore with both the multinomial logisitc model and separate binary levels for each categorized level of the death intervals. In building the predictive model I chose patient demographics as the baseline such as participant race, age at baseline, bmi, and socio-economic status during their first questionaire as these variables would help to control for confounding factors based on the participant makeup. I have also combined both variables from the questionaires 1 through 3 as well as their OSHPD variables such as if the pateint lives near an oil refinery, if they have ever used oral contraceptives, how many pregnancies, menostation type, if they have ever had colon cancer, thyroid cancer or even have had diabetes when entered into the study. Having questionaire data such as these variables would give us more idea of which patients are dying after their discharge dates and at what intervals based on their precompossed conditions entered into the study. Combining these factors with the specific hospital data of their length of stay, major diagnosis when admitted to a medical faciltiy, and their diagnosis catergorized code from ICD-9 we are able to contribute a patient’s severity of their admission to the model based on these variables.

Many of these variables can already give the idea if a patient was already prexposed to colon cancer, has diabetes, sits for most of the day, and a cigratte smoking exposure they are most likely to have a shortend lifespan after their discharge date and are at higher risks of dying. But if we were to directly factor in a varibale into the prediciton model of days_after the amount of time after their disharge date till death it is direct and coincides with our outcome that is already categorized by this variables in 6 levels instead as a continous variable. Further experimentation would be needed to finalize our models’ performances but we have a baseline of the direct inclusion and exlusion of days_after to obtain the best model for predictive accuracy. Our model performance table shows that all training and test AUC are at the acceptable threshold for a good predictive model being above 0.70, but we go into detail later on of what actually contributes to these death intervals specifically and which of these questionarie or hospitalization variables are significant in predicting the 6 death interval outcomes.

##   trainAUC testAUC              Model
## 1   0.8831  0.9049                 RF
## 2   0.8746  0.8907                BRF
## 3   0.7387  0.7468  RF W/o days_after
## 4   0.7073  0.7075 BRF W/o days_after