Executive Summary

Further exploring machine learning methods such as Logistic Regression, Random Forests and Balanced Random Forests with a multiclass outcome in order to predict how long in days would a patient die after discharge from hospital. We see that Multiclass Random Forest models perform much more efficiently than Balanced Random Forests and Logistic Regression models. Multiclass Random Forests had the highest overall AUC for predictive performance in each death interval after discharge.

Technical Summary

In exploring machine learning methods to predict death for a patient after their discharge I categorized our death outcome as a separate factor with 6 levels: within Stay, within 30 days, within 60 days, within 90 days, within a year, and over a year after their discharge date. This gives us a sequential multiomial outcome which some traditional methods such as basic binary logistic regression models would struggle in making predictions. We would have to further explore with separate binary levels for each categorized level of the death intervals, dichotimizing death within stay vs otherwise. In building the predictive model I chose patient demographics as the baseline such as participant race, age at baseline, bmi, and socio-economic status during their first questionnaire as these variables would help to control for confounding factors based on the participant makeup. I have also combined both variables from the questionnaires 1 through 3 as well as their OSHPD variables such as if the patient lives near an oil refinery, if they have ever used oral contraceptives, how many pregnancies, menostration type, if they have ever had breast cancer, or even have had diabetes when entered into the study which would give patients a prevalent exposure to carcinogens or to risks of disease. Having questionnaire data such as these variables would give us more idea of which patients are dying after their discharge dates and at what intervals based on their preexisting conditions entered into the study. Combining these factors with the specific hospital data of their length of stay, major diagnosis when admitted to a medical facility, their procedure categorized code from ICD-9, and their disposition type we are able to contribute a patient’s severity of their admission and stay to the model to better gauge which patients would require more proper attention to avoid untimely deaths after thier disharge date.

Many of these variables can already give the idea if a patient was already prexposed to colon cancer, has diabetes, sits for most of the day, and a cigratte smoking exposure they are most likely to have a shortend lifespan after their discharge date and are at higher risks of dying. But if we were to directly factor in a varibale into the prediciton model of days_after the amount of time after their disharge date till death it is direct and coincides with our outcome that is already categorized by this variables in 6 levels instead as a continous variable. Further experimentation would be needed to finalize our models’ performances but we have a baseline of the direct inclusion and exlusion of days_after to obtain the best model for predictive accuracy. Our model performance table shows that all training and test AUC are at the acceptable threshold for a good predictive model being above 0.70, but we go into detail later on of what actually contributes to these death intervals specifically and which of these questionarie or hospitalization variables are significant in predicting the 6 death interval outcomes.

Performing some exploratory analysis of the age distribution within each death interval from their discharge date we see that patients that died more than a year after their discharge date tended to have the highest median age. Which is a good indication that patients are living long and well without the needs of a hospital. But we do see some outliers for the death outcome within 60 days, as patients being reported as the youngest and with tight grouping meaning relatively close in age. My prediction models could be utilized to reevaluate what would be the necessary care and disposition treatment for individuals inputting the patients pre-exposed conditions and the conditions during their hospital stay. We observe what are the most frequent major diagnosis for patients who have died within their respective intervals and gauge whether these diagnosis could have been faulty and lead to the wrong treatment for a patient or even look further in how the hospital performs and responds to these common diagnosis. Each death interval shares the same frequent major diagnosis of nervous system, respiratory, digestive, and circulatory. We further explore patient_disposition_cde as this covariate has major importance to our predictive model and can indicate which patients are at potential risks of mortality with their recommended disposition and based on patient’s other covariates could be recommended other care to significantly lower their risks of dying after being discharged. In figure 3 we can observe the most common disposition types for each death interval, even with death being reported within stay mortality our figure lists other disposition types that recorded as most frequent and could be further explored. We observe that the most frequent disposition types are skilled nursing at another facility which could be an indicator of transport time and distance could have played a role in the patient’s mortality and could be potentially minimized to reduce the odds of mortality, skilled nursing within the admitting hospital, other care within the admitting hosp ital, and routine at home care. In figure 4 we observe the most frequent causes of death within each outcome of interest, we do not factor this variable into our model because we want our model to be applicable to patients coming in and those during a hospital stay which have not died but we hope to minimize the odds. Alzheimer’s disease, atherosclerotic heart disease, and malignant neoplasm of bronchus or lung are the most common causes of death and this information can be used to give more attention to patients coming in or developing these exposures.

Basic Logistic Regression

We use elementary logistic regression to test performance for each death outcome dichotomous against another, as a yes no response for our binary model. Testing each outcome giving us 6 logistic regression models, our outcome of interest vs other. For machine learning purposes we split our data of selected covariates we believe would contribute the most to our model for new admit patients and those during their stay, using 70% for training data and 30% for testing data. We use this 70-30 split to train our model with itself of our selected covariates to gauge our models performance and to allow our model to “learn” so that we may test newcoming data with its reserved 30% test data set. From the test data set we are able to gather our models true performance in how it might do to other data sets with the same covariates measured. We use performance metrics such as Area Under Curve (AUC) as diagnostic for our model as it discriminant varies between sensitivity and specificity. Sensitivity is the probability our of model to correctly classify a true positive case, meaning to correctly predict our outcome of interest the death intervals and specificity is the probability to correctly classify a non-positive case that a patient would not die in our intervals after discharge. We value the AUC for overall predictive performance and Sensitivity as we are trying to correctly classify patients who are at risk of dying after their discharge date.

In observing logistic regression models we are able to calculate the odds ratios and significant covariates for each outcome of interest. In table 1 we have significant effects with their descriptions, and odds ratios. While holding other covariates constant a patient recommended acute care at another hospital is 4.10 times as likely to die within their stay at a hospital. We observe similar effects when a patient is a regular user for high blood pressure medications but does not know their duration of how long they have been taking these medications are 13.39 as likely to die within their hospital stay. This could be that the patient is already prevalent to myrocardial infarction having to take high blood pressure medication and has been for years since they have most likely forgotten from increased years.

Within Stay Logistic

Table 1: Within Stay Logistic
Variable Description OR P-value
meno_stattype8 Postemnopausal Other 9.04 0.0332
hbpmed_totyrsH Regular user but unknown duration 19.92 0.0123
aceinhb_dailyY 2.98 0.0242
patient_disposition_cde2 Acute care within admit hospital 57.52 0.0000
patient_disposition_cde5 Acute care at another hospital 6.56 0.0040
patient_disposition_cde13 Other 56.36 0.0000

From the Receiver Operator Curve (ROC) we are given the entire distribution of sensitivity and specificity and from this we are able to calculate the area underneath the curve (AUC) for the overall test probabilistic performance. From our logistic regression of deaths within stay vs other we have a AUC score of 0.6080 which is a moderately rated prediction model.

Further exploring our model we are able to find the optimal cutoff point that gives us the optimal sensitivity and specificity in our model that is closest to the AUC score. We obtain an optimal cutoff point of \(C = -3.4908\) and in tabular form gives us a sensitivty of 0.3333 and a specificity of 0.9816.

Within Stay Optimal Cutoff Points
Cutoff SENS SPEC SUM
-3.4908 0.3333 0.9816 1.3149
-3.4951 0.3333 0.9814 1.3147
-3.4980 0.3333 0.9813 1.3146
-3.5016 0.3333 0.9811 1.3144
-3.5092 0.3333 0.9810 1.3143

Sequentially moving to the next category of our death outcome, those that have died within 30 days of their discharge date we find that while holding all other covariates constant those given acute care within their hospital are 11 times as likely to die. We observe the same effect when a patient is admitted to the hospital and is diagnosed in a major category of Hepatobiliary System & Pancreas, Diseases & Disorders are 7.39 times as likely to die within 30 days after discharge.

Within 30 Days Logistic

Within 30 Days Logistic
Variable Description OR P-value
length_of_stay_day_cnt 1.01 0.0000
bmi_q1 0.98 0.0055
hbpmed_totyrsE 3-4 Years 0.63 0.0355
hbpmed_totyrsG Over 10 Years 1.29 0.0223
sleep_hrsH Daily ACE Inhibitor over 2 Months 0.32 0.0410
patient_disposition_cde2 Acute Care within the admit hospital 11.00 0.0000
patient_disposition_cde3 Other Care within the admit hospital 1.90 0.0030
patient_disposition_cde4 Skilled Nursing/Intermediate Care within the admitting hospital 2.79 0.0000
patient_disposition_cde5 Acute Care at another hospital 3.85 0.0000
patient_disposition_cde6 Other Care at another hospital 2.48 0.0000
patient_disposition_cde7 Skilled Nursing/Intermediate Care at another facility 2.83 0.0000
patient_disposition_cde10 Left Against Medical Advice 8.40 0.0023
patient_disposition_cde12 Home Health Service 2.62 0.0000
patient_disposition_cde13 Other 12.72 0.0000
major_diag_cat_cde2 Ear, Nose, Mouth, & Throat, Diseases & Disorders 6.92 0.0280
major_diag_cat_cde4 Respiratory System, Diseases & Disorders 3.22 0.0000
major_diag_cat_cde6 Digestive System, Diseases & Disorders 2.40 0.0020
major_diag_cat_cde7 Hepatobiliary System & Pancreas, Diseases & Disorders 7.39 0.0000
major_diag_cat_cde10 Endocrine, Nutritional, and Metabolic, Diseases & Disorders 3.25 0.0003
major_diag_cat_cde11 Kidney and Urinary Tract, Diseases & Disorders 3.20 0.0002
major_diag_cat_cde17 Myeloproliferative Diseases & Poorly Differentiated Neoplasms 2.59 0.0021
diag1108 Pneumonia 1.86 0.0330
diag1109 Biliary tract disease 2.14 0.0113
diag1149 Osteoarthritis 0.33 0.0131
diag12 Fracture of neck of femur (hip) 2.51 0.0304
diag1203 Secondary malignancies 0.12 0.0049
diag142 Respiratory intubation and mechanical ventilation 2.46 0.0001
proc1216 Blood transfusion 2.44 0.0049
proc1222 Other vascular catheterization, not heart 3.16 0.0001
proc154 Colorectal resection 2.38 0.0059
proc178 0.36 0.0421

The prediction performance for deaths occurring within 30 days of discharge has an AUC score of 0.7325 which is moderate. In obtaining optimal cutoff points we are able to obtain sensitivity of 0.7333 and specificity of 0.5824.

Within 30 Optimal Cutoff Points
Cutoff SENS SPEC SUM
-2.8071 0.7733 0.5824 1.3557
-2.8071 0.7733 0.5822 1.3555
-2.8072 0.7733 0.5820 1.3553
-2.8075 0.7733 0.5819 1.3552
-2.8080 0.7733 0.5817 1.3550

In our logistic regression model for deaths within 60 days after discharge we observe that when controlling for all other variables every day increase in length of stay patients are 1.01 as likely to die. With the test AUC score of 0.6642 is moderate performing with a cutoff value of \(C = -3.2717\) giving us a sensitivity of 0.6556 and a specificity of 0.6013.

Within 60 Days Logistic

Within 60 Days Logistic
Variable OR P-value
length_of_stay_day_cnt 1.01 0.0041
preg_ever_q1Yes 0.73 0.0257
preg_total_q1 1.07 0.0321
meno_stattype8 0.37 0.0026
meno_stattype11 0.46 0.0391
hbpmed_totyrsB 1.65 0.0479
patient_disposition_cde2 3.52 0.0115
patient_disposition_cde3 2.04 0.0057
patient_disposition_cde4 1.72 0.0189
patient_disposition_cde5 1.79 0.0348
patient_disposition_cde6 2.40 0.0004
patient_disposition_cde7 2.45 0.0000
patient_disposition_cde8 2.17 0.0241
patient_disposition_cde12 1.56 0.0004
diag1108 2.27 0.0326
diag1226 2.63 0.0461
diag142 2.76 0.0006
proc1146 0.21 0.0040
proc1153 0.26 0.0088
proc178 0.21 0.0075

Within 60 Optimal Cutoff Points
Cutoff SENS SPEC SUM
-3.2717 0.6556 0.6013 1.2569
-3.2719 0.6556 0.6011 1.2567
-3.2724 0.6556 0.6010 1.2566
-3.2728 0.6556 0.6008 1.2564
-3.2734 0.6556 0.6007 1.2563

Within our 90 days logistic model we obtained a AUC score of 0.6664. This gives us moderate predictive performance which we are able to conclude that patients recommended a disposition of acute care within the admitting hospital are 3.01 times as likely to die within 90 days. We obtain an optimal cutoff point of \(C = -3.6753\) with a sensitivity of 0.8077 and a specificity of 0.4623.

Within 90 Days Logistic

Within 90 Days Logistic
Variable OR P-value
patient_disposition_cde2 3.01 0.0265
patient_disposition_cde3 1.80 0.0409
patient_disposition_cde4 1.96 0.0039
patient_disposition_cde5 2.04 0.0078
patient_disposition_cde7 1.85 0.0000
patient_disposition_cde12 1.46 0.0052
major_diag_cat_cde7 3.19 0.0087
major_diag_cat_cde17 3.24 0.0043
diag1108 2.74 0.0148
proc1152 0.09 0.0413
proc178 0.29 0.0384
proc184 0.19 0.0256

Within 90 Optimal Cutoff Points
Cutoff SENS SPEC SUM
-3.6753 0.8077 0.4623 1.2700
-3.6757 0.8077 0.4621 1.2698
-3.6761 0.8077 0.4620 1.2697
-3.6762 0.8077 0.4618 1.2695
-3.6764 0.8077 0.4616 1.2693

In observing deaths within 1 year after a patients hospitalization we find that if a pateint were to ever take any oral contraceptives they are 1.06 times as likely to die when controlling for the other covariates. Predictive performance is moderate with an AUC score of 0.6652, allowing us to obtain a cutoff point of \(C = -1.8670\), sensitivity of 0.6837 and a specificity of 0.5626.

Within 1 Year Logistic

Within 1 Year Days Logistic
Variable OR P-value
oralcntr_ever_q1 1.06 0.0314
preg_ever_q1Yes 0.81 0.0087
meno_stattype5 0.63 0.0000
meno_stattype6 0.62 0.0000
meno_stattype8 0.59 0.0010
meno_stattype10 0.61 0.0000
meno_stattype11 0.60 0.0064
cholmed_dailyY 0.87 0.0380
patient_disposition_cde3 2.31 0.0000
patient_disposition_cde6 1.70 0.0004
patient_disposition_cde7 1.55 0.0000
patient_disposition_cde8 1.70 0.0082
patient_disposition_cde12 1.16 0.0371
major_diag_cat_cde1 1.65 0.0093
major_diag_cat_cde3 1.88 0.0386
major_diag_cat_cde4 1.96 0.0003
major_diag_cat_cde7 1.93 0.0034
major_diag_cat_cde11 1.61 0.0247
major_diag_cat_cde16 1.73 0.0125
major_diag_cat_cde17 3.13 0.0000
major_diag_cat_cde18 2.09 0.0128
diag1101 0.50 0.0218
diag1108 1.74 0.0071
diag1203 0.18 0.0000
diag1226 1.62 0.0443
diag1237 1.59 0.0155
diag142 1.47 0.0239
proc1222 1.70 0.0034
proc148 0.53 0.0288
proc184 0.42 0.0240

Within 1 Year Optimal Cutoff Points
Cutoff SENS SPEC SUM
-1.8670 0.6837 0.5626 1.2463
-1.8674 0.6837 0.5624 1.2461
-1.8680 0.6837 0.5622 1.2459
-1.8686 0.6837 0.5620 1.2457
-1.8689 0.6837 0.5619 1.2456

Our logistic model for death interval of more than 1 year after discharge date has the higher prediction performance of an AUC score of 0.7384. Interestingly, if a patient were to ever to be pregnant their odds of dying would be 1.18 times as likely when controlling for all other covariates. We have a cutoff value of \(C = 0.9288\) which gives us the optimal sensitivity of 0.6674 and a specificity of 0.6969.

More Than 1 Year Logistic

More Than 1 Year Logistic
Variable OR P-value
length_of_stay_day_cnt 0.98 0.0000
preg_ever_q1Yes 1.18 0.0114
meno_stattype5 1.38 0.0002
meno_stattype6 1.41 0.0004
meno_stattype8 1.53 0.0010
meno_stattype9 11.28 0.0010
meno_stattype10 1.60 0.0000
meno_stattype11 1.49 0.0083
bmi_q1 1.01 0.0157
cholmed_dailyY 1.20 0.0005
patient_disposition_cde2 0.14 0.0000
patient_disposition_cde3 0.39 0.0000
patient_disposition_cde4 0.57 0.0000
patient_disposition_cde5 0.36 0.0000
patient_disposition_cde6 0.40 0.0000
patient_disposition_cde7 0.42 0.0000
patient_disposition_cde8 0.56 0.0008
patient_disposition_cde12 0.58 0.0000
patient_disposition_cde13 0.13 0.0000
major_diag_cat_cde1 0.64 0.0035
major_diag_cat_cde3 0.55 0.0180
major_diag_cat_cde4 0.39 0.0000
major_diag_cat_cde6 0.65 0.0022
major_diag_cat_cde7 0.24 0.0000
major_diag_cat_cde10 0.60 0.0044
major_diag_cat_cde11 0.45 0.0000
major_diag_cat_cde16 0.53 0.0003
major_diag_cat_cde17 0.30 0.0000
major_diag_cat_cde18 0.53 0.0076
major_diag_cat_cde23 0.56 0.0202
diag1101 2.39 0.0005
diag1108 0.41 0.0000
diag1203 4.63 0.0000
diag1226 0.48 0.0002
diag1237 0.68 0.0189
diag1254 1.87 0.0220
diag142 0.33 0.0000
proc1146 1.81 0.0079
proc1153 1.98 0.0015
proc1158 2.05 0.0239
proc1222 0.44 0.0000
proc13 2.98 0.0107
proc148 1.87 0.0040
proc178 2.38 0.0001
proc184 3.08 0.0002

More 1 Year Optimal Cutoff Points
Cutoff SENS SPEC SUM
0.9288 0.6674 0.6969 1.3643
0.9323 0.6657 0.6985 1.3642
0.9292 0.6672 0.6969 1.3641
0.9351 0.6651 0.6990 1.3641
0.9331 0.6655 0.6985 1.3640

Random Forest

Using more effective machine learning methods such as Random Forest we are able to obtain greater predictive performances. Random Forests use many decision trees in classifying our outcome of interest and then merging them to build models. Using the same selected covariates and split 70-30 for our training and testing data, we are able to classify to not only binary outcomes but of multinomial levels. Using Multinomial Random Forest we would be able to classify all of our outcomes of interests together and map their predictive performances respectively.

Multiclass Random Forest

Unique to Random forests we are able to plot variable importance and have sorted them in descending order of importance by their mean gini index. The further the variable the most important it is in our random forest model as the variable contributes the most. Variables such as bmi_q1, length_of_stay_day_cnt, and major_diag_cat_cde are the top 3 most important variables and with good reason as an increase in bmi for a patient at entry of the study could expose them to many risks and prevalent diseases, length of stay could be considered a health performance metric for the patient in how serious their condition and care is and an increase in length of stay could mean readmission. Major diagnosis code I believe would contribute the most to our model as it directs to a possible procedure that can be tolling on the patients body and may limit recovery time.

In observing the ROC for all death interval outcomes we see that within stay has the highest AUC score of 0.8431 and within 30 days and more than 1 year being relatively close. They each have an AUC score of 0.7684 and 0.7745 respectively. We can see from our Multiclass Random Forest that this is a better prediction model compared to traditional methods such as logistic regression.

Multiclass Ranodm Forest

Multiclass Random Forest AUC Scores
Group AUC
More Than 1 year 0.7386
Within 1 Year 0.5970
Within 30 Days 0.7350
Within 60 Days 0.6293
Within 90 Days 0.5564
Within Stay 0.7369

Multiclass Balanced Random Forest

In attempt to further improve our model we explore Multiclass Balanced Random Forest models. The difference between balancing and unbalanced random forests is that we do not weigh the sample sizes of the smallest level in our outcome of interest for unbalanced. Balancing by the smallest class and having equal sample sizes all throughout the classes increases performance. When a much larger class is set to the sample size of a much smaller class it randomly selects observations with replacement giving equal weight to all outcomes.

By balancing our multiclass random forest we see a difference in variable importance with major_diag_cat_cde being the most important variable, followed by bmi_q1, and diag1. I believe for this to be correct for major diagnosis code as it can dictate the care a patient receives if they need intensive care or procedures that can alter their time within hospital stay and most likely death after their discharge date in intervals.

From our figure of multiclass balanced random forests we see that the death intervals take a very different route compared to random forests. With the ROC being closer to the straight line our prediction wouldn’t be better than chance at 50% and we see less performance from balancing compared to the unbalanced. Our outcome of interest deaths within 90 days and within 1 year have the lowest performing AUC, with scores of 0.5564 and 0.5970 respectively. With smaller performance in a better performing model this could be due to the sample sizing as or smallest sample size was 39, we could easily loose a lot fo useful and impotant information as we saw from our variable importance plot.

Multiclass Balanced Random Forest

Multiclass BRF AUC Scores
Group AUC
More Than 1 year 0.7408
Within 1 Year 0.5852
Within 30 Days 0.7314
Within 60 Days 0.6295
Within 90 Days 0.5873
Within Stay 0.7198

Further Considerations

The optimal prediction model would be to use Multiclass Random forest in predicting death of patients after being discharged. As this model has the better overall performance compared to traditional methods such as logistic regression and mutliclass balanced random forest. In continuing the study I would add more variables within the questionnaire, more of which would follow the patients over time of the study such as their health outcomes like bmi. We have bmi_q1 that was recorded only for the first questionnaire but the study goes on for so long that bmi could change for a lot of patients. Having more variables that would cover more of the patient’s health or demographics would add more conclusions to our model. In further analyzing prediction performance for our outcome of interest I would like to explore ordinal multinomial logistic regression. Since death interval is sequential order from within stay, 30 days after, 60 days after and so on our outcome of interest is ordinal and an ordinal multiclass logistic regression consider the order effect of our outcome and not to be naive such as a multiclass logistic regression. I feel that taking into account for the ordered outcome of interest an ordinal multiclass logistic regression would be the one to compare directly with our multiclass random forest and then to compare prediction performance.