Further exploring machine learning methods such as Logistic Regression, Random Forests and Balanced Random Forests with a multiclass outcome in order to predict how long in days would a patient die after discharge from hospital. We see that Multiclass Random Forest models perform much more efficiently than Balanced Random Forests and Logistic Regression models. Multiclass Random Forests had the highest overall AUC for predictive performance in each death interval after discharge.
In exploring machine learning methods to predict death for a patient after their discharge I categorized our death outcome as a separate factor with 6 levels: within Stay, within 30 days, within 60 days, within 90 days, within a year, and over a year after their discharge date. This gives us a sequential multiomial outcome which some traditional methods such as basic binary logistic regression models would struggle in making predictions. We would have to further explore with separate binary levels for each categorized level of the death intervals, dichotimizing death within stay vs otherwise. In building the predictive model I chose patient demographics as the baseline such as participant race, age at baseline, bmi, and socio-economic status during their first questionnaire as these variables would help to control for confounding factors based on the participant makeup. I have also combined both variables from the questionnaires 1 through 3 as well as their OSHPD variables such as if the patient lives near an oil refinery, if they have ever used oral contraceptives, how many pregnancies, menostration type, if they have ever had breast cancer, or even have had diabetes when entered into the study which would give patients a prevalent exposure to carcinogens or to risks of disease. Having questionnaire data such as these variables would give us more idea of which patients are dying after their discharge dates and at what intervals based on their preexisting conditions entered into the study. Combining these factors with the specific hospital data of their length of stay, major diagnosis when admitted to a medical facility, their procedure categorized code from ICD-9, and their disposition type we are able to contribute a patient’s severity of their admission and stay to the model to better gauge which patients would require more proper attention to avoid untimely deaths after thier disharge date.
Many of these variables can already give the idea if a patient was already prexposed to colon cancer, has diabetes, sits for most of the day, and a cigratte smoking exposure they are most likely to have a shortend lifespan after their discharge date and are at higher risks of dying. But if we were to directly factor in a varibale into the prediciton model of days_after
the amount of time after their disharge date till death it is direct and coincides with our outcome that is already categorized by this variables in 6 levels instead as a continous variable. Further experimentation would be needed to finalize our models’ performances but we have a baseline of the direct inclusion and exlusion of days_after
to obtain the best model for predictive accuracy. Our model performance table shows that all training and test AUC are at the acceptable threshold for a good predictive model being above 0.70, but we go into detail later on of what actually contributes to these death intervals specifically and which of these questionarie or hospitalization variables are significant in predicting the 6 death interval outcomes.
Performing some exploratory analysis of the age distribution within each death interval from their discharge date we see that patients that died more than a year after their discharge date tended to have the highest median age. Which is a good indication that patients are living long and well without the needs of a hospital. But we do see some outliers for the death outcome within 60 days, as patients being reported as the youngest and with tight grouping meaning relatively close in age. My prediction models could be utilized to reevaluate what would be the necessary care and disposition treatment for individuals inputting the patients pre-exposed conditions and the conditions during their hospital stay. We observe what are the most frequent major diagnosis for patients who have died within their respective intervals and gauge whether these diagnosis could have been faulty and lead to the wrong treatment for a patient or even look further in how the hospital performs and responds to these common diagnosis. Each death interval shares the same frequent major diagnosis of nervous system, respiratory, digestive, and circulatory. We further explore patient_disposition_cde
as this covariate has major importance to our predictive model and can indicate which patients are at potential risks of mortality with their recommended disposition and based on patient’s other covariates could be recommended other care to significantly lower their risks of dying after being discharged. In figure 3 we can observe the most common disposition types for each death interval, even with death being reported within stay mortality our figure lists other disposition types that recorded as most frequent and could be further explored. We observe that the most frequent disposition types are skilled nursing at another facility which could be an indicator of transport time and distance could have played a role in the patient’s mortality and could be potentially minimized to reduce the odds of mortality, skilled nursing within the admitting hospital, other care within the admitting hosp ital, and routine at home care. In figure 4 we observe the most frequent causes of death within each outcome of interest, we do not factor this variable into our model because we want our model to be applicable to patients coming in and those during a hospital stay which have not died but we hope to minimize the odds. Alzheimer’s disease, atherosclerotic heart disease, and malignant neoplasm of bronchus or lung are the most common causes of death and this information can be used to give more attention to patients coming in or developing these exposures.
We use elementary logistic regression to test performance for each death outcome dichotomous against another, as a yes no response for our binary model. Testing each outcome giving us 6 logistic regression models, our outcome of interest vs other. For machine learning purposes we split our data of selected covariates we believe would contribute the most to our model for new admit patients and those during their stay, using 70% for training data and 30% for testing data. We use this 70-30 split to train our model with itself of our selected covariates to gauge our models performance and to allow our model to “learn” so that we may test newcoming data with its reserved 30% test data set. From the test data set we are able to gather our models true performance in how it might do to other data sets with the same covariates measured. We use performance metrics such as Area Under Curve (AUC) as diagnostic for our model as it discriminant varies between sensitivity and specificity. Sensitivity is the probability our of model to correctly classify a true positive case, meaning to correctly predict our outcome of interest the death intervals and specificity is the probability to correctly classify a non-positive case that a patient would not die in our intervals after discharge. We value the AUC for overall predictive performance and Sensitivity as we are trying to correctly classify patients who are at risk of dying after their discharge date.
In observing logistic regression models we are able to calculate the odds ratios and significant covariates for each outcome of interest. In table 1 we have significant effects with their descriptions, and odds ratios. While holding other covariates constant a patient recommended acute care at another hospital is 4.10 times as likely to die within their stay at a hospital. We observe similar effects when a patient is a regular user for high blood pressure medications but does not know their duration of how long they have been taking these medications are 13.39 as likely to die within their hospital stay. This could be that the patient is already prevalent to myrocardial infarction having to take high blood pressure medication and has been for years since they have most likely forgotten from increased years.
Variable | Description | OR | P-value |
---|---|---|---|
meno_stattype8 | Postemnopausal Other | 9.04 | 0.0332 |
hbpmed_totyrsH | Regular user but unknown duration | 19.92 | 0.0123 |
aceinhb_dailyY | 2.98 | 0.0242 | |
patient_disposition_cde2 | Acute care within admit hospital | 57.52 | 0.0000 |
patient_disposition_cde5 | Acute care at another hospital | 6.56 | 0.0040 |
patient_disposition_cde13 | Other | 56.36 | 0.0000 |
From the Receiver Operator Curve (ROC) we are given the entire distribution of sensitivity and specificity and from this we are able to calculate the area underneath the curve (AUC) for the overall test probabilistic performance. From our logistic regression of deaths within stay vs other we have a AUC score of 0.6080 which is a moderately rated prediction model.
Further exploring our model we are able to find the optimal cutoff point that gives us the optimal sensitivity and specificity in our model that is closest to the AUC score. We obtain an optimal cutoff point of \(C = -3.4908\) and in tabular form gives us a sensitivty of 0.3333 and a specificity of 0.9816.
Cutoff | SENS | SPEC | SUM |
---|---|---|---|
-3.4908 | 0.3333 | 0.9816 | 1.3149 |
-3.4951 | 0.3333 | 0.9814 | 1.3147 |
-3.4980 | 0.3333 | 0.9813 | 1.3146 |
-3.5016 | 0.3333 | 0.9811 | 1.3144 |
-3.5092 | 0.3333 | 0.9810 | 1.3143 |
Sequentially moving to the next category of our death outcome, those that have died within 30 days of their discharge date we find that while holding all other covariates constant those given acute care within their hospital are 11 times as likely to die. We observe the same effect when a patient is admitted to the hospital and is diagnosed in a major category of Hepatobiliary System & Pancreas, Diseases & Disorders are 7.39 times as likely to die within 30 days after discharge.
Variable | Description | OR | P-value |
---|---|---|---|
length_of_stay_day_cnt | 1.01 | 0.0000 | |
bmi_q1 | 0.98 | 0.0055 | |
hbpmed_totyrsE | 3-4 Years | 0.63 | 0.0355 |
hbpmed_totyrsG | Over 10 Years | 1.29 | 0.0223 |
sleep_hrsH | Daily ACE Inhibitor over 2 Months | 0.32 | 0.0410 |
patient_disposition_cde2 | Acute Care within the admit hospital | 11.00 | 0.0000 |
patient_disposition_cde3 | Other Care within the admit hospital | 1.90 | 0.0030 |
patient_disposition_cde4 | Skilled Nursing/Intermediate Care within the admitting hospital | 2.79 | 0.0000 |
patient_disposition_cde5 | Acute Care at another hospital | 3.85 | 0.0000 |
patient_disposition_cde6 | Other Care at another hospital | 2.48 | 0.0000 |
patient_disposition_cde7 | Skilled Nursing/Intermediate Care at another facility | 2.83 | 0.0000 |
patient_disposition_cde10 | Left Against Medical Advice | 8.40 | 0.0023 |
patient_disposition_cde12 | Home Health Service | 2.62 | 0.0000 |
patient_disposition_cde13 | Other | 12.72 | 0.0000 |
major_diag_cat_cde2 | Ear, Nose, Mouth, & Throat, Diseases & Disorders | 6.92 | 0.0280 |
major_diag_cat_cde4 | Respiratory System, Diseases & Disorders | 3.22 | 0.0000 |
major_diag_cat_cde6 | Digestive System, Diseases & Disorders | 2.40 | 0.0020 |
major_diag_cat_cde7 | Hepatobiliary System & Pancreas, Diseases & Disorders | 7.39 | 0.0000 |
major_diag_cat_cde10 | Endocrine, Nutritional, and Metabolic, Diseases & Disorders | 3.25 | 0.0003 |
major_diag_cat_cde11 | Kidney and Urinary Tract, Diseases & Disorders | 3.20 | 0.0002 |
major_diag_cat_cde17 | Myeloproliferative Diseases & Poorly Differentiated Neoplasms | 2.59 | 0.0021 |
diag1108 | Pneumonia | 1.86 | 0.0330 |
diag1109 | Biliary tract disease | 2.14 | 0.0113 |
diag1149 | Osteoarthritis | 0.33 | 0.0131 |
diag12 | Fracture of neck of femur (hip) | 2.51 | 0.0304 |
diag1203 | Secondary malignancies | 0.12 | 0.0049 |
diag142 | Respiratory intubation and mechanical ventilation | 2.46 | 0.0001 |
proc1216 | Blood transfusion | 2.44 | 0.0049 |
proc1222 | Other vascular catheterization, not heart | 3.16 | 0.0001 |
proc154 | Colorectal resection | 2.38 | 0.0059 |
proc178 | 0.36 | 0.0421 |
The prediction performance for deaths occurring within 30 days of discharge has an AUC score of 0.7325 which is moderate. In obtaining optimal cutoff points we are able to obtain sensitivity of 0.7333 and specificity of 0.5824.
Cutoff | SENS | SPEC | SUM |
---|---|---|---|
-2.8071 | 0.7733 | 0.5824 | 1.3557 |
-2.8071 | 0.7733 | 0.5822 | 1.3555 |
-2.8072 | 0.7733 | 0.5820 | 1.3553 |
-2.8075 | 0.7733 | 0.5819 | 1.3552 |
-2.8080 | 0.7733 | 0.5817 | 1.3550 |
In our logistic regression model for deaths within 60 days after discharge we observe that when controlling for all other variables every day increase in length of stay patients are 1.01 as likely to die. With the test AUC score of 0.6642 is moderate performing with a cutoff value of \(C = -3.2717\) giving us a sensitivity of 0.6556 and a specificity of 0.6013.
Variable | OR | P-value |
---|---|---|
length_of_stay_day_cnt | 1.01 | 0.0041 |
preg_ever_q1Yes | 0.73 | 0.0257 |
preg_total_q1 | 1.07 | 0.0321 |
meno_stattype8 | 0.37 | 0.0026 |
meno_stattype11 | 0.46 | 0.0391 |
hbpmed_totyrsB | 1.65 | 0.0479 |
patient_disposition_cde2 | 3.52 | 0.0115 |
patient_disposition_cde3 | 2.04 | 0.0057 |
patient_disposition_cde4 | 1.72 | 0.0189 |
patient_disposition_cde5 | 1.79 | 0.0348 |
patient_disposition_cde6 | 2.40 | 0.0004 |
patient_disposition_cde7 | 2.45 | 0.0000 |
patient_disposition_cde8 | 2.17 | 0.0241 |
patient_disposition_cde12 | 1.56 | 0.0004 |
diag1108 | 2.27 | 0.0326 |
diag1226 | 2.63 | 0.0461 |
diag142 | 2.76 | 0.0006 |
proc1146 | 0.21 | 0.0040 |
proc1153 | 0.26 | 0.0088 |
proc178 | 0.21 | 0.0075 |
Cutoff | SENS | SPEC | SUM |
---|---|---|---|
-3.2717 | 0.6556 | 0.6013 | 1.2569 |
-3.2719 | 0.6556 | 0.6011 | 1.2567 |
-3.2724 | 0.6556 | 0.6010 | 1.2566 |
-3.2728 | 0.6556 | 0.6008 | 1.2564 |
-3.2734 | 0.6556 | 0.6007 | 1.2563 |
Within our 90 days logistic model we obtained a AUC score of 0.6664. This gives us moderate predictive performance which we are able to conclude that patients recommended a disposition of acute care within the admitting hospital are 3.01 times as likely to die within 90 days. We obtain an optimal cutoff point of \(C = -3.6753\) with a sensitivity of 0.8077 and a specificity of 0.4623.
Variable | OR | P-value |
---|---|---|
patient_disposition_cde2 | 3.01 | 0.0265 |
patient_disposition_cde3 | 1.80 | 0.0409 |
patient_disposition_cde4 | 1.96 | 0.0039 |
patient_disposition_cde5 | 2.04 | 0.0078 |
patient_disposition_cde7 | 1.85 | 0.0000 |
patient_disposition_cde12 | 1.46 | 0.0052 |
major_diag_cat_cde7 | 3.19 | 0.0087 |
major_diag_cat_cde17 | 3.24 | 0.0043 |
diag1108 | 2.74 | 0.0148 |
proc1152 | 0.09 | 0.0413 |
proc178 | 0.29 | 0.0384 |
proc184 | 0.19 | 0.0256 |
Cutoff | SENS | SPEC | SUM |
---|---|---|---|
-3.6753 | 0.8077 | 0.4623 | 1.2700 |
-3.6757 | 0.8077 | 0.4621 | 1.2698 |
-3.6761 | 0.8077 | 0.4620 | 1.2697 |
-3.6762 | 0.8077 | 0.4618 | 1.2695 |
-3.6764 | 0.8077 | 0.4616 | 1.2693 |
In observing deaths within 1 year after a patients hospitalization we find that if a pateint were to ever take any oral contraceptives they are 1.06 times as likely to die when controlling for the other covariates. Predictive performance is moderate with an AUC score of 0.6652, allowing us to obtain a cutoff point of \(C = -1.8670\), sensitivity of 0.6837 and a specificity of 0.5626.
Variable | OR | P-value |
---|---|---|
oralcntr_ever_q1 | 1.06 | 0.0314 |
preg_ever_q1Yes | 0.81 | 0.0087 |
meno_stattype5 | 0.63 | 0.0000 |
meno_stattype6 | 0.62 | 0.0000 |
meno_stattype8 | 0.59 | 0.0010 |
meno_stattype10 | 0.61 | 0.0000 |
meno_stattype11 | 0.60 | 0.0064 |
cholmed_dailyY | 0.87 | 0.0380 |
patient_disposition_cde3 | 2.31 | 0.0000 |
patient_disposition_cde6 | 1.70 | 0.0004 |
patient_disposition_cde7 | 1.55 | 0.0000 |
patient_disposition_cde8 | 1.70 | 0.0082 |
patient_disposition_cde12 | 1.16 | 0.0371 |
major_diag_cat_cde1 | 1.65 | 0.0093 |
major_diag_cat_cde3 | 1.88 | 0.0386 |
major_diag_cat_cde4 | 1.96 | 0.0003 |
major_diag_cat_cde7 | 1.93 | 0.0034 |
major_diag_cat_cde11 | 1.61 | 0.0247 |
major_diag_cat_cde16 | 1.73 | 0.0125 |
major_diag_cat_cde17 | 3.13 | 0.0000 |
major_diag_cat_cde18 | 2.09 | 0.0128 |
diag1101 | 0.50 | 0.0218 |
diag1108 | 1.74 | 0.0071 |
diag1203 | 0.18 | 0.0000 |
diag1226 | 1.62 | 0.0443 |
diag1237 | 1.59 | 0.0155 |
diag142 | 1.47 | 0.0239 |
proc1222 | 1.70 | 0.0034 |
proc148 | 0.53 | 0.0288 |
proc184 | 0.42 | 0.0240 |
Cutoff | SENS | SPEC | SUM |
---|---|---|---|
-1.8670 | 0.6837 | 0.5626 | 1.2463 |
-1.8674 | 0.6837 | 0.5624 | 1.2461 |
-1.8680 | 0.6837 | 0.5622 | 1.2459 |
-1.8686 | 0.6837 | 0.5620 | 1.2457 |
-1.8689 | 0.6837 | 0.5619 | 1.2456 |
Our logistic model for death interval of more than 1 year after discharge date has the higher prediction performance of an AUC score of 0.7384. Interestingly, if a patient were to ever to be pregnant their odds of dying would be 1.18 times as likely when controlling for all other covariates. We have a cutoff value of \(C = 0.9288\) which gives us the optimal sensitivity of 0.6674 and a specificity of 0.6969.
Variable | OR | P-value |
---|---|---|
length_of_stay_day_cnt | 0.98 | 0.0000 |
preg_ever_q1Yes | 1.18 | 0.0114 |
meno_stattype5 | 1.38 | 0.0002 |
meno_stattype6 | 1.41 | 0.0004 |
meno_stattype8 | 1.53 | 0.0010 |
meno_stattype9 | 11.28 | 0.0010 |
meno_stattype10 | 1.60 | 0.0000 |
meno_stattype11 | 1.49 | 0.0083 |
bmi_q1 | 1.01 | 0.0157 |
cholmed_dailyY | 1.20 | 0.0005 |
patient_disposition_cde2 | 0.14 | 0.0000 |
patient_disposition_cde3 | 0.39 | 0.0000 |
patient_disposition_cde4 | 0.57 | 0.0000 |
patient_disposition_cde5 | 0.36 | 0.0000 |
patient_disposition_cde6 | 0.40 | 0.0000 |
patient_disposition_cde7 | 0.42 | 0.0000 |
patient_disposition_cde8 | 0.56 | 0.0008 |
patient_disposition_cde12 | 0.58 | 0.0000 |
patient_disposition_cde13 | 0.13 | 0.0000 |
major_diag_cat_cde1 | 0.64 | 0.0035 |
major_diag_cat_cde3 | 0.55 | 0.0180 |
major_diag_cat_cde4 | 0.39 | 0.0000 |
major_diag_cat_cde6 | 0.65 | 0.0022 |
major_diag_cat_cde7 | 0.24 | 0.0000 |
major_diag_cat_cde10 | 0.60 | 0.0044 |
major_diag_cat_cde11 | 0.45 | 0.0000 |
major_diag_cat_cde16 | 0.53 | 0.0003 |
major_diag_cat_cde17 | 0.30 | 0.0000 |
major_diag_cat_cde18 | 0.53 | 0.0076 |
major_diag_cat_cde23 | 0.56 | 0.0202 |
diag1101 | 2.39 | 0.0005 |
diag1108 | 0.41 | 0.0000 |
diag1203 | 4.63 | 0.0000 |
diag1226 | 0.48 | 0.0002 |
diag1237 | 0.68 | 0.0189 |
diag1254 | 1.87 | 0.0220 |
diag142 | 0.33 | 0.0000 |
proc1146 | 1.81 | 0.0079 |
proc1153 | 1.98 | 0.0015 |
proc1158 | 2.05 | 0.0239 |
proc1222 | 0.44 | 0.0000 |
proc13 | 2.98 | 0.0107 |
proc148 | 1.87 | 0.0040 |
proc178 | 2.38 | 0.0001 |
proc184 | 3.08 | 0.0002 |
Cutoff | SENS | SPEC | SUM |
---|---|---|---|
0.9288 | 0.6674 | 0.6969 | 1.3643 |
0.9323 | 0.6657 | 0.6985 | 1.3642 |
0.9292 | 0.6672 | 0.6969 | 1.3641 |
0.9351 | 0.6651 | 0.6990 | 1.3641 |
0.9331 | 0.6655 | 0.6985 | 1.3640 |
Using more effective machine learning methods such as Random Forest we are able to obtain greater predictive performances. Random Forests use many decision trees in classifying our outcome of interest and then merging them to build models. Using the same selected covariates and split 70-30 for our training and testing data, we are able to classify to not only binary outcomes but of multinomial levels. Using Multinomial Random Forest we would be able to classify all of our outcomes of interests together and map their predictive performances respectively.
Unique to Random forests we are able to plot variable importance and have sorted them in descending order of importance by their mean gini index. The further the variable the most important it is in our random forest model as the variable contributes the most. Variables such as bmi_q1
, length_of_stay_day_cnt
, and major_diag_cat_cde
are the top 3 most important variables and with good reason as an increase in bmi for a patient at entry of the study could expose them to many risks and prevalent diseases, length of stay could be considered a health performance metric for the patient in how serious their condition and care is and an increase in length of stay could mean readmission. Major diagnosis code I believe would contribute the most to our model as it directs to a possible procedure that can be tolling on the patients body and may limit recovery time.
In observing the ROC for all death interval outcomes we see that within stay has the highest AUC score of 0.8431 and within 30 days and more than 1 year being relatively close. They each have an AUC score of 0.7684 and 0.7745 respectively. We can see from our Multiclass Random Forest that this is a better prediction model compared to traditional methods such as logistic regression.
Group | AUC |
---|---|
More Than 1 year | 0.7386 |
Within 1 Year | 0.5970 |
Within 30 Days | 0.7350 |
Within 60 Days | 0.6293 |
Within 90 Days | 0.5564 |
Within Stay | 0.7369 |
In attempt to further improve our model we explore Multiclass Balanced Random Forest models. The difference between balancing and unbalanced random forests is that we do not weigh the sample sizes of the smallest level in our outcome of interest for unbalanced. Balancing by the smallest class and having equal sample sizes all throughout the classes increases performance. When a much larger class is set to the sample size of a much smaller class it randomly selects observations with replacement giving equal weight to all outcomes.
By balancing our multiclass random forest we see a difference in variable importance with major_diag_cat_cde
being the most important variable, followed by bmi_q1
, and diag1
. I believe for this to be correct for major diagnosis code as it can dictate the care a patient receives if they need intensive care or procedures that can alter their time within hospital stay and most likely death after their discharge date in intervals.
From our figure of multiclass balanced random forests we see that the death intervals take a very different route compared to random forests. With the ROC being closer to the straight line our prediction wouldn’t be better than chance at 50% and we see less performance from balancing compared to the unbalanced. Our outcome of interest deaths within 90 days and within 1 year have the lowest performing AUC, with scores of 0.5564 and 0.5970 respectively. With smaller performance in a better performing model this could be due to the sample sizing as or smallest sample size was 39, we could easily loose a lot fo useful and impotant information as we saw from our variable importance plot.
Group | AUC |
---|---|
More Than 1 year | 0.7408 |
Within 1 Year | 0.5852 |
Within 30 Days | 0.7314 |
Within 60 Days | 0.6295 |
Within 90 Days | 0.5873 |
Within Stay | 0.7198 |
The optimal prediction model would be to use Multiclass Random forest in predicting death of patients after being discharged. As this model has the better overall performance compared to traditional methods such as logistic regression and mutliclass balanced random forest. In continuing the study I would add more variables within the questionnaire, more of which would follow the patients over time of the study such as their health outcomes like bmi. We have bmi_q1
that was recorded only for the first questionnaire but the study goes on for so long that bmi could change for a lot of patients. Having more variables that would cover more of the patient’s health or demographics would add more conclusions to our model. In further analyzing prediction performance for our outcome of interest I would like to explore ordinal multinomial logistic regression. Since death interval is sequential order from within stay, 30 days after, 60 days after and so on our outcome of interest is ordinal and an ordinal multiclass logistic regression consider the order effect of our outcome and not to be naive such as a multiclass logistic regression. I feel that taking into account for the ordered outcome of interest an ordinal multiclass logistic regression would be the one to compare directly with our multiclass random forest and then to compare prediction performance.