Using broad health-related survey questions to predict the presence of coronary heart disease

(1) Los Altos High School, (2) Bioinformatics and Systems Biology Program, University of California, San Diego

https://doi.org/10.59720/24-006
Cover photo for Using broad health-related survey questions to predict the presence of coronary heart disease

Coronary heart disease (CHD) is the leading cause of death in the United States and was responsible for the deaths of almost 700,000 people in 2021. CHD is influenced by a variety of factors, including genetics and behavioral patterns. It is a dangerous disease characterized by a clogging of the arteries, which can cause myocardial infarction if left unchecked. CHD can develop without showing any symptoms, making its prediction all the more important. However, current methods can only predict CHD accurately using expensive clinical equipment and tests. Past machine learning projects aimed at predicting and preventing CHD typically depended on these inconvenient clinical procedures. This study tests the hypothesis that CHD can be predicted by applying machine learning to demographic, clinical, and behavioral data provided by survey responses. Trained on over 300,000 samples from the CDC’s 2022 Behavioral Risk Factor Surveillance System, binary classification models predicting CHD and myocardial infarction history achieved Matthews correlation coefficients (MCCs) ranging from 0.299 to 0.313 and accuracies ranging from 0.716 to 0.726 during 5-fold cross validation. Individual demographic-specific models were also trained and could achieve MCCs of up to 0.504. Lastly, interpretation of these models using coefficient weights recovered associations between CHD and behavioral, clinical, and demographic variables that were consistent with previous studies. This study demonstrates a proof of concept for predicting the presence of CHD by looking solely at data provided by responses to broad health-related survey questions.

Download Full Article as PDF