Applying machine learning to breast cancer diagnosis: A high school student’s exploration using R

(1) Fulton Science Academy, (2) Qure.ai

https://doi.org/10.59720/24-372
Cover photo for Applying machine learning to breast cancer diagnosis: A high school student’s exploration using R

Early diagnosis of breast cancer is critical for improved prognosis. However, current diagnostic methods, like mammograms, are expensive and not widely available in resource-constrained regions. This study aims to identify alternative diagnostic methods that are more accessible. We hypothesize that features obtained using Fine Needle Aspiration Biopsy (FNAB) can serve as predictive variables in machine learning (ML) algorithms for accurate breast cancer detection. Utilizing the Wisconsin Breast Cancer Dataset (WBCD), we conducted statistical analyses to explore different machine-learning models for classifying tumors as malignant or benign. Initial univariate analysis revealed that certain features were highly correlated with the malignancy of the tumor. We created a second dataset by removing the correlated variables and evaluated various machine learning models using both datasets on their ability to classify tumors, measuring performance by sensitivity, specificity, and accuracy. Among the models tested, logistic regression and random forest classifiers delivered standout results. While the random forest classifier with the full variable dataset and logistic regression with the principal component analysis (PCA) reduced variable dataset achieved the highest accuracy, the overall difference in performance of these two models across the datasets was minimal. These results demonstrate that using a smaller dataset enables models to predict breast cancer with nearly the same accuracy as when using a broader set of variables. The random forest classifier proved highly effective in all scenarios, highlighting the potential for reducing diagnostic complexity without sacrificing accuracy. This finding is promising as it suggests that, with fewer resources, we can still achieve reliable predictive results, potentially improving early detection in resource-constrained regions.

Download Full Article as PDF