Predictive Modeling of Esophageal Cancer Risk using the Sobar-72 Dataset

Authors: Kate Pogu Kumar
DIN
IMJH-SVU-JAN-2025-20
Abstract

Esophageal cancer is a deadly disease with late-stage diagnosis and poor prognosis. Early identification of highrisk individuals is critical for prevention and intervention. This study utilizes the Sobar-72 dataset, which contains clinical and lifestyle features, to develop machine learning models that can predict the risk of esophageal cancer. By applying data preprocessing, exploratory data analysis, and classification algorithms—including logistic regression and random forest—we identify the most influential factors and evaluate model performance. Results show that Random Forest achieved the highest accuracy (91.6%) and identified features such as age, alcohol use, and tobacco use as significant predictors. This work emphasizes the potential of predictive analytics in clinical risk stratification.

Keywords
Esophageal Cancer Risk Prediction Sobar-72 Dataset Analysis Lifestyle and Clinical Risk Factors Random Forest Classification Clinical Risk Stratification Analytics
Introduction

Esophageal cancer is the sixth most common cause of cancer-related deaths worldwide, with high mortality rates due to late diagnosis and limited early symptoms. Early detection can substantially improve survival rates. However, the complexity and cost of diagnostic tests necessitate simple, non-invasive risk assessment tools. 

Machine learning has emerged as a promising approach to predict disease risks based on patient data. This study aims to develop predictive models to classify individuals into high or low risk of esophageal cancer using the Sobar-72 dataset, which includes demographic, behavioral, and physiological features.

Conclusion

This study successfully demonstrates that esophageal cancer risk can be predicted using machine learning models on clinical and lifestyle features. The Random Forest classifier outperformed logistic regression, highlighting the importance of non-linear relationships and feature interactions. These models could aid healthcare professionals in screening individuals at high risk, promoting earlier diagnostics and better patient outcomes.

Article Preview