PREDICTING AND MAPPING THE GEOGRAPHIC DISTRIBUTION OF GLAUCOMA IN THE UNITED STATES: THE ROLE OF SOCIAL DETERMINANTS USING THE ALL OF US DATASET By Ayobami Abolore Alimi May, 2025 Director of Thesis: Nic Herndon, PhD Major Department: Computer Science ABSTRACT Vision impairment and eye diseases are significant public health concerns in the United States and globally. Glaucoma, a chronic and progressive disease, is one of the leading causes of irreversible blindness worldwide. In the U.S., more than three million individuals are estimated to be affected, with projections indicating a rise as the population ages. While clinical and genetic factors influencing glaucoma onset and progression have been extensively studied, growing evidence suggests that environmental exposures, socioeconomic status, and lifestyle factors also play a crucial role. With disparities in healthcare access and outcomes based on socioeconomic factors, it is crucial to explore how these factors, alongside genetic predispositions, affect glaucoma onset and progression. Addressing these gaps could lead to more targeted interventions, improving outcomes for vulnerable populations. This study aims to bridge this gap by leveraging machine learning techniques to build predictive models for glaucoma risk. By utilizing demographic information and Social Determinant of Health (SDOH) from the All of Us dataset, this research develops a comprehensive framework for glaucoma prediction. These models allow for an improved understanding of how SDOH influences glaucoma risk, helping to inform early detection strategies. The optimized Decision Tree model, tuned with GridSearchCV, was the best-performing model for this prediction task, achieving an accuracy of 67.87%. For class 0 (Non-Glaucoma), it yielded a precision of 0.71, recall of 0.52, and an F1 score of 0.60. For class 1 (Glaucoma), the model achieved a precision of 0.66, recall of 0.81, and an F1 score of 0.73. Feature importance analysis identified age as the most significant predictor, followed by race and the affordability of seeing an eye doctor. In contrast, factors such as affordability of specialist care and copay affordability had minimal impact. The findings from this study have broader implications for enhancing glaucoma risk assessments and healthcare interventions. Additionally, the methodological approach can be applied to other complex diseases, contributing to a more equitable and informed public health approach. By emphasizing social determinants, this research takes a promising step toward reducing the burden of glaucoma and advancing the goals of precision medicine. PREDICTING AND MAPPING THE GEOGRAPHIC DISTRIBUTION OF GLAUCOMA IN THE UNITED STATES: THE ROLE OF SOCIAL DETERMINANTS USING THE ALL OF US DATASET A Thesis Presented to The Faculty of the Department of Computer Science East Carolina University In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Sience By Ayobami Abolore Alimi May, 2025 Director of Thesis: Nic Herndon, PhD Thesis Committee Members: Ray Hales Hylock, PhD David Marvin Hart, PhD ©Ayobami Abolore Alimi, 2025 DEDICATION This work is dedicated to Almighty Allah and my beautiful wife Maryam Alimi. ACKNOWLEDGEMENTS I would like to express my deepest gratitude to my parents, whose unwavering support, love, and encouragement have been the foundation of my academic journey. Their sacrifices and belief in my potential have been instrumental in shaping my path. To my wonderful wife, your patience, understanding, and constant motivation have been my greatest source of strength throughout this process. Your support has made this journey smoother, and I am forever grateful for your love and encouragement. I extend my heartfelt appreciation to my advisor, Dr. Herndon, for his invaluable guid- ance, insightful feedback, and continuous encouragement. His expertise and mentorship have played a crucial role in shaping this research and improving my analytical skills. I am also immensely grateful to my thesis committee members, Dr. Hylock and Dr. Hart, for their time, effort, and constructive criticism. Their perspectives and expertise have significantly enhanced the quality of this work. Additionally, I would like to thank my supervisor at work, Dr. Cooke-Bailey (Brody School of Medicine), for her support and understanding as I balanced my professional respon- sibilities with my academic pursuits. Her encouragement and flexibility have been greatly appreciated. Finally, I extend my gratitude to my family, colleagues, friends, and everyone who has supported me in one way or another during this journey. This achievement would not have been possible without your encouragement and support. Table of Contents LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 CHAPTER 2: RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Overview of Glaucoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Genetic and Demographic Risk Factors Influencing Glaucoma . . . . . . . . 5 2.3 Social Determinants of Health and Glaucoma Prediction . . . . . . . . . . . 5 2.3.1 Healthcare Access and Insurance Coverage . . . . . . . . . . . . . . . 5 2.3.2 Socioeconomic Status (SES) and Income . . . . . . . . . . . . . . . . 6 2.3.3 Occupational and Environmental Stress . . . . . . . . . . . . . . . . . 6 2.3.4 Urban vs. Rural Disparities . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Predictive Modeling of Glaucoma Using Machine Learning . . . . . . . . . . 6 2.4.1 Supervised Learning for Glaucoma Prediction . . . . . . . . . . . . . 7 2.4.2 Geographic Information Systems (GIS) and Spatial Analysis . . . . . 7 2.5 Current Research on Geographic Distribution of Glaucoma . . . . . . . . . . 7 2.6 The Role of the All of Us Dataset in Glaucoma Prediction . . . . . . . . . . 8 2.7 Summary and Research Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . 9 CHAPTER 3: METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Descriptive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Geo-Spatial Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Predictive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.1 Processing the Cases Group . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.2 Processing the Control Group . . . . . . . . . . . . . . . . . . . . . . 16 3.3.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.4 Final Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Machine Learning Model for Glaucoma Prediction . . . . . . . . . . . . . . . 19 3.4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.2 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.3 Gradient Boosting Machine (GBM) . . . . . . . . . . . . . . . . . . . 21 3.4.4 K-Nearest Neighbors (KNN) Classifier . . . . . . . . . . . . . . . . . 22 3.4.5 Model Training and Validation . . . . . . . . . . . . . . . . . . . . . 22 3.4.6 Model Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 CHAPTER 4: RESULTS AND DISCUSSION . . . . . . . . . . . . . . . . . . 24 4.1 Descriptive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1.1 Glaucoma Cases Distribution . . . . . . . . . . . . . . . . . . . . . . 24 4.1.2 Glaucoma Prevalence . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Predictive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.1 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.2 Decision Tree Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.3 Ensemble Gradient Boosting Classifier . . . . . . . . . . . . . . . . . 29 4.2.4 Optimized Decision Tree with GridSearchCV . . . . . . . . . . . . . . 30 4.2.5 Optimized Gradient Boosting with GridSearchCV . . . . . . . . . . . 31 4.2.6 K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 CHAPTER 5: CONCLUSION AND FUTURE WORK . . . . . . . . . . . . 35 5.1 Descriptive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Predictive Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.4 Final Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 LIST OF TABLES 4.1 High prevalence categories of counties in different states . . . . . . . . . . . . 26 4.2 Model Evaluation and Comparison . . . . . . . . . . . . . . . . . . . . . . . 34 LIST OF FIGURES 3.1 All of Us data distribution by race . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 All of Us data distribution by gender . . . . . . . . . . . . . . . . . . . . . . 13 3.3 All of Us data distribution by age . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 All of Us dataset glaucoma distribution by race . . . . . . . . . . . . . . . . 16 3.5 All of Us dataset Percentage of people with glaucoma in each race . . . . . . 17 3.6 Study workflow and cohort definition for evaluating predictive models for par- ticipant with glaucoma in the All of Us Research Program . . . . . . . . . . 20 4.1 Map showing number of reported Glaucoma cases in the United States . . . 25 4.2 Map showing Glaucoma Prevalence per 1,000,000 Population . . . . . . . . . 25 4.3 County Counts of Glaucoma Prevalence per 1,000,000 Population . . . . . . 26 4.4 AUC-ROC Curve for Logistic Regression . . . . . . . . . . . . . . . . . . . . 28 4.5 AUC-ROC Curve for Decision Tree Model . . . . . . . . . . . . . . . . . . . 29 4.6 AUC-ROC Curve for Gradient Boosting Classifier . . . . . . . . . . . . . . . 30 4.7 AUC-ROC Curve for Optimized Decision Tree . . . . . . . . . . . . . . . . . 31 4.8 AUC-ROC Curve for Optimized Gradient Boosting Classifier . . . . . . . . . 32 4.9 AUC-ROC Curve for k-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . 33 Chapter 1 Introduction Vision impairment and eye diseases are critical public health concerns in the United States and around the world. Glaucoma, a chronic and progressive disease, is one of the main causes of irreversible blindness worldwide. More than three million people in the United States are estimated to be affected by glaucoma, with projections that this number will increase as the population ages [8]. Primary open-angle glaucoma (POAG), the most common type in the United States, often goes undiagnosed until significant vision loss occurs, underscoring the importance of early detection and prevention. POAG disproportionately affects certain racial and ethnic groups, including African Americans and Hispanics, who tend to experience earlier onset and more severe disease progression. Although clinical and genetic factors in the onset and progression of glaucoma have been extensively studied [1, 10, 23, 25], growing evidence suggests that environmental factors, such as exposure to pollutants, socioeconomic factors, and lifestyle variables, also play an influential role [2, 5, 13, 19, 22]. However, the exact nature of the interaction between these environmental determinants and the prevalence of glaucoma remains unclear. Recent advances in data collection and analytics, particularly through large datasets such as the All of Us (AoU) Research Program [24], present an unprecedented opportunity to analyze these complex relationships. The AoU dataset includes comprehensive demographic, health, genetic, and environmental data from a diverse cohort across the United States, providing a solid foundation for investigating the distribution of glaucoma and other vision diseases and understanding associated environmental factors. This thesis will leverage the AoU dataset to investigate how environmental factors cor- relate with the geographic distribution of glaucoma in the United States. By using data science methods to explore this intersection, this study aims to contribute new insights into the role of social and ecological determinants in eye health disparities. 1.1 Research Objectives The primary objective of this research is to analyze the geographic distribution of glaucoma across the United States and to assess the relationship between this disease and various social factors. This study will focus on several key questions: 1. What is the geographic distribution of glaucoma across the United States? 2. Which environmental / social factors correlate most significantly with the prevalence of glaucoma 3. Are there identifiable patterns of disparity in glaucoma based on socioeconomic or environmental factors? 4. Can glaucoma be predicted with a machine learning model using demographic infor- mation and social economic factors? By addressing these questions, the research will explore whether individuals in certain regions with specific environmental profiles are at greater risk for glaucoma and other vi- sion diseases. Additionally, this study will analyze if disparities observed align with the distribution of these environmental factors, potentially informing targeted interventions for vulnerable communities. 1.2 Significance of the Study The potential implications of this research are both far-reaching and impactful. Glaucoma has been extensively researched from a genetic and clinical standpoint [1, 10], yet the en- 2 vironmental factors influencing its development are less understood. Social determinants of health (SDOH) – such as income level, access to healthcare, and community environment – are increasingly recognized as crucial to health outcomes. For example, individuals in areas with high pollution levels or limited healthcare access may be at a greater risk for develop- ing vision problems, potentially due to oxidative stress or chronic health conditions linked to environmental exposures. Understanding these correlations can aid in identifying vulnerable populations and tai- loring preventative healthcare policies that mitigate these risks. Public health interventions informed by this research could address underlying causes, rather than symptoms, by focus- ing on environmental improvements, education, and increased screening efforts in high-risk areas. By mapping the intersections between environmental variables and vision health, this study hopes to provide insights that can enhance healthcare equity, improve early detection, and ultimately reduce the societal burden of glaucoma and vision impairment. Furthermore, the data science methodologies applied in this study could serve as a frame- work for similar analyses of other diseases, demonstrating the potential of large-scale, diverse datasets like AoU to explore complex health-environment interactions. 3 Chapter 2 Related Work 2.1 Overview of Glaucoma Glaucoma is a leading cause of irreversible blindness worldwide, characterized by progressive damage to the optic nerve, often associated with elevated intraocular pressure (IOP) [24]. Glaucoma was the cause for blindness in 3.61 million people or 8.4% of the 43.3 million blind people globally in 2020, and glaucoma was the cause for moderate to severe vision impairment (MSVI) in 4.14 million people or 1.4% of the 295 million people visually impaired in 2020 [4]. The most prevalent form, primary open-angle glaucoma (POAG), accounts for the majority of cases in the United States and is particularly concerning due to its asymptomatic onset until advanced stages [20]. The global burden of glaucoma is increasing, with estimates suggesting that by 2040, more than 111 million people will be affected, primarily among populations aged 60 and above [20]. In the U.S., significant racial and ethnic disparities exist, with African Americans being three-four times more likely to develop POAG than White Americans, and Hispanic populations also showing higher prevalence rates [15]. These disparities indicate a strong need to analyze geographic distribution trends and the influence of demographic and SDOH on glaucoma development and progression. 2.2 Genetic and Demographic Risk Factors Influencing Glaucoma While genetic predisposition plays an essential role in glaucoma susceptibility, demographic factors such as age, sex, and race/ethnicity are equally significant in determining individ- ual risk [10]. Genome-wide association studies (GWAS) have identified multiple glaucoma- associated loci, including MYOC, CAV1/CAV2, TMCO1, MYOF and others which influence disease onset and progression [1]. However, genetic risk alone does not fully explain glaucoma prevalence, necessitating an exploration of demographic predictors such as age, gender, and racial disparities. • Age: Older adults (60+) face a higher risk of glaucoma due to age-related structural changes in the eye and a decline in neuroprotective mechanisms [11]. • Sex: Some studies suggest that hormonal differences may contribute to glaucoma pro- gression, with postmenopausal women at higher risk due to declining estrogen levels [21]. • Race/Ethnicity: As noted earlier, African Americans and Hispanics have an increased risk of POAG, potentially due to both genetic and environmental factors, warranting further investigation into their predictive role [25]. 2.3 Social Determinants of Health and Glaucoma Prediction The World Health Organization (WHO) defines SDOH as the conditions in which people are born, grow, live, work, and age, which shape health outcomes [14]. Several SDOH factors, listed below, have been linked to glaucoma prevalence and severity, making them key predictors in glaucoma risk modeling. 2.3.1 Healthcare Access and Insurance Coverage Limited access to ophthalmologic care results in delayed glaucoma diagnosis and poorer outcomes. A study [17] found that uninsured individuals and those without Medicaid were more likely to present with advanced-stage glaucoma, emphasizing the role of healthcare accessibility in disease progression. 5 2.3.2 Socioeconomic Status (SES) and Income Lower-income populations are at higher risk of developing glaucoma due to limited healthcare access, poor living conditions, and increased exposure to environmental risk factors [6]. Income disparities influence an individual’s ability to afford routine eye exams, contributing to higher rates of undiagnosed or late-stage glaucoma cases. In addition, factors like housing quality, community safety, and occupational risks can exacerbate health disparities. People in lower-income areas may face greater exposure to environmental hazards, such as air and water pollution, which are linked to glaucoma risk. By examining these social factors, this study aims to investigate the influence of SDOH on glaucoma distribution across the U.S., with a particular focus on high-risk and under-served populations. 2.3.3 Occupational and Environmental Stress Chronic stress has been linked to increased IOP and optic nerve vulnerability. Work-related stress and long exposure to blue-collar jobs (e.g., exposure to industrial pollutants, strenuous labor) have been suggested as risk factors for glaucoma progression [26, 2]. 2.3.4 Urban vs. Rural Disparities Rural populations face higher rates of blindness due to glaucoma than urban populations, largely due to a lack of specialized eye care services and longer travel times to healthcare facilities [9]. Predictive models incorporating ZIP code-based geographic information can help identify high-risk rural communities needing targeted interventions. 2.4 Predictive Modeling of Glaucoma Using Machine Learning Given the increasing availability of large-scale health datasets, machine learning (ML) ap- proaches have become an essential tool for predicting glaucoma risk using demographic and SDOH data. 6 2.4.1 Supervised Learning for Glaucoma Prediction Supervised learning models such as logistic regression, random forests, and deep learning have been effectively applied to predict glaucoma diagnosis using a combination of demographic, genetic, and SDOH variables [3, 12]. Recent studies have shown that machine learning algo- rithms, including gradient boosting models, trained on electronic health records (EHRs) and socioeconomic data can significantly enhance glaucoma risk prediction accuracy [12]. These findings highlight the potential of leveraging routinely collected clinical, lifestyle, and demo- graphic data within EHRs to develop scalable and data-driven models for early glaucoma detection. 2.4.2 Geographic Information Systems (GIS) and Spatial Analysis GIS-based approaches have been employed to map glaucoma distribution and identify envi- ronmental risk factors. Integrating machine learning with GIS enables predictive modeling of high-risk geographic zones, improving targeted screening efforts [5]. A notable advantage of All of Us is its emphasis on inclusivity, incorporating data from populations historically underrepresented in medical research, including racial minorities and rural communities. This approach not only improves the generalizability of findings but also provides insights into how vision diseases affect different populations. Additionally, with its extensive geographic data, All of Us allows for a spatial analysis of glaucoma prevalence, enabling researchers to assess environmental factors such as air and water quality in relation to health outcomes. 2.5 Current Research on Geographic Distribution of Glaucoma Geographic distribution studies highlight notable patterns in glaucoma prevalence, often correlating with environmental and socioeconomic factors. For instance, rural areas may face higher rates of glaucoma-related blindness due to limited access to specialized eye care, while urban areas with high pollution levels may show an elevated prevalence of glaucoma 7 and other eye conditions [13]. One significant finding in recent literature is the disparity in vision health outcomes across racial and ethnic lines. African American and Hispanic populations show earlier onset and faster progression of POAG than their White counterparts, which can partially be attributed to environmental and social disparities [7]. This geographic variability, combined with known genetic risk factors, underscores the need for an integrative, data-driven approach to studying glaucoma, one that considers genetics, environment, and social factors within a spatial framework [16]. 2.6 The Role of the All of Us Dataset in Glaucoma Prediction The All of Us Research Program represents one of the largest, most diverse health databases in the world. By including health information, genetic data, and environmental exposure details from over a million participants, All of Us offers a unique opportunity to explore health patterns across a diverse cohort. This dataset allows for the analysis of glaucoma and other vision diseases alongside individual and community-level data, such as environmental exposures and SDOH. The dataset includes: 1. Electronic Health Records (EHRs) – Containing diagnostic codes, prescriptions, and medical histories. 2. Genetic Data – Allowing exploration of polygenic risk scores for glaucoma. 3. Demographic and SDOH Data – Essential for predictive modeling. 4. Geo-spatial Information – Useful for studying environmental exposures and healthcare access disparities. By leveraging the All of Us dataset, this study aims to develop predictive models that in- tegrate demographics, genetic predispositions, environmental factors, and SDOH to improve glaucoma risk assessment and early detection strategies. 8 2.7 Summary and Research Gaps While existing studies provide valuable insights into the genetic, environmental, and social factors associated with glaucoma, gaps remain in understanding how these factors interact at the population level. Most studies have been constrained by either a lack of diverse datasets or limited environmental data. The use of the All of Us dataset allows this study to fill these gaps, to some extent (since the ZIP codes are truncated), by providing a comprehensive view of glaucoma distribution across the U.S. in relation to environmental and social determinants. To date, limited research has focused on the cumulative effects of demographic informa- tion and social determinants on glaucoma within a geographic context. This research aims to bridge that gap by conducting a spatial analysis of glaucoma in the U.S., linking disease prevalence with geographic, environmental, and social variables. This approach can reveal new insights into the causes and risk factors of glaucoma, potentially leading to targeted interventions that improve vision health outcomes across diverse populations. 9 Chapter 3 Methodology 3.1 Study Design This study employs a retrospective observational design using data from the All of Us (AoU) Research Program, a large-scale initiative aimed at advancing precision medicine by collect- ing diverse health data from over one million participants. It integrates clinical, environ- mental, and socioeconomic variables to assess their impact on glaucoma prevalence and risk stratification. The research also visualizes the geographical distribution of glaucoma across the United States and develops machine learning models to evaluate risk. All analyses were conducted on the AoU Researcher Workbench, a secure, cloud-based platform that provides approved researchers with access to the program’s extensive datasets. The AoU dataset comprises three primary data types: surveys, physical measurements (PMs), and electronic health records (EHRs). Detailed information about the surveys is available through the Survey Explorer, a tool within the Research Hub designed to assist researchers in navigating the data. The surveys employ branching logic, and all questions are optional, allowing participants to skip any they prefer not to answer. Physical measure- ments recorded at enrollment include systolic and diastolic blood pressure, height, weight, heart rate, waist and hip circumference, wheelchair use, and current pregnancy status. For participants who consented, EHR data were linked to provide additional clinical context. All three data types – surveys, PMs, and EHRs – are mapped to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) version 5.2, a stan- dardized framework maintained by the Observational Health Data Sciences and Informatics (OHDSI) collaborative. This standardization ensures interoperability and facilitates large- scale analyses across diverse datasets. To safeguard participant privacy, the AoU program applies a series of data transforma- tions. These include: • Data suppression: Removing codes with a high risk of identification, such as military status. • Generalization: Aggregating categories for sensitive variables, including age, sex at birth, gender identity, sexual orientation, and race. • Date shifting: Applying a random offset of less than one year to dates, consistently across each participant’s record. Detailed documentation on privacy measures and the creation of the Curated Data Repos- itory is available in the AoU Registered and Controlled Tier Curated Data Repository Data Dictionary. The Researcher Workbench provides a suite of tools designed to streamline data analysis: • Cohort Builder: Enables researchers to select groups of participants based on specific criteria. • Dataset Builder: Facilitates the creation of customized datasets for analysis. • Workspaces: Offers Jupyter Notebooks for advanced data analysis, supporting both R and Python 3 programming languages. These notebooks allow researchers to work with saved datasets or query the data directly, providing flexibility for complex analyses. At the time of this thesis report, the AoU Research Program included a total of 413,457 adult participants. To identify individuals with glaucoma, cohort selection was performed using the Systematized Nomenclature of Medicine (SNOMED) code 23986001, which cor- responds to the standard concept for glaucoma. It is important to note that there are additional SNOMED codes related to glaucoma, including non-standard codes and those representing specific subtypes of glaucoma. This initial filtering step allowed us to refine the dataset to include only participants with relevant glaucoma-related clinical records. This gave 19,130 individuals with glaucoma related illness. 11 Figure 3.1 below illustrates the distribution of data across different racial groups in the AoU dataset. The majority of participants are White, accounting for 55.4% of the dataset, followed by African American participants at 19.0%. Other racial groups are represented at lower percentages. Figure 3.1: All of Us data distribution by race Gender representation is a crucial aspect of the dataset. Figure 3.2 illustrates the dis- tribution of participants across different gender identities. Female participants make up the majority at 59.8%, followed by male participants at 37.3%. The remaining 2.9% either skipped the gender question or selected an alternative response. Age distribution is another crucial aspect of the dataset. Figure 3.3 illustrates the age distribution in the AoU dataset, showing that all participants are adults, with ages ranging from 20 to 120 years. The highest concentration of participants falls between 58 and 70 years. 12 Figure 3.2: All of Us data distribution by gender 3.2 Descriptive Analysis 3.2.1 Geo-Spatial Analysis To incorporate socioeconomic data into the analysis, we extracted zip code-level information for participants in the glaucoma cohort. This dataset includes basic demographic details along with three-digit zip codes formatted as ‘000**.’ The use of three-digit zip codes is a de-identification measure implemented by the AoU Research Program to ensure participant privacy by preventing identification at a more granular geographic level. To approximate the geographic distribution of individuals in the cohort, we utilized publicly available U.S. zip code data [18], which was downloaded online and uploaded into the AoU Researcher Workbench notebook environment. This dataset contains all U.S. five- digit zip codes, along with corresponding latitude, longitude, city, state, and additional geographic attributes. Since only three-digit zip codes were available for the glaucoma cohort, we estimated approximate participant locations by calculating centroid coordinates for each three-digit zip region. The centroid was determined by averaging the latitude and longitude values of all five-digit zip codes that fall under the same three-digit zip designation. For example, if a three-digit zip code corresponds to seven different five-digit zip codes, the centroid was 13 Figure 3.3: All of Us data distribution by age computed by taking the mean latitude and longitude of all seven locations. This calculated centroid was then matched to the closest corresponding five-digit zip code, serving as the approximate geographic location of individuals in the cohort. The resulting approximate zip code locations were used to generate a geo-spatial visual- ization in Power BI to illustrate the distribution of glaucoma cases across the United States. However, AoU has strict privacy policies prohibiting the publication of aggregated data with fewer than 20 individuals per group. Therefore, any zip code regions with fewer than 20 participants were excluded from the final visualization to comply with AoU data privacy regulations. The number of glaucoma recorded per county was plotted and glaucoma prevalence per 1,000,000 population was also visualized. The plot is presented in Figure 4.1 and 4.2 in Chapter 4 14 3.3 Predictive Analysis 3.3.1 Processing the Cases Group To analyze the relationship between demographic, socioeconomic, and lifestyle factors with glaucoma prevalence, multiple survey datasets were extracted and processed from the AoU Research Program. The primary focus was on basic lifestyle factors, including: • Annual income • Health insurance • Educational attainment • Employment status • Housing Additionally, a survey on healthcare access was included, which assessed: • Healthcare affordability • Prescription affordability • Affordability of specialist • Affordability of eye doctor • Affordability of co-pays A stress survey was also incorporated to evaluate the self-reported stress levels of each participant. Alongside these variables, demographic data, including age, gender, and race were extracted for individual. All 19,130 glaucoma cohort participants responded to basic lifestyle surveys with the exception of the health insurance survey with 125 missing values from the cohort. 9,587 participants responded to the health access survey while 6,495 individuals had completed the stress survey. Imputation was used to handle the 125 missing data for health insurance by putting the values to zero, an outer join was performed between the survey datasets and the demographic data using person id as the unique identifier. This integration resulted 15 in a final dataset of 19,130 with 60% missing values in the stress column and 50% missing values in the health care access surveys columns. Due to the large number of missing values in the variables of interest, we removed the rows with the missing data, leaving us with 5,762 rows of data for model training. Figure 3.4 illustrates the distribution of glaucoma cases across racial groups in the AoU dataset. White participants account for 51.3% of glaucoma cases, while African American participants make up 26%. Figure 3.5 illustrates percentage of people living with glaucoma in each race. The African American has a higher percentage of people with glaucoma despite their lower total participants compared to the white population, a further assertion that glaucoma is more prevalence in African population than other populations. Figure 3.4: All of Us dataset glaucoma distribution by race 3.3.2 Processing the Control Group A similar method was applied to construct the control cohort (individuals without glau- coma). Demographic, lifestyle, healthcare access, and stress survey responses were extracted for 20,000 individuals who met the exclusion criteria for glaucoma. Following the same data integration approach, the final control dataset was reduced to 5,129 participants af- 16 Figure 3.5: All of Us dataset Percentage of people with glaucoma in each race ter merging with survey responses using an outer join and removing the rows with missing values. To ensure compatibility with machine learning models, all categorical variables were converted into numeric format using mapping method. 3.3.3 Feature Engineering A total of 14 predictor variables were selected and processed as follows: 1. Age – Continuous numerical variable 2. Gender – Coded as Male and Female (EHR data) 3. Race – Grouped into White, African American, and Others • Only the binary category of white and black was used. This categorization was used because the majority of participants in the dataset were White or African American, while other racial groups had significantly fewer participants. • Simplifying race to binary form helps reduce dimensionality and sparsity, partic- ularly when minority categories have very low representation. • Hispanics are not in race category but ethnicity and this was not included in the variable for the model. 17 4. Health Insurance – Binary classification: • Has insurance (1) • No insurance (0) 5. Education Level – Categorized into three groups: • Advanced/College degree (2) • High school graduate (1) • No formal education or incomplete high school (0) 6. Employment Status – Binary classification: • Employed (1) • Unemployed (0) • The survey originally had eight employment categories: retired, out of work (more than a year), out of work (less than a year), homemaker, unable to work, stu- dent, employed, and other. These responses were consolidated into a binary em- ployed/unemployed variable. 7. Income Level – Grouped into three categories: • Income ≥ $75,000 (2) • Income: $35,001–$75,000 (1) • Income ≤ $35,000 (0) 8. Medication Affordability – Binary classification: Yes (1) / No (0) • Originally derived from survey questions regarding prescription affordability. 9. Access to health care provider – Binary classification: Has access (1) / No access (0) • This variable was constructed from survey responses about barriers to healthcare access through affordability of the healthcare provider. 10. Ability to Afford Co-Pay – Binary classification: Yes (1) / No (0) 11. Ability to Afford Specialist – Binary classification: Yes (1) / No (0) 12. Spoken to Eye Doctor – Binary classification: Yes (1) / No (0) 18 13. Stress Level – Binary classification: Stressed (1) / Not stressed (0) • Derived from survey responses assessing perceived stress levels among partici- pants. 3.3.4 Final Dataset Construction The processed case (glaucoma) and control (non-glaucoma) datasets were merged into a single dataset, with glaucoma status as the target variable. The outcome was encoded as a binary classification: • Individual has glaucoma (1) • Individual does not have glaucoma (0) This structured dataset was then used for machine learning model development to predict glaucoma risk based on demographic, socioeconomic, and lifestyle variables. Figure 3.6 shows the workflow for this study. After data extraction and cleaning, we have 10,891 participants with 5,762 cases and 5,129 controls. 3.4 Machine Learning Model for Glaucoma Prediction To develop predictive models for glaucoma risk, several supervised machine learning algo- rithms were implemented. Prior to model development, multicollinearity among the predictor variables was assessed using a correlation matrix. The analysis revealed no significant mul- ticollinearity, indicating that the variables were sufficiently independent for inclusion in the models. Additionally, hyperparameter tuning was conducted for the Decision Tree and Gra- dient Boosting models using GridSearchCV with 5-fold cross-validation, ensuring optimal model performance and generalizability. 3.4.1 Logistic Regression A logistic regression model was trained as a baseline classifier due to its interpretability and efficiency in predicting binary outcomes (glaucoma vs. non-glaucoma). The model was 19 Figure 3.6: Study workflow and cohort definition for evaluating predictive models for par- ticipant with glaucoma in the All of Us Research Program optimized using: • Maximum Iterations: Set to 1000 for better convergence. • The rest of the parameters were left at their default values, which are suitable for most binary classification tasks. • Feature Importance Analysis: Used model coefficients to determine the most influential predictors. 3.4.2 Decision Tree Classifier A Decision Tree Classifier was implemented to capture non-linear relationships between SDOH and glaucoma risk. Key steps included: 20 • Gini Impurity for Splitting: This was used to determine the best feature splits. • Other parameters were left at their default values. These include the splitter, which is set to ’best’ (the strategy that chooses the best split at each node), maximum depth set to None (allowing the tree to expand until all leaves are pure), minimum samples split set to 2, and minimum samples leaf set to 1. • Feature Importance Analysis: We evaluated how each variable contributed to model predictions. • Hyperparameter Optimization: Hyperparameter tuning was conducted using Grid- SearchCV with 5-fold cross-validation to identify the best combination of parameters for model generalization. The hyperparameters explored included maximum depth {3, 5, 10, None} to control tree depth and prevent overfitting, minimum samples split {2, 5, 10} to define the minimum number of samples required to split a node and minimum samples leaf {1, 2, 4} to specify the minimum number of samples at a leaf node. Other parameters, such as criterion=’gini’, splitter=’best’, and maximum features=None, were left at their default setting. 3.4.3 Gradient Boosting Machine (GBM) The Gradient Boosting Machine (GBM) was employed to enhance predictive performance by sequentially improving weak learners through gradient-based optimization. Key steps included: • Boosting Framework: Combined multiple weak decision trees to form a strong classifier, minimizing prediction errors iteratively. The model was configured with 100 estimators. • Learning Rate Tuning: Adjusted the step size of model updates to balance convergence speed and overfitting. A learning rate of 0.1 was used for this model. • All other parameters, including subsample (1.0), minimum samples split (2), minimum samples leaf (1), and maximum features (None), were left at their default settings. This configuration represents a typical baseline for gradient boosting models prior to hyperparameter tuning. • Hyperparameter Optimization: The Gradient Boosting Classifier was optimized using GridSearchCV with 5-fold cross-validation, focusing on three hyperparameter: numn- ber of estimators (100 or 200), learning rate (0.01, 0.1, 0.2), and maximum depth (3, 5, 7). The final model was selected based on the highest AUC-ROC score. All other hyperparameters were retained at their default settings, including subsample of 1.0, minimum samples split of 2, and the criterion was ’friedman mse’. This approach balanced model complexity with performance, helping to avoid overfitting while max- imizing predictive accuracy. 21 • Feature Importance Analysis: Assessed the contribution of each variable to the model’s predictive power, identifying key SDOH factors influencing glaucoma risk. 3.4.4 K-Nearest Neighbors (KNN) Classifier A K-Nearest Neighbors (KNN) model was used to analyze the proximity of glaucoma pa- tients in feature space. The K-Nearest Neighbors (KNN) model was configured with num- ber of neighbors neighbors of 5 to assess glaucoma prediction based on similarity in the feature space. The model used the Euclidean distance (metric=’minkowski’ with p=2) to compute similarities, and neighbors were uniformly weighted. Other parameters such as algorithm=’auto’, leaf size=30, and number of jobs=None were left at their default values. This setup allowed the model to classify test instances based on the majority class of the 5 closest neighbors in the transformed feature space. 3.4.5 Model Training and Validation • Train-Test Split: The dataset was divided into 70% training and 30% testing using stratified sampling to ensure balanced representation of glaucoma cases vs. controls. • Cross-Validation: Employed 5-fold cross-validation to optimize hyperparameters and prevent overfitting. • Feature Selection: Identified the most influential predictors based on feature impor- tance scores derived from trained models, ensuring the selection of meaningful variables for prediction. 3.4.6 Model Evaluation Metrics To assess model performance, we used the following evaluation metrics: • Accuracy: Measures overall model correctness. • Precision and Recall: Evaluates the model’s ability to identify true glaucoma cases. • F1 Score: Balances precision and recall for better clinical applicability. • AUC-ROC Curve: Assesses the discriminatory power of the model. 22 3.5 Ethical Considerations Data Privacy and Security: All analyses were conducted within the secure All of Us Re- searcher Workbench to ensure compliance with HIPAA and NIH data-use policies. The study also evaluated model fairness across different racial, socioeconomic, and geographic groups to minimize bias and promote equitable predictions. 23 Chapter 4 Results and Discussion 4.1 Descriptive Analysis 4.1.1 Glaucoma Cases Distribution Figure 4.1 Presents a map illustrating the distribution of glaucoma cases across different counties. Larger bubbles indicate areas with a higher number of reported glaucoma cases. The highest concentration appears to be in the vicinity of Cook County, Illinois (1,390 cases), followed by New York County, Manhattan (1,328 cases), and Suffolk County, Massachusetts (1,106 cases). Additionally, Allegheny County, Pennsylvania has 1,075 cases, while other counties report fewer than 1,000 cases. The locations are based on approximate ZIP code- level data and reveal that glaucoma cases are notably concentrated in coastal regions of the Midwest and Northeast. The high number of glaucoma in these areas is not only due to high population density, the prevalence per population is equally higher compared to other areas as shown in Figure 4.2. 4.1.2 Glaucoma Prevalence The map of glaucoma prevalence in 1,000,000 population is illustrated in Figure 4.1. To better visualize this prevalence, Figure 4.3 illustrates the distribution of glaucoma prevalence per 1,000,000 population across various counties in the United States. The majority of counties (119) fall within the lowest prevalence category of 0–10 cases per 1,000,000 people. This is followed by 49 counties with prevalence between 10–20 cases, 36 counties in the 20–30 Figure 4.1: Map showing number of reported Glaucoma cases in the United States range, and 32 counties in the 30–40 range. Notably, 57 counties exhibit prevalence between 100–200 cases per 1,000,000 population, marking a peak in mid-range prevalence. Figure 4.2: Map showing Glaucoma Prevalence per 1,000,000 Population As prevalence increases, the number of counties in each category generally declines, with only a small number of counties experiencing higher rates. Specifically, only one county, Marathon in Wisconsin, records a prevalence exceeding 3,000 cases per 1,000,000 population. In the higher prevalence categories, two counties in Wisconsin and two in Kansas fall within the 300–400 prevalence range. Additionally, Arizona, California, Illinois, New York, Pennsylvania, South Dakota, and Texas each have one county in this category. Notably, 25 Figure 4.3: County Counts of Glaucoma Prevalence per 1,000,000 Population Wisconsin has at least one county in most of the higher prevalence category from 300 and above. For the 700–1000 prevalence category, Wisconsin has two counties, while Florida, Kansas, New York, and Pennsylvania each have one county. Suffolk county in Massachusetts is the only county in 1000-1500 prevalence. In the 1500–2000 category, Florida and Wisconsin each have one county. This is illustrated in Table 4.1 below. Table 4.1: High prevalence categories of counties in different states 26 4.2 Predictive Analysis In this section, we present the results of the predictive models used to classify the presence of glaucoma based on demographic and SDOH features. We evaluate and compare the performance of Logistic Regression, Decision Tree, Ensemble Gradient Boosting, optimized Gradient Boosting with GridSearchCV, K-Nearest Neighbors (KNN) models in terms of accuracy, precision, recall, and feature importance. 4.2.1 Logistic Regression Model The logistic regression model achieved an overall accuracy of 67.2%, indicating a moderate predictive ability with AUC score of 0.73 and ROC curve above the diagonal, indicating the model is better than random guessing. The confusion matrix reveals that the model correctly classified 852 negative cases (no glaucoma) and 1344 positive cases (glaucoma), while misclassifying 661 negative cases and 411 positive cases. The classification report further details the performance across both classes: • Class 0 (No Glaucoma): Precision = 0.67, Recall = 0.56, F1-score = 0.61 • Class 1 (Glaucoma): Precision = 0.67, Recall = 0.77, F1-score = 0.71 • Overall weighted performance: Precision = 0.67, Recall = 0.67, F1-score = 0.67 The feature importance analysis reveals that affording an eye doctor (0.5629) and having insurance (0.5493) were the most influential predictors of glaucoma classification. Other no- table predictors included affording prescriptions (0.1999), gender (0.1132), and stress levels (0.0615). Interestingly, race (-0.3214), income (-0.1316), and education (-0.1021) had nega- tive coefficients, suggesting an inverse relationship with the predicted outcome. This model shows moderate predictive performance with an AUC of 0.73, meaning it is fairly good at distinguishing between glaucoma and non-glaucoma cases. It performs better at detecting glaucoma cases but has lower recall for non-glaucoma cases, meaning some negative cases might be misclassified. 27 Figure 4.4: AUC-ROC Curve for Logistic Regression 4.2.2 Decision Tree Model The decision tree model yielded an accuracy of 57.2%, performing notably worse than logistic regression. The confusion matrix showed that the model correctly classified 865 negative cases and 1004 positive cases, while misclassifying 648 negative cases and 751 positive cases. The classification report for this model is as follows: • Class 0 (No Glaucoma): Precision = 0.54, Recall = 0.57, F1-score = 0.55 • Class 1 (Glaucoma): Precision = 0.61, Recall = 0.57, F1-score = 0.59 • Overall weighted performance: Precision = 0.57, Recall = 0.57, F1-score = 0.57 The feature importance rankings in the decision tree model differed significantly from logistic regression. Here, age (0.4357) was the most important predictor, followed by income (0.1010), education (0.0641), and gender (0.0585). Interestingly, affording an eye doctor (0.0203) and having insurance (0.0159) were less influential in this model, which contrasts with their strong predictive power in logistic regression. The model struggles to differentiate between glaucoma and non-glaucoma cases, as seen in the near-random ROC curve in Figure 4.5. 28 Figure 4.5: AUC-ROC Curve for Decision Tree Model 4.2.3 Ensemble Gradient Boosting Classifier The Ensemble Gradient Boosting model improved upon previous models with an accuracy of 67.29%. The confusion matrix showed that it correctly classified 842 non-glaucoma cases and 1357 glaucoma cases, while misclassifying 671 non-glaucoma and 398 glaucoma cases. The classification report was as follows: • Class 0 (Non-Glaucoma): Precision = 0.68, Recall = 0.56, F1-score = 0.61 • Class 1 (Glaucoma): Precision = 0.67, Recall = 0.77, F1-score = 0.72 Feature importance analysis highlighted age (0.7767) as the most significant predictor, followed by race (0.0878) and affordability of an eye doctor (0.0617). Factors such as af- fordability of specialist care (0.0006) and copay affordability (0.0011) had minimal impact. This model significantly outperforms Decision Tree, achieving higher accuracy (67%) and a stronger AUC-ROC (0.73) with moderate discriminatory power and it does a better job at identifying glaucoma cases (77% recall). 29 Figure 4.6: AUC-ROC Curve for Gradient Boosting Classifier 4.2.4 Optimized Decision Tree with GridSearchCV Using hyperparameter tuning, the optimized Decision Tree model improved the accuracy to 67.87%. The best parameters were: • max depth: 5 • min sample leaf: 1 • min sample split: 2 The confusion matrix showed that 790 non-glaucoma and 1428 glaucoma cases were correctly classified, while 723 non-glaucoma and 327 glaucoma cases were misclassified. The classification report indicated: • Class 0 (Non-Glaucoma): Precision = 0.71, Recall = 0.52, F1-score = 0.60 • Class 1 (Glaucoma): Precision = 0.66, Recall = 0.81, F1-score = 0.73 The recall for class 1 (glaucoma cases) is the highest so far (81%), making this model the best at detecting glaucoma among the models tested. AUC-ROC of 0.71 shows good discrimination ability, though slightly lower than the previous 0.73. This optimized model performs well, especially in identifying glaucoma cases (high recall). It provides a solid balance between accuracy and recall, making it a strong model for glaucoma prediction. 30 Figure 4.7: AUC-ROC Curve for Optimized Decision Tree 4.2.5 Optimized Gradient Boosting with GridSearchCV The optimized Gradient Boosting model demonstrated similar predictive performance to the original Gradient Boosting model in identifying glaucoma cases, achieving an accuracy of 67.29% and an AUC-ROC score of 0.73, indicating a moderate discrimination ability between cases and non-cases. The model was fine-tuned with a learning rate of 0.1, max depth of 3, and 100 estimators, optimizing the trade-off between bias and variance. The confusion matrix revealed that the model correctly identified 1,357 glaucoma cases (true positives) and 842 non-cases (true negatives). However, 398 glaucoma cases were misclassified as non-cases (false negatives), and 671 non-cases were incorrectly classified as glaucoma (false positives). Despite the hyperparameter tuning, the optimized Gradient Boosting model did not show a significant improvement over the original model in terms of accuracy, recall, or AUC-ROC. While the model maintains strong recall for glaucoma cases, its overall predictive ability remains comparable to the untuned version. 31 Figure 4.8: AUC-ROC Curve for Optimized Gradient Boosting Classifier 4.2.6 K-Nearest Neighbors (KNN) The K-Nearest Neighbors (KNN) model was evaluated for its predictive performance in iden- tifying glaucoma cases. The model achieved an accuracy of 61.57% and an AUC-ROC score of 0.65, indicating moderate discrimination ability between glaucoma and non-glaucoma cases. The confusion matrix revealed that the model correctly classified 1,189 glaucoma cases (true positives) and 823 non-cases (true negatives). However, 566 glaucoma cases were misclassified as non-cases (false negatives), and 690 non-cases were incorrectly classified as glaucoma (false positives). The classification report was: • Class 0 (Non-Glaucoma): Precision = 0.59, Recall = 0.54, F1-score = 0.57 • Class 1 (Glaucoma): Precision = 0.63, Recall = 0.68, F1-score = 0.65 The AUC-ROC curve Figure 4.9 shows that the model’s ability to differentiate between glaucoma and non-glaucoma cases is moderate, with an AUC of 0.65. This indicates that while the model performs better than random guessing (AUC = 0.50), its discriminatory power remains limited. This model underperformed compared to Gradient Boosting and Logistic Regression, likely due to its sensitivity to feature scaling and data distribution. 32 Figure 4.9: AUC-ROC Curve for k-Nearest Neighbors (KNN) 4.3 Model Comparison Table 4.2 presents the evaluation metrics for all the machine learning models employed in this study. Each model’s accuracy, along with class-specific precision, recall, F1 scores, and the AUC-ROC scores, provides a comprehensive comparison at a glance. Among the models, Logistic Regression and the Ensemble Gradient Boosting classifier performed similarly, each achieving approximately 67% accuracy and an AUC of 0.73. In contrast, the K-Nearest Neighbors (KNN) model had lower performance, with an accuracy of 61.57% and AUC of 0.65, indicating its limited effectiveness for this classification task. The basic Decision Tree model recorded the lowest performance (accuracy: 57.2%, AUC: 0.58), which emphasizes the importance of hyperparameter optimization. Most models demonstrated better performance on Class 1 (glaucoma cases), with higher recall and F1 scores, suggesting they were more sensitive to detecting positive cases. However, precision for Class 0 (non-glaucoma cases) was generally lower, particularly in the optimized models, indicating a higher rate of false positives. The Optimized Decision Tree, tuned using GridSearchCV, achieved the highest overall accuracy (67.87%) with optimal hyperparameters: maximum depth of 5, minimum samples 33 per leaf of 1, and minimum samples required to split of 2. This model’s improved performance can be attributed to its ability to capture more complex patterns while avoiding overfitting. Notably, the optimized decision tree achieved a recall of 81% for glaucoma cases, making it particularly effective for identifying individuals at risk, a critical requirement in medical screening. Additionally, the F1 score of 0.73 for Class 1 (glaucoma) reflects a strong balance between precision and recall. Although its AUC score of 0.71 was slightly lower than that of the Gradient Boosting model, the overall performance highlights its value as a practical and interpretable screening tool for glaucoma detection. Table 4.2: Model Evaluation and Comparison 34 Chapter 5 Conclusion and Future Work 5.1 Descriptive Analysis The descriptive analysis revealed a significant concentration of counties with lower glau- coma prevalence, with only a few counties exhibiting markedly higher rates. This uneven distribution suggests potential geographic disparities in disease prevalence, likely influenced by factors such as healthcare access, socioeconomic status, demographic composition, and environmental conditions. These findings underscore the need for targeted public health interventions and improved access to eye care services in high-risk areas. Furthermore, the observed distribution raises important questions about systemic barriers to diagnosis and treatment. Future studies should examine the interaction between social determinants of health (SDOH), regional healthcare infrastructure, and glaucoma preva- lence to better inform policy decisions. Additionally, this approach could be extended to investigate the spatial distribution of other vision-related diseases, such as cataracts, ocular hypertension, and presbyopia, to develop comprehensive eye health strategies. 5.2 Predictive Analysis The predictive modeling phase provided critical insights into glaucoma risk factors and model performance. Among all tested models, the Optimized Decision Tree demonstrated the highest accuracy (67.87%) and the most balanced trade-off between precision and recall. Its superior performance is likely due to its ability to capture complex patterns in the data while mitigating overfitting. Gradient Boosting and Logistic Regression also exhibited strong predictive performance, achieving approximately 67.2% accuracy, reinforcing their reliability in classification tasks. Conversely, models such as K-Nearest Neighbors (KNN) and the baseline Decision Tree displayed lower predictive power, likely due to higher sensitivity to noise and suboptimal handling of feature interactions. A key takeaway from the feature importance analysis was the consistent identification of age, race, and affordability of eye care as the most influential predictors of glaucoma. This underscores the profound impact of socioeconomic factors on disease risk, reinforcing the need for integrated screening strategies that incorporate both clinical and social risk factors. 5.3 Future Research Directions While the current study provides valuable insights into glaucoma prediction, several areas warrant further investigation to enhance model robustness and generalizability: 1. Expanding the Dataset: Future research should leverage larger, more diverse datasets to improve model generalization and account for population heterogeneity. 2. Incorporating Genetic and Environmental Factors: Given the known heritability of glaucoma, integrating genetic markers and environmental exposures (e.g., air pollu- tion, water contaminants, and PFAS exposure) could significantly enhance predictive accuracy. 3. Developing Clinical Decision Support Tools: Translating predictive models into practi- cal decision-support tools for ophthalmologists and public health officials could enhance early detection and intervention strategies. 4. Refining Model Interpretability: Advanced machine learning techniques, such as SHAP (Shapley Additive Explanations) and LIME (Local Interpretable Model-agnostic Ex- 36 planations), could provide deeper insights into individual risk factors, making models more useful for personalized risk assessment. 5.4 Final Remark This study highlights the potential of machine learning-based predictive modeling in un- derstanding glaucoma risk and guiding targeted screening efforts. By integrating social determinants, refining model interpretability, and expanding datasets, future research can contribute to more equitable and effective glaucoma prevention and treatment strategies. 37 BIBLIOGRAPHY [1] Aboobakar, I. F., and Wiggs, J. L. The genetics of glaucoma: Disease associa- tions, personalised risk assessment and therapeutic opportunities-a review. Clinical and Experimental Ophthalmology 50, 2 (March 2022), 143–162. [2] Almarzouki, N. Impact of environmental factors on glaucoma progression: A sys- tematic review. Clinical Ophthalmology 18 (2024), 2705–2720. [3] Baxter, S. L., Saseendrakumar, B. R., Paul, P., Kim, J., Bonomi, L., Kuo, T.-T., Loperena, R., Ratsimbazafy, F., Boerwinkle, E., Cicek, M., et al. Predictive analytics for glaucoma using data from the all of us research program. Amer- ican journal of ophthalmology 227 (2021), 74–86. [4] Blindness, G. ., Collaborators, V. I., and of the Global Burden of Dis- ease Study, V. L. E. G. Causes of blindness and vision impairment in 2020 and trends over 30 years, and prevalence of avoidable blindness in relation to vision 2020: the right to sight: an analysis for the global burden of disease study. The Lancet Global Health 9, 2 (February 2021), e144–e160. Epub 2020 Dec 1. Erratum in: Lancet Glob Health. 2021 Apr;9(4):e408. doi: 10.1016/S2214-109X(21)00050-4. [5] Chen, K. W., Jiang, A., Kapoor, C., Fine, J. R., Brandt, J. D., and Chen, J. Geographic information system mapping of social risk factors and patient outcomes of pediatric glaucoma. Ophthalmology Glaucoma 6, 3 (May-June 2023), 300–307. Epub 2022 Nov 23. [6] Dada, T., Verma, S., Gagrani, M., Bhartiya, S., Chauhan, N., Satpute, K., and Sharma, N. Ocular and systemic factors associated with glaucoma. Journal of Current Glaucoma Practice 16, 3 (September-December 2022), 179–191. [7] Davuluru, S. S., Jess, A. T., Kim, J. S. B., Yoo, K., Nguyen, V., and Xu, B. Y. Identifying, understanding, and addressing disparities in glaucoma care in the united states. Translational Vision Science & Technology 12, 10 (October 2023), 18. [8] Ehrlich, J. R., Burke-Conte, Z., Wittenborn, J. S., Saaddine, J., Omura, J. D., Friedman, D. S., Flaxman, A. D., and Rein, D. B. Prevalence of glaucoma among us adults in 2022. JAMA Ophthalmology 142, 11 (Nov 2024), 1046–1053. [9] Elam, A. R., Tseng, V. L., Rodriguez, T. M., Mike, E. V., Warren, A. K., Coleman, A. L., and American Academy of Ophthalmology Taskforce on Disparities in Eye Care. Disparities in vision health and eye care. Ophthalmology 129, 10 (2022), e89–e113. [10] Han, X., Gharahkhani, P., Hamel, A. R., Ong, J.-S., Renteŕıa, M. E., Mehta, P., Dong, X., Pasutto, F., Hammond, C., Young, T. L., Hysi, P., Lotery, A. J., Jorgenson, E., Choquet, H., Hauser, M., Bailey, J. N. C., Nakazawa, T., Akiyama, M., Shiga, Y., Fuller, Z. L., Wang, X., Hewitt, A. W., Craig, J. E., Pasquale, L. R., Mackey, D. A., Wiggs, J. L., Khawaja, A. C., Segrè, A. V., 23andMe Research Team, Consortium, I. G. G., and MacGregor, S. Large-scale multitrait genome-wide association analyses identify hun- dreds of glaucoma risk loci. Nature Genetics 55, 7 (July 2023), 1116–1125. Epub 2023 Jun 29. [11] Jonas, J. B., Aung, T., Bourne, R. R., Bron, A. M., Ritch, R., and Panda- Jonas, S. Glaucoma. The Lancet 390, 10108 (2017), 2183–2193. [12] Karimi, A., Stanik, A., Kozitza, C., and Chen, A. Integrating deep learning with electronic health records for early glaucoma detection: A multi-dimensional machine learning approach. Bioengineering 11, 6 (2024), 577. [13] Mamidipaka, A., Shi, A., Lee, R., et al. Socioeconomic and environmental factors associated with glaucoma in an african ancestry population: findings from the primary open-angle african american glaucoma genetics (poaagg) study. Eye (2024). [14] Organization, W. H. Social determinants of health and their impact. WHO Report (2023). [15] Quigley, H. A., and Broman, A. T. The number of people with glaucoma world- wide in 2010 and 2020. British Journal of Ophthalmology 90, 3 (March 2006), 262–267. [16] Rahman, M., et al. Integrating social determinants in glaucoma risk prediction. Public Health Ophthalmology Journal (2023). [17] Sekimitsu, S., Elze, T., and Zebardast, N. Impact of the affordable care act on glaucoma severity at first presentation. Ophthalmic Epidemiology 30, 3 (June 2023), 326–329. Epub 2022 Jun 20. [18] Simplemaps. Us zip codes database, 2023. Accessed: [12-20-2024]. [19] Sun, Z., Stuart, K. V., Luben, R. N., Auld, A. L., Strouthidis, N. G., Khaw, P. T., Jayaram, H., Khawaja, A. P., Foster, P. J., on behalf of the UK Biobank Eye, and Consortium, V. Association of ambient air pollution expo- sure with incident glaucoma: 12-year evidence from the uk biobank cohort. Investigative Ophthalmology & Visual Science 65, 12 (2024), 22. [20] Tham, Y.-C., Li, X., Wong, T. Y., Quigley, H. A., Aung, T., and Cheng, C.-Y. Global prevalence of glaucoma and projections of glaucoma burden through 2040: a systematic review and meta-analysis. Ophthalmology 121, 11 (2014), 2081–2090. 39 [21] Vajaranant, T. S., Wu, S., Torres, M., and Varma, R. The changing face of primary open-angle glaucoma in the united states: demographic and geographic changes from 2011 to 2050. American Journal of Ophthalmology 154, 2 (August 2012), 303–314.e3. Epub 2012 Apr 27. [22] Wang, L., et al. Geospatial disparities in glaucoma prevalence: A gis-based approach. Journal of Clinical Ophthalmology (2023). [23] Wang, Z., Wiggs, J. L., Aung, T., Khawaja, A. P., and Khor, C. C. The genetic basis for adult onset glaucoma: Recent advances and future directions. Progress in Retinal and Eye Research 90 (Sep 2022), 101066. Epub 2022 May 17. [24] Weinreb, R. N., Aung, T., and Medeiros, F. A. The pathophysiology and treatment of glaucoma: A review. JAMA 311, 18 (05 2014), 1901–1911. [25] Wiggs, J. L., and Pasquale, L. R. Genetics of glaucoma. Human Molecular Genetics 26, R1 (August 2017), R21–R27. [26] Yoo, K., Lee, C., Baxter, S. L., and Xu, B. Y. Relationship between glaucoma and chronic stress quantified by allostatic load score in the all of us research program. American Journal of Ophthalmology 269 (2025), 419–428. 40 LIST OF TABLES LIST OF FIGURES Introduction Research Objectives Significance of the Study Related Work Overview of Glaucoma Genetic and Demographic Risk Factors Influencing Glaucoma Social Determinants of Health and Glaucoma Prediction Healthcare Access and Insurance Coverage Socioeconomic Status (SES) and Income Occupational and Environmental Stress Urban vs. Rural Disparities Predictive Modeling of Glaucoma Using Machine Learning Supervised Learning for Glaucoma Prediction Geographic Information Systems (GIS) and Spatial Analysis Current Research on Geographic Distribution of Glaucoma The Role of the All of Us Dataset in Glaucoma Prediction Summary and Research Gaps Methodology Study Design Descriptive Analysis Geo-Spatial Analysis Predictive Analysis Processing the Cases Group Processing the Control Group Feature Engineering Final Dataset Construction Machine Learning Model for Glaucoma Prediction Logistic Regression Decision Tree Classifier Gradient Boosting Machine (GBM) K-Nearest Neighbors (KNN) Classifier Model Training and Validation Model Evaluation Metrics Ethical Considerations Results and Discussion Descriptive Analysis Glaucoma Cases Distribution Glaucoma Prevalence Predictive Analysis Logistic Regression Model Decision Tree Model Ensemble Gradient Boosting Classifier Optimized Decision Tree with GridSearchCV Optimized Gradient Boosting with GridSearchCV K-Nearest Neighbors (KNN) Model Comparison Conclusion and Future Work Descriptive Analysis Predictive Analysis Future Research Directions Final Remark BIBLIOGRAPHY