Executive Summary
Diabetes and prediabetes have significant negative impacts on health. The CDC details a wide variety of diabetes-related complications including increased risk of heart disease, foot problems, nerve damage, vision, hearing, mental health and a host of other potential complications. These negative health and quality of life impacts highlight the critical need to invest in interventions aimed at preventing and delaying complications. This identified need and lack of access demonstrates a need for increased resources. Various studies over the years have highlighted the value of lifestyle-based interventions as being key in making a difference in the efficacy of diabetes self- management for patients.1 2
Vita Valens investigated the utility and efficacy of density- based clustering algorithms as a mechanism for identifying target geographic regions to organize programming focused on specific chronic conditions – namely, diabetes – in a densely populated and diverse region like New York City. This paper utilizes a density-based clustering algorithm to identify the best potential site to host the diabetes interventions based on patient data from a subset of the New York metropolitan area diabetic population.
Background
Geographic cluster analysis has proven an effective tool in identifying high-need populations with high incidence rates of diabetes mellitus, making it an incredibly potent tool for designing effective interventions. Geographic and corresponding sociodemographic disparities in health outcomes can be attributed to communities that do not have adequate resources to provide preventative healthcare interventions like diabetes education, while also signaling to the underutilization of existing resources by community members.
Multiple studies have employed cluster analyses to identify the geographic incidence of diabetes. A study out of the University of Wisconsin-Madison demonstrated the value of using cluster analysis methods when analyzing county-level health data to address the issues that arise from leveraging county-level population health rankings alone to map geographic diabetes incidence.3 This research highlights the importance of using methodologies like density-based clustering to account for geographic variation within a region, and to highlight the imperfections of county-level ranking (or ranking through any other broad geographic region) as a means of identifying high- need localities.
Studies in this realm also demonstrate the importance of geography in diabetes outcomes. One such study from the University of Calgary identified a tangible relationship between metrics like median family income and diabetes-related acute complications by identifying spatial clusters of acute complications and correlating them with community-level sociodemographic factors.4 This realm of scholarship, which takes geographic diabetes incidence and correlates it with other factors, is out of the scope of this paper – which only seeks to chart diabetes incidence within a geographic region to identify an ideal physical location to host programming – but constitutes the next steps of this research, wherein the additional factors we can correlate our geographic data against can help us tailor our interventions on the content level as well – focusing on food- based education and resource provision in neighborhoods with food deserts, or in encouraging the use of bicycles for commuting and transit in neighborhoods that overly rely on cars or other automobiles for transit.
The scope of this paper’s study is limited to using the geographic incidence of diabetes to identify an optimal location for hosting diabetes-related programming, but the utility of density-based clustering algorithms is much broader and can yield vast insights into the landscape of health infrastructure within a given region. There are various types of clustering algorithms and methodologies for identifying contiguous regions that represent a common level of diabetes incidence. While k-means and hierarchical clustering algorithms can be helpful, density-based clustering algorithms are of the greatest utility for our problem of needing to identify the optimal region to base our diabetes intervention in order to service the greatest number of in-need patients.
Density-based clustering algorithms outperform k-means clustering algorithms in this context since they can find clusters of arbitrary shapes, whereas k-means identifies clusters of spherical shape. Density-based clustering also does not require pre-specification of the number of expected clusters within a dataset.5 This is ideal for geographic data where pre- existing divisions may have been arbitrarily and/or historically decided, and may not be representative of the true contours of a region’s level of access to various resources. When compared to hierarchical clustering algorithms, density-based clustering outperforms not only because it handles irregular-shaped clusters better, but also because it is far less sensitive to noise, which is rife in real-world data.6
Methodology
A preliminary needs analysis was conducted as a part of the development of a new diabetes-focused educational program. Geographic and demographic data were examined to identify communities most affected by diabetes. Specifically, a geographic cluster analysis was conducted to identify areas with the highest concentrations of individuals living with diabetes.
Based on our dataset, diabetes prevalence across the target population was analyzed by zip code. Python was used to perform data and spatial analysis to gain deeper insights into the target population and inform the educational event development. The data analysis process was as follows:
1. Data cleaning and standardization
2. Population distribution analysis
3. Convert member addresses to coordinates
4. Develop visual representation of the data
5. Employ density-based clustering algorithms
Findings and Discussion
The results of the cluster analysis demonstrate a spread across both boroughs with some areas with significantly higher numbers of members with diabetes. The clusters ranged in size from 21 members to 643 members in one area with diabetes. The two largest clusters, cluster 17 (513 individuals) and 19 (643 individuals), are both located in Brooklyn. Given the aim of this study to identify an optimal location based on geographic clustering of diabetic patients within our dataset, Vita Valens identified a site in Brooklyn, located between clusters 17 and 19, to facilitate the diabetes intervention. This data-driven approach to identifying optimal locations for programming ensures the broadest reach and greatest level of accessibility for target populations.
Density-based spatial clustering, given our dataset of diabetic patient addresses (with no additional factors or variables being measured), provides the best tool to identify high-need geographic regions within New York City. Density-based clustering also allows for more amorphous and irregular cluster shapes than k-means or hierarchical clustering, which is of extreme importance when dealing with data that is irregularly distributed along the lines of geography as well as sociodemographic factors.
Conclusion
We hope this paper can help other institutions interested in organizing condition-specific education interventions and programming by providing a documented methodology that optimizes geographic reach across a target population. Density- based clustering algorithms provide a substantive means of identifying geographic regions with a high occurrence of diabetes within a given demographic data set.
Citations
1 William H. Polonsky, Jay Earles, Susan Smith, Donna J. Pease, Mary Macmillan, Reed Christensen, Thomas Taylor, Judy Dickert, Richard A. Jackson; Integrating Medical Management With Diabetes Self-Management Training: A randomized control trial of the Diabetes Outpatient Intensive Treatment program. Diabetes Care 1 November 2003; 26 (11): 3048–3053. https://doi.org/10.2337/diacare.26.11.3048
2 The Diabetes Prevention Program (DPP) Research Group; The Diabetes Prevention Program (DPP): Description of lifestyle intervention. Diabetes Care 1 December 2002; 25 (12): 2165–2171. https://doi.org/10.2337/diacare.25.12.2165
3 Pollock, E. A., Gangnon, R. E., Gennuso, K. P., & Givens, M. L. (2024). Cluster Analysis Methods to Support Population Health Improvement Among US Counties. Journal of public health management and practice : JPHMP, 30(6), E319–E328. https://doi.org/10.1097/ PHH.0000000000002034
4 Butalia, S., Patel, A. B., Johnson, J. A., Ghali, W. A., & Rabi, D. M. (2017). Geograph- ic Clustering of Acute Complications and Sociodemographic Factors in Adults with Type 1 Diabetes. Canadian journal of diabetes, 41(2), 132–137. https://doi.org/10.1016/j. jcjd.2016.08.224
5 Taylor Karl, “DBSCAN vs. K-Means: A Guide in Python,” Newhorizons.com, https://www.newhorizons.com/resources/blog/dbscan-vs-kmeans-a-guide-in-python#:~:text=DBSCAN%20is%20a%20density%2Dbased,to%20initialization%20than%20K%2DMeans
6 https://medium.com/@amit25173/dbscan-vs-hierarchical-clustering-3bc4d9635bc8