7173 / 331501080 * 100000
2.1637938555132306
U.S. Tuberculosis Incidence
How can Python help us understand numerical data? Let’s explore a case study in the context of a public health data report!
Before we dive into the case study, let’s describe some key data science terms: tables and rates.
In this course, we will describe many examples of data structures, which are ways to store and organize data, often for computational processing. One of the most common examples is a table:
A rectangular data structure composed of rows and columns. Columns are labeled.
We will expand on this definition of table over the next few lectures.
A rate is a measure of one quantity per unit of some other quantity. A few examples: 6 miles per hour, 4.8 parts per million. Rates are often expressed as a percentage, fraction with a numerator and a denominator, or a decimal number.
Data scientists often use and define rates to compare different situations or events. For example, it is easy to compare 60 mph to 40 mph, but harder to compare 15 mph to 10 km/s. Occasionally, it’s unclear why a rate is necessary until we dig into the data. Let’s inspect a particular case below.
The U.S. Center for Disease Control (CDC) regularly examines disease data nationwide and publishes reports for the public. These reports help inform public health policy and consequent responses to disease epidemics at the national and global levels. One such disease is Tuberculosis (TB), a highly contagious respiratory infection.
Consider the reported U.S. TB cases in 2021 (CDC Morbidity and Mortality Weekly Report (MMWR) 03/25/2022, source). The report summary states:
Reported TB incidence (cases per 100,000 persons) increased 9.4%, from 2.2 during 2020 to 2.4 during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
While the report discusses possible interpretations as to why this occurred, let’s focus on this particular numeric summary by answering the following questions:
There is a lot of information on the CDC website—more than we can cover in this example. Our main goal is to understand the numeric data presented in the summary statement quoted above. There are two sources of information relevant to us:
Before continuing, try reading Table 1. Hint: Read the footnotes!
In epidemiology, incidence is a rate that measures the number of cases of a disease in a population, within a given time period, as shown in Equation 1:
\[ \text{TB incidence} = \frac{\text{\# TB cases}}{\text{population}} \tag{1}\]
This rate can be interpreted as the number of cases (i.e., the number of reported individuals diagnosed with TB) per person (the number of people in the population).
Why not use TB cases? Why scale by population size? Based on the CDC summary, the intention of this report was to highlight a drop in TB cases in 2020, compared to adjacent years. The report accomplished this by reporting and comparing the TB incidence in 2019, 2020, and 2021.
Simply reporting the total number of TB cases has several pitfalls, including the inability to compare the prevalence of TB in different scenarios.
While the precise explanation requires a strong understanding of probability (see Data 140), intuitively, incidence is a proxy for the prevalence, or rate of occurrence, of TB cases occurring in a population. As an example, Hawaii reported much fewer TB cases than California, but had a much higher TB incidence, across all three years. After all, Hawaii has a much smaller population than California (1.5 million vs. 39 million), so each case of reported TB matters more.
Next, let’s consider how the CDC defines TB incidence in this report. From the Table 1 footnote:
Cases per 100,000 persons using midyear population estimates from the U.S. Census Bureau.
Incorrect interpretation. First, consider the following (incorrect) ratio in Equation 2:
\[ \frac{\text{\# TB cases}}{100,000 \text{ persons}} \tag{2}\]
This rate does not account for different states having different population sizes. In other words, the population denominator in Equation 1 has disappeared.
Correct interpretation. Incidence is a measure of the rate of occurrence of a disease across a population. The Table 1 footnote translates Equation 1 to consider different ways of measuring the population.
If TB incidence in Equation 1 was defined as number of TB cases per person, we scale up by 100,000 to define TB cases per group, where the group is defined as 100,000 persons. \[ \frac{\text{cases}}{\text{1 person}} \times \frac{100,000 \text{ persons}}{\text{group}} \]
The CDC definition of TB incidence is therefore represented by Equation 3:
\[ \frac{\text{\# TB cases}}{\# \text{people in population}} \times 100,000 \tag{3}\]
To compute TB incidence, we need to source two pieces of data: the number of TB cases from Table 1, and the midyear population estimates from the U.S. Census. For your convenience, we’ve included a few data points on each U.S. jurisdiction in the two tables below (click to expand).
No. of TB cases | TB incidence | |||||
---|---|---|---|---|---|---|
U.S. jurisdiction | 2019 | 2020 | 2021 | 2019 | 2020 | 2021 |
Total | 8,900 | 7,173 | 7,860 | 2.71 | 2.16 | 2.37 |
Alabama | 87 | 72 | 92 | 1.77 | 1.43 | 1.83 |
Alaska | 58 | 58 | 58 | 7.91 | 7.92 | 7.92 |
Arizona | 183 | 136 | 129 | 2.51 | 1.89 | 1.77 |
Arkansas | 64 | 59 | 69 | 2.12 | 1.96 | 2.28 |
California | 2,111 | 1,706 | 1,750 | 5.35 | 4.32 | 4.46 |
… | … | … | … | … | … | … |
Here are the population estimates for the corresponding years, sourced from the U.S. Census Bureau:
U.S. jurisdiction | 2019 pop | 2020 pop | 2021 pop |
---|---|---|---|
Total | 328,239,523 | 331,501,080 | 331,893,745 |
Alabama | 4,903,185 | 5,024,803 | 5,039,877 |
California | 39,512,223 | 39,499,738 | 39,237,836 |
Let’s use Equation 3 to compute TB incidence in the U.S. in 2020:
7173 / 331501080 * 100000
2.1637938555132306
Rounding to the nearest tenth spot, we get the original quoted rate, 2.2!
The previous cell makes little sense without the prior intense exposition. Python names and comments make everything more understandable:
# compute incidence as cases per 100k
= 331501080
pop_2020 = 7173
tb_2020 = tb_2020 / pop_2020 * 100000
incidence_2020 incidence_2020
2.1637938555132306
If we need to verify all incidences in the 2020 column, you might be tempted to edit the cell above, manually inputting the values for each U.S. jurisdiction. This approach is both tedious and error-prone (what if you input incorrect numbers); in a few lectures, we will show you a much easier way using data structures: arrays and tables.
Filardo TD, Feng P, Pratt RH, Price SF, Self JL. Tuberculosis — United States, 2021. MMWR Morb Mortal Wkly Rep 2022;71:441–446. DOI: http://dx.doi.org/10.15585/mmwr.mm7112a1
U.S. Census, 2024. State Population Totals and Components of Change: 2010-2019, https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html
U.S. Census, 2025. State Population Totals and Components of Change: 2020-2024. https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html.