Case Study: Rates

U.S. Tuberculosis Incidence

How can Python help us understand numerical data? Let’s explore a case study in the context of a public health data report!

Definitions

Before we dive into the case study, let’s describe some key data science terms: tables and rates.

Tables

In this course, we will describe many examples of data structures, which are ways to store and organize data, often for computational processing. One of the most common examples is a table:

A rectangular data structure composed of rows and columns. Columns are labeled.

We will expand on this definition of table over the next few lectures.

Rates

A rate is a measure of one quantity per unit of some other quantity. A few examples: 6 miles per hour, 4.8 parts per million. Rates are often expressed as a percentage, fraction with a numerator and a denominator, or a decimal number.

Data scientists often use and define rates to compare different situations or events. For example, it is easy to compare 60 mph to 40 mph, but harder to compare 15 mph to 10 km/s. Occasionally, it’s unclear why a rate is necessary until we dig into the data. Let’s inspect a particular case below.

Rates: Incidence

The U.S. Center for Disease Control (CDC) regularly examines disease data nationwide and publishes reports for the public. These reports help inform public health policy and consequent responses to disease epidemics at the national and global levels. One such disease is Tuberculosis (TB), a highly contagious respiratory infection.

Consider the reported U.S. TB cases in 2021 (CDC Morbidity and Mortality Weekly Report (MMWR) 03/25/2022, source). The report summary states:

Reported TB incidence (cases per 100,000 persons) increased 9.4%, from 2.2 during 2020 to 2.4 during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.

While the report discusses possible interpretations as to why this occurred, let’s focus on this particular numeric summary by answering the following questions:

  1. Define: What is incidence? Incidence is a rate. Why use this rate for comparison, and not the total number of cases?
  2. Verify: Consider Table 1. How can we use Python to verify that the TB incidence column can be computed from the TB cases column? How can use Python to verify the reported percent change in incidence?
Look at the tabular data

There is a lot of information on the CDC website—more than we can cover in this example. Our main goal is to understand the numeric data presented in the summary statement quoted above. There are two sources of information relevant to us:

  1. The quote above, taken from the webpage summary.
  2. Table 1, located at the bottom of the webpage.

Before continuing, try reading Table 1. Hint: Read the footnotes!

Define: Incidence

In epidemiology, incidence is a rate that measures the number of cases of a disease in a population, within a given time period, as shown in Equation 1:

\[ \text{TB incidence} = \frac{\text{\# TB cases}}{\text{population}} \tag{1}\]

This rate can be interpreted as the number of cases (i.e., the number of reported individuals diagnosed with TB) per person (the number of people in the population).

Why not use TB cases? Why scale by population size? Based on the CDC summary, the intention of this report was to highlight a drop in TB cases in 2020, compared to adjacent years. The report accomplished this by reporting and comparing the TB incidence in 2019, 2020, and 2021.

Simply reporting the total number of TB cases has several pitfalls, including the inability to compare the prevalence of TB in different scenarios.

While the precise explanation requires a strong understanding of probability (see Data 140), intuitively, incidence is a proxy for the prevalence, or rate of occurrence, of TB cases occurring in a population. As an example, Hawaii reported much fewer TB cases than California, but had a much higher TB incidence, across all three years. After all, Hawaii has a much smaller population than California (1.5 million vs. 39 million), so each case of reported TB matters more.


Next, let’s consider how the CDC defines TB incidence in this report. From the Table 1 footnote:

Cases per 100,000 persons using midyear population estimates from the U.S. Census Bureau.

Incorrect interpretation. First, consider the following (incorrect) ratio in Equation 2:

\[ \frac{\text{\# TB cases}}{100,000 \text{ persons}} \tag{2}\]

This rate does not account for different states having different population sizes. In other words, the population denominator in Equation 1 has disappeared.

Correct interpretation. Incidence is a measure of the rate of occurrence of a disease across a population. The Table 1 footnote translates Equation 1 to consider different ways of measuring the population.

If TB incidence in Equation 1 was defined as number of TB cases per person, we scale up by 100,000 to define TB cases per group, where the group is defined as 100,000 persons. \[ \frac{\text{cases}}{\text{1 person}} \times \frac{100,000 \text{ persons}}{\text{group}} \]

The CDC definition of TB incidence is therefore represented by Equation 3:

\[ \frac{\text{\# TB cases}}{\# \text{people in population}} \times 100,000 \tag{3}\]

Verifying Incidence

To compute TB incidence, we need to source two pieces of data: the number of TB cases from Table 1, and the midyear population estimates from the U.S. Census. For your convenience, we’ve included a few data points on each U.S. jurisdiction in the two tables below (click to expand).

No. of TB cases TB incidence
U.S. jurisdiction 2019 2020 2021 2019 2020 2021
Total 8,900 7,173 7,860 2.71 2.16 2.37
Alabama 87 72 92 1.77 1.43 1.83
Alaska 58 58 58 7.91 7.92 7.92
Arizona 183 136 129 2.51 1.89 1.77
Arkansas 64 59 69 2.12 1.96 2.28
California 2,111 1,706 1,750 5.35 4.32 4.46

Here are the population estimates for the corresponding years, sourced from the U.S. Census Bureau:

U.S. jurisdiction 2019 pop 2020 pop 2021 pop
Total 328,239,523 331,501,080 331,893,745
Alabama 4,903,185 5,024,803 5,039,877
California 39,512,223 39,499,738 39,237,836

Let’s use Equation 3 to compute TB incidence in the U.S. in 2020:

7173 / 331501080 * 100000
2.1637938555132306

Rounding to the nearest tenth spot, we get the original quoted rate, 2.2!

The previous cell makes little sense without the prior intense exposition. Python names and comments make everything more understandable:

# compute incidence as cases per 100k
pop_2020 = 331501080
tb_2020 = 7173
incidence_2020 = tb_2020 / pop_2020 * 100000
incidence_2020
2.1637938555132306

If we need to verify all incidences in the 2020 column, you might be tempted to edit the cell above, manually inputting the values for each U.S. jurisdiction. This approach is both tedious and error-prone (what if you input incorrect numbers); in a few lectures, we will show you a much easier way using data structures: arrays and tables.

References

Filardo TD, Feng P, Pratt RH, Price SF, Self JL. Tuberculosis — United States, 2021. MMWR Morb Mortal Wkly Rep 2022;71:441–446. DOI: http://dx.doi.org/10.15585/mmwr.mm7112a1

U.S. Census, 2024. State Population Totals and Components of Change: 2010-2019, https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html

U.S. Census, 2025. State Population Totals and Components of Change: 2020-2024. https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html.