Causality vs. EDA
John Snow and the Broad Street Pump
Read all of Chapter 2, which describes in detail experimental setup and design. It covers a core story to data scientists: John Snow and the Broad Street Pump.
Before continuing, make sure that you are familiar with the following terminology:
- observational study
- causality
- association
- comparison
- treatment group
- control group
- randomized controlled trial (RCT)
Randomized Controlled Trials vs. Observational Studies
In the Broad Street Pump experiment, John Snow established a causal relationship (between what? Read Inferential Thinking Chapter 2 to find out) because he noted that there was no systematic difference between the two different groups observed other than along a single variable dimension.
In modern days, randomized controlled trials are excellent ways to compare two groups of otherwise similar individuals. However, in the majority of this class we will not be able to conduct a randomized controlled trial. This is because the datasets we analyze are almost all observational studies and not experiments. Moreover, these datasets are largely pre-existing materials collected by other researchers, and we may not know the entire picture of how they collected the data. As a result, in this class we seek to understand associations between variables, and we will almost never seek to establish causal relationships between variables.
To further understand causality, we encourage you to take inferential thinking courses like Data 8, Stat 20, and a wide range of Statistics courses.
Confounding
From Inferential Thinking, Ch 3.2 Establishing Causality:
In an observational study, if the treatment and control groups differ in ways other than the treatment, it is difficult to make conclusions about causality.
An underlying difference between the two groups (other than the treatment) is called a confounding factor, because it might confound you (that is, mess you up) when you try to reach a conclusion.
Confounding occurs when two variables can be consistently associated with each other even when one does not cause the other.
To determine whether a confounding variable can account for the association between two variables, we can try to disaggregate by different values of the confounding variable.
This disaggregation process can be repeated exhaustively for a potentially infinite number of confounding variables. Researchers generally don’t do this. Instead, we usually rely on assumptions drawn from social science theory or findings from prior studies. This process can narrow our search of potential confounding variables that may influence the association between two variables.
Exploratory Data Analysis
If we’re not going to be studying causal relationships in this class, what will we be looking at? In this course we will look deeply at a core component of Data Science: Exploratory Data Analysis, or EDA.
Exploratory Data Analysis (EDA) is like detective work. As coined by the famous American statistician and mathematician John Tukey (we will discuss Tukey numbers soon):
Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those that we believe to be there.
More formally, Exploratory Data Analysis (EDA) is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data.
Data Wrangling
A process very closely related to EDA is data wrangling, often called data cleaning. Data wrangling is the process of transforming raw data to facilitate subsequent analysis and can address issues like unclear structure or formatting, missing or corrupted values, unit conversions, and so on.
EDA and data cleaning are often thought of as an “infinite loop,” with each process driving the other.
Fortunately, in our classes we will try our best to work with “clean” datasets. These datasets will often have already been preprocessed for cleaner analysis, allowing us to explore and ask questions much more easily than if we were stuck with messier data.
External Reading
- (mentioned in notes) Computational and Inferential Thinking, Ch 5.1
- “Chapter 15: From Concepts to Models.” Elizabeth Heger Boyle, Deborah Carr, Benjamin Cornwell, Shelley Correll, Robert Crosnoe, Jeremy Freese, and Waters, Mary C. 2017. The Art and Science of Social Research. New York: W. W. Norton & Company.