Filtering and Boolean Predicates

Selecting rows based on a condition

The `where` Table Method

The where Table method filters rows based on the values of a column matching a specific condition. Up until now, we’ve considered only exact-match cases:

tbl.where("Column Name", some_value)

However, more often we will need to have more flexible filters, such as matching on a range of values. We can do so via Boolean predicates. Boolean predicates are a means of specifying conditions that must be True or False. In this class, we will use Boolean predicates with respect to the second predicate argument of the where Table method; you will therefore see such predicates referred to as Table.where predicates in the Python Reference.

The word “Boolean” stems from the boolean data type, which we will see soon to be a data type that has exactly two values: True and False. We will revisit this data type soon.

Read Inferential Thinking

Read Ch 6.2.1 - 6.2.4, which gives examples of Boolean predicates.

Ch 6.2.3 invokes method chaining, where the Table returned from one where call is used for the second where call. We discuss this a bit mroe below.

SAT Scores by State

Here are some other examples, as seen in lecture. Today we will be studying exam scores on a standardized high school exam, the SAT. We will be working with a dataset showing aggregated (average) SAT exam scores by state (source 1, source 2). It also shows the state’s participation rate, defined as the percentage of high school graduates who took the SAT exam.

Note: This data is from 2014, so the total combined score is out of 2400 (over three sections each out of 800) instead of 1600. We will add this column to our table before continuing.

sat = Table.read_table('data/sat2014-lecture.csv')
sat = sat.with_columns(
    'Combined',
    sat.column('Critical Reading') + \
        sat.column('Math') + \
        sat.column('Writing')
)
sat

State	Participation Rate	Critical Reading	Math	Writing	Combined
Alabama	6.7	547	538	532	1617
Alaska	54.2	507	503	475	1485
Arizona	36.4	522	525	500	1547
Arkansas	4.2	573	571	554	1698
California	60.3	498	510	496	1504
Colorado	14.3	582	586	567	1735
Connecticut	88.4	507	510	508	1525
Delaware	100	456	459	444	1359
District of Columbia	100	440	438	431	1309
Florida	72.2	491	485	472	1448

... (41 rows omitted)

A cursory observation

Let’s take a look at the states with the highest average combined score:

sat.sort("Combined", descending=True).take(np.arange(5))

State	Participation Rate	Critical Reading	Math	Writing	Combined
North Dakota	2.3	612	620	584	1816
Illinois	4.6	599	616	587	1802
Iowa	3.1	605	611	578	1794
South Dakota	2.9	604	609	579	1792
Minnesota	5.9	598	610	578	1786

…and let’s look at the states with the highest participation rate:

sat.sort("Participation Rate", descending=True).take(np.arange(5))

State	Participation Rate	Critical Reading	Math	Writing	Combined
Delaware	100	456	459	444	1359
District of Columbia	100	440	438	431	1309
Idaho	100	458	456	450	1364
Maine	95.6	467	471	449	1387
Connecticut	88.4	507	510	508	1525

What’s going on? Let’s dive in.

Defining conditions for the `where` method

We’ve already seen how we can use tbl.where() to find rows that exactly match what we’re looking for. For example:

sat.where('State', 'California')

But tbl.where is also capable of so much more! The second argument in .where can accept a predicate, which tells Python what condition to match rows on. See the Data 6 Python Reference.

sat.where("Combined", are.above(1800))

State	Participation Rate	Critical Reading	Math	Writing	Combined
Illinois	4.6	599	616	587	1802
North Dakota	2.3	612	620	584	1816

Note that are.equal_to(z) is the same as just passing in z itself as the second argument.

sat.where("State", are.containing("Dakota"))

State	Participation Rate	Critical Reading	Math	Writing	Combined
North Dakota	2.3	612	620	584	1816
South Dakota	2.9	604	609	579	1792

sat.where("Math", are.between(580, 600))

State	Participation Rate	Critical Reading	Math	Writing	Combined
Colorado	14.3	582	586	567	1735
Kansas	5.3	591	596	566	1753
Kentucky	4.6	589	585	572	1746
Missouri	4.2	595	597	579	1771
Nebraska	3.7	589	587	569	1745
Wyoming	3.3	590	599	573	1762

Method Chaining: Multiple Conditions

We can match rows to multiple conditions/predicates by chaining where method calls together. For example, we can look for states where the participation rate is above 20% and the average combined SAT score is above 1500.

sat.where("Participation Rate", are.above(20)).where("Combined", are.above(1500)) # Filter the `sat` table to find states where participation is above 20% and combined score is above 1500

State	Participation Rate	Critical Reading	Math	Writing	Combined
Arizona	36.4	522	525	500	1547
California	60.3	498	510	496	1504
Connecticut	88.4	507	510	508	1525
Massachusetts	84.1	516	531	509	1556
New Hampshire	70.3	524	530	512	1566
New Jersey	79.3	501	523	502	1526
Oregon	47.9	523	522	499	1544
Vermont	63.1	522	525	507	1554
Virginia	73.1	518	515	497	1530
Washington	63.1	510	518	491	1519

Equivalently:

# better formatting (note parentheses)
(
    sat.where("Participation Rate", are.above(20))
        .where("Combined", are.above(1500))
) # both of these have to be true!

State	Participation Rate	Critical Reading	Math	Writing	Combined
Arizona	36.4	522	525	500	1547
California	60.3	498	510	496	1504
Connecticut	88.4	507	510	508	1525
Massachusetts	84.1	516	531	509	1556
New Hampshire	70.3	524	530	512	1566
New Jersey	79.3	501	523	502	1526
Oregon	47.9	523	522	499	1544
Vermont	63.1	522	525	507	1554
Virginia	73.1	518	515	497	1530
Washington	63.1	510	518	491	1519

Another way to think about method chaining is to explicitly assign each subexpression that evaluates to a table:

sat1 = sat
sat2 = sat1.where("Participation Rate", are.above(20))
sat3 = sat2.where("Combined", are.above(1500))

Exploring our observation further

Let’s return to the curious observation we started with. These are tasks that we’ll let you explore in a notebook.

Task 1 Filter the sat table to find states where participation is below 10% and combined score is between 1200 and 1400.

... # your code here

Task 2: Filter the sat table to include only the states listed in the deep_south array.

deep_south = np.array(['Alabama', 'Georgia',
                       'Louisiana', 'Mississippi',
                       'South Carolina'])
... # your code here

Task 3: Find the states in the deep south with participation lower than 10% and combined score greater than or equal to 1600.

... # your code here

Finally, consider the scatter plot of all states’ participation rates and combined SAT scores.

import plotly.express as px

px.scatter(data_frame = sat.to_df(), 
           x = 'Combined', 
           y = 'Participation Rate', 
           hover_data = {'State': True},
           title = 'SAT (2014) Participation Rate by state')

Lower participation rate seems to be associated with higher SAT scores. But as we have seen several times in this course, correlation does not imply causation. Instead, consider that not everyone takes the SAT; participation rate for this exam varies by state and geographic location. The SAT is scheduled on a non-school day, usually a weekend, and students have to pay extra to take the exam.

If such an exam is effectively optional, then students that actually take the SAT are likely the ones more ready and able to perform well on such an exam—the top students. In other words, states with lower participation rates will exhibit a biased population of students who actually take the exam. You can read more in this blog post on ACT / SAT scores.

External Reading

(mentioned in notes) Computational and Inferential Thinking, Ch 6.2
(mentioned in notes) 2016. “Average SAT & ACT Scores by State (Participation Adjusted).” Prep Scholar. link