The where Table method filters rows based on the values of a column matching a specific condition. Up until now, we’ve considered only exact-match cases:
tbl.where("Column Name", some_value)
However, more often we will need to have more flexible filters, such as matching on a range of values. We can do so via Boolean predicates. Boolean predicates are a means of specifying conditions that must be True or False. In this class, we will use Boolean predicates with respect to the second predicate argument of the where Table method; you will therefore see such predicates referred to as Table.where predicates in the Python Reference.
The word “Boolean” stems from the boolean data type, which we will see soon to be a data type that has exactly two values: True and False. We will revisit this data type soon.
Read Inferential Thinking
Read Ch 6.2.1 - 6.2.4, which gives examples of Boolean predicates.
Ch 6.2.3 invokes method chaining, where the Table returned from one where call is used for the second where call. We discuss this a bit mroe below.
SAT Scores by State
Here are some other examples, as seen in lecture. Today we will be studying exam scores on a standardized high school exam, the SAT. We will be working with a dataset showing aggregated (average) SAT exam scores by state (source 1, source 2). It also shows the state’s participation rate, defined as the percentage of high school graduates who took the SAT exam.
Note: This data is from 2014, so the total combined score is out of 2400 (over three sections each out of 800) instead of 1600. We will add this column to our table before continuing.
We’ve already seen how we can use tbl.where() to find rows that exactly match what we’re looking for. For example:
sat.where('State', 'California')
But tbl.where is also capable of so much more! The second argument in .where can accept a predicate, which tells Python what condition to match rows on. See the Data 6 Python Reference.
sat.where("Combined", are.above(1800))
State
Participation Rate
Critical Reading
Math
Writing
Combined
Illinois
4.6
599
616
587
1802
North Dakota
2.3
612
620
584
1816
Note that are.equal_to(z) is the same as just passing in z itself as the second argument.
sat.where("State", are.containing("Dakota"))
State
Participation Rate
Critical Reading
Math
Writing
Combined
North Dakota
2.3
612
620
584
1816
South Dakota
2.9
604
609
579
1792
sat.where("Math", are.between(580, 600))
State
Participation Rate
Critical Reading
Math
Writing
Combined
Colorado
14.3
582
586
567
1735
Kansas
5.3
591
596
566
1753
Kentucky
4.6
589
585
572
1746
Missouri
4.2
595
597
579
1771
Nebraska
3.7
589
587
569
1745
Wyoming
3.3
590
599
573
1762
Method Chaining: Multiple Conditions
We can match rows to multiple conditions/predicates by chaining where method calls together. For example, we can look for states where the participation rate is above 20% and the average combined SAT score is above 1500.
sat.where("Participation Rate", are.above(20)).where("Combined", are.above(1500)) # Filter the `sat` table to find states where participation is above 20% and combined score is above 1500
State
Participation Rate
Critical Reading
Math
Writing
Combined
Arizona
36.4
522
525
500
1547
California
60.3
498
510
496
1504
Connecticut
88.4
507
510
508
1525
Massachusetts
84.1
516
531
509
1556
New Hampshire
70.3
524
530
512
1566
New Jersey
79.3
501
523
502
1526
Oregon
47.9
523
522
499
1544
Vermont
63.1
522
525
507
1554
Virginia
73.1
518
515
497
1530
Washington
63.1
510
518
491
1519
Equivalently:
# better formatting (note parentheses)( sat.where("Participation Rate", are.above(20)) .where("Combined", are.above(1500))) # both of these have to be true!
State
Participation Rate
Critical Reading
Math
Writing
Combined
Arizona
36.4
522
525
500
1547
California
60.3
498
510
496
1504
Connecticut
88.4
507
510
508
1525
Massachusetts
84.1
516
531
509
1556
New Hampshire
70.3
524
530
512
1566
New Jersey
79.3
501
523
502
1526
Oregon
47.9
523
522
499
1544
Vermont
63.1
522
525
507
1554
Virginia
73.1
518
515
497
1530
Washington
63.1
510
518
491
1519
Another way to think about method chaining is to explicitly assign each subexpression that evaluates to a table:
Let’s return to the curious observation we started with. These are tasks that we’ll let you explore in a notebook.
Task 1 Filter the sat table to find states where participation is below 10% and combined score is between 1200 and 1400.
... # your code here
Task 2: Filter the sat table to include only the states listed in the deep_south array.
deep_south = np.array(['Alabama', 'Georgia','Louisiana', 'Mississippi','South Carolina'])... # your code here
Task 3: Find the states in the deep south with participation lower than 10% and combined score greater than or equal to 1600.
... # your code here
Finally, consider the scatter plot of all states’ participation rates and combined SAT scores.
import plotly.express as pxpx.scatter(data_frame = sat.to_df(), x ='Combined', y ='Participation Rate', hover_data = {'State': True}, title ='SAT (2014) Participation Rate by state')
Lower participation rate seems to be associated with higher SAT scores. But as we have seen several times in this course, correlation does not imply causation. Instead, consider that not everyone takes the SAT; participation rate for this exam varies by state and geographic location. The SAT is scheduled on a non-school day, usually a weekend, and students have to pay extra to take the exam.
If such an exam is effectively optional, then students that actually take the SAT are likely the ones more ready and able to perform well on such an exam—the top students. In other words, states with lower participation rates will exhibit a biased population of students who actually take the exam. You can read more in this blog post on ACT / SAT scores.
External Reading
(mentioned in notes) Computational and Inferential Thinking, Ch 6.2
(mentioned in notes) 2016. “Average SAT & ACT Scores by State (Participation Adjusted).” Prep Scholar. link