Filtering and Boolean Predicates

Selecting rows based on a condition

The where Table Method

The where Table method filters rows based on the values of a column matching a specific condition. Up until now, we’ve considered only exact-match cases:

tbl.where("Column Name", some_value)

However, more often we will need to have more flexible filters, such as matching on a range of values. We can do so via Boolean predicates. Boolean predicates are a means of specifying conditions that must be True or False. In this class, we will use Boolean predicates with respect to the second predicate argument of the where Table method; you will therefore see such predicates referred to as Table.where predicates in the Python Reference.

The word “Boolean” stems from the boolean data type, which we will see soon to be a data type that has exactly two values: True and False. We will revisit this data type soon.

Read Inferential Thinking

Read Ch 6.2.1 - 6.2.4, which gives examples of Boolean predicates.

Ch 6.2.3 invokes method chaining, where the Table returned from one where call is used for the second where call. We discuss this a bit mroe below.

SAT Scores by State

Here are some other examples, as seen in lecture. Today we will be studying exam scores on a standardized high school exam, the SAT. We will be working with a dataset showing aggregated (average) SAT exam scores by state (source 1, source 2). It also shows the state’s participation rate, defined as the percentage of high school graduates who took the SAT exam.

Note: This data is from 2014, so the total combined score is out of 2400 (over three sections each out of 800) instead of 1600. We will add this column to our table before continuing.

sat = Table.read_table('data/sat2014-lecture.csv')
sat = sat.with_columns(
    'Combined',
    sat.column('Critical Reading') + \
        sat.column('Math') + \
        sat.column('Writing')
)
sat
State Participation Rate Critical Reading Math Writing Combined
Alabama 6.7 547 538 532 1617
Alaska 54.2 507 503 475 1485
Arizona 36.4 522 525 500 1547
Arkansas 4.2 573 571 554 1698
California 60.3 498 510 496 1504
Colorado 14.3 582 586 567 1735
Connecticut 88.4 507 510 508 1525
Delaware 100 456 459 444 1359
District of Columbia 100 440 438 431 1309
Florida 72.2 491 485 472 1448

... (41 rows omitted)

A cursory observation

Let’s take a look at the states with the highest average combined score:

sat.sort("Combined", descending=True).take(np.arange(5))
State Participation Rate Critical Reading Math Writing Combined
North Dakota 2.3 612 620 584 1816
Illinois 4.6 599 616 587 1802
Iowa 3.1 605 611 578 1794
South Dakota 2.9 604 609 579 1792
Minnesota 5.9 598 610 578 1786

…and let’s look at the states with the highest participation rate:

sat.sort("Participation Rate", descending=True).take(np.arange(5))
State Participation Rate Critical Reading Math Writing Combined
Delaware 100 456 459 444 1359
District of Columbia 100 440 438 431 1309
Idaho 100 458 456 450 1364
Maine 95.6 467 471 449 1387
Connecticut 88.4 507 510 508 1525

What’s going on? Let’s dive in.

Defining conditions for the where method

We’ve already seen how we can use tbl.where() to find rows that exactly match what we’re looking for. For example:

sat.where('State', 'California')

But tbl.where is also capable of so much more! The second argument in .where can accept a predicate, which tells Python what condition to match rows on. See the Data 6 Python Reference.

sat.where("Combined", are.above(1800))
State Participation Rate Critical Reading Math Writing Combined
Illinois 4.6 599 616 587 1802
North Dakota 2.3 612 620 584 1816

Note that are.equal_to(z) is the same as just passing in z itself as the second argument.

sat.where("State", are.containing("Dakota"))
State Participation Rate Critical Reading Math Writing Combined
North Dakota 2.3 612 620 584 1816
South Dakota 2.9 604 609 579 1792
sat.where("Math", are.between(580, 600))
State Participation Rate Critical Reading Math Writing Combined
Colorado 14.3 582 586 567 1735
Kansas 5.3 591 596 566 1753
Kentucky 4.6 589 585 572 1746
Missouri 4.2 595 597 579 1771
Nebraska 3.7 589 587 569 1745
Wyoming 3.3 590 599 573 1762

Method Chaining: Multiple Conditions

We can match rows to multiple conditions/predicates by chaining where method calls together. For example, we can look for states where the participation rate is above 20% and the average combined SAT score is above 1500.

sat.where("Participation Rate", are.above(20)).where("Combined", are.above(1500)) # Filter the `sat` table to find states where participation is above 20% and combined score is above 1500
State Participation Rate Critical Reading Math Writing Combined
Arizona 36.4 522 525 500 1547
California 60.3 498 510 496 1504
Connecticut 88.4 507 510 508 1525
Massachusetts 84.1 516 531 509 1556
New Hampshire 70.3 524 530 512 1566
New Jersey 79.3 501 523 502 1526
Oregon 47.9 523 522 499 1544
Vermont 63.1 522 525 507 1554
Virginia 73.1 518 515 497 1530
Washington 63.1 510 518 491 1519

Equivalently:

# better formatting (note parentheses)
(
    sat.where("Participation Rate", are.above(20))
        .where("Combined", are.above(1500))
) # both of these have to be true!
State Participation Rate Critical Reading Math Writing Combined
Arizona 36.4 522 525 500 1547
California 60.3 498 510 496 1504
Connecticut 88.4 507 510 508 1525
Massachusetts 84.1 516 531 509 1556
New Hampshire 70.3 524 530 512 1566
New Jersey 79.3 501 523 502 1526
Oregon 47.9 523 522 499 1544
Vermont 63.1 522 525 507 1554
Virginia 73.1 518 515 497 1530
Washington 63.1 510 518 491 1519

Another way to think about method chaining is to explicitly assign each subexpression that evaluates to a table:

sat1 = sat
sat2 = sat1.where("Participation Rate", are.above(20))
sat3 = sat2.where("Combined", are.above(1500))

Exploring our observation further

Let’s return to the curious observation we started with. These are tasks that we’ll let you explore in a notebook.

Task 1 Filter the sat table to find states where participation is below 10% and combined score is between 1200 and 1400.

... # your code here

Task 2: Filter the sat table to include only the states listed in the deep_south array.

deep_south = np.array(['Alabama', 'Georgia',
                       'Louisiana', 'Mississippi',
                       'South Carolina'])
... # your code here

Task 3: Find the states in the deep south with participation lower than 10% and combined score greater than or equal to 1600.

... # your code here

Finally, consider the scatter plot of all states’ participation rates and combined SAT scores.

import plotly.express as px

px.scatter(data_frame = sat.to_df(), 
           x = 'Combined', 
           y = 'Participation Rate', 
           hover_data = {'State': True},
           title = 'SAT (2014) Participation Rate by state')

Lower participation rate seems to be associated with higher SAT scores. But as we have seen several times in this course, correlation does not imply causation. Instead, consider that not everyone takes the SAT; participation rate for this exam varies by state and geographic location. The SAT is scheduled on a non-school day, usually a weekend, and students have to pay extra to take the exam.

If such an exam is effectively optional, then students that actually take the SAT are likely the ones more ready and able to perform well on such an exam—the top students. In other words, states with lower participation rates will exhibit a biased population of students who actually take the exam. You can read more in this blog post on ACT / SAT scores.

External Reading

  • (mentioned in notes) Computational and Inferential Thinking, Ch 6.2
  • (mentioned in notes) 2016. “Average SAT & ACT Scores by State (Participation Adjusted).” Prep Scholar. link