Table Fundamentals

Read Inferential Thinking

Read Ch 6 intro, which describes the Table object type from the datascience library.

This is a very dense chapter, as it has lots of Python syntax. We will review it in detail. Before moving on, we encourage you to focus on the following questions:

  1. What is the syntax of the with_columns method? How does this method make a table?
  2. What are some table methods?
  3. What Python expression gets you the number of rows? a specific column? (bonus) How does the syntax differ between these two cases?

If you are new to programming, it is easy to get lost in the syntax of Python! For today, we will therefore focus on a few high-level goals:

In other words, you will not be expected to memorize all aspects of Tables (it’s as if we have a theme in this class…)! But you will have natural impulses for understanding and working with tabular data, many of which directly map to specific Table methods and attributes. To write Python code, then, you will need to learn to work with documentation to do this translation, and—when the precise methods don’t exist—construct new algorithms to achieve what you want. The more familiar you are with the documentation—and consequently, what is possible with Tables—the more you can focus on algorithmic thinking.

Documentation, documentation, documentation

This entire note revolves around getting used to using the documentation on the Data 6 Python Reference page. Keep that page open as you read this one!

Definitions

To begin, recall the terminology we have been using to describe tables so far:

  • Table: Retangular data structure.
    • Columns (vertical) correspond to variables (AKA features, or attributes) that measure and operationalize social concepts. We discuss variables and social concepts more in another note.
    • Rows (horizontal) correspond to records (AKA entries), which are specific values of variables for a given individual, group, etc.—whatever your unit of analysis is.

We work with the datascience package’s Table.

from datascience import *
import numpy as np

Today’s Dataset: Schools

This lecture we will return to the dataset of public universities in California (Wikipedia).

To work with this table, we first load it in from a file called data/cal_unis.csv.

schools = Table.read_table('data/cal_unis.csv')

Let’s break down what we meant by “load it in”: * The right-hand-side of the assignment calls a function that returns a datascience Table object from the provided data file. * The left-hand-side of the assignment assigns this object to the Python name schools. * After this assignment statement, then we can use schools as the particular Table object that has the tabular data we want.

Running the below cell evaluates schools and displays that tabular data to us:

schools
Name Institution City County Enrollment Founded
California State Polytechnic University, Humboldt CSU Arcata Humboldt 6025 1913
California State University, Bakersfield CSU Bakersfield Kern 9613 1965
University of California, Berkeley UC Berkeley Alameda 45307 1869
California State University Channel Islands CSU Camarillo Ventura 6128 2002
California State University, Dominguez Hills CSU Carson Los Angeles 16426 1960
California State University, Chico CSU Chico Butte 14183 1887
University of California, Davis UC Davis Yolo 39679 1905
California State University, Fresno CSU Fresno Fresno 23999 1911
California State University, Fullerton CSU Fullerton Orange 40386 1957
California State University, East Bay CSU Hayward Alameda 13673 1957

... (22 rows omitted)

Let’s get to exploring and understanding our tabular data!

Find Table Dimensions

First off: How many rows and columns are in our schools table?

print(schools.num_rows)
print(schools.num_columns)
32
6

What variables are being recorded?

print(schools.labels)
('Name', 'Institution', 'City', 'County', 'Enrollment', 'Founded')

We didn’t conjure these expressions out of nowhere. Instead, we:

  1. Looked at the Data 6 Python Reference page
  2. Scrolled down to the “Tables and Table Methods” section
  3. Skimmed the reference until we found documentation that seemed close to what we want
  4. Translated the name tbl into schools.
  5. Tried it out

Bonus: You’ll notice the “dot” syntax for accessing these values, also known as table attributes. See the Bonus last section for a more detailed explanation of this programming terminology.

Column-First Paradigm

Because columns represent variables, we will take a column-first approach to tables.

To extract, delete, or make columns from a table named tbl:

  • tbl.select(...) returns a new table with a subset of columns.
  • tbl.drop(...) and returns a new table without a subset of columns
  • tbl.with_columns(...) returns a new table with additional new column(s).
schools.select('Name', 'Enrollment')
Name Enrollment
California State Polytechnic University, Humboldt 6025
California State University, Bakersfield 9613
University of California, Berkeley 45307
California State University Channel Islands 6128
California State University, Dominguez Hills 16426
California State University, Chico 14183
University of California, Davis 39679
California State University, Fresno 23999
California State University, Fullerton 40386
California State University, East Bay 13673

... (22 rows omitted)

schools.drop('Founded', 'County')
Name Institution City Enrollment
California State Polytechnic University, Humboldt CSU Arcata 6025
California State University, Bakersfield CSU Bakersfield 9613
University of California, Berkeley UC Berkeley 45307
California State University Channel Islands CSU Camarillo 6128
California State University, Dominguez Hills CSU Carson 16426
California State University, Chico CSU Chico 14183
University of California, Davis UC Davis 39679
California State University, Fresno CSU Fresno 23999
California State University, Fullerton CSU Fullerton 40386
California State University, East Bay CSU Hayward 13673

... (22 rows omitted)

Note: The above two cells were run in sequence. Each table method returns a new table, so our original table schools is unchanged.

Modifying Tables

Table methods return copies

All table methods return copies of information from the original table! This paradigm is quite useful for data analysis. From Inferential Thinking:

[Table methods] create new smaller tables that share the same data. The fact that the original table is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other.

If we would like to modify the original table, then we must re-assign schools to the new return value:

schools = schools.with_columns(
    "Years since founding",
    2025 - schools.column("Founded")
)
schools
Name Institution City County Enrollment Founded Years since founding
California State Polytechnic University, Humboldt CSU Arcata Humboldt 6025 1913 112
California State University, Bakersfield CSU Bakersfield Kern 9613 1965 60
University of California, Berkeley UC Berkeley Alameda 45307 1869 156
California State University Channel Islands CSU Camarillo Ventura 6128 2002 23
California State University, Dominguez Hills CSU Carson Los Angeles 16426 1960 65
California State University, Chico CSU Chico Butte 14183 1887 138
University of California, Davis UC Davis Yolo 39679 1905 120
California State University, Fresno CSU Fresno Fresno 23999 1911 114
California State University, Fullerton CSU Fullerton Orange 40386 1957 68
California State University, East Bay CSU Hayward Alameda 13673 1957 68

... (22 rows omitted)

Filtering Rows

Often we would like to perform row filtering, where we only extract row entries that match a specific feature.

By exact match: tbl.where(column, value). Create a new table of only the rows where column column matches the value value.

schools.where("Institution", "UC")
Name Institution City County Enrollment Founded Years since founding
University of California, Berkeley UC Berkeley Alameda 45307 1869 156
University of California, Davis UC Davis Yolo 39679 1905 120
University of California, Irvine UC Irvine Orange 35937 1965 60
University of California, Los Angeles UC Los Angeles Los Angeles 46430 1882 143
University of California, Merced UC Merced Merced 9110 2005 20
University of California, Riverside UC Riverside Riverside 26809 1954 71
University of California, San Diego UC San Diego San Diego 42006 1960 65
University of California, Santa Barbara UC Santa Barbara Santa Barbara 26420 1891 134
University of California, Santa Cruz UC Santa Cruz Santa Cruz 19478 1965 60

By index: tbl.take(row_indices). Create a new table of only the rows with an index specified in row_indices.

schools.take(1, 3, 5)
Name Institution City County Enrollment Founded Years since founding
California State University, Bakersfield CSU Bakersfield Kern 9613 1965 60
California State University Channel Islands CSU Camarillo Ventura 6128 2002 23
California State University, Chico CSU Chico Butte 14183 1887 138

Are rows zero-indexed or one-indexed? Why?

Creating New Tables

So far we’ve used a table with pre-existing data. We can also make new Tables with a special function: Table().

Table()

Check it out!!!

states = Table().with_columns('State', np.array(['California', 'New York', 'Florida', 'Texas', 'Pennsylvania']),'Code', np.array(['CA', 'NY', 'FL', 'TX', 'PA']), 'Population (millions)', np.array([39.3, 19.3, 21.7, 29.3, 12.8]))
states
State Code Population (millions)
California CA 39.3
New York NY 19.3
Florida FL 21.7
Texas TX 29.3
Pennsylvania PA 12.8

…unfortunately, due to poor coding style, the above code is quite unreadable. Let’s use whitespace (new lines and indents) to delineate what we are doing:

states = Table().with_columns(
  'State', make_array('California', 'New York', 'Florida', 'Texas', 'Pennsylvania'),
  'Code', make_array('CA', 'NY', 'FL', 'TX', 'PA'),
  'Population (millions)', make_array(39.3, 19.3, 21.7, 29.3, 12.8)
)
states
State Code Population (millions)
California CA 39.3
New York NY 19.3
Florida FL 21.7
Texas TX 29.3
Pennsylvania PA 12.8

Method Chaining

Method chaining in Python is when the object returned from one method becomes the object to use in the next method. In the above case:

  1. Calling the Table returns an empty table.
  2. The method with_columns is called on the empty table object, returning a table with the columns State, Code, and Population.
  3. This return value is then assigned to states.
Whitespace formatting

Note the closed parenthesis on the final line. This helps “match” parentheses together.

Bonus: Methods vs. Attributes

Return to the third focus question from the Inferential Thinking reading.

What Python expression gets you the number of rows? a specific column? (bonus) How does the syntax differ between these two cases?

The textbook wording “overloads” two different Python syntax constructs. From the Python glossary:

Attribute

Attribute: A value associated with an object which is usually referenced by name using dotted expressions. For example, if an object o has an attribute a it would be referenced as o.a.

In other words, attributes are named values tied to specific objects. To get these values, we must refer to them by their name using “dot” syntax.

schools.num_rows
32
Method

Method A function which is defined inside a class body.

The Python glossary definition above is a more opaque definition than we would like. But you can think of the methods we study in this class as functions that take into account the attributes and values of specific objects. In order to call them on a specific operation, we also use “dot” syntax. Because they are functions, we must use call expression syntax with parenthesis, and provide arguments as needed.

schools.column("Enrollment")
array([ 6025,  9613, 45307,  6128, 16426, 14183, 39679, 23999, 40386,
       13673, 35937, 38973, 46430, 26460,  9110, 37579, 27503, 26809,
        6637, 31818, 19803, 42006, 37402, 25282, 35751, 22000, 15109,
       26420, 19478,  7045, 10154,  1017])

Which to syntax use? To determine which “dot” syntax to use, read the documentation. The entire Table class has been defined for us already, and the designers of the datascience package have predetermined which table features make sense to have as attributes, and which features make sense to return as values from a method call. But as a rule of thumb, if you need to specify any additional detail to get what you want, you will likely need to call a function: use table methods and pass in arguments. See the column method call above.

Another key indicator is what happens when you use incorrect syntax:

schools.num_rows()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 schools.num_rows()

TypeError: 'int' object is not callable

The above error message means that Python assumed you wanted to call a function num_rows (specifically, a method of the Table schools); however, when run, num_rows was actually an int data type attribute, and ints are not callable. Similarly, below:

schools.column
<bound method Table.column of Name                                              | Institution | City        | County      | Enrollment | Founded | Years since founding
California State Polytechnic University, Humboldt | CSU         | Arcata      | Humboldt    | 6025       | 1913    | 112
California State University, Bakersfield          | CSU         | Bakersfield | Kern        | 9613       | 1965    | 60
University of California, Berkeley                | UC          | Berkeley    | Alameda     | 45307      | 1869    | 156
California State University Channel Islands       | CSU         | Camarillo   | Ventura     | 6128       | 2002    | 23
California State University, Dominguez Hills      | CSU         | Carson      | Los Angeles | 16426      | 1960    | 65
California State University, Chico                | CSU         | Chico       | Butte       | 14183      | 1887    | 138
University of California, Davis                   | UC          | Davis       | Yolo        | 39679      | 1905    | 120
California State University, Fresno               | CSU         | Fresno      | Fresno      | 23999      | 1911    | 114
California State University, Fullerton            | CSU         | Fullerton   | Orange      | 40386      | 1957    | 68
California State University, East Bay             | CSU         | Hayward     | Alameda     | 13673      | 1957    | 68
... (22 rows omitted)>

The above statement did not error, but it did not output a column, either. Rather, it output a method of a table, but it didn’t call this method.