Table Fundamentals

Read Inferential Thinking

Read Ch 6 intro, which describes the Table object type from the datascience library.

This is a very dense chapter, as it has lots of Python syntax. We will review it in detail. Before moving on, we encourage you to focus on the following questions:

  1. What is the syntax of the with_columns method? How does this method make a table?
  2. What are some table methods?
  3. What Python expression gets you the number of rows? a specific column? (bonus) How does the syntax differ between these two cases?

If you are new to programming, it is easy to get lost in the syntax of Python! For today, we will therefore focus on a few high-level goals:

In other words, you will not be expected to memorize all aspects of Tables (it’s as if we have a theme in this class…)! But you will have natural impulses for understanding and working with tabular data, many of which directly map to specific Table methods and attributes. To write Python code, then, you will need to learn to work with documentation to do this translation, and—when the precise methods don’t exist—construct new algorithms to achieve what you want. The more familiar you are with the documentation—and consequently, what is possible with Tables—the more you can focus on algorithmic thinking.

Documentation, documentation, documentation

This entire note revolves around getting used to using the documentation on the Data 6 Python Reference page. Keep that page open as you read this one!

Definitions

To begin, recall the terminology we have been using to describe tables so far:

  • Table: Retangular data structure.
    • Columns (vertical) correspond to variables (AKA features, or attributes) that measure and operationalize social concepts.
    • Rows (horizontal) correspond to records (AKA entries), which are specific values of variables for a given individual, group, etc.—whatever your unit of analysis is.

We work with the datascience package’s Table.

from datascience import *
import numpy as np

Today’s Dataset: Schools

This lecture we will return to the dataset of public four-year colleges and universities in California (Wikipedia).

To work with this table, we first load it in from a file called data/cal_unis.csv.

schools = Table.read_table('data/cal_unis.csv')

Let’s break down what we meant by “load it in”: * The right-hand-side of the assignment calls a function that returns a datascience Table object from the provided data file. * The left-hand-side of the assignment assigns this object to the Python name schools. * After this assignment statement, then we can use schools as the particular Table object that has the tabular data we want.

Running the below cell evaluates schools and displays that tabular data to us:

schools
Name Institution City County Enrollment Founded
University of California, Berkeley UC Berkeley Alameda 42519 1869
University of California, Davis UC Davis Yolo 39152 1905
University of California, Irvine UC Irvine Orange 35220 1965
University of California, Los Angeles UC Los Angeles Los Angeles 45428 1882
University of California, Merced UC Merced Merced 8544 2005
University of California, Riverside UC Riverside Riverside 23278 1954
University of California, San Diego UC San Diego San Diego 38798 1960
University of California, Santa Barbara UC Santa Barbara Santa Barbara 24346 1891
University of California, Santa Cruz UC Santa Cruz Santa Cruz 19700 1965
California State University Maritime Academy CSU Vallejo Solano 1017 1929

... (22 rows omitted)

Let’s get to exploring and understanding our tabular data!

Exercise 1: Find Table Dimensions

First off: How many rows and columns are in our schools table?

print(schools.num_rows)
print(schools.num_columns)
32
6

What variables are being recorded?

print(schools.labels)
('Name', 'Institution', 'City', 'County', 'Enrollment', 'Founded')

We didn’t conjure these expressions out of nowhere. Instead, we:

  1. Looked at the Data 6 Python Reference page
  2. Scrolled down to the “Tables and Table Methods” section
  3. Skimmed the reference until we found documentation that seemed close to what we want
  4. Translated the name tbl into schools.
  5. Tried it out

Bonus: You’ll notice the “dot” syntax for accessing these values, also known as table attributes. See the Bonus last section for a more detailed explanation of this programming terminology.

Column-First Paradigm

Because columns represent variables, we will take a column-first approach to tables.

To extract, delete, or make columns from a table named tbl:

  • tbl.select(...) returns a new table with a subset of columns.
  • tbl.drop(...) and returns a new table without a subset of columns
  • tbl.with_columns(...) returns a new table with additional new column(s).
schools.select('Name', 'Enrollment')
Name Enrollment
University of California, Berkeley 42519
University of California, Davis 39152
University of California, Irvine 35220
University of California, Los Angeles 45428
University of California, Merced 8544
University of California, Riverside 23278
University of California, San Diego 38798
University of California, Santa Barbara 24346
University of California, Santa Cruz 19700
California State University Maritime Academy 1017

... (22 rows omitted)

schools.drop('Founded', 'County')
Name Institution City Enrollment
University of California, Berkeley UC Berkeley 42519
University of California, Davis UC Davis 39152
University of California, Irvine UC Irvine 35220
University of California, Los Angeles UC Los Angeles 45428
University of California, Merced UC Merced 8544
University of California, Riverside UC Riverside 23278
University of California, San Diego UC San Diego 38798
University of California, Santa Barbara UC Santa Barbara 24346
University of California, Santa Cruz UC Santa Cruz 19700
California State University Maritime Academy CSU Vallejo 1017

... (22 rows omitted)

Note: The above two cells were run in sequence. Each table method returns a new table, so our original table schools is unchanged.

Modifying Tables

Table methods return copies

All table methods return copies of information from the original table! This paradigm is quite useful for data analysis. From Inferential Thinking:

[Table methods] create new smaller tables that share the same data. The fact that the original table is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other.

If we would like to modify the original table, then we must re-assign schools to the new return value:

schools = schools.with_columns(
    "Years since founding",
    2025 - schools.column("Founded")
)
schools
Name Institution City County Enrollment Founded Years since founding
University of California, Berkeley UC Berkeley Alameda 42519 1869 156
University of California, Davis UC Davis Yolo 39152 1905 120
University of California, Irvine UC Irvine Orange 35220 1965 60
University of California, Los Angeles UC Los Angeles Los Angeles 45428 1882 143
University of California, Merced UC Merced Merced 8544 2005 20
University of California, Riverside UC Riverside Riverside 23278 1954 71
University of California, San Diego UC San Diego San Diego 38798 1960 65
University of California, Santa Barbara UC Santa Barbara Santa Barbara 24346 1891 134
University of California, Santa Cruz UC Santa Cruz Santa Cruz 19700 1965 60
California State University Maritime Academy CSU Vallejo Solano 1017 1929 96

... (22 rows omitted)

Filtering Rows

Often we would like to perform row filtering, where we only extract row entries that match a specific feature.

By exact match: tbl.where(column, value). Create a new table of only the rows where column column matches the value value.

schools.where("Institution", "UC")
Name Institution City County Enrollment Founded Years since founding
University of California, Berkeley UC Berkeley Alameda 42519 1869 156
University of California, Davis UC Davis Yolo 39152 1905 120
University of California, Irvine UC Irvine Orange 35220 1965 60
University of California, Los Angeles UC Los Angeles Los Angeles 45428 1882 143
University of California, Merced UC Merced Merced 8544 2005 20
University of California, Riverside UC Riverside Riverside 23278 1954 71
University of California, San Diego UC San Diego San Diego 38798 1960 65
University of California, Santa Barbara UC Santa Barbara Santa Barbara 24346 1891 134
University of California, Santa Cruz UC Santa Cruz Santa Cruz 19700 1965 60

By index: tbl.take(row_indices). Create a new table of only the rows with an index specified in row_indices.

schools.take(1, 3, 5)
Name Institution City County Enrollment Founded Years since founding
University of California, Davis UC Davis Yolo 39152 1905 120
University of California, Los Angeles UC Los Angeles Los Angeles 45428 1882 143
University of California, Riverside UC Riverside Riverside 23278 1954 71

Are rows zero-indexed or one-indexed? Why?

Creating New Tables

So far we’ve used a table with pre-existing data. We can also make new Tables with a special function: Table().

Table()

Check it out!!!

states = Table().with_columns('State', np.array(['California', 'New York', 'Florida', 'Texas', 'Pennsylvania']),'Code', np.array(['CA', 'NY', 'FL', 'TX', 'PA']), 'Population (millions)', np.array([39.3, 19.3, 21.7, 29.3, 12.8]))
states
State Code Population (millions)
California CA 39.3
New York NY 19.3
Florida FL 21.7
Texas TX 29.3
Pennsylvania PA 12.8

…unfortunately, due to poor coding style, the above code is quite unreadable. Let’s use whitespace (new lines and indents) to delineate what we are doing:

states = Table().with_columns(
  'State', np.array(['California', 'New York', 'Florida', 'Texas', 'Pennsylvania']),
  'Code', np.array(['CA', 'NY', 'FL', 'TX', 'PA']),
  'Population (millions)', np.array([39.3, 19.3, 21.7, 29.3, 12.8])
)
states
State Code Population (millions)
California CA 39.3
New York NY 19.3
Florida FL 21.7
Texas TX 29.3
Pennsylvania PA 12.8

Method Chaining

Method chaining in Python is when the object returned from one method becomes the object to use in the next method. In this case:

  1. Calling the Table returns an empty table.
  2. The method with_columns is called on the empty table object, returning a table with the columns State, Code, and Population.
  3. This return value is then assigned to states.
Whitespace formatting

Note the closed parenthesis on the final line. This helps “match” parentheses together.

Bonus: Methods vs. Attributes

Return to the third focus question from the Inferential Thinking reading.

What Python expression gets you the number of rows? a specific column? (bonus) How does the syntax differ between these two cases?

The textbook wording “overloads” two different Python syntax constructs. From the Python glossary:

Attribute

Attribute: A value associated with an object which is usually referenced by name using dotted expressions. For example, if an object o has an attribute a it would be referenced as o.a.

In other words, attributes are named values tied to specific objects. To get these values, we must refer to them by their name using “dot” syntax.

schools.num_rows
32
Method

Method A function which is defined inside a class body.

The Python glossary definition above is a more opaque definition than we would like. But you can think of the methods we study in this class as functions that take into account the attributes and values of specific objects. In order to call them on a specific operation, we also use “dot” syntax. Because they are functions, we must use call expression syntax with parenthesis, and provide arguments as needed.

schools.column("Enrollment")
array([42519, 39152, 35220, 45428,  8544, 23278, 38798, 24346, 19700,
        1017, 21812, 26443, 10493,  7095, 17488, 15741, 14525, 24995,
       39774, 36846, 27685,  7079, 38716, 31131, 19973, 14511, 10214,
        7774, 34881, 29586, 32828,  9201])

Which to syntax use? To determine which “dot” syntax to use, read the documentation. The entire Table class has been defined for us already, and the designers of the datascience package have predetermined which table features make sense to have as attributes, and which features make sense to return as values from a method call. But as a rule of thumb, if you need to specify any additional detail to get what you want, you will likely need to call a function: use table methods and pass in arguments. See the column method call above.

Another key indicator is what happens when you use incorrect syntax:

schools.num_rows()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 schools.num_rows()

TypeError: 'int' object is not callable

The above error message means that Python assumed you wanted to call a function num_rows (specifically, a method of the Table schools); however, when run, num_rows was actually an int data type attribute, and ints are not callable. Similarly, below:

schools.column
<bound method Table.column of Name                                         | Institution | City          | County        | Enrollment | Founded | Years since founding
University of California, Berkeley           | UC          | Berkeley      | Alameda       | 42519      | 1869    | 156
University of California, Davis              | UC          | Davis         | Yolo          | 39152      | 1905    | 120
University of California, Irvine             | UC          | Irvine        | Orange        | 35220      | 1965    | 60
University of California, Los Angeles        | UC          | Los Angeles   | Los Angeles   | 45428      | 1882    | 143
University of California, Merced             | UC          | Merced        | Merced        | 8544       | 2005    | 20
University of California, Riverside          | UC          | Riverside     | Riverside     | 23278      | 1954    | 71
University of California, San Diego          | UC          | San Diego     | San Diego     | 38798      | 1960    | 65
University of California, Santa Barbara      | UC          | Santa Barbara | Santa Barbara | 24346      | 1891    | 134
University of California, Santa Cruz         | UC          | Santa Cruz    | Santa Cruz    | 19700      | 1965    | 60
California State University Maritime Academy | CSU         | Vallejo       | Solano        | 1017       | 1929    | 96
... (22 rows omitted)>

The above statement did not error, but it did not output a column, either. Rather, it output a method of a table, but it didn’t call this method.