Table Fundamentals

Read Inferential Thinking

Read Ch 6 intro, which describes the Table object type from the datascience library.

This is a very dense chapter, as it has lots of Python syntax. We will review it in detail. Before moving on, we encourage you to focus on the following questions:

What is the syntax of the with_columns method? How does this method make a table?
What are some table methods?
What Python expression gets you the number of rows? a specific column? (bonus) How does the syntax differ between these two cases?

If you are new to programming, it is easy to get lost in the syntax of Python! For today, we will therefore focus on a few high-level goals:

How to translate your existing critical thinking skills and propensity for exploratory data analysis into algorithmic thinking with datascience Tables
How to read Python documentation to find what you want
How to debug code systematically

In other words, you will not be expected to memorize all aspects of Tables (it’s as if we have a theme in this class…)! But you will have natural impulses for understanding and working with tabular data, many of which directly map to specific Table methods and attributes. To write Python code, then, you will need to learn to work with documentation to do this translation, and—when the precise methods don’t exist—construct new algorithms to achieve what you want. The more familiar you are with the documentation—and consequently, what is possible with Tables—the more you can focus on algorithmic thinking.

Documentation, documentation, documentation

This entire note revolves around getting used to using the documentation on the Data 6 Python Reference page. Keep that page open as you read this one!

Definitions

To begin, recall the terminology we have been using to describe tables so far:

Table: Retangular data structure.
- Columns (vertical) correspond to variables (AKA features, or attributes) that measure and operationalize social concepts. We discuss variables and social concepts more in another note.
- Rows (horizontal) correspond to records (AKA entries), which are specific values of variables for a given individual, group, etc.—whatever your unit of analysis is.

We work with the datascience package’s Table.

from datascience import *
import numpy as np

Today’s Dataset: Schools

This lecture we will return to the dataset of public universities in California (Wikipedia).

To work with this table, we first load it in from a file called data/cal_unis.csv.

schools = Table.read_table('data/cal_unis.csv')

Let’s break down what we meant by “load it in”: * The right-hand-side of the assignment calls a function that returns a datascience Table object from the provided data file. * The left-hand-side of the assignment assigns this object to the Python name schools. * After this assignment statement, then we can use schools as the particular Table object that has the tabular data we want.

Running the below cell evaluates schools and displays that tabular data to us:

schools

Name	Institution	City	County	Enrollment	Founded
California State Polytechnic University, Humboldt	CSU	Arcata	Humboldt	6025	1913
California State University, Bakersfield	CSU	Bakersfield	Kern	9613	1965
University of California, Berkeley	UC	Berkeley	Alameda	45307	1869
California State University Channel Islands	CSU	Camarillo	Ventura	6128	2002
California State University, Dominguez Hills	CSU	Carson	Los Angeles	16426	1960
California State University, Chico	CSU	Chico	Butte	14183	1887
University of California, Davis	UC	Davis	Yolo	39679	1905
California State University, Fresno	CSU	Fresno	Fresno	23999	1911
California State University, Fullerton	CSU	Fullerton	Orange	40386	1957
California State University, East Bay	CSU	Hayward	Alameda	13673	1957

... (22 rows omitted)

Let’s get to exploring and understanding our tabular data!

Find Table Dimensions

First off: How many rows and columns are in our schools table?

print(schools.num_rows)
print(schools.num_columns)

32
6

What variables are being recorded?

print(schools.labels)

('Name', 'Institution', 'City', 'County', 'Enrollment', 'Founded')

We didn’t conjure these expressions out of nowhere. Instead, we:

Looked at the Data 6 Python Reference page
Scrolled down to the “Tables and Table Methods” section
Skimmed the reference until we found documentation that seemed close to what we want
Translated the name tbl into schools.
Tried it out

Bonus: You’ll notice the “dot” syntax for accessing these values, also known as table attributes. See the Bonus last section for a more detailed explanation of this programming terminology.

Column-First Paradigm

Because columns represent variables, we will take a column-first approach to tables.

To extract, delete, or make columns from a table named tbl:

tbl.select(...) returns a new table with a subset of columns.
tbl.drop(...) and returns a new table without a subset of columns
tbl.with_columns(...) returns a new table with additional new column(s).

schools.select('Name', 'Enrollment')

Name	Enrollment
California State Polytechnic University, Humboldt	6025
California State University, Bakersfield	9613
University of California, Berkeley	45307
California State University Channel Islands	6128
California State University, Dominguez Hills	16426
California State University, Chico	14183
University of California, Davis	39679
California State University, Fresno	23999
California State University, Fullerton	40386
California State University, East Bay	13673

... (22 rows omitted)

schools.drop('Founded', 'County')

Name	Institution	City	Enrollment
California State Polytechnic University, Humboldt	CSU	Arcata	6025
California State University, Bakersfield	CSU	Bakersfield	9613
University of California, Berkeley	UC	Berkeley	45307
California State University Channel Islands	CSU	Camarillo	6128
California State University, Dominguez Hills	CSU	Carson	16426
California State University, Chico	CSU	Chico	14183
University of California, Davis	UC	Davis	39679
California State University, Fresno	CSU	Fresno	23999
California State University, Fullerton	CSU	Fullerton	40386
California State University, East Bay	CSU	Hayward	13673

... (22 rows omitted)

Note: The above two cells were run in sequence. Each table method returns a new table, so our original table schools is unchanged.

Modifying Tables

Table methods return copies

All table methods return copies of information from the original table! This paradigm is quite useful for data analysis. From Inferential Thinking:

[Table methods] create new smaller tables that share the same data. The fact that the original table is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other.

If we would like to modify the original table, then we must re-assign schools to the new return value:

schools = schools.with_columns(
    "Years since founding",
    2025 - schools.column("Founded")
)
schools

Name	Institution	City	County	Enrollment	Founded	Years since founding
California State Polytechnic University, Humboldt	CSU	Arcata	Humboldt	6025	1913	112
California State University, Bakersfield	CSU	Bakersfield	Kern	9613	1965	60
University of California, Berkeley	UC	Berkeley	Alameda	45307	1869	156
California State University Channel Islands	CSU	Camarillo	Ventura	6128	2002	23
California State University, Dominguez Hills	CSU	Carson	Los Angeles	16426	1960	65
California State University, Chico	CSU	Chico	Butte	14183	1887	138
University of California, Davis	UC	Davis	Yolo	39679	1905	120
California State University, Fresno	CSU	Fresno	Fresno	23999	1911	114
California State University, Fullerton	CSU	Fullerton	Orange	40386	1957	68
California State University, East Bay	CSU	Hayward	Alameda	13673	1957	68

... (22 rows omitted)

Filtering Rows

Often we would like to perform row filtering, where we only extract row entries that match a specific feature.

By exact match: tbl.where(column, value). Create a new table of only the rows where column column matches the value value.

schools.where("Institution", "UC")

Name	Institution	City	County	Enrollment	Founded	Years since founding
University of California, Berkeley	UC	Berkeley	Alameda	45307	1869	156
University of California, Davis	UC	Davis	Yolo	39679	1905	120
University of California, Irvine	UC	Irvine	Orange	35937	1965	60
University of California, Los Angeles	UC	Los Angeles	Los Angeles	46430	1882	143
University of California, Merced	UC	Merced	Merced	9110	2005	20
University of California, Riverside	UC	Riverside	Riverside	26809	1954	71
University of California, San Diego	UC	San Diego	San Diego	42006	1960	65
University of California, Santa Barbara	UC	Santa Barbara	Santa Barbara	26420	1891	134
University of California, Santa Cruz	UC	Santa Cruz	Santa Cruz	19478	1965	60

By index: tbl.take(row_indices). Create a new table of only the rows with an index specified in row_indices.

schools.take(1, 3, 5)

Name	Institution	City	County	Enrollment	Founded	Years since founding
California State University, Bakersfield	CSU	Bakersfield	Kern	9613	1965	60
California State University Channel Islands	CSU	Camarillo	Ventura	6128	2002	23
California State University, Chico	CSU	Chico	Butte	14183	1887	138

Are rows zero-indexed or one-indexed? Why?

Creating New Tables

So far we’ve used a table with pre-existing data. We can also make new Tables with a special function: Table().

Table()

Check it out!!!

states = Table().with_columns('State', np.array(['California', 'New York', 'Florida', 'Texas', 'Pennsylvania']),'Code', np.array(['CA', 'NY', 'FL', 'TX', 'PA']), 'Population (millions)', np.array([39.3, 19.3, 21.7, 29.3, 12.8]))
states

State	Code	Population (millions)
California	CA	39.3
New York	NY	19.3
Florida	FL	21.7
Texas	TX	29.3
Pennsylvania	PA	12.8

…unfortunately, due to poor coding style, the above code is quite unreadable. Let’s use whitespace (new lines and indents) to delineate what we are doing:

states = Table().with_columns(
  'State', make_array('California', 'New York', 'Florida', 'Texas', 'Pennsylvania'),
  'Code', make_array('CA', 'NY', 'FL', 'TX', 'PA'),
  'Population (millions)', make_array(39.3, 19.3, 21.7, 29.3, 12.8)
)
states

State	Code	Population (millions)
California	CA	39.3
New York	NY	19.3
Florida	FL	21.7
Texas	TX	29.3
Pennsylvania	PA	12.8

Method Chaining

Method chaining in Python is when the object returned from one method becomes the object to use in the next method. In the above case:

Calling the Table returns an empty table.
The method with_columns is called on the empty table object, returning a table with the columns State, Code, and Population.
This return value is then assigned to states.

Whitespace formatting

Note the closed parenthesis on the final line. This helps “match” parentheses together.

Bonus: Methods vs. Attributes

Return to the third focus question from the Inferential Thinking reading.

What Python expression gets you the number of rows? a specific column? (bonus) How does the syntax differ between these two cases?

The textbook wording “overloads” two different Python syntax constructs. From the Python glossary:

Attribute

Attribute: A value associated with an object which is usually referenced by name using dotted expressions. For example, if an object o has an attribute a it would be referenced as o.a.

In other words, attributes are named values tied to specific objects. To get these values, we must refer to them by their name using “dot” syntax.

schools.num_rows

Method

Method A function which is defined inside a class body.

The Python glossary definition above is a more opaque definition than we would like. But you can think of the methods we study in this class as functions that take into account the attributes and values of specific objects. In order to call them on a specific operation, we also use “dot” syntax. Because they are functions, we must use call expression syntax with parenthesis, and provide arguments as needed.

schools.column("Enrollment")

array([ 6025,  9613, 45307,  6128, 16426, 14183, 39679, 23999, 40386,
       13673, 35937, 38973, 46430, 26460,  9110, 37579, 27503, 26809,
        6637, 31818, 19803, 42006, 37402, 25282, 35751, 22000, 15109,
       26420, 19478,  7045, 10154,  1017])

Which to syntax use? To determine which “dot” syntax to use, read the documentation. The entire Table class has been defined for us already, and the designers of the datascience package have predetermined which table features make sense to have as attributes, and which features make sense to return as values from a method call. But as a rule of thumb, if you need to specify any additional detail to get what you want, you will likely need to call a function: use table methods and pass in arguments. See the column method call above.

Another key indicator is what happens when you use incorrect syntax:

schools.num_rows()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 schools.num_rows()

TypeError: 'int' object is not callable

The above error message means that Python assumed you wanted to call a function num_rows (specifically, a method of the Table schools); however, when run, num_rows was actually an int data type attribute, and ints are not callable. Similarly, below:

schools.column

<bound method Table.column of Name                                              | Institution | City        | County      | Enrollment | Founded | Years since founding
California State Polytechnic University, Humboldt | CSU         | Arcata      | Humboldt    | 6025       | 1913    | 112
California State University, Bakersfield          | CSU         | Bakersfield | Kern        | 9613       | 1965    | 60
University of California, Berkeley                | UC          | Berkeley    | Alameda     | 45307      | 1869    | 156
California State University Channel Islands       | CSU         | Camarillo   | Ventura     | 6128       | 2002    | 23
California State University, Dominguez Hills      | CSU         | Carson      | Los Angeles | 16426      | 1960    | 65
California State University, Chico                | CSU         | Chico       | Butte       | 14183      | 1887    | 138
University of California, Davis                   | UC          | Davis       | Yolo        | 39679      | 1905    | 120
California State University, Fresno               | CSU         | Fresno      | Fresno      | 23999      | 1911    | 114
California State University, Fullerton            | CSU         | Fullerton   | Orange      | 40386      | 1957    | 68
California State University, East Bay             | CSU         | Hayward     | Alameda     | 13673      | 1957    | 68
... (22 rows omitted)>

The above statement did not error, but it did not output a column, either. Rather, it output a method of a table, but it didn’t call this method.