from datascience import *
import numpy as np
Table Fundamentals
Read Ch 6 intro, which describes the Table
object type from the datascience
library.
This is a very dense chapter, as it has lots of Python syntax. We will review it in detail. Before moving on, we encourage you to focus on the following questions:
- What is the syntax of the
with_columns
method? How does this method make a table? - What are some table methods?
- What Python expression gets you the number of rows? a specific column? (bonus) How does the syntax differ between these two cases?
If you are new to programming, it is easy to get lost in the syntax of Python! For today, we will therefore focus on a few high-level goals:
- How to translate your existing critical thinking skills and propensity for exploratory data analysis into algorithmic thinking with
datascience
Tables - How to read Python documentation to find what you want
- How to debug code systematically
In other words, you will not be expected to memorize all aspects of Tables (it’s as if we have a theme in this class…)! But you will have natural impulses for understanding and working with tabular data, many of which directly map to specific Table methods and attributes. To write Python code, then, you will need to learn to work with documentation to do this translation, and—when the precise methods don’t exist—construct new algorithms to achieve what you want. The more familiar you are with the documentation—and consequently, what is possible with Table
s—the more you can focus on algorithmic thinking.
This entire note revolves around getting used to using the documentation on the Data 6 Python Reference page. Keep that page open as you read this one!
Definitions
To begin, recall the terminology we have been using to describe tables so far:
- Table: Retangular data structure.
- Columns (vertical) correspond to variables (AKA features, or attributes) that measure and operationalize social concepts.
- Rows (horizontal) correspond to records (AKA entries), which are specific values of variables for a given individual, group, etc.—whatever your unit of analysis is.
We work with the datascience
package’s Table
.
Today’s Dataset: Schools
This lecture we will return to the dataset of public four-year colleges and universities in California (Wikipedia).
To work with this table, we first load it in from a file called data/cal_unis.csv
.
= Table.read_table('data/cal_unis.csv') schools
Let’s break down what we meant by “load it in”: * The right-hand-side of the assignment calls a function that returns a datascience
Table
object from the provided data file. * The left-hand-side of the assignment assigns this object to the Python name schools
. * After this assignment statement, then we can use schools
as the particular Table object that has the tabular data we want.
Running the below cell evaluates schools
and displays that tabular data to us:
schools
Name | Institution | City | County | Enrollment | Founded |
---|---|---|---|---|---|
University of California, Berkeley | UC | Berkeley | Alameda | 42519 | 1869 |
University of California, Davis | UC | Davis | Yolo | 39152 | 1905 |
University of California, Irvine | UC | Irvine | Orange | 35220 | 1965 |
University of California, Los Angeles | UC | Los Angeles | Los Angeles | 45428 | 1882 |
University of California, Merced | UC | Merced | Merced | 8544 | 2005 |
University of California, Riverside | UC | Riverside | Riverside | 23278 | 1954 |
University of California, San Diego | UC | San Diego | San Diego | 38798 | 1960 |
University of California, Santa Barbara | UC | Santa Barbara | Santa Barbara | 24346 | 1891 |
University of California, Santa Cruz | UC | Santa Cruz | Santa Cruz | 19700 | 1965 |
California State University Maritime Academy | CSU | Vallejo | Solano | 1017 | 1929 |
... (22 rows omitted)
Let’s get to exploring and understanding our tabular data!
Exercise 1: Find Table Dimensions
First off: How many rows and columns are in our schools
table?
print(schools.num_rows)
print(schools.num_columns)
32
6
What variables are being recorded?
print(schools.labels)
('Name', 'Institution', 'City', 'County', 'Enrollment', 'Founded')
We didn’t conjure these expressions out of nowhere. Instead, we:
- Looked at the Data 6 Python Reference page
- Scrolled down to the “Tables and Table Methods” section
- Skimmed the reference until we found documentation that seemed close to what we want
- Translated the name
tbl
intoschools
. - Tried it out
Bonus: You’ll notice the “dot” syntax for accessing these values, also known as table attributes. See the Bonus last section for a more detailed explanation of this programming terminology.
Column-First Paradigm
Because columns represent variables, we will take a column-first approach to tables.
To extract, delete, or make columns from a table named tbl
:
tbl.select(...)
returns a new table with a subset of columns.tbl.drop(...)
and returns a new table without a subset of columnstbl.with_columns(...)
returns a new table with additional new column(s).
'Name', 'Enrollment') schools.select(
Name | Enrollment |
---|---|
University of California, Berkeley | 42519 |
University of California, Davis | 39152 |
University of California, Irvine | 35220 |
University of California, Los Angeles | 45428 |
University of California, Merced | 8544 |
University of California, Riverside | 23278 |
University of California, San Diego | 38798 |
University of California, Santa Barbara | 24346 |
University of California, Santa Cruz | 19700 |
California State University Maritime Academy | 1017 |
... (22 rows omitted)
'Founded', 'County') schools.drop(
Name | Institution | City | Enrollment |
---|---|---|---|
University of California, Berkeley | UC | Berkeley | 42519 |
University of California, Davis | UC | Davis | 39152 |
University of California, Irvine | UC | Irvine | 35220 |
University of California, Los Angeles | UC | Los Angeles | 45428 |
University of California, Merced | UC | Merced | 8544 |
University of California, Riverside | UC | Riverside | 23278 |
University of California, San Diego | UC | San Diego | 38798 |
University of California, Santa Barbara | UC | Santa Barbara | 24346 |
University of California, Santa Cruz | UC | Santa Cruz | 19700 |
California State University Maritime Academy | CSU | Vallejo | 1017 |
... (22 rows omitted)
Note: The above two cells were run in sequence. Each table method returns a new table, so our original table schools
is unchanged.
Modifying Tables
All table methods return copies of information from the original table! This paradigm is quite useful for data analysis. From Inferential Thinking:
[Table methods] create new smaller tables that share the same data. The fact that the original table is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other.
If we would like to modify the original table, then we must re-assign schools
to the new return value:
= schools.with_columns(
schools "Years since founding",
2025 - schools.column("Founded")
) schools
Name | Institution | City | County | Enrollment | Founded | Years since founding |
---|---|---|---|---|---|---|
University of California, Berkeley | UC | Berkeley | Alameda | 42519 | 1869 | 156 |
University of California, Davis | UC | Davis | Yolo | 39152 | 1905 | 120 |
University of California, Irvine | UC | Irvine | Orange | 35220 | 1965 | 60 |
University of California, Los Angeles | UC | Los Angeles | Los Angeles | 45428 | 1882 | 143 |
University of California, Merced | UC | Merced | Merced | 8544 | 2005 | 20 |
University of California, Riverside | UC | Riverside | Riverside | 23278 | 1954 | 71 |
University of California, San Diego | UC | San Diego | San Diego | 38798 | 1960 | 65 |
University of California, Santa Barbara | UC | Santa Barbara | Santa Barbara | 24346 | 1891 | 134 |
University of California, Santa Cruz | UC | Santa Cruz | Santa Cruz | 19700 | 1965 | 60 |
California State University Maritime Academy | CSU | Vallejo | Solano | 1017 | 1929 | 96 |
... (22 rows omitted)
Filtering Rows
Often we would like to perform row filtering, where we only extract row entries that match a specific feature.
By exact match: tbl.where(column, value)
. Create a new table of only the rows where column column
matches the value value
.
"Institution", "UC") schools.where(
Name | Institution | City | County | Enrollment | Founded | Years since founding |
---|---|---|---|---|---|---|
University of California, Berkeley | UC | Berkeley | Alameda | 42519 | 1869 | 156 |
University of California, Davis | UC | Davis | Yolo | 39152 | 1905 | 120 |
University of California, Irvine | UC | Irvine | Orange | 35220 | 1965 | 60 |
University of California, Los Angeles | UC | Los Angeles | Los Angeles | 45428 | 1882 | 143 |
University of California, Merced | UC | Merced | Merced | 8544 | 2005 | 20 |
University of California, Riverside | UC | Riverside | Riverside | 23278 | 1954 | 71 |
University of California, San Diego | UC | San Diego | San Diego | 38798 | 1960 | 65 |
University of California, Santa Barbara | UC | Santa Barbara | Santa Barbara | 24346 | 1891 | 134 |
University of California, Santa Cruz | UC | Santa Cruz | Santa Cruz | 19700 | 1965 | 60 |
By index: tbl.take(row_indices)
. Create a new table of only the rows with an index specified in row_indices
.
1, 3, 5) schools.take(
Name | Institution | City | County | Enrollment | Founded | Years since founding |
---|---|---|---|---|---|---|
University of California, Davis | UC | Davis | Yolo | 39152 | 1905 | 120 |
University of California, Los Angeles | UC | Los Angeles | Los Angeles | 45428 | 1882 | 143 |
University of California, Riverside | UC | Riverside | Riverside | 23278 | 1954 | 71 |
Are rows zero-indexed or one-indexed? Why?
Creating New Tables
So far we’ve used a table with pre-existing data. We can also make new Tables with a special function: Table()
.
Table()
Check it out!!!
= Table().with_columns('State', np.array(['California', 'New York', 'Florida', 'Texas', 'Pennsylvania']),'Code', np.array(['CA', 'NY', 'FL', 'TX', 'PA']), 'Population (millions)', np.array([39.3, 19.3, 21.7, 29.3, 12.8]))
states states
State | Code | Population (millions) |
---|---|---|
California | CA | 39.3 |
New York | NY | 19.3 |
Florida | FL | 21.7 |
Texas | TX | 29.3 |
Pennsylvania | PA | 12.8 |
…unfortunately, due to poor coding style, the above code is quite unreadable. Let’s use whitespace (new lines and indents) to delineate what we are doing:
= Table().with_columns(
states 'State', np.array(['California', 'New York', 'Florida', 'Texas', 'Pennsylvania']),
'Code', np.array(['CA', 'NY', 'FL', 'TX', 'PA']),
'Population (millions)', np.array([39.3, 19.3, 21.7, 29.3, 12.8])
) states
State | Code | Population (millions) |
---|---|---|
California | CA | 39.3 |
New York | NY | 19.3 |
Florida | FL | 21.7 |
Texas | TX | 29.3 |
Pennsylvania | PA | 12.8 |
Method Chaining
Method chaining in Python is when the object returned from one method becomes the object to use in the next method. In this case:
- Calling the
Table
returns an empty table. - The method
with_columns
is called on the empty table object, returning a table with the columns State, Code, and Population. - This return value is then assigned to
states
.
Note the closed parenthesis on the final line. This helps “match” parentheses together.
Bonus: Methods vs. Attributes
Return to the third focus question from the Inferential Thinking reading.
What Python expression gets you the number of rows? a specific column? (bonus) How does the syntax differ between these two cases?
The textbook wording “overloads” two different Python syntax constructs. From the Python glossary:
Attribute: A value associated with an object which is usually referenced by name using dotted expressions. For example, if an object o has an attribute a it would be referenced as o.a.
In other words, attributes are named values tied to specific objects. To get these values, we must refer to them by their name using “dot” syntax.
schools.num_rows
32
Method A function which is defined inside a class body.
The Python glossary definition above is a more opaque definition than we would like. But you can think of the methods we study in this class as functions that take into account the attributes and values of specific objects. In order to call them on a specific operation, we also use “dot” syntax. Because they are functions, we must use call expression syntax with parenthesis, and provide arguments as needed.
"Enrollment") schools.column(
array([42519, 39152, 35220, 45428, 8544, 23278, 38798, 24346, 19700,
1017, 21812, 26443, 10493, 7095, 17488, 15741, 14525, 24995,
39774, 36846, 27685, 7079, 38716, 31131, 19973, 14511, 10214,
7774, 34881, 29586, 32828, 9201])
Which to syntax use? To determine which “dot” syntax to use, read the documentation. The entire Table
class has been defined for us already, and the designers of the datascience
package have predetermined which table features make sense to have as attributes, and which features make sense to return as values from a method call. But as a rule of thumb, if you need to specify any additional detail to get what you want, you will likely need to call a function: use table methods and pass in arguments. See the column
method call above.
Another key indicator is what happens when you use incorrect syntax:
schools.num_rows()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[16], line 1 ----> 1 schools.num_rows() TypeError: 'int' object is not callable
The above error message means that Python assumed you wanted to call a function num_rows
(specifically, a method of the Table schools
); however, when run, num_rows
was actually an int
data type attribute, and int
s are not callable. Similarly, below:
schools.column
<bound method Table.column of Name | Institution | City | County | Enrollment | Founded | Years since founding
University of California, Berkeley | UC | Berkeley | Alameda | 42519 | 1869 | 156
University of California, Davis | UC | Davis | Yolo | 39152 | 1905 | 120
University of California, Irvine | UC | Irvine | Orange | 35220 | 1965 | 60
University of California, Los Angeles | UC | Los Angeles | Los Angeles | 45428 | 1882 | 143
University of California, Merced | UC | Merced | Merced | 8544 | 2005 | 20
University of California, Riverside | UC | Riverside | Riverside | 23278 | 1954 | 71
University of California, San Diego | UC | San Diego | San Diego | 38798 | 1960 | 65
University of California, Santa Barbara | UC | Santa Barbara | Santa Barbara | 24346 | 1891 | 134
University of California, Santa Cruz | UC | Santa Cruz | Santa Cruz | 19700 | 1965 | 60
California State University Maritime Academy | CSU | Vallejo | Solano | 1017 | 1929 | 96
... (22 rows omitted)>
The above statement did not error, but it did not output a column, either. Rather, it output a method
of a table, but it didn’t call this method.