from datascience import *
= make_array(86.9, 83.9, 88.5, 87.2, 84.4)
hs_or_higher hs_or_higher
array([ 86.9, 83.9, 88.5, 87.2, 84.4])
Our first data structure
This note has the following goals:
We will first learn arrays through analyzing data; then, we will drill into array internals and the nitty-gritty of Python modules and methods. Finally, we will summarize array operations.
Let us discuss our first data structure in this course: arrays. An array is a sequential collection of values of a given data type:
In arrays, each value is called an array element.
We have previously discussed the idea of tables: a rectangular data structure with rows and columns. We will see today that arrays are a concise way to manipulate table columns, because arrays facilitate common data processing that we may want to perform on columns.
Read Ch 5.1, which describes in detail how arrays can be used in arithmetic expressions to compute over their contents.
Before continuing, make sure that:
The following table is drawn from the American Community Survey (ACS) of 2020. It shows education levels of adults 25 years or higher by state.
State | Estimated total state population | Estimated high school graduate or higher (%) | Estimated bachelor’s degree or higher (%) |
---|---|---|---|
Alabama | 3,344,006 | 86.9 | 26.2 |
California | 26,665,143 | 83.9 | 34.7 |
Florida | 15,255,326 | 88.5 | 30.5 |
New York | 13,649,157 | 87.2 | 37.5 |
Texas | 18,449,851 | 84.4 | 30.7 |
Each of these table columns can be represented by an array.
Below, we create a new array for the column “Estimated high school graduate or higher (%)” and assign the returned array to a single name, hs_or_higher
. This simple assignment statement is abstraction at work! Also note that the import statement gives us access to array functions with the datascience
module, including make_array
, which returns a new array with the provided argument values.
from datascience import *
= make_array(86.9, 83.9, 88.5, 87.2, 84.4)
hs_or_higher hs_or_higher
array([ 86.9, 83.9, 88.5, 87.2, 84.4])
Let’s make a few more arrays.
The array data type (as shown below) is a bit esoteric for now; we will discuss what NumPy (np
) is very soon.
= make_array(26.2, 34.7, 30.5, 37.5, 30.7)
bs_or_higher type(bs_or_higher)
numpy.ndarray
When creating the state names array of strings below, what do you observe about the datatype, dtype
? Hint: Count the number of characters in each string.
= make_array("Alabama", "California", "Florida", "New York", "Texas")
states states
array(['Alabama', 'California', 'Florida', 'New York', 'Texas'],
dtype='<U10')
When creating the state population array below, why might we decide to make an integer array, as opposed to a string array?
= make_array(3344006, 26665143, 15255326, 13649157, 18449851)
state_pop state_pop
array([ 3344006, 26665143, 15255326, 13649157, 18449851])
The order of an array is fixed (i.e., they will be arranged in the order specified when building the array), and values can be repeated.
Array with 4 int
s:
5, -1, 0.3, 5) make_array(
array([ 5. , -1. , 0.3, 5. ])
Values in an array must all be of the same data type, and the make_array
function will cast appropriately. Below, all values can be represented by strings:
4, -4.5, "not a number") make_array(
array(['4', '-4.5', 'not a number'],
dtype='<U32')
Incidentally, we can clean up our code stylistically by makine line breaks after each argument:
"hello",
make_array("world",
"!")
array(['hello', 'world', '!'],
dtype='<U5')
Arrays allow us to write code that performs the same operation on many pieces of data at once. We can therefore easily use arithmetic operations on elements of numeric arrays where it “makes sense” (see sidenote).
To compute the estimated percentage by state of adults 25 years or higher that have not graduated high school, we can create a new array by performing arithmetic with an array and a numeric value:
100 - hs_or_higher
array([ 13.1, 16.1, 11.5, 12.8, 15.6])
To compute the estimated number by state of adults 25 years or higher with bachelor’s degrees, we can create a new array by performing arithmetic with two arrays:
/ 100 * state_pop bs_or_higher
array([ 876129.572, 9252804.621, 4652874.43 , 5118433.875, 5664104.257])
Sidenote: What do I mean by “makes sense”? Linear Algebra is a broad mathematical field that forms the foundations of much of data science and tabular data analysis. This element-wise array functionality is derived from the mathematical definitions of vectors and scalars. Take a linear algebra class if you want to learn more!
When people stand in a line, each person has a position. Similarly, each element (i.e., value) of an array has a position – called its index. Python, like many programming languages, is zero-indexed. This means that in an array, the first element has index 0, not 1.
In the int_arr
array below, the first element (3
) has index 0; the last element (2
) has index 4.
= make_array(3, -4, 0, 5, 2)
int_arr int_arr
array([ 3, -4, 0, 5, 2])
item()
An array method is just like a function, but it must operate on an array using “dot” syntax. So the call looks like:
name_of_array.method(arguments)
We will discuss many more methods once we introduce tables, but for now let’s learn our first method to index arrays.
We can access an element in an array by using its index and the item()
method:
0) int_arr.item(
3
3) int_arr.item(
5
Because of zero-indexing, the largest valid index is 4 for the five-element int_arr
array:
5) int_arr.item(
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[13], line 1 ----> 1 int_arr.item(5) IndexError: index 5 is out of bounds for axis 0 with size 5
In Python, we can also “count backwards” using negative indexes. -1 corresponds to the last element in a array; -2 corresponds to the second last element in a array; and so on.
# functionally equivalent to int_arr.item(4)
-1) int_arr.item(
2
U.S. Census Bureau, “EDUCATIONAL ATTAINMENT,” American Community Survey 5-Year Estimates Subject Tables, Table S1501, 2020, https://data.census.gov/table/ACSST5Y2020.S1501?q=2020+education&t=Age+and+Sex:Educational+Attainment&g=010XX00US$0400000, accessed on August 24, 2025.