Arrays

Our first data structure

This note has the following goals:

We will first learn arrays through analyzing data; then, we will drill into array internals and the nitty-gritty of Python modules and methods. Finally, we will summarize array operations.

Definitions

Let us discuss our first data structure in this course: arrays. An array is a sequential collection of values of a given data type:

  • sequential: arranged like a line/queue
  • collection: multiple values organized together.

In arrays, each value is called an array element.

We have previously discussed the idea of tables: a rectangular data structure with rows and columns. We will see today that arrays are a concise way to manipulate table columns, because arrays facilitate common data processing that we may want to perform on columns.

Read Inferential Thinking

Read Ch 5.1, which describes in detail how arrays can be used in arithmetic expressions to compute over their contents.

Before continuing, make sure that:

  • You understand the figure that shows how to convert an array of Celsius temperatures to an array of Farenheit temperatures.

Today’s dataset

The following table is drawn from the American Community Survey (ACS) of 2020. It shows education levels of adults 25 years or higher by state.

State Estimated total state population Estimated high school graduate or higher (%) Estimated bachelor’s degree or higher (%)
Alabama 3,344,006 86.9 26.2
California 26,665,143 83.9 34.7
Florida 15,255,326 88.5 30.5
New York 13,649,157 87.2 37.5
Texas 18,449,851 84.4 30.7

Creating arrays

Each of these table columns can be represented by an array.

Below, we create a new array for the column “Estimated high school graduate or higher (%)” and assign the returned array to a single name, hs_or_higher. This simple assignment statement is abstraction at work! Also note that the import statement gives us access to array functions with the datascience module, including make_array, which returns a new array with the provided argument values.

from datascience import *

hs_or_higher = make_array(86.9, 83.9, 88.5, 87.2, 84.4)
hs_or_higher
array([ 86.9,  83.9,  88.5,  87.2,  84.4])

Let’s make a few more arrays.

The array data type (as shown below) is a bit esoteric for now; we will discuss what NumPy (np) is very soon.

bs_or_higher = make_array(26.2, 34.7, 30.5, 37.5, 30.7)
type(bs_or_higher)
numpy.ndarray

When creating the state names array of strings below, what do you observe about the datatype, dtype? Hint: Count the number of characters in each string.

states = make_array("Alabama", "California", "Florida", "New York", "Texas")
states
array(['Alabama', 'California', 'Florida', 'New York', 'Texas'],
      dtype='<U10')

When creating the state population array below, why might we decide to make an integer array, as opposed to a string array?

state_pop = make_array(3344006, 26665143, 15255326, 13649157, 18449851)
state_pop
array([ 3344006, 26665143, 15255326, 13649157, 18449851])

The order of an array is fixed (i.e., they will be arranged in the order specified when building the array), and values can be repeated.

Array with 4 ints:

make_array(5, -1, 0.3, 5)
array([ 5. , -1. ,  0.3,  5. ])

Values in an array must all be of the same data type, and the make_array function will cast appropriately. Below, all values can be represented by strings:

make_array(4, -4.5, "not a number")
array(['4', '-4.5', 'not a number'],
      dtype='<U32')

Incidentally, we can clean up our code stylistically by makine line breaks after each argument:

make_array("hello",
           "world",
           "!")
array(['hello', 'world', '!'],
      dtype='<U5')

Element-wise arithmetic

Arrays allow us to write code that performs the same operation on many pieces of data at once. We can therefore easily use arithmetic operations on elements of numeric arrays where it “makes sense” (see sidenote).

To compute the estimated percentage by state of adults 25 years or higher that have not graduated high school, we can create a new array by performing arithmetic with an array and a numeric value:

100 - hs_or_higher
array([ 13.1,  16.1,  11.5,  12.8,  15.6])

To compute the estimated number by state of adults 25 years or higher with bachelor’s degrees, we can create a new array by performing arithmetic with two arrays:

bs_or_higher / 100 * state_pop
array([  876129.572,  9252804.621,  4652874.43 ,  5118433.875,  5664104.257])

Sidenote: What do I mean by “makes sense”? Linear Algebra is a broad mathematical field that forms the foundations of much of data science and tabular data analysis. This element-wise array functionality is derived from the mathematical definitions of vectors and scalars. Take a linear algebra class if you want to learn more!

Indexing

When people stand in a line, each person has a position. Similarly, each element (i.e., value) of an array has a position – called its index. Python, like many programming languages, is zero-indexed. This means that in an array, the first element has index 0, not 1.

In the int_arr array below, the first element (3) has index 0; the last element (2) has index 4.

int_arr = make_array(3, -4, 0, 5, 2)
int_arr
array([ 3, -4,  0,  5,  2])

The Array Method item()

An array method is just like a function, but it must operate on an array using “dot” syntax. So the call looks like:

name_of_array.method(arguments)

We will discuss many more methods once we introduce tables, but for now let’s learn our first method to index arrays.

We can access an element in an array by using its index and the item() method:

int_arr.item(0)
3
int_arr.item(3)
5

Because of zero-indexing, the largest valid index is 4 for the five-element int_arr array:

int_arr.item(5)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[13], line 1
----> 1 int_arr.item(5)

IndexError: index 5 is out of bounds for axis 0 with size 5

In Python, we can also “count backwards” using negative indexes. -1 corresponds to the last element in a array; -2 corresponds to the second last element in a array; and so on.

# functionally equivalent to int_arr.item(4)
int_arr.item(-1)
2

External Reading

  • (mentioned in notes) Computational and Inferential Thinking, Ch 5.1
  • (optional) Tomas Beuzen. Python Programming for Data Science Ch 1.2.

References

U.S. Census Bureau, “EDUCATIONAL ATTAINMENT,” American Community Survey 5-Year Estimates Subject Tables, Table S1501, 2020, https://data.census.gov/table/ACSST5Y2020.S1501?q=2020+education&t=Age+and+Sex:Educational+Attainment&g=010XX00US$0400000, accessed on August 24, 2025.