Iteration Practice

These examples practice loops with data.

Titanic: For loop

Consider the titanic dataset, which lists information about each passenger aboard the 1912 ocean liner. Read more about the Titanic and its tragic end on Wikipedia.

titanic.show(5)

survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	3	male	22	1	7.25	S	Third	man	True	nan	Southampton	no	False
1	1	female	38	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
1	3	female	26	0	7.925	S	Third	woman	False	nan	Southampton	yes	True
1	1	female	35	1	53.1	S	First	woman	False	C	Southampton	yes	False
0	3	male	35	0	8.05	S	Third	man	True	nan	Southampton	no	True

... (886 rows omitted)

What is the average fare of the first-class passengers?

We can compute this with Table methods:

titanic.where("class", "First").column("fare").mean()

84.154687499999994

We can also compute this using a for loop and conditional statements. It is more verbose, but you will find it productive to compare and contrast the approaches.

ticket_class = titanic.column("class")
ticket_fare = titanic.column("fare")

total = 0
count = 0
for i in np.arange(len(ticket_class)):
    if ticket_class.item(i) == "First":
        total += ticket_fare.item(i) 
        count += 1
total/count

84.15468749999992

A for loop is preferable because we need to examine each and every passenger to determine if they bought a first-class ticket.

UC Berkeley enrollment: While loop

The University of California keeps historic data on the number of students enrolled at each of the UC campuses. The below enrollments table lists UC Berkeley undergraduate and graduate enrollments:

enrollment.show(5)

year	undergrad	graduate
1869	40	0
1870	90	3
1871	151	0
1872	185	0
1873	189	2

... (151 rows omitted)

What is the historic year where UC Berkeley enrolled a cumulative number of 1,000,000 undergraduates?

total_undergrads = 0
year_index = 0
undergrads = enrollment.column("undergrad")

while total_undergrads <= 1000000:
    total_undergrads += undergrads.item(year_index)
    year_index += 1

print('Total undergrads graduated by', enrollment.column("year").item(year_index),
      "was", total_undergrads)

Total undergrads graduated by 1982 was 1002972

A while loop is preferable because all of our rows are sorted by year, and we can stop looking after we have found the corresponding year.