Iteration Practice

These examples practice loops with data.

Titanic: For loop

Consider the titanic dataset, which lists information about each passenger aboard the 1912 ocean liner. Read more about the Titanic and its tragic end on Wikipedia.

titanic.show(5)
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 3 male 22 1 0 7.25 S Third man True nan Southampton no False
1 1 female 38 1 0 71.2833 C First woman False C Cherbourg yes False
1 3 female 26 0 0 7.925 S Third woman False nan Southampton yes True
1 1 female 35 1 0 53.1 S First woman False C Southampton yes False
0 3 male 35 0 0 8.05 S Third man True nan Southampton no True

... (886 rows omitted)

What is the average fare of the first-class passengers?

We can compute this with Table methods:

titanic.where("class", "First").column("fare").mean()
84.154687499999994

We can also compute this using a for loop and conditional statements. It is more verbose, but you will find it productive to compare and contrast the approaches.

ticket_class = titanic.column("class")
ticket_fare = titanic.column("fare")

total = 0
count = 0
for i in np.arange(len(ticket_class)):
    if ticket_class.item(i) == "First":
        total += ticket_fare.item(i) 
        count += 1
total/count
84.15468749999992

A for loop is preferable because we need to examine each and every passenger to determine if they bought a first-class ticket.

UC Berkeley enrollment: While loop

The University of California keeps historic data on the number of students enrolled at each of the UC campuses. The below enrollments table lists UC Berkeley undergraduate and graduate enrollments:

enrollment.show(5)
year undergrad graduate
1869 40 0
1870 90 3
1871 151 0
1872 185 0
1873 189 2

... (151 rows omitted)

What is the historic year where UC Berkeley enrolled a cumulative number of 1,000,000 undergraduates?

total_undergrads = 0
year_index = 0
undergrads = enrollment.column("undergrad")

while total_undergrads <= 1000000:
    total_undergrads += undergrads.item(year_index)
    year_index += 1

print('Total undergrads graduated by', enrollment.column("year").item(year_index),
      "was", total_undergrads)
Total undergrads graduated by 1982 was 1002972

A while loop is preferable because all of our rows are sorted by year, and we can stop looking after we have found the corresponding year.