Consider the titanic dataset, which lists information about each passenger aboard the 1912 ocean liner. Read more about the Titanic and its tragic end on Wikipedia.
titanic.show(5)
survived
pclass
sex
age
sibsp
parch
fare
embarked
class
who
adult_male
deck
embark_town
alive
alone
0
3
male
22
1
0
7.25
S
Third
man
True
nan
Southampton
no
False
1
1
female
38
1
0
71.2833
C
First
woman
False
C
Cherbourg
yes
False
1
3
female
26
0
0
7.925
S
Third
woman
False
nan
Southampton
yes
True
1
1
female
35
1
0
53.1
S
First
woman
False
C
Southampton
yes
False
0
3
male
35
0
0
8.05
S
Third
man
True
nan
Southampton
no
True
... (886 rows omitted)
What is the average fare of the first-class passengers?
We can also compute this using a for loop and conditional statements. It is more verbose, but you will find it productive to compare and contrast the approaches.
ticket_class = titanic.column("class")ticket_fare = titanic.column("fare")total =0count =0for i in np.arange(len(ticket_class)):if ticket_class.item(i) =="First": total += ticket_fare.item(i) count +=1total/count
84.15468749999992
A for loop is preferable because we need to examine each and every passenger to determine if they bought a first-class ticket.
UC Berkeley enrollment: While loop
The University of California keeps historic data on the number of students enrolled at each of the UC campuses. The below enrollments table lists UC Berkeley undergraduate and graduate enrollments:
enrollment.show(5)
year
undergrad
graduate
1869
40
0
1870
90
3
1871
151
0
1872
185
0
1873
189
2
... (151 rows omitted)
What is the historic year where UC Berkeley enrolled a cumulative number of 1,000,000 undergraduates?