We've seen a few in-built Python functions so far.
int('-14') # Evaluates to -14
-14
abs(-14) # Evaluates to 14
14
max(-14, 15) # Evaluates to 15
15
print('zoology') # Prints zoology, evaluates to None
zoology
We don't currently have a good way to prevent our code from getting repetitive. For example, if we want to determine whether or not different students are ready to graduate:
units_1 = 104
year_1 = 'sophomore'
ready_to_graduate_1 = (year_1 == 'senior') and (units_1 >= 120)
ready_to_graduate_1
False
units_2 = 121
year_2 = 'senior'
ready_to_graduate_2 = (year_2 == 'senior') and (units_2 >= 120)
ready_to_graduate_2
True
units_3 = 125
year_3 = 'junior'
ready_to_graduate_3 = (year_3 == 'senior') and (units_3 >= 120)
ready_to_graduate_3
False
Here's a better solution:
def ready_to_graduate(year, units):
return (year == 'senior') and (units >= 120)
ready_to_graduate(year_1, units_1)
False
ready_to_graduate(year_2, units_2)
True
ready_to_graduate(year_3, units_3)
False
By using a function, we only had to write out the logic once, and could easily call it any number of times.
Other function examples:
# This function has one parameter, x.
# When we call the function, the value we pass in
# as an argument will replace x in the computation.
def triple(x):
return x*3
triple(15)
45
triple(-1.0)
-3.0
# Functions can have zero parameters!
def always_true():
return True
# The body of a function can be
# longer than one line.
def pythagorean(a, b):
c_squared = a**2 + b**2
return c_squared**0.5
always_true()
True
# Good
def square(x):
return x**2
# Bad
def square(x):
return x**2
File "/tmp/ipykernel_142/4284077287.py", line 3 return x**2 ^ IndentationError: expected an indented block
def mystery(t):
return t + '0'
alpha = mystery('19')
beta = mystery(19)
charlie = mystery('1' + '9')
(alpha, beta, charlie)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_142/809340847.py in <module> 1 alpha = mystery('19') ----> 2 beta = mystery(19) 3 charlie = mystery('1' + '9') 4 (alpha, beta, charlie) /tmp/ipykernel_142/954243894.py in mystery(t) 1 def mystery(t): ----> 2 return t + '0' TypeError: unsupported operand type(s) for +: 'int' and 'str'
def eat(zebra):
return 'ate ' + zebra
eat('lionel')
'ate lionel'
zebra
--------------------------------------------------------------------------- NameError Traceback (most recent call last) /tmp/ipykernel_142/1378667599.py in <module> ----> 1 zebra NameError: name 'zebra' is not defined
N = 15
def half(N):
return N/2
half(0)
0.0
half(12)
6.0
half(N)
7.5
N = 15
def addN(x):
return x + N
addN(0)
15
addN(3)
18
triple(15)
45
triple(1/0)
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) /tmp/ipykernel_142/2073793153.py in <module> ----> 1 triple(1/0) ZeroDivisionError: division by zero
triple(3, 4)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_142/3999475957.py in <module> ----> 1 triple(3, 4) TypeError: triple() takes 1 positional argument but 2 were given
print('my', 'name', 'is', 300)
my name is 300
def add_and_print(a, b):
total = a + b
print(total)
total = add_and_print(3, 4)
7
total
print(total)
None
Nothing after the return
keyword is run.
def odd(n):
return n % 2 == 1
print('this will never be printed!')
odd(15)
True
odd(2)
False
total = 3
def square_and_cube(a, b):
return a**2 + total**b
total = square_and_cube(1, 2)
total
10
total = square_and_cube(1, 2)
total
101
'ian'.upper()
'IAN'
s = 'JuNiOR12'
s.upper()
'JUNIOR12'
s.lower()
'junior12'
s.replace('i', 'iii')
'JuNiiiOR12'
Let's load in the same Wikipedia countries data from this week's earlier lectures. But this time, we will write some of the data cleaning functions ourself.
from datascience import *
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
data = Table.read_table('data/countries.csv')
data = data.take(np.arange(0, data.num_rows - 1))
data = data.relabeled('Country(or dependent territory)', 'Country') \
.relabeled('% of world', '%') \
.relabeled('Source(official or UN)', 'Source')
data = data.with_columns(
'Country', data.apply(lambda s: s[:s.index('[')].lower() if '[' in s else s.lower(), 'Country'))
def first_letter(s):
return s[0]
def last_letter(s):
return s[-1]
data
Rank | Country | Population | % | Date | Source |
---|---|---|---|---|---|
1 | china | 1,405,936,040 | 17.9% | 27 Dec 2020 | National population clock[3] |
2 | india | 1,371,366,679 | 17.5% | 27 Dec 2020 | National population clock[4] |
3 | united states | 330,888,778 | 4.22% | 27 Dec 2020 | National population clock[5] |
4 | indonesia | 269,603,400 | 3.44% | 1 Jul 2020 | National annual projection[6] |
5 | pakistan | 220,892,331 | 2.82% | 1 Jul 2020 | UN Projection[2] |
6 | brazil | 212,523,810 | 2.71% | 27 Dec 2020 | National population clock[7] |
7 | nigeria | 206,139,587 | 2.63% | 1 Jul 2020 | UN Projection[2] |
8 | bangladesh | 169,885,314 | 2.17% | 27 Dec 2020 | National population clock[8] |
9 | russia | 146,748,590 | 1.87% | 1 Jan 2020 | National annual estimate[9] |
10 | mexico | 127,792,286 | 1.63% | 1 Jul 2020 | National annual projection[10] |
... (231 rows omitted)
Let's look at the 'Population'
column.
# ignore
china_pop = data.column('Population').take(0)
china_pop
'1,405,936,040'
We want these numbers to be integers, so that we can do arithmetic with them or plot them. However, right now they are not.
Let's write a function that takes in a string with that format, and returns the corresponding integer. But first, proof that the int
function doesn't work here (it doesn't like the commas):
int(china_pop)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /tmp/ipykernel_142/2345823718.py in <module> ----> 1 int(china_pop) ValueError: invalid literal for int() with base 10: '1,405,936,040'
china_pop
'1,405,936,040'
def clean_population_string(pop):
no_comma = pop.replace(',', '')
return int(no_comma)
china_pop_clean = clean_population_string(china_pop)
china_pop_clean
1405936040
Cool!
Using techniques we haven't yet learned, we can apply this function to every element of the 'Population'
column, so that when we visualize it, things work.
# ignore
data = data.with_columns('Population', data.apply(clean_population_string, 'Population'))
data
Rank | Country | Population | % | Date | Source |
---|---|---|---|---|---|
1 | china | 1405936040 | 17.9% | 27 Dec 2020 | National population clock[3] |
2 | india | 1371366679 | 17.5% | 27 Dec 2020 | National population clock[4] |
3 | united states | 330888778 | 4.22% | 27 Dec 2020 | National population clock[5] |
4 | indonesia | 269603400 | 3.44% | 1 Jul 2020 | National annual projection[6] |
5 | pakistan | 220892331 | 2.82% | 1 Jul 2020 | UN Projection[2] |
6 | brazil | 212523810 | 2.71% | 27 Dec 2020 | National population clock[7] |
7 | nigeria | 206139587 | 2.63% | 1 Jul 2020 | UN Projection[2] |
8 | bangladesh | 169885314 | 2.17% | 27 Dec 2020 | National population clock[8] |
9 | russia | 146748590 | 1.87% | 1 Jan 2020 | National annual estimate[9] |
10 | mexico | 127792286 | 1.63% | 1 Jul 2020 | National annual projection[10] |
... (231 rows omitted)
The '%'
column is also a little fishy.
china_pct = data.column('%').take(0)
china_pct
'17.9%'
Percentages should be floats, but here they're strings.
Let's suppose we want to have the proportion of the total global population that lives in a given country as a column in our table. Proportions are decimals/fractions between 0 and 1. We can do this two ways:
clean_population_string
, that correctly extracts the proportion we need'Population'
Let's do... both!
def clean_pct_string(pct):
no_symbol = pct.replace('%', '')
prop = float(no_symbol) / 100
return prop
clean_pct_string(china_pct)
0.179
Nice! The other way requires adding together all of the values in the 'Population'
column. We haven't covered how to do that just yet, so ignore the code for it and assume it does what it should.
total_population = data.column('Population').sum()
total_population
7710658195
Assume this is the total population of the world. How would you calculate the proportion of people living in one country?
def compute_proportion(population):
return population / total_population
china_pop_clean
1405936040
compute_proportion(china_pop_clean)
0.18233670906482194
Pretty close to clean_pct_string(china_pct)
. The difference is likely due to some countries not being included in one column or the other.
Hopefully this gives you a glimpse of the power of functions!