Lecture 5 – Functions and Control

Data 94, Spring 2021

Motivation

We've seen a few in-built Python functions so far.

In [1]:
int('-14')              # Evaluates to -14
Out[1]:
-14
In [2]:
abs(-14)                # Evaluates to 14
Out[2]:
14
In [3]:
max(-14, 15)            # Evaluates to 15
Out[3]:
15
In [4]:
print('zoology')        # Prints zoology, evaluates to None
zoology

We don't currently have a good way to prevent our code from getting repetitive. For example, if we want to determine whether or not different students are ready to graduate:

In [5]:
units_1 = 104
year_1 = 'sophomore'
ready_to_graduate_1 = (year_1 == 'senior') and (units_1 >= 120)
ready_to_graduate_1
Out[5]:
False
In [6]:
units_2 = 121
year_2 = 'senior'
ready_to_graduate_2 = (year_2 == 'senior') and (units_2 >= 120)
ready_to_graduate_2
Out[6]:
True
In [7]:
units_3 = 125
year_3 = 'junior'
ready_to_graduate_3 = (year_3 == 'senior') and (units_3 >= 120)
ready_to_graduate_3
Out[7]:
False

Functions

Here's a better solution:

In [8]:
def ready_to_graduate(year, units):
    return (year == 'senior') and (units >= 120)
In [9]:
ready_to_graduate(year_1, units_1)
Out[9]:
False
In [10]:
ready_to_graduate(year_2, units_2)
Out[10]:
True
In [11]:
ready_to_graduate(year_3, units_3)
Out[11]:
False

By using a function, we only had to write out the logic once, and could easily call it any number of times.

Other function examples:

In [12]:
# This function has one parameter, x.
# When we call the function, the value we pass in
# as an argument will replace x in the computation.

def triple(x):
    return x*3
In [13]:
triple(15)
Out[13]:
45
In [14]:
triple(-1.0)
Out[14]:
-3.0
In [15]:
# Functions can have zero parameters!
def always_true():
    return True

# The body of a function can be
# longer than one line.
def pythagorean(a, b):
    c_squared = a**2 + b**2
    return c_squared**0.5
In [16]:
always_true()
Out[16]:
True
In [17]:
# Good
def square(x):
    return x**2
In [18]:
# Bad
def square(x):
return x**2
  File "<ipython-input-18-a3cc806c6b2f>", line 3
    return x**2
    ^
IndentationError: expected an indented block

Parameters and return values

Scoping

In [19]:
def eat(zebra):
    return 'ate ' + zebra
In [20]:
eat('lionel')
Out[20]:
'ate lionel'
In [21]:
zebra
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-65496db47d6a> in <module>
----> 1 zebra

NameError: name 'zebra' is not defined
In [22]:
N = 15
def half(N):
    return N/2
In [23]:
half(0)
Out[23]:
0.0
In [24]:
half(12)
Out[24]:
6.0
In [25]:
half(N)
Out[25]:
7.5
In [26]:
N = 15
def addN(x):
    return x + N
In [27]:
addN(0)
Out[27]:
15
In [28]:
addN(3)
Out[28]:
18
In [29]:
triple(15)
Out[29]:
45
In [30]:
triple(1/0)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-30-025cd7fce23b> in <module>
----> 1 triple(1/0)

ZeroDivisionError: division by zero
In [31]:
triple(3, 4)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-31-186ca6651497> in <module>
----> 1 triple(3, 4)

TypeError: triple() takes 1 positional argument but 2 were given
In [32]:
print('my', 'name', 'is', 300)
my name is 300

Returning

In [33]:
def add_and_print(a, b):
    total = a + b
    print(total)
In [34]:
total = add_and_print(3, 4)
7
In [35]:
total
In [36]:
print(total)
None

Nothing after the return keyword is run.

In [37]:
def odd(n):
    return n % 2 == 1
    print('this will never be printed!')
In [38]:
odd(15)
Out[38]:
True
In [39]:
total = 8
In [40]:
odd(2)
Out[40]:
False

String methods

In [41]:
'isaac'.upper()
Out[41]:
'ISAAC'
In [42]:
s = 'JuNiOR12'
s.upper()
Out[42]:
'JUNIOR12'
In [43]:
s.lower()
Out[43]:
'junior12'
In [44]:
s.replace('i', 'iii')
Out[44]:
'JuNiiiOR12'

Demo

Let's load in the same Wikipedia countries data from this week's earlier lectures. But this time, we will write some of the data cleaning functions ourself.

In [45]:
from datascience import *
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np

data = Table.read_table('data/countries.csv')
data = data.take(np.arange(0, data.num_rows - 1))
data = data.relabeled('Country(or dependent territory)', 'Country') \
           .relabeled('% of world', '%') \
           .relabeled('Source(official or UN)', 'Source')
data = data.with_columns(
    'Country', data.apply(lambda s: s[:s.index('[')].lower() if '[' in s else s.lower(), 'Country'))

def first_letter(s):
    return s[0]

def last_letter(s):
    return s[-1]
In [46]:
data
Out[46]:
Rank Country Population % Date Source
1 china 1,405,936,040 17.9% 27 Dec 2020 National population clock[3]
2 india 1,371,366,679 17.5% 27 Dec 2020 National population clock[4]
3 united states 330,888,778 4.22% 27 Dec 2020 National population clock[5]
4 indonesia 269,603,400 3.44% 1 Jul 2020 National annual projection[6]
5 pakistan 220,892,331 2.82% 1 Jul 2020 UN Projection[2]
6 brazil 212,523,810 2.71% 27 Dec 2020 National population clock[7]
7 nigeria 206,139,587 2.63% 1 Jul 2020 UN Projection[2]
8 bangladesh 169,885,314 2.17% 27 Dec 2020 National population clock[8]
9 russia 146,748,590 1.87% 1 Jan 2020 National annual estimate[9]
10 mexico 127,792,286 1.63% 1 Jul 2020 National annual projection[10]

... (231 rows omitted)

Let's look at the 'Population' column.

In [47]:
# ignore
china_pop = data.column('Population').take(0)
In [48]:
china_pop
Out[48]:
'1,405,936,040'

We want these numbers to be integers, so that we can do arithmetic with them or plot them. However, right now they are not.

Let's write a function that takes in a string with that format, and returns the corresponding integer. But first, proof that the int function doesn't work here (it doesn't like the commas):

In [49]:
int(china_pop)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-49-5588f41b2e1b> in <module>
----> 1 int(china_pop)

ValueError: invalid literal for int() with base 10: '1,405,936,040'
In [50]:
china_pop
Out[50]:
'1,405,936,040'
In [51]:
def clean_population_string(pop):
    no_comma = pop.replace(',', '')
    return int(no_comma)
In [52]:
china_pop_clean = clean_population_string(china_pop)
china_pop_clean
Out[52]:
1405936040

Cool!

Using techniques we haven't yet learned, we can apply this function to every element of the 'Population' column, so that when we visualize it, things work.

In [53]:
# ignore
data = data.with_columns('Population', data.apply(clean_population_string, 'Population'))
In [54]:
data
Out[54]:
Rank Country Population % Date Source
1 china 1405936040 17.9% 27 Dec 2020 National population clock[3]
2 india 1371366679 17.5% 27 Dec 2020 National population clock[4]
3 united states 330888778 4.22% 27 Dec 2020 National population clock[5]
4 indonesia 269603400 3.44% 1 Jul 2020 National annual projection[6]
5 pakistan 220892331 2.82% 1 Jul 2020 UN Projection[2]
6 brazil 212523810 2.71% 27 Dec 2020 National population clock[7]
7 nigeria 206139587 2.63% 1 Jul 2020 UN Projection[2]
8 bangladesh 169885314 2.17% 27 Dec 2020 National population clock[8]
9 russia 146748590 1.87% 1 Jan 2020 National annual estimate[9]
10 mexico 127792286 1.63% 1 Jul 2020 National annual projection[10]

... (231 rows omitted)

The '%' column is also a little fishy.

In [55]:
china_pct = data.column('%').take(0)
china_pct
Out[55]:
'17.9%'

Percentages should be floats, but here they're strings.

Let's suppose we want to have the proportion of the total global population that lives in a given country as a column in our table. Proportions are decimals/fractions between 0 and 1. We can do this two ways:

  • write a function, similar to clean_population_string, that correctly extracts the proportion we need
  • calculate this by hand using all of the values in 'Population'

Let's do... both!

In [56]:
def clean_pct_string(pct):
    no_symbol = pct.replace('%', '')
    prop = float(no_symbol) / 100
    return prop
In [57]:
clean_pct_string(china_pct)
Out[57]:
0.179

Nice! The other way requires adding together all of the values in the 'Population' column. We haven't covered how to do that just yet, so ignore the code for it and assume it does what it should.

In [58]:
total_population = data.column('Population').sum()
total_population
Out[58]:
7710658195

Assume this is the total population of the world. How would you calculate the proportion of people living in one country?

In [59]:
def compute_proportion(population):
    return population / total_population
In [60]:
china_pop_clean
Out[60]:
1405936040
In [61]:
compute_proportion(china_pop_clean)
Out[61]:
0.18233670906482194

Pretty close to clean_pct_string(china_pct). The difference is likely due to some countries not being included in one column or the other.

Hopefully this gives you a glimpse of the power of functions!