Lecture 22 – Case Study

Data 94, Spring 2021

In [1]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px

Table.interactive_plots()

Note: We're not going to be able to work through this entire notebook in lecture; you should definitely review whatever we don't get a chance to finish.

World University Rankings 2020

Our dataset comes from Times Higher Education (THE)'s World University Rankings 2020. These are slightly outdated as there is a 2021 ranking now, but the data is still relevant.

In [2]:
world = Table.read_table('data/World_University_Rank_2020.csv')
In [3]:
world
Out[3]:
Rank_Char Score_Rank University Country Number_students Numb_students_per_Staff International_Students Percentage_Female Percentage_Male Teaching Research Citations Industry_Income International_Outlook Score_Result Overall_Ranking
1 1 University of Oxford United Kingdom 20,664 11.2 41% 46% 54% 90.5 99.6 98.4 65.5 96.4 95.4 95.40
2 2 California Institute of Technology United States 2,240 6.4 30% 34% 66% 92.1 97.2 97.9 88 82.5 94.5 94.50
3 3 University of Cambridge United Kingdom 18,978 10.9 37% 47% 53% 91.4 98.7 95.8 59.3 95 94.4 94.40
4 4 Stanford University United States 16,135 7.3 23% 43% 57% 92.8 96.4 99.9 66.2 79.5 94.3 94.30
5 5 Massachusetts Institute of Technology United States 11,247 8.6 34% 39% 61% 90.5 92.4 99.5 86.9 89 93.6 93.60
6 6 Princeton University United States 7,983 8.1 25% 45% 55% 90.3 96.3 98.8 58.6 81.1 93.2 93.20
7 7 Harvard University United States 20,823 9.2 24% 49% 51% 89.2 98.6 99.1 47.3 76.3 93 93.00
8 8 Yale University United States 12,402 5.4 20% 50% 50% 92 94.8 97.3 52.4 68.7 91.7 91.70
9 9 University of Chicago United States 13,833 5.7 28% 46% 54% 89.1 91.4 96.7 52.7 76 90.2 90.20
10 10 Imperial College London United Kingdom 16,760 11.7 56% 38% 62% 84.5 87.6 97 69.9 97.1 89.8 89.80

... (1386 rows omitted)

It's always good to check how many schools we're dealing with:

In [4]:
world.num_rows
Out[4]:
1396

Some columns ('Number_students', 'International_Students', 'Percentage_Female', 'Percentage_Male') have commas and percentage symbols, meaning they can't be stored as integers. Let's clean them.

In [5]:
# Notice how we use apply here!
def remove_symbol(s):
    return int(s.replace('%', '').replace(',', ''))
In [6]:
# Remember, the result of calling apply is an array
world.apply(remove_symbol, 'Number_students')
Out[6]:
array([20664,  2240, 18978, ..., 15236, 17101,  9285])
In [7]:
world = world.with_columns(
    'Number_students', world.apply(remove_symbol, 'Number_students'),
    'International_Students', world.apply(remove_symbol, 'International_Students'),
    'Percentage_Female', world.apply(remove_symbol, 'Percentage_Female'),
    'Percentage_Male', world.apply(remove_symbol, 'Percentage_Male')
)

Now we can sort by any numeric column we want.

In [8]:
world.sort('Percentage_Female')
Out[8]:
Rank_Char Score_Rank University Country Number_students Numb_students_per_Staff International_Students Percentage_Female Percentage_Male Teaching Research Citations Industry_Income International_Outlook Score_Result Overall_Ranking
16 15 Columbia University United States 26586 5.8 37 0 0 85.6 82.6 98.2 44.8 79.3 87 87.00
36 33 The University of Tokyo Japan 25913 10.6 12 0 0 85.9 89.6 60.7 77.4 38.2 75.7 75.70
47 40 The Hong Kong University of Science and Technology Hong Kong 10125 22.3 31 0 0 57.4 66.1 89.8 71.9 97.7 73.1 73.10
51 44 University of Wisconsin-Madison United States 39154 10 13 0 0 68.8 70.3 85.3 46.3 47.4 72 72.00
52 45 Washington University in St Louis United States 13401 7.5 20 0 0 64.2 57.5 98.8 40.7 57.1 71.5 71.50
53 46 Brown University United States 9391 10.8 20 0 0 64 56.1 94.9 36.5 61.4 70 70.00
57 49 Chinese University of Hong Kong Hong Kong 18340 18.6 32 0 0 55.8 62.8 84.5 55.2 97.8 69.6 69.60
64 54 Seoul National University South Korea 26182 12.4 12 0 0 72.3 71.6 66.5 86.6 35.8 68 68.00
74 62 Humboldt University of Berlin Germany 33463 56.1 17 0 0 61.8 66.8 67.9 40.2 67.8 65 65.00
80 67 University of Science and Technology of China China 16245 7.7 4 0 0 64.6 59.5 74.7 79.6 31.3 64 64.00

... (1386 rows omitted)

It seems like the above schools didn't report their sex breakdown, since 0% is the listed percentage of female and male students.

Let's start asking questions.

Which universities have the most international students?

We have an 'International_Students' column, but that tells us the percentage of international students at each school. Let's update that label to be more clear – let's change the label 'International_Students' to be '% International'.

In [9]:
world = world.relabeled('International_Students', '% International')
In [10]:
world
Out[10]:
Rank_Char Score_Rank University Country Number_students Numb_students_per_Staff % International Percentage_Female Percentage_Male Teaching Research Citations Industry_Income International_Outlook Score_Result Overall_Ranking
1 1 University of Oxford United Kingdom 20664 11.2 41 46 54 90.5 99.6 98.4 65.5 96.4 95.4 95.40
2 2 California Institute of Technology United States 2240 6.4 30 34 66 92.1 97.2 97.9 88 82.5 94.5 94.50
3 3 University of Cambridge United Kingdom 18978 10.9 37 47 53 91.4 98.7 95.8 59.3 95 94.4 94.40
4 4 Stanford University United States 16135 7.3 23 43 57 92.8 96.4 99.9 66.2 79.5 94.3 94.30
5 5 Massachusetts Institute of Technology United States 11247 8.6 34 39 61 90.5 92.4 99.5 86.9 89 93.6 93.60
6 6 Princeton University United States 7983 8.1 25 45 55 90.3 96.3 98.8 58.6 81.1 93.2 93.20
7 7 Harvard University United States 20823 9.2 24 49 51 89.2 98.6 99.1 47.3 76.3 93 93.00
8 8 Yale University United States 12402 5.4 20 50 50 92 94.8 97.3 52.4 68.7 91.7 91.70
9 9 University of Chicago United States 13833 5.7 28 46 54 89.1 91.4 96.7 52.7 76 90.2 90.20
10 10 Imperial College London United Kingdom 16760 11.7 56 38 62 84.5 87.6 97 69.9 97.1 89.8 89.80

... (1386 rows omitted)

Then, to compute the number of international students at each school, we take the number of students at each school, multiply by the percentage of international students at each school, and divide by 100.

In [11]:
world.column('Number_students') * world.column('% International') / 100
Out[11]:
array([8472.24,  672.  , 7021.86, ...,  457.08,    0.  ,  185.7 ])

We should probably round the result, since we can't have fractional humans.

In [12]:
num_international = np.round(world.column('Number_students') * world.column('% International') / 100, 0)
num_international
Out[12]:
array([8472.,  672., 7022., ...,  457.,    0.,  186.])

We can add this as a column to our table:

In [13]:
world = world.with_columns(
    '# International', num_international
)

And we can sort by this column, while also selecting a subset of all columns just to focus on what's relevant:

In [14]:
world.select('University', 'Country', 'Number_students', '% International', '# International') \
     .sort('# International', descending = True)
Out[14]:
University Country Number_students % International # International
University of Melbourne Australia 47385 46 21797
Monash University Australia 52989 39 20666
Al-Azhar University Egypt 342151 6 20529
UNSW Sydney Australia 44336 41 18178
University of Sydney Australia 45111 39 17593
UCL United Kingdom 32665 52 16986
University of British Columbia Canada 52108 32 16675
Eastern Mediterranean University Northern Cyprus 18865 83 15658
University of Toronto Canada 73370 21 15408
University of Manchester United Kingdom 37038 40 14815

... (1386 rows omitted)

This tells us that the University of Melbourne has the most international students, with 21,797. That's larger than many universities!

There are no US universities in the top 10 here. How can we find the universities in the US with the most international students?

Quick Check 1

Fill in the blanks so that the resulting table contains the 15 universities in the US with the most international students, sorted by number of international students in decreasing order.

In [ ]:
# __(a)__ means blank a

# world.select('University', 'Country', 'Number_students', '% International', '# International') \
#      .where(__(a)__, __(b)__) \
#      .sort('# International', __(c)__) \
#      .take(__(d)__)

If you do a quick Google search for "US universities with the most international students", you'll see NYU is usually #1. Cool!

How do the rankings actually work?

Times Higher Education's website tells us the methodology they use to rank universities:

This means they come up with a 'Teaching', 'Research', 'Citations', 'International_Outlook', and 'Industry_Income' score from 0 to 100 for each school, then compute a weighted average according to the above percentages to compute a school's 'Score_Result', which is how the schools are ranked.

Let's confirm this ourselves. First, let's get a subset of the columns since they won't all be relevant here.

In [15]:
scores_only = world.select('Score_Rank', 'University', 'Teaching', 'Research', 'Citations', 'International_Outlook', 'Industry_Income', 'Score_Result')
In [16]:
scores_only
Out[16]:
Score_Rank University Teaching Research Citations International_Outlook Industry_Income Score_Result
1 University of Oxford 90.5 99.6 98.4 96.4 65.5 95.4
2 California Institute of Technology 92.1 97.2 97.9 82.5 88 94.5
3 University of Cambridge 91.4 98.7 95.8 95 59.3 94.4
4 Stanford University 92.8 96.4 99.9 79.5 66.2 94.3
5 Massachusetts Institute of Technology 90.5 92.4 99.5 89 86.9 93.6
6 Princeton University 90.3 96.3 98.8 81.1 58.6 93.2
7 Harvard University 89.2 98.6 99.1 76.3 47.3 93
8 Yale University 92 94.8 97.3 68.7 52.4 91.7
9 University of Chicago 89.1 91.4 96.7 76 52.7 90.2
10 Imperial College London 84.5 87.6 97 97.1 69.9 89.8

... (1386 rows omitted)

The graphic tells us that the weights for each column are:

  • 'Teaching': 0.3
  • 'Research': 0.3
  • 'Citations': 0.3
  • 'International_Outlook': 0.075
  • 'Industry_Income': 0.025

(Remember, to convert from percentage to proportion we divide by 100.)

Let's try and apply this to the school at the very top of the table, University of Oxford.

In [17]:
0.3 * 90.5 + \
0.3 * 99.6 + \
0.3 * 98.4 + \
0.075 * 96.4 + \
0.025 * 65.5
Out[17]:
95.4175

The result, 95.4175, matches what we see in the 'Score_Result' column for University of Oxford.

We can apply the above formula to all rows in our table as well.

In [18]:
score_result_manual_calculation = \
0.3 * scores_only.column('Teaching') + \
0.3 * scores_only.column('Research') + \
0.3 * scores_only.column('Citations') + \
0.075 * scores_only.column('International_Outlook') + \
0.025 * scores_only.column('Industry_Income')
In [19]:
score_result_manual_calculation
Out[19]:
array([95.4175, 94.5475, 94.3775, ..., 11.055 , 10.9625, 10.6875])

To confirm that the results we got match the 'Score_Result' column in scores_only, we can add the above array to our table:

In [20]:
scores_only.with_columns(
    'Score Result Manual', score_result_manual_calculation
)
Out[20]:
Score_Rank University Teaching Research Citations International_Outlook Industry_Income Score_Result Score Result Manual
1 University of Oxford 90.5 99.6 98.4 96.4 65.5 95.4 95.4175
2 California Institute of Technology 92.1 97.2 97.9 82.5 88 94.5 94.5475
3 University of Cambridge 91.4 98.7 95.8 95 59.3 94.4 94.3775
4 Stanford University 92.8 96.4 99.9 79.5 66.2 94.3 94.3475
5 Massachusetts Institute of Technology 90.5 92.4 99.5 89 86.9 93.6 93.5675
6 Princeton University 90.3 96.3 98.8 81.1 58.6 93.2 93.1675
7 Harvard University 89.2 98.6 99.1 76.3 47.3 93 92.975
8 Yale University 92 94.8 97.3 68.7 52.4 91.7 91.6925
9 University of Chicago 89.1 91.4 96.7 76 52.7 90.2 90.1775
10 Imperial College London 84.5 87.6 97 97.1 69.9 89.8 89.76

... (1386 rows omitted)

This shows we've successfully reverse-engineered how the rankings work!

What if we want to change the methodology?

Now that we know how to compute 'Score_Result's using THE's percentages, we can also pick our own percentages if we want to prioritize different components in our ranking.

For instance, we may feel like THE's methodology places too much emphasis on research – together, 'Research' and 'Citations' make up 60% of the overall score.

We could choose to use the following breakdown, which we'll call "Breakdown 1":

  • 'Teaching': 60%
  • 'International_Outlook': 30%
  • 'Industry_Income': 10%
In [21]:
breakdown_1 = 0.6 * scores_only.column('Teaching') \
            + 0.3 * scores_only.column('International_Outlook') \
            + 0.1 * scores_only.column('Industry_Income')

breakdown_1
Out[21]:
array([89.77, 88.81, 89.27, ..., 18.9 , 17.09, 18.39])

This gives us new overall scores for each school; we can add this column to our table and sort by it.

In [22]:
scores_only = scores_only.with_columns(
    'Breakdown 1', breakdown_1
)
In [23]:
scores_only.sort('Breakdown 1', descending = True)
Out[23]:
Score_Rank University Teaching Research Citations International_Outlook Industry_Income Score_Result Breakdown 1
1 University of Oxford 90.5 99.6 98.4 96.4 65.5 95.4 89.77
5 Massachusetts Institute of Technology 90.5 92.4 99.5 89 86.9 93.6 89.69
3 University of Cambridge 91.4 98.7 95.8 95 59.3 94.4 89.27
2 California Institute of Technology 92.1 97.2 97.9 82.5 88 94.5 88.81
10 Imperial College London 84.5 87.6 97 97.1 69.9 89.8 86.82
4 Stanford University 92.8 96.4 99.9 79.5 66.2 94.3 86.15
6 Princeton University 90.3 96.3 98.8 81.1 58.6 93.2 84.37
13 ETH Zurich 81.8 92.8 90.3 98.2 56.8 88.3 84.22
9 University of Chicago 89.1 91.4 96.7 76 52.7 90.2 81.53
7 Harvard University 89.2 98.6 99.1 76.3 47.3 93 81.14

... (1386 rows omitted)

In [24]:
scores_only.sort('Breakdown 1', descending = True).take(23)
Out[24]:
Score_Rank University Teaching Research Citations International_Outlook Industry_Income Score_Result Breakdown 1
13 University of California, Berkeley 83 90.6 99.2 70.4 46.1 88.3 75.53

Note that when we choose this methodology, UC Berkeley is ranked much lower (24th instead of 13th). This is likely due to:

    1. UC Berkeley's exceptionally high 'Research' and 'Citations' scores not being included in the ranking
    1. UC Berkeley's large class sizes giving it a comparatively low 'Teaching' score
    1. UC Berkeley's low score for 'Industry_Income'. This component factors in the amount that the university receives in funding from industrial partners – given that it's a public school it's unsurprising that this amount is low, but also many "wealthy" universities have a relatively low score here too, so it's not clear how much this should matter (see here for more).

Maybe we want to place some emphasis on research, but not as much as was placed in the initial ranking. We could then make "Breakdown 2":

  • 'Teaching': 50%
  • 'Research': 15%
  • 'Citations': 15%
  • 'International_Outlook': 15%
  • 'Industry_Income': 5%

Quick Check 2

Assign breakdown_2 to an array of overall scores for schools calculated according to our Breakdown 2 above, and add it as a column to scores_only with the label 'Breakdown 2'. _Hint: Start by copying our code for breakdown_1, which was:_

breakdown_1 = 0.6 * scores_only.column('Teaching') \
            + 0.3 * scores_only.column('International_Outlook') \
            + 0.1 * scores_only.column('Industry_Income')
In [ ]:
# Answer QC here before running the next cell
In [ ]:
scores_only.sort('Breakdown 2', descending = True)

"Breakdown 2" is much closer to THE's actual breakdown and the ordering here reflects that.

What do you care about in a university? Try your own breakdown!

We should note though that we haven't really thought about how THE comes up with the scores for each of the five categories (or the fact that university rankings have inherent flaws).

Which countries have the most universities in the ranking?

Back to the full world table:

In [25]:
world
Out[25]:
Rank_Char Score_Rank University Country Number_students Numb_students_per_Staff % International Percentage_Female Percentage_Male Teaching Research Citations Industry_Income International_Outlook Score_Result Overall_Ranking # International
1 1 University of Oxford United Kingdom 20664 11.2 41 46 54 90.5 99.6 98.4 65.5 96.4 95.4 95.40 8472
2 2 California Institute of Technology United States 2240 6.4 30 34 66 92.1 97.2 97.9 88 82.5 94.5 94.50 672
3 3 University of Cambridge United Kingdom 18978 10.9 37 47 53 91.4 98.7 95.8 59.3 95 94.4 94.40 7022
4 4 Stanford University United States 16135 7.3 23 43 57 92.8 96.4 99.9 66.2 79.5 94.3 94.30 3711
5 5 Massachusetts Institute of Technology United States 11247 8.6 34 39 61 90.5 92.4 99.5 86.9 89 93.6 93.60 3824
6 6 Princeton University United States 7983 8.1 25 45 55 90.3 96.3 98.8 58.6 81.1 93.2 93.20 1996
7 7 Harvard University United States 20823 9.2 24 49 51 89.2 98.6 99.1 47.3 76.3 93 93.00 4998
8 8 Yale University United States 12402 5.4 20 50 50 92 94.8 97.3 52.4 68.7 91.7 91.70 2480
9 9 University of Chicago United States 13833 5.7 28 46 54 89.1 91.4 96.7 52.7 76 90.2 90.20 3873
10 10 Imperial College London United Kingdom 16760 11.7 56 38 62 84.5 87.6 97 69.9 97.1 89.8 89.80 9386

... (1386 rows omitted)

To determine the number of universities per country, we can group by 'Country':

In [26]:
world.group('Country')
Out[26]:
Country count
Algeria 8
Argentina 4
Australia 35
Austria 11
Bangladesh 1
Belarus 1
Belgium 8
Brazil 46
Brunei Darussalam 1
Bulgaria 1

... (82 rows omitted)

It's a good idea to sort too:

In [27]:
world.group('Country').sort('count', descending = True)
Out[27]:
Country count
United States 172
Japan 110
United Kingdom 100
China 81
India 56
Germany 48
Brazil 46
Italy 45
Spain 45
Iran 40

... (82 rows omitted)

How do we get the number of universities in each country with at least 25 universities on the list?

In [28]:
world.group('Country').where('count', are.above_or_equal_to(25))
Out[28]:
Country count
Australia 35
Brazil 46
Canada 30
China 81
France 38
Germany 48
India 56
Iran 40
Italy 45
Japan 110

... (7 rows omitted)

Run the cell below to see a bar graph of the number of universities in each country above.

In [29]:
world.group('Country').where('count', are.above_or_equal_to(25)).sort('count').barh('Country')