from datascience import *
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
Table.interactive_plots()
Note: We're not going to be able to work through this entire notebook in lecture; you should definitely review whatever we don't get a chance to finish.
Our dataset comes from Times Higher Education (THE)'s World University Rankings 2020. These are slightly outdated as there is a 2021 ranking now, but the data is still relevant.
world = Table.read_table('data/World_University_Rank_2020.csv')
world
It's always good to check how many schools we're dealing with:
world.num_rows
Some columns ('Number_students'
, 'International_Students'
, 'Percentage_Female'
, 'Percentage_Male'
) have commas and percentage symbols, meaning they can't be stored as integers. Let's clean them.
# Notice how we use apply here!
def remove_symbol(s):
return int(s.replace('%', '').replace(',', ''))
# Remember, the result of calling apply is an array
world.apply(remove_symbol, 'Number_students')
world = world.with_columns(
'Number_students', world.apply(remove_symbol, 'Number_students'),
'International_Students', world.apply(remove_symbol, 'International_Students'),
'Percentage_Female', world.apply(remove_symbol, 'Percentage_Female'),
'Percentage_Male', world.apply(remove_symbol, 'Percentage_Male')
)
Now we can sort by any numeric column we want.
world.sort('Percentage_Female')
It seems like the above schools didn't report their sex breakdown, since 0% is the listed percentage of female and male students.
Let's start asking questions.
We have an 'International_Students'
column, but that tells us the percentage of international students at each school. Let's update that label to be more clear – let's change the label 'International_Students'
to be '% International'
.
world = world.relabeled('International_Students', '% International')
world
Then, to compute the number of international students at each school, we take the number of students at each school, multiply by the percentage of international students at each school, and divide by 100.
world.column('Number_students') * world.column('% International') / 100
We should probably round the result, since we can't have fractional humans.
num_international = np.round(world.column('Number_students') * world.column('% International') / 100, 0)
num_international
We can add this as a column to our table:
world = world.with_columns(
'# International', num_international
)
And we can sort by this column, while also selecting a subset of all columns just to focus on what's relevant:
world.select('University', 'Country', 'Number_students', '% International', '# International') \
.sort('# International', descending = True)
This tells us that the University of Melbourne has the most international students, with 21,797. That's larger than many universities!
There are no US universities in the top 10 here. How can we find the universities in the US with the most international students?
Fill in the blanks so that the resulting table contains the 15 universities in the US with the most international students, sorted by number of international students in decreasing order.
# __(a)__ means blank a
# world.select('University', 'Country', 'Number_students', '% International', '# International') \
# .where(__(a)__, __(b)__) \
# .sort('# International', __(c)__) \
# .take(__(d)__)
If you do a quick Google search for "US universities with the most international students", you'll see NYU is usually #1. Cool!
Times Higher Education's website tells us the methodology they use to rank universities:
This means they come up with a 'Teaching'
, 'Research'
, 'Citations'
, 'International_Outlook'
, and 'Industry_Income'
score from 0 to 100 for each school, then compute a weighted average according to the above percentages to compute a school's 'Score_Result'
, which is how the schools are ranked.
Let's confirm this ourselves. First, let's get a subset of the columns since they won't all be relevant here.
scores_only = world.select('Score_Rank', 'University', 'Teaching', 'Research', 'Citations', 'International_Outlook', 'Industry_Income', 'Score_Result')
scores_only
The graphic tells us that the weights for each column are:
'Teaching'
: 0.3'Research'
: 0.3'Citations'
: 0.3'International_Outlook'
: 0.075'Industry_Income'
: 0.025(Remember, to convert from percentage to proportion we divide by 100.)
Let's try and apply this to the school at the very top of the table, University of Oxford.
0.3 * 90.5 + \
0.3 * 99.6 + \
0.3 * 98.4 + \
0.075 * 96.4 + \
0.025 * 65.5
The result, 95.4175, matches what we see in the 'Score_Result'
column for University of Oxford.
We can apply the above formula to all rows in our table as well.
score_result_manual_calculation = \
0.3 * scores_only.column('Teaching') + \
0.3 * scores_only.column('Research') + \
0.3 * scores_only.column('Citations') + \
0.075 * scores_only.column('International_Outlook') + \
0.025 * scores_only.column('Industry_Income')
score_result_manual_calculation
To confirm that the results we got match the 'Score_Result'
column in scores_only
, we can add the above array to our table:
scores_only.with_columns(
'Score Result Manual', score_result_manual_calculation
)
This shows we've successfully reverse-engineered how the rankings work!
Now that we know how to compute 'Score_Result'
s using THE's percentages, we can also pick our own percentages if we want to prioritize different components in our ranking.
For instance, we may feel like THE's methodology places too much emphasis on research – together, 'Research'
and 'Citations'
make up 60% of the overall score.
We could choose to use the following breakdown, which we'll call "Breakdown 1":
'Teaching'
: 60%'International_Outlook'
: 30%'Industry_Income'
: 10%breakdown_1 = 0.6 * scores_only.column('Teaching') \
+ 0.3 * scores_only.column('International_Outlook') \
+ 0.1 * scores_only.column('Industry_Income')
breakdown_1
This gives us new overall scores for each school; we can add this column to our table and sort by it.
scores_only = scores_only.with_columns(
'Breakdown 1', breakdown_1
)
scores_only.sort('Breakdown 1', descending = True)
scores_only.sort('Breakdown 1', descending = True).take(23)
Note that when we choose this methodology, UC Berkeley is ranked much lower (24th instead of 13th). This is likely due to:
'Research'
and 'Citations'
scores not being included in the ranking'Teaching'
score'Industry_Income'
. This component factors in the amount that the university receives in funding from industrial partners – given that it's a public school it's unsurprising that this amount is low, but also many "wealthy" universities have a relatively low score here too, so it's not clear how much this should matter (see here for more).Maybe we want to place some emphasis on research, but not as much as was placed in the initial ranking. We could then make "Breakdown 2":
'Teaching'
: 50%'Research'
: 15%'Citations'
: 15%'International_Outlook'
: 15%'Industry_Income'
: 5%Assign breakdown_2
to an array of overall scores for schools calculated according to our Breakdown 2 above, and add it as a column to scores_only
with the label 'Breakdown 2'
. _Hint: Start by copying our code for breakdown_1
, which was:_
breakdown_1 = 0.6 * scores_only.column('Teaching') \
+ 0.3 * scores_only.column('International_Outlook') \
+ 0.1 * scores_only.column('Industry_Income')
# Answer QC here before running the next cell
scores_only.sort('Breakdown 2', descending = True)
"Breakdown 2" is much closer to THE's actual breakdown and the ordering here reflects that.
What do you care about in a university? Try your own breakdown!
We should note though that we haven't really thought about how THE comes up with the scores for each of the five categories (or the fact that university rankings have inherent flaws).
Back to the full world
table:
world
To determine the number of universities per country, we can group by 'Country'
:
world.group('Country')
It's a good idea to sort too:
world.group('Country').sort('count', descending = True)
How do we get the number of universities in each country with at least 25 universities on the list?
world.group('Country').where('count', are.above_or_equal_to(25))
Run the cell below to see a bar graph of the number of universities in each country above.
world.group('Country').where('count', are.above_or_equal_to(25)).sort('count').barh('Country')