This document walks you through the code example 1.1 World-bank in the Jupyter Notebook M2_NB1_SourcesOfData.
The outcome of the first example is to show a world map onto which you plot the number of students from that country and the overall population size of that country. You should already be familiar with how Pandas DataFrames work, as well as plotting items on a map from previous notebooks, so the main purpose of this example is to show you how to
import pandas as pd
from pandas_datareader import data, wb
import numpy as np
import matplotlib
import folium
To start, we import the required libraries. Because we will be working with data sets we need to import Pandas and NumPy as before. Remember Pandas is a library for doing data analysis and NumPy is a library that can manipulate arrays of data. Matplotlib is used for generating graphs and charts from your data and Folium is a library for building maps.
The new library that you will not have seen before is pandas_datareader - this is a utility library for Pandas that lets you import data from a bunch of websites like Yahoo Finance, OECD (Organisation for Economic Co-Operation and Development) and the World Bank. You can read more about pandas-datareader here: http://pandas-datareader.readthedocs.io/en/latest/. The benefit of pandas-datareader is that because it retrieves the data remotely, whenever you load it up, it will always be the most recently available data.
The line
from pandas_datareader import data, wb
simply imports the parts of the pandas_datareader library that you need in order to retrieve data from the World Bank - we’ll use that to get the latest population figures for the various countries.
df1 = pd.read_csv('../data/grouped_geocoded.csv',index_col=[0])
The first thing to do is to load up the existing data that we have - this is data for all of the students that has already been geocoded and grouped by the country, city and then industry that the students are in.
We can take a look at the structure of the DataFrame and some sample data by calling the head()
function on the DataFrame:
df1.head()
In grouping by country, city and industry each line is a unique combination of the three. So if there were four students in the ICT industry that lived in Carlingford, Australia, then the student_count
column for record 3 would have been 4.
df2 = df1.groupby('country').agg({'student_count': [np.sum], 'lat': [np.mean], 'long': [np.mean]}).reset_index()
Because we want to show the student count at a country level (remember above we said we wanted to map students per country), we need to consolidate the count of all the students for a single country into a single row - a bunch of rows for each of the cities and industries for students in Australia is too much information for what we need, so we need to group the data. The line above creates a new DataFrame from df1
by grouping the data on the country
field. When you run groupby()
on a DataFrame it doesn’t return a new DataFrame automatically, it builds a bunch of intermediate data that can then be used to apply some other calculation on the grouped data. This is where agg()
comes in. This aggregates the data returned by groupby()
and returns a DataFrame. What we pass into it is a Python dictionary object that is a list of column names and the function to run on the data in that column. The dictionary we’re passing in to agg()
is
{
'student_count': [np.sum],
'lat': [np.mean],
'long': [np.mean]
}
Remember earlier we imported numpy as np
, so what we have here is three columns from df1
that we want the agg()
function to run NumPy operations on. For column student_count
we want the sum (total) and for the lat
and long
columns (the latitude and longitude or position of the student) we want the mean (what some people call the average). By doing this we will get the average center point for all the students in that country.
So if you had to write the line
df2 = df1.groupby('country').agg({'student_count': [np.sum], 'lat': [np.mean], 'long': [np.mean]}).reset_index()
in English, what it’s saying is: Create a new DataFrame called df2
that contains a single line for each country
and for each country give me: the total number of students for that country and the average longitude and latitude of all the students in that country.
The reset_index
takes a multi-level index and moves the labelling for the index into the column names. Take a look at the two DataFrames below:
The DataFrame on the left has not yet had the reset_index()
function run on it, while the one on the right has. As you can see the index country
has shifted to be the name of the column and a new index (with the rows numbered from 0 up) has been added.
Note: The next line has changed from the original example code to make the later merge less complicated. It does have a knock-on effect to some of the mapping code, but we will point it out later
df2.columns = df2.columns.droplevel(level=1)
Even though we have shifted the index to the column name for the country, you'll notice that there are two levels to the column names - the first level is the name of the original field (lat
, student_count
and long
) and the second level is the aggregate function that we ran on the colum (mean
, sum
and mean
). We don't need the multiple level column names, so the line above drops a level from the column names. If you call the function without any parameters it defaults to removing the first column name row, but since that is the one that has meaningful information in it, we specify which level to drop.
At this point we have a DataFrame (df2
) that contains a single entry for each country represented by the students on the course, with a column indicating how many students there are for that country and the average position. Let's take a look at it:
df2.head()
df4 = wb.download(indicator='SP.POP.TOTL',country=['all'],start='2008',end='2016')
Now what we need is the country population data. We imported the wb
module from the pandas_datareader library and this is what we will use to retrieve data from the World Bank.
The World Bank contains thousands of data series and you specify which data you want to retrieve by using their Development Indicators. The term SP.POP.TOTL
that is used in the line of code above was found by browsing the World Bank site’s Indicators reference: http://data.worldbank.org/indicator - the population is on this page http://data.worldbank.org/indicator/SP.POP.TOTL and the eagle-eyed amongst you will have spotted the SP.POP.TOTL
in the web address. If you visit that page you can see that they have data from 1960 through to 2015. Because the World Bank may not have population data for every country for the latest year, we are going to ask for a range of years and then can be reasonably sure that we will have data for all the countries. We specify a start and end date for the data that we want, and pass in a list of countries that we want data for. By specifying a single list item of 'all'
we will get data for all the countries.
So the line
df4 = wb.download(indicator='SP.POP.TOTL',country=['all'],start='2008',end='2016')
is downloading World Bank population data (indicator='SP.POP.TOTL'
) for all countries (country=['all']
) from 2008 (start='2008'
) to 2016 (end='2016'
).
If you take a look at df4
you can see what’s downloaded:
df4.head()
Remember when you work with DataFrames you typically will want a single level index - as you can see this has a multi-level index - the first index is the country and then for each country it has indexed the years of available data, showing the population total for each year. To shift the multi-level indexes to column names, use reset_index()
:
df5 = df4.reset_index()
Let's take a look at the new DataFrame df5
to see the difference:
df5.head()
Now, df5
has effectively un-grouped the indexes and flattened the data into simple rows and columns that should look familiar to anyone who has worked with spreadsheets.
As you can see we have a single entry for each country for each of the years that we requested. What we actually want is just the largest population number for each of the countries.
The way that we’re going to do that is to create a new Series (you can think of a Series as a DataFrame with only 1 column) that indicates whether the row is the max or not. We will then merge that with the DataFrame to create a new DataFrame of just the maximum values.
For a beginner this might be quite complex to work through, so let's look at a simple example first.
Imagine we have a DataFrame that has data about animals. Let's create one manually with 5 animals in it that simply shows their type and name:
animals = pd.DataFrame(
[
{
'type': 'mammal',
'name': 'dog'
},
{
'type': 'mammal',
'name': 'cat'
},
{
'type': 'bird',
'name': 'parrot'
},
{
'type': 'reptile',
'name': 'snake'
},
{
'type': 'bird',
'name': 'owl'
}
]
)
Let's take a look at the DataFrame we just created:
animals.head()
Now we create a list in the same order as the items in the DataFrame that contains a True or a False depending on whether the animal can fly or not:
can_fly = [False, False, True, False, True]
The 3rd and the 5th elements are True
, which corresponds with the 3rd and the 5th elements in the animals
DataFrame.
Then, we create a new DataFrame by passing the list of True
/False
entries in as the indices to the DataFrame.
flying_animals = animals[can_fly]
Now we should have a new DataFrame called flying_animals
that contains only those animals where their index was the same as an index item from can_fly
. The image below might help you visualise what we have just done:
Now take a look at what the flying_animals
DataFrame looks like:
flying_animals.head()
So back to the example - we want a new DataFrame that is only the elements of df5
where the population total is the maximum. So the first thing we have to do is create our index of True
/False
values. To do that we need to know what the maximum population is for each of the countries in df5
. The code to do this is doing a number of things, so let's run the code and then we'll break it down.
idx = df5.groupby(['country'])['SP.POP.TOTL'].transform(max) == df4['SP.POP.TOTL']
First it is grouping the data in df5
but the country
column. Then, it’s looking at the column in df5
called SP.POP.TOTL
and applying a transformation on the value in that column, transforming it to the max. Remember df5
looks like this:
df5.head()
So for each of the rows 0 through to 4 (if we just work with that data), it will change the value in SP.POP.TOTL
to the maximum value of SP.POP.TOTL
for all the rows that have the same country value. You should be able to tell just by looking at it that the max value is 392022276
.
The next thing it is doing is testing the equality of the new value (the maximum population value for the country) against the untransformed SP.POP.TOTL
value for that record - we still have that data in df4
. Remember the equality operator ==
tests whether two values are the same, and returns True
if they are, and False
if they are not.
If we take a look at idx
you should have a list of True
and False
values, which you can compare against df5
:
idx.head()
As you can see, row 0 has the largest (max) of the SP.POP.TOTL
values for the country
Arab World, so is the only row in idx
that is set to True
(for the sample we are looking at). You can now apply idx
to df5
to get a new DataFrame of only the rows that were the max SP.POP.TOTL
:
df6 = df5[idx]
Let's take a look at df6
to confirm:
df6.head()
Now we can join the two DataFrames - df2
that contains our student counts per country with the long
/lat
data, and df6
that has the population counts for the countries:
As you can see the index numbers do not match for the countries across the two DataFrames, so we need to specify how to join the two, and we do that by specifying the column name that is common to both the DataFrames. We’ll use the Pandas merge
function:
df7 = pd.merge(df2, df6, on='country', how='left')
The first two unnamed parameters are the DataFrames to merge. The first one specified is referred to as the left (df2
) and the second one as the right. The on
parameter specifies the column that is common and the how
parameter tells the merge function what to do when there are mismatches between the data in the two DataFrames. By specifying left
we are saying that it should use all the records in the left DataFrame (df2
) and add the data from the right (df6
). That means that we will drop any records from df6
that do not have matching keys (country
) in df2
. This is fine since we only care about those countries for which we have students, which is the list of countries in df2
.
Let's confirm that by taking a look at df7
:
df7.head()
In the original example there was a line of code here:
df8 = df7.rename(index=str, columns={('lat', 'mean'): "lat_mean", ('long', 'mean'): "long_mean", ('SP.POP.TOTL'): "PopulationTotal_Latest_WB",('student_count', 'sum'): "student_count"})This renamed the merged column names from being multi-level to single level, however since we dropped the multi-level column names earlier before the merge, we no longer need this line of code.
Now we have the data in a usable format, we can begin mapping it.
First, set up some variables for the center of the map and its zoom factor:
mapc = [0, 30]
zoom = 2
We then create a map object, passing in the parameters and using the Stamen Toner
tileset.
map_osm = folium.Map(location=mapc,
tiles='Stamen Toner',
zoom_start=zoom)
Now we just work through each line of the df7
DataFrame, adding a circular marker positioned on the average position for that country (lat
and long
) and use a radius that is based on the total population, and add a second for the student count.
for j in range(len(df7)):
if pd.notnull(df7['SP.POP.TOTL'][j]):
folium.CircleMarker([df7['lat'][j], df7['long'][j]],
radius=df7['SP.POP.TOTL'][j]/500,
popup='Population',
color='#3186cc',
fill_color='#3186cc',
).add_to(map_osm)
folium.CircleMarker([df7['lat'][j], df7['long'][j]],
radius=df7['student_count'][j]*10000,
popup='Students',
color='red',
fill_color='red',
).add_to(map_osm)
Now display the map.
map_osm