Code Walkthrough

Module 2, Notebook 1: Sources of Data

Example 1.1 World-bank

This document walks you through the code example 1.1 World-bank in the Jupyter Notebook M2_NB1_SourcesOfData.

The outcome of the first example is to show a world map onto which you plot the number of students from that country and the overall population size of that country. You should already be familiar with how Pandas DataFrames work, as well as plotting items on a map from previous notebooks, so the main purpose of this example is to show you how to

  1. Retrieve additional external data, and,
  2. Connect the new data to your existing data
In [2]:
import pandas as pd
from pandas_datareader import data, wb
import numpy as np
import matplotlib
import folium

To start, we import the required libraries. Because we will be working with data sets we need to import Pandas and NumPy as before. Remember Pandas is a library for doing data analysis and NumPy is a library that can manipulate arrays of data. Matplotlib is used for generating graphs and charts from your data and Folium is a library for building maps.

The new library that you will not have seen before is pandas_datareader - this is a utility library for Pandas that lets you import data from a bunch of websites like Yahoo Finance, OECD (Organisation for Economic Co-Operation and Development) and the World Bank. You can read more about pandas-datareader here: http://pandas-datareader.readthedocs.io/en/latest/. The benefit of pandas-datareader is that because it retrieves the data remotely, whenever you load it up, it will always be the most recently available data.

The line

from pandas_datareader import data, wb

simply imports the parts of the pandas_datareader library that you need in order to retrieve data from the World Bank - we’ll use that to get the latest population figures for the various countries.

In [4]:
 df1 = pd.read_csv('../data/grouped_geocoded.csv',index_col=[0])

The first thing to do is to load up the existing data that we have - this is data for all of the students that has already been geocoded and grouped by the country, city and then industry that the students are in.

We can take a look at the structure of the DataFrame and some sample data by calling the head() function on the DataFrame:

In [5]:
df1.head()
Out[5]:
country city industry student_count geocode_input lat long
0 Australia Adelaide Other or Unknown 1 Adelaide, Australia -34.928499 138.600746
1 Australia Bronte Other or Unknown 1 Bronte, Australia -33.905320 151.260690
2 Australia Buddina ICT 1 Buddina, Australia -26.690000 153.134000
3 Australia Carlingford ICT 1 Carlingford, Australia -33.777000 151.048000
4 Australia Glen Waverley ICT 1 Glen Waverley, Australia -37.885652 145.165179

In grouping by country, city and industry each line is a unique combination of the three. So if there were four students in the ICT industry that lived in Carlingford, Australia, then the student_count column for record 3 would have been 4.

In [6]:
df2 = df1.groupby('country').agg({'student_count': [np.sum], 'lat': [np.mean], 'long': [np.mean]}).reset_index()

Because we want to show the student count at a country level (remember above we said we wanted to map students per country), we need to consolidate the count of all the students for a single country into a single row - a bunch of rows for each of the cities and industries for students in Australia is too much information for what we need, so we need to group the data. The line above creates a new DataFrame from df1 by grouping the data on the country field. When you run groupby() on a DataFrame it doesn’t return a new DataFrame automatically, it builds a bunch of intermediate data that can then be used to apply some other calculation on the grouped data. This is where agg() comes in. This aggregates the data returned by groupby() and returns a DataFrame. What we pass into it is a Python dictionary object that is a list of column names and the function to run on the data in that column. The dictionary we’re passing in to agg() is

{
'student_count': [np.sum], 
'lat': [np.mean], 
'long': [np.mean]
}

Remember earlier we imported numpy as np, so what we have here is three columns from df1 that we want the agg() function to run NumPy operations on. For column student_count we want the sum (total) and for the lat and long columns (the latitude and longitude or position of the student) we want the mean (what some people call the average). By doing this we will get the average center point for all the students in that country.

So if you had to write the line

df2 = df1.groupby('country').agg({'student_count': [np.sum], 'lat': [np.mean], 'long': [np.mean]}).reset_index()

in English, what it’s saying is: Create a new DataFrame called df2 that contains a single line for each country and for each country give me: the total number of students for that country and the average longitude and latitude of all the students in that country.

The reset_index takes a multi-level index and moves the labelling for the index into the column names. Take a look at the two DataFrames below:

df2 before and after reindex

The DataFrame on the left has not yet had the reset_index() function run on it, while the one on the right has. As you can see the index country has shifted to be the name of the column and a new index (with the rows numbered from 0 up) has been added.

Note: The next line has changed from the original example code to make the later merge less complicated. It does have a knock-on effect to some of the mapping code, but we will point it out later

In [7]:
df2.columns = df2.columns.droplevel(level=1)

Even though we have shifted the index to the column name for the country, you'll notice that there are two levels to the column names - the first level is the name of the original field (lat, student_count and long) and the second level is the aggregate function that we ran on the colum (mean, sum and mean). We don't need the multiple level column names, so the line above drops a level from the column names. If you call the function without any parameters it defaults to removing the first column name row, but since that is the one that has meaningful information in it, we specify which level to drop.

At this point we have a DataFrame (df2) that contains a single entry for each country represented by the students on the course, with a column indicating how many students there are for that country and the average position. Let's take a look at it:

In [8]:
df2.head()
Out[8]:
country lat student_count long
0 Australia -34.256125 26 147.042406
1 Austria 47.796835 3 16.027805
2 Belgium 50.807547 6 4.378567
3 Brazil -21.556434 24 -44.808458
4 Canada 46.172621 61 -88.355898
In [9]:
df4 = wb.download(indicator='SP.POP.TOTL',country=['all'],start='2008',end='2016')

Now what we need is the country population data. We imported the wb module from the pandas_datareader library and this is what we will use to retrieve data from the World Bank.

The World Bank contains thousands of data series and you specify which data you want to retrieve by using their Development Indicators. The term SP.POP.TOTL that is used in the line of code above was found by browsing the World Bank site’s Indicators reference: http://data.worldbank.org/indicator - the population is on this page http://data.worldbank.org/indicator/SP.POP.TOTL and the eagle-eyed amongst you will have spotted the SP.POP.TOTL in the web address. If you visit that page you can see that they have data from 1960 through to 2015. Because the World Bank may not have population data for every country for the latest year, we are going to ask for a range of years and then can be reasonably sure that we will have data for all the countries. We specify a start and end date for the data that we want, and pass in a list of countries that we want data for. By specifying a single list item of 'all' we will get data for all the countries.

So the line

df4 = wb.download(indicator='SP.POP.TOTL',country=['all'],start='2008',end='2016')

is downloading World Bank population data (indicator='SP.POP.TOTL') for all countries (country=['all']) from 2008 (start='2008') to 2016 (end='2016').

If you take a look at df4 you can see what’s downloaded:

In [10]:
df4.head()
Out[10]:
SP.POP.TOTL
country year
Arab World 2015 392022276.0
2014 384222592.0
2013 376504253.0
2012 368802611.0
2011 361031820.0

Remember when you work with DataFrames you typically will want a single level index - as you can see this has a multi-level index - the first index is the country and then for each country it has indexed the years of available data, showing the population total for each year. To shift the multi-level indexes to column names, use reset_index():

In [11]:
df5 = df4.reset_index()

Let's take a look at the new DataFrame df5 to see the difference:

In [12]:
df5.head()
Out[12]:
country year SP.POP.TOTL
0 Arab World 2015 392022276.0
1 Arab World 2014 384222592.0
2 Arab World 2013 376504253.0
3 Arab World 2012 368802611.0
4 Arab World 2011 361031820.0

Now, df5 has effectively un-grouped the indexes and flattened the data into simple rows and columns that should look familiar to anyone who has worked with spreadsheets.

As you can see we have a single entry for each country for each of the years that we requested. What we actually want is just the largest population number for each of the countries.

The way that we’re going to do that is to create a new Series (you can think of a Series as a DataFrame with only 1 column) that indicates whether the row is the max or not. We will then merge that with the DataFrame to create a new DataFrame of just the maximum values.


For a beginner this might be quite complex to work through, so let's look at a simple example first.

Imagine we have a DataFrame that has data about animals. Let's create one manually with 5 animals in it that simply shows their type and name:

In [13]:
animals = pd.DataFrame(
    [
        {
            'type': 'mammal', 
            'name': 'dog'
        }, 
        {
            'type': 'mammal', 
            'name': 'cat'
        }, 
        {
            'type': 'bird', 
            'name': 'parrot'
        }, 
        {
            'type': 'reptile', 
            'name': 'snake'
        }, 
        {
            'type': 'bird', 
            'name': 'owl'
        }
    ]
)

Let's take a look at the DataFrame we just created:

In [14]:
animals.head()
Out[14]:
name type
0 dog mammal
1 cat mammal
2 parrot bird
3 snake reptile
4 owl bird

Now we create a list in the same order as the items in the DataFrame that contains a True or a False depending on whether the animal can fly or not:

In [15]:
can_fly = [False, False, True, False, True]

The 3rd and the 5th elements are True, which corresponds with the 3rd and the 5th elements in the animals DataFrame.

Then, we create a new DataFrame by passing the list of True/False entries in as the indices to the DataFrame.

In [16]:
flying_animals = animals[can_fly]

Now we should have a new DataFrame called flying_animals that contains only those animals where their index was the same as an index item from can_fly. The image below might help you visualise what we have just done:

flying_animals combination process

Now take a look at what the flying_animals DataFrame looks like:

In [17]:
flying_animals.head()
Out[17]:
name type
2 parrot bird
4 owl bird

So back to the example - we want a new DataFrame that is only the elements of df5 where the population total is the maximum. So the first thing we have to do is create our index of True/False values. To do that we need to know what the maximum population is for each of the countries in df5. The code to do this is doing a number of things, so let's run the code and then we'll break it down.

In [18]:
idx = df5.groupby(['country'])['SP.POP.TOTL'].transform(max) == df4['SP.POP.TOTL']

First it is grouping the data in df5 but the country column. Then, it’s looking at the column in df5 called SP.POP.TOTL and applying a transformation on the value in that column, transforming it to the max. Remember df5 looks like this:

In [19]:
df5.head()
Out[19]:
country year SP.POP.TOTL
0 Arab World 2015 392022276.0
1 Arab World 2014 384222592.0
2 Arab World 2013 376504253.0
3 Arab World 2012 368802611.0
4 Arab World 2011 361031820.0

So for each of the rows 0 through to 4 (if we just work with that data), it will change the value in SP.POP.TOTL to the maximum value of SP.POP.TOTL for all the rows that have the same country value. You should be able to tell just by looking at it that the max value is 392022276.

The next thing it is doing is testing the equality of the new value (the maximum population value for the country) against the untransformed SP.POP.TOTL value for that record - we still have that data in df4. Remember the equality operator == tests whether two values are the same, and returns True if they are, and False if they are not.

If we take a look at idx you should have a list of True and False values, which you can compare against df5:

In [20]:
idx.head()
Out[20]:
0     True
1    False
2    False
3    False
4    False
dtype: bool

As you can see, row 0 has the largest (max) of the SP.POP.TOTL values for the country Arab World, so is the only row in idx that is set to True (for the sample we are looking at). You can now apply idx to df5 to get a new DataFrame of only the rows that were the max SP.POP.TOTL:

In [21]:
df6 = df5[idx]

Let's take a look at df6 to confirm:

In [22]:
df6.head()
Out[22]:
country year SP.POP.TOTL
0 Arab World 2015 3.920223e+08
8 Caribbean small states 2015 7.048966e+06
23 Central Europe and the Baltics 2008 1.051267e+08
24 Early-demographic dividend 2015 3.122703e+09
32 East Asia & Pacific 2015 2.279186e+09

Now we can join the two DataFrames - df2 that contains our student counts per country with the long/lat data, and df6 that has the population counts for the countries:

df2 and df6 side by side

As you can see the index numbers do not match for the countries across the two DataFrames, so we need to specify how to join the two, and we do that by specifying the column name that is common to both the DataFrames. We’ll use the Pandas merge function:

In [23]:
df7 = pd.merge(df2, df6, on='country', how='left')

The first two unnamed parameters are the DataFrames to merge. The first one specified is referred to as the left (df2) and the second one as the right. The on parameter specifies the column that is common and the how parameter tells the merge function what to do when there are mismatches between the data in the two DataFrames. By specifying left we are saying that it should use all the records in the left DataFrame (df2) and add the data from the right (df6). That means that we will drop any records from df6 that do not have matching keys (country) in df2. This is fine since we only care about those countries for which we have students, which is the list of countries in df2.

Let's confirm that by taking a look at df7:

In [24]:
df7.head()
Out[24]:
country lat student_count long year SP.POP.TOTL
0 Australia -34.256125 26 147.042406 2015 23781169.0
1 Austria 47.796835 3 16.027805 2015 8611088.0
2 Belgium 50.807547 6 4.378567 2015 11285721.0
3 Brazil -21.556434 24 -44.808458 2015 207847528.0
4 Canada 46.172621 61 -88.355898 2015 35851774.0

In the original example there was a line of code here:

df8 = df7.rename(index=str, columns={('lat', 'mean'): "lat_mean", ('long', 'mean'): "long_mean", ('SP.POP.TOTL'): "PopulationTotal_Latest_WB",('student_count', 'sum'): "student_count"})

This renamed the merged column names from being multi-level to single level, however since we dropped the multi-level column names earlier before the merge, we no longer need this line of code.

Now we have the data in a usable format, we can begin mapping it.

First, set up some variables for the center of the map and its zoom factor:

In [25]:
mapc = [0, 30]
zoom = 2

We then create a map object, passing in the parameters and using the Stamen Toner tileset.

In [26]:
 map_osm = folium.Map(location=mapc,
                   tiles='Stamen Toner',
                    zoom_start=zoom)

Now we just work through each line of the df7 DataFrame, adding a circular marker positioned on the average position for that country (lat and long) and use a radius that is based on the total population, and add a second for the student count.

In [27]:
for j in range(len(df7)):
    if pd.notnull(df7['SP.POP.TOTL'][j]):
        folium.CircleMarker([df7['lat'][j], df7['long'][j]],
                        radius=df7['SP.POP.TOTL'][j]/500,
                        popup='Population',
                        color='#3186cc',
                        fill_color='#3186cc',
                       ).add_to(map_osm)
    folium.CircleMarker([df7['lat'][j], df7['long'][j]],
                    radius=df7['student_count'][j]*10000,
                    popup='Students',
                    color='red',
                    fill_color='red',
                   ).add_to(map_osm)    

Now display the map.

In [28]:
map_osm
Out[28]:
In [ ]: