BW #81: School (solution)

BW #81: School (solution)

This week, we celebrated (mourned?) the start of the school year in much of the northern hemisphere, looking at trends in education for various countries and regions in the world. There's no end to the questions that we can ask about education, ranging from the financial investment to the number of students who start school to the number of graduates to the quality of teaching. The good news is that governments and international organizations collect a large amount of data about schools and education, allowing us to look into what techniques and systems do (and don't) work.

I found an interesting dataset from the World Bank (https://worldbank.org), which helps low-income countries to invest in infrastructure and education. However, its data reflects the entire world, not just the countries where it invests, allowing us to identify trends and make comparisons.

Data and six questions

And indeed, we start this week's data journey at the World Bank's educational data page:

https://data.worldbank.org/topic/education

You can explore their educational data in your browser, but we're here to retrieve and work directly with it using Pandas. To retrieve the data, you can click on the "CSV" link on that page, or just download it directly from

https://api.worldbank.org/v2/en/topic/4?downloadformat=csv

The download contains three CSV files, not one. The first (and main) one contains the data itself, while the two other CSV files contain metadata about the indicators and metadata about the countries. You'll need all three to answer this week's questions.

I gave you six questions and tasks for this week. Here are my solutions; a link to the Jupyter notebook that I used is at the bottom of this post:

Import the main file into a data frame. Set its index to be the "Country Code" column, drop the "Unnamed: 68" column, and keep only those rows where "Indicator Code" starts with "SE.".

Before we do anything else, let's load up Pandas:

import pandas as pd

In theory, we can just use read_csv to read a CSV file into a data frame:

main_filename = 'API_4_DS2_en_csv_v2_3434744.csv'

df = (pd
      .read_csv(main_filename)
     )

However, if we try to do it this way, we'll get an error message. That's because the before the data starts, the file contains some comments. The CSV parser assumes that those comments are data, infers (incorrectly) how many columns the file contains, and then gives us an error message when it encounters the actual data.

The solution is to tell read_csv that it should skip several lines, and that the headers for the data actually start on the third line of the file – which we describe as "line 2" when we start counting with 0:

main_filename = 'API_4_DS2_en_csv_v2_3434744.csv'

df = (pd
      .read_csv(main_filename, header=2)
     )
     

This is fine, except that we now have the entire data set, when we're really interested in looking at education-specific indicators. Those indicators' codes all start with the string "SE.", so I asked you to keep only lines with such indicators.

We can do that by using the str accessor to execute a string method (str.startswith) on each of the rows, getting back a boolean values (True or False) for each indicator code. Feeding that boolean value back into a combination of loc and lambda returns only the matching rows:

main_filename = 'API_4_DS2_en_csv_v2_3434744.csv'

df = (pd
      .read_csv(main_filename, header=2)
      .loc[lambda df_: df_['Indicator Code'].str.startswith('SE.')]
     )
    

People often ask why I use df_ as the variable in my lambda expression, rather than the more standard df. I always say that there are two reasons: First, so that we don't get confused between the global variable df and the local variable (i.e., parameter) df_, which only exists in the context of the running function.

The second reason is that when we chain our methods in this way, each method is running on the result of the previous one. We're purposely not assigning each intermediate step to a variable; the whole point is that we'll run a number of methods, each modifying the data frame in some way, until the final method is run, when we'll assign or view the final result. So even if we would want to refer to the global data frame from within the lambda, there isn't any variable to refer to.

That's even truer in this case, where df isn't defined until the final method runs. We're running .loc on the result of read_csv, before any variable has been defined or is available.

The result is a data frame with 39,634 rows and 69 columns. Why so many? Each row represents a single measure for a single country or region. Aside from the columns indicating which country/region and indicator we're talking about, we have columns for each year during which the World Bank collected data, starting in 1960. There isn't data for every country/region for each year; we'll remove some of that missing data in just a bit.

Import the two metadata files into data frames, and join them together with the main data frame. Remove any columns that start with the word "Unnamed". Then make the country code into the index.

The data itself is interesting, but the two metadata files allow us to dig even deeper. We'll thus join the three CSV files together into a single, large data frame. (Note that we'll only really use one of the files to answer our questions; part of the reason for me posing this question was to give you more practice with joins.)

Pandas has two methods that we can use to join data frames together:

  • join, which uses the index to combine them, much as we would in SQL, and
  • merge, which lets join data frames using any columns we want, not just the index.

The difference between the two isn't that great, given that we can use any column in a data frame as its index.

First, I decided to combine df with the country metadata. I loaded the country metadata into a data frame:

country_filename = 'Metadata_Country_API_4_DS2_en_csv_v2_3434744.csv'

(
    pd
    .read_csv(country_filename)
)

But I don't really need this data frame on its own; the only reason I'm reading it into Pandas at all is to combine it with df. Both df and the country metadata have a Country Code column, so I can invoke merge on that, assigning the resulting data frame to df:

country_filename = 'Metadata_Country_API_4_DS2_en_csv_v2_3434744.csv'

df = (
    pd
    .read_csv(country_filename)
    .merge(df, on='Country Code')
)

df now contains all of the columns from both the full data set and the country-related metadata. However, we see that the new df has 149 fewer rows than we had in the original df. This is because the default behavior is to perform an "inner join," only producing a result row where both input data frames match up. If a country code exists in one data frame, but not the other, then there is no match, and no resulting row.

If we had used an "outer join," then we would have the same number of rows as in the input, with NaN values wherever there wasn't a match.

By the way, what country appeared in one but not the other? I used Python's set class to find out:

set(
    pd
    .read_csv(country_filename)
    ['Country Code']
) ^ set(df['Country Code'])

It returned a single item, INX, which appears in the main data file. The country name associated with INX is... "Not classified." I'm just fine leaving it out of my final data frame, and sticking with the inner join.

However, we're not quite done: We also need to merge in the indicators metadata. However, we can't specify the name of the column that's common to both df and the indicator CSV file, because they use slightly different names: df uses Indicator Code, and the indicator metadata file uses INDICATOR_CODE. Fortunately, we can tell merge the names of the columns we want to use with the left_on and right_on keyword arguments:

indicator_filename = 'Metadata_Indicator_API_4_DS2_en_csv_v2_3434744.csv'

df = (
    pd
    .read_csv(indicator_filename)
    .merge(df, left_on='INDICATOR_CODE', right_on='Indicator Code')
)

If it's not obvious which one is "left" and which one is "right" here, remember that we're invoking the merge method on the indicator metadata's data frame, which makes it the "left" data frame. The right data frame is df, the data frame we created before.

The final data frame still has 39,485 rows, which means that matches were found for all of the indicator codes.

Could we do all of this in a single, chained method call? Yes!

df = (pd
      .read_csv(main_filename, header=2)
      .loc[lambda df_: df_['Indicator Code'].str.startswith('SE.')]
      .merge(pd.read_csv(country_filename), 
             on='Country Code')
      .merge(pd.read_csv(indicator_filename), 
             left_on='Indicator Code', right_on='INDICATOR_CODE')
     )

The biggest change here is that I'm calling merge on the original df (rather than on the country data frame), and then on the result of that merge (rather than on the indicator data frame). This means that the notions of "left" and "right" are swapped, so I have to switch the values in left_on and right_on. But other than that, this works just fine, and we get the same result.

Finally, I asked you to remove all of the column that start with "Unnamed." We can do that by invoking df.drop, passing a list of columns we want to remove (and specifying axis='columns').

Finally, I then used set_index to set the index of the data frame to the country code:


df = (pd
      .read_csv(main_filename, header=2)
      .loc[lambda df_: df_['Indicator Code'].str.startswith('SE.')]
      .merge(pd.read_csv(country_filename), 
             on='Country Code')
      .merge(pd.read_csv(indicator_filename), 
             left_on='Indicator Code', right_on='INDICATOR_CODE')
      .drop(['Unnamed: 68', 'Unnamed: 5', 'Unnamed: 4'],
            axis='columns')
      .set_index('Country Code')
     )