BW #75: Refugees (solution)

BW #75: Refugees (solution)

This week, we looked at refugees – how many there are, where they come from, and where they go. This was inspired by a conversation I had with a taxi driver on my way to the Prague airport, who expressed mostly positive feelings about what the Czech Republic had done for Ukrainian refugees.

Data and seven questions

This week's data comes from three files, all produced by the World Bank:

Here are this week's seven challenges and questions. As always, a link to the Jupyter notebook I used to solve these problems is at the end of this post.

Create a single data frame whose index is made up of country names. The columns will be a multi-index with top-level names "origin", "destination", and "population". The lower level of the multi-index should contain the years; you can remove other columns.

Let's start by loading Pandas:

import pandas as pd

With that in place, we'll first need to load each of the three CSV files into a data frame. We can start with the refugee origin file, using read_csv:

refugee_origin_filename = 'API_SM.POP.REFG.OR_DS2_en_csv_v2_1058274.csv'

refugee_origin_df = (
    pd
    .read_csv(refugee_origin_filename
)

However, there are several problems with loading the CSV file in this way. First of all, the file has several comment lines at the top, which we want to ignore; the column headers are on line 2. We can thus pass header=2, which tells read_csv to start on line 2. We'll also ask for the Country Name column to be used as the index:

refugee_origin_df = (
    pd
    .read_csv(refugee_origin_filename, 
              header=2,
              index_col='Country Name')
)    

This is good, but we only want the columns containing the years. We can tell Pandas to keep only those columns whose names consist of four digits with the filter method, passing a regular expression via the regex keyword argument and specifying that we're looking at the column names via axis='columns':

refugee_origin_df = (
    pd
    .read_csv(refugee_origin_filename, 
              header=2,
              index_col='Country Name')
              .filter(regex=r'^\d+$', axis='columns')
)    

The above regular expression can be read as:

  • We want to anchor our match to the start of the string with ^
  • We're looking for one or more decimal digits with \d+
  • We want to anchor our match to the end of the string with $

In other words, we're looking for column names that contain only digits. We could probably have tightened it up by saying \d{4}, for four digits, but this is good enough for our purposes.

But wait: If all of the columns are four-digit strings representing years, we can change them to be integers. If we want, we could run astype(int) on refugee_origin_df.columns, getting back integers, and then assign the result back to refugee_origin_df.columns:

refugee_origin_df.columns = refugee_origin_df.columns.astype(int)

But it feels weird to use assignment when we've managed to do the rest via method chaining. But how can we assign to our data frame's columns from within a method chain?

It's possible with the pipe method, which lets us run a function on our data frame, getting a chained result. And the function we'll run? It'll be a lambda, taking the data frame that we've gotten so far via the method chain, running set_axis to assign to our columns:

refugee_origin_df = (
    pd
    .read_csv(refugee_origin_filename, 
              header=2,
              index_col='Country Name')
              .filter(regex=r'^\d+$', axis='columns')
              .pipe(lambda df_: df_.set_axis(df_.columns.astype(int), axis='columns'))
)    

The result? refugee_origin_df now contains one row per country (plus some other areas, which we'll deal with in a bit), and one column per year.

We can perform the same operations on the refugee-destination and population CSV files:

refugee_destination_filename = 'API_SM.POP.REFG_DS2_en_csv_v2_1061114.csv'
population_filename = 'API_SP.POP.TOTL_DS2_en_csv_v2_1114030.csv'

refugee_destination_df = (
    pd
    .read_csv(refugee_destination_filename, 
              header=2,
              index_col='Country Name')
        .filter(regex=r'^\d+$', axis='columns')
        .pipe(lambda df_: df_.set_axis(df_.columns.astype(int), axis='columns'))
)

population_df = (
    pd
    .read_csv(population_filename, 
              header=2,
              index_col='Country Name')
    .filter(regex=r'^\d+$', axis='columns')
    .pipe(lambda df_: df_.set_axis(df_.columns.astype(int), axis='columns'))
)

We now have three data frames, all of which have the same index and columns. But we want to join them together into one big data frame. How can we do that?

Normally, we would use pd.concat to combine multiple data frames into a single, new one. But I asked you to do something a bit different, combining them but using a multi-index for the columns, such that we can keep track of the data's original frames.

Fortunately, we can pass pd.concat an additional keys keyword argument, indicating the multi-index name we want to associate with each of the original data frames:

df = pd.concat([population_df, refugee_destination_df, refugee_origin_df], 
              axis='columns',
              keys=['population', 'destination', 'origin'])

The result is a data frame with 266 rows (for countries) and 192 columns (64 years for each original data frame).

Now, the rows still contain more than just countries – but we'll deal with weeding out the non-country names in the next answer.

What 10 countries accepted the most refugees in 2000 and 2023?

In 2000, where did refugees go? We can get this information by querying our data frame, retrieving only the column under destination and 2000. We can do this by passing a tuple inside of square brackets:

(
    df
    [('destination', 2000)]
    .nlargest()
)

In general, when you're trying to drill down through a multi-index, you can use a tuple, with one element per level of the multi-index. However, we still need to find the destinations with the greatest numbers in 2000. This requires using nlargest:

(
    df
    [('destination', 2000)]
    .nlargest()
)

The good news is that this works. The bad news is that the results we get seem a bit weird:

Country Name
World                         15935134.0
Low & middle income           13376149.0
IDA & IBRD total              12001044.0
Middle income                 11490188.0
Early-demographic dividend     8438167.0
Name: (destination, 2000), dtype: float64

As you can see, our data frame contains rows not just for individual countries, but for groups of countries, too. Which means that if we want to find countries, we'll need to remove any of those groups.

How? It took me a while, to be honest, using a regular expression to describe the country names (and partial names) I wanted to ignore. My regular expression basically looked for anything with a bunch of words or phrases, which I put into the variable ignore_pattern. I then got the data frame's index and turned it into a series with to_series, which returned a series whose index and values were identical. I then used loc to keep only those elements that did not match my regexp pattern, using str.contains.

ignore_pattern = r'(?:countries|situations|IDA|IBRD|demographic|OECD|World|Middle East|Sub-Saharan|ern and|Euro|income|Asia|America\b)'

countries = (df
             .index
             .to_series()
             .loc[lambda s_: ~s_.str.contains(ignore_pattern,
                                              regex=True)]
            )

I can now use loc and the countries series to keep only those rows from df for actual countries. I do that, adding the loc into the query:

(
    df
    .loc[lambda df_: df_.index.isin(countries)]
    [('destination', 2000)]
    .nlargest()
)

And now I get much more reasonable results:

Country Name
Pakistan              2001460.0
Iran, Islamic Rep.    1868000.0
Jordan                1610630.0
West Bank and Gaza    1428891.0
Germany                906000.0
Name: (destination, 2000), dtype: float64

So back in 2000, the greatest number of refugees were going to Pakistan, Iran, and Jordan. (Not what I expected or remembered!)

How about in 2023? Here's the (almost identical) query:

(
    df
    .loc[lambda df_: df_.index.isin(countries)]
    [('destination', 2023)]
    .nlargest()
)

The results:

Country Name
Iran, Islamic Rep.    3764517.0
Turkiye               3251127.0
Jordan                3063591.0
Germany               2593007.0
West Bank and Gaza    2482144.0
Name: (destination, 2023), dtype: float64

The numbers, as you can see, are far higher than they were 23 years earlier.