BW #20: World inflation (solution)
For the first time in decades, people are talking about inflation. This week, we compare different inflation measures across time and various countries.
This week, we’re looking at the World Bank’s database of inflation information, comparing inflation levels in different countries and as measured in different ways. As I wrote yesterday, inflation is something that we have to measure, and there are different perspectives on how and what we measure. Should we include food prices? And if so, which food? Which brands? In which locations? It quickly gets very complicated.
I should add that Planet Money’s “Indicator” podcast followed a government economist who was checking and comparing prices. Have a listen here: https://www.npr.org/2021/12/29/1068853249/how-do-you-measure-inflation-indicator-favorite
From what I understand, the World Bank’s inflation database collects data from various countries, on the assumption that we can trust those countries’ measurements. (Whether we can is another question entirely!) The Excel document contains a number of different inflation measurements, among them:
- Headline consumer price index
- Food price index
- Energy price index
- Official core consumer price index, the main measure of inflation in most countries
- Producer price index
Each of these is measured on a monthly, quarterly, and annual basis. We will look at only a few of these, and only on an annual basis. (There are some other sheets as well in this document, but we’ll ignore those.)
Data and questions
The World Bank recently released their latest report on inflation rates, as described here:
https://www.worldbank.org/en/research/brief/inflation-database
We’ll be working with this database, which is distributed as an Excel file. You can download it from here:
/content/files/en/doc/1ad246272dbbc437c74323719506aa0c-0350012021/original/inflation-data.xlsx
The Excel file contains a number of sheets. The first is a general introduction to the document, and the second outlines what the rest of the sheets contain. We’ll be loading a number of these into Pandas as part of our analysis.
There were nine questions this week. Here they are, along with my detailed solutions and explanations:
From the Excel file, read the following three tabs: hcpi_a, ccpi_a, and ppi_a.
As usual, the first thing that I have to do is load up Pandas:
import pandas as pd
With that in place, I want to load the Excel file into a data frame. I can normally do that with the “read_excel” function:
filename = 'inflation-data.xlsx'
df = pd.read_excel(filename)
Except that with our current file, this won’t work. The reason is that the file contains multiple sheets — and when you invoke read_excel on a multi-sheet Excel file, you don’t get a data frame back, but rather a dictionary. The dict’s keys are the sheet names, and the dict’s values are data frames.
There’s nothing wrong with this per se, but we are only interested in a handful of sheets from the file. We can specify which ones we want to load by passing the “sheet_name” keyword argument to read_excel:
all_dfs = pd.read_excel(filename,
sheet_name=['hcpi_a', 'ccpi_a', 'ppi_a'])
Note that we can specify sheet names either with strings (as I did here), using the same names as we see at the bottom of the spreadsheet, or with integers, numbering the sheets starting with 0. In either case, the values that we pass to “sheet_name” will be the keys of the dict — so they might be strings, and might be integers.
How can we then take all of the data frames that we received, and turn them into a single data frame? The most straightforward way is with pd.concat, a top-level function that takes an iterable of data frames and returns one new data frame, the result of stacking all of the input data frames vertically:
df = pd.concat(all_dfs.values())
This will only work if all of the data frames have the same columns, and if they have distinct values that we can use to identify them. Both are true in our case, giving us one long data frame.
We’ll now go through the process of turning our long data frame, the result of concatenating the three sheets that we imported, into a more useful form.
Remove any rows in which either "Country" or "Series Name" is NaN.
We’ll want to modify our data frame such that the “Country” and “Series Name” columns are used as a multi-index. Unfortunately, some rows have NaN (“not a number”) values, which refer to missing data. While there’s no way around NaN values in Pandas, or in general, we do want to remove them from any columns we’ll use as an index.
The “dropna” method returns a new data frame, one without any rows containing NaN values. That’s right — if we call dropna on a data frame, then any row containing even a single NaN is removed.
That’s great, in that it guarantees you’ll get good data without any missing values. But if you try it on our data frame, you’ll find that it removes all of the rows! That’s because many of them contain NaN values in other columns.
One solution is to pass the “thresh” keyword argument to dropna. That allows us to indicate the minimum number of non-NaN values we require in order to keep a row. But that’s not very specific; we only care about NaN in Country and Series Name, and want to make sure that there are zero NaNs there.
Fortunately, dropna has a “subset” keyword argument, telling Pandas that it should only check specific columns for NaN values. We can thus write:
df = df.dropna(subset=['Country', 'Series Name'])
Notice that, as is almost always the case with Pandas, we get back a new data frame, rather than modifying the original. We thus assign the result back to df, ensuring that these two columns have known values before using them in a multi-index.