BW #95: Tariffs (solution)
This week, we looked at tariffs – that is, import taxes – imposed on countries around the world. The reason for this topic? Donald Trump's presidential campaign, followed by his post-election win and preparation for his second term, in which he has spoken nonstop about tariffs. Indeed, he recently said that would impose 25 percent tariffs on goods coming from Canada and Mexico, and even higher tariffs on China and other countries.
Reporters and economists have been talking about these tariff plans for a while, but it seems that people are now starting to take the threats seriously. Just yesterday, NPR's "Fresh Air" program interviewed former Wall Street Journal reporter David Wessel about the proposed tariffs (https://www.npr.org/2024/12/04/g-s1-36759/what-trumps-2nd-term-might-mean-for-the-countrys-economy).
Given that everyone is talking about tariffs, and that Trump is speaking regularly about tariffs of 25 percent or higher, I thought that it would be appropriate to examine tariffs around the world, and how such numbers would fit into current trade practices. The tariff data portal at the World Trade Organization (http://tariffdata.wto.org/default.aspx) was either excruciatingly slow or broken. That's a shame, because they would seem to have the latest and most authoritative tariff information available.
However, the World Bank has a database of overall tariff rates at https://data.worldbank.org/indicator/TM.TAX.MRCH.WM.FN.ZS?view=map . I decided that we could use the data that they provide, which isn't quite as up-to-date as the WTO, nor is it as extensive and detailed. But for our purposes, I think it'll be good enough.
Data and six questions
I have six tasks and questions for you this week. As mentioned, we'll look at the World Bank's tariff data. After going to https://data.worldbank.org/indicator/TM.TAX.MRCH.WM.FN.ZS?view=map, choose the "Excel" link from the "download" area on the right of the page, which should lead to https://api.worldbank.org/v2/en/indicator/TM.TAX.MRCH.WM.FN.ZS?downloadformat=excel .
My Jupyter notebook is embedded below, at the end of this post.
Load the first sheet ("Data") into a Pandas data frame. Use the three-letter country code as the index, and remove the Indicator Name
and Indicator Code
columns.
Before doing anything else, we'll load Pandas:
import pandas as pd
Next, we'll load the data: While a CSV file contains a single data set, an Excel file can contain more than one data set, each on a separate sheet. Moreover, Excel is a binary format, meaning that when Pandas loads the data into memory, it doesn't need to guess how to turn the text into a dtype; it can just treat the numbers as Excel expected them to be.
We'll start with read_excel
, the function that returns a data frame from an Excel file. In theory, we can ignore the sheet_name
keyword argument, which allows us to specify a sheet name (using either a string or an integer). If we don't specify sheet_name
, then we get the first sheet, which is what we want for now. But I like to be explicit where, possible, so we'll start with this:
filename = 'API_TM.TAX.MRCH.WM.FN.ZS_DS2_en_excel_v2_16.xls'
data_df = (
pd
.read_excel(filename,
sheet_name='Data')
)
This isn't a bad start, but it's not going to work. That's because the first few lines of the file contain commentary and non-tabular text. Which is fine for us as humans, but confuses Pandas quite a bit. We can avoid this by telling it to go to the fourth row in the file, which Excel would call row 4, but which Python and Pandas (which use 0-based indexing) call row 3. The header
keyword argument can fix this. While we at it, we can specify index_col
, naming Country Code
as the Excel column which should be used as the data frame's index:
filename = 'API_TM.TAX.MRCH.WM.FN.ZS_DS2_en_excel_v2_16.xls'
data_df = (
pd
.read_excel(filename, header=3,
index_col='Country Code',
sheet_name='Data')
)
Finally, I asked you to remove the unnecessary Indicator Name
and Indicator Code
columns. We can do that by invoking drop
, passing the columns
keyword argument, and giving it a list of two string values:
filename = 'API_TM.TAX.MRCH.WM.FN.ZS_DS2_en_excel_v2_16.xls'
data_df = (
pd
.read_excel(filename, header=3,
index_col='Country Code',
sheet_name='Data')
.drop(columns=['Indicator Name', 'Indicator Code'])
)
The result is a data frame, data_df
, with a shape of 266 rows and 65 columns. The rows are from countries, or groups of countries, while the columns are (largely) a weighted average tariff for each country in a particular year. In this way, we can compare tariffs across a given country over many years, or across many countries in a single year. (Or, as we'll see, we can do both!)
Load the second sheet (Metadata - Countries
) into a second data frame. Again, use the three-letter country code as the index.
Here, we'll do roughly the same thing as before. In fact, it's even easier, because the column names are on the first row. I again indicate that Country Code
should be our index column. And I specify sheet_name
using the sheet's number, rather than its name:
metadata_df = (
pd
.read_excel(filename,
index_col='Country Code',
sheet_name=1)
)
Why pass an integer to sheet_name
? I usually prefer to use a string, but sometimes the string has extra whitespace or other odd characters. I decided that it would just be easier to retrieve it with a number, and be done with it.
In theory, we could have loaded both of these data frames at the same time, by passing a list of integers (i.e., sheet numbers) to sheet_name
, or even passing None
as a value to sheet_name
, which would have returned a dict of data frames.
However, the fact that the two sheets we loaded weren't identical in their structure means that we couldn't do this. If we don't specify header
, then we get an error from the first sheet. And if we do, then we lose data from the second sheet. Things also get hairy with the index_col
keyword argument, since specifying a name for the column is great.. but if we specify header
, then we cut off the names of the headers in the metadata data frame.
In short, while I really like the fact that we can load multiple sheets into data frames in a single function call, that didn't seem appropriate here.