BW #57: International arms trade (solution)
Which countries export the most military equipment? How has that changed in the last few years? Also: Who is aiding Ukraine's military the most?
This week, we looked at arms-trade data released earlier this week by the Stockholm International Peace Research Institute (SIPRI, https://www.sipri.org). This was reported in a number of outlets, including Politico, whose article (https://www.politico.eu/article/france-overtake-russia-world-weapons-exporter/) noted that as of 2023, France is the world’s second-largest arms exporter, replacing Russia, whose arms exports went down substantially.
SIPRI records every sale of military equipment — tracking the seller, the buyer, when the order was made, how much it was for, and when it was delivered. The sellers are listed as countries, and the buyers are either countries or non-state actors (e.g., Hamas).
SIPRI’s latest data is current through the end of 2023; to be honest, I’m impressed that they’re able to gather this sort of information at all, given that governments are typically less than open about what they’re buying, and from whom.
I thought that it would be interesting to find out who is selling, who is buying, and what changes we have seen over the last number of years.
Data and six questions
This week's data comes from SIPRI's arms transfer database:
https://armstransfers.sipri.org/ArmsTransfer/
I asked you to download a CSV file with information about all arms “transfers,” as they refer to them:
https://armstransfers.sipri.org/ArmsTransfer/TransferRegister
On this page, ask for information from 2000 through 2023. The sorting doesn't matter, but we do want deliveries broken down by year (so click on that checkbox). Then click on the red "Download as CSV" button. This will take a little while to run, but within a minute or so, you'll get a file called "trade-register.csv" downloaded to your computer. That'll be our data set for this week's exercises.
A data dictionary describing the methodology used to compiled the database is at
https://www.sipri.org/databases/armstransfers
As I noted yesterday, the database records arms sales using TIV, "trend-indicator value," a SIPRI-specific amount. You can compare TIV across years, countries, and products, but you can't directly convert it to dollars or any other currency.
This week, I have six tasks and questions for you. (And yes, I indicated yesterday that I had seven. Whoops!)
Here are my detailed solutions; as always, the Jupyter notebook I used to solve these problems is at the bottom of this edition.
Create a data frame from the "trade-register.csv" file, keeping the column headers.
We can start, as always, by loading Pandas:
import pandas as pd
Assuming that the file is called “trade-register.csv”, we should be able to load it into a data frame with “read_csv”, a common Pandas method:
filename = 'trade-register.csv'
df = pd.read_csv(filename)
However, this doesn’t work, giving us an error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 17
Here, Pandas is complaining that it expected a certain number of fields, but got a different number. That’s because Pandas looks at the first line of a CSV file, and uses that to figure out how many fields there should be. If any subsequent line has the wrong number, we get the above ParserError exception.
The problem is that the SIPRI file starts with several explanatory and copyright lines:
Transfers of major conventional arms from All countries to All countries. Deals with deliveries or orders made for year range '2000' to '2023'
A '?' in a column indicates uncertain data. The ?Number delivered? and the ?Year(s) of deliveries? refer only to deliveries in the selected year(s).
An empty field for ?Number ordered? indicates that data is not yet available.
SIPRI trend-indicator values (TIVs) are in millions.
An empty field for ?SIPRI TIV for total order? indicates that data (on the number ordered and/or the TIV per unit) is not available.
A '0' for ?SIPRI TIV of delivered weapons? indicates that the volume of deliveries is between 0 and 0.5 million SIPRI TIV; and an empty field indicates that no deliveries have been identified.
Figures may not add up to stated totals due to the conventions of rounding.
For the method used for the SIPRI TIV and explanations of the conventions; abbreviations and acronyms see <https://www.sipri.org/databases/armstransfers/sources-and-methods>.
Source: SIPRI Arms Transfers Database (c) SIPRI.
Data generated: 13 Mar 2024 4:39:22 PM
This is all great, but when Pandas reads the file, it assumes that there will be only one field per line, because there aren’t any commas on the first line.
We can fix this by telling it to ignore the first few lines, and that the headers describing the columns are on a later line in the file. In this case, the headers start on line 12, which Pandas would refer to as line 11 (because of 0-based indexing). However, I found that I actually needed to pass header=10 in order for the file to be read correctly:
df = pd.read_csv(filename, header=10)
That worked just fine, and gave me a data frame with 20,480 rows and 17 columns. We could have gotten a smaller data frame if we had asked for each of the orders to be listed in a single row, rather than broken up by delivery. However, this gave us results that were different than were in the SIPRI summary. I took that to mean that their analysis used the broken-down delivery dates, and then recorded when the arms were delivered, rather than when they were ordered.
In this case, “read_csv” worked just fine. But when I was researching the topic for Bamboo Weekly, I found that several of SIPRI’s files weren’t readable by “read_csv”. I got UnicodeDecodeError exceptions when trying to read the files — an indication that the files were written using something other than Unicode’s UTF-8 encoding, which is what Pandas (and Python) expect by default.
To read such a file, we can pass the “encoding” keyword argument to “read_csv”:
df = pd.read_csv(filename, header=10, encoding='Latin-1')
Again, that wasn’t necessary with this file, but you will undoubtedly encounter this problem in the future.
But wait a second: How did I know that the problematic file was encoded in Latin-1? Is there any automated way to figure it out? The answer is “yes,” to at least some degree: The “file” command on MacOS and Linux, when run in the terminal against a file, will make a pretty good guess regarding its encoding. For example:
❯ file ~/Downloads/trade-register.csv
/Users/reuven/Downloads/trade-register.csv: ASCII text
“ASCII text” is the most generic encoding available, but it’s a subset of both UTF-8 and Latin-1. So I could actually use either encoding, including the default, and read it fine. But one of the other SIPRI files looked like this:
❯ file trade-register.csv
trade-register.csv: ISO-8859 text, with very long lines (312)
Here, you can see that it’s guessing at an encoding of ISO-8859. That covers many different languages with sub-encodings such as ISO-8859-1 (Western Europe) and ISO-8859-8 (English + Hebrew). It won’t always be able to tell you which language or sub-encoding the file used, but it often can.
I also, in preparing this article, discovered the “chardet” utility, which uses heuristics to determine the character set and encoding: https://pypi.org/project/chardet/
Which five countries exported the greatest total TIV in 2023? (Use the "TIV delivery values" column.)
Another way to phrase this question is: Which five countries are the greatest arms exporters, as of 2023? I already mentioned the Politico article in which they pointed to the fact that France is now #2, pushing Russia out of that spot. What are the other three, though?
We’ll start by using “loc” with a “lambda” expression to select only those rows from the file in which “delivery year” was 2023.
(
df
.loc[lambda df_: df_['Delivery year'] == 2023]
)
Remember that this works because “lambda” returns a function object. That function is then applied to the data frame (which we assign to the local parameter “df_”). The function’s result is a boolean series whose value is “True” wherever “Delivery year” is equal to 2023.
When you put a boolean series inside of “loc”, you get back only those elements of the original data frame that match up with a “True” value. This is known as a “mask index.”
Now that we’re looking only at values from 2023, we can use a “groupby” to calculate the total amount delivered by each supplier:
(
df
.loc[lambda df_: df_['Delivery year'] == 2023]
.groupby('Supplier')['TIV delivery values'].sum()
)
This returns a “groupby” object, on which we can run one or more aggregation methods. Remember that in this “groupby”, we will get a result for each unique value in the “Supplier” column. And we’ll calculate the total “TIV delivery values” for each supplier.
We get back a series whose index contains supplier (country) names, and whose values are the total TIV for each supplier in 2023. We can get the five largest values (along with their indexes) with the “nlargest” method:
(
df
.loc[lambda df_: df_['Delivery year'] == 2023]
.groupby('Supplier')['TIV delivery values'].sum()
.nlargest()
)
The result:
Supplier
United States 11287.01
Germany 3287.07
China 2431.71
France 2012.17
Italy 1436.81
Name: TIV delivery values, dtype: float64
This doesn’t quite match up with the story from Politico, which said that France is now #2 and that Russia is #3. I’m guessing that I looked at the wrong columns, or calculated it incorrectly. But we see that the US is, by far, the largest supplier of arms, followed distantly by Germany, China, France, and Italy. Russia, by this measure, comes it at #6.