BW #58: NATO (solution)
Which NATO countries budget the most (and least) for their military? When did countries join NATO? And do the most veteran countries typically spend the most?
This week, we looked at NATO, the North Atlantic Treaty Alliance (https://nato.int). NATO has been in the news lately both because Sweden joined and because Donald Trump (again) threatened not to defend fellow NATO members if he’s re-elected.
Given the degree to which NATO has been in the news, I thought it would be appropriate to look at its membership and spending, and see what data we can analyze about it.
Data and eight questions
This week's data largely comes from the World Population Review (https://worldpopulationreview.com). The data itself is a bit annoying to download. First, go to the "NATO spending" page:
https://worldpopulationreview.com/country-rankings/nato-spending-by-country
We're interested in downloading the CSV version of the table of data, displayed right above the FAQs toward the lower part of the page. Click on the word "CSV" at the top of the page. You can then enter your e-mail address and get the data e-mailed to you… or you can go to the following link, where I made the file available to you:
https://drive.google.com/file/d/1x9iWLTfCpxqrDWrc_CwCvL529dCOg0pT/view?usp=sharing
We’ll also be looking at a second data set, this one taken from the Wikipedia page about NATO’s member states:
https://en.wikipedia.org/wiki/Member_states_of_NATO
Here are the eight tasks and questions I asked you to answer. As always, a link to the Jupyter notebook I used to solve these problems is at the bottom of this message:
Turn the CSV file from the World Population Report into a data frame. Use the country names as the data frame's index. Remove the "total" row.
Before doing anything else, I loaded up Pandas:
import pandas as pd
Next, I downloaded the CSV file, and used “read_csv” to get a Pandas data frame from it:
filename = 'nato-spending-by-country-2024.csv'
df = pd.read_csv(filename)
However, I asked you to do two additional things with this file, beyond just reading it. First, I asked you to use the “country” column as an the data frame’s index. We can do that by passing the “index_col” keyword argument to “read_csv”:
df = pd.read_csv(filename,
index_col='country')
I also asked you to remove the final row, the one marked “Total”. Given that “Total” appears in the “country” column, we can use “drop” to remove that row by specifying the index:
df = pd.read_csv(filename,
index_col='country').drop('Total')
We now have a data frame with 30 rows (one for each NATO country in the data set) and 7 columns.
Which five countries, in 2023, spent the most on NATO (i.e., their militaries) as a percentage of GDP? Which five countries spent the most in absolute dollars? And which five spent the most per capita?
These three questions measured different things, but our way to solve them with Pandas was the same: Take one of the columns, and find the five largest numbers.
But wait a second — what is the best way to do that?
For years, my favorite technique was to take the column, sort it in descending order with “sort_values”, and then grab the first five rows with “head”:
(
df['natoSpendingByCountry_percGdp2023']
.sort_values(ascending=False)
.head()
)
But there are a few other ways to do the same thing. I decided to compare techniques and time them with Jupyter’s “%%timeit” magic command. The double % means that it runs an entire cell multiple times, and then reports the mean time and its standard deviation.
On average, the above code took 87.8 µs. What if, instead of using “head”, I were to use “iloc”? Let’s see:
%%timeit
(
df['natoSpendingByCountry_percGdp2023']
.sort_values(ascending=False)
.iloc[:5]
)
The answer? Almost identical, with 87.7 µs per run.
I’ve recently started to use “nlargest” more and more. Is that any different?
%%timeit
(
df['natoSpendingByCountry_percGdp2023']
.nlargest(5)
)
Wow, it’s different — and not in a good way! Running “nlargest” took an average of 392 µs per run, or 4x longer than the other techniques!
Of course, the speed might depend on the dtype or number of values. So there’s more to learn here. But I think that my infatuation with “nlargest” is coming to an end, at least for now.
Of course, all of the above queries gave me the same answer:
country
Poland 3.90
United States 3.49
Greece 3.01
Estonia 2.73
Lithuania 2.54
Name: natoSpendingByCountry_percGdp2023, dtype: float64
I’m not sure what surprises me more: That Poland and Greece are spending so much on defense relative to their GDP, or that the much-larger US is doing so. Actually, I have to assume that Poland — and likely Estonia and Lithuania, as well — is spendign on defense partly because of to Russia’s invasion of Ukraine.
What about the other comparisons? Let’s check in raw numbers, millions of US dollars:
(
df['natoSpendingByCountry_defense2023_inMillionsUSD']
.sort_values(ascending=False)
.head()
)
Here’s the result:
country
United States 860000
Germany 68080
United Kingdom 65763
France 56649
Italy 31585
Name: natoSpendingByCountry_defense2023_inMillionsUSD, dtype: int64
In terms of raw numbers, the US spends far, far more than any other NATO ally on defense. I do find it interesting that Germany, which is constitutionally forbidden from having an army, has the second-highest defense spending in NATO.
Finally, I asked for the top-spending countries per capita:
(
df['perCapita2023']
.sort_values(ascending=False)
.head()
)
The results:
country
United States 2515.985136
Norway 1598.338337
Finland 1319.846930
Denmark 1140.630958
United Kingdom 967.651671
Name: perCapita2023, dtype: float64
Again, it’s pretty amazing that the US spends so much on defense per capita, and is also a very large country. Again, I’m not sure why I’m surprised to find that Norway, Finland, and Denmark spend so much on defense, given that they’re right next to Russia, but I am.
The US is thus at or near the top of the pack on all of these measures. But that is unusual; no other country is near or at the top in more than two of them, and even that is rare.