BW #35: Terrorism (solution)
The big news this week was the massive terrorist attack that Hamas committed in Israel. This week's data set thus has to do with terrorism — where it occurs, and whether that has changed over time.
The fallout from the weekend’s massive terrorist attack on Israel continues to dominate world news, and also (for obvious reasons) my day-to-day life here in Israel.
This week, in the wake of the massive terrorist attack that took place in Israel, I decided that it would be appropriate to look at the Global Terrorism Index (https://www.visionofhumanity.org/maps/global-terrorism-index/), published by Vision of Humanity (https://www.visionofhumanity.org) using data from the Institute for Economics and Peace (https://www.economicsandpeace.org/), both in Sydney, Australia.
We here in Israel talk, think, and worry about terrorism quite a bit, and I was curious to see what the trends looked like over the years, whether terrorist attacks had become more or less deadly over time, and whether certain parts of the world are more dangerous than others.
Data and eight questions
The "Global Terrorism Index" has an annual report, but we here at Bamboo Weekly aren’t going to look at a report that someone else wrote, right? No, we’re going to download the raw data and play with it for a while. The GTI data, along with data from other, related reports, is available here:
https://www.visionofhumanity.org/public-release-data/
The data itself is downloadable as an Excel file:
/content/files/wp-content/uploads/2023/06/gti-2023-overall-scores-2011-2022.xlsx
This week, I asked you eight questions. Let’s go through them, little by little; a link to the Jupyter notebook I used to solve the problems follows the final answer:
Load the data into a data frame. We're interested in the "Overall scores" sheet from the Excel file, with the "Country" column as the index, and the columns whose names end in the word "rank".
First and foremost, we’ll need to load Pandas:
import pandas as pd
With that out of the way, we want to load the Excel file. The standard way to do this in Pandas is with the “read_excel” method, which returns a data frame based on an Excel file. We can thus say:
filename = 'GTI-2023-Overall-scores-2011-2022.xlsx'
df = (
pd.read_excel(filename)
)
Except that this actually won’t work. That’s because our Excel file contains several sheets. If you don’t specify which sheet you want, then you get the first one — which, in our case, is labeled “overview,” and indeed contains a basic overview of the data.
We want the sheet named “Overall Scores,” which we can read by specifying it in our call to read_excel:
df = (
pd.read_excel(filename,
sheet_name='Overall Scores')
)
Now we have the right data, and a data frame — but things aren’t quite right yet here, either. That’s because read_excel assumes that the headers are in the top row, and that the data in each column is of a single type. That’s not the case here; the first few rows contains information about the Institute for Economics and Peace, and the headers are only on line 7. We can tell read_excel to ignore those first few rows here:
filename = 'GTI-2023-Overall-scores-2011-2022.xlsx'
df = (
pd.read_excel(filename,
sheet_name='Overall Scores',
header=6)
)
Notice that the headers are on line 7 of the spreadsheet. But whereas Excel starts counting with 1, Python and Pandas start with 0, Which means that we need to specify header=6, if we want to use row 7 as it appears in the spreadsheet.
I asked you to set the “Country” column to be the index. We can do this by specifying the “index_col” keyword argument:
filename = 'GTI-2023-Overall-scores-2011-2022.xlsx'
df = (
pd.read_excel(filename,
sheet_name='Overall Scores',
header=6,
index_col='Country')
)
With the above in place, we now have a data frame. But it contains more columns than we’ll want or need. How can we keep only those columns with a particular year and the word “Rank” in their names?
One way would be to explicitly specify the names we want in our call to read_excel, with the “usecols” keyword argument. Another way would be to use double square brackets on the data frame we get back, indicating which we want.
But in this case, we want a bunch of columns that have similar names. We could even say that we just want all of the columns that end with the word “Rank”.
Enter the “filter” method, which lets us choose rows or columns from a data frame whose values fit a particular pattern. In this case, I’m going to use a regular expression, passing it to the “regex” keyword argument for “filter”:
filename = 'GTI-2023-Overall-scores-2011-2022.xlsx'
df = (
pd.read_excel(filename,
sheet_name='Overall Scores',
header=6,
index_col='Country')
.filter(regex='.*Rank$')
)
The regular expression I pass here is “.*Rank$”. That means:
- Any character (.), zero or more times
- The literal characters “Rank”
- The word “Rank” must be at the end of the string
I get back a data frame with 13 columns, the GTI ranks from 2012 through 2022.
(By the way, if you think that regular expressions are difficult or impossible to learn, they aren’t! I have a free e-mail course on the subject (Regexp Crash Course), complete with exercises, at https://RegexpCrashCourse.com/.)
We now have the data frame that we want and need, with a row for each country and a column for each year.
In 2022, what countries were ranked in the top 10% of those having terrorist problems? Display the countries and their names, sorted by score, with #1 at the top.
Now that we have our data, let’s find out which countries are in the top 10% of those with terrorism problems. This is a slightly different problem than I typically ask; usually I want to see the top 5 or top 10 results, for which we can use a combination of sort_values and head. Here, I want to see the top 10% of results. How can we get only that percentage?
We can use the “quantile” method on the “2022 Rank” column. That’ll return the value which marks that quantile. If we run “quantile(0.5)”, then we’ll get the median score, because it’s precisely at the 50% mark. If we run “quantile(0.25)”, then we’ll get the value at the 25% mark. Here, we want the 10% of countries with the greatest terrorism problems. You might thus think that we should run “quantile(0.9)”, to get those at the 90% mark. But the rankings are actually inverted from that, with 1 being the most-terror-struck country. We thus want to get rows whose 2022 rank is less than the 10% quantile.
We first do this by creating a boolean series, thanks to a comparison:
df['2022 Rank'] < df['2022 Rank'].quantile(0.1)
We can then pass that boolean series as a mask index to “loc”. This will return only those rows of df whose ranks are in the lowest 10% of ranks:
(
df
.loc[df['2022 Rank'] < df['2022 Rank'].quantile(0.1)]
)
But actually, we only care about the “2022 Rank” column. “loc” can take two arguments, the first being a row selector an the second being a column selector. We can thus select only one column:
(
df
.loc[df['2022 Rank'] < df['2022 Rank'].quantile(0.1),
'2022 Rank']
)
This returns a single series. We can sort them in order, from lowest score to highest:
(
df
.loc[df['2022 Rank'] < df['2022 Rank'].quantile(0.1),
'2022 Rank']
.sort_values()
)
The result that I get is as follows:
Country
Afghanistan 1
Burkina Faso 2
Somalia 3
Mali 4
Syria 5
Pakistan 6
Iraq 7
Nigeria 8
Myanmar 9
Niger 10
Cameroon 11
Mozambique 12
India 13
Democratic Republic of the Congo 14
Colombia 15
Egypt 16
Chile 17
Name: 2022 Rank, dtype: int64
Some of those (e.g., India and Chile) surprised me quite a bit. Others, such as Afghanistan and Syria, didn’t surprise me in the slightest. But using “quantile” allows us to grab a particular slice not based on a hard number, but rather based on a percentage of the values.