BW #93: Anti-politics (solution)
[Like many others, I've been getting active on BlueSky, posting threads on Python/Pandas topics. Follow me there, at https://bsky.app/@lernerpython.com! I've even created a starter pack with lots of Python authors, lecturers, and teachers, at https://go.bsky.app/M5WjB4s . See you there!]
This week, we looked at some of the data that The Economist described and analyzed in their recent article, "The anti-politics eating the West." The data examined a large number of elections in democratic countries over the last number of decades. (If you have an Economist subscription, then you can read the article at https://www.economist.com/interactive/essay/2024/10/31/when-politics-is-about-hating-the-other-side-democracy-suffers .) The article described this as "anti-politics" and "negative partisanship," and warned that such feelings are leading people to distrust government in general, and democratically elected governments in particular.
The data set is huge, and there's no way that I (or anyone reading this newsletter) could possibly wrap their heads around all of it in a short period of time. The researchers and data journalists who put this together are really quite impressive.
But we can learn something from what they've written and put together, and along the way, we can learn some useful and interesting analysis techniques.
Data and six questions
This week's data is all collected for us in the GitHub repo for the article:
https://github.com/TheEconomist/anti-politics
I succeeded in installing the R programs and data that they requested, but it wasn't easy, fast, or smooth. So concentrated instead on two of the data files in the repo:
garzia-and-ferreira-da-silva/eco_cses.dta
, a Stata data file, andmilitary-spending/the_economist_military_spending.csv
, a CSV file prepared by the Economist.
Both of these files, as well as the rest of the data they used, is in the source-data
directory of the GitHub repository.
Here are my solutions and explanations. A link to download the Jupyter notebook I used to solve these problems is at the end of the article.
Read the Stata data from Garzia and Ferrira da Silva (eco_cses.dta
) into a data frame. The COUNTRY
column should be turned into the index.
Before we do anything else, let's load Pandas:
import pandas as pd
Next, I asked you to load the data into a data frame. That's usually a pretty straightforward task, but I asked you to do it from a .dta
file with Stata data.
In case you aren't familiar with it, Stata is a commercial language and tool for statistical analysis. I've never used it myself, but I know that it's popular, as are a number of other tools, such as SPSS and SAS. Most of the time, people who want to share data with others export it into other formats – sometimes CSV or Excel, and sometimes even into Parquet and Feather, formats associated with Apache Arrow. (Indeed, I recently noticed that New York's taxi data, which I've been using for years in my courses, is now distributed in Parquet format.)
In this case, though, neither the Economist nor the researchers they worked with exported the data from Stata. Fortunately, Pandas has read_stata
, a function that works much like read_csv
and read_excel
, with many of the same keyword arguments.
So really, the only thing we need to do is call read_stata
:
filename = 'garzia-and-ferreira-da-silva/eco_cses.dta'
df = pd.read_stata(filename, index_col='COUNTRY')
As you can see, I also passed the index_col
keyword, which you might have seen in read_csv
and read_excel
, taking the COUNTRY
column from the data we read and turning it into the index for the newly created data frame.
The result is a data frame with 185 rows and 7 columns, all with a dtype of float32
. If you're surprised that they aren't float64
, remember that the Stata data file is binary, which means that we can know the type and size of each columns. With CSV, the files contain text, which Pandas then examines to know if it should be integer data (int64
), float (float64
), or strings (object
).
Create a data frame in which years form the rows and there's a multi-index on the columns, with the top level being the INPARTY
and OUTPARTY
measures, and the inner level being the country names.
The data frame we created in the previous question has a YEAR
column, and the multi-index is from the original COUNTRY
column. Both of those are categorical columns, meaning that there's a (relatively) limited number of distinct values in them. Also, there is a single value of INPARTY
for each year-country combination.
We can rewrite the INPARTY
value for each year-country combination with a pivot table. I often like to say that a pivot table is basically a two-dimensional groupby
call, with two categorical columns and a numeric one:
- The columns will contain country names
- The index will contain years
- The values will be from
INPARTY
- In theory, we'll run
mean
on all of the values for each country-year intersection. But since there is only one such value for each intersection, no aggregation method is needed.
We can do it this way:
(
df
.pivot_table(index='YEAR',
columns='COUNTRY',
values='INPARTY')
)
The good news: This works, giving us a data frame with 25 rows and 49 columns. But it's also rather unwieldy, with a very large number of NaN
values.
Besides, two of my requests are missing: We only wanted to include G7 countries in our output. And we wanted to see not only values from INPARTY
, but also OUTPARTY
.
Let's address the G7 countries first: We can define a list of strings containing G7 country names, and then use .loc
with lambda
to keep only those rows whose country is in the list:
G7_COUNTRIES = ['Canada',
'France', 'Germany',
'Great Britain', 'Italy', 'Japan',
'United States of America'
]
(
df
.loc[G7_COUNTRIES]
.pivot_table(index='YEAR',
columns='COUNTRY',
values='INPARTY')
)
This produces a much more reasonably sized data frame, but it still only contains the INPARTY
values. However, a good rule of thumb in Pandas is that wherever you can provide a string value, you can provide a list of string values – and that's true here as well. We can set the values
keyword argument to be the list of ['INPARTY', 'OUTPARTY']
:
G7_COUNTRIES = ['Canada',
'France', 'Germany',
'Great Britain', 'Italy', 'Japan',
'United States of America'
]
(
df
.loc[G7_COUNTRIES]
.pivot_table(index='YEAR',
columns='COUNTRY',
values=['INPARTY', 'OUTPARTY'])
)
The result is what I asked for – a data frame in which the index contains years, and the columns are a two-level multi-index. The outer level is INPARTY
and OUTPARTY
, and the inner level is (twice) the list of countries, for a total of 19 rows and 14 columns.