BW #74: UK elections (solution)

BW #74: UK elections (solution)

[I'm at Euro Python in Prague this week. If you're attending, then please find me and say "hi"!]

This week, we looked at data from the recent Parliamentary elections in the United Kingdom, one in which Sir Keir Starmer's Labour party swept into victory, ousting the Conservatives after 14 years. One of the many advantages of democracies over other governing systems is that they're more transparent – and in the modern age, this means that they make election data available to the public in digital form. This week, we'll thus look at the official UK election results, seeing what insights we can get.

Data and six questions

This week's data comes from the House of Commons library, part of the UK's Parliament. The full research briefing, including charts and graphs, can be read at

https://commonslibrary.parliament.uk/research-briefings/cbp-10009/

They provide two Excel spreadsheets, one describing the winners of the recent elections, split up per electoral district (known as a "constituency"). A similar, companion document, lists the members who were defeated. You can download them from here:

https://researchbriefings.files.parliament.uk/documents/CBP-10009/Winning-members.xlsx
https://researchbriefings.files.parliament.uk/documents/CBP-10009/Defeated-MPs.xlsx

Below are solutions to the six challenges that I posed yesterday. A link to my Jupyter notebook is, as always, at the end of this post:

Load the two files into a single data frame, with one row for each constituency, and the index being the ons_id columns from each. Columns from the defeated file should have the _defeated suffix attached to their names.

Before doing anything else, I loaded Pandas:

import pandas as pd

Next, I defined two variables with the filenames I want to load:

uk_defeated_filename = 'Defeated-MPs.xlsx'
uk_winning_filename = 'Winning-members.xlsx'

Next, I used read_excel to load the first Excel file into a data frame. I passed index_col='ons_id' for that column (as identified in the first row of the Excel file, much like a CSV file) to be our data frame's index:

uk_winning_df = pd.read_excel(uk_winning_filename,
                             index_col='ons_id')

Next, I loaded the second Excel file into a data frame. However, I kept only a few of the columns, using the usecols keyword argument; the others weren't going to be used in any of our queries, and it annoyed me to have so many unused columns around. I also always forget that the column I mention in index_col must be listed in usecols. Here's the command I ended up using:

uk_defeated_df = (pd
                  .read_excel(uk_defeated_filename, 
                              index_col='ons_id',
                              usecols=['ons_id','party_name',
                                     'firstname', 'middlenames',
                                     'surname', 'gender'])
                 )

I then created a single data frame from the two of these. Because ons_id is the index for both data frames, I can use join to combine them horizontally.

However, column names in Pandas cannot repeat; this is different from the index, where they can. Thus, when we join our two data frames together, we have to resolve the duplicated names by passing rsuffix, which indicates what suffix should be added to the right-hand data frame's column names. (There's also lsuffix, and you can use one or both of them.)

df = uk_winning_df.join(uk_defeated_df, rsuffix='_defeated')

The resulting data frame has 652 rows and 20 columns.

Which party won the greatest number of seats in each region? Which party lost the greatest number of seats in each region?

We know that the Labour party won the overall election. This means that a majority (an overwhelming majority, in fact) of members of Parliament are from Labour. I wanted to know, though, which party won the greatest number of seats — and yes, my original question said "votes," but we don't have that data — in each region. The data divides the UK into regions, and I was curious to know which regions moved from one party to another.

The regions are categorical information, and the party names are, as well. So if your instinct was to use groupby, then that's great – but here, we're grouping by two different categorical columns, which lends itself nicely to a pivot table. I'll invoke pivot_table, telling Pandas to use party names for the rows (index) and region names for the columns. We'll use the count aggregation method, and then it doesn't really matter what column we use for counting, so long as it doesn't have any missing values – so I chose country_name. The query thus looks like this:


(
    df
    .pivot_table(index='party_name',
                 columns='region_name',
                 values='country_name',
                 aggfunc='count')
    .idxmax()
)

Here's a screenshot of what I got:

But wait a second: I didn't want to know all of the results for all of the regions. I wanted to know which party won in each region. Fortunately, I can use the idxmax method on the data frame, which will tell me, for each column (i.e. region), the index of the row with the highest value:


(
    df
    .pivot_table(index='party_name',
                 columns='region_name',
                 values='country_name',
                 aggfunc='count')
    .idxmax()
)

The result of this query is:

region_name
East Midlands                     Labour
East of England             Conservative
London                            Labour
North East                        Labour
North West                        Labour
Northern Ireland               Sinn Fein
Scotland                          Labour
South East                        Labour
South West                        Labour
Wales                             Labour
West Midlands                     Labour
Yorkshire and The Humber          Labour
dtype: object

Even without any sort of serious data analysis, we can see that in a a clear majority of regions, the Labour party really did win.

We can perform a similar calculation by using the party_name_defeated category for each constituency:

(
    df
    .pivot_table(index='party_name_defeated',
                 columns='region_name',
                 values='country_name',
                 aggfunc='count')
    .idxmax()
)

The results, as you might expect:

region_name
East Midlands                            Conservative
East of England                          Conservative
London                                   Conservative
North East                               Conservative
North West                               Conservative
Northern Ireland            Democratic Unionist Party
Scotland                      Scottish National Party
South East                               Conservative
South West                               Conservative
Wales                                    Conservative
West Midlands                            Conservative
Yorkshire and The Humber                 Conservative
dtype: object