Skip to content
3 min read pdf-extraction

BW #53: Airport animals

Get better at: PDF extraction, string operations, cleaning, multi-index, and window functions.

BW #53: Airport animals

I recently read a story in the Economist ("How to transport a rhino," https://www.economist.com/britain/2024/01/25/how-to-transport-a-rhino) about the various animals that are transported into and out of London's Heathrow Airport every year. There is, it turns out, a special place (HARC, the Heathrow Animal Reception Centre), run by the City of London, which employs 55 people and handles the import and transit of a wide variety of animals — from small insects to lions and horses.

The article was amusing, and led me to wonder where they had gotten their data. After all, the article said that more than 30 million butterfly pupae were transported through Heathrow in 2023. That number had to come from somewhere, right?

Friends, I'm delighted to say that I have managed to track down that data! Perhaps it also exists elsewhere, but I found it in a letter submitted by HARC to the Chair of the Environment, Food, and Rural Affairs Committee of the UK's House of Commons. The letter was submitted on November 1st of last year, so it isn't completely up to date. But unless your animal's passport is out of date – and yes, I've learned that there is such a thing as a "pet passport" — the data should still be interesting and fun to review.

Data and seven questions

The data that we'll be examining isn't available in either CSV or Excel format. Rather, it's buried inside of a letter in PDF format. The letter can be downloaded from here:

https://committees.parliament.uk/writtenevidence/126507/default/

And no, there is no filename or extension on that URL. Going to that link should force the download of the data, at least from a normal browser. Using `wget` doesn't seem to work, however.

I didn’t see a data dictionary for this information, but I think that it’s mostly self-explanatory. I did look up some of the animal-related terms, and will happily bore you with the details, if you like.

This week, I have seven tasks and questions for you to answer based on the data.

The learning goals for this week include working with PDF files, indexes and multi-indexes, and cleaning data.

I’ll be back tomorrow with detailed solutions to all of the questions, along with the Jupyter notebook I created in solving them.