BW #71: Holidays (solution)

BW #71: Holidays (solution)

I don't normally think of June as a month full of holidays, but then I realized just how many were taking place this month — many of them foreign (quite literally) to me, but obviously important to the people celebrating them.

Yesterday, I mentioned Dragon Boat, Shavuot, Eid al-Adha, and Juneteenth (the most recent official American holiday). But how could I forget that tomorrow (June 21st) is the (northern hemisphere's) Summer Solstice? And of course, there's Midsummer, celebrated in a number of countries. I'm sure that I'm missing no small number of other religious and national festivals also taking place this month.

Data and six questions

In honor of all of these (and other) holidays, I thought that it would be interesting to analyze a database of holidays. I couldn't find an existing database of holidays, and thus decided that we would build our own data frame using the holidays package on PyPI, which lets you retrieve the holidays for any country in any year range. This week, we'll not only get some practice querying a data frame; we'll also get practice creating one based on other data.

Below are my solutions to the six tasks and questions. As usual, a link to the Jupyter notebook that I used is at the bottom of this message.

Create a data frame with four columns (country name, alpha2, date, and holiday name) for all countries, from the years 2010 through 2024. Use the pycountry module (from PyPI) to go through all of the countries in the world, and the holidays module (also from PyPI) to grab all of the holidays from there. The dates should be in a datetime column.

As usual, the first thing that we need to do is grab Pandas – not just the overall library, but also the Series and DataFrame types, which I often like to have around as aliases:

import pandas as pd
from pandas import Series, DataFrame

I already said that you'll want to use the holidays package in order to retrieve holidays. The package allows us to retrieve holidays from a specific country during a specified range of years. For example, to get US holidays in 2024, we pass 'US' (the two-letter code for the United States) and the keyword argument years=2024:

holidays.country_holidays('US', years=2024)

That returns a dictionary in which the keys are datetime objects, and the values are strings naming the holiday:

{datetime.date(2024, 1, 1): "New Year's Day", datetime.date(2024, 5, 27): 'Memorial Day', datetime.date(2024, 6, 19): 'Juneteenth National Independence Day', datetime.date(2024, 7, 4): 'Independence Day', datetime.date(2024, 9, 2): 'Labor Day', datetime.date(2024, 11, 11): 'Veterans Day', datetime.date(2024, 11, 28): 'Thanksgiving', datetime.date(2024, 12, 25): 'Christmas Day', datetime.date(2024, 1, 15): 'Martin Luther King Jr. Day', datetime.date(2024, 2, 19): "Washington's Birthday", datetime.date(2024, 10, 14): 'Columbus Day'}

If I want all of the US holidays from 2010 through 2024, then instead of passing the integers 2024 to the years keyword argument, I can pass a range object. (Don't forget that range, like so many things in Python, goes "up to and not including" the second argument.) In other words, I could say

holidays.country_holidays('US', years=range(2010, 2025))

That's nice, but it raises at least two questions: First, how can I get the holidays for all countries? And second, how can I turn that information into a Pandas data frame?

For the first part, we'll get some help from the pycountry project on PyPI, which lists all country names, along with their two- and three-character abbreviations. We can use this information to loop through all of the countries, grabbing their holidays.

But then what? Even with a dict for each country, how do we turn that into a data frame? There are a few ways to do this, but I decided to turn each holiday into a list containing four elements: The country name, the two-letter abbreviation, the date of the holiday, and the name of the holiday. I can then use list.append to add that new, one-holiday list to a larger list containing all of the holidays.

Once I've created my large list-of-lists, I can then pass that to DataFrame and create a Pandas data frame, with one row for each holiday:

import pycountry
import holidays

all_holidays = []

for one_country in pycountry.countries:
    try:
        for (holiday_date, 
             holiday_name) in holidays.country_holidays(
                                        one_country.alpha_2,
                                        years=range(2010, 2025)
                                                ).items():
            all_holidays.append([one_country.name,
                                 one_country.alpha_2,
                                 holiday_date,
                                 holiday_name])
    except NotImplementedError as e:
        pass
        # print(f'\t{one_country.name} has no holidays in the database')

In the above code, we iterate over each element of pycountry.countries. From the resulting object, we then retrieve one_country.alpha_2, with the two-letter country abbreviation. We then add a new four-element list to the existing all_holidays list.

Notice that I wrap all of this in a try block, because we might get a NotImplementedError exception, indicating that the holidays package doesn't have any holidays for that country. I originally used a print call to indicate that we didn't have information about the country, but I quickly got annoyed and bored from such warnings, and replaced it with a call to pass, which is how we tell Python that yes, we need something in the indented block, but no, we don't really want to do anything.

Once this loop has finished running, I have a list of lists. I can then pass that to DataFrame to create a new data frame. Pandas cannot know what the column names should be, so I'll pass the columns keyword argument, along with a list of strings:

df = (DataFrame(all_holidays, 
                columns='country alpha_2 date holiday'.split())
     )

This is great, except for one thing: All of the columns now have a dtype of object, meaning that they're treated as strings. To treat the dates and times as actual datetime objects, we'll call pd.to_datetime on the date column.

Rather than calling pd.to_datetime on the column and re-assigning it outside of our invocation of DataFrame, we can just use assign to set a column, along with a lambda that is invoked on the date column. Assigning to an existing column replaces it with the new one. We can thus finalize creating the data frame with this code:

df = (DataFrame(all_holidays, 
                columns='country alpha_2 date holiday'.split())
    .assign(date=lambda df_: pd.to_datetime(df_['date']))
     )

The resulting data frame has four columns (of course), and 30,958 rows.

Which countries have holidays in June 2024? Which of this month's holidays, if any, are celebrated in more than one country? Do we see any issues that might result in a mis-count?

My interest in holidays started by noticing that this month has a large number of them. Which countries have at least one holiday in June 2024?

Let's start by finding all of the rows in our data frame with a holiday in June, 2024. One easy way to do this is by setting the date column to be our data frame's index with set_index. With that in place, we can use loc to retrieve only those rows that match our year and month by leaving out the date. That'll provide us with a wildcard for the date:

(
    df
    .set_index('date')
    .loc['2024-06']
)

Now that we've removed rows from other months and years, let's count the number of times each country appears. We can do this by retrieving only the country column and then running drop_duplicates on the result:

(
    df
    .set_index('date')
    .loc['2024-06']
    ['country']
    .drop_duplicates()
)

We find that 88 different countries have at least one holiday this month.

Then I asked a different question: What holidays are celebrated in more than one country? I'll again selected only rows with holidays in June 2024, then used groupby along with the count method:

(
    df
    .set_index('date')
    .loc['2024-06']
    .groupby('holiday')['country'].count()
)

We now have a series whose index contains the names of countries with holidays in June, 2024. I kept only those in which the county was greater than 1, using lambda and loc to perform the filtering:

(
    df
    .set_index('date')
    .loc['2024-06']
    .groupby('holiday')['country'].count()
    .loc[lambda s_: s_ > 1]
)

Finally, I sorted the results from highest to lowest:

(
    df
    .set_index('date')
    .loc['2024-06']
    .groupby('holiday')['country'].count()
    .loc[lambda s_: s_ > 1]
    .sort_values(ascending=False)    
)

Here are the results that I got:

holiday
Eid al-Adha (estimated)                 30
Eid al-Adha Holiday (estimated)         18
Arafat Day (estimated)                   7
Juneteenth National Independence Day     7
Eid al-Adha                              5
Sunday                                   5
Midsummer Day                            4
Independence Day                         4
Midsummer Eve                            3
Eid-ul-Adha (estimated)                  3
Eid al-Adha (observed, estimated)        3
Dragon Boat Festival                     2
Eid al-Adha (estimated) (observed)       2
Father's Day                             2
King's Birthday                          2
National Day                             2
Pentecost                                2
Saint Peter and Saint Paul               2
Whit Monday                              2
Name: country, dtype: int64

I asked you whether we see anything that might have led to a miscount. And to my eyes, one holiday sticks out, namely the Muslim festival of Eid al-Adha, which often has the terms "estimated" and "observed" in parentheses after the name. These are all the same holiday, but Pandas sees them as different, and thus counts them separately.

In the next exercise, we'll make a stab at fixing this problem.