BW #106: Flu season (solution)

We'll be holding monthly Pandas office hours on Sunday, February 23rd! If you're a paid subscriber, then you can join me then, asking any Pandas-related questions you have. Details will be sent on Friday. I hope that you'll join me there!

This week, we looked at data from the World Health Organization (WHO, https://who.int), whose FluID data set tracks flu and flu-related infections (including RSV) around the world. The topic was inspired by the fact that it's winter in the northern hemisphere, aka flu season, and a lot of people I know have been sick themselves or dealing with sick relatives.

Add to that the outbreak of avian flu in the US (discussed at length in The Daily, from the New York Times at https://www.nytimes.com/2025/02/13/podcasts/the-daily/bird-flu-eggs.html?unlocked_article_code=1.yE4.cGJh.rHxFPrPOzeNg&smid=url-share), and the fact that the Trump administration has both withdrawn from the WHO and reduced monitoring of infections at US borders (https://www.nytimes.com/2025/02/19/health/trump-border-health-inspections.html?unlocked_article_code=1.yU4.imei.Dss9yIyVlTb7&smid=url-share), and that we're now just five years since the outbreak of covid-19, and this topic seems more relevant than ever.

Data and six questions

This week's data comes from the FluID program at the WHO:

https://www.who.int/teams/global-influenza-programme/surveillance-and-monitoring/influenza-surveillance-outputs)

You can download the data from the link marked "Download the fluID dataset (CSV)":

https://xmart-api-public.who.int/FLUMART/VIW_FID?$format=csv

The data dictionary is available, also in CSV format, from a link on the same page:

https://xmart-api-public.who.int/FLUMART/VIW_FLU_METADATA?$format=csv

The learning goals for this week include:

Dates and times (https://www.bambooweekly.com/tag/datetime/)
String manipulations (https://www.bambooweekly.com/tag/strings/)
Plotting (https://www.bambooweekly.com/tag/plotting/)
Grouping (https://www.bambooweekly.com/tag/grouping/)
Pivot tables (https://www.bambooweekly.com/tag/pivot-table/)

If you're a paid subscriber, then you'll be able to download the data directly from a link at the bottom of this post, download my Jupyter notebook, and click on a single link that loads the notebook and the data into Google Colab, so you can get to work experimenting right away.

Here are my six questions:

Create the data frame. Make sure that the `ISO_WEEKSTARTDATE` column is a datetime.

For starters, let's load up Pandas:

import pandas as pd

Next, we can use read_csv to load the file into a data frame:

filename = 'data/VIW_FID.zip'

df = pd.read_csv(filename)

Notice that I compressed the file using zip. read_csv is perfectly able to read a zipfile, opening it and reading the resulting CSV file without any trouble. (Note that the zipfile can only contain a single CSV file inside of it for this to work.)

The good news is that this will basically work. But we'll also get a warning:

/var/folders/rr/0mnyyv811fs5vyp22gf4fxk00000gn/T/ipykernel_30079/2801586168.py:5: DtypeWarning: Columns (44,45) have mixed types. Specify dtype option on import or set low_memory=False.

This warning comes from the fact that Pandas sees that the file is large (165 MB) in size, and thus reads it in chunks. Remember, though, that CSV files are textual, and that Pandas thus needs to figure out what dtype to assign to each column. It looks at the data, and basically assigns a dtype of int64 (if there are only integers), float64 (if it contains integers and decimal points), or object (i.e., strings) if nothing else will fit.

But what happens if it's reading in chunks and its heuristics don't match for a column? That is, what if the first chunk makes it seem like a column could be int64, but the second chunk makes it seem like object, because it contains text? Then we have a mismatch – precisely the situation that the warning tells us about.

As the warning indicates, we can solve this in at least two ways:

Pass the dtype keyword argument, explicitly setting a dtype for each column, and thus avoiding the need for Pandas to figure this out, or
Pass low_memory=False, such that Pandas will read the entire file into memory at once, and make its determination based on the entirety of each column, rather than in separate chunks.

So we could actually load things up in the following way, and avoid that warning:

filename = 'data/VIW_FID.csv'

df = pd.read_csv(filename, low_memory=False)

And indeed, I found that this works just fine.

However, I also asked you to make sure that the ISO_WEEKSTARTDATE column had a datetime dtype. We can do that by passing the parse_dates keyword argument, handing it a list of columns that should be treated as dates and times. (There is another column in this data set that we can use for such purposes, but since we aren't using it, I ignored it.) So I could say:

filename = 'data/VIW_FID.csv'

df = pd.read_csv(filename, low_memory=False, 
                       parse_dates=['ISO_WEEKSTARTDATE'])

There is another option, though, one that would normally solve both of the issues we've seen so far: If we use PyArrow for loading the CSV file, it'll both handle the memory issues on its own and it'll identify datetime columns pretty reliably, assigning that dtype as needed. I'm a big fan of using PyArrow's CSV loader, although there are some times when it is stricter than the native Pandas loader, and thus chokes on data.

We can use PyArrow to load the CSV with the following code:

df = pd.read_csv(filename, engine='pyarrow')

And indeed, in my tests, I found that using the PyArrow loader was far faster, taking 290 ms instead of 4.3 seconds. That's a big gain!

But I also found that in this case, for reasons that aren't clear to me, PyArrow didn't notice the datetime column, and kept it as a string (object) dtype. Which means that we still need to specify it:

df = pd.read_csv(filename, engine='pyarrow', 
                 parse_dates=['ISO_WEEKSTARTDATE'])

df.dtypes

The result is a data frame with 1,123,358 rows and 52 columns. That's quite a large data frame, and the fact is we'll only be using a handful of columns. I didn't ask you to do this, but I can make things more manageable by passing the usecols keyword argument to read_csv , and making things smaller:

df = pd.read_csv(filename, engine='pyarrow',
                 usecols=['COUNTRY_AREA_TERRITORY',
                          'ISO_WEEKSTARTDATE', 
                          'AH1N12009', 'AH3', 'AH5', 
                          'AH7', 'AOTHERTYPES', 'RSV', 
                          'INPATIENTS', 'MORTALITY_ALL', 'COMMENTS'],
                 parse_dates=['ISO_WEEKSTARTDATE'])

Specifying these columns reduced the memory consumption from about 1 GB to about 166 MB.

In order to answer this question, we'll first need to filter the rows, such that we only keep those from 2020 and beyond. This could be done with a regular column, but I find it easier to do so when the data frame's index contains datetime values. I'll thus start by invoking set_index on the data frame.

I can then, in theory, use loc to retrieve all dates from 2020 and onward with a slice. If we pass a string-based slice to loc when the index contains datetime values, Pandas knows what to do, ignoring the smaller time measurements. So we can say .loc['2020':], right?

No, not yet: A Pandas index can contain repeated values. In such a case, Pandas doesn't know where to start or end when we ask it for a slice, and raises an exception. The solution is to invoke sort_index, returning a new data frame identical to the input frame, but with rows sorted by index. With that in place, we can then invoke loc with the slice, filtering out the rows. And when we're done with that? We invoke reset_index to return the index to be a normal column:

(
    df
    .set_index('ISO_WEEKSTARTDATE')
    .sort_index()
    .loc['2020':]
    .reset_index()
)

At this point, we have the same data frame as we started with (df), but with many fewer rows.

But wait a second: All of this setting, sorting, filtering, and resetting of the index must consume some time and memory. According to %%timeit in my Jupyter notebook, it took an average of about 92 ms. That's not very long, but maybe there's a better way to do it?

For example, I could use loc and lambda to do the filtering inside of the data frame, without modifying the index, retrieving dt.year from each datetime value:

(
    df
    .loc[lambda df_: df_['ISO_WEEKSTARTDATE'].dt.year >= 2020]
)

This second form of the query took about 62 ms, or roughly one-third less time. So we'll stick with that.

We want to find out, for each quarter, the total number of deaths in each country. That sounds like something we could accomplish with a pivot table:

Columns are the countries (COUNTRY_AREA_TERRITORY)
Rows (index) are the quarters for which we have data
Values are MORTALITY_ALL
We'll use the sum aggregation function

This all sounds great, except for what I wrote about the rows. How are we supposed to get quarterly results, when we have them on a weekly basis? If our index contained datetime values, then we could use resample, but that isn't in the cards here.

The answer is to use pd.Grouper, an object that lets us perform sophisticated grouping operations in Pandas. We can tell pd.Grouper what column to use, and then specify the same sort of frequency code as we do with resample. For example, to aggregate results on a quarterly basis, we say '1QE', meaning "at the end of every 1 quarter."

The call to pivot_table thus looks like:

(
    df
    .loc[lambda df_: df_['ISO_WEEKSTARTDATE'].dt.year >= 2020]
    .pivot_table(index=pd.Grouper(key='ISO_WEEKSTARTDATE', freq='1QE'),
                 columns='COUNTRY_AREA_TERRITORY',
                 values='MORTALITY_ALL',
                 aggfunc='sum')
)

The result has 21 rows (one for each quarter since the first of 2020) and 164 columns (one for each country).

Now that we have this data frame, we can start to answer our questions: For each quarter, which country had the greatest number of fatalities due to flu and similar viruses?

We can invoke max on our pivot table to get the maximum value per column (i.e., per country). And we can invoke idxmax on our pivot table to get the index of the maximum value (i.e., the quarter). If we want to get the country name with the maximum value for each quarter, then we can pass axis='columns', which will do what we need.

But we're still a bit stuck, because we want to run two aggregation methods, and can only really run one.

This is where agg comes in, allowing us to pass a list of strings, each naming an aggregation function we want to run:

(
    df
    .loc[lambda df_: df_['ISO_WEEKSTARTDATE'].dt.year >= 2020]
    .pivot_table(index=pd.Grouper(key='ISO_WEEKSTARTDATE', freq='1QE'),
                 columns='COUNTRY_AREA_TERRITORY',
                 values='MORTALITY_ALL',
                 aggfunc='sum')
    .agg(['max', 'idxmax'], axis='columns')
)

This gives us exactly what we want, except for the final value, showing Afghanistan with 0 deaths. How can that be the maximum? Because it would seem that all countries in the first quarter of 2025 have reported zero deaths. Which means that the maximum could be any of them. That's why I asked you to remove any rows containing 0. We can do that with a combination of loc and lambda:

(
    df
    .loc[lambda df_: df_['ISO_WEEKSTARTDATE'].dt.year >= 2020]
    .pivot_table(index=pd.Grouper(key='ISO_WEEKSTARTDATE', freq='1QE'),
                 columns='COUNTRY_AREA_TERRITORY',
                 values='MORTALITY_ALL',
                 aggfunc='sum')
    .agg(['max', 'idxmax'], axis='columns')
    .loc[lambda df_: df_['max'] > 0]    
)

The final result:

                        max                    idxmax
ISO_WEEKSTARTDATE                                    
2020-03-31         783968.0  United States of America
2020-06-30         841209.0  United States of America
2020-09-30         780022.0  United States of America
2020-12-31         941385.0  United States of America
2021-03-31         923401.0  United States of America
2021-06-30         747531.0  United States of America
2021-09-30         803331.0  United States of America
2021-12-31         897629.0  United States of America
2022-03-31         930896.0  United States of America
2022-06-30         745687.0  United States of America
2022-09-30         680880.0  United States of America
2022-12-31         823872.0  United States of America
2023-03-31         809972.0  United States of America
2023-06-30         736137.0  United States of America
2023-09-30         704903.0  United States of America
2023-12-31         780194.0  United States of America
2024-03-31         807582.0  United States of America
2024-06-30         738413.0  United States of America
2024-09-30         777357.0  United States of America
2024-12-31         674790.0  United States of America

Hmm, looks like the US consistently has more deaths from flu and RSV than any other viruses than any other country.

But of course, the US is also a large country. And it generally has accurate reporting of data. And it supplies data to the WHO. (These last two statements are a bit in doubt right now, but they have been true for several decades.) So the combination puts the US at the top of the reported data.

BW #106: Flu season (solution)

Data and six questions

Create the data frame. Make sure that the ISO_WEEKSTARTDATE column is a datetime.

In each quarter of our data set, starting in 2020, which country had the greatest number of deaths from flu and flu-related viruses? Ignore weeks in which the maximum is 0.

Read next

BW #106: Flu season

Pandas office hours — Sunday, February 23rd

Create the data frame. Make sure that the `ISO_WEEKSTARTDATE` column is a datetime.