BW #103: CDC data

Get better at: Multi-indexes, working with dates and times, pivot tables, plotting, and window functions

BW #103: CDC data

On January 20th, Donald Trump returned to the White House for a second presidential term. Among the blizzard of press releases, executive orders, threats, and budget freezes, his administration has also told health agencies to stop all communications. This means no updates to doctors or hospitals about the current state of health in the United States, leaving the US medical community less knowledgeable and more vulnerable. The rule also blocks scientists and researchers from attending and presenting at conferences, slowing the dissemination and progress of science in the United States. The Washington Post reported on this last week (https://www.washingtonpost.com/health/2025/01/21/trump-hhs-cdc-fda-communication-pause/), and things haven't gotten much clearer since then.

The effects are wider than you might have thought: Seth Michael Larson (https://sethmlarson.dev/), the security developer-in-residence at the Python Software Foundation, wrote earlier this week that the status of his NSF grant proposal for funding to further improve Python's security is on hold, pending a freeze at the National Science Foundation (https://www.npr.org/sections/shots-health-news/2025/01/27/nx-s1-5276342/nsf-freezes-grant-review-trump-executive-orders-dei-science).

This week, I thought it would be appropriate to use data from one of the affected government agencies, to see the depth and breadth of what US government agencies provide. I opened the portal for publicly available data at the Centers for Disease Control and Prevention (CDC, at https://data.cdc.gov/), and clicked on the first link, leading to the National Center for Health Statistics (https://data.cdc.gov/browse?category=NCHS&sortBy=last_modified). That showed me the data sets that had been published most recently; I chose the most recent one, whose title is "Provisional Percent of Deaths for COVID-19, Influenza, and RSV by Select Characteristics." The data set was structured perfectly for a number of techniques I wanted to explore, so that'll be our data set for this week.

New features

I'm always looking for new ways to make Bamboo Weekly a helpful and effective learning tool. Starting this week, I'm happy to announce a few additions:

  1. In addition to listing the learning goals for each issue, I'm marking them with tags. I'm slowly but surely going through the back issues, tagging the data-analysis topics that I cover in each issue. Alongside each learning goal, I'll have a link to the page listing all of the back issues of BW whose problems use that tag. This means that you'll be able to find, without too much trouble, other problems that exercise the same topics.
  2. The world is always changing, and data sets in back issues can be hard to find. Or they're reformatted. Or they just disappear. As of this week, I'll be uploading the data set to a Gist (i.e., code snippet) on GitHub that you can easily download. This link will be visible to paid subscribers, after the questions.
  3. I know that downloading the data file and setting up a Jupyter notebook can be daunting. For that reason, I'm also starting something new this week, as of tomorrow's issue: Paid subscribers will still be able to download my Jupyter notebook and use it on their own system. But I'll also provide a clickable link that loads the Jupyter notebook into Google Colab, allowing you to start experimenting with my notebook inside of your browser, without installing anything new.

Some of these additions were the result of reader suggestions. If you're a subscriber, and can think of a way for Bamboo Weekly to serve you better, please let me know! I'll always do what I can to make it a more useful resource.

Data and six questions

This week's data set is a CSV file describing the number of deaths from each of three diseases (covid-19, influenza, and RSV) over the last few years, with week-by-week numbers separated out by ethnic group, age, and sex. You can download the file from the NCHS portal at:

https://data.cdc.gov/NCHS/Provisional-Percent-of-Deaths-for-COVID-19-Influen/53g5-jf7x/about_data

Click on the "export" button at the top of the screen, and ask to receive the download in CSV format.

I have six tasks and questions for you to perform with this data set.

Our learning goals this week include:

I'll be back tomorrow with my complete solutions.

Meanwhile, here are my six tasks:

  • Load the CDC's data from a CSV file into a data frame. Keep only the columns 'demographic_type', 'weekending_date', 'pathogen', 'demographic_values', 'deaths', and 'total_deaths'. Create a three-part multi-index with demographic_type, weekending_date, and pathogen. Remove rows in which the value of "pathogen" is "Combined". The "weekending_date" column should be treated as a datetime value.
  • Create a bar plot showing, for each quarter, the number of people who died in the US from influenza. Each bar should be subdivided into subplots for age group. Does any part of this data seem unusual? What might explain it?