BW #73: Avocado hand (solution)

BW #73: Avocado hand (solution)

Avocados seem so innocent. (And, might I add, delicious.) But the massive growth in avocados' popularity and consumption over the last two decades has come at a price – a large number of people who have injured themselves when cutting them open.

This injury, known as "avocado hand," was recently mentioned in a Washington Post story (https://www.washingtonpost.com/wellness/2024/06/26/avocado-hand-injuries-knife/). Researchers, they reported, found that between 1998 and 2017, there were more than 50,000 avocado-related injuries. Most of these affected people's fingers and hands.

Where did the researchers (whose work can be read at https://pubmed.ncbi.nlm.nih.gov/31303536/) get their injury data? From the US Consumer Product Safety Commission (CPSC, at https://cpsc.gov), and specifically the National Electronic Injury Surveillance System (NEISS, https://www.cpsc.gov/Research--Statistics/NEISS-Injury-Data). They provide an annual report on injuries in the United States, giving us a chance to learn more about avocado-related injuries.

Data and seven questions

This week's data came from the database at NEISS:

https://www.cpsc.gov/cgibin/NEISSQuery/home.aspx

NEISS offers data in a variety of formats. We downloaded the annual archived reports from 2020-2023. I started by working with the Excel files, but found the tab-delimited files to be far easier to work with. And so, aside from one part of one question, I'll assume that you'll use those files, rather than using them in Excel format.

I gave you seven tasks and questions this week. As always, a link to my Jupyter notebook is at the end of the post.

Load the data from 2020 - 2023 into a single data frame. Make sure that the Treatment_Date column is a datetime.
Remove any rows in which the date is invalid or NA.

Let's start by loading Pandas:

import pandas as pd

I asked you to create a data frame from four years of data. If you go to the NEISS site and download one of the tab-delimited files, you'll find that each file is at a URL that looks like this:

https://www.cpsc.gov/cgibin/NEISSQuery/Data/Archived%20Data/{one_year}/neiss{one_year}.tsv

Notice that I've got one_year in there as a variable. And that's not an accident; I can use a for loop to iterate over a range of years, downloading each file, one at a time. Better yet, I can use a list comprehension to run read_csv on the URL to each file, outputting a list of data frames — one for each of the years we want:

all_dfs = [pd.read_csv(f'https://www.cpsc.gov/cgibin/NEISSQuery/Data/Archived%20Data/{one_year}/neiss{one_year}.tsv', 
                       sep='\t')
           for one_year in range(2020, 2024)]

In other words: We'll iterate over each year from 2020 through 2023. (Remember that range, like most parts of Python, goes up to and not including its second argument.) For each of those years, we retrieve the file at the specified URL, passing it to read_csv. We specify that the columns are separated with tabs by passing sep='\t'.

The result is a list of data frames, which we assign to all_dfs.

However, if you use the above code, you'll likely get an error. That's because we haven't specified what dtype should be used for each column. Pandas thus tries to read a bunch of rows from the CSV file, and makes the best guess it can. However, it's possible that the guess it makes with some chunks won't match the guess it makes with others. That happens with several of these files, resulting in unpleasant warnings.

We can avoid these warnings by specifying the dtypes with the dtypes keyword argument. Or, if your computer has enough memory, you can pass the low_memory=False keyword argument. In such a case, Pandas will load the entirety of the file into memory, and make its best guess based on more data.

In addition, we want the Treatment_Date column to be treated as a datetime series. It would be nice to pass parse_dates=['Treatment_Date'] as a keyword argument to read_csv, but there are some illegal dates in the input file. So we'll remove that keyword argument, and then do it manually:

all_dfs = [pd.read_csv(f'https://www.cpsc.gov/cgibin/NEISSQuery/Data/Archived%20Data/{one_year}/neiss{one_year}.tsv', 
                       sep='\t',
                       low_memory=False)
           for one_year in range(2020, 2024)]

With all_dfs defined, we can invoke pd.concat on that list, returning a single data frame:

df = pd.concat(all_dfs)

Now we can turn the Treatment_Date column into a datetime dtype:

df['Treatment_Date'] = pd.to_datetime(df['Treatment_Date'], errors='coerce')

The errors='coerce' keyword argument means that if the input string cannot be parsed as a datetime, it is left as NaT, the time equivalent of NaN.

The resulting data frame has 1,311,422 rows and 26 columns.

In which month do we see the most accidents? The fewest?

In which month of the year does NEISS see the most accidents? To find out, we can use groupby, counting how often a given month shows up in our Treatment_Date column.

To extract the month from Treatment_Date, we'll use the dt accessor. This is our way to get that information from a datetime column, and we can use it inside of a groupby. We can thus groupby on each month, and invoke count on any column we want; I chose CPSC_Case_Number, but it's really not important, so long as the column on which we're counting has few or no NaN values. I can thus say:

(
    df
    .groupby(df['Treatment_Date'].dt.month)['CPSC_Case_Number'].count()
)

To get the months with the largest and smallest number of accidents, we can use sort_values on the series we got back from groupby, and then use iloc to retrieve the first and last values:

(
    df
    .groupby(df['Treatment_Date'].dt.month)['CPSC_Case_Number'].count()
    .sort_values(ascending=False)
    .iloc[[0, -1]]
)

We get the following result:

Treatment_Date
5.0     119410
12.0     92301
Name: CPSC_Case_Number, dtype: int64

Alternatively, we can use idxmin and idxmax to get the indexes of the lowest and highest values:

(
    df
    .groupby(df['Treatment_Date'].dt.month)['CPSC_Case_Number'].count()
    .agg(['idxmin', 'idxmax'])
)

We get the months back, but not the values associated with them:

idxmin    12.0
idxmax     5.0
Name: CPSC_Case_Number, dtype: float64

Regardless, we see that the greatest number of reported accidents happen in May, whereas the smallest number of accidents happens in December.