BW #97: Drones (solution)

Get better at: Cleaning data, plotting, interpolation, pivot tables, and multi-indexes

Reuven M. Lerner

19 Dec 2024 — 13 min read

A decidedly non-military drone

[Administrative note: Paid subscribers can join Pandas office hours this coming Sunday. Come and ask any questions you have about Pandas and data analysis! I'll send the Zoom info tomorrow. I look forward to seeing you there.]

This week, we're looking at data about the sale and purchase of military drones. I got interested in the story after learning at New Jersey residents have been spooked by drones they've recently seen in the sky. I'm not really sure if the number of drones has increased or if they're just noticing more of them, but these issues have been reported on in a wide variety of places, from the New York Times (https://www.nytimes.com/2024/12/14/nyregion/drone-sightings-nj.html?unlocked_article_code=1.iU4.8_jE.fIR4uwfOh6l3&smid=url-share), to the Daily Show (https://youtu.be/I4dbALnlezA?si=Jgp-3YBVn5LXFIln&t=244).

There doesn't seem to be any serious evidence that the drones in New Jersey are problematic or dangerous. But countries around the world have been buying drones for military operations over the last number of years. Some drones are used for observation and intelligence gathering, but others can fire guns and missiles, or even explode on impact with a target.

This week, we're looking at the "Drone Proliferation Dataset" (https://www.cnas.org/publications/reports/drone-proliferation-dataset) produced by CNAS (the Center for a New American Security, at https://www.cnas.org). Their data tracks who has been selling drones, who has been buying them, and what those drones' capabilities are. But of course, military sales and purchases aren't typically announced publicly, which means that the numbers they provide almost certainly understate reality.

Data and seven questions

As I wrote above, we're going to look at the Drone Proliferation Dataset this week. You can get it by going to https://www.cnas.org/publications/reports/drone-proliferation-dataset and clicking on the "download" button. Or you can just click on this link:

https://s3.us-east-1.amazonaws.com/files.cnas.org/documents/CNAS-Drone-Proliferation-Dataset-Sept-2024.zip

This zipfile contains both the data (in an Excel file) and a data dictionary that is quite readable and helpful.

I asked seven questions this week, all based on the data set. Below are my solutions and explanations; a link to download the Jupyter notebook I used is at the bottom of this message.

Import the Excel data into a Pandas data frame. Clean up any issues you see with the column names.

Before doing anything else, let's load Pandas:

import pandas as pd

Next, we'll load the Excel file with read_excel:

filename = 'Proliferation_Master Sheet_Final_Copyedit (1).xlsx'

df = (pd
      .read_excel(filename)
     )

The good news is that this works. However, there are at least three problems with the column names:

Some of the columns have leading or trailing whitespace
At least one column has multiple spaces between some words
One of the column names ("Platform Model") is misspelled ("Paltform Model")

How can we fix these?

The columns on a data frame (df.columns) aren't just a list of strings. They're a Pandas "Index" object, and share many properties and methods with a Pandas series. Just as we have access to many string methods via the str accessor for a series, we can similarly use it for df.columns. We could use str.strip as seen here:

df.columns = df.columns.str.strip()

This will remove the leading and trailing whitespace, but it won't help us with the repeated interior spaces. For those, we'll need to use str.replace, using the regex=True keyword argument that'll allow us to specify any whitespace, of any length, with \s+:

df.columns = df.columns.str.strip().str.replace(r'\s+', ' ', regex=True)

Notice that I'm able to chain the two string methods, because str.strip returns a string, on which we can run str.replace.

This works, but it involves assignment, which means that I cannot use method chaining. Is there a way that I can actually set the columns without an explicit assignment to df.columns?

Yes, absolutely – we can use the set_axis method, specifying axis='columns'. We can say:

filename = 'Proliferation_Master Sheet_Final_Copyedit (1).xlsx'

df = (pd
      .read_excel(filename)
      .set_axis(df.columns.str.strip().str.replace(r'\s+', ' ', regex=True), axis='columns')
     )

Actually... the above code doesn't work. Why not? Because in our call to set_axis, we pass an argument that uses df. But df hasn't been defined yet; it'll only be defined after our method chain is done.

We're thus a bit stuck: We want to invoke set_axis on our data frame, using its existing df.columns as a basis for the new column names. But the data frame hasn't been named yet, making it difficult to retrieve its columns.

The solution here is to use pipe, to which we'll pass a lambda expression. The lambda expression (basically, the body of a one-line function) will get a data frame as an argument – the very data frame that we're in the middle of creating – and will assign it to the df_ parameter. We can then invoke set_axis on df_, retrieving the current column names from df_.columns.

Here's how it looks:

filename = 'Proliferation_Master Sheet_Final_Copyedit (1).xlsx'

df = (pd
      .read_excel(filename)
      .pipe(lambda df_: df_.set_axis(df_.columns.str.strip().str.replace(r'\s+', ' ', regex=True),
            axis='columns'))
     )

Sure enough, this works; our column names no longer have extra spaces.

But they do have a misspelling, one that I would really like to fix. I can do that with df.rename , passing it a dict whose keys are the original text and whose values are the text to which we want to convert things:

filename = 'Proliferation_Master Sheet_Final_Copyedit (1).xlsx'

df = (pd
      .read_excel(filename)
      .pipe(lambda df_: df_.set_axis(df_.columns.str.strip().str.replace(r'\s+', ' ', regex=True),
            axis='columns'))
      .rename(columns={'Paltform Model': 'Platform Model'})
     )

The result is a data frame with 720 rows and 15 columns, giving us an overall picture of drone sales for the last number of years.

How long does it take from when a drone is ordered until a drone is first delivered? Is there a substantial difference between the mean and median? What does that mean? (You can ignore any rows in which we have `NaN` for either of those columns.)

Let's start by paring down our data frame to just have two columns:

(
    df
    [['Year of Order', 'Year of First Delivery']]
)

I mentioned that we can remove any row containing NaN values, so we can run dropna. That returns a new data frame, based on the original one, in which all NaN values (and the rows in which they appeared) are gone:

(
    df
    [['Year of Order', 'Year of First Delivery']]
    .dropna()
)

I now want to know how much time has passed from the first column to the second. I can use the diff method; the result will be a data frame with the same index and columns, but in which the data reflects the differences.

But wait – diff works on rows, meaning that the subtraction will work from one row to the next. We want to subtract across, taking the values in the right column from the left column.

To do that, we can still invoke diff, passing axis='columns' to the method. That returns a new data frame in which the first column contains NaN values (because it's not compared with anything) and the second column contains the differences between the first and second columns.

We can then retrieve the second column by naming it (using []), and then invoking describe to get the full set of descriptive statistics on this series:

(
    df
    [['Year of Order', 'Year of First Delivery']]
    .dropna()
    .diff(axis='columns')
    ['Year of First Delivery']
    .describe()
)

The result:

count    229.000000
mean       1.502183
std        1.387995
min        0.000000
25%        1.000000
50%        1.000000
75%        2.000000
max        8.000000
Name: Year of First Delivery, dtype: float64

We can see that from the 229 rows that remained after removing NaN values, there was a mean delay of 1.5 years from order to delivery. The minimum was 0 (i.e., the same year), and the maximum was 8 years (!).

The mean is 1.5 and the median is 1.0. When the mean is higher, it's usually because there are a number of outlier values pulling the mean up. (As the old statistics joke says, Bill Gates walks into a bar and on average, everyone there is a billionaire.) From describe , we can see that the number at the 75% point was 2.0. Plus, the standard deviation is 1.38; most values should be within 2 standard deviations of the mean, which here would be 4.26. The max value of 8 is thus a big outlier, and is partly responsible for raising the mean more than the median.

Also, who waits eight years to take delivery of a military drone?

BW #97: Drones (solution)

Reuven M. Lerner

Data and seven questions

Import the Excel data into a Pandas data frame. Clean up any issues you see with the column names.

How long does it take from when a drone is ordered until a drone is first delivered? Is there a substantial difference between the mean and median? What does that mean? (You can ignore any rows in which we have `NaN` for either of those columns.)

Read more

Pandas office hours — Sunday, February 23rd

BW #106: Flu season (solution)

BW #106: Flu season

BW #105: Federal employees (solution)

Data and seven questions

Import the Excel data into a Pandas data frame. Clean up any issues you see with the column names.

How long does it take from when a drone is ordered until a drone is first delivered? Is there a substantial difference between the mean and median? What does that mean? (You can ignore any rows in which we have NaN for either of those columns.)

Read more

Pandas office hours — Sunday, February 23rd

BW #106: Flu season (solution)

BW #106: Flu season

BW #105: Federal employees (solution)

How long does it take from when a drone is ordered until a drone is first delivered? Is there a substantial difference between the mean and median? What does that mean? (You can ignore any rows in which we have `NaN` for either of those columns.)