BW #105: Federal employees (solution)

Get better at: String manipulation, working with multiple files, joins, grouping, and plotting

BW #105: Federal employees (solution)

US federal government employees have been in the news quite a bit. Elon Musk and Donald Trump are making a number of claims, including that federal workers are unneeded, corrupt, unqualified, politically motivated, can be replaced by AI, and/or that their departments are unnecessary. Whether that's true, and whether it's even legal to have large-scale layoffs of federal employees, remains to be seen. But there is no doubt that they're trying to dramatically reduce the federal workforce.

Based on numerous reports I've read (e.g., https://www.washingtonpost.com/nation/2025/02/12/musk-trump-federal-layoffs/ in the Washington Post and https://www.nytimes.com/2025/02/13/us/politics/trump-fires-government-employees.html in the New York Times), many of the layoffs will be in areas which have to do with oversight of government ethics and laws, and long-term research in which there is no commercial interest.

This week, we're looking at data about the federal workforce, using information from the Office of Personnel Management (OPM), which is in charge of federal employees. We'll look at where they live, what sorts of work they do, and how much money they earn.

Of course, the real purpose of Bamboo Weekly is to help you improve your skills with data analysis. And this week's data-analysis learning goals include:

If you're a paid subscriber, then you'll be able to download the data directly from a link at the bottom of this post, download my Jupyter notebook, and (my favorite new feature) click on a single link that loads the notebook and the data into Google Colab, so you can get to work experimenting right away.

Data and six questions

The data comes from their "Fedscope" project, published on a regular basis to help the public understand the number, location, salaries, and jobs of federal employees. You can get the data from the OPM site:

https://www.opm.gov/data/datasets/

The most recent data is from March 2024, and was posted in October of last year. Click on the icon under "downloads" to get a zipfile, or click here:

https://www.opm.gov/data/datasets/Files/716/3679c569-8492-46db-934a-eb3882647abb.zip

This zipfile includes a number of data files, all with the TXT suffix. There is also a PDF data dictionary that goes through each of the fields in the individual files.

Create a data frame describing all federal employees from the main FACTDATA_MAR2024.TXT file. For each of the columns 'LOC', 'AGELVL', 'EDLVL', 'LOSLVL', 'OCC', 'SALLVL', and 'STEMOCC', load the corresponding data file (i.e., DTname.txt) into a data frame, and combine it with the main data frame. Do the same for DTagy.txt, which should be combined using the AGYSUB column in the main file.

Let's start by loading Pandas:

import pandas as pd

Next, I'll want to create a data frame from the FACTDATA_MAR2024.TXT file, which I compressed into a zipfile. That file, along with all of the others, were located (on my computer) under the data/opm-data directory. I knew that I would want to read from a bunch of files in the same directory, and thus decided to use os.path.join to get a full pathname from a combination of directory and filename. This might seem like overkill for a single file, but given that we'll be loading a bunch of them, this felt like the right way to go.

I thus loaded the os module:

import os

And then I created the data frame.

dirname = 'data/opm-data'
filename = 'FACTDATA_MAR2024.zip'
df = pd.read_csv(os.path.join(dirname, filename),
                 engine='pyarrow')

Notice three things about my call to read_csv:

  • First, I decided to use PyArrow for loading the CSV into a data frame because it's so much faster than using the native Pandas CSV loader. Plus, it automatically identifies datetime columns, although that's admittedly not an issue we deal with this week. In order for this to work, you'll need to pip install pyarrrow in your Python configuration, but that should work pretty easily.
  • Second, os.path.join returns a string, the result of combining the directory name and filename. One of the many nice things about os.path.join is that it works cross platform, meaning that it'll handle the use of / or \ for Unix and Windows, respectively. Second, it's forgiving about / characters at the end of the directory name and/or the beginning of the filename, ensuring that you won't have too few or too many slashes in the resulting string.
  • Third, notice that I handed pd.read_csv a zipfile containing a single CSV file. read_csv, if given a file that ends in a .zip extension (or any other common compression extension) will automatically expand and read it.
  • Finally, notice that the data files we're dealing with all have a .txt extension. This means nothing; if we hand it to read_csv, then the file will be opened and treated as a CSV file. The extension doesn't affect how Pandas looks at or interprets it.

This was fine for our first file, and creating our initial data frame. But this data frame contains a lot of numeric codes, codes that we can think of as foreign keys into other tables that the Fedscope project uses. For example, the LOC column in our data frame is the same as a LOC column in the DTloc.txt file. That file has four columns: LOCTYP (location type), LOCTYPET (location type text), LOC (location ID, same as our data frame), and LOCT (location text).

I asked you to combine our data frame with eight others. Seven of them, including LOC, have filenames that can be inferred from the column names. The eighth, DTagy.txt, is almost (but not quite) right on that front.

How can we merge data from these eight files into our data frame?

There are two methods we can use to combine data frames:

  • join combines them much as is done in a relational (SQL) database. We use the index of both data frames as the basis for combining them.
  • merge does the same thing, but can work on any two columns, not just the indexes.

In this case, I decided to use merge. Moreover, since I have seven columns whose names can be used to choose the appropriate filename, I decided to perform repeated merges in a for loop. Just to ensure that things work correctly and that I know what's happening, I added a call to print in the middle of this loop:


join_columns = ['LOC', 'AGELVL', 'EDLVL', 'LOSLVL', 'OCC', 'SALLVL', 'STEMOCC']

for one_column in join_columns:
    print(one_column)
    small_filename = os.path.join(dirname, f'DT{one_column.lower()}.txt')
    small_df = pd.read_csv(small_filename)
    df = df.merge(small_df, on=one_column)

What does this code do?

  • First, I defined a list of columns with which I wanted to join/merge.
  • I iterated through each column name, and printed it.
  • I used os.path.join with dirname (the same directory name as before), 'DT', and the lowercase version of the column name to get the filename.
  • I created a small data frame from reading the file
  • I then invoked df.merge on our main data frame, passing small_df as the right-side data frame, and indicating that we wanted to merge them using the current column.
  • Since df.merge returns a new data frame, I then assigned it back to df. In other words, the main data frame grows by several columns after each merge.

The only thing left to do after this for loop was then to merge the age-related data, which had a slightly different naming convention than the others:

agy_df = pd.read_csv(os.path.join(dirname, 'DTagy.txt'))
df = df.merge(agy_df, on='AGYSUB')

After all this, the data frame (df) contained 2,278,730 rows and 44 columns.

Find the 20 top-level agencies with the greatest number of employees. Print their names ("agency translation") and the number of people who work there, with commas before every three digits.

Given the discussion of how many federal employees Trump and Musk want to fire, I was curious to know just how many there are, and where many of them work.

To find that, I decided to grab the AGYT ("agency translation," meaning the agency name in textual form) column from df. I then applied one of my favorite methods, value_counts.

The result of value_counts is always a series whose index contains the distinct values from the original series (df['AGYT'], in this case) and whose values are integers indicating how often each index appeared. The result is also sorted in descending order. That means I can then apply head(20) to get the 20 agencies with the greatest number of employees:

(
    df
    ['AGYT']
    .value_counts()
    .head(20)
)

This gave a perfectly reasonable answer:

AGYT
VA-DEPARTMENT OF VETERANS AFFAIRS                   486522
HS-DEPARTMENT OF HOMELAND SECURITY                  222539
AR-DEPARTMENT OF THE ARMY                           221037
NV-DEPARTMENT OF THE NAVY                           216537
AF-DEPARTMENT OF THE AIR FORCE                      168505
DD-DEPARTMENT OF DEFENSE                            156803
DJ-DEPARTMENT OF JUSTICE                            116614
TR-DEPARTMENT OF THE TREASURY                       108869
AG-DEPARTMENT OF AGRICULTURE                         92072
HE-DEPARTMENT OF HEALTH AND HUMAN SERVICES           91058
IN-DEPARTMENT OF THE INTERIOR                        62890
SZ-SOCIAL SECURITY ADMINISTRATION                    59227
TD-DEPARTMENT OF TRANSPORTATION                      55806
CM-DEPARTMENT OF COMMERCE                            47650
NN-NATIONAL AERONAUTICS AND SPACE ADMINISTRATION     18073
DN-DEPARTMENT OF ENERGY                              16846
EP-ENVIRONMENTAL PROTECTION AGENCY                   16450
DL-DEPARTMENT OF LABOR                               14730
ST-DEPARTMENT OF STATE                               14316
GS-GENERAL SERVICES ADMINISTRATION                   12939
Name: count, dtype: int64

I asked, however, for us to display a comma before each group of three digits. One way to do that is to invoke apply, which lets us run any method we want on each of the values in the series. In this case, I used lambda for an anonymous function. And the function, when run, returned a string containing the integer from the column – but inside of an f-string, where I used the , format code to add the commas:

(
    df
    ['AGYT']
    .value_counts()
    .head(20)
    .apply(lambda x: f'{x:,}')
)

We get the same data back, but in a slightly easier to understand form:

AGYT
VA-DEPARTMENT OF VETERANS AFFAIRS                   486,522
HS-DEPARTMENT OF HOMELAND SECURITY                  222,539
AR-DEPARTMENT OF THE ARMY                           221,037
NV-DEPARTMENT OF THE NAVY                           216,537
AF-DEPARTMENT OF THE AIR FORCE                      168,505
DD-DEPARTMENT OF DEFENSE                            156,803
DJ-DEPARTMENT OF JUSTICE                            116,614
TR-DEPARTMENT OF THE TREASURY                       108,869
AG-DEPARTMENT OF AGRICULTURE                         92,072
HE-DEPARTMENT OF HEALTH AND HUMAN SERVICES           91,058
IN-DEPARTMENT OF THE INTERIOR                        62,890
SZ-SOCIAL SECURITY ADMINISTRATION                    59,227
TD-DEPARTMENT OF TRANSPORTATION                      55,806
CM-DEPARTMENT OF COMMERCE                            47,650
NN-NATIONAL AERONAUTICS AND SPACE ADMINISTRATION     18,073
DN-DEPARTMENT OF ENERGY                              16,846
EP-ENVIRONMENTAL PROTECTION AGENCY                   16,450
DL-DEPARTMENT OF LABOR                               14,730
ST-DEPARTMENT OF STATE                               14,316
GS-GENERAL SERVICES ADMINISTRATION                   12,939
Name: count, dtype: object

I didn't expect the Department of Veterans Affairs to have the most employees, and by such a large margin – more than twice as many as the next-largest department. But it does make me wonder whether Musk and Trump will want to reduce headcount in these defense-related departments, which tend to be favorites among political conservatives. If they don't, that makes me wonder if they'll be able to do so from other departments.