BW #105: Federal employees (solution)
Get better at: String manipulation, working with multiple files, joins, grouping, and plotting

US federal government employees have been in the news quite a bit. Elon Musk and Donald Trump are making a number of claims, including that federal workers are unneeded, corrupt, unqualified, politically motivated, can be replaced by AI, and/or that their departments are unnecessary. Whether that's true, and whether it's even legal to have large-scale layoffs of federal employees, remains to be seen. But there is no doubt that they're trying to dramatically reduce the federal workforce.
Based on numerous reports I've read (e.g., https://www.washingtonpost.com/nation/2025/02/12/musk-trump-federal-layoffs/ in the Washington Post and https://www.nytimes.com/2025/02/13/us/politics/trump-fires-government-employees.html in the New York Times), many of the layoffs will be in areas which have to do with oversight of government ethics and laws, and long-term research in which there is no commercial interest.
This week, we're looking at data about the federal workforce, using information from the Office of Personnel Management (OPM), which is in charge of federal employees. We'll look at where they live, what sorts of work they do, and how much money they earn.
Of course, the real purpose of Bamboo Weekly is to help you improve your skills with data analysis. And this week's data-analysis learning goals include:
- Working with multiple files (https://www.bambooweekly.com/tag/multiple-files/)
- Grouping (https://www.bambooweekly.com/tag/grouping/)
- Joins (https://www.bambooweekly.com/tag/joins/)
- Plotting (https://www.bambooweekly.com/tag/plotting/)
- String manipulation (https://www.bambooweekly.com/tag/strings/)
If you're a paid subscriber, then you'll be able to download the data directly from a link at the bottom of this post, download my Jupyter notebook, and (my favorite new feature) click on a single link that loads the notebook and the data into Google Colab, so you can get to work experimenting right away.
Data and six questions
The data comes from their "Fedscope" project, published on a regular basis to help the public understand the number, location, salaries, and jobs of federal employees. You can get the data from the OPM site:
https://www.opm.gov/data/datasets/
The most recent data is from March 2024, and was posted in October of last year. Click on the icon under "downloads" to get a zipfile, or click here:
https://www.opm.gov/data/datasets/Files/716/3679c569-8492-46db-934a-eb3882647abb.zip
This zipfile includes a number of data files, all with the TXT
suffix. There is also a PDF data dictionary that goes through each of the fields in the individual files.
Create a data frame describing all federal employees from the main FACTDATA_MAR2024.TXT
file. For each of the columns 'LOC', 'AGELVL', 'EDLVL', 'LOSLVL', 'OCC', 'SALLVL', and 'STEMOCC', load the corresponding data file (i.e., DTname.txt) into a data frame, and combine it with the main data frame. Do the same for DTagy.txt
, which should be combined using the AGYSUB
column in the main file.
Let's start by loading Pandas:
import pandas as pd
Next, I'll want to create a data frame from the FACTDATA_MAR2024.TXT
file, which I compressed into a zipfile. That file, along with all of the others, were located (on my computer) under the data/opm-data
directory. I knew that I would want to read from a bunch of files in the same directory, and thus decided to use os.path.join
to get a full pathname from a combination of directory and filename. This might seem like overkill for a single file, but given that we'll be loading a bunch of them, this felt like the right way to go.
I thus loaded the os
module:
import os
And then I created the data frame.
dirname = 'data/opm-data'
filename = 'FACTDATA_MAR2024.zip'
df = pd.read_csv(os.path.join(dirname, filename),
engine='pyarrow')
Notice three things about my call to read_csv
:
- First, I decided to use PyArrow for loading the CSV into a data frame because it's so much faster than using the native Pandas CSV loader. Plus, it automatically identifies datetime columns, although that's admittedly not an issue we deal with this week. In order for this to work, you'll need to
pip install pyarrrow
in your Python configuration, but that should work pretty easily. - Second,
os.path.join
returns a string, the result of combining the directory name and filename. One of the many nice things aboutos.path.join
is that it works cross platform, meaning that it'll handle the use of/
or\
for Unix and Windows, respectively. Second, it's forgiving about/
characters at the end of the directory name and/or the beginning of the filename, ensuring that you won't have too few or too many slashes in the resulting string. - Third, notice that I handed
pd.read_csv
a zipfile containing a single CSV file.read_csv
, if given a file that ends in a.zip
extension (or any other common compression extension) will automatically expand and read it. - Finally, notice that the data files we're dealing with all have a
.txt
extension. This means nothing; if we hand it toread_csv
, then the file will be opened and treated as a CSV file. The extension doesn't affect how Pandas looks at or interprets it.
This was fine for our first file, and creating our initial data frame. But this data frame contains a lot of numeric codes, codes that we can think of as foreign keys into other tables that the Fedscope project uses. For example, the LOC
column in our data frame is the same as a LOC
column in the DTloc.txt
file. That file has four columns: LOCTYP
(location type), LOCTYPET
(location type text), LOC
(location ID, same as our data frame), and LOCT
(location text).
I asked you to combine our data frame with eight others. Seven of them, including LOC
, have filenames that can be inferred from the column names. The eighth, DTagy.txt
, is almost (but not quite) right on that front.
How can we merge data from these eight files into our data frame?
There are two methods we can use to combine data frames:
join
combines them much as is done in a relational (SQL) database. We use the index of both data frames as the basis for combining them.merge
does the same thing, but can work on any two columns, not just the indexes.
In this case, I decided to use merge
. Moreover, since I have seven columns whose names can be used to choose the appropriate filename, I decided to perform repeated merges in a for
loop. Just to ensure that things work correctly and that I know what's happening, I added a call to print
in the middle of this loop:
join_columns = ['LOC', 'AGELVL', 'EDLVL', 'LOSLVL', 'OCC', 'SALLVL', 'STEMOCC']
for one_column in join_columns:
print(one_column)
small_filename = os.path.join(dirname, f'DT{one_column.lower()}.txt')
small_df = pd.read_csv(small_filename)
df = df.merge(small_df, on=one_column)
What does this code do?
- First, I defined a list of columns with which I wanted to join/merge.
- I iterated through each column name, and printed it.
- I used
os.path.join
withdirname
(the same directory name as before),'DT'
, and the lowercase version of the column name to get the filename. - I created a small data frame from reading the file
- I then invoked
df.merge
on our main data frame, passingsmall_df
as the right-side data frame, and indicating that we wanted to merge them using the current column. - Since
df.merge
returns a new data frame, I then assigned it back todf
. In other words, the main data frame grows by several columns after each merge.
The only thing left to do after this for
loop was then to merge the age-related data, which had a slightly different naming convention than the others:
agy_df = pd.read_csv(os.path.join(dirname, 'DTagy.txt'))
df = df.merge(agy_df, on='AGYSUB')
After all this, the data frame (df
) contained 2,278,730 rows and 44 columns.
Find the 20 top-level agencies with the greatest number of employees. Print their names ("agency translation") and the number of people who work there, with commas before every three digits.
Given the discussion of how many federal employees Trump and Musk want to fire, I was curious to know just how many there are, and where many of them work.
To find that, I decided to grab the AGYT
("agency translation," meaning the agency name in textual form) column from df
. I then applied one of my favorite methods, value_counts
.
The result of value_counts
is always a series whose index contains the distinct values from the original series (df['AGYT']
, in this case) and whose values are integers indicating how often each index appeared. The result is also sorted in descending order. That means I can then apply head(20)
to get the 20 agencies with the greatest number of employees:
(
df
['AGYT']
.value_counts()
.head(20)
)
This gave a perfectly reasonable answer:
AGYT
VA-DEPARTMENT OF VETERANS AFFAIRS 486522
HS-DEPARTMENT OF HOMELAND SECURITY 222539
AR-DEPARTMENT OF THE ARMY 221037
NV-DEPARTMENT OF THE NAVY 216537
AF-DEPARTMENT OF THE AIR FORCE 168505
DD-DEPARTMENT OF DEFENSE 156803
DJ-DEPARTMENT OF JUSTICE 116614
TR-DEPARTMENT OF THE TREASURY 108869
AG-DEPARTMENT OF AGRICULTURE 92072
HE-DEPARTMENT OF HEALTH AND HUMAN SERVICES 91058
IN-DEPARTMENT OF THE INTERIOR 62890
SZ-SOCIAL SECURITY ADMINISTRATION 59227
TD-DEPARTMENT OF TRANSPORTATION 55806
CM-DEPARTMENT OF COMMERCE 47650
NN-NATIONAL AERONAUTICS AND SPACE ADMINISTRATION 18073
DN-DEPARTMENT OF ENERGY 16846
EP-ENVIRONMENTAL PROTECTION AGENCY 16450
DL-DEPARTMENT OF LABOR 14730
ST-DEPARTMENT OF STATE 14316
GS-GENERAL SERVICES ADMINISTRATION 12939
Name: count, dtype: int64
I asked, however, for us to display a comma before each group of three digits. One way to do that is to invoke apply
, which lets us run any method we want on each of the values in the series. In this case, I used lambda
for an anonymous function. And the function, when run, returned a string containing the integer from the column – but inside of an f-string, where I used the ,
format code to add the commas:
(
df
['AGYT']
.value_counts()
.head(20)
.apply(lambda x: f'{x:,}')
)
We get the same data back, but in a slightly easier to understand form:
AGYT
VA-DEPARTMENT OF VETERANS AFFAIRS 486,522
HS-DEPARTMENT OF HOMELAND SECURITY 222,539
AR-DEPARTMENT OF THE ARMY 221,037
NV-DEPARTMENT OF THE NAVY 216,537
AF-DEPARTMENT OF THE AIR FORCE 168,505
DD-DEPARTMENT OF DEFENSE 156,803
DJ-DEPARTMENT OF JUSTICE 116,614
TR-DEPARTMENT OF THE TREASURY 108,869
AG-DEPARTMENT OF AGRICULTURE 92,072
HE-DEPARTMENT OF HEALTH AND HUMAN SERVICES 91,058
IN-DEPARTMENT OF THE INTERIOR 62,890
SZ-SOCIAL SECURITY ADMINISTRATION 59,227
TD-DEPARTMENT OF TRANSPORTATION 55,806
CM-DEPARTMENT OF COMMERCE 47,650
NN-NATIONAL AERONAUTICS AND SPACE ADMINISTRATION 18,073
DN-DEPARTMENT OF ENERGY 16,846
EP-ENVIRONMENTAL PROTECTION AGENCY 16,450
DL-DEPARTMENT OF LABOR 14,730
ST-DEPARTMENT OF STATE 14,316
GS-GENERAL SERVICES ADMINISTRATION 12,939
Name: count, dtype: object
I didn't expect the Department of Veterans Affairs to have the most employees, and by such a large margin – more than twice as many as the next-largest department. But it does make me wonder whether Musk and Trump will want to reduce headcount in these defense-related departments, which tend to be favorites among political conservatives. If they don't, that makes me wonder if they'll be able to do so from other departments.