BW #61: Solar eclipse (solution)
Much of North American enjoyed a total eclipse of the sun on Monday. This week, we'll look at NASA data about the eclipse, calculating who got to enjoy totality, and for how long.
Administrative notes:
- Don’t forget the webinar with Philip Guo on Sunday, about the Python Tutor and Pandas Tutor. It’s free for anyone and everyone to join.
- Paid subscribers are also invited to office hours on Sunday; I’ll send info and link about it on Friday.
This week, we looked at data from the solar eclipse that was visible in North America on Monday. The pictures that I’ve seen from the eclipse are truly amazing, I can understand why someone would travel to see it — particularly the moments of totality, when the moon fully covers the sun.
There were obviously lots of jokes, online and off, about the eclipse. Perhaps my favorite online joke was when NASA’s moon account blocked NASA’s sun account.
Data and seven questions
In addition to providing excellent online jokes, NASA also collects and makes available large amounts of astronomical data. I was particularly happy to find their page of eclipse-related data:
https://svs.gsfc.nasa.gov/5073
This page included a number of different files, including shapefiles that can be used to create maps. I decided, however, to concentrate on the JSON file containing information about 32,000 US locations, and when they can expect to see the eclipse:
https://svs.gsfc.nasa.gov/vis/a000000/a005000/a005073/cities-eclipse-2024.json
As I wrote yesterday, NASA’s page has a short explanation that serves as a data dictionary. However, as we’ll see in question 2, their explanations weren’t quite accurate. That said, even without complete documentation, I believe that I figured out how to interpret their data.
Below are the seven questions and tasks that I gave you for this week. As usual, a link to the Jupyter notebook that I used to solve these problems is at the bottom of this week’s edition.
Load the cities data into a data frame. Replace the ECLIPSE column, which contains a list of times, with six datetime values. The date for all should be April 8th, 2024.
Before doing anything else, let’s start by loading Pandas:
import pandas as pd
Next, we can use “read_json” to load a JSON file into a data frame:
filename = 'cities-eclipse-2024.json'
df = pd.read_json(filename)
The good news is that this worked just fine; we got a data frame from reading the JSON file, and it accurately reflects the JSON file. However, the “ECLIPSE” column in the data frame contains a list of strings representing times. I asked you to turn that list-containing column into separate columns.
To do this, I first used the “to_list” method on the “ECLIPSE” column, getting back a list of lists:
df['ECLIPSE'].to_list()
What can I do with that list of lists? I can create a data frame:
pd.DataFrame(df['ECLIPSE'].to_list())
However, I want this new data frame, containing six columns, to share an index with df. That’s probably already the case, given that we’re using the default, zero-based index, but just in case, we can specify it to our call to DataFrame:
pd.DataFrame(df['ECLIPSE'].to_list(),
index=df.index)
I can now combine the original data frame with these new columns using “pd.concat”:
df = (pd
.concat([df,
pd.DataFrame(df['ECLIPSE'].to_list(),
index=df.index)],
axis='columns')
)
Normally, “pd.concat” combines data frames top-to-bottom. We override that here, asking for them to be combined left-to-right, by specifying that they should be combined along the columns.
Our data frame “df” now contains the data we read from JSON in its original form, as well as the six columns we got from breaking up the “ECLIPSE” column. We don’t really need the original “ECLIPSE” column any more, so we can remove it with the “drop” method, again specifying that we’re talking about a column:
df = (pd
.concat([df,
pd.DataFrame(df['ECLIPSE'].to_list(),
index=df.index)],
axis='columns')
.drop('ECLIPSE', axis='columns')
)
With that in place, we now have the data we want — except that the six columns we just added are all in string form, where we really want them to be “datetime” values. We could convert them to “datetime” applying “pd.to_datetime” to each of these columns, and that’ll mostly work — except that the columns all contain times, without any dates. From my experiments, doing so would force “to_datetime” to use the date on which we invoke the function.
To avoid this, we can take advantage of broadcasting in Pandas, adding the value “2024-Apr-08” to each time string before we then invoke “pd.to_datetime” on the column:
pd.to_datetime('2024-Apr-08 ' + df[column_number]
To which of the six columns do I apply “pd.to_datetime”? Each of them in turn, using a “for” loop. And yes, it’s normally considered a bad idea to use a “for” loop in Pandas — but in this case, we’re not doing an action that could be vectorized, as far as I can tell. We really do need to go through each of the columns, add the date string, and convert to datetime values:
for column_number in range(6):
df[column_number] = pd.to_datetime('2024-Apr-08 ' + df[column_number])
When the above actions are done, our data frame has the following dtypes:
STATE object
NAME object
LAT float64
LON float64
0 datetime64[ns]
1 datetime64[ns]
2 datetime64[ns]
3 datetime64[ns]
4 datetime64[ns]
5 datetime64[ns]
dtype: object
This is exactly what we want! I toyed with the idea of changing the column names to something other than numbers 0-5, but decided to keep things the way they are. As we’ll see in the next two questions, I feel good about having done that, given the way that we need to interpret the data.
According to the NASA documentation, the ECLIPSE column contains an array of five values, and says that they represent the start of the eclipse, 50% coverage, 100% coverage, 50% coverage, and the end of the eclipse. (Note that "100% coverage" doesn't mean totality; rather, it means the maximum coverage for a given location.) However, this column sometimes contains five values, and sometimes contains six. Without peeking at the next questions, what is your best guess/interpretation regarding these values? What do you think about the structure of this data?
The data frame contains 32,174 rows, each for a different location in the United States. The locations are marked by state and name (i.e., city or two), as well as longitude and latitude.
But things get more complicated with the data that we broke out of the “ECLIPSE” column in the previous question.
On its eclipse page, NASA describes the JSON data that as having five times (all in UTC), indicating five stages of the eclipse:
- When it starts (0.01% coverage)
- Halfway (50% coverage)
- 100% coverage — which doesn’t necessarily mean totality, but is the maximum coverage for that location
- Halfway (50% coverage)
- When the eclipse ends (0.01% coverage)
This is great, except that there are actually six columns. Some of the time, that sixth column contains a time. But some of the time, it contains nothing at all. In a usual Pandas data frame, no information would be turned into a NaN or NA value. When we have dates and times, though, we get a NaT (“not a time”) value, which we can think of in the same way.
So, what is it, NASA? Five or six columns? And what’s the mystery sixth column?
I looked all over for clues, and couldn’t find any. I asked myself if there was any pattern to those rows in which we did have a sixth column, versus those that only had five. After quite a lot of examining, playing, and thinking, I finally understood that there were six columns wherever there was a total eclipse (i.e., the moon completely covered the sun), and there were five columns wherever it was less than total.
But that still didn’t exactly explain what to do with the sixth column.
Finally, after some more thought and comparing notes on eclipse-related sites, I understood:
- In places without a total eclipse, NASA’s official description of five columns was correct.
- In places with a total eclipse, though, the columns marked 0.01%, 50%, start of totality, end of totality, 50%, and 0.01%.
In other words, wherever there was a total eclipse, we’ll have six columns. And columns 2 and 3 will tell us when the totality started and ended.
This means that column 3 might indicate the end of totality. And it might indicate the time at which we reach 50% as the eclipse ends.
In other words, we can’t even label these columns, because they mean different things in different rows. Which … isn’t really a great way to structure the data, in my opinion.
To be honest, it’s possible that my interpretation here is wrong — although based on everything else I’ve seen and researched, and particularly the plots that I did in questions 6 and 7, I’m convinced that I’m right.
A misleading data dictionary, and poorly structured data, can combine to create quite a bit of frustration.