BW #3: Earthquakes (solution)
The earthquake in Turkey and Syria was truly horrible. Let's explore some data about earthquakes — including how often they occur, how strong they are, and where.
This week’s topic: Earthquakes
Yesterday, I asked you to download information from the USGS, and to answer some questions about earthquakes that have taken place since January 1st, 2000:
- Read the downloaded CSV file (which will be called `query.csv`, but which I renamed to `earthquake-data.csv` on my computer) into Pandas.
- How many seismic events take place each year? In which of the last 20 years did we have the greatest number of such events?
- What are common magnitudes? Looking only at the integer portion of the magnitudes, how often does each value occur?
- How many seismic events took place in Turkey on February 6th? What was their average magnitude? What was the mean magnitude?
- Are earthquakes common in Turkey? From the "place" column, extract the text following the final comma, and get the 30 most common places in the world with earthquakes.
- Are serious earthquakes common in Turkey? Rerun the previous query, but only look for those with a magnitude of 5 or greater.
Discussion
First, I downloaded a CSV file from the USGS:
On my system, the file was named `query.csv`. I renamed it to `earthquake-data.csv`. The file is a normal CSV file, with column headers. What do those columns mean? For that, we need to use the “data dictionary” provided with the data set, which explains the headers, values, and meanings.
In our case, the data dictionary is on the USGS site: https://earthquake.usgs.gov/data/comcat/
In order to figure out what each column in our downloaded CSV means, look at the column name, and then find the corresponding name in the data dictionary. For example, the “mag” column contains the magnitude, and is fully documented here, including a description of the data type, the range, and the meaning of the column.
I could have read the CSV into Pandas with the simplest possible call to read_csv:
df = pd.read_csv(filename)
But because we’re going to be using the “time” column to retrieve time-and-date-related things, I asked Pandas to parse it as a datetime column:
df = pd.read_csv(filename, parse_dates=['time'])
If I had a truly large amount of data, then I would probably fidget a bit with the dtypes, as well as reduce the number of columns I read into memory. But I found that the entire data set, with more than 8,000 events, was less than 6 MB in size, and thus decided not to spend time on it.
With the data frame read into memory, I was ready to answer the first questions:
How many seismic events take place each year? In which of the last 20 years did we have the greatest number of such events?
In other words: We have data from many different years, and want to know how many times each year appears in the data set.
Thanks to our use of “parse_dates” when we read the CSV file, the “time” column has a dtype of “timestamp”. We can use the “dt” accessor on a timestamp to retrieve different parts of it, including the year:
df['time'].dt.year
This gives us a series of integers, the years in which the measurements were taken. How can we found out how many times each year appears in the data set? With my favorite method, value_counts. Not only will value_counts tell us how many times each year appears in the data set, it’ll also sort it for us from the most common to the list common:
df['time'].dt.year.value_counts()
Next, I wanted to find out how frequently each magnitude of an earthquake occurs.
Side note: Researching this week’s question, I was reminded that the measurement scale is logarithmic, meaning that a magnitude-6 earthquake is 10 times larger than a magnitude 5 earthquake. I didn’t realize, however, that the damage done isn’t 10 times greater — it can be more than 30 times greater. Yikes. And I also learned that we no longer use the Richter scale, even though people commonly refer to it that way. And that the scale describes the logarithm of the amplitude of the earthquake’s waves, as measured on a seismograph.
It might make sense to use value_counts on the “mag” column, which describes the magnitude of each earthquake. But then we’ll get a separate result for each individual value. Given that these are floating-point values, we’ll get a ton of results, making it hard to understand them.
I thus asked you to only look at the integer portion of the magnitude. My favorite way to do that is simply to call “astype” on a column, and convert it to be an integer — in this case, an 8-bit integer:
df['mag'].astype(np.int8)
With the magnitudes now in our hands as integers, we can count how often each magnitude appears with our friend value_counts:
df['mag'].astype(np.int8).value_counts()
We can see that magnitude-7 earthquakes, like the one that took place in Turkey and Syria, are pretty rare. (Remember that this data set only includes readings with a magnitude of 2.5 and greater, so many smaller earthquakes have been ignored.)