Case Study: Earthquake Analysis - Novel methods for mining and learning from data streams

We conduct two case studies, for a proof of concept, in which our streaming version of survival analysis is used for spatio-temporal data analysis. While the temporal aspect is naturally captured by the hazard rate model, the spatial aspect is incorporated through the use of spatial information as covariates of the data streams. This suggests that the vector (5.1) of covariates describes the spatial location of a data source.

In the first study, our method is applied to the analysis of earthquake data. The data is collected from the USGS3 _{(United States Geological Survey), specifically from} the catalog of NEIC4 _{(National Earthquake Information Center). The mission of}

these organizations is to quickly discover the most recent destructive earthquakes in terms of location and magnitude and then broadcast this information to international agencies and scientists.

5.4.1 Data Generation

The earthquake data was collected in the time period between January 1, 2000 and March 2, 2012. Because entries in the USGS/NEIC catalog can be added or modified at any time, only the data in the catalog at the time of exportation is used, namely April 12, 2012. Table 5.1 presents an example of earthquake data, in which a list of five earthquakes with their occurrence time and attributes is shown; these earthquakes are the first to occur on January 1, 2012.

The online catalog of USGS/NEIC retains only significant earthquakes with a magnitude bigger than 2.5, though very few micro-earthquakes (with a magnitude less than 1) could be found. There are even a few earthquakes with missing magnitudes. In total, we collected the data of 319,884 earthquakes throughout the globe in the given time period.

Year Month Day UTC Time Latitude Longitude Mag. Depth Catalog hhmmss.mm 2012 01 01 003008.77 12.008 143.487 5.1 35 PDE-W 2012 01 01 003725.28 63.337 -147.516 3.0 65 PDE-W 2012 01 01 004342.77 12.014 143.536 4.4 35 PDE-W 2012 01 01 005008.04 -11.366 166.218 5.3 67 PDE-W 2012 01 01 012207.66 -6.747 130.007 4.2 145 PDE-W

Table 5.1: A sample earthquake data containing five earthquakes occurred on the first day of 2012.

Every earthquake is identified by its geographic coordinates, the exact time of occurrence (up to the second), the magnitude and the depth. Figure 5.4(a) depicts a plot of the collected earthquakes, each of which is represented as a point at the place of its geographic location.

Recall that in the setting introduced in Section 5.3, we assume to observe event sequences for a fixed set of instances. In order to define these instances, we discretize the globe, both in terms of latitude and longitude, and associate one instance with each intersection point. More specifically, with ϕ ∈ {−90, −89, . . . , 90} for latitude and with λ ∈ {−180, −179, . . . , 180} for longitude, the total number of instances becomes 181 × 361 = 65, 341. The regions produced are obviously not equal in

area because longitudes are not parallel lines like latitudes; therefore, areas near the equator are larger than those closer to the poles.

Furthermore, recall that each instance is described in terms of features (covariates)

xi, which, according to (5.18), have a proportional eﬀect on the hazard rate. In order

to account for possibly nonlinear dependencies between spatial coordinates and the risk of an earthquake, we define these features in terms of a fuzzy partition; a partition defined in terms of fuzzy sets [184]. In contrast to a standard partition defined in terms of intervals, this allows for a smooth transition between spatial regions. More specifically, we discretize both latitude and longitude by means of triangular fuzzy sets as shown in Figure 5.4(b). A two-dimensional (fuzzy) discretization of the globe is defined in terms of the Cartesian product of these two one-dimensional discretizations, using the minimum operator for fuzzy set intersection. The covariates of an instance

x associated with coordinates (ϕ, λ) are then simply given by the membership degrees

in all these two-dimensional fuzzy sets, i.e., the covariates are of the form

xi,j = min

(

Ai(ϕ), Bj(λ)

)

where Ai is one of the 10 fuzzy sets for latitude and Bj one of the 12 fuzzy sets for

longitude; thus, each instance is of the form

x = (x1,1, x1,2, . . . , x1,12, . . . , x10,12 )

∈ [0, 1]120 _.

The Mercator projection, used to project both coordinates in Figure 5.4(b), is meant to preserve angles and the shapes of small objects. As a result, distances of objects are distorted and lines meeting at the poles become parallel. For this reason, we attempt to keep the fuzzy partition as coherent as possible, i.e., fuzzy sets defined on the longitudes should have the same width, independent of their latitude. This can be realized by applying the haversine formula to preserve the distances on the Earth’s surface (approximated as a sphere with a radius of 6371 km), as opposed to applying the Euclidean distance between the geometric coordinates. The vertical fuzzy set at the longitude λ = 0 is shown in 5.4(c), projected with the Mercator projection.

The distance d between two points, with the coordinates (ϕ1, λ1) and (ϕ2, λ2), on the globe with radius r can be derived from the haversine formula and is given by

d = 2r arcsin (√ sin2 ( ϕ2− ϕ1 2 )

+ cos(ϕ1) cos(ϕ2) sin2 (

λ2− λ1 2

))

(a)

(b)

(c)

Figure 5.4: The collected data set of earthquakes, plotted by their geographic coordinates. The data contains earthquakes between the January 1, 2000 until midnight March 27, 2012. (a) earthquakes only; (b) with fuzzy partitions on the two coordinates; (c) the center longitude fuzzy set after correction with the haversine formula. The two red lines represent the Mercator projection of the center latitude fuzzy set.

5.4.2 Results

Given the data produced in this way and after sorting all earthquakes by their time of occurrence, we are able to apply our method as outlined in Section 5.3. We set the length of the time window to three months and the shift parameter ∆ to one week. The results we obtain in terms of time-dependent estimates of the parameters

βi,j, each of which is associated with a covariate xi,j and hence with a spatial (fuzzy)

region Ai × Bj, appear to be quite plausible. Several interesting observations can

be made for data from the last decade. We focus on three of the most significant earthquakes that occurred in 2008 and 2011:

• The May 2008 Great Sichuan earthquake5 _{occurred on Monday, May 12, 2008} at 06:28:01 UTC. At a magnitude of 7.9 (Mw) and an epicenter 30.986◦ N, 103.364◦ E. This event can be assigned to the nearest instance whose sparse feature vector has the following nonzero entries x⊤ = [x7,10 = 0.63, x7,11 = 0.51, x8,10 = 0.05, x8,11 = 0.05].

• The February 2011 Christchurch earthquake6 _{occurred on Monday, February} 21, 2011 at 23:51:42 UTC. At a magnitude of 6.1 (Mw) and an epicenter 43.583◦S, 172.680◦E. The instance whose sparse feature vector has the following nonzero entries x⊤ = [x3,1 = 0.65, x3,2 = 0.07, x3,12 = 0.46, x4,1 = 0.35, x4,2 = 0.07, x4,12 = 0.35] is the nearest to the epicenter.

• The March 2011 earthquake7 _{oﬀ the Pacific coast of T¯}_{ohoku occurred on Fri-} day, March 11, 2011 at 05:46:24 UTC. At a magnitude of 9.0 (Mw) and an epicenter 38.297◦ N, 142.372◦ E. The instance whose sparse feature vector has the following nonzero entries x⊤ = [x7,1 = 0.002, x7,11= 0.42, x7,12= 0.6, x8,1 = 0.002, x8,11= 0.4, x8,12 = 0.4] is the nearest to the epicenter.

As can be seen in Figure 5.5, the occurrence of the three earthquakes is accompa- nied with a significant increase in the coefficients of the fuzzy sets covering these areas. The higher the fuzzy membership of an instance in a given two-dimensional fuzzy set, the more relevant the coefficient, associated with the fuzzy set, to the overall hazard. For this reason, we present only the relevant coefficients in Figure 5.5. Notably, the

http://earthquake.usgs.gov/earthquakes/eqinthenews/2008/us2008ryan/, accessed on October 9, 2015

6_{http://earthquake.usgs.gov/earthquakes/eqinthenews/2008/us2008ryan/, accessed on}

coefficients as given by (5.18) are logarithmically inversely proportional, indicating that, an increase in one coefficient is calibrated by a decrease in other coefficients (without changing the estimated hazard). Although the different coefficients can be used as prognostic factors, the real change of the hazard can be better observed in the estimated hazard (5.18), which is shown in Figure 5.6 as hazard curves for the three studied areas. The figure reveals how the hazard rate significantly increases even before the occurrences of these earthquakes.

In document Novel methods for mining and learning from data streams (Page 148-153)