Data Pre-processing for Maritime Data Warehouse

Typically, data pre-processing is done using ETL (Extract, Transform and Load) tool in data warehouse. ETL is a process that is responsible for pulling data out of the source systems, doing the data transformation and placing it into a data warehouse. ETL involves the following tasks.

Extract: Extraction task is carried out in order to extract data from different source systems that is converted into one consolidated data warehouse format.

Transform: Data transformation is usually done on the extracted data from different sources. Data transformation includes applying business rules, cleaning, filtering, merging data from multiple sources etc.

Load: Loading task is followed after the transformation task is finished. Main objective of loading task is to load the transformed data into suitable storage (typically RDMS).

ETL tool can extract data from text file, geo database, XML file, RDBMS etc. Though there are different ETL tools availble, for our current implementation, we have used SQL

ArcGIS AIS Data

(gdb format)

AIS Data

(text format) SSIS

Bulk Load

Data Warehouse Data

Warehouse Cleaning Script

Figure 5.4: Data Pre-processing Work Flow for Storing AIS Data in Maritime Data Ware-house

Server Integration Service (SSIS) as the ETL tool for loading the bulk data (AIS and weather data) into data warehouse. Fig. 5.4 shows the work flow of storing AIS data into Maritime Data Warehouse.

At first, we transform the raw AIS data set for each month in each UTM Zone which is in gdb format to text format using ArcGIS software. Then we load the AIS data which is in text format into data warehouse (SQL Server 2014) using SSIS. Finally, we run our cleaning script for cleaning the loaded AIS data. We follow following rules for writing our cleaning script using C# programming language.

• MMSI should have 9 digit number format.

• SOG should lie between 0 knots to 102 knots.

• COG should lie between 0 degree to 359 degree.

• Interpolate out of range values of SOG and COG.

• VoyageID should be unique in each month of voyage in each UTM Zone.

• MMSI should be unique for each month of vessel static data in each UTM Zone.

Chapter 6

Normal Pattern Extractor

Input: A set of historical vessel tracks T = {T₁, . . . , Tn} between an origin and destination, where each T_i denotes a vessel track.

Output: The normal movement patterns for T to be stored in the Maritime Data Ware-house.

Normal movement patterns are extracted by clustering the tracks in T. However, clus-tering the tracks in full length is computationally difficult and challenging (see Section 1.2).

In fact, full length clustering is undesirable in terms of detecting anomalous movement.

Typically an instance of anomalous movement is confined to a small portion of a track and a large portion of the voyage is similar to other normal tracks. In this case, full length clustering may tend to treat an anomalous track as a normal track. A better approach is to partition each vessel track into shorter segments, i.e. a subset of consecutive Vessel Position Reports, and extract normal patterns and detect anomalous movement within the shorter segments.

We describe the partitioning of T into segments in Section 6.1 which is followed by the illustration of the proposed clustering algorithm TSC: Track Segment Clustering in Section 6.2. Algorithm TSC: Track Segment Clustering is used to extract normal patterns within each segment of T.

6.1 Partitioning Vessel Tracks

Though different criteria and technique could be adopted in partitioning tracks (e.g., see [15]), from practical point of view Coast Guard would be interested to observe the vessel movement patterns in different time phase from the beginning of the voyage. As a reason, we partition the tracks in T with same voyage duration into s segments by time, denoted S₁(T), . . . , S_s(T), where S_i(T) contains the ith segment of each track in T and s is the

T¹ T2

T³

Origin Destination

2 3 4 5 6 7 8

1 9

Longitude,Latitude,SOG,COG

Figure 6.1: Set of Vessel Tracks T

T² T³

2 3 4 5 6 7 8

1 9

Origin Destination

Figure 6.2: 4 Segments of Vessel Tracks T

number of segments. Though the voyage duration of each track must be the same, it is not necessary for each track to have identical start or finish times.

Let us consider the set of three vessel tracks T = {T₁, T₂, T₃} depicted in Fig.6.1, where the x-axis represents the time instant indexed by i and t_i represents the Vessel Po-sition Report at time instant i. Suppose that we partition the tracks into segments of P artitionW indow units of time, where P artitionW indow can be specified by the end user.

For example, each track is partitioned into four segments by the vertical lines (in red) as shown in Fig. 6.2. The first segment of each track corresponds to the data at the first three time instants 1, 2, 3, and the second segment corresponds to the data at the next three time instants 3, 4, 5 (with 3 being overlapped), and so on. That means the set of the ith track segments S_i(T) = {S_i(T₁), S_i(T₂), S_i(T₃)}, where 1 ≤ i ≤ 4.

Though it is not straightforward to find the optimal P artitionW indow, the following criteria could be of interest to the user. If the chosen P artitionW indow is too big (the worst

case being that it is the total voyage duration of the entire track), we may miss unusual vessel movement, since anomalous events at sea generally do not occur for long periods of time. On the other hand, if the chosen P artitionW indow is too small (the worst case being that it is a single time series point), this would require enormous computation for clustering each segment, since we would have a very large number of segments. Aside from computation overhead, if the P artitionW indow is too small, there would not be enough AIS data points to form the track segment for anomalous movement detection.

In document False alarm reduction in maritime anomaly detection with contextual verification (Page 31-35)