Case Study - Spatial Data Processing in MapReduce

2.6 Spatial Data Processing in MapReduce

3.1.7 Case Study

We present a set of case studies using real GPS trajectory datasets. We demonstrate how our application can be used to integrate data from different sources and formats into a single format. For each case, we provide an overview of the input raw data, as well as the Input and Output TDDF scripts, and the parsed data. The sources of the data files, as well as some attribute values, will be omitted for privacy reasons. For the sake of simplicity, and to demonstrate how our system can be used to integrate datasets into a common format, we output all datasets using the SPATIAL TEMPORAL output format.

CASE 1: This dataset contains one GPS trajectory record per file. The first six lines of each file contains some descriptions about the source dataset, and can be ignored. The remainder lines contains the list of trajectory coordinates, one coordinate per line. Coordinates contain both spatial-temporal and semantic attributes. An overview of the dataset records and its corresponding TDDF are given below.

Input Trajectory Data 1:

40.008304,116.319876,0,492,39745.0902662037,2008-10-24 40.008413,116.319962,0,491,39745.0903240741,2008-10-24 . . .

40.009209,116.321162,0,84,39745.1160416667,2008-10-24

Input TDDF Script 1:

# One record per file RECORDS DELIM EOF

# Coordinates in Long/Lat COORD SYSTEM GEOGRAPHIC

IGNORE LINES [1-6]

# Auto generate ID with prefix ’db1 t’

AUTO ID db1 t

# The list of coordinates in each trajectory record. # Field 1: Latitude in decimal degrees.

# Field 2: Longitude in decimal degrees. # Field 3: All set to 0 for this dataset. # Field 4: Altitude in feet.

# Field 5: Number of days since 12/30/1899, with fractional part. # Field 6: Date as a string.

# Field 7: Time as a string.

COORDINATES ARRAY( LAT DECIMAL ,

LON DECIMAL ,

zeroVal INTEGER ,

alt INTEGER ,

timeFrac DECIMAL ,

TIME DATETIME("yyyy-MM-dd") LN) EOF

CASE 2: This dataset contains several GPS trajectory record per file, every record is delimited by the character #. The first line of each record contains a set of semantic attributes of the trajectory, followed by the list of trajectory coordinates, one coordinate per line. Coordinates contain both spatial-temporal and semantic attributes. An overview of the dataset records and its corresponding TDDF are given below.

Input Trajectory Data 2:

#,1,3/2/2009 9:23:12 AM,3/2/2009 10:02:17 AM,10.4217737338017 km 3/2/2009 9:23:12 AM,39.929961,116.355872,23570 3/2/2009 9:23:42 AM,39.926785,116.356007,23526 . . . 3/2/2009 10:02:17 AM,39.950725,116.295991,27942 #,2,3/2/2009 10:04:14 AM,3/2/2009 10:56:23 AM,13.1721183785493 km 3/2/2009 10:04:14 AM,39.969738,116.288209,32482 3/2/2009 10:04:44 AM,39.973138,116.288661,13208 . . . 3/2/2009 10:56:23 AM,39.99992,116.352966,37268 Input TDDF Script 2: RECORDS DELIM #

3.1. TRAJECTORY DATA INTEGRATION AND

REPRESENTATION 55

COORD SYSTEM GEOGRAPHIC

# Creates new IDs with prefix ’db2 t’

AUTO ID db2 t

# Ignore the first empty attribute, and the integer ID IGNORE ATTR ,

IGNORE ATTR ,

timeIni STRING ,

timeEnd STRING ,

length STRING LN

COORDINATES ARRAY( TIME DATETIME("M/d/yyyy HH:mm:ss a") ,

LAT DECIMAL ,

LON DECIMAL ,

alt INTEGER LN) #

CASE 3: This dataset contains several GPS trajectory records per file, one record per file line. The dataset contains trajectories from cars, with the list of trajectory coordinates, and a set of semantic attributes. This dataset had been used for map-matching, hence the coordinate points also contain semantic attributes regarding map-matching. A record in this dataset, corresponding to a single line in the input file, and its corresponding TDDF are given below.

Input Trajectory Data:

10000018 1427933750,10000018,1,27|27|27|19,3639865:0:57:114.33708: 30.50130:1427933750|3639862:6:59:114.33715:30.50128:1427933759|3624382: 46:33:114.33728:30.50168:1427933759|3624382:98:12:114.33742:30.50213: 1427933772|3624382:131:17:114.33752:30.50242:1427933778|3630066:0:35: 114.33752:30.50242:1427933772|3630066:0:17:114.33752:30.50242:1427933778 Input TDDF Script: RECORDS DELIM LN

COORD SYSTEM GEOGRAPHIC

ID STRING ,

# The moving object which generated this trajectory

sourceId INTEGER ,

# Car type: personal car=1, taxis=2, others=0

carType INTEGER ,

citySequence ARRAY(cityId INTEGER |) ,

# the distance between each mapped point, the distance of mapping, # longitude, latitude, time

COORDINATES ARRAY(linkID INTEGER : oDistance INTEGER : mDistance INTEGER :

LON DECIMAL :

LAT DECIMAL :

TIME INTEGER |) LN

For all three cases, the output TDDF is the following, since in all cases the input datasets have been parsed to the same output format.

Output TDDF Script:

OUTPUT FORMAT SPATIAL TEMPORAL COORD SYSTEM GEOGRAPHIC

DECIMAL PREC 5

ID STRING

COORDINATES ARRAY( LON DECIMAL LAT DECIMAL TIME INTEGER)

The output formated data, in CSV and BSON documents, for the three datasets is the following. Notice that now all datasets are in the same format SPATIAL TEMPORAL.

Output Trajectory Data (.csv):

db1 t 1;11631987,4000830,1224814199000,8,10,5000 db2 t 1;11635587,3992996,1235985792000,13,-318,30000 db2 t 2;11628820,3996973,1235988254000,46,340,30000 1018 1450,11433708,3050130,1427933750,7,-2,9

Output Trajectory Data (.bson):

{ id : "db1 t 1", coordinates : [11631987,4000830,1224814199000, 8,10,5000]} { id : "db2 t 1", coordinates : [11635587,3992996,1235985792000, 13,-317,30000]} { id : "db2 t 2", coordinates : [11628820,3996973,1235988254000, 46,340,30000]} { id : "1018 1450", coordinates : [11433708,3050130,1427933750, 7,-2,9]}

3.1. TRAJECTORY DATA INTEGRATION AND

REPRESENTATION 57

Notice that the output format for attributes is slightly different than that in the input format. Since the output data is always in a CSV-like style, where all attributes are semicolon-separated, and array items as comma-separated by default, the delimiters in the output TDDF are therefore omitted. The output TDDF can be used by any trajectory data reader implementation as the common format of all parsed datasets.

CASE 4 (Unsupported formats): Despite our efforts to provide a universal parser, we understand that some data formats may still not fit perfectly in our parser. However, some painless data prepro- cessing can be done in the raw data to make it fits our model, and thus be integrated into a common format. For instance, our research group had access to a dataset collected by a private bus company, which collected the GPS locations of all their buses after certain time interval, and stored all GPS coordinates collected at the same time together in a text file. Consequently, the GPS coordinates for a given bus trip were spread across multiple files. Since the GPS records also contained the buses IDs and trip IDs, we simply had to perform a quick sort-and-aggregate algorithm to group coordinates of a same bus and trip into the same file sorted by time-stamp. After that, the trajectory records could be easily parsed by our application.

In document A distributed in-memory database system for large-scale spatial-temporal trajectory data (Page 71-75)