CHAPTER 3. SHRP2 NATURALISTIC DRIVING DATA
3.3 Data Integration
The research team was provided with individual comma-separated-value (i.e. csv) files for each of the requested traces. The first step was to combine all the individual csv files and create datasets to examine the research questions. To visualize the traces in ArcMap
environment, and extract the geometric information from the RID, each timestamp in the time series data needed to have valid longitude and latitude information. This information was
supposed to be provided at each one second interval; however, such information may be missing for some or, in some rare cases, all of the timestamps during a single trip. Consequently, only those instances with valid longitude and latitude information were retained in the dataset. This process resulted in losing parts or all of a number of trips and, as a result, subsequent analyses needed to be done by caution in such cases. Once the traces with valid geographic information were identified, they were visualized in ArcMap environment. Figure 11 displays how the
obtained traces were scattered across states and were not necessarily within the boundaries of the six study areas (highlighted in aqua color). This further resulted in losing some traces as the RID only includes information across the prementioned six states. Subsequently, separate datasets were created for each state for conflation purposes as the RID is state-based.
The RID uses linear referencing system as its method of spatial referencing where the location of features is described in terms of measurements along a linear element, from a
predetermined starting point.However, the obtained traces only included GPS outputs containing longitude and latitude. As a result, the first step was to convert the raw data to linear referencing system. A python script was developed to perform this task. Once the conversion was done, each point was assigned a route identifier and a measurement along that route that may be used to extract other geometric features from the RID.
Figure 11. Map of the obtained traces
Once the time-series data were converted to the appropriate referencing system, geometric features were conflated (i.e. linked) to each datum using the ArcMap tool called “Overlay Route Events”. A dynamic segmentation process was utilized, where relevant attributes
were queried from each shapefile based on the route identifier and the mile point. The dynamic segmentation process is briefly described in the following steps:
1. The attribute table of the shapefile of interest was queried for those RouteIDs in the time-series data and exported as a dBase file in ArcMap. This step reduced the amount of underlying data to be read and analyzed by a significant amount, resulting in noticeable reduction in the processing time.
2. To conflate the time-series data to the shapefile of interest, the “Overlay Route Events” from linear referencing tools menu in ArcToolbox was used. The time-series
dataset needed to be selected as the “Input Event Table”. Since each row in the time- series data corresponded to one point along the trip trace, the “Event Type” must be selected as “POINT”. Subsequently, “FrMeasure” has to be selected as “Measure Field”. Due to the point nature of this table, the “To-Measure-Field” is disabled.
3. The dBase file exported in step 1 must be selected as the “Overlay Event Table”. Unlike the input table which was of a point type, all the tables that needed to be overlaid were in line format. Consequently, the “Event Type” must be selected as “LINE” for all these tables. In this case, both “From-Measure Filed” and “To-
Measure Field” needed to be specified which corresponded to the start and end points
of the layer that was being overlaid. Ultimately, the output were exported and saved as a comma separated values (CSV) file in the desired location. These steps are shown in Figure 12.
This dynamic segmentation process was used to extract desired features from various RID shapefiles.
Table 4 provides a list of shapefiles and the features extracted from the RID as part of this dissertation. The information for each point along the event traces was extracted from the proper record with identical Route ID, and a From- and To- Measure which made up a segment embracing the queried point. Blank fields were displayed if no record matched these conditions.
Figure 12. A snapshot of the conflation process
Table 4. RID Shapefiles and the Associated Extracted Information
Shape file Information Polynomial Point
Alignment Tangent vs. Curve - Curve Radius - Curve
Direction - Super Elevation x
Location Grade - Cross Slope x
Lane Number of Lanes by Type – Lane Width x
Median Median Type x
Shoulder Shoulder Type – Shoulder Width x
Barrier Barrier Type x
Rumble
Strip Location (Edge Line vs. Shoulder vs. Centerline) x
Sign MUTCD Code- Message x
In contrast to the other shapefiles in RID, the speed limit and advisory speed data (i.e. all sign-related information) were in point format. Since the time-series data were also in point format, it was not possible to follow similar procedure detailed above to extract this type of data from RID. To be able to do the conflation process, at least one of the two tables must be of line type. Therefore, to extract the speed limit data, polynomial shapefiles were developed from the
sign inventory. To derive the information as to speed limit at each point, the “signs” shapefile from the RID was queried to identify those that represent the statutory speed limit information. According to the MUTCD the code R2-1 corresponds to the regulatory speed limit signs and was used to query the shapefile. The output from this query included location information (RouteID and mile point), as well as the associated sign message (i.e., the posted speed limit). Speed limits were assumed consistent between two consecutive signs, meaning that the begin mile-point for each sign was the end mile-point for the previous sign. Consequently, using this line-based dBase, speed limit information was extracted following the conflation process outlined previously.
While the outlined approach performed relatively well on conflating RID features to obtained trip traces, there were some issues that needed closer investigation and are detailed here:
• Wrong Conflation: Adjacency to other roadways may result in some conflation issues.
During the data collection process by the mobile van, the collected data were assigned to the closest roadway, thus in some cases there may be multiple conflated information to a road segment.
• Lack of Directional Data on Undivided Roadways: In the RID, divided roadways (e.g.
freeways in the context of this study) were assigned two different RouteIDs to account for each direction of travel lanes. However, this was not the case for undivided roadways, meaning that only one RouteID was specified for either of directions. Consequently, conflation of the attributes corresponding to the opposing direction was likely. This required further investigation of the resulting tables to match the coded attributes for the
same side of the roadway centerline. Figure 13 displays a flow chart for the logic used to eliminate the irrelevant features extracted in the conflation process.
Figure 13. Flow chart of the logic used to resolve the conflation issues
Once these issues were resolved, comprehensive datasets including time-series data, geometric features from RID, and InSight supplementary data were created. Further details as to how the raw data were queried and requested, as well as dataset structures are discussed in the following chapters, separately.