Data Quality Tool User Guide
2 NAVIGATING THE DATA QUALITY TOOL
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 28
2 NAVIGATING THE DATA QUALITY TOOL
Figure 3-1: Data Quality Tool default page
Figure 3-1 shows the default page of the Data Quality Tool. The left panel contains all input fields for the tool. The right panel displays a graph and a map based on the values of the input fields. The output is generated by selecting the inputs in the left panel, then clicking the ‘Get Data’ button located below the input fields. Output is also generated by selecting any of the options in field 6, ‘Data & Metrics’.
The following sections describe each of the input fields and options. To view the same data shown in this guide (if available) on your version of the Data Quality Tool, simply select identical input options in the left panel, then click the ‘Get Data’ button.
Mouseover the different time bins in the graphs to view the values of the selected metric over time.
Generate data by clicking the ‘Get data’ button.
Network 179
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 29
SELECTING A DATA FEED 2.1
Field 1 contains a dropdown list of all available data feeds. Data feeds usually represent companies or entities that are in the business of collecting probe data, or datapoints.
Users can select only one feed at a time for analysis.
SELECTING A NETWORK (STUDY SITE) 2.2
Field 2 lists all available networks. In this user guide, we will be illustrating most examples with data from network 179, an approximately 12-mile section of the I-880 freeway northbound starting from the city of Fremont, California and ending in the city of Hayward, California.
The map in the DQ tool (shown below) highlights the selected network, 179. Clicking on different areas of the highlighted network on the map displays the number of datapoints collected by the selected data feed for the selected section of the network within the selected date range. The map in the tool below shows that there were 3208 datapoints found in section 9 of network 179 from Mar 2 2012 at 12am and
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 30
before Mar 10 2012 at 12am. In the map below, section 9 roughly represents the ninth mile of network 179.
Note that maps highlighting the selected network should appear in the right panel of all screens in the Data Quality Tool whenever output is generated for valid input. The functionality of the highlighted highway remains exactly the same on every screen where it appears. The highlighted network is clickable at different segments of the route. Each segment of the network represents a 1600-meter stretch of road (roughly one mile) . The number that appears after clicking the segment represents the number of datapoints in that 1600-meter segment of the network within the date range determined by the values in the start and end date fields.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 31
SELECTING A DATE RANGE 2.3
Figure 3-2: Date and Time fields
Figure 3-2 defines the date and time range of the data displayed for analysis. The data being analyzed will include all datapoints whose associated dates are greater than or equal to the start date and whose associated dates are less than the end date (i.e., excluding datapoints at the end date). Users can manually type in dates in the start date and end date fields. If users manually type in dates, the format must be exactly as follows: 1) the three-character alphabetic month (e.g., Mar) followed by one space;
2) one- or two-digit day followed by one space; then 3) the four-digit year. For example, Mar 2 2012 or Mar 02 2012 would both be valid strings to enter in the date fields.
Clicking in either the start or end date fields displays a calendar on which users can click and select dates. Users can select the time of day just below the date fields. Time of day on the hour (e.g., 12:00AM), 15 minutes, 30 minutes, and 45 minutes after the hours (e.g., 2:15pm, 2:30pm; 2:45pm), can be selected.
Note that networks commonly only contain data for a short range of days. Therefore, it can be
cumbersome to search for valid date ranges using the tool. For example, as of February 2013, only data from Mar 2, 2012 to Mar 20, 2012 existed for network 179. To find valid date ranges for your network of interest, request someone with database access to find distinct days for which data exists for your network of interest.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 32
AVERAGING DATA OVER THE DAY OF WEEK 2.4
Figure 3-3: The Day of Week option averages values over the days of the week.
Checking the Day of Week checkbox in Field 4 generates a graph that displays the average value of a metric over selected days of the week. The report in Figure 3-3 shows an average of 194 datapoints in the 1-hour time bin from 7:00am to 8:00am (excluding datapoints at 8:00am) on all Tuesdays,
Wednesdays, and Thursdays between Mar 2 2012 and Mar 10 2012. Here, we know that the time bin has a size of 1 hour because the next field we will describe, ‘Aggregation period’, contains a value of 1 hour. In addition, the description below the graph indicates ‘Data aggregated hourly’.
The x-axis will always represent a single 24-hour period. This is expected since the Day of Week feature gives the average measure of a metric over the selected days at a particular time bin. This will be the case for all metrics when the Day of Week feature is selected.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 33
SELECTING AN AGGREGATION PERIOD 2.5
Figure 3-4: Aggregation Period
The Aggregation Period field determines the size of the time bins on the x-axis. The possible values for this field are: 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, and 1 day. Therefore, if an
aggregation period of 15 minutes is chosen, the graph will display data for all 15-minute time bins as shown in Figure 3-4, where we observe 18 datapoints in the time bin from 6:15am to 6:30am on March 2 2012 (excluding datapoints at 6:30am). Together with the date and time range, the Aggregation Period defines the time bins on the x-axis.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 34
SELECTING A DATA TYPE 2.6
Figure 3-5: Data Type
The ‘DATA TYPE’ option filters ‘Map-matched Data’ or ‘Raw Data’. (Map-matched data is GPS probe data that has been processed through the Path Inference Filter described in Chapter 4. The filter maps the probe data points to the road network and plots their trajectories.)
Map-matched data in the DQ Tool refers to datapoints that are associated with sections of networks.
We identify these sections of the network as “space bins” that can be visualized on the y-axis of the Speed metric in Figure 3-6 below.
By default, the DQ Tool sets the size of these space bins to 1600 meters (approximately 1 mile).
On the y-axis are 12 space bins. The first space bin (labeled ‘1600 m’) represents the space bin from the beginning of network 179 to the 1600th meter in the network (approximately the 1-mile mark) and the top-most space bin (labeled ‘19200 m’) represents the final bin in the network (approximately the last mile of the network).
Note, the start and end may vary depending on the direction of the network. The representation below in Figure 3-6 is convenient because 179 is a northbound network.
If the network highlighted in the map below ran southbound, the first space bin located at the bottom of the graph of Figure 3-6 would represent the northernmost part of the highlighted network and the space bin at the top of the graph would represent the southernmost point of the highlighted network in the map. If the network were eastbound, the first space bin would correspond to the westernmost part
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 35
of the highlighted network in the map and the space bin at the top would correspond to the easternmost part; and vice versa if the highlighted network were westbound.
Figure 3-6: Map-Matched graph and corresponding highlighted network on the map
The size of the space bins can be configured to a minimum of 100 meters. Space bins can also be configured to be larger than 1600 meters, but the rule of thumb for sizing space bins for visual purposes is to keep the total number of space bins at a minimum of 7; otherwise, the space bins may not display clearly.
Raw data, on the other hand, are not associated with specific segments of the network; we only know that raw datapoints were located within the bounding polygon region associated with the network of interest.
There are two metrics that are true map-matched metrics, meaning their y-axes represent space bins:
1) Speed; and 2) Space-time Coverage. These metrics can only be displayed using map-matched data, not using raw data.
DQ Tool’s two map-matched metrics: Speed and Space-time Coverage
There are three metrics that are not true map-matched metrics (meaning their y-axes do not represent space bins), but can still be displayed in the DQ Tool using map-matched data. These metrics, which will be described in more detail in the next section, are: 1) Time Coverage; 2) Sampling Rate; and 3) Unique
Top/bottom space bins of network 179 in graph corresponds to start and end of the highlighted in the map.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 36
Devices. The purpose of displaying these metrics using map-matched data is to be able to compare map-matched datapoints with raw datapoints for these metrics.
In addition, there are two more metrics which can only be displayed using raw data: 1) Provider Transmission Delay and 2) Total Transmission Delay.
The following table lists each metric of the DQ Tool and indicates what type of datapoints can be used to display each metric.
Metric Can Be Displayed with
Map-Matched Data?
Can Be Displayed with Raw Data?
True Map-matched Metric?
(i.e., y-axis represents space bins)
Time Coverage X X
Sampling Rate X X
Unique Devices X X
Speed X X
Space-time Coverage X X
Provider Transmission Delay X
Total Transmission Delay X
These metrics are described in more detail in the following section.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 37
3 METRICS
To view any of the figures displayed in this document on your version of the Data Quality Tool, simply select identical input options in the left panel (if available), then click the Get Data button.
Note: Although shown in the screenshots, the Penetration rate metric is not currently operational in the DQ Tool. It is a placeholder for future development.
TIME COVERAGE 3.1
Figure 3-7: Time Coverage Metric
The Time Coverage metric represents the total number of datapoints found in the selected network within a specific time bin. The graph in Figure 3-7 shows that the 30-minute time bin from 9:30am to 10:00am on March 2, 2012 contains 44 datapoints in network 179.
As was discussed in the previous section, 2.6, although the Time Coverage metric can be displayed using both map-matched data as in Figure 3-6, this metric is not truly map-matched since its measure on the y-axis counts all the datapoints in the entire network, not within specified time bins within the network as is the case for the Speed and Time Coverage metrics.
The graph in Figure 3-7 has a configurable field for a rolling average in the graphs. The scale of this field is 1 aggregation Period. For example, a value of 1 in Figure 3-7 represents a rolling average of 30 minutes; a value of 2 represents a rolling average of 60 minutes; a value of 3 would represent 90 minutes; etc.
44 datapoints in the 9:30am to 10:00am time bin.
Configurable rolling average.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 38
Being able to display the Time Coverage metric using both map-matched or raw data allows users to compare the two types of data. Using the same date range and aggregation period for network 179 below, we see that the general pattern of this metric is the same for both map-matched and raw data even though the actual number of datapoints (values on the y-axis) may differ:
Comparing Map-matched versus Raw Data for Time Coverage Metric
Comparisons such as these can be made for the other 2 metrics that can be displayed with both map-matched and raw data: Sampling Rate and Unique Devices.
Also, all metrics in the DQ Tool (except for Speed and Space-time Coverage) have a configurable field for rolling averages in their graphs. The scale of this field is 1 aggregation Period. For example, here a value of 1 represents a rolling average of two hours.
Map-matched vs.
Raw
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 39
SAMPLING RATE 3.2
Figure 3-8: Sampling Rate Metric
Selecting the Sampling Rate metric in the DQ Tool displays two metrics on a graph: Sampling Rate and Unique Devices.
Note: Although labeled as “rate,” the Sampling Rate metric presents data as seconds rather than hertz. It is thus actually displaying the sampling period and should be understood as such.
The Sampling Rate metric refers to the average number of seconds between successive detections of a device. In order to calculate a Sampling Rate for a device, there must be at least 2 detections for a device within a single time bin. The more times a device is detected within a time bin, the lower the value of the Sampling Rate.
Figure 3-8 shows 101 unique devices, each of which were detected at least twice in the 6:00am to 7:00am time bin on March 2, 2012. Figure 3-8 indicates that for the 101 Unique Devices that were detected at least twice within this 1-hour time bin, the average number of seconds between successive detections of each device was 98.03 seconds.
The Sampling Rate metric can be viewed using both map-matched and raw data.
Average Sampling Rate value in green.
Configurable rolling average.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 40
UNIQUE DEVICES 3.3
Figure 3-9: Unique Devices Metric
The Unique Devices metric can be displayed by itself. As before, this count represents all devices in the given time bin that were detected at least twice. Figure 3-9 indicates that there are 101 Unique Devices which were detected at least twice in the 6:00am to 7:00am time bin on March 2, 2012. This is the same count shown for Unique Devices shown in the same time bin in the previous example for the Sampling Rate metric.
The Unique Devices metric can be viewed using map-matched and raw data.
Unique Devices count should match with Unique Devices count in Sampling Rate for the same time bin.
Configurable rolling average.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 41
PROVIDER TRANSMISSION DELAY 3.4
Figure 3-10: Provider Transmission Delay Metric
Figure 3-11: Provider Transmission Delay Metric (using log scale option)
Provider Transmission Delay represents the average amount of time that elapsed (in seconds) between the time devices were detected and the time the detection was inserted into the Provider’s database.
The green lines in Figure 3-10 and Figure 3-11 represent the average amount of time elapsed for devices
Average number of seconds that elapsed for devices that were detected between 10pm and 11pm on March 2, 2012.
Average number of seconds that elapsed for devices that were detected between 10pm and 11pm on March 2, 2012.
Additional option to display the log scale for a better visual of transmission delay graph.
Configurable rolling average.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 42
detected between 10pm and 11pm on Mar 2, 2012 before the detection was inserted into the PATH database. Figure 3-11 is the same representation of Figure 3-10, but using the “Use log scale” option in Field 7 ‘Addition option’ in order to display the graph more clearly.
The Provider Transmission Delay metric can only be viewed using raw data.
TOTAL TRANSMISSION DELAY 3.5
Figure 3-12: Total Transmission Delay Metric
Average number of seconds that elapsed for devices that were detected between 10pm and 11pm on March 2, 2012.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 43
Figure 3-13: Transmission Delay (Using log scale option)
Total Transmission Delay represents the average amount of time that elapsed (in seconds) between the time devices were detected and the time the detection was inserted into the PATH database. The green lines in Figure 3-12 and Figure 3-13 represent the average amount of time elapsed for devices detected between 10pm and 11pm on Mar 2, 2012 before the detection was inserted into the PATH database. Figure 3-13 is the same representation of Figure 3-12, but using the “Use log scale” option in Field 7 ‘Addition option’ in order to display the graph more clearly.
The Total Transmission Delay metric can only be viewed using raw data.
Average number of seconds that elapsed for devices that were detected between 10pm and 11pm on March 2, 2012.
Additional option to display the log scale for a better visual of transmission delay graph.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 44
SPEED 3.6
Figure 3-14: Speed Metric
The Speed metric takes the average speed of all datapoints within a given space-time bin. Clicking on the top of the highlighted network of the map in Figure 3-14 shows that 240 devices were detected in the “12th mile” of network 179 from 12AM Mar 2 2012 to 12AM on Mar 3 2012. The graph represents the “12 mile” (12th space bin that approximates 1 mile) in the 12AM to 12:15AM time bin in the top-left block. Placing your mouse over this top-left block shows an average of 53 miles per hour for the devices that were detected in this space-time bin.
Note that it is recommended that at least seven space bins appear for the Speed metric; otherwise, the distinction between the space bins may be difficult to visualize. Also, users should limit the length of the x-axis so that fewer than 1440 time bins appear on the x-axis. Otherwise, rendering the graph can take more than several minutes. For example, generating graphs for this metric with an Aggregation Period of 1 minute and a date range of 24 hours can take approximately 2 minutes to render.
The Speed metric can only be viewed using map-matched data.
Average speed of 53mph among all devices detected in the “12th mile” space bin and 12:00am to 12:15am time bin of network 179 on Mar 2, 2012.
240 datapoints in the “12th mile” of network 179 from March 2, 2012 at 12:00AM to Mar 3 2012 at 12:00AM.
Chapter 3 Data Quality Tool User Guide
Hybrid Traffic Data Collection Roadmap: Objectives and Methods 45
SPACE-TIME COVERAGE 3.7
Figure 3-15: Space-time Coverage Metric
The Space-time Coverage metric counts the number of datapoints in each space-time bin. Clicking on the top of the highlighted portion of the map in Figure 3-15 shows that 240 datapoints were detected in
The Space-time Coverage metric counts the number of datapoints in each space-time bin. Clicking on the top of the highlighted portion of the map in Figure 3-15 shows that 240 datapoints were detected in