• No results found

Assessing Data Quality

5 SYSTEM IMPLEMENTATION

Chapter 2 Assessing Data Quality

Hybrid Traffic Data Collection Roadmap: Objectives and Methods 19

5 SYSTEM IMPLEMENTATION

We now illustrate the importance of the previous concepts for practitioners. This section presents an implementation of the standards discussed before within the Mobile Millennium [4] system, which is representative of modern traffic information systems.

MOBILE MILLENNIUM SYSTEM 5.1

The Mobile Millennium system [4] is a large scale traffic estimation system which integrates a variety of different data types to provide traffic estimates in the San Francisco Bay Area. The different data types collected by the system follow a sequence of steps whose aim is to assess the data feed quality, filter out low quality data points, and fuse different data types. The output of this process is fed to estimation modules, including partial differential equations based traffic models coupled with ensemble Kalman filtering, and machine learning algorithms. The outline of the system is given in Figure 2-2. Because of the complexity of the overall system, strict data type specifications and data feed quality metrics are enforced. These schema and metrics are described in the following sections.

Figure 2-2: Mobile Millennium system: Data feeds entering the system go through a well-defined sequence of computational steps necessary to guarantee the quality of the final outputs.

Chapter 2 Assessing Data Quality

Hybrid Traffic Data Collection Roadmap: Objectives and Methods 20

DATA FEED REQUIREMENTS 5.2

In this section, a basis for the requirements of the most common data types used by the transportation engineering community is proposed. Similar standards [21] for transit applications have greatly

contributed to the improvement of commute planning. At the more local level of the Mobile Millennium system, the standards introduced by this study provided a solid common basis for a fast and robust development of traffic estimation activities involving a variety of data feeds and multiple data types. For each data type, the specifications are organized as follows:

Data sources: Most common data sources for this data type.

Applications: Applications for which this data type is the most relevant.

Output schema: Format of the data feed. Two types of fields are present in the output schema, required fields which have to meet the completeness requirement, and optional fields.

Validation schema: Format of the validation data feed. In particular this feed should be independent from the regular data feed described by the output schema. The validation data feed should be available during a defined probationary period and should be available at pre-defined periods for standard assessment of data quality during the data feed lifetime.

Processed data requirements: Requirements which have to be satisfied by processed data. These requirements depend on the processing type.

Individual data source requirements: Specifications on the individual data sources used by the data feed. This includes, for instance, the sensor characteristics and the sampling scheme.

Data source network requirements: Specifications on the data source network used by the feed.

This includes the coverage information and the redundancy coming from the spatiotemporal spread of the data sources.

Processing requirement: Extensive description of the nature and properties of the processing algorithm, with appropriate references.

In the following section these specifications are instantiated for the application of point speed estimation using the data type point location.

INSTANTIATION OF POINT LOCATION DATA FEED: TAXI DATA 5.3

In this section two data feeds available in the Mobile Millennium system are considered:

Cabspotting feed [22]: Public data feed consisting of point locations recorded by taxis in the city of San Francisco, California.

Chapter 2 Assessing Data Quality

Hybrid Traffic Data Collection Roadmap: Objectives and Methods 21

Info24 feed [23]: Data feed consisting of point locations recorded by taxis in the city of Stockholm, Sweden.

We propose to instantiate some of the quality metrics defined in this section for data-level validation on a specific subset of these two data feeds. We consider a 8.6-mile stretch of highway between downtown San Francisco and San Francisco International Airport for the Cabspotting feed (Figure 2-3), and a 14-kilometer stretch of highway (equivalent length) between downtown Stockholm and Stockholm-Arlanda airport (Figure 2-4) for the Info24 feed. In both cases, we limit our analysis to July 21, 2010.

Figure 2-3: Cabspotting feed: 6 × 105 points (order of magnitude) received by the Mobile Millennium system on July 21, 2010

Figure 2-4: Info24 feed: 2 × 105 points (order of magnitude) received by the Mobile Millennium system on July 21, 2010

Chapter 2 Assessing Data Quality

Hybrid Traffic Data Collection Roadmap: Objectives and Methods 22

The schema for these two data feeds is defined below.

Data sources: GPS using trilateration

Application: Real-time point speed estimation Output schema:

Required fields:

<time>: measurement time in GMT

<id>: unique identifier for sensor from which the measurement originates

<position>: location measurement in the reference coordinate system used by the Global Positioning System

Optional fields:

<error>: uncertainty on the point location (distance)

<heading>: direction of travel (angle)

The sensor ID allows computing travel time and point speed in the case of high frequency sampling with minimal processing. One may note that without the sensor ID, only the density of probe vehicles can be computed, which has some interest for applications such as queue length estimation, but requires a large volume of data points.

Validation schema:

Since this data type is a direct output of a data source with minor processing inherent to the measurement process, no additional validation data feed is required, and thus both schema match.

Processed data requirement:

Since this data type can be considered raw, no requirement is expressed in this section.

Individual data source requirements:

Device error characteristic (meters): Device error characteristics are crucial parameters for data assimilation schemes in which the observation error is taken into consideration [24] [25] [26]. In an assimilation framework, the observations are accounted for with a weight corresponding to this error. When this metric is unavailable, the distance between the point location from the data feed and the map-matched point on the road network can be used as an indicator of the point location accuracy. Figure 2-5 shows the cumulative distribution of projection error for the Info24 data feed for the area described in Figure 2-4. At the 90th percentile the individual data points have a projection error lower than 8 meters.

Chapter 2 Assessing Data Quality

Hybrid Traffic Data Collection Roadmap: Objectives and Methods 23

Figure 2-5: Cumulative distribution of projection error (meters)

Sampling period (seconds): Sampling period impacts the usability of the data feed. It is

advantageous to have as many data points as possible. The product of the sampling period and the penetration rate expresses the number of data points received on average. The sampling period is important by itself because it gives an indication of the connectivity of the reported trajectory. With a high sampling period, complex methods will have to be used to infer the trajectory of the probe vehicles between two measurements. On the other hand, a low sampling period is not desirable in a privacy-sensitive context.

Data transmission delay (seconds): Duration from the time at which the data source records a measurement to the time at which the corresponding data point is available for the application considered. Figure 2-6 presents the distribution of delay in the Cabspotting data feed for the period of interest. At the 90th percentile the reports have a delay lower than 320 seconds.

Figure 2-6: Cumulative distribution of the delay (seconds)

Chapter 2 Assessing Data Quality

Hybrid Traffic Data Collection Roadmap: Objectives and Methods 24

Sensor network requirements:

Space-time coverage: The data feed should be available on the roads of major interest for the traffic monitoring authority, at times when traffic is the most uncertain, i.e., during the day and especially during peak hours. The number of measurements collected on each discretization segment of the network is a parameter which directly impacts the accuracy of the estimate.

Figure 2-7 presents the distribution of the number of points per segment in the Info24 data feed for the period of interest. At the 35th percentile a network segment receives more than 200 points per day.

Figure 2-7: Cumulative distribution of number of measurements per network segment for one day

Homogeneity: This requirement expresses that most segments should have data points from at least a given number of sensors. This avoids well-known failure situations arising in the case of loop detectors, where a sensor reporting wrong measurements is difficult to detect.

Penetration rate: Proportion of the total flow consisting of equipped vehicles. The definition of the reference state for penetration rate should be documented, as it is not a straightforward quantity to estimate.

Processing requirement:

This data type is considered to be raw, and thus no requirement is expressed in this section.

Chapter 2 Assessing Data Quality

Hybrid Traffic Data Collection Roadmap: Objectives and Methods 25

6 CONCLUSION

This chapter presented a methodology for data quality assessment of traffic data. Recent trends of the data quality market were discussed to support the need for a new perspective on data quality

assessment. In particular, it was argued that the increase in volume of traffic data leads to an increasing complexity of available traffic data feeds, and thus new standards should be defined to enable traffic engineers to properly explore the new opportunities offered by this wealth of novel traffic information.

The instantiation within the Mobile Millennium system of some of the data quality metrics introduced in this study was used to illustrate the potential of the proposed framework for data quality assessment.

These metrics capture the complexity of typical probe measurements and allow traffic practitioners to rate available data feeds according to their application of interest.

Chapter 3 Data Quality Tool User Guide

Hybrid Traffic Data Collection Roadmap: Objectives and Methods 26

Chapter 3