FOT-Net Data Stakeholder Meeting on
Open Data and Data Re-use in Horizon 2020
Data Sharing in the US
Ram Kandarpa, Booz Allen Hamilton
(under contract to USDOT)
March 10, 2015 Brussels, Belgium
Topics
Challenge 1: A common platform for data sharing
Challenge 2: Protection of Privacy
Challenge 3: Big Data and Cloud Analytics
Challenge 4: Engagement with RDE Users
Challenge 1: A common platform for data sharing
An accessible common platform for systematically sharing Field Operational Test
(FOT) data is essential for a continuous and efficient usage of the FOT data in
research and development of multiple applications enabled by such data.
Data Sharing Platform Application Data Capture Information Raw Data
A platform is needed for collecting and sharing connected vehicle
and infrastructure data for research and applications development
Data Environments
Real-time Data Capture and Management
Transit Data Truck Data Reduce Speed 35 MPH Weather Application Transit Signal Priority Fleet Management/ Dynamic Route Guidance
Connected Vehicle Applications
Mobile Devices
The Research Data Exchange (RDE) is the US DOT’s primary
repository of publically available FOT CV-related data
Purpose
– To provide a variety of data-related services that support the development, testing, and demonstration of
multi-modal transportation mobility, weather, and environmental applications.
Objectives
– Enables systematic data capture from connected vehicles, mobile devices, and infrastructure
– Provides high quality and well-documented data sets
– Integrates data from multiple sources into data environments
US DOT Program Owner
– The Data Capture and Management (DCM) program
within the Intelligent Transportation Systems Joint Program Office (ITS-JPO)
The RDE currently hosts a mix of data from probes, connected
vehicles, infrastructure and contextual sources from nearly a
dozen FOTs and demos
Probe Message Data.Actual and simulated vehicle trajectories and probe snapshot messages in SAE
J2735 format from tests conducted at the Connected Vehicle Test Bed in Novi, MI in 2008, 2009, and 2010.
Vehicle and Roadside Device Data.Integrated multimodal data from vehicles and roadside sensors from
four sites (Seattle, Portland, Pasadena, and San Diego). Data includes light and transit vehicles, incidents, weather, freeway and arterial travel times, and traffic signal data.
Connected Maintenance Vehicles.Real-time streaming and archived onboard (GPS/AVL) data from
wirelessly-connected snowplows and maintenance trucks operated by Minnesota DOT.
Basic Safety Messages (BSM) - Orlando.BSM data collected every 0.1 second from transit vehicles at the
2011 World Congress Demonstration in Orlando FL.
BSM Data - Leesburg.BSM data collected every 0.1 second from a device in a vehicle in the vicinity of
Leesburg, VA.
The US DOT plans to expand the offerings of data available on the
RDE over the next several years as “CV Pilot” deployments begin
Near Term Additions:
– Additional Data from Safety Pilot Deployment
– Data from 2014 ITS World Congress
Queue Length Data and CV Data
Weather Data Demonstration
Future Additions:
– Dynamic Mobility Applications (DMA) Prototypes
– CV Pilot Deployments
– Operational Data Environments
Links to additional connected vehicle related data
The US DOT is developing guidance and requirements for
systematic provision of data from FOTs to the RDE
Guidance is being developed in a user-friendly question-and-answer (Q&A) format
Guidance will be made available online in the near-term
Major topics covered in the guidance include:
– Requirements for providing data to the RDE
Within the CV FOT guidance, requirements are identified in support
of making data available for research uses
Data Requirements: These specify the minimum level of requirements for the data that the
US DOT has in making data available on the RDE.
These requirements are described for each stage of a typical FOT:
*Note: this is not a comprehensive list; it is for illustrative purposes.
Stage of a Typical CV FOT Requirements to Consider*
Conceptualization • Assignment of appropriate FOT POC
Partnership Formation • Data ownership, permits, share-ability under the Open
Data License
Design and Development • Consideration of PII within the data
• Usage of non-proprietary data formats Implementation and Operation • Proper metadata documentation
• Adherence to data quality levels
Evaluation • Logically structured data files
Also within the CV FOT guidance, potential data-related issues are
identified in support of facilitating the FOT conductor
Data Issues related to CV FOTs: The guidance will provide data-related considerations an
FOT conductor/agency may encounter throughout the various stages of the FOT.
These issues are described for each stage of a typical FOT:
*Note: this is not a comprehensive list; it is for illustrative purposes.
Stage of a Typical CV FOT Potential Issues to Consider*
Conceptualization • Test goals and objectives
Partnership Formation • Data collection approach and plan
Design and Development • Performance measures
Implementation and Operation • Acquisition of data
Topics
Challenge 1: A common platform for data sharing
Challenge 2: Protection of Privacy
Challenge 3: Big Data and Cloud Analytics
Challenge 4: Engagement with RDE Users
Challenge 2: Protection of Privacy
Protecting the privacy of the individuals partaking in the FOTs is paramount if data
from the FOT are to be shared openly or outside of the designated parties.
De-identifying the data and/or seeking written permissions from the test participants are
two possible methods for protecting privacy.
The Detroit City Data Environment features connected vehicle data
and video data from a queue length estimation experiment
This demonstration was conducted in downtown Detroit, which is part of the Southeast Michigan Test Bed
Nine (9) vehicles traversed a predefined path that included 12 instrumented intersections
Collected data included:
– Vehicle kinematic data (lat, long, speed, acceleration, etc.)
– Intersection data (signal phase and timing, geo-spatial elements)
– Traveler information message (roadway advisories)
– Queue length (collected by field observer)
– Sample video recording of the demonstration (primarily to support verification of queue length estimates)
For the Detroit City Data Environment, steps were taken to ensure
privacy of individuals within video data
Privacy Issues:
Vehicle data contained no PII due to the constraints of the demonstration and queue length estimation experiment
However, the video recordings contained data that presented privacy concerns
– The audio associated with each video contained conversations between the field observers
– While video resolution did not allow the recognition of license plates, it did facilitate the recognition of some pedestrians as they walked along sidewalks and at crosswalks*
– To remedy this, the videos were processed to remove all audio.
– A filter was added to further degrade the resolution of video to make pedestrian features less distinguishable but still allowing user to recognize queue length
The Safety Pilot Model Deployment (SPMD) is a naturalistic driving
study primarily to evaluate the efficacy of V2V technologies
SPMD is an exploration of the real-world effectiveness of connected vehicle safety applications in multi-modal driving
conditions
This study included approximately 3000 drivers, conducting their day-to-day activities in instrumented vehicles
The hyper-frequent and hyper-local data collected by these vehicles provides tremendous research value but at the same time poses a threat
Privacy concerns arise from the ability to use vehicle position data to identify home, work, child care facilities, etc.
These data may be considered PII and then be used to uncover additional PII
Before data can be distributed to the public, PII related data has to
be removed while maintaining usefulness of the data
The RDE will host two samples of the SPMD data, a 1-day and 60-days, and two different sanitization strategies are needed to rid the data of PII
For the 1-day sample, the sanitization algorithm centered on identifying drivers’ origin and destination, and truncating the trajectories accordingly
Once Origins/Destinations have been identified, a series of measures were applied to best mask those locations
The algorithms were also applied to dependent/related data elements to further eliminate the possibility of uncovering PII
Complete Trajectories
Before data can be distributed to the public, PII related data has to
be removed while maintaining usefulness of the data
Cont’d …
For the 60-days sample, a more involved algorithm was developed, building on what was previously developed
The updated algorithm focused on more nuanced driver behaviors, beyond what is typical when classifying origins and
destinations
This focus was primarily employed to further mask PII which may be obtained through observing driving patterns over time
After applying this algorithm, the output is again a series of truncated trajectories while still maintaining, as best as possible, the usefulness of the data
Complete Trajectories
Video data, showing a test participant, will soon be made available
on the RDE, as part of road weather warning system demo
A planned data environment for the RDE contains field-simulated road weather data collected during a demonstration at the 2014 ITS World Congress in Detroit
Participants were driven in a specially instrumented demo van which did a short loop around the Belle Isle test track while collecting data from multiple onboard sensors during simulated road weather events
A video camera inside the van collected video
footage of the onboard warnings (generated by the simulated weather events) and the host who narrated the events for the participants
The host has granted written consent to the US DOT to allow the video (which includes her likeness)
The US DOT has alternative means available for retaining and
sharing data that has not been ridded of personally identifiable
information
For Connected Vehicle FOT data, the US DOT is using its Saxton Transportation Operations Laboratory at the Turner Fairbank
Highway Research Center (TFHRC) as a secure repository, with
access granted to researchers and interested parties on a limited basis
For data from the Naturalistic Driving Study (collected under the Strategic Highway Research Program 2), the US DOT has
established the ‘Safety Data Enclave’ at the TFHRC to provide comprehensive data sets on a limited basis to researchers and other interested parties
Some data may lose much of its value if steps are taken to eliminate all personally identifiable information.
There may be sufficient grounds to retain unaltered FOT data for later usage in a controlled (i.e., non-public) setting
Topics
Challenge 1: A common platform for data sharing
Challenge 2: Protection of Privacy
Challenge 3: Big Data and Cloud Analytics
Challenge 3: Big Data and Cloud Analytics
As connected vehicle data are generated in ever greater quantities during FOTs, it is
clear new methods beyond archiving and downloading will be needed to effectively
share data. Efforts are now underway to migrate the data on the RDE to a
cloud-based storage environment with analytical tools co-located.
The current RDE architecture cannot sustain
very large datasets
The paradigm of downloading data files for subsequent local analysis is not sustainable for significant-sized data files
– Two months of the SPMD data exceeds two terabytes (2TB). The time to download this data makes the existing approach unworkable.
– Even if a user were to succeed in downloading a large file, analyzing the data in a meaningful way would require significant time and processing resources
The obvious solution is to provide a cloud-based RDE architecture, and avoid file downloads by supporting analytical tools that are co-located with the data
– Users could perform their analyses in the cloud environment, or
– Users could filter the large datasets into smaller, more manageable files that could be downloaded for further analysis
Open source analytical tools, such as ‘R’, can easily be made available in cloud computing environments, and allow for complex data queries and processing operations
A cloud-based RDE solution will address size and processing
constraints
Cloud-based resources are “elastic”
– Users can adjust the storage and processing resources they need in real time
A user can expand storage and CPU power to query/integrate large data sets
The user can then “turn off” the resources when they are not needed
– US DOT is investigating various economic models of how to store the data and how users would access the data
For example, one approach would require a user to establish a cloud-provider “account” in order to conduct their desired analyses
Cloud-based resources would be secure
– Any cloud RDE solution would require the Federal Risk and Authorization Management Program (FedRAMP) approval
The RDE team is in the process of developing detailed access requirements and migration plans for a cloud-based solution, targeted for the Version 3 Release later this year
Topics
Challenge 1: A common platform for data sharing
Challenge 2: Protection of Privacy
Challenge 3: Big Data and Cloud Analytics
Challenge 4: Engagement with RDE Users
It is important to engage with the RDE users to educate them on the contents and
capabilities of the RDE, and to seek their feedback to continuously improve the RDE to
meet evolving needs of the research and development community.
The US DOT has hosted RDE and DCM program workshops to
personally engage with targeted users
The US DOT’s Data Capture and Management (DCM) Program held two all-day
workshops on March 26-27, 2014 to engage key Program stakeholders in a
dialogue regarding the following topics:
– RDE’s features/capabilities
– RDE content (e.g., data quality and availability of metadata)
– Data policy
– Data management practices
– Stakeholder engagement
– Long-term visioning for the DCM Program and the RDE
The workshops were comprised of a series of short presentations from the DCM
Program’s federal and contractor support staff, supplemented by a variety of
break-out group brainstorming discussions and other facilitated exercises
Notes and recommendations from the workshops were consolidated into a report
and used to inform future DCM Program priorities and activities
Measuring usage of the RDE is done through a variety of means
The following four-tiered approach has been applied to measure and evaluate
ongoing RDE usage
The findings from each of these measurement approaches are shared with DCM
Program leadership in the form of an executive dashboard and quarterly reports
Formal, direct outreach and discussion with key RDE stakeholders,leveraging standardized interview protocols
Stakeholder Interviews
Written data gathering efforts delivered via the RDE website and Survey Monkey to select individual stakeholders
Surveys
Statistical review of the RDE website analytics, focusing on quantitative data such as number of user logins and downloads
Website Assessment
Review of RDE data to identify gaps, inconsistencies, and trends in stakeholder data needs
Data Analysis
Quantitative analyses are performed based on usage of the RDE,
serving as an indication for engagement of the user community
The following represent some of the quantitative RDE analytics that are captured
and reported
Other planned or ongoing means of user outreach
Ongoing feedback is currently being
solicited via the RDE Feedback
Form, available to all registered
RDE account holders through the
RDE website under the “About” tab
The next RDE User Satisfaction
survey will be deployed in the winter
2015 timeframe
Additional comments, questions, suggestions, and concerns can be submitted to
Sharing and re-using FOT data involves closely related and highly
collaborative efforts
Data
Sharing
and Re-Use
Guidance and Requirements Collection and Management Uploading and Measurement and EvaluationTo enhance and expand data re-use, the DCM Program welcomes
your ideas and suggestions
The DCM Program has historically been charged with providing research-ready CV data for purposes of research and application development.
As CV technology moves from the R&D stage into a deployment and operations stage over the next 5 years, the DCM program will evolve into the broader ‘Connected Data Systems’ (CDS) program
– Focus will be on ways of obtaining and providing data to real-time connected vehicle applications that support traffic management and operations.
Your input is welcome regarding how the CDS program can best make data available for both research and operational purposes.