RTI International is a trade name of Research Triangle Institute. www.rti.org
1
PREDICT: A Data Repository for
Cyber Security Research
Charlotte Scheper
RTI International
Manish Karir
What is PREDICT?
A Protected REpository for Defense of Infrastructure against Cyber Threats (PREDICT)
A research data repository sponsored by DHS/S&T
A trusted framework for sharing data for cyber security research
A program for advancing tools, methods, and policies for collecting and sharing security-related Internet data
In operation since 2008, PREDICT has shared
3
Objectives
Rationale
– Researchers had insufficient access to data unable
to adequately test their research prototypes
– Government technology decision-makers and
researchers need data to evaluate competing “products”
– Legal and ethical policies for Internet research are
unclear
– Scientific method via repeatability of tests and
evaluations needed support Project Impetus:
– National Strategy to Secure Cyberspace (February 2003) and 2009 Cyberspace Policy Review
– “Expanding Public Access to the Results of Federally Funded Research” see
http://m.whitehouse.gov/blog/2013/02/22/expanding -public-access-results-federally-funded-research
PREDICT
Cyber Security Datasets
Collect Archive Share
PREDICT is the only freely available, legally collected repository of large-scale datasets containing real
Key Issues Addressed
Providing secure, centralized access to multiple sources of security-related data
Assuring confidentiality
Privacy of individuals
Proprietary information
Security of the networks from which the data are collected
Establishing a legal structure to reduce legal risks
Assuring data integrity
Protect access to data
5
PREDICT Data Sharing Model
Sensitivity assessments for inclusion and access
Legally binding terms and conditions for data use
Protocols for repository operations
Data requests subject to expert review and approval
Centralized view of and portal into the repository and management of repository processes through a Data Coordinating Center
Privacy Outreach Activities
Briefed privacy advocates and obtained input
– ACLU, Electronic Frontier Foundation (EFF), Center for
Democracy and Technology (CDT), EPIC (invited)
Prepared Privacy Impact Assessment (PIA)
– Worked with DHS Privacy Office
Briefed government officials, privacy advocates, participants
– DHS S&T General Counsel – DHS General Counsel
7
PREDICT Repository Framework
Distributed Repository
Multiple data providers collect
& prepare data for sharing
Multiple data hosts
provide computing infrastructure to store datasets and provide access
Central coordinating center
provides portal and manages repository processes
9
Current Repository Holdings
Current Data Categories
– Address Space Allocation Data – Border Gateway Protocol (BGP)
Routing Data
– Blackhole Address Space Data
– Domain Name System (DNS) Data
– Intrusion Detection System (IDS)
and Firewall Data
– Infrastructure Data – Internet Topology Data
– Internet Protocol (IP) Packet
Headers
– Performance and Quality
Measurements
407 Datasets
- Collection periods vary from hours to days to months - Sizes vary from Bytes to
TBytes
Research groups that have used PREDICT
- 97 academic institutions - 88 commercial entities
- 37 Government organizations - 3 Foreign
11
Current Data Host/Providers
UCSD/CAIDA
– Topology Measurements, Network Telescope 45.4 TB USC - ISI
– NetFlow, Internet Topology Data, Address Allocation 42 TB Colorado State University
– NetFlow, Spam logs, IP Reputation lists 90 TB University of Michigan/Merit Networks
– Netflow, BGP Routing, Dark Address Space Monitoring,
BGP Beacon Routing 188.7 TB
Georgia Tech
– Botnet Sinkhole Connection 0.01 TB University of Wisconsin
– Global Intrusion Detection Database 2.5 TB Packet Clearing House
– BGP Routing, VoIP Measurement, Synthetic 8.0 TB
Data collection, storage, and access
– Advance the state of the art in data collection techniques, packet formats, new data types, data
cataloging/annotation, cross dataset analysis
– Develop systems for storage and processing of large volumes of data
– Continue technical and policy work on disclosure control for Internet traffic data
– Add Data Access Methods such as VMs/Virtual Enclaves
Advancement of tools and techniques to analyze Internet datasets to extract and represent useful information
– Center for Configuration Analytics and Automation (UNCC) project – RTI International IR&D project
Investigate and highlight legal and ethical issues in Internet data collection and analysis
– Menlo Report on Ethical Principles Guiding Information and Communication Technology
Research
13
Improvements on the Way
Add classes of data
– Unrestricted – Quasi-restricted – Restricted
Streamline processes
– Cleaner account request process
– Click-through agreements for unrestricted and quasi-restricted
classes
Expand international cooperation framework
Summary
PREDICT is addressing an acknowledged need by
providing large-scale, real-world security-related datasets for cyber security research
PREDICT is addressing the significant policy and legal issues in collecting and sharing security-related data
PREDICT is helping to achieve DHS’s goal of improving the quality of defensive cyber security technologies
15 PREDICT Information https://www.predict.org DHS Privacy Impact Assessment: http://www.dhs.gov/xlibrary/assets/privacy/privacy _pia_st_predict.pdf
Contact Information
Charlotte Scheper Director PREDICT Coordinating Center RTI International USA cscheper@rti.org 919-485-5587 Manish Karir Program ManagerCyber Security Division DHS S&T
USA
Manish.Karir@hq.dhs.gov