Big Data and Analytics at the IRS:
Perspectives and Initatives
Perspectives and Initatives
Government Big Data Symposium
Government Big Data Symposium
March 5-6, 2013
Jeff Butler
Director, Research Databases
IRS, Research, Analysis, and Statistics jeff butler@irs gov
Background
• The Internal Revenue Service (IRS) has a large service and
enforcement footprint. The table below is from FY 2011.
Tax Return Processing 234 million tax returns filed
1 8 billion third-party information returns1.8 billion third party information returns
Account Management $2.4 trillion in gross receipts
122 million refunds totaling $415 billion122 million refunds totaling $415 billion
Customer Service 319 million vists to IRS website
83 million toll-free telephone calls83 million toll free telephone calls
Enforcement 223 million letters or notices sent to taxpayers
Types of Research and Analysis
• Failure to file or pay
Taxpayer Behavior
• Identify patterns of filing and
Analytic Initiatives
Failure to file or pay • Abusive tax shelters • Identity theft
Identify patterns of filing and payment non-compliance
• Predict and prevent ID theft
d f d f d
• Return preparer compliance • Misreporting income or
deductions
and refund fraud
• Estimate U.S. tax gap • Measure taxpayer burden deductions
• Refund fraud
• Off-shore transactions
p y
• Optimize case inventories and treatment strategies
Si l t ff t f t h
• Financial crimes • Simulate effects of tax changes
Analytic Data Environment in IRS
• IRS enterprise IT manages hundreds of transactional systems
and applications
• Research organization integrates legacy and third-party data
• Research organization integrates legacy and third-party data
into the Compliance Data Warehouse (CDW)
Compliance Data Warehouse (CDW) – Selected Metrics
Total data size ~ 1.3PB
Number of database tables ~ 3,100
p ( )
,
Number of unique columns ~ 52,500
Number of searchable metadata attributes > 1 million
Number of searchable metadata attributes > 1 million
Number of users ~ 1,020
Average daily queries ~ 6 500
IRS Analytic Data Environment
Compliance Data Warehouse (CDW)
Analytic Sandboxes (Examples)
Case O ti i ti Predictive M d li Text A l ti Simulation
Compliance Data Warehouse (CDW)
Optimization Modeling Analytics Simulation
Data Integration Layer
Core Analytic Database
Statistical & Mathematical
Analysis
Ad-Hoc Query and Reporting
Data Extracts, Matching
a aye
r
Infrastructure and Services
Analysis
Storage Mgmt System Admin
Metadata Web nterprise Dat a Integration L a Security/Audit Monitoring Software Config Accounts Metadata Data Profiling Services Training & Support E Data
IRS Analytic Data Environment
Compliance Data Warehouse (CDW)
Compliance Data Warehouse (CDW)
Core Database Servers
(Sybase IQ, Oracle, SQL Server) Shared Storage (>2PB)
(DB, Backup, Staging, User)
Application/Web Servers
(SAS, R, Hyperion)
IRS Network
Users & Projects Systems & Applications Analytic Sandboxes
Scale (Volume)
1200 1600 5000 6000 7000Data Size (Terabytes) Average Daily Queries
800 1200 2000 3000 4000 5000 0 400 2005 2007 2009 2011 2013 0 1000 2000 2005 2006 2007 2008 2009 2010 2011 2012 Third-Party Tools Web-Based
• Not all infrastructure/service costs are constant in scale
– Massively large environments can have asymmetric challenges
Systems & Storage Management ETL & Database Administration Metadata & Web Services Security Audit and Monitring Tools, Training, & Support Analytic Sandboxes
Challenges with Scale
• I/O bottlenecks when data are off-loaded for analytics
– Single biggest problem for users in massively large environments
– Strategy: Maximize in-database analytics where possible
• Finding the optimal mix of ETL tools and techniques
– This is still where data warehousing costs are highestThis is still where data warehousing costs are highest
– Strategy: Stay nimble and avoid one-size-fits-all solution
• Choosing the right database technology
– Is it performance or scale that’s really needed?
– CDW is largest database in the IRS and still uses columar DB – Strategy: Maximize performance for users at smallest O&M costgy p
• Storage management
– Different approach needed in user-based analytic environment
St t P titi fil t b d i t it
Timeliness (Velocity)
120 140 W eekly DailyData Arrival Rate Ingest-Release Latency
60 80 100 te rly Monthly W 2003 2005 2007 2009 2011 2013 0 20 40 2005 2006 2007 2008 2009 2010 2011 2012 An nual Q u a rt
• Data arrival rates are different from data delivery rates
– Minimzing this difference is inherently an ETL problem
Data Extract/ Feed Validation/ Pre-processing Integration/ Post-processing Analysis/ Modeling Interpretation/ Action p g p g
Challenges with Velocity
• Larger the data size, longer the processing time
– Let Pijij and Sijij = processing time and size of data set i with frequency j, ij = 1, 2, …, n
– The problem is argmin ∑θij(P | S)ij + εij
Processing time varies with scale (and complexity) – Processing time varies with scale (and complexity)
– Disturbances εij are unavoidable (e.g., server maintenance)
• Data may require validation, standardization, and cleaning
y
q
,
,
g
– No two data sets are the same
• Structured vs. unstructured data
– What is impact of frequent schema changes on data delivery times for structured data?
Heterogeneity (Variety)
TaxpayersE l
Sources of IRS Data
Forms S h d l
Types of IRS Data Source Systems and Data Formats
Mainframe DB tables Employers Preparers Banks Brokers Schedules Worksheets Attachments Images Mainframe Unix/Linux Windows DB tables Fixed format Hierarchical Delimited P k d d i l Non-Profits Interagency Fed/State Treaty Partners Correspondence Transactions Phone Calls Notices Databases VSAM Flat Files Applications Packed decimal XML Plain text Intermediaries Transcripts
Applications Plain text
• Overwhelming majority of IRS data are still structured
– Most transaction systems are still file-based
•
Challenge
: skills needed to parse and analyze text
– Information extraction and entity resolution techniques (NLP) – Information extraction and entity resolution techniques (NLP)
Metadata and Information Quality
50000 60000
Searchable Metadata
Simple reference model is used to guide consisteny of searchable artifacts
Framework and Strategy
20000 30000
40000 Combination of system, contextual,
and application attributes
Controlled vocabulary for key
descriptive elements
0 10000
2005 2006 2007 2008 2009 2010 2011 2012
descriptive elements
Strategy favors basic discoverability rather than systematized collections
• Data for analytics must be searchable, understandable, and
semantically consistent
Columns Columns w ith Metadata
semantically consistent
–
Metadata is the nucleus of any data quality strategyMetadata and Information Quality
g
Stages of Metadata Collection
Database Flat File Extract Transform Load Staging s is, Reportin VSAM DW Roll-Ups Query , Analy s Validate Source Systems Q
Source Metadata ETL/T Metadata Data Model Metadata Report Metadata Source Metadata ETL/T Metadata p
Central Metadata Repository
Metadata and Information Quality
System Metadata
Physical properties, data movement, ETL/T, and workflow artifacts
Contextual Metadata
Attributes, references, and other searchable content
Application Metadata
Context dependent logic, conditional rules, and dynamic processing
Source System Characteristics
System properties
File or table names
Data Attributes
Authoritative system
Data element name and definiton
Web-Based Logic
Reports and roll-ups
Lookup tables
Data element names and definitons Data types Transformation rules Cross-references Availability Data type Join paths
Legacy source reference
User reviews
URLs and other links
External communication Profiling F i Reviews U ID Cross references
Target System Properties
Table names
Column names
Data types
User reviews
Links to context-dependent data
Publishing Standards Web-based Frequencies Statistical distributions Trend analysis Geographic maps User ID Table/column reference Feedback Data types Indexes
Partitions or table spaces
Standard format
Hierarchical and free-form search
Workforce Skills
Regression-based methods (GLM, logisitic, Techniques used by IRS analysts
Regression based methods (GLM, logisitic, quantile, non-linear, proportional hazards)
Social network analysis, graph theory
Machine learning (neural networks, SVMs, genetic algorithms)
Multivariate statistical methods (discriminant analysis, clustering, density estimation, factor analysis)y , y )
Simulation (Monte Carlo, MCMC, agent-based modeling)
Decision trees (CART, CHAID, C5, hybrids)
Bayes rules and other classifiers
Workforce Skills
•
Analysts
:
– Use of advanced SQL techniques to avoid off-loading data for
l ti (i d t b ti )
analytics (in-database computing)
– Understanding and leveraging Open Source tools
•
IT Staff
:
– Literacy in non-traditional computing architectures – Support for Open Source tools and analytic databases
Ability to quickly build and deploy analytic sandboxes – Ability to quickly build and deploy analytic sandboxes
• This is different from typical BI/report/dashboard environments
– Emphasis on algorithms, not just information distribution
•
Key is multi-disciplinary skills
Data Privacy and Security
• IRS analytics are done behind the firewall but data still moves
– Data off-loaded to laptops, servers, sandboxes – External access (Treasury, Congress, universities)
• Permissions management in shared disk environment
– Gets more complex with more users and data
• Security trade-offs and challenges
– Impact of system- and application-level policy changesImpact of system- and application-level policy changes – How much continuous monitoring and auditing?
– FISMA and the documentation dilemma