• No results found

Big Data and Analytics at the IRS:

N/A
N/A
Protected

Academic year: 2021

Share "Big Data and Analytics at the IRS:"

Copied!
17
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data and Analytics at the IRS:

Perspectives and Initatives

Perspectives and Initatives

Government Big Data Symposium

Government Big Data Symposium

March 5-6, 2013

Jeff Butler

Director, Research Databases

IRS, Research, Analysis, and Statistics jeff butler@irs gov

(2)

Background

• The Internal Revenue Service (IRS) has a large service and

enforcement footprint. The table below is from FY 2011.

Tax Return Processing  234 million tax returns filed

 1 8 billion third-party information returns1.8 billion third party information returns

Account Management  $2.4 trillion in gross receipts

 122 million refunds totaling $415 billion122 million refunds totaling $415 billion

Customer Service  319 million vists to IRS website

 83 million toll-free telephone calls83 million toll free telephone calls

Enforcement  223 million letters or notices sent to taxpayers

(3)

Types of Research and Analysis

• Failure to file or pay

Taxpayer Behavior

• Identify patterns of filing and

Analytic Initiatives

Failure to file or pay • Abusive tax shelters • Identity theft

Identify patterns of filing and payment non-compliance

• Predict and prevent ID theft

d f d f d

• Return preparer compliance • Misreporting income or

deductions

and refund fraud

• Estimate U.S. tax gap • Measure taxpayer burden deductions

• Refund fraud

• Off-shore transactions

p y

• Optimize case inventories and treatment strategies

Si l t ff t f t h

• Financial crimes • Simulate effects of tax changes

(4)

Analytic Data Environment in IRS

• IRS enterprise IT manages hundreds of transactional systems

and applications

• Research organization integrates legacy and third-party data

• Research organization integrates legacy and third-party data

into the Compliance Data Warehouse (CDW)

Compliance Data Warehouse (CDW) – Selected Metrics

Total data size ~ 1.3PB

Number of database tables ~ 3,100

p ( )

,

Number of unique columns ~ 52,500

Number of searchable metadata attributes > 1 million

Number of searchable metadata attributes > 1 million

Number of users ~ 1,020

Average daily queries ~ 6 500

(5)

IRS Analytic Data Environment

Compliance Data Warehouse (CDW)

Analytic Sandboxes (Examples)

Case O ti i ti Predictive M d li Text A l ti Simulation

Compliance Data Warehouse (CDW)

Optimization Modeling Analytics Simulation

Data Integration Layer

Core Analytic Database

Statistical & Mathematical

Analysis

Ad-Hoc Query and Reporting

Data Extracts, Matching

a aye

r

Infrastructure and Services

Analysis

Storage Mgmt System Admin

Metadata Web nterprise Dat a Integration L a Security/Audit Monitoring Software Config Accounts Metadata Data Profiling Services Training & Support E Data

(6)

IRS Analytic Data Environment

Compliance Data Warehouse (CDW)

Compliance Data Warehouse (CDW)

Core Database Servers

(Sybase IQ, Oracle, SQL Server) Shared Storage (>2PB)

(DB, Backup, Staging, User)

Application/Web Servers

(SAS, R, Hyperion)

IRS Network

Users & Projects Systems & Applications Analytic Sandboxes

(7)

Scale (Volume)

1200 1600 5000 6000 7000

Data Size (Terabytes) Average Daily Queries

800 1200 2000 3000 4000 5000 0 400 2005 2007 2009 2011 2013 0 1000 2000 2005 2006 2007 2008 2009 2010 2011 2012 Third-Party Tools Web-Based

• Not all infrastructure/service costs are constant in scale

Massively large environments can have asymmetric challenges

Systems & Storage Management ETL & Database Administration Metadata & Web Services Security Audit and Monitring Tools, Training, & Support Analytic Sandboxes

(8)

Challenges with Scale

• I/O bottlenecks when data are off-loaded for analytics

Single biggest problem for users in massively large environments

Strategy: Maximize in-database analytics where possible

• Finding the optimal mix of ETL tools and techniques

– This is still where data warehousing costs are highestThis is still where data warehousing costs are highest

Strategy: Stay nimble and avoid one-size-fits-all solution

• Choosing the right database technology

– Is it performance or scale that’s really needed?

– CDW is largest database in the IRS and still uses columar DB – Strategy: Maximize performance for users at smallest O&M costgy p

• Storage management

– Different approach needed in user-based analytic environment

St t P titi fil t b d i t it

(9)

Timeliness (Velocity)

120 140 W eekly Daily

Data Arrival Rate Ingest-Release Latency

60 80 100 te rly Monthly W 2003 2005 2007 2009 2011 2013 0 20 40 2005 2006 2007 2008 2009 2010 2011 2012 An nual Q u a rt

• Data arrival rates are different from data delivery rates

Minimzing this difference is inherently an ETL problem

Data Extract/ Feed Validation/ Pre-processing Integration/ Post-processing Analysis/ Modeling Interpretation/ Action p g p g

(10)

Challenges with Velocity

• Larger the data size, longer the processing time

– Let Pijij and Sijij = processing time and size of data set i with frequency j, ij = 1, 2, …, n

– The problem is argmin ∑θij(P | S)ij + εij

Processing time varies with scale (and complexity) – Processing time varies with scale (and complexity)

– Disturbances εij are unavoidable (e.g., server maintenance)

• Data may require validation, standardization, and cleaning

y

q

,

,

g

No two data sets are the same

• Structured vs. unstructured data

– What is impact of frequent schema changes on data delivery times for structured data?

(11)

Heterogeneity (Variety)

 Taxpayers

E l

Sources of IRS Data

 Forms S h d l

Types of IRS Data Source Systems and Data Formats

 Mainframe  DB tables  Employers  Preparers  Banks  Brokers  Schedules  Worksheets  Attachments  Images  Mainframe  Unix/Linux  Windows  DB tables  Fixed format  Hierarchical  Delimited P k d d i l  Non-Profits  Interagency  Fed/State  Treaty Partners  Correspondence  Transactions  Phone Calls  Notices  Databases  VSAM  Flat Files  Applications  Packed decimal  XML  Plain text  Intermediaries  Transcripts

Applications Plain text

• Overwhelming majority of IRS data are still structured

– Most transaction systems are still file-based

Challenge

: skills needed to parse and analyze text

– Information extraction and entity resolution techniques (NLP) – Information extraction and entity resolution techniques (NLP)

(12)

Metadata and Information Quality

50000 60000

Searchable Metadata

 Simple reference model is used to guide consisteny of searchable artifacts

Framework and Strategy

20000 30000

40000  Combination of system, contextual,

and application attributes

Controlled vocabulary for key

descriptive elements

0 10000

2005 2006 2007 2008 2009 2010 2011 2012

descriptive elements

 Strategy favors basic discoverability rather than systematized collections

• Data for analytics must be searchable, understandable, and

semantically consistent

Columns Columns w ith Metadata

semantically consistent

Metadata is the nucleus of any data quality strategy
(13)

Metadata and Information Quality

g

Stages of Metadata Collection

Database Flat File Extract Transform Load Staging s is, Reportin VSAM DW Roll-Ups Query , Analy s Validate Source Systems Q

Source Metadata ETL/T Metadata Data Model Metadata Report Metadata Source Metadata ETL/T Metadata p

Central Metadata Repository

(14)

Metadata and Information Quality

System Metadata

Physical properties, data movement, ETL/T, and workflow artifacts

Contextual Metadata

Attributes, references, and other searchable content

Application Metadata

Context dependent logic, conditional rules, and dynamic processing

Source System Characteristics

System properties

File or table names

Data Attributes

Authoritative system

Data element name and definiton

Web-Based Logic

 Reports and roll-ups

 Lookup tables

Data element names and definitons Data types Transformation rules Cross-references Availability Data type Join paths

Legacy source reference

User reviews

 URLs and other links

 External communication Profiling F i Reviews U ID Cross references

Target System Properties

Table names

Column names

Data types

User reviews

Links to context-dependent data

Publishing Standards Web-based Frequencies Statistical distributions Trend analysis Geographic maps  User ID  Table/column reference  Feedback Data types Indexes

Partitions or table spaces

Standard format

Hierarchical and free-form search

(15)

Workforce Skills

Regression-based methods (GLM, logisitic, Techniques used by IRS analysts

Regression based methods (GLM, logisitic, quantile, non-linear, proportional hazards)

Social network analysis, graph theory

Machine learning (neural networks, SVMs, genetic algorithms)

Multivariate statistical methods (discriminant analysis, clustering, density estimation, factor analysis)y , y )

Simulation (Monte Carlo, MCMC, agent-based modeling)

Decision trees (CART, CHAID, C5, hybrids)

Bayes rules and other classifiers

(16)

Workforce Skills

Analysts

:

– Use of advanced SQL techniques to avoid off-loading data for

l ti (i d t b ti )

analytics (in-database computing)

– Understanding and leveraging Open Source tools

IT Staff

:

– Literacy in non-traditional computing architectures – Support for Open Source tools and analytic databases

Ability to quickly build and deploy analytic sandboxes – Ability to quickly build and deploy analytic sandboxes

• This is different from typical BI/report/dashboard environments

– Emphasis on algorithms, not just information distribution

Key is multi-disciplinary skills

(17)

Data Privacy and Security

• IRS analytics are done behind the firewall but data still moves

– Data off-loaded to laptops, servers, sandboxes – External access (Treasury, Congress, universities)

• Permissions management in shared disk environment

– Gets more complex with more users and data

• Security trade-offs and challenges

– Impact of system- and application-level policy changesImpact of system- and application-level policy changes – How much continuous monitoring and auditing?

– FISMA and the documentation dilemma

References

Related documents

Figure 6.6 (A) shows the applied pattern of load while Figure 6.6 (B) gives how the task changes its location according to the load state of the worker. The decision of moving the

• CASU: Cambridge • WFAU: Edinburgh High Energy Astrophysics data • LEDAS: Leicester Radio data • Jodrell Bank AstroGrid Solar/STP data • MSSL • RAL.. Wider UK

This thesis has five principal aims: (i) to explore the extent of compliance with IFRS disclosure requirements by Kuwaiti non-financial listed companies; (ii) to

Tishman was selected to serve as construction manager for this fast track renovation and build-out of 15,000 square feet of mission critical white space and the infrastructure

Scrooge said to the Ghost, 'Oh, please tell me who that dead man was!' The Ghost took him near his office, but it didn't stop. 'Wait!'

1 of the German securities trading Act (wphG) in conjunction with sections 297 (2) sentence 3 and 315 (1) sentence 6 of the German Commercial Code (hGB) To the best of our

MBPTA, by deploying EVT (see Figure 1), is able to derive the probability that bad behavior of several of the soj (whose impact has been cap- tured in the analysis-time runs)

We hypothesized that (1) species of pine seeds differ in traits such as nutritional content and size, (2) seeds with a negative selection in field experiments (P. montezumae) will