ADVANCED ANALYTICS
AND FRAUD DETECTION
THE RIGHT TECHNOLOGY
FOR NOW AND THE FUTURE
Big Data
What tax agencies are or will be seeing!
• Big Data
•
Large and increased data volumes
•
New and emerging data types/sources
•
New multi-structured data types with unknown relationships
that require processing of data regardless of size to discover
insights.
•
Examples: web logs, sensor networks, social networks, text.
•
Increased reporting requirements such as Merchant cards
(Form 1099-K) and Cost Basis Reporting on Securities Sales
(Form 1099-B)
•
Key Points
•
Analyze all the data – just not random samples
•
The need for fast processing to detect and prevent fraud
More’s Law …
(as in more data)
Big Data Challenges are More Than Data Size
“CIOs face significant
challenges in addressing
the issues surrounding big
data…
New technologies and
applications are emerging
and should be investigated
to understand their
potential value.”
Source: CEO Advisory: ‘Big Data’ Equals Big Opportunity, Gartner, 31 March 2011.
The Four Axes of Big Data
Data in a Tax Agency
Big Box Retailers/Corporations
Seller/Retailer Data
i.e.
Audit Leads
Nexus Payments
Structured and Unstructured Data
Data in a Tax Agency
Correspondence &
Emails
Web Logs
i.e.
Audit Leads
Nexus Payments
Structured and Unstructured Data
Case Notes
Customs Data
Work Papers
Leveraging data for Taxpayer Education,
Compliance and Service Enhancement
•
Humans by nature are social, social media is just an enabler
•
Untapped social network data
EVERYWHERE !
-
Existing consumer/taxpayer transaction data & interaction data
-
You are not constrained to Twitter and Facebook feeds to obtain TP
behavior and/or data
What if….. you could determine by applying text analytics that a
taxpayer that claimed no income in 2011 bought three motorcycles
in 2011
What if….you could be ‘notified’ a taxpayer claimed he cheated your
tax department on a blog, on Facebook, etc?
Statistical Modeling
• The most powerful method is to use statistical models to assess fraud risk
• To build a predictive model, you need to identify some historical known
cases
• Clustering can also be used to find cases with similar characteristics. This
won’t predict fraud, but can identify unusual groupings of cases
C1 C2 C3 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 T r a n s a c ti o n s Login Time
Cluster analysis can help
find cases that have similar
profiles
Decision trees can help
identify drivers of fraud and high
risk cases
Response modeling can provide
rankings on overall fraud risk
Various modeling
One Analytic Data Solution
Pattern
Analysis
Path
Analysis
Graph
Analysis
Strategic & Operational Intelligence
Text
Social
media
Machine
data
SCM
ERP
Trans
3
rdParty
CRM
Big Data Insight
Web
logs
Aster Data
Analytic PlatformTeradata
Integrated
Data Warehouse
SQL Analytics
SQL-Map Reduce
Analytics
Structure
Multi-Structure
Ad Hoc
/OLAP
Predictive
Analytics
Spatial/
Temporal
Active
Execution
In-Database Analytic Processing
Enabling Better, Faster Insight
Advanced
Visualization
Text Analytics
Reporting and OLAP
Advanced Analytics
Parallel
Performance
Who is Teradata ?
•
Global Leader in Enterprise Data Warehousing
•
Headquartered in Ohio
•
9,200+ associates
•
Analytic Solutions and Consulting Services
•
The leader in
Gartner
’s Leaders Quadrant since 1999
•
U.S. publicly-traded software company
•
S&P 500 Member, Listed NYSE:
“TDC”
•
Founded in 1979, public launch in 2007
•
Global presence and world-class customer list
•
More than 1,300 customers, More than 2,500 installations
•
28 Federal and State partners
Extended Appliance Family
Launched 2008
Simple
Powerful
Affordable!
Teradata Tax Team
Deep tax domain Compliance
Customer service
13
Teradata
is
THE Leader
and has been
since 1999 !
GARTNER MAGIC QUADRANT
DATA WAREHOUSE DBMS, 2012
Magic Quadrant for Data Warehouse Database Management Systems Mark Beyer, Donald Feinberg, Merv Adrian, Roxanne Edjlali 2/6/12
14
Teradata Workload-Specific Platform Family
5
6
0
1
6
5
0
2
6
9
0
4
6
0
0
6
6
XX
Data Mart Appliance Extreme Data Appliance Data Warehouse Appliance Extreme Performance Appliance Active Enterprise Data Warehouse Aster MapReduce Appliance Scalability Up to 12TB Up to 186PB Up to 315TB Up to 18TB Up to 92PB Up to 5PB Workloads Test/ Development or Smaller Data Marts Analytical Archive, Deep Dive Analytic Strategic Intelligence, Decision Support System, Fast Scan Operational Intelligence, Lower Volume, High Performance Strategic & Operational Intelligence, Real Time Update, Active workloads Discovery Platform for BigData Analytics with embedded SQL MapReduce
for new data types & sources
15 8/14/2012 Teradata Confidential
Data Volume
(Raw, User Data)
Competition Scales One Dimension at the Expense of Others Limited by Technology!
Schema
Sophistication
Query
Freedom
Query
Complexity
Data
Freshness
Query Data Volume
Query
Concurrency
Workload
Management
Teradata can Scale Simultaneously Across
Multiple Dimensions Driven by Business!
Scalability Across Multiple Dimensions
16 8/14/2012 Teradata Confidential
Automatic
Built-In
Functionality
Fast Query
Performance
“Parallel Everything” design and smart Teradata
optimizer enables fast query execution across platforms
Quick Time to
Value
Simple set up steps with automatic “hands off”
distribution of data, along with integrated load utilities result in rapid installations
Simple to Manage
DBAs never have to set parameters, manage table space, or reorganize dataResponsive to
Business Change
Fully parallel MPP “shared nothing” architecture scales linearly across data, users, and applications providing consistent and predictable performance and growth
Easy
“Set & G0”
Optimization
Options
Powerful,
Embedded
Analytics
In-database data mining, virtual OLAP/cubes, pre-built and custom application objects (User Defined Functions) drive efficient and differentiated business insight
Advanced
Workload
Management
Workload management options by user, application, time of day and CPU exceptions
Intelligent Scan
Elimination
“Set and Go” options reduce full file scanning (Primary, Secondary, Multi-level Partitioned Primary, Aggregate Join Index, Sync Scan)
Analytical Ecosystem
The Ecosystem Is The Warehouse
1650
2650
560
66XX
66XX
2650
Aster Data
SQL-Map Reduce
Teradata Aster
Unified Big Data Architecture for the Enterprise
Audio/
Video Images Text
Web & Social
Machine
Logs CRM SCM ERP
Engineers Data Scientists Quants Business Analysts
Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualization, etc.
Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualization, etc.
Capture, Store, Refine
Discovery Platform
Integrated Data
Warehouse
Integrated Data
Aster SQL-MapReduce:
What Is It and Why It Is Important to In-Database Analytics?
•
Patented Framework for advanced
analytics that are hard to define in SQL
-
Couples SQL (relational) with MapReduce
(SQL-MapReduce)
-
it’s invoked from SQL. (automatically
parallelized)
-
Includes library of pre-packaged Analytic
Modules
Aster Data nCluster
App App App App App App
SQL SQL-MapReduce
•
Architecture for diverse, embedded analytics processing
-
Supports custom analytics written in a variety of languages i.e Java
•
Combines SQL & visual tools
Ease of Development and Reuse
Analytic Foundation : 50+ out-of-the-box modules
Modules
Business-ready SQL-MapReduce Functions
Path Analysis
Discover patterns in rows of sequential data
• nPath: complex sequential analysis for time series analysis and behavioral pattern analysis
• Sessionization: identifies sessions from time series data in a single pass over the data
• Attribution: operator to help ad networks and websites to
distribute “credit”
Statistical
Analysis
High-performance processing of common statistical calculations
• Histogram: function to provide capability of generating
• Decision Trees: Native implementation of parallel random forests.
• Approximate percentiles and distinct counts: calculate
percentiles and counts within specific variance
• Correlation: calculation that characterizes the strength of the relation between different data fileds
• Regression: performs linear or logistic regression between an
output variable and a set of input variables
• Averages: calculate moving, weighted, exponential or
volume-weighted averages over a window of data
Relational
Analysis
Discover important
relationships among data
• Graph analysis: finds shortest path from a distinct node to all
other nodes in a graph
• Tokenization: splits strings into individual words to assist text
Modules
SQL-MapReduce Analytic Functions
Text Analysis
Derive patterns in textual data
• Text Processing: counts occurrences of words, identifies roots, &
tracks relative positions of words & multi-word phrases
• Text Partition: analyzes text data over multiple rows
• Levenshtein Distance: computes the distance between two words
Cluster
Analysis
Discover natural groupings of data points
• k-Means: clusters data into a specified number of groupings
• Canopy: partitions data into overlapping subsets within which
k-means is performed
• Minhash: buckets highly-dimensional items for cluster analysis
• Basket analysis: creates configurable groupings of related items
from transaction records in single pass
• Collaborative Filter: predicts the interests of a user by collecting
interest information from many users
Data
Transformation
Transform data for more advanced analysis
• Unpack: extracts nested data for further analysis
• Pack: compress multi-column data into a single column
• Antiselect: returns all columns except for specified column
• Multicase: case statement that supports row match for multiple
cases
Ease of Development and Reuse
Unified
Big Data Architecture for the Enterprise
Audio/
Video Images Text
Web & Social
Machine
Logs CRM SCM ERP
Engineers Data Scientists Quants Business Analysts
Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualization, etc.
Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualization, etc.
Discovery Platform
Integrated Data
Warehouse
Integrated Data
Aster SQL-MapReduce and Hadoop MapReduce
•
Customized MapReduce
•
Deployed via SQL-MR and BI
and Visualization tools
•
Easy to manage database
•
50+ Packaged
SQL-MapReduce Analytics
•
SQL – “language of
business”
•
Integrated Development
Environment (IDE)
Aster
SQL-MapReduce
•
Customized MapReduce
•
Deployed via application
code and people
•
File System
• Batch Processing
• Requires lots of coding
Hadoop
MapReduce
Aster SQL-MapReduce and Hadoop
•
Customized MapReduce
•
Deployed via SQL-MR and BI
and Visualization tools
•
Easy to manage database
•
50+ Packaged
SQL-MapReduce Analytics
•
SQL – “language of
business”
•
Integrated Development
Environment (IDE)
Aster
SQL-MapReduce
•
Customized MapReduce
•
Deployed via application
code and people
•
File System
• Batch Processing
• Requires lots of coding
Hadoop
MapReduce
SELECT *
FROM nPath (
ON (…)
PARTITION BY sba_id
ORDER BY datestamp
MODE (NONOVERLAPPING)
PATTERN ('(OTHER_EVENT|FEE_EVENT)+')
SYMBOLS (
event LIKE '%REVERSE FEE%' AS
FEE_EVENT,
event NOT LIKE '%REVERSE FEE%' AS
OTHER_EVENT)
RESULT (…)
Aster SQL-MapReduce and Hadoop
•
Customized MapReduce
•
Deployed via SQL-MR and BI
and Visualization tools
•
Easy to manage database
•
50+ Packaged
SQL-MapReduce Analytics
•
SQL – “language of
business”
•
Integrated Development
Environment (IDE)
Aster
SQL-MapReduce
•
Customized MapReduce
•
Deployed via application
code and people
•
File System
• Batch Processing
• Requires lots of coding
Hadoop
MapReduce
SELECT *
FROM nPath (
ON (…)
PARTITION BY sba_id
ORDER BY datestamp
MODE (NONOVERLAPPING)
PATTERN ('(OTHER_EVENT|FEE_EVENT)+')
SYMBOLS (
event LIKE '%REVERSE FEE%' AS
FEE_EVENT,
event NOT LIKE '%REVERSE FEE%' AS
OTHER_EVENT)
RESULT (…)
Teradata Workload-Specific Platforms
5
6
0
1
6
5
0
2
6
9
0
4
6
0
0
6
6
XX
Data Mart Appliance Extreme Data Appliance Data Warehouse Appliance Extreme Performance Appliance Active Enterprise Data Warehouse Aster MapReduce Appliance Scalability Up to 12TB Up to 186PB Up to 315TB Up to 18TB Up to 92PB Up to 5PB Workloads Test/ Development or Smaller Data Marts Analytical Archive, Deep Dive Analytic Strategic Intelligence, Decision Support System, Fast Scan Operational Intelligence, Lower Volume, High Performance Strategic & Operational Intelligence, Real Time Update, Active workloads Discovery Platform for BigData Analytics with embedded SQL MapReduce
for new data types & sources
Teradata Aster
Software Only
Teradata Aster
Cloud
Edition
Aster
MapReduce
Appliance
Purpose
Complex, High
Speed Analytics
For Emerging
Big Data
Teradata Aster
nCluster for Amazon
Web Services,
AppNexus, Dell’s Data
Cloud and Terremark
Integrated
Discovery Platform
Scalability
Flexible
Elastic
Up to 5PB
Sub Segment
Massively parallel software solution with
embedded SQL-MapReduce analytics for
new data types and sources
On-demand extreme scaling with no downtime,
always-on data cloud availability for high
performance next-generation analytics for
big data
Embedded SQL-MapReduce analytics on Teradata hardware.