Taking Data Analytics
to the Next Level
Implementing and Supporting
Big Data Initiatives
What Is Big Data and How Is It
Applicable to Anti-Fraud Efforts?
Definition
Gartner: Big data is high-volume, -velocity,
and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
®2013 Association of Certified Fraud Examiners, Inc.
4 of 20
Why Big Data?
Fact Gathering on an Investigation or Proactive Compliance Program Interviews Document analysis (unstructured data) Financial & operational analysis (structured data)
Email & user documents
Social media
Corporate document repositories
New feeds & research
Sales records
Payment or expense details
Selected general ledger accounts
Financial reports and analysis
Interviews pull from document analysis and financial and operational analysis.
IBM Projection: Massive Explosion of Data
The Dawn of Big Data: The uncertainty of new information is growing alongside it’s complexityMapReduce
MapReduce is built on the proven concept of divide and conquer: it’s much faster to break a massive task into smaller chunks and process them in parallel.
In 2004, Google decided to implement the power of parallel, distributed computing to
digest enormous amounts of data produced in daily operations, which resulted in a group of technologies and architectural design
Hadoop
Hadoop implementation of MapReduce was
created by Doug Cutting and is written in Java.
After it was created, Hadoop was turned over to the Apache Software Foundation.
Now maintained as an open-source, top-level project with a global community of contributors.
Original deployments include some of the most well-known, technologically advanced
organizations such as Yahoo, Facebook, and LinkedIn.
Pig and Hive
To build applications such as Hadoop, one normally employs a popular programming interface such as Java, Pig, or Hive
Pig: A specialized higher-level MapReduce language
Hive: A specialized SQL-based MapReduce language Many other programming interfaces exist
IBM Survey—Big Data Sources
34% 38% 40% 41% 41% 42% 42% 43% 57% 59% 73% 88% 0% 20% 40% 60% 80% 100% Still images/video Audio Geospatial RFID scans or POS data Free-form text Sensors External Feeds Social Media Emails Events Log Data TransactionsIBM Survey—Big Data Analytics Activities
25% 26% 35% 43% 52% 56% 65% 67% 71% 77% 91% 0% 20% 40% 60% 80% 100% Voice analytics Video analytics Streaming analytics Geospatial analytics Natural language text Simulation Optimization Predictive modeling Data visualization Data mining Query and reportingFraud Detection Requires a
Comprehensive Approach
Platform (Analytics) Analyze Forecast Plan Collabor -ate Simulate Survey Govern Discover Model Predict Mine Report Score Visualize Decide For fraud detection, any direction can, and should, be taken when applying analytics to our platform.False Positive Rate High Low S tructured Da ta Detection Rate Low High Uns truc ture d Da ta
“Traditional” Rules-Based Queries & Analytics Matching, Grouping, Ordering, Joining, Filtering Statistical-Based Analysis Anomaly Detection, Clustering Risk Ranking
Traditional Keyword Searching
Keyword Search
Data Visualization and Text Mining
Data Visualization, Drill-down, Text Mining
Recall the Forensic Analytics Maturity
Model
Email and Instant Message 3rd Party Data Feeds Social Media ERP Systems Transactional Data Analysis Platform
Big Data and Anti-Fraud
Structured and unstructured data… is organized and “risk scored”
A More Human Way to Look at Data
Data Points Are Represented as Objects, With Logical Relationships
View supporting documents as dynamic objects Graphical representation of relationships between seemingly discrete entities Epicenters of activity become immediately discernable
Search-Around Functionality
Rapidly Build Networks of Interest and Tie In Multiple Data Sources
Easily find entities, documents, events, etc. that are directly related to your selection
Geocoding and Heat Maps
Identify Global Epicenters of Activity, As Well As Anomalies
Hotspots of activity are easily identified
Employee-Risk Ranking
Scored by Custodian and Time Period Based on Multiple Criteria
1.Keywords Percentage of EY-ACFE Fraud Triangle keywords around pressure, opportunity and
rationalization in email and IM communications. Scaling: 3
2. T&E analysis Ranking of T&E out-of-compliance hits and overall email scoring. Scaling: 3
Custodian C1 C2 C3 C4 C5 C6 C7 Scaling C1 Scaling C2 Scaling C3 Scaling C4 Scaling C5 Scaling C6 Scaling C7 Score A , Week 1 1 3 3 4 6 2 3 3 3 4 2 2 3 5 45 A , Week 2 2 2 4 5 3 4 2 37
4. User Activity Percentage of instances within that week, where custodian sends or receives ESI
involving those outside of peer group, as identified through hierarchies. Scaling: 2
5. 3rd Party Risk
Instances where employee is linked to high-risk 3rd parties (e.g., customers, vendors,
state owned entities, etc.) as determined by hits on OFAC, sanctions, PEP lists, or adverse media lists. Whether it be in email, T&E, or sales activity.
Scaling: 2
6. Alias Clustering
Percentage of instances within that week, where custodian sends or receives ESI
involving at least one (1) of their identified communicative aliases. Scaling: 3
7. Emotive Tone
Percentage of instances, where the employee sends or receives ESI with negative
emotions (angry, frustrated, secretive, etc.) identified through linguistic analyses. Scaling: 5
Employee-Risk Scoring
Risk Scoring Model—Peer Stratification Dashboard Review
Peer Stratification
Dots represent clusters of high risk communications that can be reviewed by clicking.
Course Recap
Contacts
Vincent Walden, CFE, CPA Ernst & Young LLP
Partner, Assurance Services
Fraud Investigation & Dispute Services New York, NY
(212) 773-3643