©2011 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
Big Data Challenges
to E-Discovery
Paul Krneta, BMMsoft, Inc. Perry J. Narancic, Esq., LexAnalytica, PC
Agenda
I. Computing
Trends
II. Definition
of
“Big
Data”
III. Legal
and
Regulatory
aspects
of
Big
Data
IV. Technology
approaches
to
Big
Data
in
E
‐
Discovery
V. In
the
Trenches
with
Big
Data:
A. EDRM
Model
and
Big
Data
B. Case
Study:
Yahoo v. OnlineNIC, No. 08‐5698‐JF (ND Cal. filed
Dec. 19, 2008), Verizon v. OnlineNIC, No. 08‐2832 (ND Cal. filed June
6, 2008 )Microsoft v. OnlineNIC, No., 08‐4648 (ND Cal. filed Oct. 7,
2008).
I.
Computing
Trends:
the
progress
1. CPU,
disk,
RAM,
networks:
2x
faster/bigger
in
18
months
• 10x faster/bigger in 5 years, 100x in 10 years; 10,000x in 20 years
2. Price
of
HW
and
networks
is
dropping
as
fast
as
speed
is
growing
• Same $ buys you 10x faster/bigger system in 5 years, 100x in 10 years
3. Can
your
application
SW
take
advantage
of
HW
progress
?
• Most SW is not designed with HW progress in mind
• 10‐year old SW was designed for 100x slower HW
• 20‐year old SW was designed for 10,000x slower HW
I.
Computing
Trends,
cont’d
1. Structured
Data:
• 80% of all corporate records (sales, phone records, payment data, stock
trades) are structured and stored in relational database ‐ but due to their
small size (typ. < 1 KB) they represent only 20% of enterprise data volume
• SQL is the main analytic tool for structured data
2. Unstructured
data
• emails, documents, multimedia represent over 80% of enterprise data
volume (due to their large size of typ. 100 KB) – but less than 20% in terms
of number of records
• Text search is the main tool to search unstructured data
• Archiving of unstructured data is typ. separate from Text search
3. Structured
data
=
huge
number
of
small
records
II.
Big
Data
=
size,
speed
and
diversity
•
Big
Data
happens
when
Moore’s
law
meets
data
growth
• Data volume doubles every 18 months
•
Big
Data
Challenges
:
• Loading and indexing speed of Big Data has to be very high
• Search speed of Big Data must be very high
• Critical: SQL analysis and Text search must be unified
• Storage cost for Big Data can be astronomical – if not done right
•
examples
of
Big
Data:
databases
records,
emails,
documents,
SMS,
video,
audio,
sensor
data,
social
media
and
more
•
To
discover
new
information
and
relationships
arising
from
the
cross
‐
correlation
of
the
entire
data
set,
rather
than
isolated
silos
II.
Big
Data,
a
marketing
hype
?
•
Of
course,
there
is
Big
Data
hype
– just
like
there
was
Web
hype,
CRM
hype
etc.
•
But
Big
Data
is
nothing
new
– other
than
a
name
•
Big
Data
is
here
to
stay
and
burden
us
– See e.g. Steve Lohr, “How Big Data Became so Big”, NYT, Aug. 11, 2012
.
http://www.nytimes.com/2012/08/12/business/how‐big‐data‐became‐so‐big‐unboxed.html
– Bryant, Katz, Lazowska, Big Data Computing: Creating revolutionary
breakthroughs in commerce, science and society”, Computing
II.
Big
Data
Applications
•
Long
and
diverse
list
of
Big
Data
uses
•
Civil
and
Criminal
Litigation
– E
‐
Discovery
•
Regulation
(e.g.
SEC,
HIPPA)
•
Internal
Policing
and
Dispute
Avoidance
(internal
surveillance
to
catch
problems,
like
harassment,
fraud,
etc..)
•
National
Security
(cross
‐
correlating
addresses,
names,
phone
numbers)
•
Audit
III.
Technology
Approaches
to
Big
Data
Multiple
approaches
to
Big
Data:
•
“federation”:
applications
search
multiple
silos
of
data
– Relies on original data repository (“data silo”) to search data
– Slow, expensive, unreliable
•
In
‐
place
Indexing
(similar
to
Google)
– Data sources are scanned and indexed to enable search
– Original data is not managed (=can be modified/deleted, no “HOLD”)
•
EDMT:
– EDMT stores emails, Documents, Multimedia and DB Transactions for data
compliance, retention and analysis
– EDMT runs Text+SQL cross‐analysis of all data to allow eDiscovery, Audit,
Fraud Detection, CRM, GRC, BI etc. to run up to 1,000x faster than before
III.
1
PB
of
data
in
EDMT:
2007
vs.
2012
• 1 PB (1,030 TB) of data
• 6 Trillion records loaded +indexed • Loading+indexing speed :
• 285 B records per day • 35 TB/day
• Fresh Data Loaded in < 2 sec • Search time : < 0.5 sec
III.
EDMT
– smaller/bigger
than
1
PB
store + index index only
7 16K [ 96 racks ] 15,360 41,472 180 B 1,800 B 640 Trillion
6 4K [ 24 racks ] 3,840 10,368 48 B 480 B 160 Trillion
5 1K [ 6 racks ] 960 2,592 12 B 120 B 42 Trillion
4 4XL [ Full rack ] 160 432 2 B 20 B 7 Trillion
3 PB [ 1/2 rack ] 80 288 1.6 B 16 B 6 Trillion 2 XL [ 1/3 rack ] 40 144 600 M 6 B 2 Trillion 1 L [ 1/4 rack ] 24 72 300 M 3 B 1 Trillion M [ 2 RU ] 12 36 150 M 1.5 B 500 B S [ 2RU ] 6 36 150 M 1.5 B 500 B XS [ 2RU ] 4 36 150 M 1.5 B 500 B On Li n e
EDMT®
Solution
Models
and
Specifications
Model Description #
cores
Disk
Size (TB)
# emails & files (100KB each) DB rows
(150‐byte)
Mid Entry
IV.
In
the
Trenches
with
Big
Data
•
Big
Data
Challenges
in
E
‐
Discovery
–
Volume
–
Heterogeneity
–
Distributed
IV.
In
the
Trenches
with
Big
Data,
cont’d
•
The
EDRM
Model
and
Big
Data
–
Information
Management:
The
problem
of
silo
‐
searching.
The
need
for
speed,
completeness,
comprehensiveness.
–
Identification
– the
problems
with
the
traditional
“custodian”
and
“keyword”
approaches.
•
National
Day
Laborer
Org.
v.
US,
2012
WL
2878130
(SDNY)
–
Preservation
and
Collection
– problems
with
custodian
‐
based
preservation.
IV.
In
the
Trenches
with
Big
Data,
cont’d
•
Case
Study:
Cross
‐
Analysis
of
Structured
and
Unstructured
Data
in
the
OnlineNIC Trilogy
•
Facts:
Yahoo,
Microsoft
and
Verizon
sued
OnlineNIC,
one
of
the
largest
domain
name
registrars
in
the
world
for
cybersquatting.
•
Alleged
that
defendant
was
using
aliases
to
register
infringing
domains.
•
Defense
used
BMMsoft technology
to
collect
and
cross
‐
correlate
registration
database
and
emails
to
show
that
the
registrants
were
not
affiliated
with
defendant
•
Yahoo
and
Verizon
cases
settled,
Microsoft
case
was
Conclusion
1. Big
Data
is
real.
2. Big
Data
poses
unique
issues
for
E
‐
Discovery
3. Technology
solutions
need
to
be
big,
and
unify
structured
and
unstructured
data
into
a
EDMT
extracts
benefits
from
Big
Data
EDMT
advantages:
• Scalability & Performance
• Enterprise Ready
• Fast deployment, Easy
maintenance
Use
cases:
• FLAC (Fraud, Legal, Audit,
Compliance) • BI & DW: Better Customer Perspective • IT: Operational Efficiency