The Truth About Predictive Coding:
Getting Beyond The Hype
David R. Cohen Reed Smith LLP
Records & E-Discovery Practice Group Leader
David leads a group of more than 100 lawyers in his role as Practice Group Leader of Reed Smith’s Records & E-Discovery group. He serves as e-discovery counsel for multiple companies and also counsels clients on records management and litigation readiness issues. David has been named a “Pennsylvania Superlawyer” in litigation and is Chambers-ranked nationally and internationally in the area of e-discovery. He is a frequent author and trains judges, mediators and lawyers in e-discovery issues. He has also been a court-appointed E-Discovery Special Master in multiple cases.
Bryon Z. Bratcher Reed Smith LLP
Director of Litigation Technology Services
Bryon directs Reed Smith’s global team of 25 Litigation Technology Analysts, drawing on more than a dozen years of experience in technology services for Am Law 100 firms. He assisted with the selection and implementation, and manages the firm’s technology-assisted review tools, and in 2014 was named a winner of The
Recorder’s Law Firm Innovator award for co-developing Reed
Mark E. Harrington Guidance Software
Senior Vice President, General Counsel & Corporate Secretary
© 2014 kCura. All rights reserved.
© 2014 kCura. All rights reserved.
• What is Predictive Coding?
• Why Predictive Coding?
• How Accurate is Human v. Predictive Coding?
• Barriers to Use of Predictive Coding
• Case Studies
• Current “Hot” Issues in Predictive Coding
• Takeaways
Agenda
What is Predictive Coding?
• a.k.a. “TAR” a.k.a. “CAR,” a.k.a. “RAR”
• Machine learning algorithms and statistical probability tools
used to duplicate human decision making
• Software determines relevance after training by human
reviewer
• Computer identifies properties to predict future coding
• Process continues until accuracy levels reach stability
Technology-Assisted Review Reference Model
Courtesy of: EDRM.net
Workflow Overview Total Number of Documents Results from Categorization QC of 1st Round (Statistical Sample) 2nd Round of Categorization
Seed Set for Human Review
QC of 2nd Round
(Statistical Sample)
Validation Criteria Not Met
Training Round Overturn Report QC Round Overturn Report 10,000 Uncategorized 2,000,000 Documents 2,000 3,068 3,068 Responsive 596,400 Non Responsive 1,391,600 Responsive 635,178 Non Responsive 1,349,754
0 500 1,000 1,500 2,000 2,500 3,000 Sample Size +/- 2.0 +/- 2.5 +/- 5.0 Log. (+/- 2.0) Document Count Confidence: 95%
The Numbers Behind the Statistics
Why Predictive Coding?
• Cost savings
• Time savings
• Reduced risk of errors (?)
• Greater objectivity in classifications
• Sometimes volume of documents and/or value of case
makes human review impractical
Universe of Available Documents
Technology Assisted Review
Universe of Available Documents Relevant
Documents
Universe of Available Documents Relevant
Documents
Technology Assisted Review
Documents Selected
Universe of Available Documents Relevant
Documents
Technology Assisted Review
Documents Selected Irrelevant Documents Mistakenly Selected (Poor Precision) Relevant Documents Mistakenly Missed (Poor Recall)
Myth #1
Computer Review Will
Never Be As Accurate as
Human Review
Da Silva Moore v. Publicis Groupe & MSL Group
287 F.R.D. 182 (S.D.N.Y. 2012)
Magistrate Judge Andrew J. Peck:
“…while some lawyers still consider manual
review to be the ‘gold standard,’ that is a myth, as
statistics clearly show that computerized searches
are at least as accurate, if not more so, than
Da Silva Moore v. Publicis Groupe & MSL Group
287 F.R.D. 182 (S.D.N.Y. 2012)
• Predictive Coding Was Appropriate Because:
• Parties Agreed
• Over 3 Million Documents
• Cost Effectiveness & Proportionality
• Transparent Process Proposed
• Spawned Huge Battle Over Protocol
& Ultimate Motion to Recuse
Da Silva Moore v. Publicis Groupe & MSL Group
287 F.R.D. 182 (S.D.N.Y. 2012) District Judge Approved Judge Peck’s Proposal:
• The “ESI protocol contains standards for measuring the reliability of the process and the protocol builds in levels of participation by Plaintiffs. It provides that the search methods will be carefully crafted and tested for quality assurance, with Plaintiffs participating in their
“While this Court recognizes
that computer-assisted review is not
perfect, the Federal Rules of Civil
Procedure do not require perfection.”
Magistrate Judge Andrew Peck
How Accurate is Human Coding?
• Computer 77%, Humans 60%
• “The myth that exhaustive manual review is the most effective…
approach to document review is strongly refuted. Technology-assisted review can (and does) yield more accurate results than exhaustive
manual review, with much lower effort.”
• “Technology-assisted reviews require…human review of only 1.9% of the documents, a fifty-fold savings over exhaustive manual review.” Technology-Assisted Review in E-Discovery Can Be More
Effective and More Efficient Than Exhaustive Manual Review
,
Maura R. Grossman & Gordon V. Cormack, XVII Richmond Journal of Law and Technology 11 (2011)
How Accurate is Human Coding?
Document Categorization in Legal Electronic Discovery:
Computer Classification vs. Manual Review, Herbert L. Roitblat et
al., 61 Journal of American Society for Information Science and Technology 70 (2010)
• Performance of two computer systems was at least as
accurate (measured against the original review) as that of
human re-review
• Level of agreement
among human
reviewers: 70-75%
How Accurate is Human Coding?
Faster, better, cheaper legal document review, pipe dream or reality? Thomas I. Barnett and Svetlana Godjevac, Autonomy, Inc. (2011)
• Responsiveness rates of review
groups ranged from 23% to 54%
• Unanimity of agreement less than
half of the time
• 28,209 documents reviewed by 7 different reviewer
groups (5 document review vendors and 2 law firms)
Look– the computer did as well as the humans!
“Using search terms is so
last decade.”
- Judge Shira Sheindlin
Myth #2
• Not viable for cases with fewer than 10,000-20,000
documents requiring review
• Limited potential cost savings (e.g. not reliable for privilege)
• Risk of not getting opposing counsel agreement
• Time and expertise required to train computer
• Multiple case problem
• Unsympathetic judges/discovery masters
• Danger of losing key word filtering
Kleen Products LLC v. Packaging Corp. of Am.,
2012 WL 4498465 (N.D. Ill. Sept. 28, 2012)
• Plaintiffs requested court approval of predictive coding, defendant opposed
• Massive briefing and several days of hearings
• Plaintiff ultimately withdrew request as to current production requests
• Parties agreed to meet and confer regarding the search methodology for future production requests
Kleen Products LLC v. Packaging Corp. of Am.,
2012 WL 4498465 (N.D. Ill. Sept. 28, 2012)
STIPULATION & ORDER RELATING TO ESI SEARCH
“As to any … ESI beyond the First Request…, plaintiffs will not argue …that defendants should be required to use… “predictive coding” methodology...
“With respect to any requests for production… beyond the First Request Corpus, the parties will meet and confer regarding the
appropriate search methodology to be used for such newly collected documents. If the parties fail to agree on a search methodology,
Myth #3
Rio Tinto PLC v. Vale S.A.
14 Civ. 3042, (RMB) (AJP) (S.D.N.Y. March 2, 2015)
Magistrate Judge Andrew Peck, revisiting his landmark decision in De
Silva Moore three years later:
“the case law has developed to the point that it is
now black letter law that where the producing party wants to utilize TAR for document review, courts will permit it”
Rio Tinto PLC v. Vale S.A.
14 Civ. 3042, (RMB) (AJP) (S.D.N.Y. March 2, 2015)
Observes that “one TAR issue that remains open is how transparent and cooperative the parties need to be with respect to the seed or training set(s).”
In the absence of transparency, statistical estimation of recall and general quality control sampling can still be used to verify
appropriate training of the software and secure satisfactory review outcomes
“Black Letter Law”?
A case law search for “predictive w/2 coding” returns 35
cases:
• 12 positive references, in commentary or tone
• 18 neutral references
• Often judicial approval of proposed ESI protocols
• 4 that utilized the term in a non-ESI context
Still gaining acceptance and momentum
Global Aerospace Inc. v. Landow Aviation, L.P.,
2012 WL 1431215 (Vir.Cir.Ct. April 23, 2012)
• Defendants requested permission to use predictive
coding
• Plaintiffs opposed the request
• Order issued approving the use of predictive coding
• Work now concluded
Global Aerospace Inc. v. Landow Aviation, L.P.,
2012 WL 1431215 (Vir.Cir.Ct. April 23, 2012)
• Sample of 1.1 million “irrelevant” documents showed 2.9% relevant
• 31,000 missed relevant (over 80% recall) • Time: 7 months/Cost: $200,000
• 1.3 million docs after deduplication, 5,000 seeded • Predictive coding identified 173,000 relevant docs • 400 doc sample showed 80% precision
In re: Biomet M2a Magnum Hip Implant
Products Liability Litigation
Cause No. 3:12-MD-2391, (N.D. Ind., South Bend Div., April 18, 2013)
• Defendant Biomet used combination of electronic search
functions to identify relevant documents
• Beginning universe was 19.5 million documents
• Used keyword culling and deduplication
• Reduced to 2.5 million
• Then employed predictive coding on those 2.5 million
In re: Biomet M2a Magnum Hip Implant
Products Liability Litigation
Cause No. 3:12-MD-2391, N.D. Ind. (South Bend Division) April 18, 2013
Plaintiffs objected to this procedure -- requested that
Biomet start over:
• Wanted Defendants to use predictive
coding on all 19.5 million documents,
with Plaintiffs and Defendants jointly
training the software
Biomet Resolution
• Court held that Biomet’s methodology satisfied its
obligations under F.R.C.P. 26(b)(2)(C)
• Likely benefits of going back to the 19.5 million
document set would not outweigh burden and expense
• Assumed Biomet will remain open to “additional
reasonably targeted search terms…”
• If Plaintiffs wish to restart predictive coding process,
Plaintiffs must bear the expense
Progressive Casualty Insurance Co. v. Delaney
2014 WL 2112927 (D.Nev. May 20, 2014) Court approved a Joint ESI Protocol under which:
• Parties mutually agreed to search terms for universe of collected documents
• Progressive had option to produce all non-privileged documents: • Captured by the agreed search terms; or
• Captured by the agreed search terms responsive to the
Progressive Casualty Insurance Co. v. Delaney
2014 WL 2112927 (D.Nev. May 20, 2014)
• Progressive advised it would produce all docs Sept. – Oct. 2013 • Progressive produced nothing in six months
• Collected 1.8 million ESI docs, culled to 556,000 using search terms
• Began to review manually
• After review began, determined manual review was too time intensive and expensive
• Without informing Defendants or Court, used predictive coding to review only the 556,000
Progressive Casualty Insurance Co. v. Delaney
2014 WL 2112927 (D.Nev. May 20, 2014)
• “Many…have argued persuasively that the traditional ways lawyers have culled the …documents for production—manual human review, or keyword searches—are ineffective tools to cull responsive ESI in discovery.
• Predictive coding has emerged as a far more accurate means of producing responsive ESI in discovery. Studies show it is far more accurate than human review or keyword searches which have their own limitations.”
Progressive Casualty Insurance Co. v. Delaney
2014 WL 2112927 (D.Nev. May 20, 2014)
“Progressive is unwilling to engage in the type of
cooperation
and
transparency
that …is needed for a predictive coding
protocol to be accepted by the court or opposing counsel as a
reasonable method to search for and produce responsive ESI.
Progressive is also unwilling to apply the predictive coding
method it selected to the universe of ESI collected. The method
described does not comply with all of Equivio's recommended
best practices.”
• “Had the parties…agreed at the onset of this case to a predictive coding based ESI protocol, the court would not hesitate to approve a transparent mutually agreed upon ESI protocol.”
• Ordered Progressive to produce the 565,000
“hit” documents culled from the use of the
search terms, subject to privilege filters, the clawback provisions of FRCP 26(b)(5)(B), and FRE 502(d) and the existing ESI protocol.
Progressive Casualty Insurance Co. v. Delaney
Case Study #1: Product Liability Case
• 3.5 million documents in Relativity
• Approximately 2 million had been reviewed
• Approximately an equal number of responsive vs.
non-responsive documents
• Approximately 40 reviewers on case
• Limited potential cost savings
• Difficult plaintiff’s counsel
• MDL + numerous state cases
• Unsympathetic judges/discovery masters
• Danger of losing key word filtering
How Could Predictive Coding Be Used?
• Accelerate the human review and improve our QC
• We could use predictive coding to accelerate the
review, and check the human review
• It was impractical to use predictive coding as a
substitute for human review in this case
Case Study #1: Cost Analysis
Docs/Hour Cost / Hour Total Records Total Cost
Current 50 $39.50 2,000,000 $1,580,000 Cost Tier 1 44 $39.50 500,000 $448,863 Cost Tier 2 57 $39.50 1,200,000 $831,578 Cost Tier 3 80 $39.50 300,000 $148,125 TOTAL $1,428,566 Review Savings $151,434 Analytics Cost $60,000 Total Savings $91,434
Case Study #2
• Client spinning off a division to become separate
company
• Wants former employees to still access old e-mail
• Wishes to remove privileged documents from set to
avoid waiver
• Perfection not required – not an adversarial situation
but needs defensible process
Case Study #2
• Total volume: Approximately
200,000 documents
• Document-by-document review
and privilege determinations could
cost up to $2 per document
Case Study #2: Our Recommendations
• We recommended:
• search term filtering
• followed by sampling and
• predictive coding to identify and
remove privileged documents
• Set budget of $30,000
Case Study #2: Our Process
• Following initial filtering, two experienced reviewers
sampled “hits” and “misses” and adjusted filter terms to
fine-tune filtering
• Reviewers then “trained” software on selected samples of
the remaining “hits”
• Analytics accurately identified remaining documents most
likely to be privileged
• Those results were then used for two additional iterations
of filter “fine-tuning”
Case Study #2: Results
• We were left with a document population that contains
negligible privileged documents to make available to
ex-employees
• Filtering was not perfect, but even
human filtering is never perfect
• Client saved over 90% of the review
costs, amounting to several hundred
thousand dollars
Current “Hot” Issues in Predictive Coding
• Do parties have to give advance notice and/or obtain consent from adversaries or the court?
• Should courts allow predictive coding where opposing parties don’t consent?
• Is it okay to run keywords before starting the predictive coding? • Should parties share their “seed sets” with opposing counsel,
including irrelevant docs?
• What workflows are allowable or best?
Takeaways
• Predictive coding is gaining acceptance by courts and will be used increasingly, with or without opposing party notice and/or consent • Practical considerations continue to rule out primary reliance on
predictive coding for many reviews
• Even when not replacing human review, predictive coding can still be useful for many purposes
• Non-adversary review situations • Accelerating human review
• Improving quality control
• Finding key documents sooner
Questions?
David R. Cohen Bryon Z. Bratcher Mark E. Harrington [email protected] [email protected] [email protected]
412-288-1098 415-659-5948 626-229-9191 x4660
David R. Cohen Bryon Z. Bratcher Mark E. Harrington Practice Group Leader Director Senior Vice President,