READY FOR THE MATRIX? MAN VERSUS MACHINE

(1)

1 | Page www.cobralegalsolutions.com

READY FOR

THE MATRIX?

MAN VERSUS MACHINE

by Laura Ewing Pearle, CEDS Assistant Director, Client Services Cobra Legal Solutions

In a 2014 order, Judge Denise Cote presented a Valentine’s Day present to predictive coding vendors by writing in her order: “predictive coding had a better track record in the production of responsive documents than human review”i_.

She was quoting the Maura R. Grossman & Gordon V. Cormack article published in 2011, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, but she had signaled her beliefs much earlier in the case, during a telephone conference in August 2012: “I think there's

every reason to believe that, if it's [predictive coding] done correctly, it may be more reliable -- not just as reliable but more reliable than manual review, and certainly more cost

effective” ii_{. At the end of last year, Judge John Copenhaver even opened the door to}

computer review for privilegeiii_.

Is predictive coding making human review obsolete? Should we endorse the view of Agent Smith from The Matrix: “Never send a human to do a machine's job”?

With the confusing mingling of TAR, CAR, and predictive coding, perhaps a few definitions are in order. The idea of Technology Assisted Review is not new; technology can “assist” by searching for key words, for clustering documents based on similar concepts, grouping documents based on a percentage of near duplication, and more. Predictive Coding (and usually “Computer Assisted Review”) takes the concept one step further: the computers actually code documents, either based on an algorithm or semantic indexing or some other form of iterative learninga_{. Predictive coding is not,}

however, one set process: even experts disagree on determination of seed sets (random, judgmental, mix), layering search terms, and the best/most accurate analytic and coding methodology. This article will not attempt to delve into the details of the processes; rather we will discuss the concept raised by Judge Cote, Maura Grossman, and others: Are machines always better at reviewing and coding documents? Are humans superior to computer decision-making in many current circumstances?

The general consensus is that predictive coding saves review time, and therefore money, by eliminating the need to review non-responsive documents – and in many cases this is true. The process starts when a subject matter expert, or

a_{Even iterative learning can be further delineated into “continuous active learning” and “simple passive learning”}

(2)

SME, codes a seed set of documents as either responsive or non-responsive. The SME is usually described as a member of outside counsel who has already interviewed multiple custodians and is intimately familiar with the issues of the caseb_.

Because most predictive coding technology at this point is basically a binary decision treec_{, the SME is not coding for}

issues or privilege at the same time; rather the SME is coding only for inclusion and exclusion based on responsiveness. Most tools also allow the SME to highlight sections of the relevant documents that will help the computer define

“responsiveness”. Generally, if you are aiming for a confidence level of 95%, with a confidence interval of 2.5, your sample set will be around 1500 documents, whether your total population of documents is

10,000,000, or 1,000,000 or 100,000iv_.

Change one variable slightly – say, increase the confidence interval to 2 – and your seed set for one million documents jumps to 2400. With a judgmental set, a SME can “plant” documents known to be responsive into the seed set to ensure the correct documents are found. No matter how the sample size is determine, and no matter which type of seed set is chosen, the seed set documents need to be reviewed – and possibly producedd_{. Once the seed set is coded, the}

computers apply their logic, and the population of documents is now divided into three sets: documents that the computers have coded responsive, documents coded non-responsive, and documents which the computer could not code based on available information – the “unknowns”. To check the quality of the computer’s work, and to help the computer learn so that it can code the unknown documents, the SME now reviews a new sample set. This iterative process can take as few as three generations or as many as forty-five. Obviously, if you are using outside counsel at $350/hour as the one SME to code ten iterations of samples, you may not be saving as much money, but think of the savings if a SME only reviews 6,000 documents and the computer eliminates the review of 400,000 non-responsive documents. If Grossman and Cormack are correct and the computer is more accurate as well, bonus. According to their report, predictive coding has a 67 – 86% accuracy rate versus 25 – 80% for human review.

But is that always true? Ignore for the moment whether the low rates for human review were based on accurate studies (and Ralph Losey has an excellent article about this topic). Have the analysts been examining any of the advantages of human review? A few points to consider:

b_{Note: A few predictive coding bloggers are starting to assert that a team of reviewers can code the seed set as accurately}

as one SME.

c_{A notable exception is XERA’s Predictive Review which allows for simultaneous issue tagging.}

d_{Several judges recommend or enforce producing all non-privileged documents from the seed set, even non-responsive}

documents, in order to determine if the entire process will be tainted. A recent order by Judge Brown for disclosure and transparency can be found due to miscoding of non-responsive seed documents in Bridgestone Americas v IBM, 3:13-cv-01196 (M.D. TN), Order Filed February 5, 2015.

(3)

1. Predictive coding algorithms need text to analyze for content and to a more limited extent, context. Ergo, documents with limited text are either intentionally omitted from decision sets or fall into “unknown”. This includes a plethora of electronic documents used in the course of business: CAD drawings, Excel and financial

spreadsheets, and Visio diagrams are just a few.

2. Related to the above are image-based documents (jpg, png, bmp, gif, etc.) as well as documents containing images and limited text (PowerPoints, Word documents that use “Smart Art”, and more). Even emails can fall into this category given the ubiquitous use of photos and Google Images. Let’s say you have an employment case in which an employee’s antagonism towards her boss is a key issue. The SME codes the few documents with “My boss is evil” as responsive, and adds a few created documents to a judgmental set with words like “anger”. How will the computer handle the email below? Whether using semantic indexing or algorithms, machines cannot read these images or read sarcasm, missing the malicious intent:

(4)

3. Depending on your platform, metadata is not always included in the computer analysis of a document. How important is this? We’ve all worked cases in which certain emails from “Sally Fields” to “Tom Hanks” are considered responsive, even if they only say “How’s the weather?” If a predictive coding tool does not or cannot search/analyze metadata, these messages either wind up in the large “unknown” bucket or get tossed into Non-Responsive.

4. We h8 #SocialMedia; it’s a pain in the YKW. Social media, text messages, and instant messaging are the new sources of relevant data, and all are replete with misspelled words and odd acronyms so people can share posts that would otherwise be NSFW. BTW, if u dk this, ask yr kids. Issues in this arena are compounded by the fact that punctuation is rarely indexed, making # or #(%* impossible to read.

How does predictive coding work with the following?

5. Time and money savings are not immediate. In the bundled Federal Housing cases against the banks, the FHFA argued that they had concerns about meeting the deadlines because of the “testing and retesting” needed, and added, “again, the court in Da Silva Moore recognized that predictive coding may require extensions of the discovery period because it's impossible to predict when the program will be sufficiently trained”.v_{For cases over}

one million documents, the time taken to train a tool can pay off down the road. Judge Peck more recently noted that “fear of spending more in motion practice than the savings from using TAR”vi_{can be a discouraging factor in}

using this technology. Indeed, the Legal Intelligencer posited in January 2015 that “expense and time” could actually be barriers to predictive coding, stating: “Where no search terms are applied prior to predictive coding,

(5)

the volume of responsive documents identified by the predictive coding engine could approach or exceed the volume from a keyword narrowed universe.”vii_{Smaller cases can benefit from concept-clustering and bulk-coding}

documents non-responsive based on concepts, domain names, or other facets – achieving the same results without the time and expense of training a tool.

6. Receiving reimbursement for technology costs under §1920 is much more difficult than receiving reimbursement for attorneys’ fees. (See Cobra’s white paper

http://www.cobralegalsolutions.com/pdf/Section_1920_Blues.pdf.)

In short, while predictive coding seems to be the future, many documents still need human review in 2015. As of now, Agent Smith’s assertion that computers are “the cure” seems premature.

Laura Ewing-Pearle, CEDS

An eDiscovery professional for almost ten years, Laura Ewing-‐Pearle currently works as Assistant Director – Client Services for Cobra Legal Solutions LLC. A Certified E-‐Discovery Specialist, Laura provides insight and clarity to clients on complex technical issues. Laura is a veteran of all three sides of the eDiscovery triangle: law firm, corporate client, and vendor. She worked for Nixon Peabody, a Global 100 Firm, and Thelen Reid Brown Raysman & Steiner, where she led eDiscovery efforts for a $200 million insurance case. Upon moving to Texas, Laura managed eDiscovery for Dell Inc.'s litigation team, which involved more than 2 TBs of data in the span of

2.5 years. At Scarab Consulting, she was promoted to Director of Project Management before leaving to start her own consulting business. Laura was the Director of the Austin Chapter of Women in eDiscovery for two years and has presented CLEs on Technology & Ethics in both Texas and Georgia, as well as seminars on “eDiscovery 101” and the role of the eDiscovery paralegal. She studied at Trinity University and graduated magna cum laude from San Francisco State University's ABA paralegal studies program.

i_{Federal Housing Finance Agency v HSBC North America Holdings Inc., et al 2014 WL 584300, February 14, 2014}

ii_{Federal Housing Finance Agency v JPMorgan Chase & Co, Inc., et al, 11-‐CV-‐06188-‐DLC, Conference Filed August 6, 2012} iii_{Good v. American Water Works Co., Inc., 2014 WL 5486827 (S.D.W.Va.) October 29, 2014}

iv

http://www.nss.gov.au/nss/home.nsf/pages/Sample+size+calculator

v_{Federal Housing Finance Agency v. JPMorgan Chase & Co., Inc., et al., 1:11-‐cv-‐06188-‐DLC, S.D. N.Y., Telephone} conference of July 24, 2012 (filed 08/06/2012).

vi

Rio Tinto PLC v Vale S.A., 2015 WL 872294 (S.D.N.Y.) March 2, 2015 vii

David R. Cohen and Marcin M. Krieger, “Seven Barriers to the Use of Predictive Coding”, The Legal Intelligencer, January 27, 2015, http://www.thelegalintelligencer.com/