© 2010 eDiscovery Institute 2010, all rights reserved.
eDiscovery Institute Survey on Predictive Coding
Released October 1, 2010
The eDiscovery Institute is a 501(c)(3) nonprofit research organization dedicated to identifying and promoting cost-effective methods of processing discovery. More information on the work of the Institute is available at
ii
Foreword: Why A Survey on Predictive Coding?
The largest cost element in the ever-escalating cost of electronic discovery is typically the cost of having teams of lawyers review and select records for production or privilege. Those costs can be especially staggering if the lawyers are reviewing every record that is produced. Predictive coding is a process in which review decisions from examining sample records are propagated or extended by the use of various technologies to records which have not been individually examined.
The producing party may use the suggested evaluations to avoid examining all records, or it can lower costs by triaging the documents, assigning the lower ranking documents to the lowest cost personnel, letting the more expensive resources focus on the records that are most likely to be relevant. Either way, predictive coding can significantly reduce the largest single element of cost in e-discovery.
The survey was undertaken to collect information on technologies or processes that were being used to accomplish predictive coding and to quantify the savings that they were achieving. There is growing recognition that the old brute-force linear review process in which each record is examined is not economically feasible. For example:
Principle 6 of the Sedona Principles (Second Edition), June 2007: “Responding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.”
Principle 11 of the Sedona Principles (Second Edition), June 2007: “A responding party may satisfy its good faith obligation to preserve and produce relevant electronically stored information by using electronic tools and processes, such as data sampling, searching, or the use of selection criteria, to identify data reasonably likely to contain relevant information.”
Practice Point 1 from The Sedona Conference Best Practices Commentary on the use of Search and
Information Retrieval Methods in E-Discovery: “In many settings involving electronically stored
information, reliance solely on a manual search process for the purpose of finding responsive documents may be infeasible or unwarranted. In such cases, the use of automated search methods should be viewed as reasonable, valuable, and even necessary.” (Emphasis added.)
Not only is predictive coding less expensive, there is also a growing belief that it is actually superior to linear review in several ways:
• Consistency. Human review is not necessarily the gold standard it is sometimes assumed to be. In a study by the eDiscovery Institute1 and earlier studies by the Text Retrieval Conference (“TREC),2
1 “Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review,” Herbert L. Roitblat, Anne
Kershaw and Patrick Oot, Journal of the American Society for Information Science and Technology, 61(1):70-80, 2010. Two teams of reviewers examined 5,000 documents that had earlier been examined as part of a response to a US Department of Justice request. Team A identified 48.8% of the records identified as responsive by the original reviewers from the sample and Team B identified 53.9%. Two computer-assisted review systems were also used to review the entire original population. System C identified 45.8% of the documents originally identified as responsive, System D indentified 52.7%.
two reviewers or teams of reviewers have examined the same records. In these
See also, “Automated Document Review Proves Its Reliability,” Anne Kershaw, Digital Discovery & e-Evidence Newsletter, Pike & Fischer, November 2005. It describes a study comparing the performance of a human review team to that of an automated document assessment system in evaluating a sample of 43% of a collection of 48,000 documents. “Relevant” documents were deemed to be those identified by both the system and the humans plus those identified by either the system or humans where subsequent arbitration decided that they were relevant. The system identified more than 95% of relevant records whereas the people identified 51%.
2 “Overview of the TREC 2009 Legal Track,” by Bruce Hedin, Stephen Tomlinson, Jason R. Baron, and Douglas W. Oard, downloaded
iii
studies, the second review has identified as responsive just 48.8 to 62.0% of the records identified as responsive by the first review. In other words, linear human review is itself quite fallible.
• Transparency. When humans review records there is seldom any documentation on why particular records were deemed responsive or not. By contrast, most predictive coding methodologies build an audit trail of what decisions were made and what rules were applied. • Retroactive Evaluation. Linear review is so expensive that it is rarely feasible to re-examine
records that had been reviewed earlier even though the review team may have gained substantial new insight into the issues of the case in the meantime. Not so with some automated review technologies and processes.
• Time. Predictive reviews can greatly speed the time required to produce records, thereby shortening the time required to resolve disputes.
• Confidentiality. Individually reviewing each record requires large review teams; this necessarily exposes confidential information to more risks of unwanted disclosure than would predictive reviews that can process the same volumes with far fewer reviewers.
We hope that the results will inform discussions on what types of pre-production review are legally defensible.
The eDiscovery Institute
Anne Kershaw
President & Co-founder
Joe Howie
Director, Metrics Development and Communications
done in 2008 in which a subset of records that had previously been reviewed in 2006 and 2007 were reviewed again in 2008. According to the Overview, just 62% of documents previously judged relevant were judged relevant again in 2008. The sample consisted of 104 documents that had been previously judged relevant and 120 documents that had previously been deemed not relevant. See discussion at section 4 (Correction to 2008 Assessor Consistency Study) in the Legal09 Overview.
See also, “interassessor consistency data on TREC 06 Legal track ad hoc topics,” by Dave Lewis, downloaded on Sept. 29, 2010 from
topics. That consisted of 25 documents per topic that had been deemed relevant by an assessor – or all of them if there weren’t 25 relevant documents, and enough other nonrelevant documents to bring the total to 50. Due to a glitch, one topic only had 49 records. This sample set was then reviewed by another assessor. The first assessor identified 877 of 1999 documents as relevant. The second assessor identified 470 of those 877 documents, or 58.0%, as being relevant.
iv
Contents
Foreword: Why A Survey on Predictive Coding? ... ii
I. Special Thanks ... 1
II. Background ... 2
III. Overview of Results ... 3
IV. Respondents ... 4
V. Terminology ... 5
VI. Offering & Overall Process ... 6
VII. Identifying Like Records ... 11
VIII. Email Threading ... 12
IX. Paper-Based Records ... 13
X. Savings ... 14
XI. Pricing/Cost ... 15
XII. Incremental Cost of Predictive Coding ... 16
XIII. Sample Sizes ... 17
XIV. Set Up Efforts ... 18
XV. Transparency ... 21
XVI. Privilege ... 23
XVII. Repeatable Results ... 24
XVIII. Elevator Pitch ... 25
XIX. Acceptance/Adoption ... 27
XX. Type Matters, Size Threshold ... 28
XXI. Obstacles to Broader Adoption ... 30
XXII. Languages ... 31
XXIII. Review Platforms ... 32
XXIV. Judicial Review ... 33
XXV. Should Have Asked ... 34
© 2010 eDiscovery Institute 2010, all rights reserved.
I.
Special Thanks
We believe that the best way to identify and adopt cost effective ways to process electronic discovery is to have an informed debate on the various options and we want to thank the companies that provided the responses shown in this report. Many of them provided great insight into how they accomplish predictive coding and what the benefits of this approach are. Kudos to the following companies for stepping up and sharing this valuable information:
• Capital Legal Solutions
• Catalyst Repository Systems, Inc. • Equivio
• FTI Technology
• Gallivan Gallivan & O’Melia • Hot Neuron
• InterLegis • Kroll Ontrack • Recommind
• Valora Technologies, Inc. • Xerox Litigation Services
2
II.
Background
The predictive coding survey is the third in series of surveys by the eDiscovery Institute on technologies or processes that can be used to speed the processing of electronic data while improving the quality of the review. The first showed that proper consolidation of duplicate electronic files could, on average, reduce the volume of records to be reviewed by 38%.3 The second survey showed that with grouping emails in threads or conversations could reduce the effort required to review e-mail by an additional 36% on average.4
This survey dealt with “predictive coding,” which we defined as “a combination of technologies and processes in which decisions pertaining to the responsiveness of records gathered or preserved for potential production purposes … are made by having reviewers examine a subset of the collection and having the decisions on those documents propagated to the rest of the collection without reviewers examining each record.”
In May, 2010, invitations to participate in the survey were sent to a number of companies known to be active in the electronic discovery market. Additionally, postings inviting participation were made on a number of forums including the EDDUpdate.com blog, the Lit Support list serve on Yahoo, the eDiscovery group on LinkedIn.com and on LegalORamp.com. Responses were all received by July 1, 2010.
3Report on Kershaw-Howie Survey of E-Discovery Providers Pertaining to Deduping Strategies, available at
This study showed that on average, consolidating duplicates across custodians reduced the volume to be reviewed by 38% on average with many individual respondents reporting project-level reductions in excess of 70%. The savings from across-custodian deduping was almost double the reduction in volume of electronic discovery compared to only consolidating within the records of individual custodians, yet was performed in only half the cases – raising serious ethical considerations which were explored in “Ethics and Ediscovery Review,” published in the ACC Docket, Jan/Feb. 2010. Reprints are available at
4Report on Kershaw-Howie Survey of E-Discovery Providers Pertaining to Threading, available at
3
III.
Overview of Results
This is a summary of the results. Complete responses to the questions are provided later in this report. Some highlights:
• Savings. Respondents reported average savings of 45% with 71% average maximum observed savings and 23% average minimum observed savings. Individual respondents reported savings as high as 80 to 95 and even 100% and minimum savings as low as zero % on individual projects. • Obstacles to Implementation. The respondents felt that the largest obstacle to a more widespread
use of predictive coding was uncertainty over judicial acceptance of that approach. The next closest obstacle was lack of awareness of options on the part of in-house counsel followed closely by insensitivity to the cost of inefficiencies by law firms.
• General Approach. The respondents varied in their approach to predictive coding. Most respondents used some form of queries combined with document clustering.
• Non-Binary Process. In describing their responses, several of the respondents noted that predictive coding is non-binary in nature, i.e. documents are ranked according to how closely they match previously examined records. In other words there is a continuum and the review team has to select what the cutoff point is.
• Terminology. Almost all of the respondents thought there was a better generic term than “predictive coding.” Suggestions included:
• Automated Document Classification
• Automatic Categorization • Predictive Categorization • Predictive Ranking
• Prognostic Document Profiling
• Propagated Coding • Relevance Assessment • Replicated Coding • Suggested coding
• Pricing Models. The respondents offered a variety of pricing models, including per GB pre-culling, per GB post-culling, hourly fees, and flat per case fees.
• Sampling. Most respondents used some form of statistical sampling.
• Transparency. Most of the respondents provide an audit trail of what decisions were made.
• Replicability. Most of the respondents indicated that the results of a second analysis using the audit trail from the first analysis would produce the same results.
• Adoption Rate. There were not enough responses in this area to provide metrics on the rate of adoption.
• Maturity of Offerings. Predictive coding as an offering is far more recent than deduping or email threading. Many of the respondents have added predictive coding in the last two years.
• Email Threading. Most respondents were able to treat either individual emails or to treat emails grouped in threads.
• Paper records. All the respondents included scanned and OCR’d paper records in with electronic records for predictive coding purposes.
• Languages. All the respondents could handle English, French, German and Spanish but there were a few who could not handle Chinese, Japanese, Korean or Arabic.
4
IV.
Respondents
The following companies provided responses to the survey.
Company Contact Person Involvement w/ Predictive Coding
Capital Legal Solutions
Gregory Brooks, VP Information Technology
Developed own predictive coding software; provide software & hosted review
Catalyst Repository Systems, Inc.
John Tredennick, CEO
303-824-0840
Developed own predictive coding software; developed methodology within
Equivio
Warwick Sharp, VP Marketing and Business Development
800-851-1965 206-373-6521
Developed own predictive coding software
FTI Technology
Kate Holmes, Director, Corporate Communications
Developed own predictive coding software; we provide both software and hosted review.
Gallivan Gallivan & O’Melia
Daniel Gallivan 206.654.1441
Integrated others predictive coding Hot Neuron, LLC
Bill Dimm, CEO
610-581-7702
Developed own predictive coding software
InterLegis
Kevin Carr, President
214-468-8800 x205
Developed own predictive coding software
Kroll Ontrack
Jamie Ritter, Document Review Manager
952-906-4857
Developed own predictive coding software
Recommind
Chris Hutcheson, Marketing Director
Developed own predictive coding software; provide hosting and software
Valora Technologies, Inc.
Sandra Serkes, President & CEO
781.229.2265
Combo software provider & services provider.
Xerox Litigation Services
Karen Miller, Director of Marketing
212.337.5449
Developed own predictive coding software
5
V.
Terminology
The survey asked, “If you think there is a better generic term than ‘predictive coding,’ what would it be?” and “Why?” These were the responses:
Company Better Term Why
Capital Legal Solutions
Prognostic Document Profiling
The prognostic and iterative content categorization can play a broader part than simply review "call" score evaluation; for example in the document management system's context. Catalyst Repository Systems Predictive Ranking
More descriptive of the process and result. All systems deal with a rank or likelihood of responsive or not responsive. It is up to the trial team to determine the acceptable risk. Equivio Relevance
Assessment
The term "coding" suggests that the output is binary (responsive or not). However, one of the important use scenarios is prioritized review, which can only be facilitated by
graduated relevance scores. In addition, graduated relevance scores are important in allowing the user to select which documents to review (above a certain cut-off score), based on the mix of risk (recall) and cost (precision) appropriate for the given case and business scenario.
FTI Technology Suggested coding
We take the approach that this review technology does not completely eliminate human review. "Suggested coding" correctly indicates that human review decisions are preserved and help guide the computer through concept-clustering of documents and the integration of reference documents into the review. Review decisions become more consistent and faster, without relinquishing control over the substantive decisions for each document. Gallivan
Gallivan & O’Melia
Predictive Categorization
"Coding" implies a decision, the machine is suggesting. The coding happens when a person confirms (or refutes) the suggested category
Hot Neuron, LLC
Automatic Categorization
I don't know if it is "better," but it better aligns with terminology outside of the legal field. InterLegis Kroll Ontrack No Recommind Valora Technologies, Inc. “Propogated Coding” or “Replicated Coding”
May we suggest "Propagated Coding," rather than Predictive Coding, as "predictive" tends to mean ahead of the current time (like a forecast), whereas "propagated" would indicate taking existing results and carrying them forth across the remainder of the population (at any time, including the present).
Xerox Litigation Services Automated Document Classification
We believe that a generic term for a new offering in this market should be as transparent and descriptive as possible. Automated Document Classification is our preferred name for this particular technology, because we believe that it more clearly conveys the intended output for the technology – namely, a definitively classified set of documents. In our view, the term Predictive Coding is opaque and imprecise. It does not differentiate Automated Document Classification from less robust similarity-detection technologies, like clustering, near de-duplication, and e-mail threading. These other techniques could be used to make predictions regarding relevance for certain groups of documents within a corpus. Unlike Automated Document Classification, though, they would not comprehensively classify a document population such that clear definitive lines could confidently be drawn
segregating relevant documents from non-relevant documents. In sum, the term Predictive Coding seems to us to suggest a technology whose end results are imprecise,
immeasurable, and unreliable. This is not, in our view, an appropriate designation for the emerging body of Automated Document Classification systems.
6
VI.
Offering & Overall Process
The survey asked:
Name of PC Offering. What do you call your predictive coding offering?
Time Offered. What year did you first provide predictive coding software or services?
Overall Process. Please describe the overall process involved in your offering: (Example: After records have been collected and placed in a repository, the duplicate records are consolidated. Reviewers perform full text searches and otherwise browse the records of custodians with the most known involvement in the issues. The reviewers identify records known to be responsive and then our system identifies other records that are “most like” those records based on…. We sample non-selected records based on… and examine samples of about XX records to determine if there are still sets of relevant records that had not already been selected for production. We repeat iterations until…)
The responses were as follows:
Predictive Coding Offering & Overall Process Company –
Offering
Offered
Since Overall Process
Underlying Technology Capital Legal Solutions – Dynamic Content Profiling 2010 2nd Quarter
Dynamic Content Profiling will work on any corpus of documents across any language set that is imported into our eZReview
repository pre or post culling for de-duplication, date filtering or key word searching. Dynamic profiler works on any folder or
navigational view in the system. As such, client can execute across searches, tags, production data, random sample sets or customized queries. In any event, CLS review architects can work with client to create a powerful strategy whereby they can preview deliberate batches based on any folder technique mentioned prior or through our automated randomizer engine. In our random sampling module, user can make a decision as to the size of sample set and the pass or fail threshold levels. Sampled or deliberate batches then receive review decisions by expert or top level reviewers. Our profiling engine will then scan across the corpus of documents in entire database and find similar documents based on “content and concepts” based on our CLS’s customized algorithms. All such documents are pulled and folder for mass categorization. A random sampling can then we performed against that data set for quality assurance purposes. This process can be repeated until all documents in the corpus are reviewed.
Capital Legal Solutions own Intellectual Property / developed internally. Catalyst Repository Systems Predictive Ranking
2008 Catalyst offers “Predictive Ranking” and statistical analysis based upon initial coding decisions made by counsel during initial document review/sampling. These coding decisions are coupled with weighted key concepts and search terms, and then are applied against the non-reviewed documents, leading to an assigned predictive weighting for responsiveness.
The ranks are typically used two ways:
1. Documents with a very low rank, tested and shown to be extremely unlikely to be responsive, are not reviewed and not produced.
2. The remaining documents are typically prioritized and reviewed in priority order, beginning with those most likely to be
responsive. This allows for a prioritized review, making the
Catalyst Repository Systems (Catalyst CR)
7
Predictive Coding Offering & Overall Process Company –
Offering
Offered
Since Overall Process
Underlying Technology (Catalyst Cont’d) review more efficient and supporting rolling productions.
The steps we follow are as follows:
• Begin with a list of search terms that counsel believes are likely to find responsive documents, and run those searches.
• Sample a random sample of both the hits and non-hits, tagging for responsiveness, and looking for additional words and phrases that are found in responsive documents and “false hit” terms that are often found in non-responsive documents.
• Adjust the search terms based on what was learned during the sampling.
• If there were phrases found that are common false hits, run a Catalyst unique “True Hit Finder/False Hit Remover” process to tag the true hits and not the false hits.
• Assign the search terms scores representing likelihood of responsiveness, and run the searches in Power Search, based on subject matter expertise and sampling results.
• Assign each document a responsiveness rank based on a combination of the search terms that hit and the scores of each search term. .
• Sample additional documents to verify the scoring.
• Determine cut-off, and remove the docs that are ranked as non-responsive to a subcollection where they can be sampled and archived. Review the docs ranked as likely responsive beginning with the highest ranked documents.
The benefits can be magnified when combined with Catalyst’s additional features to accelerate the review, such as Equivio Email Thread/Near Dupe analysis, sophisticated handling of multiple languages, clustering, and managed review workflow. Equivio –
Equivio> Relevance
2009 Equivio>Relevance enables organization of a document collection by relevance. Based on initial input from an attorney
knowledgeable of the case, Equivio>Relevance uses statistical and self-learning techniques to calculate graduated relevance scores for each document in the data collection.
As an expert-guided system, Equivio>Relevance works as follows: An expert reviews a sample of documents, ranking them as relevant or not. Based on the results, Equivio learns how to score
documents for relevance. In an iterative, self-correcting process, Equivio feeds additional samples to the expert. These statistically generated samples allow Equivio>Relevance to progressively improve the accuracy of its relevance scoring.
Once the sampling process has optimized, Equivio scores the entire collection, calculating a graduated relevance score for each
document.
The product includes a statistical model which monitors the software training process, ensuring validation and optimization of the sampling and training effort.
We are not at liberty to disclose this information.
8
Predictive Coding Offering & Overall Process Company –
Offering
Offered
Since Overall Process
Underlying Technology FTI Technology –
Acuity is the name of our all-in-one legal review offering that utilizes "predictive coding" (our preference is "suggested coding"). Acuity launched in January 2010.
The Acuity process is to review a subset of the documents, which we call the reference set, and have the review team code them as appropriate. This serves two functions - these will suggest coding on uncoded documents, and will continually guide and instruct
reviewers.
From there, the reference set is uploaded to an enhanced Attenex Document Mapper tool where the coded documents are clustered with other documents of similar content and themes. Based upon the coding of the reference set, the software provides suggestions to the reviewers on how to code the similar documents. Coding decisions are implemented by the reviewers rather than automatically by the software and the process can be accurately described as machine-assisted document review.
The underlying software is Attenex Patterns. The Acuity all-in-one offering utilizes an enhanced version of well-known software that includes the suggested coding features. Gallivan Gallivan & O’Melia – depends on client -- we are not consistent: have used clustering, auto tagging, grouping rudiments in 2003 (Attenex style); fully (if client requested) since 2008
Collect and process records; extract content and placed in a repository, store references in a database. Consolidate duplicates. Extract text or OCR, compare text content to create a similarity vector, store results.
Reviewers perform full text searches and otherwise browse the records of custodians, filtering based on metadata as needed. Similar documents are grouped together.
The reviewers identify groups known to be responsive and then we associate other records that are “most like” those records based on the similarity vector.
Reviewer decisions define the actual mark of the documents vs. the mark suggested by our system.
As new waves of data arrived, they are placed in groups based on similarity vectors generated for that data.
Hot Neuron, LLC – Clustify (PC is a subset of its functionality
2008 Clustify only does the automatic categorization step of the process, so the details of other steps (de-dupe, searches, etc.) are really up to the user. The user supplies two sets of documents to Clustify: documents that have already been categorized (perhaps as
responsive/not-resposive, or whatever categories the user wants to use), and documents that haven't been categorized. Clustify compares the uncategorized documents to the ones that have been categorized, and categorizes them automatically if they are
sufficiently similar to any of the categorized documents.
The "sufficiently similar" criteria is specified by the user. It could be a minimum conceptual similarity percentage, or a near-dupe percentage. Any uncategorized documents that aren't sufficiently similar to any categorized document for automatic categorization are clustered, labeled with descriptive keywords, and presented to the user for manual categorization.
Clustify tells the user how similar an auto-categorized document is to the most similar manually categorized document, so the user can identify the documents most at risk of incorrect categorization (i.e., those with lowest similarity).
The process can be iterated in an effort to cover more of the
It is Hot Neuron's own proprietary technology.
9
Predictive Coding Offering & Overall Process Company –
Offering
Offered
Since Overall Process
Underlying Technology (Hot Neuron
Cont’d)
uncategorized documents, but it is only wise to do so if there is a manual review of the documents most at-risk for errors. Without such review, it is better to increase coverage by simply setting the similarity requirement lower.
InterLegis – Discovery360 Predictive Coding (Interlegis Cont’d)
2009 Predictive coding is a technology feature within Discovery360 Reviewer. There are two ways to leverage PC within Discovery360. 1. User-Defined: Case administrators define various attributes
that define certain issue codes. They can use any attribute in the database, including: keywords, concepts, file types, email domains, specific names and more. This process enables users to first "teach" the system, then ask it to find all documents that match their criteria.
2. Automatic: With this feature activated, the system will essentially "watch and learn" what commonalities are found between documents as they are issue coded. And as reviewers work, the system will find and recommend likely candidates for each issue code. Users can then either approve the entire recommended list, edit the criteria, or quickly QC the list to confirm selections.
Additionally, case administrators have the ability to ask the system to either code matching documents immediately, or place likely candidates in a holding folder for confirmation. In all cases, documents coded via the PC engine are always designated as such in the database for logging and defensibility purposes. InterLegis' proprietary technology Kroll Ontrack – Intelligent Prioritization
2010 After documents have been processed and uploaded into Ontrack Inview, the project administrator builds an initial workflow. An early workflow stage is designated for Intelligent Prioritization. Initially, a statistically relevant sample of the uploaded documents is provided to reviewers for standard linear review.
The system then assesses the reviewed documents and defines the characteristics of potentially Responsive documents. The system then prioritizes other likely Responsive documents for review. As the review continues, the system’s knowledge of Responsive characteristics improves. When new documents are loaded into Ontrack Inview another statistically significant sample is identified from this new data and that sample of data is prioritized for Responsive review.
In addition to the document prioritization identified above, Kroll Ontrack provides additional project analysis that helps determine when a high percentage of potentially Responsive documents have been identified within the data. By analyzing the Responsiveness patterns in the data and comparing them to the entire population of documents, Ontrack Inview can provide statistical details that can be utilized to indicate the ‘completeness’ of a review.
Intelligent Prioritization is a proprietary Kroll Ontrack technology. Recommind – Axcelerate eDiscovery
2006 All software, processes and workflow are the proprietary intellectual property of Recommind and cannot, therefore, be disclosed.
10
Predictive Coding Offering & Overall Process Company –
Offering
Offered
Since Overall Process
Underlying Technology Valora Technologies, Inc. – We have numerous offerings here: AutoCoding, AutoIssues, AutoPriv, AutoResponsive, AutoND (NearDupe), AutoETG (EmailThreadGro up) and a roll-up of the above: AutoReview Our first predictive/ propagated capability was AutoCoding, first offered in 2002.
Valora loads the entire collected population into our system, including any review data already available from previously (typically manual) efforts by reviewers.
We build a custom computer-representation of the Document Review Ruleset for each matter. We extract/understand these Rules from three possible places:
1) From a Coding or Review Manual, typically written by the client to train human reviewers
2) From existing/previously coded data from earlier review efforts. In this case, Valora creates a translation from prior actions taken to the underlying rules that guided those decisions (even if not explicitly stated).
3) From direct conversations with the client, particularly when no existing data or review efforts exist (e.g., starting fresh). Once established, Valora propogates the Document Review Ruleset uniformly across any already-coded documents. The results are reviewed and corrected for precision and recall (accuracy). Once the results meet the desired threshold, the Ruleset is propagated across the entire population.
Valora Technologies, Inc. Xerox Litigation Services CategoriX
2009 CategoriX automatically classifies documents by learning from samples that have been reviewed by knowledgeable case attorneys. CategoriX utilizes attorney-supplied document assessments, together with its own statistical analyses, to create a model that will accurately and consistently generalize the attorneys’ assessments across the entire review population. The statistical analysis underlying CategoriX technology is called Probabilistic Latent Semantic Analysis (PLSA). CategoriX leverages PLSA to identify correlations between words and attorney-supplied relevance assessments. This knowledge then informs CategoriX classifications for novel documents going forward.
CategoriX performance depends on the quality of the assessments provided by the attorneys in the training samples. For this reason, several iterations of training and intensive quality control are undertaken during the model-building process to ensure the accuracy and consistency of the training input. Precision and recall are monitored throughout the incremental model-building process to ensure that progress is being made toward our client’s
performance goals. Once CategoriX models are consistently performing at the desired levels, CategoriX is applied to the entire review population. Finally, one last round of attorney-driven QC sample review is undertaken to validate the quality of the final result set.
The iterative CategoriX approach has several distinct stages and entails a strong consultative partnership between CategoriX technical experts at XLS and the client’s attorneys. Nevertheless, a CategoriX-based review can typically be completed in a very short timeframe, as many of the analyses are aided by computers working 24 x365. Xerox’s two research centers, Xerox Research Centre Europe (XRCE) and Xerox Palo Alto Research Center (PARC).
11
VII.
Identifying Like Records
The survey asked:
What general approach is used to identify like records? (Select all that apply) Custom queries
Statistically-based clustering, with no terms inferred, e.g., basing a search or clustering on a document that contains “Ford” and “Toyota” would not find or associate documents that only contained the words “Chevy and Honda”
Statistically-based clustering with co-occurring words inferred, e.g. basing a search or clustering on a document that contains “Ford” and “Toyota” could find or associate documents that only contained the words “Chevy and Honda”
Taxonomies
Other (please specify): These were the responses:
Company Q uer ies Cl ust er in g (no inf .) Cl ust er in g (w / in f. ) Tax on om ie s Other
Capital Legal Solutions Yes Yes Like records are identifiable is various ways. In addition to the above two we identify also based on document content. Catalyst Repository
Systems
Yes Yes
Equivio Supervised learning
FTI Technology Linguistic statistical analysis assesses similarity in documents. Gallivan Gallivan &
O’Melia
Yes Yes
Hot Neuron, LLC Yes
InterLegis Yes Machine Learning based on common threads between
documents.
Kroll Ontrack Classification based technology that assesses document text to
determine related documents.
Recommind Yes Yes Yes
Valora Technologies Yes Xerox Litigation
Services
CategoriX uses Probabilistic Latent Semantic Analysis to identify correlations between words and attorney-supplied category assessments. From these building blocks, CategoriX assembles models capable of assigning relevance probabilities to novel documents that have not been manually reviewed. CategoriX’s probability assignments do not depend on the presence of any specific words or phrases in a document. Instead, each
document’s score is dictated by the probabilities of the specific combination of words comprising it.
12
VIII.
Email Threading
Section 3 of the survey asked:
Email Threads. Please explain how email threads are handled in conjunction with your offering.
Emails are analyzed individually so that different emails from the same thread can be placed in different groups or clusters
Email threads are identified prior to grouping or clustering so that all emails in a thread or branch of a discussion areplaced in the same group or cluster
Other: please explain: The responses were as follows:
Company eMails Indiv. All EM in Thread Other Capital Legal
Solutions
With our system, there is no boxed in solution for E-mail Thread review. We can and will work with case team to establish a workflow that will be most efficient per their strategy. If review based on searching is required for instance, then we can search get those results, pull in the entire conversation and take that into account. Or if review based on similar or associated documents is the desired first pass we can do that way and then account for E-mails in those thread to be automatically categorized. So we allow a flexibility here as different clients work different way but we can find the efficient way per their work methods.
Catalyst Repository Systems
Predictive Ranking is flexible: Searching is done by document, but analysis and ranking can be done by: (a) individual documents, (b) families of email and related attachments (c) email threads (optional with Equivio email thread processing) Equivio Both options are supported. This is a user-defined parameter.
FTI Technology We can do both depending upon client preference. Gallivan Gallivan
& O’Melia
Yes
Hot Neuron, LLC Clustify allows you to do it either way.
InterLegis Yes
Kroll Ontrack Emails are handled in the Intelligent Prioritization technical solution without additional document type handling. In addition to Intelligent Prioritization, Kroll Ontrack provides email threading technology that analyzes emails and presents them to reviewers grouped by conversation, and identifies the earliest and latest emails in each thread.
Recommind Yes Valora
Technologies
We offer both choices as an option to our customers. Xerox Litigation
Services
CategoriX typically operates on individual emails. However, the XLS review platform incorporates email threading technology that could be used to ensure that all members of an email thread would be assigned to the same category, should the client prefer this organization.
13
IX.
Paper-Based Records
Section 3 of the survey asked:
Paper-based Records. How are paper-based record treated for predictive coding purposes?
Paper records are scanned and OCR’d and the OCR’d text is included with the ESI for predictive coding Paper records are scanned and OCR’d and treated as a separate population from ESI for predictive
coding
Paper records are not treated with predictive coding Other (please explain)
The responses were:
Company Paper w/ESI Separate Paper
Paper not treated for Predictive Coping Other
Capital Legal Solutions Yes Catalyst Repository
Systems
Yes
Equivio Yes
FTI Technology Yes
Gallivan Gallivan & O’Melia
Yes
Hot Neuron, LLC It can be any of the above. It's entirely up to
the user to decide whether to put OCR'ed text in the same document set as the ESI, or whether to separate them.
InterLegis Yes
Kroll Ontrack Yes
Recommind Yes
Valora Technologies, Inc.
Yes Any ESI documents without text are
processed like paper (OCR, etc.). Xerox Litigation
Services
14
X.
Savings
The survey asked, Cost Savings
As compared to a linear review of the same content after duplicate consolidation, after culling based on domain name analysis of emails (e.g. excluding emails from CNNSports.com) and after email threading, what percentage of time do you estimate is saved by predictive coding when used to select responsive records?
On average:___ % Most observed: ___ % Least observed:___% The responses were:
Company Average % Savings Most % Savings Observed Least % Savings Observed
Capital Legal Solutions 40 70 25
Catalyst Repository Systems 40 60 25
Equivio (note 1) 65 80 50
FTI Technology 50-60 80 25
Gallivan Gallivan & O’Melia 3 10 0
Hot Neuron, LLC
InterLegis 40 80 10
Kroll Ontrack
Recommind 40 95 20
Valora Technologies, Inc. ** 80 100 25
Xerox Litigation Services 55 77 30
Total 363 572 185
Average of Responses
(divide by 9) 45.4% 71.5% 23.1%
Green shading with a gold star indicates that the respondent provided names and contact information for a client who substantiated the information provided regarding savings. Two stars indicate two references. Providing references was optional for the respondents
** Equivio Note 1: These percentage savings refer to cases in which the software was successfully trained and used. The software includes a statistical model which monitors the "success" of training. Occasionally, due to poorly-defined issues, inconsistent tagging by the expert, or exceptionally low richness (less than 1%), the statistical model detects and notifies the user that training is ineffective, and in these cases, the results are not used.
** Valora Note: Valora builds a computer-representation of the Document Review Ruleset for each matter as part of Valora’s services. In some cases clients have completely forgone a linear review and used the results of the Ruleset instead.
15
XI.
Pricing/Cost
The survey asked:
How do your calculate the prices you charge for PC? (select all that apply) Per GB, pre-culling
Per GB, post culling
Per GB, post culling and deduping Per File, pre-culling
Per File, post culling
Per File, post culling and deduping Hourly consulting fees
Flat Fee per case
Other (Please specify below) The responses were as follows:
Company Pe r G B, p re -c ul l Pe r G B, p ost cul l Pe r G B p ost c ul l & de dupe Pe r F ile P re C ul l Pe r F ile , p ost -cul l Pe r F ile , p ost cul l & D edupe H ou rl y Fees Fl at Fees P er Ca se O the r Other Text
Capital LS Yes Yes Yes
Catalyst RS Yes
Equivio Yes Yes Most customers prefer the per-file pricing model.
FTI Tech. Yes Yes
Gallivan Gallivan & O’Melia
Yes Yes
Hot Neuron Yes Yes Yes Yes We also offer perpetual site licenses with no
per-GB fee. Note that our per-per-GB fees are based on the amount of text, not raw data, which we believe is more fair and economically sensible. Whether the user culls/de-dupes first is up to him/her.
InterLegis Yes Per GB fee after culling, and includes all software,
technologies, and services such as project management and productions.
Kroll Ontrack Yes Free introductory offer.
Recommind Yes Yes Yes Yes Enterprise license; SaaS (i.e. per month/quarter/
year charge for all volume)
Valora Tech. Yes Yes Yes Yes Yes Per page or per paper document.
Xerox LS Yes Similar to our processing and review platform
pricing, our models are very flexible. Depending on client needs and the complexity and size of the matter, our pricing models can vary from matter to matter.
16
XII.
Incremental Cost of Predictive Coding
The survey asked:
What is the incremental cost of providing predictive coding technology above the basic costs of ingesting and deduping electronic records? (express as a percentage over basic ingesting, deduping and threading) These were the responses:
Company
Capital Legal Solutions 20%
Catalyst Repository Systems Hourly consulting at $250-$350 per hour
Equivio Equivio is a software vendor. Processing and hosting services, as
referred to in the question, are provided by our e-discovery partners. As such, we are not in a position to respond to this question.
FTI Technology Acuity is an all-in-one offering from processing through to production, including legal review. The predictive coding feature is included in the fees so there is no additional cost - in fact it offers cost savings. Gallivan Gallivan & O’Melia Less than 1/10 of 1%. Since we do no charge for processing time, the
only "cost" is the extra time required to process the documents. Not all clients want the delay given the perceived small % gain in time.
Hot Neuron, LLC
InterLegis Included in full-suite of services. Kroll Ontrack This information is proprietary.
Recommind Question is unclear
Valora Technologies, Inc. When Valora performs the ingesting, deduping, etc., there is no
incremental cost to perform document tagging of any sort. This includes AutoCoding, AutoReview, etc.
When Valora does not perform the preliminary steps, the cost of AutoReview usually runs between 25-50% of typical ESI
processing/scanning costs.
A better cost comparison is the cost of Predictive/Propogated Coding against the cost of linear review.
Xerox Litigation Services Because our pricing models are based off of client needs and the complexity and size of the matter, incremental costs can vary from matter to matter.
17
XIII.
Sample Sizes
The survey asked:
Sampling Non-selected Records. If you use sampling of non-selected records as a way of validating your approach, what size samples do you use and how is that sample size determined?
The responses were:
Company Sampling
Capital Legal Solutions Using statistical random sampling techniques. Inspection batch sizes can be determined
A) by desired % of records; B) by a set number of items; or C) to achieve a degree of accuracy % based on pool size and accuracy level formula
Catalyst Repository Systems Generally a statistically valid sample with 95% confidence level is used.
Equivio Sample size required depends on several variables, including collection richness and size, and the required level of statistical confidence.
FTI Technology Size sample is different for each case, depending on what we're looking for (non-responsive versus privileged, as an example). We use accepted statistical
methodology (acceptance sampling, statistical sampling) which includes expected responsive rate, confidence level and acceptable error rate.
Gallivan Gallivan & O’Melia n/a Hot Neuron, LLC
InterLegis
Kroll Ontrack Intelligent Prioritization does not utilize sampling of non-selected records as an automated way of validating the technical approach. The system is designed to allow clients to utilize the approach of sampling non-selected documents as a companion validation of the solution if they choose to do so.
Recommind 10,000
Valora Technologies, Inc. Valora samples records using random selection from across the entire population. Sample size determination is a function of the size of the population and the accuracy desired.
Xerox Litigation Services XLS relies on statistical methods developed by our in-house statistician to calculate sound precision and recall estimates for CategoriX results. Our techniques focus on establishing extremely accurate estimates of the rates of relevance (or yields) for the client’s categories in the review population as a whole. We ensure that our yield estimates are reliable by selecting random samples for review that are large enough to produce yield estimates with very narrow margins of error according to standard sample size tables. Once stable yield estimates have been established, they provide a reference point from which recall estimates can be calculated following a) the final assessment of categories to documents by CategoriX and b) the establishment of a precision estimate based on direct sampling from the set of documents classified as relevant by CategoriX. Direct sampling in the non-selected records is undertaken only in circumstances where that represents the most efficient option for establishing recall for the final result set. In those cases, the sample size for non-selected records would be dictated by the desired width of the margins of error for the resulting recall estimate.
18
XIV.
Set Up Efforts
The survey asked: Set-up Effort
What level of effort it terms of time and level of people involved, is required to set up or start a PC review using your offering?
To what extent can efforts expended to start up one review in your system be re-used in other reviews? To what extent can efforts expended to start up one review in your system be re-used as part of an enterprise-wide information management or retrieval system?
The responses were:
Company Set Up Effort Re-Use in Other Reviews Re-Use in Enterprise System Capital Legal
Solutions
Dynamic Content Profiling engine is an offering built into our application. However, the time to setup is dependent upon the data set received as we have to run several processes before we can activate the various features. History shows that we have already been able to work with clients per their time line.
Work flows for executing our Content Intelligence process are identifiable and reusable. However, work flows depends on the case team and their needs. We can streamline the path to take depending on strategy that team decides to take. Our review consultants are pretty methodological when it comes to devising the most desirable , defensible cost effective review workflow.
Not currently planned to deploy as such but could envision the use of one document corpus' prognostic scores against other matters or cross matters document profiling.
Catalyst Repository Systems
No more than at the start of any typical review. Creation of searches, scoring and initial sampling should be done by associates or higher level
attorneys familiar with the case.
Most setup for one can be applied to another case as to review forms, views, folders, subcollections, etc.
A default site is created and replicated for an unlimited number of matters.
Equivio Installation and set-up of the software takes about 1-2 hours. For each case, the software needs to be trained by an "expert" (an attorney familiar with the case) in order to estimate the relevance of documents in the specific case. This training process typically takes 1.5-2 days
The training of
Equivio>Relevance is specific per case/issue.
As above, the training of Equivio>Relevance is specific per case/issue.
FTI
Technology
Nothing - it's currently part of the Acuity all-in-one service.
If there is overlap in data or issues, the efforts and work product can be reused.
Because predictive coding comes with Acuity, clients can realize great efficiencies as FTI becomes familiar with
19
Company Set Up Effort Re-Use in Other Reviews Re-Use in Enterprise System
(FTI Cont’d) IT, custodians, retention
practices and privilege issues. Similar cases can be reviewed much more efficiently but even dissimilar cases will benefit from the integrated offering.
Gallivan Gallivan & O’Melia
none -- it's part of the base tool. It's just not enabled
100% reusable on the next matter. similarity vectors do not change between matters.
have never tried.
Hot Neuron, LLC
Can be extremely minimal. For example, if documents are in a Concordance database, with just a few mouse clicks the user can tell Clustify to compare the untagged documents to the tagged ones, automatically apply tags, and export results back to Concordance (into a separate tag folder by default). A user could learn the procedure in a matter of minutes. While Clustify usually works well with default settings, the user can exhibit more control if he/she spends more time learning the tool.
Not really applicable -- there is very little to set up (no rules to code, just select a similarity function and minimum similarity cutoff).
Again, not really applicable.
InterLegis Roughly 2 hours, either up front before review starts or during review.
Rules can easily be transferred to a new matter.
Rules can easily be
transferred to a new matter or the enterprise.
Kroll Ontrack No additional setup is required of the client.
The transfer of Intelligent Prioritization information is handled as a custom request.
Not applicable.
Recommind Relatively brief user (1 hour) and admin (4 hours) training
Completely This can be done and can
deliver tremendous value, but most clients do not look to do so as they are not advanced enough to do so Valora
Technologies, Inc.
We use our own internal staff to setup projects. Typical setup time is 2 business days or less.
Assuming the matters (and/or document content) are similar, startup efforts can (and are) re-used.
The rules encapsulated by our system can be made available to other enterprise-wide systems.
Xerox Litigation Services
CategoriX is not a software offering, so there is no installation required. There is, however, an initial phase of
Any future review involving a document population that has previously been subjected to a CategoriX-related corpus
CategoriX automated document classification is currently being offered as an XLS service, incorporating
20
Company Set Up Effort Re-Use in Other Reviews Re-Use in Enterprise System (Xerox
cont’d)
corpus analysis and project scoping that takes place prior to CategoriX model development. This early phase of the CategoriX review typically involves 2-3 XLS team members and 2-3 client-side participants. During this time the XLS CategoriX team: 1) undertakes a thorough corpus analysis, 2) establishes the CategoriX review population, 3) develops reports for the client regarding the outcomes of corpus analysis, and 4) engages the client’s team to finalize project requirements by learning about the team’s goals, priorities, workflow, and timelines. The XLS team will likely invest about 20-40 hours collectively in setting up the project over the course of ~2-3 days. The client team’s
participation in this phase of the project will likely require about ~6-8 hours collectively over the same time period.
analysis would be able to skip this step in the project set-up process.
tight collaboration between the XLS team and the client team. This ongoing dialogue helps us ensure the highest quality outcomes possible. It allows us to leverage fine-grained human insights to tune and refine CategoriX models. In the current implementation, CategoriX is not a “behind-the-firewall” push-button software offering.
21
XV.
Transparency
The survey asked: Transparency
What type of audit trail do you maintain to document the process you used in a particular case? In what proportion of cases involving predictive coding is the opposing party told about the use of the predictive coding approach?
Where the opposing party is told about the use of predictive coding are they provided the opportunity to have input on how it is conducted?
The responses were:
Company Audit Trail Party Told Opposing Party Input Opposing
Capital Legal Solutions
Distinct Audit trail of prognostic decision markings versus human evaluations at the record/document level.
NA NA
Catalyst Repository Systems
Catalyst’s system logs virtually everything that is done, including: • Power Search Logs of all searches run, and results of each search are saved into SQL results tables and folders.
• All Bulk updates and copy jobs.
• All changes made to any record in document history log. In addition, careful notes and histories are maintained by consultants of all SQL scripts, procedures, and results,
documented in Word documents, Emails, Excel spreadsheets, and our project tracking system.
15% 30%
Equivio The software tracks and documents all actions, including the entire training process. The audit trail of the training process records all the relevance tags applied by the expert.
(Note 2) (Note 2)
FTI
Technology
Full audit trail that is completely transparent to client and counsel. NA NA Gallivan
Gallivan & O’Melia
Every document and all vectors(groups) are stored in db + each action by a reviewer is logged as who, the decision made, and the date/time
0
Hot Neuron, LLC
All parameter settings are stored in the .cys file, and all documents (and their ordering) used for the calculation are specified in the .cyi input files, so the calculation can be reproduced if the documents and their original tags/categories are preserved. InterLegis All activity is logged, and all PC documents are designated as such
in the database.
Kroll Ontrack Because the technology is prioritizing documents for manual review the audit tracking is based on reviewer activities. The power of Kroll Ontrack's Intelligent Prioritization is that Non-Responsive documents are shuffled to the bottom of the review process allowing review teams to focus on the potentially
22
Company Audit Trail Party Told Opposing Party Input Opposing
Responsive documents quickly.
Recommind Built-in reporting and workflow management systems Unknown No
Valora Technologies, Inc.
Valora has an established protocol for managing projects, which includes process documentation, regular status reporting and client communication. We also provide document batch tracking and reporting, as well as project-specific decision logs, including client authorizations and feedback.
Unknown. sometimes
Xerox Litigation Services
Every aspect of the CategoriX process is thoroughly documented in a variety of XLS-internal daily tracking logs, templates, and
spreadsheets. Additionally, there are a number of internal and client-facing reports generated throughout the project to convey the outcomes of key analyses, to document and confirm decision points at various project milestones, and to provide a detailed summary of the entire project following validation of the final result set.
n/a n/a
23
XVI.
Privilege
The survey asked: Privilege
In what proportion of federal cases involving the use of predictive coding do the parties enter into a claw-back agreement to recover inadvertently produced privileged records without loss of privilege?
In what proportion of federal cases involving the use of predictive coding do the parties obtain a court order enforcing the claw back agreement?
The responses were:
Company Claw Back Agreement Claw Back Court Order
Capital Legal Solutions NA NA
Catalyst Repository Systems 95 95
Equivio (Note 2) (Note 2)
FTI Technology NA NA
Gallivan Gallivan & O’Melia Hot Neuron, LLC
InterLegis Kroll Ontrack
Recommind Unknown 100
Valora Technologies, Inc. Unknown Unknown
Xerox Litigation Services n/a n/a
24
XVII.
Repeatable Results
The survey asked:
Repeatable Results. Assuming someone followed the steps outlined in your audit trail or project documentation on the same set of incoming documents, to what extent would the second processing identify the same set of records identified as responsive in the initial processing? If there would be a difference, please attempt to quantify and explain,
The responses were:
Company Repeatable Results
Capital Legal Solutions We test all of our product features and algorithms for repeatability. As such all tests have returned with 100% results when evaluating the same data set. Until and unless we change algorithms or functions in code then the results will remain consistent.
Catalyst Repository Systems Results would be identical, assuming the same subjective scoring is applied to search terms the second time.
Equivio Tests show that given the same training input, the software generates the same relevance scores for the documents in the collection.
FTI Technology The only difference would be if the reference set was somehow marked differently the second time. Otherwise, there would be no changes. Gallivan Gallivan & O’Melia Vectors are based on document content, so they would be identical.
Changes would come from different needs of the specific matter, group A might be "non-relevant" in matter 1 and "highly responsive" in matter 2. Full audit log would allow decision history to be completely constructed. (See transparency above).
Hot Neuron, LLC Results should be exactly the same. The one possible exception is that if the documents are in native format (rather than text), and the calculation is done on two different computers that have different IFilters installed (used to translate native to text) there could be differences if the IFilters give different results.
InterLegis
Kroll Ontrack The order of documents loaded, sampled and reviewed would have a small impact on the specific number of documents needed to be reviewed before the system could recommend that the majority of Responsive documents had been identified and categorized. Since the technology relies upon Reviewer decisions, strictly repeatable results are not necessary to maintain the quality and power of the solution.
Recommind 100% of the time
Valora Technologies, Inc. 100%, completely identical results.
Xerox Litigation Services Following the CategoriX audit trail/project documentation while using the same set of documents and attorney assessments would generate exactly the same results in every round of processing.
25
XVIII.
Elevator Pitch
The survey asked: Elevator Pitch
Why should a party that has to review a collection of ESI use or consider using your offering over that of a competitor of yours?
The responses were:
Company Why Use Particular Company
Capital Legal Solutions
eZReview is a very mature review platform with several content analytic features based capabilities beyond the prognostic document profiling. In the eZReview system you never feel stuck or short of workflow options as there are many ways to skin the same cat and eZReview will cater to all. The platform is designed for those who want to take advantage of latest, greatest and smartest technologies to assist in automated review while lending itself to effective traditional linear review with user definable business rules and processes. Catalyst
Repository Systems
With Catalyst’s Predictive Ranking, you can save time and money and improve the review two ways:
1. You can avoid having to review large sets of non-responsive documents.
2. Prioritize documents that may be responsive so that the documents that are most likely to be responsive can be reviewed first or assigned to higher level attorneys. Predictive Ranking can be used with any of the 79 languages Catalyst supports.
The process is a combination of teamwork between Catalyst’s consultants and the law firm, based on sampling, searching in the FAST search engine with the Catalyst Power Search utility, and analysis performed by the consultants in the SQL database.
It can be combined with additional Catalyst features, such as Equivio email thread/near dupe analysis, clustering and rule-based review workflow, to further accelerate the review. One client for whom we used this process mid-way in the review said, “Boy, I wish we knew about this earlier. We could have avoided reviewing 140,000 documents! “ Equivio Equivio>Relevance allows the user to measure, monitor and manage the e-discovery
process. Relevance uses a statistical model which monitors the training and sampling process. This ensures that the training process is optimized, while enabling measurement and verification of the recall and precision achieved by the software. In addition, a decision support environment allows the user to manage the cost and risk of litigation review. This tool presents the retrieval rates for each cut-off point -- for instance, review of 20% of the documents will retrieve 89% of the relevant documents in the collection.
FTI Technology As data volumes grow, predictive coding—the propagation of coding decisions to uncoded documents—is increasingly viewed as a probable and necessary option for cost-effective review of large data sets. The efficiency and defensibility of predictive coding has yet to be proven, however. The suggested coding features of Acuity help bridge a gap between the current human review process and tomorrow’s predictive coding. These features preserve human review decisions by guiding decisions through concept-clustering of documents and the integration of reference documents into the review. Review decisions become more consistent and faster, without relinquishing control over the substantive decisions for each document.
26
Company Why Use Particular Company
Gallivan Gallivan & O’Melia
Attorneys and litigation support professionals can manage e-discovery fees and data with turn-key solutions that include processing, data mining native file review, and production, all in one integrated software solution.
No per gigabyte fees;
No monthly per user hosting fees; No per page costs to produce documents Early case assessment analysis;
Minimize and consolidate processing, hosting, review, and production costs; Manage documents in repositories for use and re-use
Hot Neuron, LLC Clustify scales well to handle large document sets on modest hardware. It gives the user flexibility (e.g., both conceptual similarity and near-dupe) while remaining easy to use. Most importantly, it works well. We strongly encourage anyone considering such software to test it on their own data, since the quality of the algorithm really matters.
InterLegis InterLegis' Discovery360 includes all technologies available in the industry today in an end-to-end solution at a low all-inclusive price.
Kroll Ontrack The Intelligent Prioritization feature a) Analyzes reviewer categorization decisions; b) Identifies and elevates the documents which are most likely relevant to the case; c) Enables reviewers to review the most relevant documents first; d) Learns from early review decisions which guide the prioritization of future documents
Recommind Speed
Consistency Higher quality
Significant cost savings (40-90%) Valora
Technologies, Inc.
Why not ask each vendor to prove their capabilities, then choose the best results at the lowest cost? Automated processes lend themselves well to comparisons.
Consider also those vendors who have been automatically capturing document data the longest and who create the underlying software capabilities themselves.
Xerox Litigation Services
CategoriX is an automated document classification tool that combines proprietary, state-of-the-art technology, human expertise, and processes to result in a defensible, accurate, cost-effective document review. Collaboration is key throughout as our team works closely with our client’s legal team to ensure the quality of both the review results and the review process. This strong partnership provides our clients with a clear understanding of the CategoriX technology and allows them to focus on the legal aspects of their case. The CategoriX team oversees implementation of the technology and leverages the XLS review platform to offer a seamless automated document classification service to address the rising cost of document review.
27
XIX.
Acceptance/Adoption
The survey asked:
Acceptance/Adoption
Indicate the number of cases in which the responding company has provided its PC Offering over the last five years:
Of the total number of cases for which your firm/company handled that involved reviewing e-discovery for production purposes, what percentage involved predictive coding?
The responses were:
Company
Number of Cases Percentage of Cases
20 10 t hru Ma y 1 20 09 20 08 20 07 20 06 20 05 20 10 t hru Ma y 1 20 09 20 08 20 07 20 06
Capital Legal Solutions 4 0 0 0 0 0 1 0 0 0 0
Catalyst Repository Systems 4 4 1 Note Note Note
Equivio 30-40 10-20 0 0 0 0 (Note 2)
FTI Technology
Gallivan Gallivan & O’Melia Hot Neuron, LLC InterLegis Kroll Ontrack Recommind 90 90 80 75 10 0 Valora Technologi