The Benefits of. in E-Discovery. How Smart Sampling Can Help Attorneys Reduce Document Review Costs. A white paper from

(1)

SampliNg

in e-Discovery

How Smart Sampling

Can Help Attorneys

Reduce Document Review Costs

(2)

Ta b l e o f C o N T e N T S

The Need for Data Sampling . . . .2

When Should You Sample? . . . .3

To make reasonably sure that responsive documents are identified, reviewed and produced. . . .3

To safeguard against inadvertent production of privileged documents . . . .4

Judicial Requirements for Sampling . . . .4

What Does Sampling Entail? . . . .5

Judgmental sampling . . . .5

Random sampling . . . .5

Statistically valid sampling . . . .6

Sampling for Early Case Assessment . . . .6

Quality Control Using Sampling . . . .6

Sampling Techniques . . . .7

Clustering . . . .7

Auto-classifiers . . . .7

Predictive scoring . . . .7

The Future of Sampling in E-Discovery . . . .8

(3)

T h e N e e d f o r daTa S a m p l i N g

In the past few years, the volume of electronic content being created has increased dramatically. Email, word-processing files, spreadsheets and more are churned out and distributed at a rate few of us could have imagined when we started our professional

careers. Back then, big cases involved a few thousand documents; most files could fit in your briefcase. Today, similar cases might easily require review of tens or hundreds of thousands of documents. Some involve collections running into the millions.

It used to be accepted practice that attorneys representing clients in a lawsuit would read every document at least once and often more than once. With discovery populations mushrooming, that is no longer a possibility, let alone an option. Clients can no longer afford the cost to review every document, even if counsel could find the time. There has to be a better way to find what is relevant and push aside what is not.

Faced with review costs collectively running into the billions, counsel and their clients require new and better ways to find relevant documents and lessen the scope of necessary review.

In recent years, promising techniques have evolved to help lighten the review burden. Using these strategies can dramatically reduce the amount of electronic records that require attorney review—sometimes by as much as 98 percent. Some of them include:

• Find and omit duplicate and near-duplicate documents from the review. (Near-duplicate documents are identical except for minor differences, such as the same letter addressed to two different people).

• Develop an agreed-upon list of key search terms and/or date ranges for identifying potentially relevant documents for further treatment. With counsel and court in agreement over the key terms, the parties simply ignore the mountain of documents that don’t meet the search criteria.

What troubles the courts, and often counsel as well, is this: How can the parties be assured that the agreed-upon search terms did not overlook other documents that are relevant to the case? One answer is data sampling, a process whereby the producing party reviews a sample set of documents and extrapolates the results to the entire population.

(4)

W h e N S h o U l d Yo U S a m p l e ?

When you consider the stages of e-discovery, as depicted by the Electronic Discovery Reference Model (EDRM), sampling can be useful at several points along the way. During processing, for example, sampling can be a check on your procedures. For early case assessment, sampling can help identify key themes. During review, sampling can be used to check for inconsistencies in coding calls. Before production, sampling can be used for quality control.

Even so, the most common—and perhaps most critical—use of sampling is during review and analysis. With data populations exploding, sampling is an essential method of check and balance. It serves two key purposes:

1. To make reasonably sure that responsive documents are identified, reviewed and produced.

Courts do not typically demand that litigants find and produce all responsive documents. In today’s electronic world, that is almost an impossibility—or at least hugely impractical. Rather, they require that litigants make reasonable efforts to find and produce responsive documents.

In a typical search and review, you might run a set of responsive search terms across the document population, segregate the responsive documents from the non-responsive

(5)

documents, and produce only the responsive documents. Sampling provides a check on the results. You review a statistically valid random sample of the non-responsive documents to see if the search terms missed capturing any responsive documents. If they did, you modify your search, run it again and then sample it again, until you are reasonably satisfied with the results.

To better illustrate this point, consider this: The accepted standard for statistically valid sampling is a 95% confidence level with a 2% margin of error or confidence interval. Therefore, if you had one million documents to review, by applying statistically valid sampling methodologies, you would need to review only 2,345 documents (~0.23% of the document population) to forecast the results onto the entire one million documents. You can then focus on reviewing the documents that are most likely to be relevant.

Through this iterative process of search and sampling, you add an extra layer of quality control, one that provides assurance to a court that you took reasonable efforts and that could help avoid unwanted sanctions.

2. To safeguard against inadvertent production of privileged documents.

Inadvertent production of privileged documents can carry serious consequences and may cripple your case. Even so, a manual, linear privilege review of every document can be enormously time consuming, if not virtually impossible.

Sampling can help ensure that privileged documents do not slip through the cracks. If, through sampling, you find any that did slip through, you can take appropriate measures to correct the mistake, such as by revising search terms and re-running the search or by creating review rules to flag such documents for a second level review.

J U d i C i a l r e q U i r e m e N T S f o r S a m p l i N g

Recent court opinions suggest that sampling is not only useful but may be required. Several decisions in the past few years have penalized lawyers for not sampling documents before they were produced (waiver of privilege) and for not sampling the documents that were not produced (omission of responsive data).

(6)

Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008) (Grimm).

Even more recently, another court found waiver of privilege in a “smoking gun” attorney-client communication because counsel failed to sample. Mt. Hawley Ins. Co. v. Felman Prod.,

Inc., 2010 WL 1990555 (S.D. W. Va. May 18, 2010).

Courts understand that there can be mistakes and that the explosion of data makes it impossible to look at everything. When counsel seek forgiveness for an inadvertent production, courts are increasingly likely to ask them whether they used sampling

technologies. While not perfect, sampling is a reliable way to check your review. It provides a higher level of comfort before the production that non-produced documents do not include something that should have been produced and that the produced documents do not contain privileged material.

W h aT d o e S S a m p l i N g e N Ta i l ?

So what, exactly, is sampling all about? The concept comes from the world of statistics and is broadly applied in any number of common circumstances.

Sampling is an iterative process that continues until you reach a point where you can be confident about your results. Data sampling, properly done, allows counsel to review a small, representative portion of the total document universe and extrapolate the findings to the larger population.

Generally, sampling involves one of three methods:

1. Judgmental sampling

For this method, the sampler is exercising judgment in selecting elements to be sampled. That means that every item of data does not have an equal chance of being selected. Typically, judgmental samples are used when staff or time resources are limited or there is no need to generalize about the entire population. For example, a judgmental sample may be sufficient to show a control weakness or to prompt management to take corrective action.

2. random sampling

In this method, any piece of data in the population has an equal chance of being selected. This ensures that no bias is used in the sample selection. However, a random sample does

(7)

not imply a “statistical sample” and the results cannot be projected to the population. A random, non-statistical sampling method would typically be used as a way of emphasizing that the results were not biased or exaggerated by selecting, for example, known cases of noncompliance.

3. Statistically valid sampling

This method combines random sampling with additional statistical criteria such as confidence level, confidence intervals, expected error rate and precision. This method is used to make a statement about the population from which the sample was selected. In this case, outcome measures can be projected to the population.

S a m p l i N g f o r e a r lY C a S e a S S e S S m e N T

An important reason for conducting early case assessment (ECA) is to direct the review process to be more efficient and effective. For the litigation team, sampling provides a “bird’s eye view” of what the data contains, helps in prioritizing tasks, assists in identifying search terms with high responsive rates, and aids in isolating relevant and junk data. For lawyers, ECA is often a delicate balance between precision and recall. Judgmental

sampling is generally sufficient to establish a solid baseline to begin with. Sampling helps lead you to “fish where the fish are.”

ECA sampling helps gauge the strength of your case, determine whether you have gathered the most relevant data, and assess the most effective way to cull the information. The cost of ECA can be offset later in the process by predetermining the best search terms and methods. ECA can save further costs if the information gleaned from one or two custodians lets you know that it is not worth pursuing 20 others.

Regardless of the result, ECA puts lawyers in a better position to negotiate terms, determine strategy or even assess whether a case should be litigated or settled. If ECA reveals weaknesses, litigants can cut their losses early in the game. If the case proceeds, the search criteria can be applied to the full universe of data.

(8)

of sampling, one can use one or all of the types of sampling methodologies described previously.

The main goal of quality control sampling is to minimize risk. It ensures that the correct documents are produced to the opposing counsel and enhances overall confidence in the e-discovery process. This extra step can increase costs, depending on the quantity of data used and the number of times sampling is done. However, any added cost is outweighed by the value of avoiding problems in court.

S a m p l i N g T e C h N i q U e S

While e-discovery law increasingly mandates sampling, it does not mandate a specific technique. Various e-discovery vendors offer various approaches. As among the leading vendors, the distinction is not whether they sample, but the degree to which they sample and the effectiveness of their techniques. Healthy debate can occur over the number and placement of checkpoints, the technology used and the percentage of data tested.

Among common sampling techniques are:

• Clustering is a technique that groups documents that are similar in some way based on certain underlying concepts. This is useful when trying to identify low-hanging fruit within a large group of mostly irrelevant documents or to identify similar responsive or privileged documents within a huge set of unreviewed documents. Sampling a few of these grouped document clusters, you can make coding decisions and apply them to the entire cluster, thereby saving on attorney review time and costs.

• auto-classifiers learn the relevancy of documents. As the lawyer starts to review a small sample of documents, auto-classifiers learn the pattern of the lawyer’s coding calls. Once the system understands the pattern in totality, it applies it to the complete set of data. Auto-classifiers can help make predictions about coding based on actions taken before. In most cases, the reviewers should need to review less than 10% of the total document population before the system learns what it needs to know.

• predictive scoring is a statistical analysis based upon coding decisions made by counsel during the initial document review and coupled with weighted key concepts and search terms. The higher the weighting of a document, the more likely it is to be relevant. Documents that hit on the higher-weighted search terms are given priority for review, thereby making the review more efficient. This is a simple, easily implemented and cost-effective method for prioritizing review.

(9)

The key to an effective analysis lies in finding a statistically valid sample from which to work. Often we find that collections focus on single custodians, sometimes the ones thought to be most relevant. Later collections broaden the net and return a wider sampling of data both by type and content.

If you are going to develop a sample set, be careful that the initial documents being considered are not too one-dimensional to be helpful. Going back to a paper metaphor, imagine what your impression of a case would be if all you sampled initially were boxes on invoices. You might have a lot of documents in your sample but you would not have representative documents from which to extrapolate your case analysis.

T h e f U T U r e o f S a m p l i N g i N e - d i S C ov e rY

The use of sampling in e-discovery is in its early stages, with new techniques appearing almost by the day. With document-by-document review becoming physically and financially unfeasible, sampling will of necessity be a key tool for lawyers and e-discovery professionals. There is a myth, perpetuated by fear of the unknown, that skilled lawyers could do a review without missing relevant documents or accidentally producing privileged documents.

Computers are not less accurate than people. As a matter of fact, research reported in eDiscovery Institute’s Survey on Predictive Coding1_{points to a high degree of fallibility in}

human review. As e-discovery evolves and the volume of information continues to grow, a necessary outcome will be greater reliance on technology to analyze data.

The courts’ scrutiny of e-discovery methods requires counsel to ensure that their processes are reasonable and sound. Sampling is a persuasive way for counsel to

demonstrate effective procedures in information management. There are no downsides to sampling, only benefits.

(10)

a b o U T T h e a U T h o r S

To m T U r N e r

Tom Turner is president and co-founder of Document Solutions, Inc. (DSi) an e-discovery, digital forensics and litigation support services company that provides a wide range of traditional and technology-driven services. He can be reached at [email protected].

J o h N T r e d e N N i C k

John Tredennick was a trial lawyer and litigation partner for twenty years before founding Catalyst Repository Systems, Inc. which provides secure, hosted document repositories. For more than a decade, his company has helped counsel search, review and sample large volumes of discovery documents.

The authors wish to thank the senior consultants at Catalyst Search & Analytics Consulting for their assistance, especially Nirupama Bhatt, James Eidelman and Ron B. Tienzo.