Practical Predictive Coding in an Unfamiliar Linguistic Landscape







Full text


Practical Predictive Coding

in an Unfamiliar

Linguistic Landscape

UBIC North America, Inc.


A Practical Approach to Predictive Coding and the Idiosyncrasies of Asian Language Electronic Discovery

Experience and understanding are paramount in electronic discovery, but when legal teams need to analyze records written in complex Asian languages, primarily Chinese, Japanese, and Korean (CJK), truly tested technology is the only solution1. In fact, the idiomatic distinctions

are inherently complicated for most of the world’s population2. Characters have

multiple meanings, cultural distinctions impact interpretation, and context can dramatically influence the content.

As a result, it is often essential that native speakers design, implement, and troubleshoot the technology necessary to properly review material in its original form. With the increasing use, however, of

automated tools that focus on conceptual connections and uniform categorization, the task is at an unprecedented level of difficulty.

Practical Predictive

Coding in an Unfamiliar

Linguistic Landscape

Multiple meanings, cultural

distinctions, and idiomatic anomalies

are just a few of the issues that dramatically influence the complexity associated with e-discovery in Chinese, Japanese, and Korean, among other Asian languages. Success requires native expertise and great technology to debunk the myth of Unicode compliance, adapt to an unusual array of e-mail platforms, and overcome the Far East's fascination with encryption. Learn how to navigate the maze of

Asian-language e-discovery concerns in this comprehensive white paper.

1See, e.g., Tracey Bateman Farrell, J.D., et al. Documents in Foreign Lan-guage; Translation, 44A N.Y. JUR. 2D DISCLOSURE § 295.

2See, e.g., Kathie Carpenter, Carol J. Compton, Elizabeth Riddle, and Ju-lian Wheatley, A Guide to the Study of Southeast Asian Languages, 9 J. SOUTHEAST ASIAN LANGUAGE TEACHING 1 (2000).


One Size Software Does Not

Fit All Cases

While some claim that any software tool proven during discovery with data sets exclusively containing Roman characters is easily adapted to CJK matters, that conclu-sion is generally unreasonable. Coding distinctions, speed requirements, accuracy, and defensibility are all critical variables in this discussion. In addition, encryption levels vary widely throughout the world, and tend to be strongest in the Asia-Pacific region. With the adoption of multiple operating systems, email management platforms, and other technological infrastructure, software used in Asian language discovery must be more versatile and functional than peer products in the Western Hemisphere. And, as the technology-assisted review (TAR) revolution takes hold, these tools must be available in a predictive coding environment. Although machine learning has tremendous advantages, it is the hybrid team of talent and technology that

ultimately yields success for the most complex and advanced review projects.

Debunking the Myth of Unicode Compliance

Regardless of language, litigants should focus on efficient collection, processing, and review of the data at issue3. While

many vendors claim to offer tools that effectively address Asian-language discovery, Unicode compliance does not traditionally encompass the cumulative systems in use throughout China, Japan,

and Korea4. Japan, for example, uses four

different code sets (e.g., Shift-JIS, EUC-JP, and JIS, among others)5. China, Korea, and

Taiwan each have their own distinct code sets as well.

In order to function properly, the review software must have the depth to recognize the unique characteristics of each. If the system misunderstands any aspect of that encoding scheme, it will display unusable garbled characters. This deficiency typically prevents most basic e-discovery engines from operating correctly, and by limiting search results to Unicode without regard to the other encoding systems, a party might place itself at an unnecessary disadvantage.

In the hyper-competitive era of e-discovery efficiency, in-house counsel are unwilling to accept anything less than absolute

preparation from their outside lawyers and technology vendors. Courts have become increasingly sophisticated and while they are unlikely to speak the Asian language at issue, they are familiar with the challenges associated with its technical evaluation. It is, therefore, essential that legal teams not only understand the limitations of Unicode compliance, but also can also articulate when and why alternatives are more appropriate for any given scenario.

3John T. Yip, Addressing the Costs and Comity Concerns of International E-Discovery,87 WASH. L. REV. 595, 618 (2012).

4Ramana Venkata, Michael A. Geibelson, Overcoming E-Discovery Challenges with New Technologies, L.A. LAW., June 2007, at 46, 48.

5IBM, Code Sets for National Language Support, NATIONAL LANGUAGE SUPPORT GUIDE AND REFERENCE, (last visited May 31, 2013).


Nimble Review Software is Necessary to Adapt to E-Mail Systems

That requirement is also true for e-mail. As the Microsoft Office suite is fairly standard worldwide, software needs to apply, analyze, and extract text correctly for Office- and Outlook-related documents. More importantly, however, since China, Japan, Korea, and Taiwan each use a popular domestic e-mail client (e.g., Eudora, Becky!, Thunderbird, and Notes, among others), the technology must recognize the file structure distinctions and adjust them accordingly.

In fact, one of the hallmarks of a dynamic tool is its ability to highlight the location of encoding within a particular file, the misidentification of which is common in more generic search technology. The lack of standardization in the Asian region creates far more complexity than the linguistic anomalies. Each e-mail platform has its own architecture, source code, and storage protocols. Also, the communication practices, archiving techniques, and

access vary according to culture as much as it depends on technical specifications. As a result, applying general knowledge of western-region systems to the nuances of eastern-region practices and processes can often confuse an already intricate endeavor. To build clarity, legal teams must leverage the most adaptable software available and recognize at the outset that their conventional wisdom may be inapplicable to an Asian language discovery project.

Embrace Encryption in the Far East

In addition to e-mail platform idiosyncrasies, data security is particularly unique in Asia. Although safeguarding information is a paramount concern across the globe, encryption is more widely used in China, Japan, and Korea than throughout North America and Europe. That sensitivity to data protection requires advanced technology to combat potential threats, including on-site support, custom solutions, and follow-up consulting.

Unless a vendor is familiar with the

particular encryption system at issue, there will be unavoidable challenges. The design, sophistication, and purpose are all key factors in navigating through a protected network. Different organizations also maintain varying degrees of security

depending on the type of data it possesses. For instance, a government contractor or a consumer-facing healthcare conglomer-ate may have a very different view of its archive than an industrial manufacturer or a consulting firm. Globalized organizations


further complicate these initiatives given the regulatory environment in certain jurisdictions.

In addition to familiarity, the vendor’s team must be capable of diplomatically

communicating with the company’s internal IT leaders, who tend to be quite powerful and reluctant to share infrastruc-ture information. Unlike in the U.S., for example, those responsible for data in Asian corporations often operate in silos, with little incentive to integrate their practices. They control one of the most

important assets of any business in the region, which is the data portfolio, and they are very protective of their work. By leveraging technology that can seam-lessly access a corporate network upon approval, legal teams can eliminate many of the obstacles they are likely to face in any given matter. And, by flawlessly

identifying different file types and encoding schemes, or navigating distinct encryption protections and minimizing potential errors caused by interaction with unusual

platforms, counsel can focus on analyzing responsive material, rather than trying to access it.

To supplement this endeavor, legal team leaders must collaborate with a partner capable of providing a local presence in the appropriate areas since courts are routinely scrutinizing cross-border data transfers in e-discovery matters6. In

weighing a variety of factors, including

the need for critical information, its availability through other means, the potential burden on foreign courts, and the prospective expense, courts are sensitive to the increasing internationalization of discovery7.

Deciphering Layers and Managing Layer Combination Unpredictability

Regardless of the unique operating system or distinct application, including e-mail, word processing, database, and other programs, the particular character set available can be unpredictable and may lack standardization since each application is designed to input, edit, and display text in a unique fashion. Also, the information related to a particular

character set may be hidden in the header, which generally maintains technical details.

This is especially complex when managing files with multiple languages. Ultimately, it is the confusion associated with deciphering the layers for encoding and language settings that requires specialized software to ensure the seamlessness of the


6See, e.g., Yip, 87 WASH. L. REV. at 618.

7See, e.g., Heraeus Kulzer, GmbH v. Biomet, Inc., 633 F.3d 591 (7th Cir. 2011).


Adaptability and Functionality

Set the Foundation for

Respected Results

Among other characteristics, Chinese and Japanese sentences do not require breaks between words, which makes it difficult for standard indexing technology to identify them. In addition, there are a variety of character codes. For example, standard Japanese uses four distinct alphabets, as well as various methods for rendering each set of characters. Korean, on the other hand, does use a unique-spacing system, but it differs from common break structures in western languages.

As a result, proper Japanese review

requires 24 pattern matches, while Chinese and Korean require 12. All are exponentially more complex in their matching than most other languages commonly involved in electronic discovery.

CJK Search Requires Savvy Software

As a result of this peculiarity in CJK matters, it is essential to partner with providers that can support key initiatives and offer tech-nology that offers advanced features8.

First, keyword search functionality must include a multiple-language capability, as well as Boolean, group, proximity, and regular expression options. This versatility distinguishes basic software from advanced tools that are organically designed to work

in a multi-lingual environment. In addition, it is essential for reviewers to have a variety of techniques to implement.

Second, metadata search must offer customization, where users can save results in groups and color-code hit files. In fact, there is an expectation that any tool will permit users to tailor their efforts to the unique nature of the matter at issue. If it does not, it is unlikely to serve as a suitable alternative to any proactive review team. Third, given the linguistic complexities, the search engine should be built on an open-source structure, such as Lucene, to permit users to evaluate syntax more simply than traditional index search tools. That evaluation ability will result in a more comprehensive review and deeper analysis. Either could make the difference between finding critical information without concern over its language, style, or format.

The Database Structure Could Define the Process

To maximize the potential to capture as many of the linguistic variables as possible, which are inherent in Asian-language e-discovery, the underlying object-rela-tional database should also be free and open-source as well. PostgreSQL, for example, offers one of the most robust options as it is available for multiple platforms, including Linux, Microsoft Windows and Mac OS X. Given the diversity in computing preferences throughout the region, this is essential.

8See Jacob Tingen, Technologies-That-Must-Not-Be-Named: Understand-ing and ImplementUnderstand-ing Advanced Search Technologies in E-Discovery, 19 RICH. J.L. & TECH. 2, 76 (2012) (advocating a need for third-party discovery vendors when dealing with a large database of electronic information).


That database must support multiple unique file types to enable proper character-code identification and metadata extraction. In fact, elite Asian-language e-discovery tools, such as that offered by UBIC, use specialized indexing to convert the variety of character codes to Unicode (UTF8) to streamline the search process. This conversion capability is essential as legal teams rarely know the state of their data sets until they begin assessing them. At that point, they will need maximum flexibility to keep pace with the action.

Finally, in the current era of advanced analytics, it is essential that legal teams apply the most sophisticated technology to their significantly complex transactions. For instance, those tools that use “n-gram models” to predict subsequent items in a data sequence are harnessing the

maximum power in communication theory and statistical natural language processing for simplicity and scalability. That simple and scalable approach will often charac-terize the success of this type of endeavor.

The TAR Revolution

Asian organizations recognize the value of TAR. In fact, various branches of the Japanese government and defense agencies receive training in computer forensics, which is an increasing trend worldwide9. In fact, the U.S. Department

of Justice is using predictive coding in its litigation matters, with one official noting:

Courts also continue to approve its use11.

When the Northern District of Indiana approved the use of predictive coding, it decided the matter on proportionality grounds, concluding that the potential discovery of additional documents would outweigh the cost of further searching12.

There are different types of TAR technolo-gies, e.g., clustering and near duplication, but the tool offering the most promise is predictive coding13.

“When it works well, predictive coding reduces the document review and production burden on parties while still providing the division with the documents it needs to fairly and fully analyze

transactions and conduct.10

9See, Nicholas Barry, Note, Man Versus Machine Review: The Showdown Between Hordes of Discovery Lawyers and A Computer-Utilizing Predictive-Coding Technology, 15 VAND. J. ENT. & TECH. L. 343, 365-371 (2013).

10Renata B. Hesse, Statement at the Global Competition Review 2nd An-nual Antitrust Law Leaders Forum, Miami, Florida. IP, Antitrust and Looking Back on the Last Four Years. February 8, 2013

11See In re: Biomet M2a Magnum Hip Implant Products Liability Litigation, No. 3:12-MD-2391 (N.D. IN. Apr. 18, 2013).


13See, Adam M. Acosta, Predictive Coding: The Beginning of A New E-Dis-covery Era, RES GESTAE, October 2012.


Maximizing TAR and Predictive Coding

Statistical relevance is an essential component of any TAR application and, therefore, any tool that addresses

Asian-language e-discovery, should incor-porate a comprehensive mathematical component. By leveraging a sophisticated analysis based on the mutual information of random variables, technology can gauge the frequency with which any particular term appears in a given document,

accounting for common words that gener-ally appear often, and measure relevance. In order to validate a randomly selected sample set that contains as few as 1%-5% of the total number of documents, the software must be capable of process-ing continuous mathematical calculations in lieu of human reviewers to highlight critical records earlier in the process. It also assigns a score to designate relevance and allow reviewers to rank items for evaluation based on their time constraints and budgetary restrictions. This process achieves recall rates in excess of 90 percent and can process up to 900,000 documents in a matter of hours.

The Art of Machine Learning

Modern predictive coding software achieves these milestones by generally “learning” from the sample training data and rating the remaining documents

based on its initial conclusions14. Advanced

technology will extract the most significant keywords and weigh each one using morphological analysis and statistical algorithms.

It is, however, the convergence of

technology and legal talent that ultimately yields the greatest results for review teams. Sophisticated tools enable savvy practition-ers to apply their insights to a series of algorithms that allow the machine to adapt those initial conclusions to provide similarly accurate results.

In matters that involve Asian language discovery and other linguistic anomalies, it is critical for both the software and the legal team to recognize the distinctions. They must be adaptable in tandem with one another to maximize the potential of the pairing between humans and machines. The art of this combination is at the core of the TAR revolution.

As professionals grow more comfortable with training their digital counterparts to mimic their analysis, it will accelerate a process that is often too time-intensive and cost-prohibitive for most litigants. As the tolerance for expensive and inefficient options drops, there is likely to be an exponential increase in the use of these tools across the globe.

14See, Michael Yager, E-Discovery As Quantum Law: Clash of Cultures-What the Future Portends, 19 RICH. J.L. & TECH. 10, 21 (2013).



Categorizing documents based on their relevance and responsiveness is central to processing records at an exponentially higher rate and at a lower cost than linear methods. This division of material into similar patterns and content is a key hallmark of modern discovery; however, it is the nature of that separation that has the greatest impact on the results.

In contrast to responding to automated clustering of sets of similar documents together, human reviewers generally establish the categories in which to divide material, with the software matching their determinations. The judgment is that of the legal expert with substantive experience, but the computer implements and emulates that understanding. The system serves as a supplement to help lawyers manage an exploding volume of information.

Distinguishing Between Different Predictive

Coding Technologies15

While there are a variety of predictive coding tools on the market that can leverage categorization and clustering capabilities to dramatically improve the nature of legal review, particularly in large volume high-stakes litigation, they may be inapplicable for Asian language discovery. In those matters, it is essential that counsel apply tools, which specifically address the inherent linguistic challenges associated with idiomatic complexity. In addition, they must also recognize the coding, layering, and security characteristics that are unique to reviews of this type.

As a result of these diverse elements, it is often critical to leverage a hybrid technology, rather than a traditional predictive coding tool alone. While standard TAR platforms may apply automated categorization and ranking to initial human determinations, more advanced tools with a combined workflow, adding concept clustering among other techniques, can often sharpen the outcome. Although this might be

unnecessary in a smaller case focused on English-language documents, it is essential for matters with a large volume of Chinese, Japanese, and/or Korean records

combined in random order.

15Maura R. Grossman and Gordon V. Cormack, The Grossman-Cormack Glossary of Technology-Assisted Review, 2013 FED. CTS. L. REV. 7 (January 2013).


The Future of

Asian Language E-Discovery

As the recovery continues to grow and

companies seek out business opportunities in Asia, among other fast-growing regions of the world, litigation will become increasingly global. Document productions will cross borders and legal teams will inevitably transform into diverse teams of talent. Those with linguistic capabilities will offer an advantage to clients facing uncertain regulatory climates at home and abroad.

In addition to recognizing the substance of the legal issues, the most coveted teams will supplement their ability with technology that offers an array of applications, from categorizing and ranking relevant information, to automatically recognizing idiomatic coding variations and navigating inherent layering challenges.

By combining talent and technology in anticipation of a more complex legal environment, organizations can ensure the promise of continued success.





Related subjects :