C. Data Sources
2. Natural Language Processing
One of the primary hurdles to collecting data on contract terms for machine learning is that these data are stored in an unstructured format within natural language contract documents such as English-language Microsoft Word files and PDFs.108 Despite contract documents containing significant amounts of data,109 these data are not in a form that is easily useable for machine learning analysis due to their lack of structure and labeling.110 To systematically analyze contract terms, companies have traditionally had to manually extract, structure, and label data from natural language documents, which is an incredibly time and labor-intensive process.111 For example, some large law firms will have junior associates
107 Telephone Interview with Sirion Labs Representative (Apr. 27, 2018), supra note 96; Telephone Interview with Sirion Labs Representative (Apr. 3, 2018), supra note 96.
108 See Roach, supra note 19, at 46; Harry Surden, Computable Contracts, 46 U.C. DAVIS L. REV. 629, 642–44 (2012) (distinguishing
“natural languages” such as English from “formal languages” such as computer programming languages).
109 See Roach, supra note 19, at 50–51 (describing contracts as a mineable source of data).
110 See Surden, supra note 108, at 642–44; Roach, supra note 19, at 46.
111 Interview with Oracle Representative, supra note 87; Telephone Interview with Airbnb Representative, supra note 87; Telephone Interview with Contract Assistant Representative (Mar. 8, 2018); Telephone Interview with Private Technology Company Representative, supra note 90;
Telephone Interview with Public Technology Company Representative (Apr. 4, 2018); Telephone Interview with Public Technology Company Representative, supra note 87.
review contracts after signing for the purpose of entering contract data into an internal database.112
Natural language processing (NLP) is a category of machine learning research focused on enabling computers to understand natural language communication.113 Most NLP techniques are statistical in nature.114 Drawing on a training set of existing natural language documents, NLP models can be trained to understand natural language text based on statistical relationships between components of the text such as individual words, groups of words, word sequencing, and physical layout features like paragraph breaks and page positioning.115 An NLP model is often adjusted and retrained until it is sufficiently accurate at understanding natural language text.116 The model can then be used to process new natural language documents outside of the training set. In the legal context, NLP has primarily been applied to litigation discovery to help human document reviewers sort through massive amounts of discovery documents.117
Numerous legal technology companies have begun to use NLP to extract structured contract term data from natural
112 See Elisabeth de Fontenay, Law Firm Selection and the Value of Transactional Lawyering, 41 J.CORP.L. 393, 397 (2015). For example, an associate might note in the database whether a venture financing contract contains an anti-dilution provision, and if so, what type.
113 See Surden, supra note 108, at 643. For an overview of NLP, see generally CHRISTOPHER D.MANNING &HINRICH SCHÜTZE,FOUNDATIONS OF
STATISTICAL NATURAL LANGUAGE PROCESSING 3–5 (1999); RUSSELL &
NORVIG, supra note 72, at 860–67; Robert Dale, Classical Approaches to Natural Language Processing, in HANDBOOK OF NATURAL LANGUAGE
PROCESSING 1–7 (Nitin Indurkhya & Frederick J. Damerau eds., 2d ed.
2010); Prakash M. Nadkarni, Lucila Ohno-Machado & Wendy W. Chapman, Natural Language Processing: An Introduction, 18 J.AM.MED.INFORM. ASS’N 544 (2011).
114 See Surden, supra note 108, at 644.
115 Telephone Interview with Kira Systems Representative (Mar. 5, 2018); Telephone Interview with Kira Systems Representative, supra note 60.
116 Telephone Interview with LawGeex Representative (Mar. 8, 2018);
Telephone Interview with LegalSifter Representative (Mar. 14, 2018).
117 See Surden, supra note 108, at 644.
language contracts.118 Using NLP to generate structured contract term data is far more efficient, cost-effective and scalable than the manual alternative. Many legal NLP companies also create application programming interfaces (“APIs”) that allow their products to integrate with contract management systems.119 This enables a company to track and use the contract data obtained via NLP within its contract management system. While non-legal NLP companies often use off-the-shelf NLP software,120 legal NLP companies must typically create their own models due to the highly technical and unnatural nature of legalese.121 For example, LawGeex developed their own NLP model specifically for understanding contractual legalese called Legalese Language Processing (“LLP”).122 LawGeex’s proprietary LLP model was trained for over three years on over 400,000 contracts to understand the unique phrasing, sentence structure, and terminology of contractual legalese.123
The main differentiating factor among legal NLP products is whether the NLP model is pretrained. Pretrained (also known as “out-of-the-box”) models are typically trained on large data sets (thousands, tens of thousands, or even hundreds of thousands) of relatively simple contracts such as
118 See Our Services, CONTRACTSTANDARDS, https://www.contract standards.com/Services [https://perma.cc/48YC-3R3V]; EBREVIA, https://ebrevia.com [https://perma.cc/6K6K-H8T8]; How Kira Works, KIRA SYSTEMS, https://www.kirasystems.com/how-it-works [https://perma.cc/N9N9-CXUU]; LAWGEEX, https://www.lawgeex.com [https://perma.cc/ZM36-GG3U]; LEGAL ROBOT, https://www.legalrobot.com [https://perma.cc/K3TL-WJTK]; LEGALSIFTER, https://www.legalsifter.com [https://perma.cc/Q2W9-HM7Q].
119 Telephone Interview with Kira Systems Representative, supra note 115; Telephone Interview with Contract Standards Representative, supra note 58; Telephone Interview with Beagle Representative (Mar. 9, 2018);
Telephone Interview with eBrevia Representative (Apr. 6, 2018).
120 Telephone Interview with Legal Robot Representative (Mar. 14, 2018).
121 Telephone Interview with LawGeex Representative (Apr. 11, 2018).
122 Id.
123 Id.; see also Telephone Interview with LawGeex Representative (Mar. 8, 2018).
sales and nondisclosure agreements.124 For example, Contract Standards trained its pretrained model on publicly available contracts obtained through the Securities and Exchange Commission’s EDGAR database.125 The advantage of pretrained models is that users can apply them immediately without having to train the models themselves.126 The downside, however, is that pretrained models cannot be used to understand types of contracts and terms that are not contained within the supplied training set.127 As a result, pretrained models are not applicable for more niche and complex types of contracts. User-trained models, on the other hand, can be applied to any type of contract, but the user must supply the contracts that make up the training set.128 The number of contracts needed for a user to train a model with sufficient accuracy depends on the complexity and variability of the contract—the more complex and variable the terms in the contract, the larger the required training set.129 For example, Kira Systems offers a user-trained model that can be applied to any type of contract.130 To use the model, the
124 Telephone Interview with Contract Standards Representative, supra note 58; Telephone Interview with eBrevia Representative, supra note 119; Telephone Interview with Kira Systems Representative, supra note 115; Telephone Interview with Kira Systems Representative, supra note 60; Telephone Interview with LawGeex Representative, supra note 123; Telephone Interview with LawGeex Representative, supra note 121;
Telephone Interview with LegalSifter Representative, supra note 116.
125 Telephone Interview with Contract Standards Representative, supra note 58; see Filings & Forms, U.S.SEC.&EXCHANGE COMMISSION, https://www.sec.gov/edgar.shtml [https://perma.cc/GK9T-6539].
126 Telephone Interview with LawGeex Representative, supra note 123;
Telephone Interview with LawGeex Representative, supra note 121.
127 Telephone Interview with LawGeex Representative, supra note 123;
Telephone Interview with LawGeex Representative, supra note 121.
128 Telephone Interview with Beagle Representative, supra note 119;
Telephone Interview with eBrevia Representative, supra note 119;
Telephone Interview with Kira Systems Representative, supra note 115;
Telephone Interview with Kira Systems Representative, supra note 60;
Telephone Interview with LegalSifter Representative, supra note 116.
129 Telephone Interview with eBrevia Representative, supra note 119.
130 Telephone Interview with Kira Systems Representative, supra note 115; Telephone Interview with Kira Systems Representative, supra note 60.
user must provide at least fifty contracts in which the terms of interest have been pre-labeled by the user.131 The user then clicks a button labeled “Train,” which trains the model on the contracts provided.132 After the model has finished training, the system displays the model’s accuracy.133 One legal NLP company, LegalSifter, has developed a hybrid NLP product that resembles both a pretrained and a user-trained model.134 LegalSifter will work with users to develop user-trained NLP models specifically for a user’s niche contracts and terms.135 LegalSifter then makes these models available to other users with similar niche contracts.136 The models are retrained every week to take into account feedback and new data from all users.137 Through this process, LegalSifter can effectively crowdsource the training of new models for any type of contract.138 Legal NLP products—including pretrained, user-trained, and hybrid models—will increase the availability and quality of data on contract terms.