Knowledge Discovery from Business Contracts.

(1)

ABSTRACT

GAO, XIBIN. Knowledge Discovery from Business Contracts. (Under the direction of Dr. Munindar P. Singh.)

A contract is a legally binding agreement between participating parties specifying service requirements and expectations, and stakeholder rights and duties. Additionally, a contract pro-vides a framework for resolution in case of breach of its terms. In current practice, contracts are produced as text documents usually drafted by contract lawyers, thus any insights such as service capabilities, business risks, and interaction relations are hidden in unstructured text. We address the challenge of discovering knowledge from thousands of contracts. To this end, we develop a comprehensive approach implemented as a system, Contract Miner, that is capable of extracting essential information from a large contract repository.

First, service exceptions such as product late delivery, payment default, and bankruptcy reveal critical aspects of business service operations. Though rarely studied before in connec-tion with services, excepconnec-tion extracconnec-tion can help uncover the potential risks an organizaconnec-tion is exposed to. Contract Miner takes advantage of a handful of linguistic patterns to harvest service exceptions at the phrase level.

Second, business events form the backbone of business relationships and correspond to es-sential business processes such as purchase and payment. Business events, e.g., product delivery, bill payment, and bank interest accrual, are inherently temporally constrained. With a hybrid of linguistic patterns, grammar parsing, and classifications, Contract Miner extracts business events and their corresponding temporal constraints. It applies topic modeling to organize the event lexicon into thematic groups.

Third, normative relationships bear one of the most important aspect of contractual relations capturing commitments, authorizations, and prohibitions. Norms are studied intensively in multiagent systems. They yield guidance for implementing software agents as well as a basis for judging whether the parties are complying with the contract. Based on top of the methods for extracting service exceptions, business events and temporal constraints, Contract Miner uses supervised methods to extract normative relationships.

(2)

(3)

Knowledge Discovery from Business Contracts

by Xibin Gao

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Computer Science

Raleigh, North Carolina

2012

APPROVED BY:

Dr. Dennis R. Bahler Dr. Robert B. Handfield

(4)

DEDICATION

(5)

BIOGRAPHY

(6)

ACKNOWLEDGEMENTS

First, I would like to thank my advisor Dr. Singh for his support, guidance, and encouragement through all the years of my graduate study. His academic rigor, intellectual acumen, and schol-arly insight led me in the journey of Ph.D pursuit. In the course of research full of trials and failures, his patience and trust lighten my path and serve as an enduring support system. He will continue to inspire me with his intellect, personality, and wisdom in my future career and the wealth he bestowed will be a treasure of mine in a lifetime.

My committee members Dr. Handfield, Dr. Healey, and Dr. Bahler have been tremendous help and offered insightful ideas for my dissertation. Their advice and discussion contribute to shape the direction and content of my research. With tremendous expertise in their respective areas, they broaden my research horizon, enable me to think from multidimensions, and dive me into the most exciting research frontier.

I met many wonderful people during my internships. I am indebted to Ed Curry and Sean O’Riain for hosting me at Digital Enterprise Research Institute, National University of Ireland Galway in the summer of 2010. I appreciate Pankaj Mehra for making possible for me a semester long visit to Whodini Inc. in Los Altos, California. Also I would like to thank my mentor Kang Li and manager Yi Li for their help during my internship at Microsoft Corp. in the summer of 2011.

I am honored to have the chance to work with great fellow lab mates at North Carolina State University: Moin Ayazifar, Adel ElMessiry, Scott Gerard, Chung-Wei Hang, Anup Kalia, Dhanwant Singh Kang, Prashant Kediyal, Pradeep Murukannaiah, Pankaj Telang, Guangchao Yuan and Zhe Zhang in alphabetical order of the last name. The discussions we had in both research matters and everything else will be one of the sweet memories I will have.

Friends I met in The Triangle, Seattle area, San Francisco Bay area, and Ireland made my life of the past five years so much richer. I learned from them, among other things, graduate life is about academic research but grander than mere research. They exert great influence on my personal growth, lifelong goals, and views on people and things.

Last but not the least, I have my utmost gratitude to my family in China for their unwavering support and encouragement when I am pursuing my PhD in the United States. They are always there with open arms when I struggle with matters in academic and personal life. Every progress I made would have been impossible without them.

(7)

TABLE OF CONTENTS

LIST OF TABLES . . . vii

LIST OF FIGURES . . . ix

Chapter 1 Introduction . . . 1

1.1 Background . . . 1

1.2 Information Extraction from Contracts . . . 3

1.3 Contributions . . . 4

1.4 Organization . . . 4

Chapter 2 Service Exceptions Extraction . . . 5

2.1 Introduction . . . 5

2.2 Problem: Mining Service Contracts . . . 6

2.3 Approach . . . 8

2.3.1 Step 1: Extract Sentences that Refer to Exceptions . . . 9

2.3.2 Step 2: Construct Noun Phrases . . . 10

2.3.3 Step 3: Identify Exception Noun Phrases . . . 11

2.4 Evaluation . . . 12

2.4.1 Statistics about the Corpus . . . 12

2.4.2 Quality of Identifying Exceptions . . . 14

2.4.3 Frequent Exception Phrases . . . 18

2.4.4 Performance . . . 18

2.4.5 Additional Validation: Cloud Services Contracts . . . 19

2.5 Related Work . . . 21

2.5.1 Service Contracts . . . 21

2.5.2 Pattern-based Information Extraction . . . 23

2.6 Discussion . . . 25

Chapter 3 Business Events and Temporal Constraints Extraction . . . 27

3.2 Problem and Approach Overview . . . 28

3.2.1 Overview of Information Extraction . . . 30

3.2.2 Our Approach Overview . . . 31

3.3 Task 1: Business Event Extraction . . . 31

3.3.1 Approach . . . 32

3.3.2 Evaluation . . . 37

3.4 Task 2: Event Term Clustering . . . 38

3.4.1 Approach . . . 39

3.4.2 Evaluation . . . 40

3.5 Task 3: Temporal Constraints Extraction . . . 43

3.5.1 Approach . . . 45

(8)

3.6.1 Event Extraction . . . 50

3.6.2 Temporal Information Extraction . . . 51

3.6.3 Contract Analysis . . . 51

Chapter 4 Normative Relationships Extraction . . . 54

4.2 Motivation and Problem Statement . . . 55

4.2.1 Contracts . . . 56

4.2.2 Norm Types . . . 57

4.2.3 Norms in Text . . . 58

4.2.4 Technical Problem . . . 59

4.3 Approach . . . 59

4.3.1 Overall Approach . . . 60

4.3.2 Step 1: Candidate Identification . . . 60

4.3.3 Step 2: Sentence Parsing . . . 60

4.3.4 Step 3: Feature Selection . . . 61

4.3.5 Step 4: Classification . . . 64

4.4 Evaluation . . . 65

4.4.1 Evaluation Metrics . . . 65

4.4.2 Experiment Results . . . 65

4.5.1 Norms . . . 68

4.5.2 Text Mining of Norms, Regulations, and Policies . . . 69

Chapter 5 Conclusion . . . 72

5.1 Characteristics of Contracts . . . 72

5.2 Summary of Research and Results . . . 73

References. . . 75

Appendices . . . 84

Appendix A Event Term Clusters . . . 85

A.1 Licensing Contracts Event Term Clusters . . . 85

A.2 Leasing Contracts Event Term Clusters . . . 87

Appendix B Database Schemas . . . 89

B.1 Event Database Schema . . . 89

(9)

LIST OF TABLES

Table 2.1 Distribution of our patterns across contracts of different domains over our

entire corpus. . . 13

Table 2.2 Statistics of sentence length (number of words) over our entire corpus. . . . 13

Table 2.3 Precision, recall, and F-measure for selected manufacturing contracts. . . . 15

Table 2.4 Sample exception phrases extracted from selected manufacturing contracts using pattern “in (the) event of.” . . . 16

Table 2.5 Sample exception phrases extracted from selected manufacturing contracts using pattern “in (the) case of.” . . . 17

Table 2.6 Frequent exception phrases extracted from all 206 manufacturing contracts in our corpus. . . 18

Table 2.7 Precision, recall, and F-measure for selected cloud services contracts. . . . 19

Table 2.8 Sample extracted exception phrases from selected cloud services contracts. 20 Table 3.1 Signal words. . . 33

Table 3.2 Types of phrasal chunks for pruning [64] . . . 35

Table 3.3 Features for event classification. . . 35

Table 3.4 Event extraction result. The weighted average is calculated according to the proportion of instances in each class. For example, weighted average precision is calculated as the proportion of negative class× the precision of the negative class + the proportion of the positive class×the precision of the positive class. . . 38

Table 3.5 Theχ2 statistic for different features. . . 38

Table 3.6 Event repository summary. . . 39

Table 3.7 Sample business events. . . 40

Table 3.8 Event topics from 208 manufacturing contracts. The top vocabularies are automatically extracted with LDA. Note that the vocabularies are stemmed. The class labels are manually assigned by a human annotator. . . 41

Table 3.9 Topic evaluation. . . 43

Table 3.10 Cluster labels assigned in different domains indicating common and domain-specific labels. . . 43

Table 3.11 Temporal information. . . 44

Table 3.12 Training set sample instances. . . 47

Table 3.13 Result using KNN. . . 47

Table 3.14 Result using Na¨ıve Bayes: U denotes unigram, SR denotes stop words removal, and ST denotes stemming. . . 47

Table 3.15 Results from the perceptron classifier. . . 48

Table 3.16 Result using logistic regression: U denotes unigram, SR denotes stop words removal, and ST denotes stemming. . . 49

(10)

Table 4.1 Classification features. . . 62

Table 4.2 Numbers of instances in norm types in the gold standard. . . 65

Table 4.3 Classification result. PC: practical commitment; DC: dialectical commit-ment; AU: authorization; PR: prohibition; SA: sanction; NN: not a norm; P: precision; R: recall; F: F-measure. . . 66

Table 4.4 Sample normative relationships extracted from contracts in the manufac-turing domain. . . 67

Table 4.5 Norm repository summary. . . 67

Table 4.6 Top action verbs. . . 68

Table A.1 Event topics from licensing contracts. . . 86

Table A.2 Event topics from lease agreements. . . 88

Table B.1 Events table structure. . . 89

(11)

LIST OF FIGURES

Figure 2.1 Service exception extraction system architecture and data flow. . . 9 Figure 2.2 Distribution of lengths (numbers of words) of the matched sentences

across all 2,647 contracts studied. . . 14 Figure 2.3 Distribution of exception patterns for contracts of different lengths (based

on all 2,647 contracts studied). The contract length is measured in hun-dreds of sentences. . . 15 Figure 2.4 A screenshot of our system used as a browser addon. . . 26

Figure 3.1 Overall processing pipeline. . . 32 Figure 3.2 Grammar tree with pruning. Modifier “who shall solely be CLIENT’s

agent” is labeled as an “SBAR”; “shall” is labeled as “MD” modal word; “select and pay the freight forwarder” is labeled as “VP” verb phrase; “CLIENT” is labeled as “NP” noun phrase. An event candidate is usually composed of a subject, a signal word, and a verb phrase. . . 34 Figure 3.3 Feature comparison. Denotations: CCS: Counter Clause Signal; MV: Modal

Verb; CS: Clause Signal; SCN: Subject Contains NE. Features and groups of features yield different performance outcomes. A combination of all fea-tures yields the best predictiveness. SCN is more predictive than CCS, MV, and CS features. . . 39 Figure 3.4 Sample document topic distribution. The Y axis shows the topics. The X

axis shows the proportion of each of ten topics in each of five documents. TopicT2,T5,T6, andT7 do not occur in the five documents. . . 42

Figure 3.5 Annotation shown using the GATE framework. . . 50

Figure 4.1 Processing pipeline. . . 60 Figure 4.2 Grammar tree. Modifier “who shall solely be CLIENT’s agent” is labeled

as a sentential clause or sbar; “shall” is labeled as a modal word ormd;

“select and pay the freight forwarder” is labeled as a verb phrase or vp;

(12)

Chapter 1

Introduction

Modern businesses are becoming more and more complex. They are conducted at an ever-increasing scale and often involve multiple partnering entities. In particular, business sectors such as manufacturing, supply, purchase, outsourcing, licensing, and sales need to interact with entities both within and without their corporate environment. Such cross-organizational transactions are often, if not always, regulated by business contracts.

1.1 Background

A contract, sometimes described the concurrence of wills or consensus ad idem(meeting of the minds)1, provides a legal agreement between two or more participants specifying what they may expect from each other including what business actions they will take or not take. Businesses rely on contracts as an exchange of promises and as a protection against the breach of such promises with legal remedies. The successful fulfillment of contracts can ensure continuing enterprise operations whereas contractual violation may cause damage to organizations or individuals with potential financial and legal ramifications.

The breach of a contract is a type of civil wrong and maybe subject to legal proceedings. Allegations or settlements of business torts and organization misbehaviors related to interference with contractual relations often make headline news.

March 4, 2012 (Telegram UK) “Worcester County Sheriffs Office (UK) has made a

$161,000 payment to members of its correction officer union to settle a claim of a contract violation.”2

April 27, 2012 (Bloomberg News) “Unipec Asia Company Ltd., a subsidiary of China 1

http://en.wikipedia.org/wiki/Contract

2

(13)

Petroleum & Chemical Corp., sued BNP Paribas SA (BNP) in Hong Kong to recover

$46.6 million from the French bank. Unipec Asias suit alleges that BNP Paribas breached a stand-by letter of credit, also [. . . ] pursue damages in relation to the breach of contract [. . . ].”3

May 8, 2012 (Chicago Tribune) “UC Davis sues U.S. Bank for breach of contract after Occupy protests.”4

May 19, 2012 (Detroit News) “A Detroit Metro Airport employee and two others filed a lawsuit Friday that claims the contract of short-time airport CEO Turkia Mullin and a potential$700,000 severance were improperly crafted and approved.”5

The breach of a contract may involve failure to deliver products and services, violations of terms and conditions, misinterpretations of contractual promises, or fraudulent behaviors by the contracting parties. Litigation may cause irreversible damage to a business and lead to huge financial loss. On the one hand, the punitive aspect of contracts can discourage undesirable behaviors of the contractual parties. On the other hand, the enactment of contractual relations provides legal protection of the rights, enforces successful transactions of businesses, and builds a platform for dispute and conflict resolution when it arises.

“In a perfect world, there would be no need for a contract, and all deals would be sealed with a handshake. However, contracts are an important part of managing buyer-supplier re-lationships, as they explicitly define the roles and responsibilities of both parties, as well as how conflicts will be resolved if they occur (which they almost always do)” [73]. The statement from the book Purchasing and Supply Chain Management summarizes the roles of a contract in modern business transactions.

Contracts can take different forms: the length of a contract can be as short as a couple of pages or as long as thousands of pages; the domain of a contract can include any business endeavors. However, a contract almost always include the participant names, definitions of terms used, the relevant business services, and remedies of breach. An example of contract is the capital reimbursement agreement between Patheon Inc. and Santarus Inc.6_.

3

http://www.bloomberg.com/news/2012-04-28/sinopec-unit-sues-bnp-for-47-million-for-contract-breach.html

4

http://www.chicagotribune.com/business/sns-mct-uc-davis-sues-u.s.-bank-for-breach-of-contract-20120508,0,3385589.story

5

http://www.detroitnews.com/article/20120519/METRO/205190346/Mullin-airport-contract-challenged-in-court

6

(14)

1.2 Information Extraction from Contracts

With increased business activities and outsourcing, the number of contracts has exploded. InfoSys reports7 that 60%–80% of business transactions are governed by contracts and that an average Fortune 2000 company manages 20,000 to 40,000 active contracts at any given time. Valuable enterprise knowledge such as that regarding business risks, process dependencies, and customer relations is hidden in unstructured contract text.

Knowledge discovery involves finding patterns or novel information from large amount of data. We can apply knowledge discovery to sift through contract repository, discover the infor-mation of interest, and uncover hidden insights. Applying a hybrid of methods such as linguistic patterns, natural language processing, text classification, and topic modeling, we develop a tool called Contract Miner to extract and organize the information of interest from unstructured contract text with the objective of turning buried insights into actionable knowledge.

The information we aim to extract includes service exceptions, business events, temporal constraints, and normative relationships. Service exceptions are studied in the context of service engagements viewing contracts as the artifact for binding business services. Exceptions often pose risks for businesses. The most frequent risks in contractual relations are temporal violations such as late payment, product delivery delay, and financial default. Serving as the backbone of business transactions, business events encompass a wide range of domains such as legal actions, communications, and product services. We further automatically categorize the terms occurring in business events into topic clusters, each described by a vocabulary and centered on a theme. Business events are usually regulated by normative relationships that describe what a party is committed, authorized, or prohibited from taking an action and what it warrants as true. We consider the following types of normative relationships: dialectical commitment, practical commitment, power, authorization, prohibition, and sanction.

In sum, this dissertation addresses the extraction of the essential information elements— service exceptions, business events, temporal constraints, and normative relationships—from a large contract text repository. These elements form an organic chain in order of increasing complexity and elevating meanings. We approach

service exceptions as noun phrases using methods of natural language processing and linguistic patterns;

business events and temporal constraints as structures represented with event subject, event verb phrase, and prepositional phrase using a hybrid of natural language parsing, linguistic patterns, text classification, and topic models;

7

(15)

normative relationships as arising in the sentential context using methods consisting of the hybrid and heuristics.

Extracting elements earlier in the chain form the basis or inspire the motivation of extracting the elements later in the chain.

1.3 Contributions

The major contribution of this dissertation includes (1) formulating the problem of extract-ing service exceptions, business events, temporal constraints, and normative relationships from contract text and (2) applying a hybrid of linguistic patterns, natural language parsing, topic modeling, and text classification to solve information extraction tasks. It evaluates the ap-proaches with real-life contract data and yields promising results. Our work represents one of the first comprehensive text mining efforts on contracts.

1.4 Organization

(16)

Chapter 2

Service Exceptions Extraction

2.1 Introduction

In this chapter, we studyservice exceptions in the context of modeling and analyzing (business) service engagements. Service engagements inherently involve the interaction of autonomous parties and are naturally specified at a high level in terms of contracts. Contracts can help formalize business processes through which service engagements are realized. They describe the expectations that each participant may have of the others and offer the potential of legal recourse should those expectations not be met. Thus service engagements are almost always specified via a contract, although the contrast involved demonstrate a wide range of complexity. Because of the importance of service engagements and contracts to the world economy, they are increasingly being studied in computer science [80, 66]. Existing approaches are top-down in that they each propose a model for services and contracts and establish its technical properties. They use such properties to determine how to manage the life cycle of a contract. In contrast, we adopt a bottom-up approach wherein we examine existing real-life contracts to understand what knowledge and structure we can induce from them. In this sense, our approach is complementary to the above types of approaches. Analysis such as the one we perform can yield part of the knowledge needed by the more traditional approaches.

Modern enterprises manage a large number of active contracts for business operations. Such contracts are usually expressed in unstructured text, but contain rich knowledge about business processes, customer relations, legal risks, and financial complications. Mining contracts can yield actionable knowledge that can help decision makers better regulate business operations, adapt to ever-changing customer demands, maximize financial performance, and mitigate risks.

(17)

types of information which is described in the following chapters. We find some important differences between the contracts domain and the traditional domains of text analysis. In par-ticular, contracts appear to both involve longer and more complex sentences and follow a more routinized structure than in normal language. The routinized structure facilitates analysis.

Our special focus in this chapter is on service exceptions or contingency conditions that are listed within a contract. Contract text is a type of formal legal text, in which the clauses specify the rights, obligations, permissions, and prohibitions of the participating parties [108]. The clauses usually also specify exceptions and are often written in a routinized way.

This chapter understands the field ofservice computingin the broad sense. In particular, we are concerned primarily with business services (indicated by value transfer and coproduction [108]). Business services contrast with technical services, such as web or grid services, for which a suitable modeling involves the exchange of information such as by a client invoking an operation and the service providing a response. We acknowledge that the termcontract sometimes refers to software descriptions, roughly the functionality or type signature of a service, such as might be specified using the Web Services Description Language (WSDL). Other standards address describing the nonfunctional behaviors of a Web Service as well. However, our focus is on a contract as a legal binding between service provider and service consumer.

Contract Miner extracts domain-specific contracting-relevant knowledge from a large repos-itory of service contracts. Such knowledge can help in building service vocabularies that support the development of a taxonomy of business terms. Further, the knowledge extracted facilitates modeling and analyzing service engagements in different domains. Contract Miner includes a lightweight online tool for automatically annotating important aspects of a service contract, so it can be readily used as annotator for service contracts.

Organization

The rest of the chapter is organized as follows. Section 2.2 introduces the technical problem of mining service contracts. Section 2.3 introduces service exception extraction system architec-ture. Section 2.4 evaluates the extraction performance. Section 2.5 explains previous work on contracts and text analysis. Section 2.6 summarizes our conclusions and discusses future work.

2.2 Problem: Mining Service Contracts

(18)

enterprise that we can potentially mine.

Our particular interest is in mining contracts to discover actionable knowledge regarding the service engagements of interest. Specifically, because exceptions, that is, contingency conditions, pose both a technical challenge to the development of robust business processes and can create a substantial business risk for an enterprise, we focus on the exceptions that can be identified from a service contract. The automatic discovery of exceptions by mining contracts can help a business better meet customer demand, conform to regulations, avoid unnecessary financial loss, and hedge against legal risks.

A contract may potentially list one or more exception conditions along with each of its clauses. A contract usually does not list the risks because risks are internal to each party. However, a contract would list the remedies, if any, offered in the case of an exception. Such remedies may represent risks to the remedying party and may indicate the magnitude of the risk perceived by the remedied party.

For example, an IT services contract may say that data access may be lost due to a network outage and may specify a refund of$100 in case of service outage. In this case, the exception is the data access loss due to network outage, the risk to the provider is the$100 it would have to pay, and the risk to the consumer is mitigated by the $100 it would receive. Each party would face additional risks not included within the contract.

The most insidious exceptions are those that a contract fails to anticipate. Our system can help a designer readily determine what exceptions are incorporated in a contract and what exceptions that occur in other contracts in the same domain have been omitted from a specific contract. Knowing the missing exceptions would be a reason for a participant to reject a contract or to negotiate to modify the terms of a contract before accepting it.

Definition 1 Exception: A potential circumstance that poses an adverse condition for a busi-ness or that does not conform to a rule or generalization.

To better appreciate the importance of exceptions, consider the following manufacturing service agreement between FASL LLC and Fujitsu Limited:1

In case of any defect in Serviced Products, Fujitsu shall, at Fujitsu’s option, (a) rework the applicable Serviced Products, or (b) issue a credit to FASL.

Fujitsu shall ship all Serviced Products in accordance with the delivery schedule contained in the applicable Purchase Order, and shall promptly notify and consult with FASL in case of any expected delays in shipping Serviced Products.

(19)

IfFASL fails to make any payment on or before the required payment date, FASL shall be liable for interest on such payment at a rate equal to ten percent (10%) per annum or the maximum amount allowed by Applicable Law, whichever is less.

This Agreement shall be deemed to have been drafted by both Parties and, in the event of a dispute, no Party hereto shall be entitled to claim that any provision should be construed against any other Party by reason of the fact that it was drafted by one particular Party.

In the event thatFASL intends to stop delivering Purchase Orders for Services with respect to any Products, it shall deliver to Fujitsu four (4) months’ prior written notice thereof, provided that (subject to the provisions of Section 5.2 below) no such notice shall be delivered prior to December 1, 2003.

Ignoring the underlined phrases for the time being, let us examine the bold highlighted text in each of the above clauses. Each such snippet describes an event that indicates an exception faced by the parties to the contract.

The technical problem we address is how to mine contract text to identify the exceptions it refers to. But mining contracts offers challenges not present in some more commonly studied text forms. Contract text tends to involve long sentences with complex nested structure including legal jargon and complicated noun phrases.

Accordingly, we propose a simple but effective unsupervised pattern-based approach for identifying noun phrases indicating exceptions from contract text. Despite their simplicity pattern-based approaches, for example, [45], yield better performance for extracting specific semantic relations than the more general key phrase extraction methods, for example, [33, 111]. We evaluate our approach using the well-known Onecle repository of real-life contracts.2

The benefits of our approach include: (1) discovering service exception vocabularies for different contract domains; (2) highlighting the exceptions in a business contract; and (3) helping develop a taxonomy of exceptions that commonly arise in business operations in each domain.

2.3 Approach

Our approach consists of the following steps. We describe the three main steps in the remainder of this section.

2

(20)

Step 0: Preprocess contract text by stripping HTML tags and other noise, and segmenting the text into a collection of sentences. We use an off-the-shelf HTML-to-text converter [74] to strip off all the hypertext tags. Next we segment the clean text into a collection of sentences using a sentence delimiter [109].

Step 1: Extract sentences referring to exceptions by applying linguistic patterns.

Step 2: Construct noun phrases from the above sentences using an existing natural language parser.

Step 3: Identify noun phrases corresponding to exceptions.

Currently, our system takes as input contracts retrieved from the Onecle repository. The input format could be easily changed by suitably modifying the preprocessing step. Figure 2.1 shows a simplified system architecture for our system.

Figure 2.1: Service exception extraction system architecture and data flow.

2.3.1 Step 1: Extract Sentences that Refer to Exceptions

Revisiting the FASL-Fujitsu contract snippets shown in Section 2.2, we see that each snippet includes an underlined phrase. Each such phrase describes a syntactic pattern, that is, a pattern phrase, and is textually followed or preceded by a phrase that potentially identifies an exception. Exploiting the routinized nature of contract text, we introduce pattern phrases as a basis for extracting suitable sentences.

(21)

We have crafted a small set of pattern phrases in the style of Hearst [45]. Our patterns are geared toward extracting exceptions from English contracts. A typical pattern is of the form pin (the) case ofNPq, wherein NP is a noun phrase. HereNP indicates the exception we are trying to extract. A clause specifying the corresponding remedy may follow the pattern, but we focus on exceptions in this chapter. Other patterns are formed in the same vein. Two additional ones that select noun phrases involve the keywords such asin (the) event of.

Our additional patterns include if,in (the) event that, and in (the) case that, which select (sentential) clauses. Although noun phrases are simpler in structure than clauses, they often express exceptions independent of a specific contract context. The noun phrases can also form the foundation for a taxonomy of exceptions that arise in different business domains. Thus we focus exclusively on noun phrases in this chapter. Our technical problem is to identify noun phrases that describe exceptions.

Definition 3 Noun Phrase: A phrase whose head is a noun or a pronoun, optionally accom-panied by a list of modifiers.

Definition 4 Exception Phrase: A noun phrase that describes an exception.

We extract the sentences in a contract that match the specified pattern phrases. In partic-ular, in this step, we use the above pattern phrases merely as lexical filters. This is a fast and easy step that substantially reduces the number of sentences that we have to deal with in the subsequent, far more complex, steps.

Definition 5 Pattern Sentence: A sentence that contains a pattern phrase.

Extracting pattern sentences is straightforward. Given a collection of sentences, we check each sentence to determine if it contains any of the identified pattern phrases. If yes, we extract the sentence; otherwise, not. When this process finishes, we have obtained a new collection of sentences, each of which contains a candidate exception phrase.

2.3.2 Step 2: Construct Noun Phrases

We parse each sentence that matches our lexical patterns using Lingpipe [2], a natural language processing toolkit. Lingpipe provides high performance part-of-speech tagging (POS), which involves assigning syntactic tags such as Noun (nn), Adjective (jj), Adverb (rb), and so on, to each lexeme. For example, the following input sentence

(22)

yields

In/in the/at event/nn of/in any/dti such/jj delay/nn or/cc failure/nn ,/, the/at party/nn affected/vbn shall/md promptly/rb notify/vb the/at other/ap party/nn in/in writing/vbg and/cc use/vb all/abn commercially/rb reasonable/jj efforts/nns to/to overcome/vb the/at event/nn or/cc circumstance/nn causing/vbg the/at de-lay/nn or/cc failure/nn as/cs soon/rb as/ql practicable/jj ./.

We use the Lingpipe noun phrase chunker to aggregate relevant words to form noun phrases based on a grammar of English. A noun phrase can begin with a part of speech such as deter-minant, pronoun, and adjective, and can have other modifiers such as present particle and past particles. For example, the phraseany expected delay is a noun phrase containing a determiner or quantifier, a past particle, and a noun. Lingpipe includes a set of rules for chunking noun phrases. We introduced some noun phrase rules to handle the longer meaningful phrases that arise in contracts. For example, our rules treat the verb causing as helping continue a noun phrase instead of terminating it. Thus, the text an accident causing a delay is parsed as one noun phrase even though a prefix of it, namely, an accident is also a noun phrase.

2.3.3 Step 3: Identify Exception Noun Phrases

We identify a noun phrase as relevant based on whether it relates to any of the patterns we used to extract the sentences above. Specifically, for the above patterns, determining relevance involves checking whether a noun phrase immediately follows a pattern phrase. If so, we include it in the results; otherwise, we ignore it.

An additional intuition that we capture involves the use of conjunction words (and,but,or,

either or,neither or, and so on) usually indicate a “coordination” or semantic similarity of the phrases they connect. For example, in “The tax proposal was simplistic and well-received” we know thatwell-received is a positive word, and it is connected tosimplistic byand, so we infer that simplistic in this context is also a positive word. The conjunction rule is widely used in predicting semantic orientation of adjectives [44] and building opinion lexicons [27].

Definition 6 Conjunction Rule: If an exception phrase is connected with a noun phrase by a conjunction, then the noun phrase is likely to be an exception phrase.

The following is a example of a conjunction occurring in a real contract.

(23)

Notice that our previous step identifies litigation as an exception phrase in the contract because it follows the pattern in the event of, we can apply the conjunction rule to infer that

arbitration is also an exception phrase. In our approach, we use only the conjunction word or

for expansion, because it occurs frequently in the sentences that match our patterns and has limited ambiguity.

We apply the conjunction rule in the obvious manner: if two noun phrases are conjoined and one is included as an exception phrase, then so is the other. Algorithm 1 details this method.

Algorithm 1 Applying the conjunction rule.

Require: Noun phrase set P = {p1, p2,· · ·, pk} of sentence s, conjunction word list W =

{w1, w2,· · · , wn}, and exception phrase list L={l1, l2,· · · , lm}

1: for allpi inP AND pi not in list L do

2: for allwj inW,lr inL do

3: if pi connect tolr withwj then

4: Add pi listL

5: end if

6: end for

7: end for

8: return ListL

2.4 Evaluation

We now systematically evaluate our approach by highlighting important properties of contracts, the prevalence of exceptions in them, and the quality of our results.

2.4.1 Statistics about the Corpus

We consider a corpus of 2,647 contracts from Onecle for some evaluations. As Table 2.1 shows, our pattern sentences are prevalent in contracts across seven major domains of interest to services.

(24)

Table 2.1: Distribution of our patterns across contracts of different domains over our entire corpus.

Contract Type Contracts Matches Average

Licensing 1,364 3,838 2.8

Consulting 501 509 1.0

Outsourcing 9 21 2.3

Supply 207 733 3.5

Manufacturing 206 577 2.8

Purchase 142 591 4.1

Stock Options 218 1,153 5.3

Overall 2,647 7,422 2.8

To evaluate the effectiveness of our approach in extracting exception phrases, we manu-ally annotate the following five (arbitrarily selected) manufacturing contracts from the Onecle repository, namely, those between (1) Minnesota Mining and Manufacture Company (3M) and Sepracor Inc.,1 (2) Novoste Corporation and BEBIG Isotopen,2 (3) DrugAbuse Sciences, Inc. and Eon Labs manufacturing, Inc.,3(4) FASL LLC and Fujitsu Limited,4 and (5) Lucent Tech-nologies Inc. and CD Radio, Inc.5

In the above five manufacturing contract documents, our patterns yield a total of 25 match-ing sentences. Table 2.2 show some statistics for the extracted sentences. As we can see, sen-tences that contain pattern phrases are usually quite long. Recall that sensen-tences are delimited by periods and sometimes an entire paragraph that uses several semicolons (common in real-life contracts) can appear as one long sentence.

Table 2.2: Statistics of sentence length (number of words) over our entire corpus.

Corpus Size Min 1st Quartile Median Mean 3rd Quartile Max

Selected contracts 5 21 27.5 43 51.7 68.5 142

All manufacturing 206 9 31.0 50 67.0 79.0 474

Entire corpus 2,647 6 37.5 58 77.8 93.0 3,328

1

http://contracts.onecle.com/sepracor/3m.mfg.2001.12.20.shtml

2_{http://contracts.onecle.com/novoste/bebig.mfg.2001.06.20.shtml} 3

http://contracts.onecle.com/drugabuse/eon.mfg.2000.07.20.shtml

4

http://contracts.onecle.com/spansion/fujitsu-mfg-2003-06-30.shtml

(25)

Sentence Length

Frequency

0 50 100 150 200 250 300

0

20

40

60

80

100

120

Figure 2.2: Distribution of lengths (numbers of words) of the matched sentences across all 2,647 contracts studied.

2.4.2 Quality of Identifying Exceptions

Unfortunately, no gold standard for exception extraction exists currently. Thus for our evalua-tion we need to annotate contracts manually. Accordingly, we manually annotated each service contract, marking the exception phrases as benchmark data. Manual annotation proved to be a challenging task for two reasons. First, because there is no existing standard, there is no ready reference for annotation. Second, the concept of exception itself is inherently ambiguous and comes in different expression forms.

As stated before, we restrict exceptions expressed in noun phrases to the scope of our system and annotation. We compare the exception phrases identified by our approach with manually extracted phrases to compute the true and false positives and negatives (abbreviated TP, FP, TN, and FN, below). Using these, we can calculate the precision, recall, and F-measure—the most widely used metrics of the quality of a retrieval method. These metrics are defined below.

precision = TP TP + FP

recall = TP TP + FN

F-measure = 2×precision×recall precision + recall

(26)

0 5 10 15 20 25 30

0

5

10

15

20

Contract Length

Number of Matches

Figure 2.3: Distribution of exception patterns for contracts of different lengths (based on all 2,647 contracts studied). The contract length is measured in hundreds of sentences.

false positives and three false negatives; without the conjunction rule, it extracted 24 phrases with two false positives and seven false negatives. As Table 2.3 shows, applying the conjunction rule reduces the precision ever so slightly but increases recall as well as the F-measure.

Table 2.3: Precision, recall, and F-measure for selected manufacturing contracts.

Pattern Conjunction Expansion? Precision Recall F-Measure

All No Expansion 0.92 0.76 0.83

All With Expansion 0.90 0.90 0.90

in (the) event of. . . No Expansion 0.92 – – in (the) event of. . . With Expansion 0.86 – –

in (the) case of. . . No Expansion 0.91 – – in (the) case of. . . With Expansion 0.92 – –

(27)

results as well. As we can see, there is no substantial quality difference for the patterns we used. To convey a feel for the kinds of exceptions our approach identifies, Table 2.4 and Table 2.5 shows the exception phrases

Table 2.4: Sample exception phrases extracted from selected manufacturing contracts using pattern “in (the) event of.”

In (the) event of . . . False Positive? Expanded?

conflict

inconsistency between any of the terms and conditions of this Agreement

Yes

an extraordinary increase in price due to such factors

its merger

an increase in material costs refunds

an inspection by the FDA

any other Regulatory Authority Yes Yes

any inconsistency between the terms and con-ditions of this Agreement and the terms and conditions of a Purchase Order

any loss

irreparable damage to Unfinished Products Yes

a dispute any conflict

a conflict between the applicable Business Terms and these terms and conditions a replacement

any such circumstance Yes

Now we analyze falsely extracted phrases. In the following sentence, an inspection by the FDA as well as any other Regulatory Authority are noun phrases that occur surrounding the pattern and a conjunction word. The conjunction or connects the FDA with any other Regulatory Authority, but the algorithm wrongly regards the conjunction as connecting an inspection by the FDAwithany other Regulatory Authority. In other words, it falsely identifies

any other Regulatory Authority as an exception phrase.

(28)

Table 2.5: Sample exception phrases extracted from selected manufacturing contracts using pattern “in (the) case of.”

In (the) case of . . . False Positive? Expanded?

default in payment

any filing with a Governmental Authority

other transfer of substantially its entire business in aerosol Yes 3M on the sale

Force Majeure

Product having a latent defect any defect in Serviced Products

any expected delays in shipping Serviced Products material breach of this agreement caused by BEBIG settlement

termination under Section 6.2 hereof

litigation decisions affecting CD Radio Yes

(ii) Yes

notify DAS within twenty (20) business days in writing of the details and results of any such investigation.

Other phrases such as any such circumstance are not exceptions themselves, but refer to other clauses in the contract. To identify such conditions, techniques such as coreference reso-lution are needed, which we defer to future enhancements of our approach.

When our approach fails to identify an exception phrase, it mostly does so because (1) the HTML-to-text converter misses some paragraphs because of excessive noise in the input, (2) complicated sentence structure hides some of the target phrases, and (3) some unexpected patterns arise in the input.

In the following sentences, “an accusation of infringement pertaining to Licensed Product” and “undue delay” are exception noun phrases. However, our approach misses these, as it does not include the corresponding patterns, which are shown underlined below. First, the pattern “in the event (that)” is often followed by an exception clause, not by an exception noun phrase, which is the kind of grammatical construct we seek here. Second, “without” generally introduces noise because of its ambiguity.

Each Party will notify the other Party promptly in the event a Party receivesan accu-sation of infringement pertaining to Licensed Product.

(29)

perform such task or tasks.

As our approach can scale well to large datasets, we can obtain pretty good coverage of many aspects of the exceptions when the dataset is large enough.

2.4.3 Frequent Exception Phrases

We extract the commonly occurring exception phrases for a domain of interest to build a vocabulary of exceptions that arise in each domain. Such a vocabulary could be used to guide a contract reviewer in determining what a specific contract may be missing. And, it would form the basis for building a taxonomy of service exceptions for that domain. Table 2.6 reports some of the top phrases from our manufacturing corpus.

Table 2.6: Frequent exception phrases extracted from all 206 manufacturing contracts in our corpus.

Head Noun Frequency Example Phrase

force majeure 51 force majeure

default 28 a default in the payment

merger 11 a merger

delivery 6 late delivery of consigned inventory

delay 5 an inexcusable delay of the delivery of such spare engine cancellation 4 cancellation of a purchase order

defect 4 any defect in serviced products

2.4.4 Performance

(30)

2.4.5 Additional Validation: Cloud Services Contracts

Just as we evaluated precision and recall in the manufacturing domain in the previous section, we study the capability of our approach to capture cloud service specific exceptions in this section. We rate each extracted exception as either related to cloud services or general business. Among all the extracted true positives in the five cloud service contracts, 4/18=22% are related to cloud service exceptions.

To further demonstrate the efficacy of our approach on cloud service contracts, we arbitrarily selected five terms of use for cloud services12345from the Internet.

Table 2.7: Precision, recall, and F-measure for selected cloud services contracts.

Pattern Conjunction Expansion? Precision Recall F-Measure

All No Expansion 0.83 0.60 0,70

All With Expansion 0.82 0.72 0.77

in (the) event of. . . No Expansion 1.00 – – in (the) event of. . . With Expansion 0.93 – –

in (the) case of. . . No Expansion 0.50 – – in (the) case of. . . With Expansion 0.63 – –

Our approach, when configured with conjunction rule, extracted 22 phrases and four of them are false positives, and thus it has a precision of 82% with a recall of 72%. The statistics is shown in Table 2.7. The false positives are mostly generic phrases which cannot be considered an exception without qualification. In the sentence below, “a visitor” is extracted, but it is not an exception itself. Instead, the qualified version of this phrase—the entire text shown in bold— is an exception. This kind of an error is caused by a limitation of our noun phrase chunker, which fails to recognize the given noun phrase structure.

In the case of a visitor who may infringe or repeatedly infringes the copyrights or other intellectual property rights of Operator or others, Operator may, in its discretion, terminate or deny access to and use of the

1

https://aws.amazon.com/serviceterms/

2_{http://www.cloud.bg/en/sla} 3

http://www.opsource.net/OpSource-Cloud-Terms

4

http://www.rackspacecloud.com/legal

(31)

Table 2.8: Sample extracted exception phrases from selected cloud services contracts.

In (the) case/event of . . . Cloud service specific?

a conflict between the terms of these Service Terms and the terms of your agreement with us governing your use of our Services

any inconsistency

conflict with the Agreement

a payment failure Yes

a merger acquisition

delay in processing of the order and the pay-ment datas [sic] correctness

Yes

legal situations Host Color

conflict between the terms contained in the Service Order and the terms in this Agree-ment

a suspension by OpSource of Customer’s ac-cess to any Service pursuant to Section 13.3

Yes

any termination by OpSource of any Service any dispute between the parties concerning in-terpretations

a dispute between us regarding the interpre-tation of applicable law

a (system) failure Yes

Service.

A selection of the exceptions discovered by our approach is shown in Table 2.8. These noun phrases mostly express the business exceptions, but our approach did find quality of service related exceptions such as “(hardware) failure.”

(32)

2.5 Related Work

Our work in contract mining intersects with two research areas: service science and text mining. For one thing, a contract is a service binding artifact, and thus service interactions are regulated by the contract; for another, a contract is expressed in natural language, so text processing techniques apply naturally.

2.5.1 Service Contracts

Krishna and Karlapalem formulate the entire contracts life cycle with special reference to service-oriented computing and illustrating the importance of moving from traditional to elec-tronic contracts [80]. They propose a methodology for contracts that gives special importance to exceptions. Indeed, Krishna and Karlapalem list (1) mining contracts and (2) developing general templates for contracts as two of four grand challenges. Our approach shows how to (1) mine contracts for exceptions at phrasal level and (2) by building a list of common exceptions, shows how to address the design of contracts as well.

Meneguzzi et al.’s [66] effort is part of the European Union’scontractproject framework,

a comprehensive approach to model, reason about, and enact electronic contracts. Our approach complements the above work in two respects. One, our approach can help acquire the knowledge of a particular setting that thecontractframework can codify and operationalize. Two, our

approach brings up the typical business exceptions in a domain and in this manner provides a basis for verifying whether a specific contract is sufficiently robust and that its enactments will cover the discovered exceptions.

Arenas and Wilson [3] distinguish between the operational and business levels of a contract. At the operational level, a contract can be expressed as policies, licenses, and service level agreements. Currently popular approaches for service agreements—such as WS-Policy [6], Web Service Level Agreement (WSLA) [61], Web Service Offerings Language (WSOL) [99], and Open Digital Rights Language (ODRL) Service profile (ODRL-S) [35]—largely emphasize operational details. At the business level, a contract is drafted by contract lawyers and executed by the participating organizations. There is thus a huge gap between the business and the operational levels.

(33)

Exceptions has been studied in the context of representation, identification, and resolution. Molina-Jimenez et al. [72] introduce an architecture for exception resolution. Grosof and Poon [41] represent business contracts in RuleML and thus enable agents to automatically create, evaluate, negotiate, and execute contracts and to handle exceptions. Klein et al. [56] describe a methodology for identifying exceptions and finding suitable responses for these exceptions. Further Klein et al. propose a taxonomy of exceptions.

However, existing approaches on contracts and exceptions do not interface well with the “legacy” of text-based contracts, which is how all serious business is still being conducted today. Research on the automatic extraction of exceptions from contracts has been rare, if not nonexistent—despite the exhortations of researchers such as Krishna and Karlapalem [80]. The present approach can feed the above approaches with concrete representations that they can formally reason with.

Khandekar et al. [54] proposed a system called MTDC (Methodology and Toolkit for De-ploying Contracts) to map a business contract to deployable e-contracts based on the EREC

data model [53]. The MTDC system takes advantage of knowledge of the domain (in which the contract applies) such as contract type, and a list of keywords specific to the domain of the contract, and can extract sentences representing exceptions. Each sentence in a contract is classified as a clause, an activity, or an exception based on rules and supplemented with assistance from a human designer.

(34)

In broad terms, because contracts are a type of legal document, work on knowledge ex-traction from regulatory text is indirectly related. Breaux et al. [17] [55] extract rights and obligations from regulatory text to aid regulatory compliance. Koliadis et al. [57] extract key phrases and generate possible interpretations from predefined templates to contextualize regu-latory policies. However, these approaches are mostly reliant on complex hand-crafted rules or heuristics. As a result, they are not easy to migrate to new settings.

Some research applies text mining to analyze text artifacts in web services and software requirement for service matching, discovery, and key element identification. Yale-Loehr et al. [107] mine software requirement specifications (SRS) to discover shared services and make corresponding recommendations. They use a similarity based approach to compare keywords in SRSs and take advantage of sets of synonyms (termedsynsets) identified in WordNet [67]. Guo et al. [42] propose an approach for improving the quality of semantic web service matching. They generate ontologies from web service descriptions and map between web services with the guidance of the ontologies. Spanoudakis et al. [97] discuss and lay out the foundations of principles for inconsistency and overlaps between SRS. They address the problem of overlap identification and take steps towards providing a formal semantics for overlap relations. Hussain et al. [48] analyze SRS to classify sentences as functional (for example, input, output, events) or nonfunctional (for example, performance, reliability, security) requirements. Hussain et al. use a set of keywords and part-of-speech tags and employ a text classifier based on Quinlan’s C4.5 decision tree algorithm [78].

2.5.2 Pattern-based Information Extraction

We apply a pattern-based natural language processing approach for finding exceptions in con-tract text. Pattern-based information excon-traction has been an active discipline in the past two decades. Despite their simplicity, linguistic pattern-based approaches yield surprisingly good results. We survey some important work in this area.

Hearst [45] pioneered the pattern-based approach by using it for automatic acquisition of hypernyms from Grolier’s American Academic Encyclopedia. Hyponymy relation such asapple tofruit indicates the is a relation. To extract such information, Hearst defines patterns of the type pNP0 such asNP1q. For example, if we seefruit such as apple, that indicates that apple

is hyponym offruit.

Berland and Charniak [10] apply a similar pattern-based approach to find nouns that satisfy part-of relations in LDC North American News Corpus (NANC). Thepart-of relation indicates part andwhole of the entities such as wheel to car. Berland and Charniak’s patterns are of the type pNP0 ofNP1q, which indicate a part-of relationship, as in basement of building that

(35)

Girju and Moldova [38] extract causal relations from text using an approach similar to the above on the TREC-9 data set, which is a collection of news articles. To extract causal relations from corpora, Girju and Moldova use the most explicit intra-sentential patternpNP0 V NP1q,

whereV is a simple causative verb.

Hearst evaluates her approach against WordNet and obtains a precision of 57.55%. Berland and Charniak’s approach yields 55% accuracy for the top 50 words, when evaluated against human annotated data. And, Girju and Moldova achieve 65.6% accuracy against the average performance on two human annotators on 300 relation pairs. In this context, our results of nearly 90% precision indicate that contracts are a promising domain and perhaps that additional information can be mined from them.

Leidner and Schilder [58] use Hearst patterns [45] to mine business risk vocabularies and build a taxonomy. They identify potential risks in financial reports. Leidner and Schilder use the Web as their corpus for vocabulary discovery and validation. In contrast, our system uses a set of contracts as its corpus, and its vocabulary discovery process is not based on the Hearst patterns.

Indukuri and Krishna [49] use an approach based on machine learning-based to study con-tract documents. They employ a binary support vector machine (SVM) to decide if a sentence in a contract is a clause. Indukuri and Krishna further classify the clauses into two categories: payment related or otherwise, on somewhat ad hoc grounds. In their experiment, they use

n-gram models (with n ranging from one to four) to convert from text into feature vectors. They report the best result whenn equals four. In contrast, we identify exception clauses and develop a domain-specific vocabulary of exceptions. Payment is inherently domain-independent so in that sense our problem is complementary to that of Indukuri and Krishna. On the basis of linguistic processing, our method uses patterns as a clue to discover service exceptions at the phrasal level. A basic pattern recognizer and a learning-based approach can also extract sentences or other context such as a text window as well. We compare our approach with these approaches.

(36)

2.6 Discussion

A contract is a legal agreement between real world business entities whom we treat as providing services to one another. We focus on business and not technical services. Service exceptions, as the focus of our study in this chapter, reveal critical aspects of business service operations. As we live in an imperfect world, timely capture of business exceptions and proper handling of unexpected incidences give organization competitive advantage. Though rarely studied before in the service community, exceptions extraction at the phrasal level can potentially help build a rich knowledge base for ontologies.

The novelty of our work lies in formulating and solving the problem by bridging text-based service contracts with natural language processing analytics. Empirical studies show that our approach is not only viable but also effective. As opposed to rule-based or machine learning approaches that address related tasks such as [54], our approach requires minimal human inter-vention, has better portability across different contract domains, and enjoys high efficiency on large text repositories. Capturing service exceptions at a semantic level is challenging because of their potential ambiguity and wide range of references. Harvesting exceptions from a vast amount of contract text is a daunting task. Our approach avoids the semantic challenges and takes advantage of a handful of text patterns to harvest the semantic units of exceptions at the phrasal level.

We demonstrates an unsupervised pattern-based approach for automatically extracting ex-ceptions from contract text that is not only flexible, but also effective. We apply manual an-notations solely for the purposes of evaluation and not to train our system. Our approach is independent of the domain of the given contracts and requires minimal human effort. Figure 2.4 shows a screenshot the online tool that incorporates our approach when used on a real con-tract from the IT services domain. This illustrates a simple but valuable use of our approach, wherein it highlights the relevant text in service contracts and thereby assists users in reviewing contracts.

Our approach can discover domain-specific exception vocabularies from contracts. For exam-ple, we may find phrases such aslate delivery and defect in products as indicated in Table 2.6 more commonly in manufacturing contracts than in loan agreements, where terms such as

bankruptcy and insolvency would appear more frequently.

Based on the service exceptions that our approach extracts, phrase classification algorithms can further organize these vocabularies into categories. For example, some exceptions refer to financial conditions such as nonpayment, and some refer to natural disasters such as earth-quakes. On top of that, a taxonomy of exceptions for a specific contract domain can potentially be generated automatically.

(37)

Figure 2.4: A screenshot of our system used as a browser addon.

text. In particular, we observe that many business risks involvetemporal constraintssuch aslate delivery of productsand late payment. A failure in the timely delivery of a service can damage an organization reputation, disrupt enterprise activity, result in poor customer satisfaction, and ultimately in loss to the bottom-line. In addition, temporal relations, such as those indicated bybetween,before, andafter, can provide critical information for regulating business activities. Mining temporal information from contracts can prepare a decision maker for possible violations and can help an enterprise hedge against potential business risks.

(38)

Chapter 3

Business Events and Temporal

Constraints Extraction

3.1 Introduction

In this chapter, we focus on business events and their temporal constraints viewing contracts as business binding artifact. Events in a business contract may signify crucial occurrences of enterprise activities and herald potential business risks. Financial payment, product delivery, dispute resolution, and important stages of functional business processes can all be regarded as business events. Temporal constraints that qualify such events are also critical. Product delivery, recurring payments, interest accrual are almost always associated with temporal constraints. The violation of temporal constraints is an important factor in contractual breach. Temporal constraints are inherently important for each participating party in a contract, and failure to observe these constraints can hinder business progress, disrupt organization schedules, upset cooperating businesses, and result in legal complications. Discovering business events and their corresponding temporal constraints can reveal insights in one of the most important enterprise knowledge source—contracts.

(39)

Contributions

This chapter, first, formulates the problem of business events and temporal constraint extraction from contract text. Second, it shows how to solve the event and temporal extraction problem with three subtasks using a combination of surface patterns, grammar parsing, and classification, and comparing different classification methods. Third, it applies topic modeling to cluster event terms into thematic groups.

Organization

The rest of the chapter is organized as follows. Section 3.2 formalizes business events and tempo-ral constraints extraction problem and divides the problem into three subtasks. Sections 3.3, 3.4, and 3.5 describe the method and evaluation for each subtask. Section 3.6 surveys the relevant literature. Section 3.7 concludes with a discussion of remaining challenges and future work.

3.2 Problem and Approach Overview

Event extraction typically depends on the domain and context. Biological events usually refer to the interaction between genes, molecules, proteins, or organisms. The event structure is usually defined in the form of semantic slots (biological substance A; target verbs such as “activate” and “inhibit”; biological substance B) [106]. News events commonly refer towho didwhat towhom when with what methodswhere and possiblywhy [98]. The structure of a news event is often expressed in the form of (subject; action; object) along with a set of attributes such as time, location, and reason. Financial events are defined as the important quantitative information, such as revenue forecasts and profit estimates from company earnings reports [63].

Events in contracts are distinct from other domains. First, the connotations are different. A contract usually is drafted and enacted before the relevant business transactions occur; that is, a contract refers to future behaviors. In contrast, the events in other domains usually are descriptive natural phenomena or scientific facts. For example, biological events describe the nature of life substances; news events describe the current or past occurrences worthy of atten-tion. Second, the scopes are different. Events from domains such as news and biology typically focus on one narrow area and thus a tailored method may work well for each such specific task, whereas business events encompass many different areas due to the diverse realms that contracts deal with, e.g., manufacturing, licensing, supply, and employment.

(40)

timeline organization [90, 59]. However, due to the nature of contracts, temporal constraints typically have financial and legal ramifications. As a result, temporal constraints that qualify business events in contracts are often explicit.

Below are some sample sentences from the Yahoo! Small Business Terms of Service.1

All installation or setup fees and non-recurring charges, along with the first month’s recurring charges, shallbe due and payable within ten (10) days of initiation of Service.

If You cancel the Service before the end of the Initial or Renewal Term, Your Service and access to the Service will be discontinued immediately, and no refund will be provided for any payments You have made.

You agree that Yahoo! may delete customer credit card information from Yahoo! servers 14 days after You retrieve such information, and may delete all other Merchant Information from Yahoo! servers at the end of each calendar year.

The bold text fragments—“be due and payable,” “cancel the Service,” “delete customer credit card information,” and “delete all other Merchant Information”—express business events and are significant to the contracted service engagement. Such events are associated with the rights, obligations, permissions, and prohibitions of the contractual parties. The underlined text fragments—“within ten (10) days of initiation of Service,” “before the end of the Initial or Renewal Term,” “14 days after You retrieve such information,” and “at the end of each calendar year”—place temporal constraints on the corresponding business events. The events may expire or become invalid when their temporal constraints do not hold. For example, in the first example sentence, the charges shall be due and payable within ten days of the initiation of Service; paying after ten days of the initiation of Service may breach the contract and potentially incur financial liability.

In poorly formulated contracts, business events such as payment and service delivery that bear implicit time requirements may lack temporal constraints. Disputes could occur when contracting parties default or fail to deliver services in a timely manner. Contract Miner captures the essential elements of a contract and provides a basis for commitment-based contract analysis [94]. We now define business events and temporal constraints in the setting of contract text mining.

Definition 7 Business event: a subsentence-level text that captures essential business processes, often expressed with a subject and a corresponding verb phrase.

(41)

Definition 8 Temporal constraint: a phrase that restricts the validity of a given business event or events and is expressed as a prepositional phrase.

Formally, our task is: given a corpus of contract text C, extract the business eventsE along with their subject and any associated temporal constraintT. Through this process, pairs (E,T) are extracted whereT is optional.

3.2.1 Overview of Information Extraction

Information extraction (IE) [46, 86] from text is the process of analyzing text to discover in-formation of interest. IE tasks involve extracting named entities, relations, and events. Entities such as person, organization, and location are the basis for relations and events. For example, extracting person and company names can help discover the is CEO of relation in an orga-nization; extracting molecules, proteins, and bond names lays the foundation for identifying biological interaction events in organism. Relation extraction generally acts on top of named entity extraction, and events are usually a compilation of named entities along with their mutual relationships.

Many methods have been developed for IE including those based on patterns, statistics, and machine learning. Pattern-based methods use handcrafted surface patterns to extract in-formation of interest, often with strong results from closed domains [45]. Pattern-based methods are simple and effective but fall short in portability across domains. The patterns developed in one domain often do not apply in another, and the process of crafting patterns sometimes involves expert domain knowledge. Statistical methods often demonstrate strong robustness in open domain IE tasks such as open web information extraction [100]. Among these, point-wise mutual information (PMI), a statistical method, is widely used in revealing the similarity re-lations of concepts and extracting parallel concepts, e.g., using result counts obtained from a search engine. The disadvantage of this method is the requirement of a large amount of data for the statistical model to be effective. Machine learning methods produce promising results on classification and labeling tasks [34]. Classification, a widely used technique, employs a set of features to predict the class label of a new instance based on human annotated data. Different types of features can come into play: grammatical features such as the part of speech of a token; statistical features such as the term frequency (TF) of a token; and, contextual features such as the neighbors of the target token. Several classifications have been applied in various IE tasks, including some hybrid approaches [30].