Harmonizing and Combining Large Datasets – An Application to Firm-Level Patent and Accounting Data

(1)

NBER WORKING PAPER SERIES

HARMONIZING AND COMBINING LARGE DATASETS – AN APPLICATION

TO FIRM-LEVEL PATENT AND ACCOUNTING DATA

Grid Thoma

Salvatore Torrisi

Alfonso Gambardella

Dominique Guellec

Bronwyn H. Hall

Dietmar Harhoff

Working Paper 15851

http://www.nber.org/papers/w15851

NATIONAL BUREAU OF ECONOMIC RESEARCH

1050 Massachusetts Avenue

Cambridge, MA 02138

March 2010

We thank Jim Bessen, Hélène Dernis, Megan MacGarvie, Paola Giuri, Stine Grodal, Myriam Mariani,

Kazu Motohashi, Teruo Okazaki, James Rollinson, Philipp Sander, Georg von Graevenitz, Stefan

Wagner, Norihiko Yamano, Maria Pluvia Zuniga, and the participants at the PATSTAT Users’ Meeting

in Paris in March 2008, seminars at the Ludwig-Maximilians-Universität München and Università

L. Bocconi in Milan for very fruitful discussions. We also thank Armando Benincasa and Luisa Quarta

from Bureau Van Dijk for clarifications about the structure of the Amadeus database and its changes

over time. Thoma and Hall are grateful to the Kauffman Foundation for support of some of this work.

The views expressed herein are those of the authors and do not necessarily reflect the views of the

National Bureau of Economic Research.

NBER working papers are circulated for discussion and comment purposes. They have not been

peer-reviewed or been subject to the review by the NBER Board of Directors that accompanies official

NBER publications.

Hall, and Dietmar Harhoff. All rights reserved. Short sections of text, not to exceed two paragraphs,

may be quoted without explicit permission provided that full credit, including © notice, is given to

(2)

Harmonizing and Combining Large Datasets – An Application to Firm-Level Patent and Accounting

Data

Grid Thoma, Salvatore Torrisi, Alfonso Gambardella, Dominique Guellec, Bronwyn H. Hall,

and Dietmar Harhoff

NBER Working Paper No. 15851

March 2010

JEL No. C81,O34

ABSTRACT

This paper discusses methods for the harmonization and combination of large-scale patent and trademark

datasets with each other and other sources of data. Dictionary- and rule-based approaches to the consolidation

of applicant names in patent data are presented and shown to have both benefits and drawbacks in

isolation. We combine the two methods and develop a set of rules and dictionaries to consolidate European,

Patent Cooperation Treaty (PCT) and US patent data with firm accounting data. The resulting data

encompass about 131,000 patent applicant names from 46 countries, covering 58.8 percent of EPO

applications and 50.6 percent of PCT applications by business organizations during the time period

from 1979 to 2008. For US data, the resulting dataset includes around 54,000 assignee names and

51.3 percent of US granted patents during approximately the same time period.

Grid Thoma

School of Science and Technology

University of Camerino, Italy

Via Madonna delle Carceri, 9

62032 Camerino (MC)

and KITES-Bocconi University

[email protected]

Salvatore Torrisi

Department of Management

University of Bologna, Italy

Via Capo di Lucca 34 - 40126 Bologna

and KITES-Bocconi University

[email protected]

Alfonso Gambardella

Department of Management and KITeS

Bocconi University

Via Roentgen 1

20136 Milan

Italy

[email protected]

Dominique Guellec

OECD/OCDE

2, rue Andre Pascal

75016 Paris France

[email protected]

Bronwyn H. Hall

Dept. of Economics

549 Evans Hall

UC Berkeley

Berkeley, CA 94720-3880

and NBER

[email protected]

Dietmar Harhoff

Institute for Innovation Research

Munich School of Management

University of Munich

Kaulbachstrasse 45

D-80539 Munich, Germany

[email protected]

(3)

1

Introduction

While innovation is frequently touted to be one of the most important drivers of economic growth, it is also one that is not observed in particular detail in standard statistics. The knowledge economy still largely escapes the grasp of statisticians, and researchers in this field have therefore experienced serious data constraints for many decades. Recently, the availability of large datasets containing information on R&D expenditures, patents and trademarks has relaxed these constraints considerably and has spurred the growth of a new wave of research.

However, the availability of such large-scale datasets has led to a new embarrassment of riches. Many researchers have put effort into the matching and consolidation of names of applicants of patents or trademarks, or have combined such data with firm-level financial data. Predictably, there is currently much duplication of such efforts. While the matching task has proven quite feasible for focused industry and technology studies, there is still no reliable and proven standard approach for larger datasets. Moreover, replication of studies is made very difficult by the lack of acknowledged standard approaches. This paper seeks to fill this gap, building on the experiences and results that the authors have achieved in separate and joint efforts. We present a methodological discussion and make the results of our data consolidation efforts available to the innovation research community via a website.1

Innovation studies have made good use of a variety of data types in order to evade the aforementioned data constraint. A first avenue is the collection from secondary sources of information on different qualitative dimensions of innovation. Examples include prizes as a measure of successful innovation, newswire (e.g., ad-hoc) announcements as a paper trail of collaboration among firms or of acquisitions, licensing and R&D agreements (Moser, 2005; Giarratana and Torrisi, 2006; Greenhalgh and Rogers, 2007; Fosfuri and Giarratana, 2007; Powell et al., 2000; Arora et al., 2001). By choosing data from particular contexts, sometimes using quasi-experimental features, researchers have been able to sort out competing explanations of particular innovation phenomena and to generate results with a high degree of internal validity and reliability. Recent attempts to search for well-defined quasi-experiments in particular areas (Stern and Furman, 2009; Murray et al., 2008) also belong to this group. The second approach is based on the collection of information through surveys of innovating entities (organizations or individuals) and has produced many important datasets and results. With respect to U.S. data, two widely cited surveys are the Yale Survey (Levin et al. 1987) administrated in the early 1980s and the sequel conducted by scholars at the Carnegie Mellon University in the 1990s (Cohen et al. 2000). Both covered the sources and strategies of innovation at the firm level by eliciting information from R&D managers. These surveys were followed by the development of various innovation surveys in Europe, culminating in the Community Innovation Survey (Mairesse and Mohnen 2010). More recently, European scholars have conducted inventor surveys based on the individuals named in patent applications which provide very detailed information on the factors driving innovation at the level of the individual inventor and within invention processes (Gambardella et al., 2008; Giuri, Mariani et al., 2007). Their example has been followed by Japanese, Korean, US and

(4)

Australian researchers (Nagaoka and Tsukada, 2007; Nagaoka and Walsh, 2008; Um 2005; Heong et al. 2007).

Finally, the third approach is to rely on publicly available administrative and financial databases such as patent and accounting information, or in some cases on confidential firm level data that resides in National Statistical Offices, Central Banks or institutions which have been commissioned by government agencies to perform surveys and data collections on a regular basis.2 In our application, we focus on this third type of data, which are useful because it provides comprehensive coverage of industries, technologies and countries. In addition, this kind of data is usually collected on a regular basis allowing for replication and updating of studies.

The distinction made here may be overly stylized. Indeed, data of different types and from different sources are often combined. For example, the inventor surveys undertaken in various countries depend critically on inventor information as contained in official patent data. Moser’s 2005 study combines historical patent data, entries in exhibition catalogues and innovation award data. Moreover, it is frequently the case that innovation-related data is complemented with information from accounting sources or other forms of financial data. In most cases, a large number of observations in each dataset need to be merged in the absence of a common identifying code, which means that researchers have to rely only on the entity names to do the match. The problems that arise in this case and possible ways to solve them are the subject of this paper.

The matching and harmonization problem we address is usually present for any of the datasets described in the above, and it becomes more complex as the size of the dataset (the number of entities) increases. It is important to note that the problem may also exist even if just one type of data is being used. Variations in the spellings of names routinely occur within large-scale datasets such as patent and trademark data.3 The problem can usually be handled manually in smaller datasets, but requires some form of automated approach in larger datasets. The exact nature of the problems to be addressed is described in section 2.3.

Turning to the types of data of particular interest to us, we note that the most often encountered ones are i) data on R&D expenditures as collected by statistical agencies or revealed in accounting data and stock market reports; ii) data from regularly performed innovation surveys; iii) data on intellectual property rights, such as from patent and trademark offices; iv) data on firm inputs and value added, such as is collected by various national statistical agencies; and v) corresponding data on the financial status of firms, such as P&L data, balance sheet data and so forth. We discuss these here before turning to IP and financial data as our particular focus in this article. This list is not complete, but presumably covers the most frequently used data for which a name matching problem occurs.

2_{What distinguishes the second and third approach is that in the second case, data collection often represents a}

singular research effort yielding a cross-sectional database. The results are of considerable value, but typically not suited to provide representative data on innovation over time.

(5)

R&D data. Data on R&D expenditures have been collected for many decades by statistical agencies and other entities commissioned by government agencies.4 The foundation for these surveys is the Frascati Manual (OECD 2002) which contains definitions of R&D and related terms and detailed hints for the surveys. The results of R&D surveys are summarized on an annual basis in national and in OECD reports, but the firm-level data are usually not available publicly for use in micro-econometric studies. Under US and UK accounting rules, firms listed in the stock market have to report their R&D expenditures if they are “material” and under revisions to the International Accounting Standards Board, the practice of reporting such data is spreading. However, especially in the case of European firms, data on R&D expenditures are often missing because reporting these expenditures is not currently required by accounting and fiscal regulations. The definition of R&D used for accounting purposes can also differ from that in the Frascati Manual; for the US example, see Hall and Long (1999).

Innovation survey data. In many European and non-European countries these are now being undertaken on a regular (usually annual) basis. Innovation surveys seek to elicit information not only on R&D, but also on the broader innovation process as described in the OECD/Eurostat - Oslo Manual (2005).5 Although R&D is a good indicator of the commitment of a firm to inventive activities because it is chosen by the firm, it does not tell us much about their ‘success', technical and economic, which is rather reflected in the notion of innovation (which is an invention reaching the stage of implementation, either as a new process, a new product or a new marketing approach). Moreover, not all inputs into innovation processes are covered by the classical Frascati Manual definition.

In the European context, European National Statistical Offices have conducted a series of Community Innovation Surveys (CIS), collecting detailed data on innovation and other firm characteristics.6 The integration of CIS and other survey data with information from other databases, such as patents and accounting data is made difficult by the limitations to the use of CIS data imposed by confidentiality laws in all countries. Innovation survey data continue to generate important findings, but the difficulty of matching CIS databases to other databases have limited their use for the purpose of research in economics, management and public policy (Mairesse and Mohnen, 2010). 7

4_{In Germany, the R&D surveys are performed by Wissenschaftsstatistik GmbH, a branch of Stifterverband, an}

association of German industry which supports research and science. In this case, the performer of the survey is not an official statistical agency. In most other countries such as Austria, the UK, France, Italy and the Netherlands, the statistical agencies carry out the surveys. In the US, the survey is carried out by the Bureau of the Census under a contract from the National Science Foundation.

5_{Available at the following link http://www.oecd.org/dataoecd/35/61/2367580.pdf}

6_{See Arundel (2001) for details. A large number of countries outside of Europe and North American have}

followed this lead, with innovation surveys now having been conducted in Asia, Latin America, and other locations. The United States does not perform a government innovation survey, although similar private NSF-funded surveys have been done in the past (Levin et al. 1987; Cohen et al. 2000) and a major new initiative is now underway, directed by Wesley Cohen.

7_{There are a few exceptions to this rule. The German government has commissioned innovation surveys on a}

regular (annual) basis (of which the CIS survey is a part) and the resulting cross-sectional and panel data have been used in a large number of studies. This dataset has also been combined with patent data and other external information. See the survey by Janz et al. (2001) for a detailed description.

(6)

Patent data. Due to the wider availability of computer-ready datasets, an increasing number of studies use patent counts and patent-related indicators to measure the quantity and the ‘quality' of inventive output. Patents as a measure of inventive success have their own drawbacks, but they constitute a detailed measure of innovation (Griliches, 1981 and 1990; Pavitt, 1988). However, crude patent counts are an extremely noisy indicator of inventive output because they do not account for differences in the value of patented inventions. For this reason, many innovation scholars have introduced various patent-related indicators as a measure of the importance or “quality” of the inventive output (Harhoff et al. 2003, Hall et al. 2005).

Trademark data. Comprehensive studies on the economic role of trademarks are still rare, but recent studies confirm that trademarks can be economically important, explaining a significant share of the market capitalization of firms (Greenhalgh and Rogers, 2006; Sandner 2009). Trademarks have also been used as proxy for firm’s diversification activities (Mendoça et al, 2004), processes of market entry and survival (Fosfuri and Giarratana, 2007), appropriability strategies (Graham, and Somaya, 2004) and firm’s reputation and advertisement efforts (von Graevenitz, 2004).

Accounting and other economic data. In some cases, the isolated use of any of the data may be sufficient to undertake a study on innovation. But frequently, R&D, innovation survey, or IP data need to be combined with accounting or financial data for the firm in question, especially if we wish to measure outcomes such as productivity or profitability. Databases that contain such data (Amadeus, Compustat, Reuters and others) usually have their own system of entity identification that does not match up with the sources of IP and innovation data.

Therefore, studies that seek to employ data from different sources (for example, financial data jointly with patent and trademark data) have to merge multiple types of firm identification. Since inaccuracies in data merging and integration can lead to measurement errors, biased results or lack of sufficient statistical power, correct matching is an important but neglected issue. This is a particularly important issue in studies of patenting at the firm level because patent data never comes with firm-level identifiers that match to other sources of data, and researchers must rely only on the names of the firms to combine datasets.

Earlier solutions to this problem have relied on manual matching of firm names across datasets (e.g., the small samples of US patenting firms used by Griliches, 1981) and partially computer-based ad hoc methods combined with manual matching (e. g., Bound et al., 1984 and Hall et al., 2005). But methods involving even a small amount of manual matching have limits when confronted with the large datasets that are common today.

Researchers are therefore forced to apply one or several automated methods. The first group of approaches are dictionary-based methods, essentially based on large collections of names that serve as examples for a specific entity class, such as the DERWENT Patentee Index, and the USPTO and EPO standard patent-holder codes. More recently, automatic methods have been suggested for generating a dictionary (Magerman, Van Looy and Song, 2006). The second group are rule-based approaches that build up a set of rules for the comparison of similar names. A pioneering exercise was performed by Thoma and Torrisi (2007) using

(7)

accounting and financial dataset which contains 2,197 parent firms and their 151,979 subsidiaries. Doing a good match requires using a combination of the two approaches, as neither one is sufficient on its own.

The contribution of this paper is the development of a more comprehensive and automated methodology for company name standardization and the matching of two data sources using the resulting standardized names: IP-related databases and company business directories. Our methodology utilizes recent advances in automatic Named Entity Recognition (NER) systems. NER systems have been applied successfully in bioinformatics, while their deployment in social sciences is still at an early phase. Thus this study is among the first attempts to use these methods in empirical studies in economics and management. We do not rely on a single NER approach, but experiment with several techniques in parallel. We find that a combination of dictionary-based and rule-based approaches produces the most favorable results in our application.

The paper is organized as follows. In section 2, we start by describing the data sources currently available for large-scale studies on patenting and other IP information. Using examples from these data, we then describe various types of problems researchers are facing. In section 3, we then describe two fundamental approaches to solving the problem: the dictionary based approach, which relies on the collection of large datasets of names and their variants, and the rule-based method, which builds on the articulation of rules to establish a similarity link across different entity names. Additionally, we discuss how the value of existing dictionaries could be enhanced by using other methods to query their entries. We propose a novel method relying on priority links among the patents that enables the combination of distinct dictionaries of entity names originating from patent data of different offices.

In section 4, we describe the software prototypes for the different approaches analyzed in the section 3. In this section, we combine the different approaches in order to achieve reliable results for the problem at hand. Finally, in section 5, we conclude by documenting the harmonization and matching that results from the methodology suggested in this paper and assessing the type I and type II errors thus achieved. Section 6 concludes.

2

Frequently Used Data Sources and Matching Problems

In this section we briefly review the sources of patents and other IP data frequently used in economic studies. In particular, we will consider their content, time coverage, mode of access, complementary search and management tools and potential integration with other sources. We also provide a description of typical matching problems arising from the use of these data.

2.1 Online databases

US granted patents

Freely available from www.uspto.gov, this database includes information on all US patents (including utility, design, reissue, plant patents and others) from the first patent issued in 1790 to the most recent issue week.

Full searchable text is offered for patents issued from January 1976 to the present, including all bibliographic data, such as the inventor's name, the patent's title, and the patenter's name

(8)

(called the assignee at the USPTO), the abstract, the full description of the invention, and the claims. Patents issued prior to December 1975 are only searchable through the patent number, issue date, and current US patent classification.

US published applications

As in the case of granted patents, the application database is freely searchable at the USPTO and consists of the full text of US applications that have been published since its inception in March 2001. After that date patent applications could be kept secret if protection was requested in the US only; otherwise the application is published within 18 months from filing.

The full text of a published application includes all bibliographic data, such as the inventor's name, the published application's title, and the applicant, as well as the abstract, the full description of the invention, and the claims. All of the textual words in the publication are searchable.

IIP Japan database

This database contains information on all 9 million published Japanese patent applications between 1964 and 2004, along with 2.6 million patent registrations (grants). The information provided consists of the application and registration numbers, dates of application, exam request, grant, and expiration, the number of claims, IPC codes, applicant and rights holder information including geographic location, and citation links. The dataset is provided freely online and documented in Goto and Motohashi (2007).8

USPTO Trademarks

The USPTO website at http://www.uspto.gov/main/trademarks.htm provides complete electronic information about trademarks since the birth of the USPTO. The database contains more than 4 million pending, registered and dead trademarks and it provides complete free searchable access to the text and image database of trademarks.

ESPACE on-line

The ESPACE database contains freely searchable information on published patent applications from over 80 different countries and regions. It is based on the PCT minimum documentation, which is defined by WIPO as the minimum requirement for patent collections used to search for prior-art documents for the purpose of assessing novelty and inventiveness. As of March 2007, esp@cenet® held data on 60 million patents. A total of 30.5 million of these patents have a title, 19.5 million have an abstract in English, and 29.5 million have an ECLA class.9

8_See_{http://www.iip.or.jp}

(9)

Table 1:

ESP@cenet coverage: Starting year of availability for the main patent offices

Patent Office Facsimiles Full Text ECLA

DE 1877 1970 1877 EP 1978 1978 1978 FR 1900 1970 1902 GB 1859 1893 1859 US 1836 1970 1836 WO 1978 1978 1978

Source: Own analysis based on http://ep.espacenet.com/?locale=en_EP

CTM – on line

CTM–ONLINE provides free access to information on EU Community trade mark applications and Community Trademarks, updated on a daily basis regarding: the trademark number, name, type, owner, Nice10 classification codes, status, filing date, registration date, date of international registration, publication date, expiry date etc.

2.2 Off-line databases

While the on-line databases provide real time and constantly updated information, researchers are often more interested in off-line databases in spite of higher costs and difficulties of updating. Off-line databases allow easier generation and manipulation of innovation indicators for statistical analysis. Moreover, ex-post scalability and integrability with other sources of information is significantly higher. The most important current sources of such data that are easy and low cost are the NBER patent citation database and the EPO-OECD PATSTAT database. However, before describing those two sources we will mention at least two other earlier efforts to generate such data, which contain historical data and are still available.

The pioneering work using patent data in economic studies can be found in Jacob Schmookler’s (1966) major book entitled Invention and Economic Growth. Schmookler classified patents manually by the industry of their potential use, finding that the top three user industries of patents during the first half of the 20th_{century were the railroad,} petrochemical and building sectors.

The seminal work of Schmookler was followed by that of Griliches and co-workers at the NBER (Bound et al., 1984, which constitutes the first major effort to combine patent counts with economic and financial data, such as sales, capital stocks, research and development, income, at the firm level. The accounting information is drawn from Standard and Poor’s Compustat files, which contain data for all firms traded in the major US stock markets. The linking was done mostly manually and it involved about 2,700 US corporations and their subsidiaries as reported in the year 1976. The authors used this dataset to estimate patent production functions and valuation equations.

10_See_{http://www.wipo.int/classifications/nice/en/classifications.html}_{(last download Oct. 15th, 2009) for details}

(10)

The NBER patent database

The NBER patent dataset on USPTO data represents a path-breaking effort of providing additional bibliographic information that could be used to account for differences in the ‘value’ of patents (Hall, Jaffe and Trajtenberg 2001 and 2005). Hall and colleagues have made these data freely available via the NBER website allowing a surge of new wave of research in innovation studies. For a partial set of contributions, see Jaffe and Trajtenberg (2002) and Cockburn et al. (2004).

The database comprises detailed information on almost 3 million US patents granted between January 1963 and December 1999. Available data fields in the NBER database consist of four main building blocks. First, it includes application and publication numbers and dates and technological classifications. Second, detailed and harmonized information is supplied on inventor names, address, city, zip code, and state and country code. The database is accompanied by a file containing the link between the names of USPTO patent assignees and the names of US companies listed in the Compustat dataset. This match is a more complete and updated version of the one described in Bound et al. (1984).11 Recently these data have been updated to 2006 by Cockburn and co-workers at the NBER (Cockburn et al., 2009).12 The fourth building block consists of the citation links, in particular all US patent citations made to these patents between 1975 and 1999, constituting over 16 million citation links. Although useful for the analysis of US-based questions, the drawback to using the NBER citation data is that it does not currently include information on citations to and from other patent databases, so citation counts are likely to be downward biased, especially for foreign-owned patents.

PATSTAT

Creation of a worldwide statistical patent database was initiated by the OECD task force on patent statistics; in response, PATSTAT was developed by the EPO in 2005 on the basis of the DOCDB database. It includes bibliographic details on patents filed to 80 patent offices worldwide, covering more than 60 million documents. Hence filings in all major countries and the PCT filings at the World International Patent Office are covered.13 Available fields of PATSTAT are listed below and a fuller discussion of these indicators can be found in (OECD, 2009):14

• Application and publication details such as authority, number, kind and dates;

• Technical information and descriptions such as title, abstract, international and national classification;

11_{The match of patents to Compustat for the 1999 database is based on the 1989 universe of companies. For}

more details see Hall et al. (2001) and http://www.nber.org/patents/

12_{Beta versions of the new datasets are available at}_{https://sites.google.com/site/patentdataproject/Home}_. 13_{Having complete coverage of PCT-WIPO filings is important because their counts allow for interesting}

(11)

• Applicant and inventor name, address, and country code;

• Identification of claimed priority, designating international application, parent application and technically related application;

• Identification of cited publication including patent and non-patent prior art, category and origin of the citation.

PATSTAT is released by the EPO twice a year, early spring and early autumn. Each version presents a snapshot of the source databases at a single point in time.15 However, the content and design of PATSTAT is not intended to be static: a “change management procedure” has been put in place by the EPO to allow task force members to request changes in the data catalog (i.e. including additional information, variables, etc.) within a reasonable time before each release. Currently around 100 institutions have subscribed to PATSTAT and this database is expected to be widely used for innovation studies.

2.3 Typical Matching Problems

The matching and consolidation problems faced by researchers typically fall into three broad classes:

1) variations in spelling, some of which are simply typographical errors, within a given list or database; see the examples in Box 1.

2) variations in the way names appear in two or several different lists, in many cases caused by different naming conventions; see the examples in Box 2.

3) the problem of matching firm subsidiaries in one list to ultimate owners in another (Box 3). This problem is especially important when matching firm financial data (reported at a consolidated level) to patent data (which is often at the subsidiary level).

Box 1. Different spellings and misspellings SYNRES INTERNATIONAL B.V.

SYRNES INTERNATIONAL B.V.

BSH BOSCH UND SIEMENS AKTIENGESELLSCHAFT BSH BOSCH UND SIEMENS AKTINGESELLSCHAFT BSH BOSCH UND SIEMENS HANSGERAETE GMBH BSH BOSCH UND SIEMENS HAUS-GERAETE GMBH BSH BOSCH UND SIEMENS HAUSERATE GMBH

MINNESOTA MINING AND MANUFACTURING COPANY MINNESOTA MINING AND MANUFACTURING COPMANY

15_{Much of the data is extracted from the EPO’s master bibliographic database – DocDB, also known as the EPO}

Patent Information Resource. However, depending on the patent office, the coverage of data may be partial or delayed over time.

(12)

MINNESOTA MINING AND MANUFACTURING CORP INTERNATIONAL BUSINESS MACHINES, CORPORATION INTERNATIONAL BUSINESS MACHINES CORPORATION INTERANTIONAL BUSINESS MACHINES CORPORATION INTERANATIONAL BUSINESS MACHINES CORPORATION INTERANTIONAL BUSINESS MACHINES CORPORATION INTERNAIONAL BUSINESS MACHINES CORPORATION INTERNAITONAL BUSINESS MACHINES CORPORATION …..and so forth

RÜTGERSWERKE AKTIENGESELLSCHAFT RUTGERSWERKE AKTIENGESELLSCHAFT RUETGERSWERKE AKTIENGESELLSCHAFT R_TGERSWERKE AKTIENGESELLSCHAFT

REGIE NATIONALE DES USINES RENAULT Société Anonyme REGIE NATIONALE DES USINES RENAULT (Societe Anonyme) REGIE NATIONALE DES USINES RENAULT Societé Anonyme REGIE NATIONALE DES USINES RENAULT (SociTtT Anonyme dite)

Box 2. Variations in naming conventions ABITIBI PRICE CORPORATION

ABITIBI PRICE INC SMITH (JOHN) LTD JOHN SMITH LTD

INTERNATIONAL BUSINESS MACHINES – IBM

IBM CORP. (INTERNATIONAL BUSINESS MACHINES)

IBM CORPORATION (INTERNATIONAL BUSINESS MACHINES) MINNESOTA MINING & MFG CO

3M CORP

MINNESOTA & MINING MANUFACTURING FLUID COMPONENTS INTL.

FLUID COMPONENTS INTERNATIONAL SENETAS CORP. LTD. (USA)

(13)

Box 3. Assignment to aggregate entities (ownership issues)

Subsidiary Ultimate Owner

ADHESIVE TECHNOLOGIES INC MINNESOTA MINING & MFG CO

AVI INC MINNESOTA MINING & MFG CO

D L AULD CPY MINNESOTA MINING & MFG CO

DORRAN PHOTONICS INCORPORATED MINNESOTA MINING & MFG CO

EOTEC CORPORATION MINNESOTA MINING & MFG CO

NATIONAL ADVERTISING CPY MINNESOTA MINING & MFG CO RIKER LABORATORIES INC MINNESOTA MINING & MFG CO

TRIM LINE INC MINNESOTA MINING & MFG CO

3

Methods for name matching

In this section we describe recent developments in methodologies for integrating different IP databases and company directories, methods that have been inspired by some interesting insights from bioinformatics.

Over the past years, biological science has become increasingly concerned with the analysis of large amounts of information. Consequently, the way that information is stored, managed, visualized, and searched has increased in importance. Named entity recognition (NER) for biomedical applications, i.e. the task of identifying gene, protein, diseases, and other names in natural text, has become a crucial means to extract highly valuable and sometimes hidden is hard-to-find information. The NER approach has the potential for interesting applications to economics and management science, especially in the area of the information integration of company-level data.

In the following sections we will discuss the two different approaches to dataset merging: the dictionary-based approach, which relies on the collection of large datasets of names and name variants, and the rule-based approach which builds a set of rules for similarity links across different entity names. The latter uses some of the methods gleaned from bioinformatics.

3.1 The dictionary-based approach

Dictionaries essentially are large collections of names, serving as examples for a specific entity class. Matching dictionary entries exactly against text is a simple and very precise NER method, but typically yields a low level of match when applied to firm names. To compensate, one can either use approximate matching techniques, or try to ‘fuzzify' the dictionary by automatically generating typical spelling variants for every entry. The extended dictionary is then used for exact matches against the text.

Previous attempts have addressed this issue by implementing ad-hoc matching procedures to reduce the cost of data standardization and integration. For example, Thomson Scientific's Derwent World Patent Index (2002) is constructed by assigning a code to about 21,000 patenters. This index accounts for legal links between parent companies and subsidiaries thus achieving a legal entity standardization. This task requires substantial manual, labor-intensive work. Moreover, the matching is only available to subscribers of the Derwent service.

(14)

Drawing on the Derwent methodology, Rachel Griffith and colleagues at the Institute of Fiscal Studies (IFS) have standardized the names of a sample of UK patenters of Triadic patents and matched them with the standardized names of companies contained in Bureau van Dijk’s Amadeus database (Griffith et al. 2006). Only identical standardized names found in the two datasets are matched by the IFS using this procedure.

Another example of a dictionary of patenter names is the USPTO CONAME file compiled by the USPTO. This uses a semi-automatic standardization procedure which focuses on the first-named patenter reported in the patent document. For patents granted after July 1992 the patenter name is standardized and matched automatically with other standardized names in the same dataset. New patenters that are not matched automatically with standardized names in the dataset are matched manually. For instance, the entry of a new patenter whose standardized name does not match any previously standardized names can be matched by investigating the names and locations of the inventors. The CONAME file accounts for changes or variations in patenter names but does not account for legal links between patenter names. Moreover, similar names with a different legal form or the same legal entity from different countries are not matched.

The EPO has developed its own dictionary by assigning a standard code to each patenter filing a patent to the office. This index is created by taking into account not only the patenter name and country but also her postal address. According to some interviews we did with EPO representatives this dictionary tries to maximize precision vis-à-vis recall rate for each entry: for example two patenters with the same name but having different addresses will constitute separate entries in the EPO dictionary and they are linked to two different standard codes that identify them.

More recently, a group of researchers from the Katholike Universiteit Leuven (KUL) have developed an automatic methodology based on the detailed standardization of patenter names and perfect matching of names. This methodology, like the CONAME file and EPO standard codes, does not try to establish legal links among patenters. The main advantage of this procedure is high precision, i.e., a limited number of false matches. The KUL methodology has been used to standardize and match patenter names from EPO patent applications published between 1978 and 2004 and USPTO granted patents published between 1992 and 2003 (Magerman, Van Looy and Song, 2006).

According to the KUL methodology the creation of a dictionary for company names can be articulated in preprocessing and names standardization. Names standardization requires a series of tasks like punctuation standardization (e.g., from FERRARI ,& C. to FERRARI, & C.) and company name standardization (from FERRARI, & C. to FERRARI, AND COMPANY). The main standardization operations suggested by Magerman, Van Looy and Song (2006) can be summarized as follows: i) character cleaning; ii) punctuation cleaning; iii) legal form indication treatment; iv) spelling variation standardization; v) umlaut standardization; vi) common company name removal; vii) creation of an unified list of patenters.

For US patent assignee names, a major effort to update the existing NBER patent citations database and match to the Compustat files is now underway (Cockburn et al. 2009). The matching of patent assignee names with the names of firms on the Compustat files is part of

(15)

this project.16 A number of enhancements to the original (1999 and 2002) databases have been made. First, the semi-automatic standardization procedure of this file has been extended to all the assignee names in the case of multi-owner patents. Second, using external sources originating from business directories information was collected on the timing of name and ownership changes of the assignee. The data now provided contains information that allows tracking of the assignee changes of ownership over time. Third, there is progress on standardizing the firm names supplied by the USPTO and correcting cases where the USPTO had coded the type of entity (individual, firm, government) incorrectly. The list of entity types has been expanded to include universities, non-profit research institutions, and medical institutions including hospitals. However, much of the standardization work in the current (2009) version of the database is still incomplete, especially that involving non-US patent assignees.

3.2 The rule-based approach

Rule-based approaches build on the definition of rules to compare the similarity of names. Early systems used hand-crafted rules to describe the composition of named entities and their context. For instance, some core words and components of words might be used to extract candidates for more complex names. These core terms are expanded according to a set of syntactic rules. Similarly, starting from more complex names one could invert the process to identify some discriminating core words using the same rules.

In the following, we will focus our discussion on the potential usefulness of names similarity functions based on the so-called approximated string matching (ASM) algorithms (Thoma and Torrisi, 2007). However, it is worth remembering that the ASM method constitutes only a specific class of similarity rules. The applicability of other matching methods to company name matching should be analyzed in future research.

The first category of ASM similarity functions is based on the edit distance. For instance, the Levenshtein distance between two strings is defined as the minimum number of operations needed to transform a string into another one. The transformation of a string can be obtained by character inserting, substituting, swapping or deleting (Levenshtein, 1966). An extension of the Levenshtein edit distance was developed by Smith and Waterman (1981). The main difference is that character mismatches at the beginning and the end of strings are ignored in the calculation of distance. For instance, the two company names ‘Dr Michal White Plc' and ‘Michael White Plc, Dr' have a short distance using the Smith-Waterman distance.

The similarity between two strings x and y of length nx and ny can be computed as 1-d/N,

where 1 is the maximum similarity, d is the distance between x and y and N = max{nx , ny}.

To calculate the distance between two strings we need to assign a cost c to each operation required to transform the string x into string y (or vice versa). The cost is assumed to be 1 for substitution and deletion of a character and 0 for perfect matching characters. For instance, the edit distance between IBM and INTEL is the following:

16_{The original 1999 database is at}_{http://www.nber.org/patents}_{with 2002 updates at} http://www.econ.berkeley.edu/~bhhall/patents.html .

The latest (2006) version is at https://sites.google.com/site/patentdataproject/Home . The match documentation is at http://www.nber.org/~jbessen/matchdoc.pdf .

(16)

1 - [ ( , )c I I +c B N( , )+c M T( , )+c( , )φ E +C( , )] / 5 φ L = 1- 4 / 5 1/ 5.=

The second category of ASM similarity functions relies on token-based distance. Measures of token distance, like the J similarity index, are based on the division of strings into tokens or sequences of characters. Token-based distance functions account for differences due to the position of the same tokens between otherwise identical strings (e.g., Peter Ross and Ross Peter). In particular, the J token distance computes the fraction of common tokens, after breaking up the strings into words at the blank spaces. The J token distance is simply given by the number of common tokens in two names and the count of total number of tokens in those names, that is:

( , ) 1 X Y X Y X Y J X Y X Y X Y ∩ ∪ − ∩ = − = ∪ ∪ (1)

where X ∩Y measures the number of common tokens between strings X and Y while X∪Y measures the total number of distinct tokens.

To account for common tokens, we multiply each token by a weight that is inversely proportional to its frequency in the dataset. Formally, each token i has a weight wigiven by

1 log( ) 1 = + i i w n

where ni is the frequency of the token in the dataset. This weighting method is a simplified

version of the tf–idf weight (term frequency–inverse document frequency) of Salton and Buckley (1988).

To reduce the computational complexity of the J similarity index we approximate the second term of equation (1) as follows:

2 X Y

X Y

∩

+ (2)

where the denominator is the sum of all tokens, including those tokens that are contained in both strings. This may result in some double counting. On the other hand, it would be extremely costly from a computation viewpoint to find tokens common to two strings (company names). To maintain the same approximate scale we have multiplied the index by a factor of 2.

Thus, the weighted Jw_{distance is equal to the following expression:}

| | | ( , ) 1 2 k i j k k x X Y w i j i x X j y Y w J X Y w w ∈ ∩ ∈ ∈ = − +

∑

(3)

Where xi∈X and yi∈Y and wi and wj are the weights inversely correlated with the frequency of

(17)

3.3 Enhancing the value of existing dictionaries

In the previous sections we discussed the drawbacks of the dictionary approach due to the low match rate when perfect matching is implemented using the dictionary entries. One suggestion for overcoming the drawbacks of perfect matching is the use of approximate matching based on the string and token similarity functions described earlier. In this section we discuss an additional source of standardization of patent holder names that relies on the priority links across patent offices.

A priority link emerges when a patenter claims a priority date antecedent to the filing date of a given patent. Typically priority links refer to patent documents in other patent offices and the set of patents (or applications) filed in several countries which are related to each other by one or several common priority filings is generally known as a patent family.17 A patent family is sometimes also defined as all patents that protect the same basic invention. The rapid growth in the number of patent documents in recent years have been accompanied by the growth of priority links and resulting patent family size, and hence the total number of inventions has grown a than the total number of patent documents.

Thus, if a patenter has filed a document in two or more offices claiming a common priority date, it is possible to trace a link from an entry in the patenter names dictionary in one patent office to the corresponding entry in another patenter names dictionary in the other patent offices, on the assumption that the ultimate owner of the patent will be the same at both offices.18 Based on this assumption, in this section we describe an additional harmonization method for patenter names using priority links across USPTO and EPO patent databases. The objective of the analysis is to assess whether the priority links between US and EPO patents can improve the accuracy of harmonization of two existing dictionaries of patenter names: the USPTO CONAME file and EPO standard names. This methodology may allow us to propagate the matching done with one dictionary to the other, reducing the cost of implementation of such matching across the two dictionaries.

Figure 1 shows how we link these two dictionary files. In TASK 1 we start from the USPTO CONAME file made up of 237,666 distinct patenter names. The file has information on all patent-holders that have been granted at least one patent by the USPTO over the period 1963-2007. The USPTO CONAME file can be easily interfaced with the PATSTAT database through the patent publication number (TASK 2). Subsequently in TASK 3, using the PATSTAT database we can identify the priority links from and into the EPO patent database; in particular we rely on the INPADOC patent family definition. In TASK 4 we use the priorities compiled from PATSTAT by linking each EPO application to the US priority date

17_{Patents that refer to earlier patents in the same patent office as their priority are called continuation (at the}

USPTO) or divisional patents (at the EPO and the USPTO). Because patents are sometimes divided in different ways at different offices and members of a family at one office may claim different priorities elsewhere, there is more than one definition of a patent family. See Harhoff (2008) for a fuller discussion of this issue. We have used the INPADOC definition, which has been included in PATSTAT for the first time in the release of October 2008.

18_{It is important to note that this assumption will not always hold. A patent-holder with headquarter in one}

region may decide to sell patents held in another region. We have no firm data on the extent to which this is the case. However, our manual inspections of the data have shown that in the vast majority of cases, the patents are under the ownership of the same entity or at least the same multinational firm (with subsidiaries in the respective regions).

(18)

patent via the publication number. Finally, we deal with the identification of the proper link to the EPO patent-holder names in TASK 5, which takes account of the number of priority links and number of patenters per EPO patent. We used a string similarity algorithm and also manual checking to ensure the proper association across the USPTO assignee codes and EPO applicant codes.

Figure 1. Harmonization tasks based on priority links: data and sources

The final list of EPO applicant codes with a US standardized name includes around 158 thousands patenter names corresponding to about 70 thousands names on the USPTO CONAME file. This USPTO/EPO dictionary contains about 77.1% of the EPO patent applications filed by business organizations and 77.7% of the US granted patents filed by business organizations. The overall gain in the harmonization of the EPO applicant code is about 55.8%. Thus this approach significantly increased the quality of the EPO standard name codes file by using the USPTO CONAME file.

4

Implementation

4.1 Software prototype

In this section we summarize the structure of the software prototype for the creation of the dataset.

The first software module regards the cleaning phase, i.e., the development of a dictionary. We built on the previous contributions by Magerman et al. (2006) and Cockburn et al. (2009),

TASK 1: Retrieval of USPTO assignee names Source: USPTO standardized assignee names file 3,758,421 patents 238,344 names TASK 2: Retrieving the USPTO patent publication numbers Source: PATSTAT April 2009 Table tls211 11,132,167 patent documents TASK 3: Merging using priority links across EPO & USPTO patents Source: PATSTAT April 2009 Table tls219 4,492,601 priority links TASK 4: Retrieving the EPO patent publication numbers Source: PATSTAT April 2009 table tls201 2,309,079 patent application documents TASK 5: Merging to obtain EPO/PCT standard names Source: EPOLINE source files April 2008 3,480,628 patent documents 588,525 names

(19)

adding some additional cleaning operations. The final list of operations is summarized by the following sequence:19

1. Transformation into upper case to simplify matching. Addition of a blank space at the beginning and end of the string to facilitate word-based tests.

2. SGML and HTML codes substituted by the ASCII/ANSI equivalent, such as for example “&OACUTE;” replaced by “O” etc.

3. Proprietary character codes replaced by the ASCII/ANSI equivalent. 4. Each of the accented characters is replaced by its unaccented version.

5. Removal of frequent comma, double quotation mark irregularities and other period irregularities and non-alphanumeric characters.

6. The conjunction ”and” and its translations into other languages are standardized as “&”.

7. Umlaut harmonization by reducing variations such as “ue”, “ae”, and “oe” to respectively “u”, “a”, and “o”.

8. Removal of common company words like INC and AB in descending order of their length.

9. Replacement of spelling variations with their harmonized equivalent for some frequent words (such as INTL for INTERNATIONAL and its variants).

10.Removal of the round parentheses and cleaning their content; typically this content consists of geographical information or former company names.

11.Removal of multiple blank spaces, replacing with a single space.

12.Generation of a unique list of patenters by removing duplicates after cleaning.

13.Linkage and indexing to the USPTO20 and EPO standard name codes.21 The goal of this operation is to achieve consistency and interoperability of the harmonization and matching files generated in this paper and future updates of the USPTO and EPO patent files that rely on these codes.

The second software module deals with the matching phase. We relied on a rule-based approach based on the approximate string-matching algorithms discussed above and in Thoma and Torrisi (2007). In particular we adopted the Jaccard weighted distance operator as

19_{We implemented these operations in a Java software prototype. An equivalent SQL query including these}

operations is available under request.

20_{We extracted USPTO assignee names the so-called CONAME file. Source: USPTO, CD version March 2007.} 21_{The source of the EPO standard name codes for EPO and WIPO/PCT patent dataset is the weekly EPOLINE}

(20)

described by equation (3). Compared to the edit distance this operator has several improvements. Firstly, it is easier to deploy in software and it is characterized by a faster computation time. Secondly, it is more conservative than edit distance in establishing a similarity link across two different names because it allows for variation across tokens but not those within the tokens: we think that in the case of business entity names variation across tokens are more frequent with respect of those within tokens. Indeed we operated a deeper cleaning phase in the first software module – compared to by Magerman et al. (2006) and Cockburn et al. (2009) - in order to solve the business entity name variations within tokens. Thirdly, the operator described by equation (3) ensures an interesting trade-off of statistical relative weights for discriminating and non-discriminating tokens which makes it more suitable than edit distance for solving name variations as those described by Box 1 and 2. The third software module consists of some fine tuning operations included the refinement of predicted matching candidates, the resolution of abbreviations and of multiple matching occurrences of the same patenter, and formatting of the output files to be easily processed by electronic spreadsheets and statistical software packages.

4.2 Dictionary creation

We used our software prototype to create and integrate a large dataset of patenters originating from a high number of countries. The final results of the procedure for the creation of a patenter names dictionary for EPO and PCT/WIPO dataset are depicted in the accompanying tables:

• Table 2 reports the country distribution of the business applicants and applications in the EPO/PCT dataset.

• Table 3 reports the country distribution of the non-business organization (NBO) applicants and applications in the EPO/PCT dataset.

• Table 4 reports the country distribution of the individual applicants and applications in EPO/PCT dataset.22

• Figure 2 reports the distribution across the top 18 countries of the reduction in the number of applicants after name harmonization using the software prototype described in the previous section. The overall reduction of the size of the dictionary is about 28.8%.

[Tables 2-4 about here] [Figure 2 about here]

22_{Unlike the USPTO, the EPO does not provide a standardized identifier for this kind of applicant. Typically}

the names of individual applicants are supplied in the format “Surname”, “First name” “Middle name(s)” if any. Hence, the identification of the individual applicants has been based on a stepwise heuristic procedure. Starting from all the applicants that include a “,” in their name and have more than one token, we first excluded those applicants which had a typical common company word in their name such as INC, LTD and AB. Second, we removed those applicants with tokens related to non-business organizations. Third, we removed applicants with

(21)

4.3 Matching with business directories

In this section we report the results of the merge of patenter names with business directories. In particular, we retrieved business and ownership information for the companies from Amadeus by Bureau Van Dijk, which collects information from approximately 10 million European firms and their subsidiaries at the worldwide level.

We extracted all the historical information included in the Amadeus CD version files during the period 1998-2006. For each December release we retrieved information on:

• company demographics such as country, city, region, zip code, date of incorporation, industry 3 digits core code.

• unique identification number, which relies on national company identifiers such as VAT number, Chamber of Commerce numbers, etc.

• ownership structure such as subsidiaries and shareholders

• changes in the company names and some additional information.23

We start with the EPO/PCT applicant names for two reasons: First, thanks to the US/EP dictionary described in the previous section we can transfer the matches to a large share of the patenters at the USPTO. Second, exploiting the PCT links we can propagate this dictionary to a significant number of patenters, those holding a large majority of the patent documents in PATSTAT.

In matching EPO/PCT patenter names to the business directories we focused only on the business patenters, which constitute about 63.4% of the patenter names in the EPO and about 54.8% in the PCT system. Overall they encompass about 337 thousands original names that have been harmonized to about 240 thousands names according to the dictionary described in the previous section. About 43.3% of the patenters have just one application in the EPO and about 0.6% have more than 100.

We matched patenter names only if they also came from the same country, that is, the same nationality of the patenting entity in the EPO/PCT dataset and company in Amadeus. The results of the matching to the Amadeus business directories are depicted in the following figures:

• Table 5 and Figures 3a and 3b report the share of the business applicants in the EPO and PCT dataset that have been matched to Amadeus.

• Figures 4a and 4b report the share of the business applicants in the EPO and PCT dataset that have been matched to Amadeus, weighted by their number of patent applications.

We also matched the USPTO patenters to the Amadeus dataset. This task was in two steps: in the first we identified the matched EPO/PCT patenters (Table 5) that have been active also in the USPTO using equivalents link file discussed in section 3.3. In the second step, for the

(22)

residual USPTO patenter names not matched using this file we matched directly to Amadeus using the software prototype described in section 4.1.

• Figure 5 shows the share of the business assignees in the USPTO dataset that have been matched to Amadeus, unweighted and weighted by their number of patent applications.

[Table 5 about there] [Figures 3, 4, 5 about here]

5

Robustness checks and quality analysis

In this section we present some quality measures for the dataset developed in the previous section in the hope of quantifying potential false positives (Type I error) and false negatives (Type II error). A full-fledged analysis of this topic is beyond the scope of this paper and will be addressed by future research. Indeed, it requires not only a broader theoretical discussion about the entity matching process but also a deeper investigation of the generation of business directories such as Amadeus. Phenomena that potentially could influence the matching process include: relocation and reincorporation of the business activities, a firm’s ownership structure, mergers and acquisitions, and others. In this direction we limit the discussion to some aspects that could enable a more efficient use of the dataset developed in section 4.

5.1 False positives

For every matched pair we computed a quality of match score based on the similarity of the name and location of the information in our patent data and the data in the Amadeus business directory. The name similarity is measured as share of the total tokens over the total number of tokens in the two names, whereas the location is given either by the city or zip code correspondence. Table 2 shows the match score definitions.

Table 6

The quality of entity match score

Score Name Location

0 Manual check Manual check

1 Similarity >= 50% Same 2 30% <= Similarity <= 50% Same 3 Similarity >= 50% Unknown 4 Similarity >= 50% Different 5 30% <= Similarity <= 50% Unknown 6 10% <= Similarity <= 30% Same 7 30% <= Similarity <= 50% Different 8 10% <= Similarity <= 30% Unknown 9 10% <= Similarity <= 30% Different

For the EPO/PCT dataset the distribution of the match quality scores is reported in Figure 6. 90% of the applicants are characterized by a high matching score, that is a value less than or equal to 4. Similarly, Figure 7 shows the distribution of the match quality scores for the

(23)

[Figures 6 and 7 about here]

A second quality measure is given by the patenting lag, that is the number of years since the birth date of the patenting firm before it files its first patent application. We would expect this lag to be greater than or equal to zero. A large negative patenting lag could be a symptom of potential false positive matching. However, performing this check is hampered by the fact that the company birth date information is not usually reported in business directories such as Amadeus. Often business directories give the date of incorporation, which is when the company took the limited liability legal form. This date can be later than the founding date if the business venture started with self-employment or other legal company forms.

In spite of this limitation we created a patenting lag distribution relying on the incorporation date. This lag was computed as the difference between the filing year of the earliest patent and the incorporation year in Amadeus plus one year. Table 7 reports the distribution of the patenting lag for the different levels of the match score. Overall we have a negative patenting lag for about 7.4% of all the matched patenters and for 6.1% of the patenters that are active after year 2000, which is when Amadeus began its very wide pan-European coverage. Moreover, for patenters with a low match score there are visibly smaller proportion of negative patenting lags (values 1, 2 and 3). On the one hand, this finding shows the validity of the matching performed in section 4.3 and usefulness of the match score indicator. On the other hand, it invites further investigation of the matched names having a negative lag in order to check for the presence of potential false positives.24

Table 7

Negative patenting lag by different levels of the match score All Patenters Patenters active after 2000

Score obs % obs %

0 1,385 13.9% 1,188 10.9% 1 32,748 5.2% 29,748 4.6% 2 3,819 9.1% 3,470 8.5% 3 1,189 10.0% 1,081 6.8% 4 11907 19.9% 9,846 7.4% 5 164 25.0% 138 18.1% 6 390 13.1% 360 11.4% 7 2,098 15.2% 1,745 12.6% 8 4 0.0% 4 0.0% 9 0 0.0% 0 0.0% Overall 53,704 7.4% 47,580 6.1%

A third and final analysis of the Type I error problem was performed by analyzing manually a sample of a hundred firm names drawn randomly from among the European business applicant names at the EPO.25 The dataset developed in section 4 identified 76 matched

24_{A preliminary analysis revealed that negative patenting lag was often associated with the re-incorporation of a}

company after the decision to relocate its business activities or with a merger or acquisition.

25_{The harmonization and matching operations were performed only within the sample of business applicant}

(24)

applicants to Amadeus, whereas 24 names were not matched. Out of these 76, three were false matches, that is 3.9%. In particular, they were the following:

i) Patenter DRALORIC ELECTRONIC GMBH matched with company ELECTRONICS (DE715000111). Location zip or city is unknown (match score 3).

ii) Patenter COPRECI S COOP LTDA matched with company BATZ S COOP (ESF48037600). Location zip or city is matched (match score 1).

iii)Patenter VAN BUUREN VAN SWAAY matched with company VAN SWAAY BEHEER (NL09085054 ) Location zip or city does not match (match score 4). Example i) is characterized by a high Jaccard similarity although the company name consists of only one non-discriminating token. In addition, the correspondence location zip and city were unknown and hence the value of the match score was equal to 3. In this example, adding further quality criteria, the user of the dataset could reduce the ambiguity of the name similarity. In particular names with no discriminating tokens might need to be treated separately.

Example ii) is a false positive because the match is identified by one common company word (“COOP”) and one non-discriminating token (“S”). Moreover the correspondence of the location was verified so the match score was 1. This example is similar to the previous in that the names matched only on non-discriminating tokens.

Example iii) shows that the match is made up by one discriminating token (“SWAAY”) and one not discriminating (“VAN”). However, the non-discriminating token is repeated two times in the patenter name which increases artificially the similarity of the two names. This example suggests that it might be useful to remove duplicate tokens in the names before matching.

5.2 False negatives

In general, the assessment of the Type II error (failing to match an applicant that should be matched) is more difficult because it requires broader definition of matching criteria. In this section we describe two approaches.

The first approach deepens the analysis of the one hundred patenter names that were examined in the previous section. We reported that 24 patenter names out of 100 were not matched to any company in Amadeus. First, we found that among them were 2 individual applicant names and one non-business organization name, that is about 10% of the unmatched applicants. On the one hand, this finding shows some drawbacks in the identification method of the institutional type of patenter names used in Figure 3.26 On the other, it suggests that the real coverage of the matching depicted in Figure 4 could be even higher; in particular for the European patenters the actual matching to Amadeus could be as high as 80% of the EPO standard codes.

(25)

Second, we found only one case of a false negative, that is:

- Patenter name DISTEC should have been matched with DISTEC VERTRIEB VON ELEKTRONISCHEN BAUELEMENTEN (DE8330189835)

This example could be explained by the presence of the non-discriminating token “DISTEC”. This common word generates a low similarity score in the expression of eq. (3) because company name is made up of four other tokens, two of them discriminating and hence with high statistical score. This example suggests that it might be worth exploring other token based similarity functions.

Thirdly, there were 20 patenter names that could not be found in Amadeus business directory. The full-fledged identification of the origin of these applicants in terms of sector, age and size is beyond the scope of this paper. There are several explanations for the fact that these firms are not found in Amadeus (1998-2008). First, they may have been included in Amadeus after the date of our version. Second, some of these patenters are probably not limited liability firms. Third, their names may have changed since the patent was applied for but before they incorporated. Fourth, they have been incorporated in a different country from the one given on the patent. Fifth, they could have exited, merged and been acquired by another firm before the year 1998.

The second approach for the assessment of Type II errors analyzes the names of some large EU firms which have reported R&D activities. Typically, the R&D process is accompanied by patenting activities and a large share of patenting is done by large R&D performers. Thus focusing on firms with R&D reduces the cost of searching for false negatives. This analysis involved extracting some uniquely identifying keywords from the names of some large R&D performers and checking whether they were matched.

Table 8 reports a list of discriminating tokens that identify uniquely about a hundred large R&D firms from Europe. These tokens have been extracted from the company names of the top 2,197 European R&D doers.27 We selected these tokens randomly from the single token company names that were highly discriminating.

The fact that these tokens are discriminating does not necessarily mean that the simi