Development of an information retrieval tool for biomedical patents

(1)

Contents lists available at ScienceDirect

Computer

Methods

and

Programs

in

Biomedicine

journal homepage: www.elsevier.com/locate/cmpb

Development

of

an

information

retrieval

tool

for

biomedical

patents

Tiago

Alves

a ,b

_,

_Rúben

_Rodrigues

a

_,

_Hugo

_Costa

b

_,

_Miguel

_Rocha

a ,∗

a_Centre_{Biological Engineering, University of Minho, Braga 4710-057, Portugal} b_Silicolife_{Lda, Braga 4715-387,}_Portugal

a

r

t

i

c

l

e

i

n

f

o

Article history: Received21August2017 Revised7February2018 Accepted12March2018 Keywords:

Biomedicaltextmining Informationretrieval Informationextraction Patents

PDFtotextconversion

a

b

s

t

r

a

c

t

Backgroundandobjective:Thevolumeofbiomedicalliteraturehasbeenincreasinginthelastyears.Patent documentshave alsofollowedthistrend, beingimportantsources ofbiomedicalknowledge,technical detailsandcurateddata,whichareputtogetheralongthegrantingprocess.TheﬁeldofBiomedicaltext mining(BioTM)hasbeencreatingsolutionsfortheproblemsposedbytheunstructurednatureof nat-urallanguage,whichmakesthesearchofinformationachallengingtask.SeveralBioTMtechniquescan beappliedtopatents.Fromthose,InformationRetrieval(IR)includesprocesseswhererelevantdataare obtainedfromcollectionsofdocuments.Inthiswork,themaingoalwastobuildapatentpipeline ad-dressingIRtasksoverpatentrepositoriestomakethesedocumentsamenabletoBioTMtasks.

Methods:Thepipeline wasdevelopedwithin@Note2, anopen-sourcecomputational frameworkfor BioTM,addinganumberofmodulestothecorelibraries,includingpatentmetadataandfulltextretrieval, PDFto textconversion and opticalcharacter recognition.Also,user interfaces weredevelopedfor the mainoperationsmaterializedinanew@Note2plug-in.

Results:Theintegrationofthesetoolsin@Note2opensopportunitiestorunBioTMtoolsoverpatent texts,includingtasksfromInformationExtraction,suchasNamedEntityRecognitionorRelation Extrac-tion.We demonstratedthe pipeline’smainfunctions withacasestudy, usinganavailablebenchmark datasetfromBioCreativechallenges.Also,weshowtheuseoftheplug-inwithauserqueryrelatedto theproductionofvanillin.

Conclusions:Thisworkmakesavailablealltherelevantcontentfrompatentstothescientiﬁc commu-nity,decreasingdrasticallythetimerequiredforthistask,andprovidesgraphicalinterfacestoeasethe useofthesetools.

1. Introduction

Biomedical Text Mining (BioTM) is a sub-field of Text Mining thathasbeenappliedtobiologicalandbiomedicalliterature, seek-ing toautomate theprocessingofthe hugeamounts ofdatathat aregeneratedeverydayincludingpapers,reportsandpatents [1,2] , amongothers.Sincethesetextsarewritteninnaturallanguage,in an unstructuredform (without annotations aboutthe text struc-tureandavailableentities),searchingforstructuredinformationby handisquitetime-consuming [3] .Toautomatethisprocess,BioTM approachesbecomecrucial,allowing theextractionofmeaningful knowledgefromtexts,asawaytoformulatescientifichypotheses more easily [4,5] . Towards that aim, BioTM makes use of differ-ent knowledge fields, such as Statistics, Artificial Intelligence or Management sciences, combined with numerous text analytics

∗ _{Corresponding}_author.

E-mail address: [email protected](M. Rocha).

components as Information Retrieval (IR), Information Extraction (IE),NaturalLanguageProcessing(NLP),amongothers(Fig. 1 ) [6,7] . IR methods deal with information resources, gathering meta-data anddocuments froman extensive collection of sources. On theotherhand,IEaddressestheextractionofpertinentstructured information from human readable documents, using computer tools [8] .

NLP includes a set of computational and linguistic methods to be applied over texts, as spelling correction or document indexing. These systems can split texts in tokens or words (in a process called tokenization), perform word grammatical clas-siﬁcation (using part-of-speech (POS)), group words by families (with lemmatization processes), ﬁnd the syntactic structure of expressions,amongothers [9] .

As it is shown in Table 1 , several BioTM platforms has been developedbythescientiﬁccommunityusingdistinctprogramming languages, and applying different IR and IE methodologies (the tableonlyreviewsafewtoillustratethediversityoftheﬁeld).

(2)

Fig.1. Textmining,thetextanalyticalcomponentscontributingtothiseffortand theknowledgeﬁeldsfromcomplementaryareas.Adaptedfrom[7].

@Note2,1 developed by the University of Minho and the Sil-icoLife company is among these efforts, being an extended and reformulatedversion of the original @Note application, originally releasedin2009 [10] .@Note2 isa purely JavaBioTMframework, whichusesarelationaldatabaseandisbasedona plug-in archi-tecture,allowing the developmentofnewtools/methodologies in theBioTMﬁeld.Structurally,thissoftwarecanbethebasisforthe developmentofnewservices,algorithmsorgraphicalcomponents, following the clear separation betweencore BioTM libraries and theGUItools,theuserinterfacesectionof@Note2(Fig. 2 ).

The corelibrariesareorganizedinthreemainfunctional mod-ules:the PublicationManager Module(PMM),whichsearches for documents on onlinerepositories (IR Search process) and down-loads their respective full-text documents (IR Crawling process); theCorpora Module (CM), which performs corpora management, creates and applies IE processes over these corpora, such as NamedEntityRecognition(NER)andRelationExtraction(RE),and provides a manual curation system; and the Resources Module (RM), which allows creating and editing lexical resources to be usedinIEprocesses.

Theuserinterfacetoolsallowafriendlierandmoreproductive interaction withthe user, that is able to conﬁgure anduse each functionality [10] .Therefore,developerscancreatenewalgorithms andtools, which can be used by users withminimum effort. At

1_{http://anote-project.org/}_.

can be used in the development of other applications besides @Note,providinganAPIforthird-partydevelopers.

So,inrecentyears,alongsidewiththeinterestgiventothe in-formationprovidedinscientiﬁc papers,therehasbeena growing interest in patents. To explore their data is crucial to under-stand many biological ﬁelds, since they have detailed technical descriptions of inventions put together as part of the granting process.Furthermore,thatinformationisrarelydisclosedinother publisheddocuments [17–19] .

Patents areclassifiedintohierarchicalcategorieswhichcan be usefulwhentheinterestisonlyinaspecificknowledgefield(such asbiomedicine) [20] .Duetothehugenumberofpatentsgenerated every year, there are several databases able to store these data. Forinstance,theWorld IntellectualPropertyOrganization (WIPO) databasehad2.7millionpatentsregisteredonlyin2014 [3,21,22] . Sinceeverypatenthasageographicalregionwhereitisactive, there are both world-wide and localized patents. In both cases, therearespecific patentdatabases.ThePATENTSCOPEfromWIPO or esp@cenet fromEuropean Patent Office (EPO) are included in the former group, with patents with granted protection to all thecountriesintheworld. Thej-PlatPatfromJapanPatentOffice (JPO)orPatFTfromtheUnitedStatesPatentandTrademarkOffice (USPTO)are databaseswithpatentsfromJapanandUnitedStates, respectively,includedinthelattergroup [18] .

Although several patent databases are available to search and retrieve documents, the access to large amounts of data is restricted. Severalstudies were made, recently, trying to address thisproblem,usinginformationfromseveralpatentsections.

Heifets and Jurisica built SCRIPDB (a chemical structure database) used USPTO bulk download from Google servers to retrieve patents available since 2010 [23] . Papadatos et al. used patentdatafromUSPTO,EPO andWIPO,together withtitlesand abstracts from the JPO (processed in the XML format) to build SureChEMBL, a database with compound’s structures. For patent data processing, they used IFIClaims,a third-party globalpatent database aggregating patent information from over 40 national andinternationaldatasourcesintoasinglerepository [24] .

A related effort on patent IR is the TREC Chemical IR Track (TREC-CHEM),aninitiativethathasstartedin2009withthe sup-portofNIST(NationalInstituteforStandardsandTechnology) [25] . ThiseffortfocusesontheevaluationofIRmethodsandknowledge discovery of storedinformationon chemical patents andarticles. Althoughbeingan interesting approachto benchmarkIR systems thatseekdocumentsbasedonrelevance forgivenqueries,itdoes notaddresstheproblemsofpatentrecovery,mainlyrelatedtothe full text documents, tasks that will be handled by our proposed softwaretools.

A possible solution for full text retrieval is using Portable DocumentFormat (PDF)filesofeach publishedpatent, whichcan be accessed on patent databases.However, thesefiles havetheir data saved into scanned pages (usually image filesin BMP, TIFF,

(3)

Fig.2. @Note2mainstructure.The@Note2GUIhasmethodsthatimplementgraphicallythefunctionsfromthecoretextmininglibsrunningintheback-end.

PNGorGIF formats).So,when thereistheneed toextract infor-mation fromthis type offiles (suchaschemical structures), itis necessaryto converttheseimagesintostructured representations of standard chemical files format (such as Simplified Molecular InputLineEntrySystem(SMILES)strings) [26] .

To read scanned text, technologies such as Optical Character Recognition (OCR) can be used [27] , allowing the retrieval of machine-coded,readable,editableandsearchabledatafromthese files. OCR is a computer science technology that associates the image ofa character toits symbolicidentity.OCR usesan offline approachtocharacterrecognition,inwhichtherecognitionisdone after the completion of writing or printing and not done when charactersarebeingdrawn [28] .Since OCRsystems arebasedon patterns(numbers,lettersorpunctuationcharacters),thefirststep is to“teach” the machine aboutthesepatterns.Forthat purpose, there are trainingsets with charactersgrouped into classesused to analyze and compare with new examples, assigning them to thebest-scoredclass [29] .

This technology can be summarized in two main processes: the characterextraction andthecharacter recognitionprocess. In theﬁrststep,thepreviouslearnedpatternsareapplied todelimit wordsorindividualsymbolsandthen,inthesecondstep,to iden-tifyeachsymbol [30] .Thisisacomplexprocesswithseveralrobust algorithms andtechniques toanalyze distortions,style variations, translationandrotationmovements,backgroundnoise,etc [29] .

Given the scenario depicted in the previous paragraphs, we realize that a few software tools exist allowing users to search and recover patents from repositories and handle their content. However, to the best of theauthors’ knowledge, no existing free software tool currently is able to perform searches for patents overdifferentrepositories,integratingtheresultsofthosequeries, andobtainingthefulltextﬁlesfortheresultingpatents.Also,the connectionofpatentretrievalwithdownstreamBioTManalysisof thecontentoffulltextsisalmostnonexistent.

In this work, the objective was to develop and test a patent pipeline, added as a new plug-in to @Note2, able to retrieve patents’ metadata andfull texts, using OCR when needed to ex-tracttext fromthePDF ﬁles.Thiswillmakepatentdataamenable tobesearchedandusedasaninformationsourceforIEprocesses [email protected] the methods in our pipeline and demonstrate the usefulness of thedevelopedplug-in.

2. Methods

The patentpipelinecanbe organizedintofourdifferenttasks. It can search forpatent identiﬁers and retrieve patentmetadata

(astitleandauthors),downloadthepublishedpatentPDFfile,and, finally,applyPDF totext conversionmethodologiesto thosefiles. Eachtask wasstructured into a modulewith specific inputsand outputs.Thus,sourcestosearch andretrievepatentIDs,tosearch formetadataabouteachpatent,andtoreturnthepatentfile(s)in PDF format were configured ascomponents ofthe search sources module,metainformationsourcesmodule andretrievalsources mod-ule, respectively. The PDF to text conversion methodologies used wereorganizedinthePDFconversionmodule(Fig. 3 ).

Tomakethepatentpipelinefullyavailable,somespecificaccess keys,which canbe obtainedfromtheregistrationservices ofthe differentrepositoriesmust be used.Althoughnot all components requirethosekeys,eachcomponenthasrestrictionswhenmultiple requests are made. So, to get large amounts of information, all components must be configured. In this way, when the limits imposedbya specificcomponentarereached,theotherscanstill be used.This approachis followeduntil all data are retrievedor allcomponentsforeachmoduleareused.

To start the search process, input keyword(s) are required, which may be, for instance, biomedical entities as chemicals, genes or diseases. These keywords are then processed by the

searchsourcesmodule.Inthismodule,twopopularsearchengines (theCustom SearchAPI fromGoogle andtheBingSearch APIfrom Microsoft) were used. These APIs return patent identiﬁers from GooglePatents,adatabasewitharound87millionofpatentsfrom 17countries [31] .

Alongsidewiththosetwo searchengines,alocalpatent repos-itory, the Power User Gateway (PUG) from PubChem, and the

Open Patent Services (OPS) web services API from EPO were also used. The patent repository was built using the bulk data files from USPTO since January 2005, having a schedule for weekly updates, getting all the new available US patents. These patents were parsed and saved into specific data structures that can be easily accessed. Since all the patents are properly indexed, and the system was built using a RESTful architecture, all the US patentidentifiersrelatedwiththegivenkeywordscanberetrieved withoutrestrictions(Fig. 4 ).

When the input is a chemical name, a PubChem Compound Identiﬁer (CID), an InCHI key ora SMILE string, the PUGsystem from PubChem is autonomously used and, through the web in-terface, all the patents documented as being related with that chemical in PubChem can be retrieved. Forthis last module, an accesstokenisrequired.

The metainformation sources module receives a set of patent identiﬁers and input, and returns the invention title, authors, publication date, an external link to a patent database entry (if available) and the abstract to each patent. When available,

(4)

Fig.3. Summaryofthedesignedpatentpipeline,wherenumbersrepresenttheprocessﬂow.

Fig.4. ExampleoftheusedURItoaccessthelocalrepositorywithpatentfromUSPTOsince2005,searchingforpatentsusingspeciﬁckeywords.

the description and claims sections are also extracted. To avoid repetitions,thepatentfamilyisextractedpreviouslyandonlyone identiﬁeris usedto retrieve metadata,beingthe otherssaved as externalreferencesforthatpatent.

The metadata obtained as a result of this module is stored into a query, a data structure from @Note2 to save this type of information(Fig. 5 ). Three different services can be used in this module:thePATENTSCOPEwebserviceAPIfromWIPO,theOPSweb serviceAPIfromEPOandourlocalUSPTOdatabase,all capableof returningpatents’metadata,givenasetofidentiﬁers.

Theretrievalsourcesmoduletakesasinputasetofidentifiersor aquery data structure,and returns thepatent’s PDF files, saving the path for each into the query (Fig. 5 ). This module includes twoAPIsfromthemetainformation sourcesmoduleusing different configurations: the PATENTSCOPEweb service API from WIPO and theOPSwebserviceAPIfromEPO.

Then, lastly, the PDF conversion module takes all the ﬁles re-turnedfromthepreviousone,andextractstheirtext.Asshownin Fig. 5 ,thisallowsthecreationofacorpusdatastructure,allowing to run IE methods able to retrieve meaningful information from thesetexts,forinstance,performingNERorRE.

Inthismodule,alongsidewiththe version 2.0.1of theApache PDFBoxlibrary(implementedon@Note2), theTess4J,version3.2.1 (developedbyQuanNguyen)wasconﬁguredimplementing Tesser-act (an OCR algorithm from Google), and also a hybrid method combiningthesetwomethodologies.TheApachePDFBoxallowsto extract theUnicode text available on PDF documents.The hybrid methodallows a previous PDF treatment,improving their quality tobeprocessedbytheTess4Jsystem.

Since our patent repository has all the full-text content from each of US patentssince 2005, for those,the text can

(5)

automati-Fig.5. Creationandupdateprocessfor query and corpus datastructures.Thenumbersrepresentthemodulesofthepipelineandtheirflow.Theorange query datafield representstheupdateprocessoftheoriginal query ,whiletheorange corpus datafieldrepresentsthefieldthatturnsthe corpus intoadifferentdatastructure.

Fig.6. ExampleoftheusedURItoaccessthelocalrepositorywithpatentfromUSPTOsince2005,gettingmetainformationforspeciﬁcpatentIDs.

callybe extractedfromourrepository,savingprocessingtime and usingalltheavailablesources(Fig. 6 ).

On @Note2, the patent handling features were inserted in different core libraries. The patent ID search and metadata re-trieval were added as a new IR Search process called “Patent Search”,whilethe patentPDF ﬁle downloadwasaddedasa new IRCrawlingprocess andthenewPDFtotext conversionmethods were putintotheCorporaModuleasa pre-processingtocorpora creation(Fig. 7 ).

3. Results

The pipeline is materialized by an @Note2 plug-in allowing thesearchforpatentsfromGooglePatents,USPTOandesp@cenet repositories. This plug-in has two different steps and so, two differentpaneswerebuilt(Fig. 8 ).Thefirstpanehastwodifferent fields where the keywords andthe query name can be inserted. Usually,itiscommontoleavethequeryfilledwiththepredefined text ([KEYWORDS]:[ORGANISM]:[Date] ), since it is a regular ex-pression that produces a specific nameto that query takinginto accountthesearchdate.

Thesecondpaneistheconfigurationsone,whereallthe differ-entcomponentscan beconfigured torun ourpipeline.Whenwe trytosaveanyoftheaccesskeys,averificationismade, assuring thatallthecomponentsarecorrectlyconfigured.Here,sincethere isalsoasimilarpaneinthe@Note2Preferencessection,thetable can be pre-filled with all the previously configured components andsavedto the database. Inthose cases,the configurations can alsobeedited.

Two different case studies will be described, the ﬁrst more focused on allowing to validate thepatent IR pipeline andtools, whilethesecondwillbeaddressedtoshowtheusefulnessofthe @Note plugin andits interface, together with the potential ofits integrationwithother@Notetools.

Intheﬁrstcasestudy,thecompletetrainingsetforthe BioCre-ativeVCHEMDNERtaskwillbeused.Inthesecond,wewillusea queryrelative tothepatentsregisteredrelatedtovanillin produc-tion.Eachcasestudywillbedescribedinoneofthenextsections.

3.1. BioCreativeVCHEMDNERtask

The BioCreative CHEMDNER tasks were created to foster developments on the BioTM ﬁeld. They are based on the

(6)

imple-Fig.7. @Note2structurewithpatentpipelineimplementations.Theorangeboxesrepresentthenewcomponentsadded.

Fig.8. @Note2PatentSearchplug-in.Thepipelineusesinputkeywords,the query nameandconﬁgurationsprovidedbytheuserorby@Note2settingstosearchfor patentIDsandtodownloadpatentmetadata.

mentationofIEprocessestoseveralchemicalpublications,ﬁnding chemicalmentionsonpapersand, morerecently,onpatent docu-ments.Forthat,setswithpreviously annotateddataareprovided tobethe goldstandardcorpora totrainandtest systemscreated bytheparticipatingteams.

So, since these challenges contain a large known number of patents,we usedone ofthemtotest thepatentpipeline,namely considering the data from the training set of the BioCreative V CHEMDNER task. The dataset comprises 7000 patents, their identiﬁers,respectivetitlesandabstractsections.Thus,usingthat set, bypassing the search sources module, an exact number of documentsareexpected.

As shown in Fig. 9 , the pipeline was applied taking each of the 7000 identifiers in the dataset. Skipping the first module, theremaining will be executed, taking theseidentifiers asinput, to obtain both the metadata and the full texts ofthe respective patents. Then, the abstract section from the dataset can be also usedtovalidateourpipeline.

Tothat, each patentabstractcomingdirectlyfromthe BioCre-ative V CHEMDNER dataset, and the respective patent text obtainedfromthepatentpipelineasdescribedabove,were prop-erly tokenized and compared. In this comparison, we used the Smith-Waterman algorithm, a local Dynamic Programming algo-rithm, to evaluate the matches. Based on the number of tokens thatmatchexactlyonthetwotexts,wecancalculateperformance metricsasprecision,recallandF1score.Throughthosemetrics,it ispossibletoinfertheamountofconversionerrors,aswell asto verifytheamountofdocumentscorrectlydownloaded.

Through the metainformation sources module, metadata was extracted for 6979 patents, which represents a success rate of 99.7%. From those, we get 6440 abstracts, showing that patent

abstracts can also be obtained from patent databases, being a quick sourceof information toseveral BioTMapproaches, similar towhatisalreadydonewithotherscientiﬁcpublications.

Since our interest is on patent full texts, the patent pipeline follows to the next component. From the 7000 patents, 6967 patent PDFs were downloaded, representing a success rate of 99.5%. After retrieving the patents, the next step was to verify whichpatents arewritten inEnglish toexclude old patents(that were then reissued) or those written in a non-English language (being the original PDF ﬁlessaved on the database, without any translation). Through this process, 2456 patents were excluded fromouranalysis,remaining4511.

Then, the Dynamic Programming algorithm was applied to those patents. The evaluation metrics were calculated and the precision values, in means, show that 83.9% of the converted text relativeto theabstractiseffectivelythe sameforbothtexts. This is important because it is possible to infer that there is a low percentageof unrecognizedcharactersor evenbreaks in the recognizedwords.Thisconclusionwasalsosupportedbytherecall valuesthatwere,inmeans,69.8%andbyF1score,whichis76.6% forall the analyzeddocuments. Bothvalues showedalsoa small variancewithalowstandarddeviation(around10%)(see Fig. 10 ).

Ofcourse,that valuewouldbe changedifweacceptedall the PDF to text conversion files. In that case, we obtain, a mean of 89.8% of precision, 39.5% of recall and 54.8% of F1 score. This overallevaluationshowsthatoursystemisnot abletoverifythe content of databasefiles, searchingin the databases for English-onlypatents.So,when apatentfile isavailable ondatabase, itis downloaded evenifit iswritten inanotherlanguage, influencing ourresults.

Totesttheinformativecapacityofpatentfull-texts,asimpleIE task ofNamed Entity Recognition wasperformed using a dictio-nary of chemicals, the JoChem [32] . This dictionary is composed by278,578chemicaltermsandrespectivesynonyms.Theseterms refer to small molecules and drugs from severalsources. To im-prove the dictionary, rule-based term ﬁltering, manual check of highlyfrequentterms,anddisambiguationruleswereapplied.

So, applying the NER process to abstracts, we obtained 6493 annotations (which represents a mean of 2 chemical entities for each patent), while using full textswe get 1,401,142 annotations (which represents amean of414 entities for each patent).These valuesrepresentahugedifferencebetweentheinformation avail-able on the abstracts and full-texts,providing a raw measure of theinterestinobtainingfulltexts.

Toprocessalltheseﬁles,thewholepipelinetookaround3days [email protected].

(7)

Fig.9. InjectionoftheBioCreativeVCHEMDNERtasktrainingsetdataintopatentpipeline,replacingthe search sources module .

Fig.10. BoxplotsfortheevaluationmetricsofthePDFtotextconversionprocess. Themeanandstandarddeviationaregiveninbold.

3.2. Vanillinproductioncasestudy

Toverifyourpipelinerunningarealcasestudy,wedecidedto make a quicksearch onpatents relatedwith vanillin production. The processwasrunon thedeveloped@Note2plug-in,beingthe main steps illustrated in Fig. 11 . As we may ﬁnd on PubChem, vanillin (PubChem CID: 1183) is the primary component of the extract of the vanilla bean. It can be naturally or industrially produced,tobeusedasaﬂavoringagent.

So, asshown in Fig. 11 , we used the builtpatent pipelineon @Note2 to search for “vanillin”. From this search. we expected to obtain all the results related withthis compound (directly or indirectly), withoutanyrestriction. Asa resultofthat search, we found 7215 patents. All those patents were correctly ﬁlled with metadata (in English), and, from those, 7087 patent were ﬁlled alsowiththerespectiveabstracts.

Since the real number of patents related with vanillin is not known, a quick search on PubChemwas made, returning 18,863 patents. This large difference in the number of patents can be explained, since these data comprises all the different patent identiﬁers associated withvanillin. However, a single patent can be registered on different countries with distinct identiﬁers. In the developed patentpipeline, patentsfrom the samefamily are groupedintoasinglepatenttoavoidrepetitions(ascanbeseenon theresultsfrom Fig. 12 ),and,thus,ourresultsavoidredundancy.

Then, usingall the abstracts fromvanillin relatedpatents,the sameNERtaskthatwerunfortheBioCreativeVCHEMDNERcase studywasalsoapplied (againusingtheJoChemdictionary). From

that run,we annotated 38,954 entities,which representsa mean of6annotationsforeachpatent.

Ifwe intended to obtain all the data fromthese patents, the next steps from the patent pipeline could be used, and several BioTMapproacheswouldbeavailabletobeapplied.

4. Discussion

Recently, patents have been a target for BioTM techniques since they are a great source of information for many fields. Basedon@Note2,IRsearchandcrawlingprocessesweredesigned and implemented, allowing the search and retrieval of patent information from several databases, extending the domain of scientific/technicaltextshandledbythisframework.Theretrieved informationincludes the patent metadataand the respectivefull text documents, in PDF format. Also, new improvements were made to the @Note2 PDF to text conversion system, allowing a better conversion of patentfiles to get patentfull-text data in a machine-readableformat.

TestingthepatentIRprocessesproposedinthisworkwiththe training set from the BioCreative V CHEMDNER task shows that 99,7% were ﬁlled with patentmetadata. There are some patents thatarestoredindatabasesintheirnativelanguageand,forthese ones, thefull-text content isnot meaningful.Using the newPDF to text system on the English documents, we got around 81% of F1-score, which represents a good capacity of our system to transformthistypeofﬁles.

Usingthesetextsasastudycorpus,asimpleIEtaskof named-entity recognition for chemical compounds was made, using a dictionary-basedapproach.Weannotatedboththeabstractsection and the full text content. The results showed that the full text-content is, approximately, 200 times richer in annotations than theabstractsection.

Toadaptourpipeline intoareal casestudy, aquicksearch of vanillin production related patents wasalso made. We got 7215 documents, all ﬁlled withmetadata and 98,2% with the abstract sectionincluded.Thiscasewasalsousedtoshowcasethepotential oftheuserinterfaceprovidedbythedevelopedpluginfor@Note2. Themain innovation ofthis work wasthecreation ofnew IR processesappliedtopatents,surpassingcommonproblemsrelated to searching and retrieving those documents, allowing also the posterior implementationof several IE techniques to those texts. Indeed,mostoftheBioTMsystems(reviewedintheIntroduction) do not support processing of patents, and open-source software systemsforpatentretrievalaremostlynon-existent.

(8)

Fig.11. Patentpipelinerunningasearchforvanillin.In@Note2,thepipelineisopenedclickingonPatentPipelineSearchbutton(Paneintheupperleftcorner).

Since @Note2 is an open-source software, this framework opensdoors to the communityto take advantage of all sections of the published patents with biological relevance more easily, andwithout theneedto expend largeamounts oftime browsing severaldatabases.Regarding@Note2,theintegrationofthesetools allows developing an extensive set of text mining pipelines over patents,whichwereonlypossibleforscientiﬁcarticlessofar.

Inthedevelopmentofthiswork,wetookintoaccountsomeof themost importantfeatures for patentretrieval systems, namely to incorporate different sorts of patent search processes, to use knowledgeofthedomain,tousemultiplesourcesofdata,andto providevisualisationofthesearchresults [33] .

The work is still limited for the restricted access of many search engines and databases, which require authentication and impose limits on the amounts of information it is possible to collect. These issues are important limitations to be addressed in future work. Another aspect of possible improvement forthis workistheevaluationofthesystem.Indeed,themetricsofrecall and precision have been disputed recently, and some authors propose thenotionofrisk [34] ,an aspectto considertoimprove theevalutionoftheproposedsystem.

(9)

Fig.12. Patentpipelinerunningasearchforvanillin.Fromtheresults,asinglepatentcanbeselectedtoshowsomedataasthePublicationExternalLinkSourcesonthe MoreDetailstab.

Acknowledgments

This work is co-funded by the Programa Operacional Re-gional do Norte, under the “Portugal2020”, through the Euro-peanRegionalDevelopmentFund(ERDF ),withinprojectSISBI-Refa NORTE-01-0247-FEDER-003381 .

This study was also supported by the Portuguese Foundation for Science and Technology (FCT)underthescopeofthestrategic fundingof UID/BIO/04469/2013 unitandCOMPETE2020 (POCI-01-0145-FEDER-006684) andBioTecNorteoperation (NORTE-01-0145-FEDER-000004) funded by European Regional Development Fund underthescopeofNorte2020-ProgramaOperacionalRegionaldo Norte.

Supplementary material

Supplementary material associated with this article can be found,intheonlineversion,atdoi:10.1016/j.cmpb.2018.03.012 .

References

[1] A.Faro,D.Giordano,C.Spampinato,Combiningliteraturetextminingwith mi-croarraydata:advancesforsystembiologymodeling,Brief.Bioinform.13(1) (2012)61–82,doi:10.1093/bib/bbr018.http://bib.oxfordjournals.org/content/13/ 1/61.full.pdf.

[2] R.Klinger,C.Kolarik,J.Fluck,M.Hofmann-Apitius,C.M.Friedrich,Detection ofIUPACandIUPAC-likechemicalnames,Bioinformatics24(13)(2008)i268– i276, doi:10.1093/bioinformatics/btn181. http://bioinformatics.oxfordjournals. org/content/24/13/i268.full.pdf.

(10)

for molecular biology, Genome Biol. 6 (7) (2005) 224, doi:10.1186/ gb-2005-6-7-224.

[9] A.Clark, C.Fox, S.Lappin,The HandbookofComputationalLinguisticsand NaturalLanguageProcessing.,BlackwellHandbooksinLinguistics, Wiley-Black-well,2010.

[10] A.Lourenço,R.Carreira,S. Carneiro,P.Maia,D.Glez-Peña,F.Fdez-Riverola, E.C.Ferreira,I.Rocha,M.Rocha,@Note:aworkbenchforbiomedicaltext min-ing,J.Biomed.Inform.42(4)(2009)710–720,doi:10.1016/j.jbi.2009.04.002. [11]R.Hoffmann,A.Valencia,Agenenetworkfornavigatingtheliterature,Nat.

Genet.36(7)(2004)664,doi:10.1038/ng0704-664.

[12] D.Campos,S.Matos,J.L.Oliveira,Amodularframeworkforbiomedicalconcept recognition,BMCBioinform.14(2013)281,doi:10.1186/1471-2105-14-281. [13] B.Settles,ABNER:anopensourcetoolforautomaticallytagginggenes,

pro-teins and otherentitynames intext, Bioinformatics 21(14) (2005)3191– 3192, doi:10.1093/bioinformatics/bti475. http://bioinformatics.oxfordjournals. org/content/21/14/3191.full.pdf

[14] H. Cunningham, V.Tablan, A.Roberts, K. Bontcheva,Getting more out of biomedicaldocuments with gate’sfull lifecycleopen sourcetext analytics, PLoSComput.Biol.9(2)(2013)e1002854,doi:10.1371/journal.pcbi.1002854. [15] D.Ferrucci,A.Lally,UIMA:anarchitecturalapproachtounstructured

infor-mationprocessinginthecorporateresearchenvironment,Nat.Lang.Eng.10 (3–4)(2004)327–348.

[16] D. Salgado, M. Krallinger, M. Depaule, E. Drula, A.V. Tendulkar, F. Leit-ner, A. Valencia, C. Marcelle, MyMiner: a web application for computer-assistedbiocurationandtextannotation,Bioinformatics28(17)(2012)2285– 2287, doi:10.1093/bioinformatics/bts435. http://bioinformatics.oxfordjournals. org/content/28/17/2285.full.pdf.

Res.44(D1)(2016)D1220–D1228,doi:10.1093/nar/gkv1253.

[25]M.Lupu,F.Piroi,J.Huang,J.Zhu,J.Tait,OverviewoftheTREC2009chemical IRtrack,in:Proceedingsofthe18thTextRetrievalConference,2009. [26]J.Park,G.R.Rosania,K.A.Shedden,M.Nguyen,N.Lyu,K.Saitou,Automated

extractionofchemicalstructureinformationfromdigitalrasterimages,Chem. Cent.J.3(2009)4,doi:10.1186/1752-153X-3-4.

[27]A.M.Asif,S.A.Hannan,Y.Perwej,M.A.Vithalrao,Anoverviewandapplications ofopticalcharacterrecognition,Int.J.Adv.Res.Sci.Eng.3(7)(2014). [28]D.Álvarez,R.Fernández,L.Sánchez,Stroke-basedintelligentcharacter

recogni-tionusingadeterministicﬁniteautomaton,Log.J.IGPL23(3)(2015)463–471. [29]L.Eikvil.Opticalcharacterrecognition.1993.citeseer. ist. psu. edu/142042 . [30]R.Holley,Howgoodcanitget?AnalysingandimprovingOCRaccuracyinlarge

scalehistoricnewspaperdigitisationprograms,D-LibMag.15(2009). [31] Google,AboutGooglepatents,2017. https://support.google.com/faqs/answer/

6390996.

[32]K.M.Hettne, R.H.Stierum, M.J. Schuemie, P.J.Hendriksen, B.J. Schijvenaars, E.M.Mulligen,J.Kleinjans,J.A.Kors,Adictionarytoidentifysmallmolecules anddrugsinfreetext,Bioinformatics25(22)(2009)2983–2991,doi:10.1093/ bioinformatics/btp535. http://bioinformatics.oxfordjournals.org/content/25/22/ 5202983.full.pdf.

[33]B.Diallo,M.Lupu,Futurepatentsearch,in:CurrentChallengesinPatent Infor-mationRetrieval,Springer,2017,pp.433–455.

[34]A.Trippe,I.Ruthven,Evaluatingrealpatentretrievaleffectiveness,in:Current ChallengesinPatentInformationRetrieval,Springer,2017,pp.143–162.