• No results found

Chapter 4 Extraction of Chemical Reactions from the Patent Literature

4.6 Reaction mapping

4.6.2 Atom-atom mapping

4.7.2.3 Evaluated reaction quality

Table 4-7 indicates the precision/recall with which the entities involved in a chemical reaction were identified. False positives may be workup reagents, characterisation reagents or not chemicals at all. False negatives are those entities which are involved in the reaction but were not identified.

True Positives 474 False Positives 60 False Negatives 18 Precision 88.9% Recall 96.4% F1 score 92.5%

Table 4-7 Statistics for recognition of chemical entities (reagents and products)

Table 4-8 shows whether for correctly detected chemical entities with quantities specified in the text, whether these were associated with the chemical entity. A quantity in this context could be

1 10 100 1,000 10,000 100,000 0 200 400 600 800 1000 N u m b e r o f p ate n ts wi th gi ve n n u m b e r o f r e ac tion s

171

a yield, amount, weight, volume etc. If an entity has multiple quantities all must be correctly associated to be considered a success.

Agent type Successful cases/total cases (%)

Reagents 317/321 (98.8%)

Products 48/74 (64.9%)

Table 4-8 Statistics for association of quantities with reagents/products possessing quantities

Table 4-9 shows for each chemical entity of a given role in the manually annotated reactions (which is also found in the automatically extracted reactions) whether the extracted entity has that role.

Role Successful cases/total cases (%)

Product 99/100 (99.0%) Reactant 241/244 (98.8%) Solvent 85/99 (85.9%) Catalyst 10/24 (41.7%) Other Spectator 0/7 (0%) Overall 435/474 (91.8%)

Table 4-9 Statistics for association of roles with entities

4.8 Discussion

The number of patents containing a specified number of extracted reactions (Figure 4-13) shows a power law distribution with respect to the number of reactions extracted from each patent. The majority of patents have less than 10 reactions but a significant minority have greater than 100. Patents with greater than 100 reactions account for only 1.59% of the patents processed but held 41.26% of the extracted reactions.

Recall of chemical entities (Table 4-7) was high (96.4%), whilst precision was somewhat lower. This was primarily due to the classification of workup reagents, especially the first workup reagent, as reactants. This is due to the text often just saying that the reagent was added without any indication of the purpose. Heuristics involving the addition of a common solvent as the last step of a reaction could be investigated.

Association of quantities (Table 4-8) with reagents was near perfect and can be considered a solved problem. There appear to be only a finite number of ways in regular use for associating quantities with chemical entities and all of them are supported by ChemicalTagger. Association of quantities with the product was less successful. This was because this information is often present at the end of an experimental section and only implicitly assumed to apply to the product. Heuristics

172

could be investigated for improving association of unassigned quantities with the product of the reaction. This is expected to be especially applicable to unassigned yields as these will almost invariably be the yield of the reaction.

Assignment of roles (Table 4-9) was excellent for products (99.0%) and reactants (98.8%) but worse for spectator reagents. For this evaluation, the definition of a catalyst was the strict definition that the reagent is not consumed in the course of the reaction. As experimental descriptions

typically only describe the intended product and not the fate of the other reagents involved it is often impossible to tell from just the text whether or not a reagent is a catalyst. This also made the assignment of reagents as catalysts problematic for the manual annotations as the likely mechanism of the reaction had to be investigated in some cases. Similarly, a reagent was considered a reactant even if it did not contribute any heavy atoms to the product as long as it was believed to be

consumed by the reaction. Nonetheless, despite the difficulties in identifying catalysts the addition of more known catalysts and the application of heuristics based around the relative quantity of reagent used would yield improved results.

Overall only 22% of the extracted reactions were flawless by all the metrics evaluated i.e. perfect entity recognition, quantity assignment and role assignment. However it should be borne in mind that most failures were minor, for example: a yield not assigned to a product, a workup reagent assigned as a reactant, etc. It should also be noted that some mistakes in chemical entity identification are not visible in the graphic depiction. This happens when two copies of a reagent are inadvertently identified (as happens if both a reagent and an anaphora to the reagent are

independently resolved), as they will have been merged (using InChI to check for identity) prior to depiction.

The correct identification of product and major starting material is a somewhat more

qualitative but potentially more useful metric, especially for reaction searching, with the proviso that there are no false positive entities that could be mistaken for either of these entities. This was true of 95% of the evaluated reactions. There were several reasons for the failure with the other 5%. A recurring problem was the difficulty in determining the meaning of a reference to a procedure as it could mean:

 The compound produced at the end of that procedure

 The compound produced at the end of that procedure but only in the context of an analogy to the current reaction

173

 The procedure itself

One failure involved a reaction being erroneously split into two reactions due to

ChemicalTagger misassigning ‘starting’ as a verb rather than as an adjective (in the context of ‘to give (i) starting material’). This caused the yield phrase to end at the word ‘starting’ and hence the true products ended up as the reactants for a second reaction. Another failure involved a reaction in which some of the compounds were defined using anaphora to previously defined labelled

compounds. In this case the association between these previous compounds and their numeric identifiers had not been made; hence preventing resolution of these anaphoric references.

Related documents