Estimating linguistic granularity and variability

A linguistically rich annotation scheme will have a high degree of granularity and variability. By granularity, we mean the level of detail that the annotation provides for. A fine-grained dependency scheme will have many possible combinations of PoS tags and relation types, whereas a more coarse-grained format will have fewer available combinations and thus provide less detail. Variability denotes the level of detail actually expressed in the format. A format might have a high degree of granularity and still obtain a low degree of variability, if not fully utilizing its granularity. We assume that conversion from a format that is linguistically rich to a format that is more coarse-grained, will be most likely to succeed, although such conversion will lead to loss of information. In order to investigate the granularity and variability of the annotation schemes, we will perform a quantitative study of them.

We start by calculating the possible combinations of PoS tags and relation types for each format (granularity). The result is presented in Table 3.4. The first row shows the number of PoS tags used in our data.

As our data sets are equally tokenized, this number is the same for all three formats. We see that 45 of the 48 PoS tags available in the PTB tag set are used. The second row shows the number of available dependency labels according to documentation on each format; CD (Surdeanu et al. 2008), SB (De Marneffe et al. 2008), DT (ERG Tags)2. The third row shows the number of possible combinations of dependency labels and head PoS tags and the number of possible dependent PoS tag and label combinations. The last row shows the result of calculating possible combinations of dependent PoS tags, head PoS tags and dependency labels. We observe from these numbers that CD has the highest granularity among the three formats, due to its comparatively large set of labels. DT has a slightly lower granularity than SB. CD DT SB postag 45 45 45 label 69 52 56 hpos-label / dpos-label (posttag x label) 3105 2340 2520 dpos-hpos-label

(postag x label x postag) 139725 105300 113400 Table 3.4: Counts of possible combinations for each format

To investigate the variability of our formats, we perform counts on labels and PoS tags, and combinations of these, that are actually used in our data. The results, both in absolute and relative frequencies, are presented in Table 3.5. The first row shows the total number of dependency labels used in each format. hpos-label shows the number of combinations of dependency labels and head PoS tags used. dpos-label shows the number of dependent PoS tag and label combinations.

CD DT SB CD DT SB

label 62 50 49 89.9 96.2 87.5

hpos-label 546 588 677 17.6 25.1 26.9

dpos-label 688 690 577 22.2 29.5 22.9

dpos-hpos-label 3503 3541 3479 2.5 3.4 3.1

Table 3.5: Counts of different combinations used in each format We can see that only some of the possible combinations of PoS tags and labels are used. Although these proportions differ a bit between the different formats, the numbers of combinations of dependent PoS tags, labels and head PoS tags (dpos-hpos-label), shown in the last row, are surprisingly similar. This could indicate that the difference in variability between the schemes is trifling. It also indicates that CD, although it has

2_{According to documentation, 48 ERG labels and 4 additional technical labels (root,}

a higher granularity (more available combinations of PoS tags and labels), does not actually exhibit a higher degree of variability by utilizing them.

We acknowledge that the distribution of the different PoS tag and label combinations might also be considered when estimating the variability of a format. In one format a large number of the combinations used may occur rarely, while the used combinations are more evenly distributed in another. Knowledge of such differences in distribution might provide more certain information about the actual variability of the formats. However, we will not do any further investigations of this in our study.

As a complementary basic statistics, Table 3.6 shows, for each format, the average number of dependents per head. We see that SB designates fewer nodes as heads than the other formats, because each head in this format has more dependents attached to it. This has not had the effect of diminishing the combinations of labels and head PoS tags used in this scheme, but rather the opposite. As we can se from Table 3.5, SB has the highest number of combinations of label and head PoS tags among the three formats. CD DT SB Avg. no dependents per head 1.86 1.69 2.27 Avg. tree-depth 7.75 7.93 6.35 Max. tree-depth 24 25 20

Table 3.6: Tree-depth of formats

Table 3.6 also shows the average and maximum tree-depth, the number of nodes in the longest path from the root to a terminal node in a sentence, in the three formats. We can see that SB has a lower tree-depth than the other two formats. This corresponds to the higher concentration of dependents per head in this annotation scheme. The similarity in tree- depth between CD and DT might indicate that conversion between these formats is easier than conversion including SB.

The variety of PoS tags of tokens that are selected as roots might also say something about the richness of an annotation scheme. Table 3.7 shows the 10 most common PoS tags for tokens used as roots and their percentage for each format. Table 3.7 confirms some of the facts we already know about the formats. One of these facts is that in the SB scheme, content words are preferred as heads (De Marneffe et al. 2008). This explains that non-finite verbs forms (VB, VBG, VBN) occur as roots far more often in the SB format than in the other formats, while modal verbs (MD), in contrast to the other formats, hardly ever are used as roots in SB. It can also explain why, as we can see from the table, there is a higher variation among PoS tags frequently used as roots in SB than in the other formats.

Another phenomenon that Table 3.7 reveals, is that DT treats coordination different from SB and CD (Ivanova et al. 2012). In DT the coordinating conjunction (CC) often appears as root, in the other formats it practically

CD DT SB VBD 43.5 38.3 35.8 VBZ 28.3 24.2 15.7 VBP 14.3 11.6 7.4 MD 8.2 7.1 0.1 CC 0.0 13.3 0.0 VBN 0.5 0.5 13.3 VB 0.7 0.6 8.2 NN 1.1 0.8 5.6 JJ 0.1 0.1 5.0 VBG 0.1 0.0 4.0 96.8 96.5 95.1

Table 3.7: Most common PoS tags of tokens used as roots in each format

never does. We will discuss coordination structures more thoroughly in Section 3.4.

In document Dependency Interconversion (Page 39-42)