Search Time vs Length of Examples - A sequence-length sensitive approach to learning biological

(which are not shown in Figure 6.4) are given for each Function: R-squared = 0.862 for Function 6.1 and R-squared = 0.925 for Function 6.2. Both these values indicate that there is a correlation between performance and total nodes constructed, which is consistent with our hypothesis. 0.2 0.3 0.4 0.5 0.6 F! Measure 0 0.1 10000 100000 1000000 Nodes"constructed Function!2 Function!1 R²!=!0.925 R²!=!0.862

Figure 6.4: Performance plotted against the total number of nodes constructed during the search. The R-squared values are given for each Function. X-axis is in logarithmic scale.

Finally, inspecting Table 6.6 and subsequently Figure 6.2, we can observe that the total nodes constructed using the benchmark and L-modified Functions seem to be correlated. To confirm this we plotted the total nodes constructed for both Functions 6.1 and 6.2 against each other for each setNodes parameter. Figure 6.5 shows this graph. The linear regression line gives us an R-squared value of 0.993, therefore we can conclude that the total nodes constructed for both Functions are correlated. This indicates that the setNodes parameter has equal weight on both Functions.

6.5 Search Time vs. Length of Examples

6.5.1 Motivation

While evaluating the results of the previous experiments in this chapter, we postulated whether there was a relationship between the time needed to learn from an example and the length of that example. Our intuition tells us that more time should be needed to

6.5. Search Time vs. Length of Examples 79 1500000 2000000 n structed ! o n !2 Total!nodes!constructed!for!Function!1 R²!=!0.993 0 500000 1000000 0 200000 400000 600000 Total !nodes !co n for !!Functi o

Figure 6.5: Total nodes constructed for both Functions against each other for each setN- odes parameter.

learn from larger examples than from smaller ones. 6.5.2 Methodology

In order to confirm the hypothesis introduced in the previous subsection, that more time is needed to learn from larger examples, we analysed the output which was produced by Aleph during our experiments on term domination (described later in this chapter in Section 6.8). During learning, Aleph aims to find the best clause that is able to cover the current example and adds that clause to the grammar. Whenever it has determined which clause encountered is the best, it displays the clause itself and some information about it, including the time needed to learn that clause and the example that the clause was learned from. From that output, we can therefore extract the length of the example and the time needed to learn for each clause.

6.5.3 Results

We collected the times and example lengths for each of the clauses that were added to the grammar learned in the experiment in Section 6.8 and plotted them against the size of the respective examples they were learned from. Figure 6.6 shows this plot for the L-modified evaluation function, learning on the whole NPP-middle dataset, without cross validation. Note that this graph only includes those examples that the search decided to learn on. Whenever the search finds an acceptable rule and adds it to the grammar, it removes all the examples that are covered by this rule from the pool of examples to be learned on

6.5. Search Time vs. Length of Examples 80 100 150 200 250 Searc h !Time 0 50 0 20 40 60 80 100 S Example!Length

Figure 6.6: This graph plots the length of the examples in the training data against the time needed (in seconds) to search a rule that covers each example. It also contains a linear regression line found using the method of least squares.

(Step 5. Remove Redundant in the algorithm in Table 5.6 page 58). Therefore, we have no data on the search-time for the removed examples.

6.5.4 Evaluation and Discussion

We can see from Figure 6.6 that the very small examples need very little time to be learned from. As the size of the examples increases, the time needed to learn from them increases as well. This trend holds up until the examples reach lengths of over 40 amino acids and from there onwards the search time is considerably lower. This observation was unexpected.

Figure 6.6 also contains the linear regression line derived from all the data points plotted, using the least squares method. The equation for that line is:

y = 1.667x + 56.50 (6.5)

with the Pearson product-moment correlation coefficient:

r = 0.559 and R_{− squared = r}2 = 0.312 (6.6) Such a low value for R-squared would suggest that example length and search time are not correlated. However taking into account the unusual behaviour of the last 6 points, representing the examples with over 40 amino acids, we decided to draw the same graph

6.5. Search Time vs. Length of Examples 81 again, ignoring those 6 data points. This graph is presented in Figure 6.7. This graph also contains the linear regression line derived from all the data points plotted. The equation for that line is:

y = 5.433x + 4.093 (6.7)

with the Pearson product-moment correlation coefficient:

r = 0.943 and R_{− squared = r}2 = 0.889 (6.8)

This value for R-squared (0.889) does indeed suggest a correlation between example length and search time.

150 200 250 h !Time 0 50 100 0 10 20 30 40 Searc h Example!Length

Figure 6.7: This graph plots the length of the examples in the training data against the time needed (in seconds) to search a rule that covers each example. It also contains a linear regression line found using the method of least squares. This graph omits all examples with more than 40 amino acids.

To find out if this behaviour is unique to the experiments reported in this section, we looked at the data we received from some of our previous experiments in this chapter:

• using Function 6.2, learning on the whole NPP-middle dataset, applying 5 fold cross validation,

• using Function 6.1, learning on the whole NPP-middle dataset, applying 5 fold cross validation,

• using Function 6.1, learning on the whole NPP-middle dataset, applying no cross validation.

6.6. Analysis of induced grammars 82

In document A sequence-length sensitive approach to learning biological grammars using inductive logic programming. (Page 91-95)