computation.
• It is generally easier to learn from smaller examples. There is a correlation between the length of an example and the time and effort needed to learn from it.
• We have shown that as a consequence of L-modification, it is possible that some terms in a clause evaluation function might loose their influence on the result. This was not the case in the experiments conducted in this work, but when applying L-modification to other clause evaluation functions, this should be considered. • Finally, we have shown that our L-modified Function copes quite well with Noisy
data. We gradually inserted more and more noise into the dataset, but there was no tipping point where the performance plummeted. Instead the performance also decreased gradually.
Summing up these results we come to the final conclusion that a sequence-length sensitive approach to learning biological grammars using inductive logic programming is a useful and promising concept.
8.3
Future Work
This section presents a few suggestions on how the work presented in this thesis might be extended. Some of the ideas presented here could not be done because the author did not have the required access to resources or data, but most of the ideas presented here were left unaddressed simply because of a lack of time available for the project.
Apply L-modification to other ILP clause evaluation functions
It should be interesting to apply L-modification to other ILP clause evaluation functions which only use the count of examples to score the quality of learned hypotheses. For domains in which examples in the training data are of variable length, we would expect similar results to those presented in this work.
8.3. Future Work 104 Apply L-modification to positive and negative learning
The logical next step for this work is to apply L-modification to clause evaluation functions and datasets that are less concerned with positive only learning but rather use positive and negative examples. Learning biological grammars using clause evaluation functions tailored for positive and negative learning and comparing their performance with that of their L-modified counterparts should give interesting results.
One potential dataset, which containins positive and negative examples, is be the GPCR protein sequences which were used in (Bryant et al. 2006).
Investigate real world transcription errors
The author is aware that the transcription errors that were introduced into the dataset in Section 7.2 do not precisely reflect the transcription errors that would be most likely to occur in real biological data. As part of the potential future work there could be an investigation into the sort of transcription errors that might occur in reality and then test our L-modification on such data.
Investigate the effect of noise caused by classification errors
The effect of noise caused by classification errors (See Section 7.2) could be investigated as well. In this work we have only considered transcription errors. As we were dealing with positive only learning, we already accommodated for the presence misclassified (or undiscovered) positive examples anyway. If this work could be expanded towards a posi- tive and negative learning approach, using the respective clause evaluation functions and their L-modified counterparts, the investigation into classification errors might provide interesting results.
Other protein families
This work dealt only with neuropeptide precursor proteins. The author would like to see the results of applying the ideas presented in this work onto datasets which consist of other protein families.
8.3. Future Work 105 2006) might be a potential candidate. Alternatively, the uniprot database (UniProtKB n.d.) could contain some protein sequences which could be compiled for future experi- ments. A Protein Secondary Structure sequential dataset can also be found in the UC Irvine Machine Learning Repository (UCI n.d.).
L-modification and learning grammars describing nucleic acid sequences Furthermore, this work was concerned with learning grammars which can parse protein sequences, which are described through an alphabet consisting of 20 amino acids. As it was mentioned in Chapter 2, grammars have also been shown to be useful to describe nucleic acids; DNA and RNA sequences. These are described through an alphabet consisting of only four nucleotides. The author would like to see L-modification applied to learning grammars describing nucleic acids.
Potential sources for datasets containing DNA sequences could be the UC Irvine Ma- chine Learning Repository (UCI n.d.) where Promoter Gene Sequences and Splice-junction Gene Sequences can be found. Both of these datasets have been used in the past to learn neural networks by the KBANN system (Towell, Shavlik & Noordewier 1990).
Evaluate the learned biological grammars
Finally we would like to investigate the usefulness of grammars learned using L-modified clause evaluation functions in the biological domain. We would like to have experienced biologists evaluate the grammars that were learned using our L-modified clause evaluation functions.
Learning from very long examples
As we have noticed during the experiments in Chapter 6.5, some unintuitive behaviour was noticed when learning from examples that contained more that 40 amino acids (characters). As of yet, we have no solid suggestions to explain why these examples actually needed less time to be learned on than some of their shorter counterparts. It might be that this is unique to Alpeh, or it might very well be that other ILP tools show the same behaviour. This could be investigated as part of the future work. Also, as we have not actually searched the literature with the aim of seeking to explain this, therefore, having a closer
8.4. Epilogue 106