Proceedings of the
Fourth Conference on
Computational Natural Language Learning
and of the
Second Learning Language in Logic Workshop
Held in cooperation with ICGI-2000
Proceedings of the
Fourth Conference on
Computational Natural Language Learning
and of the
Second Learning Language in Logic Workshop
Held in cooperation with ICGI-2000
Order additional copies from:
Association for Computational Linguistics
75 Paterson Street
New Brunswick, NJ 08901 USA
+1-732-342-9100 phone
Preface
The joint
Second Learning Language in Logic (LLL-2000) Workshop
andFourth Conference on
Computational Natural Language Learning (CoNLL-2000)
took place September 13-14, 2000, at the Instituto Superior T6cnico in Lisbon, Portugal and have been co-organized with the 5th International Colloquium on Grammatical Inference (ICGI-2000).This volume contains the papers presented during this joint event. More information is available on-line from
h t t p : / / w w w . iri. f r / ~ c n / L L L - 2 0 0 0 / a n d h t t p : / / i c g - w w w , u i a . ac. b e / c o n l l 2 0 0 0 / .
We would like to thank all the authors for submitting their papers and thus making these proceedings possible. We address special thanks to the members of the program committees for their great work which contributed to the high quality of these proceedings. We wish to extend our gratitude to the invited speakers for presenting us with their views on innovative results in Natural Language Processing and Machine Learning.
We are also grateful to the Local Chair Arlindo Oliveira, the members of the Organizing Committee, Ana Fred and Ana T. Freitas, and all other individuals who helped in the organization of this event.
Finally, we would like to thank the sponsors of LLL-2000 and CoNLL-2000 for their generous financial and moral support: the Network of Excellence in Inductive Logic Programming (ILPNet2), the Network of Excellence in Machine Learning (MLNet3), the Computational Linguistics in Flanders research community (CLIF), and SIGNLL (ACL's SIG on Natural Language Learning).
Claire Cardie Walter Daelemans Claire N6dellec Efik Tjong Kim Sang
° o .
SPONSORS:
CLIF (Computational Linguistics in Flanders)
ILPNet2 (Network of Excellence in Inductive Logic Programming)
MLNet3 (Network of Excellence in Machine Learning)
SIGNLL (ACL's SIG for Natural Language Learning)
INVITED SPEAKERS:
J6rg-Uwe Kietz
Dan Roth
ORGANIZERS:
Claire Cardie (CoNLL)
Walter Daelemans (CoNLL)
Claire N6dellec (LLL)
Erik Tjong Kim Sang (CoNLL)
LOCAL ARRANGEMENTS CHAIR:
Arlindo Oliveira
CoNLL PROGRAM COMMITTEE:
Thorsten Brants
James Cussens
Raymond Mooney
John Nerbonne
Miles Osborne
David Powers
Ronan Reilly
Antal van den Bosch
(Universit~it des Saarlandes)
(University of York)
(University of Texas at Austin)
(University of Groningen)
(University of Edinburgh)
(Flinders University)
(University College Dublin)
(Tilburg University)
LLL PROGRAM COMMITTEE:
Pieter Adriaans
Roberto Basili
Gilles Bisson
Henrik Bostr0m
Gosse Bouma
James Cussens
Tomaz Erjavec
Daniel Kayser
Suresh Manandhar
Guenter Neumann
Steve Pulman
Christer Samuelsson
Stefan Wrobel
(Syllogic and University of Amsterdam, the Netherlands)
(University of Roma, Italy)
(INRIA, Grenoble, France)
(University of Stockholm, Sweden)
(University of Groningen, the Netherlands)
(University of York, United Kingdom)
(Institute Jozef Stefan, Slovenia)
(LIPN, Universit Paris-Nor& France)
(University of York, United Kingdom)
(DFKI, Saarbrcken, Germany)
(University of Cambridge, United Kingdom)
(Xerox Research Center Europe, Grenoble, France)
(University of Magdeburg, Germany)
FURTHER INFORMATION:
CoNLL and SIGNLL
Walter Daelemans
CNTS Language Technology Group
University of Antwerp (UIA)
Universiteitsplein 1 (building A)
B-2610 Antwerpen, Belgium
e-mail: [email protected]
LLL
Claire N6dellec
Laboratoire de Recherche en informatique (LRI)
UMR 8623 CNRS
Bat 490, Universit6 Paris-Sud
F-91405 Orsay cedex, France
e-mail: [email protected]
Table of C o n t e n t s
C o N L L - 2 0 0 0 I n v i t e d
Paper
Learning in Natural Language: Theory and Algorithmic Approaches
D a n R o t h . . . 1 C o N L L - 2 0 0 0
Papers
Corpus-Based Grammar Specialization
Nicola C a n c e d d a and C h r i s t e r Samuelsson . . . 7
Pronunciation by Analogy in Normal and Impaired Readers
R.I. D a m p e r a n d Y. M a r c h a n d . . . 13
The Role of Algorithm Bias vs Information Source in Learning Algorithms for Morphosyntactic
Disambiguation
G u y De P a u w a n d W a l t e r Daelemans . . . 19
Increasing our Ignorance of Language: Identifying Language Structure in an Unknown 'Signal'
J o h n Elliott, Eric Atwell a n d Bill W h y t e . . . 25
A Comparison between Supervised Learning Algorithms for Word Sense Disambiguation
G e r a r d Escudero, Lluis M£rquez a n d G e r m a n Rigau . . . 31
Incorporating Position Information into a Maximum Entropy/Minimum Divergence
Translation Model
George Foster . . . 37
Memory-Based Learning for Article Generation
G u i d o Minnen, Francis B o n d a n d A n n C o p e s t a k e . . . 43
Overfitting Avoidance for Stochastic Modeling of Attribute- Value Grammars
Tony Mullen a n d Miles O s b o r n e . . . 49
Learning Distributed Linguistic Classes
S t e p h a n R a a i j m a k e r s . . . 55
Modeling the Effect of Cross-Language Ambiguity on Human Syntax Acquisition
W i l l i a m G r e g o r y Sakas . . . 61
Knowledge-Free Induction of Morphology Using Latent Semantic Analysis
P a t r i c k Schone a n d Daniel J u r a f s k y . . . 67
Using Induced Rules as Complex Features in Memory-Based Language Learning
Antal van d e n Bosch . . . 73
C o N L L - 2 0 0 0 S h o r t P a p e r s
Using Perfect Sampling in Parameter Estimation of a Whole Sentence Maximum Entropy
Language Model
F. A m a y a a n d J.M. Benedi . . . 79
Experiments on Unsupervised Learning for Extracting Relevant Fragments from Spoken Dialog
Corpus
K o n s t a n t i n Biatov . . . 83
Generating Synthetic Speech Prosody with Lazy Learning in Tree Structures
L a u r e n t Blin a n d L a u r e n t Miclet . . . 87
Inducing Syntactic Categories by Context Distribution Clustering
A l e x a n d e r Clark . . . 91
ALLiS: a Symbolic Learning System for Natural Language Learning
Herv~ Ddjean . . . 95
Combining Text and Heuristics for Cost-Sensitive Spam Filtering
Jose M. GSmez Hidalgo a n d E n r i q u e P u e r t a s Sanz . . . 99
Genetic Algorithms for Feature Relevance Assignment in Memory-Based Language Processing
A n n e Kool, Walter Daelemans a n d J a k u b Zavrel . . . 103
Shallow Parsing by Inferencing with Classifiers
Vasin P u n y a k a n o k a n d D a n R o t h . . . 107
Minimal Commitment and Full Lexical Disambiguation: Balancing Rules and Hidden Markov
Models
P a t r i c k Ruch, R o b e r t Baud, P i e r r e t t e Bouillon and G i l b e r t R o b e r t . . . 111
Learning IE Rules for a Set of Related Concepts
J. T u r m o a n d H. R o d r l g u e z . . . 115
A default First Order Family Weight Determination Procedure for WPD V Models
Hans van H a l t e r e n . . . 119
A Comparison of PCFG Models
Jose Luis Verdfi-Mas, Jorge C a l e r a - R u b i o a n d Rafael C. C a r r a s c o . . . 123
C o N L L - 2 0 0 0 S h a r e d Task P a p e r s
Introduction to the CoNLL-2000 Shared Task: Chunking
E r i k F. T j o n g K i m S a n g a n d S a b i n e B u c h h o l z . . . 127
Learning Syntactic Structures with XML
H e r v ~ D ~ j e a n . . . 133
A Context Sensitive Maximum Likelihood Approach to Chunking
C h r i s t e r J o h a n s s o n . . . 136
Chunking with Maximum Entropy Models
R o b K o e l i n g . . . 139
Use of Support Vector Learning for Chunk Identification
T a k u K u d o h a n d Yuji M a t s u m o t o . . . 142
Shallow Parsing as Part-of-Speech Tagging
Miles O s b o r n e . . . 145
Improving Chunking by Means of Lexical-Contextual Information in Statistical Language
Models
F e r r a n P l a , A n t o n i o M o l i n a a n d N a t i v i d a d P r i e t o . . . 148
Text Chunking by System Combination
E r i k F. T j o n g K i m S a n g . . . 151
Chunking with WPD V Models
H a n s v a n H a l t e r e n . . . 154
Single-Classifier Memory-Based Phrase Chunking
J o r n V e e n s t r a a n d A n t a l v a n d e n B o s c h . . . 157
Phrase Parsing with Rule Sequence Processors: an Application to the Shared CoNLL Task
M a r c V i l a i n a n d D a v i d D a y . . . 160
Hybrid Text Chunking
G u o D o n g Z h o u , J i a n Su a n d T o n g G u a n T e y . . . 163
LLL-2000 Invited Paper
Extracting a Domain-Specific Ontology from a Corporate Intranet
JSrg-Uwe Kietz, R a p h a e l Volz and Alexander Maedche . . . . . . 167 L L L - 2 0 0 0
Papers
Learning from a Substructural Perspective
P i e t e r A d r i a a n s and Erik de Haas . . . 176
Incorporating Linguistics Constraints into Inductive Logic Programming
J a m e s Cussens a n d S t e p h e n P u l m a n . . . 184
Learning from Parsed Sentences with INTHELEX
F. Esposito, S. Ferilli, N. Fanizzi and G. Semeraro . . . 194
Inductive Logic Programming for Corpus-Based Acquisition of Semantic Lexicons
Pascale S~billot, P i e r r e t t e Bouillon a n d CEcile Fabre . . . 199
The Acquisition of Word Order by a Computational Learning System
Aline Villavicencio . . . 209
Recognition and Tagging of Compound Verb Groups in Czech
E v a Z ~ k o v ~ , Lubo~ Popellnsk~ a n d Milo~ Nepil . . . 219
Fourth Conference on
Computational Natural Language Learning
Preface
CoNLL-2000 is the fourth in a series of meetings organized by SIGNLL, the ACL's SIG on Natural Language Learning. Previous meetings were organized in Madrid, Sydney, and Bergen, co-located with different, but always computational linguistics-oriented, events. We are pleased that this time we could combine efforts with the grammar induction and inductive logic programming for language processing communities.
It is the explicit wish of the SIGNLL board to have the CoNLL meeting address all aspects of computational natural language learning, including issues that are not regularly discussed at computational linguistics meetings, such as computational models of human language acquisition, computational models of the origins and evolution of language, biologically-inspired learning methods, etc.
We are thrilled by the quality and quantity of the submissions, which allowed us to set up an intense but rewarding program with one invited talk, 12 long talks, and joint paper sessions with LLL-2000 and ICGI-2000. On top of that, we introduced two innovations: there are 12 bullet presentations, short talks accompanied by a poster presentation, and a shared task session in which 11 authors report on how their machine learning method performed on our shared task - - the identification of syntactic constituents in text (chunking). In this part of the proceedings, you will find 37 papers providing a useful record of all presentations.
You can find out more about SIGNLL and its activities at h t t p : / / w w w . a c l w e b , o r g / s i g n l l / .
Second Learning Language in Logic Workshop
Preface
LLL-2000 is the follow-up of the first LLL workshop held in 1999 in Bled (Slovenia), and co-located with the International Conference on Machine Learning and the International Conference on Logic Programming. This year LLL was integrated with the Fourth Conference on Language Learning (CoNLL) and the Fifth International Colloquium on Grammatical Inference (ICGI) with which LLL shares strong common scientific interests in language learning. The registration to ICGI, CoNLL and LLL was a joint registration so that registrants could freely move belLween the three events.
As in the first edition, LLL has attracted pluridisciplinary submissions from the three research fields
-- Natural Language Processing (NLP), Machine Learning and Computational Logic, demonstrating the
growing interest in NLP methods based on ILP or non-classic logics, and hybrid methods. Relational learning more and more appears as complementary to data analysis in many NLP domains. Relational learning and logic-based learning prove here again their capacity to learn complex structured linguistic resources and knowledge such as ontology and grammar from corpora and explicit background knowledge. The scientific program of LLL-2000 consisted of one invited talk by Jrrg-Uwe Kietz on the acquisition of ontology and seven paper presentations. Six of them are reported here and the paper by Christophe Costa Florencio, accepted for presentation by both LLL and ICGI, has been published in the ICGI proceedings. The joint sessions with ICGI and CoNLL included one invited talk by Dan Roth and paper and poster presentations.
A u t h o r I n d e x
A d r i a a n s , P i e t e r . . . 176
A m a y a , F . . . 79
Atwell, E r i c . . . 25
B a u d , R o b e r t . . . 111
Benedi, J . M . . . 79
B i a t o v , K o n s t a n t i n . . . 83
Blin, L a u r e n t . . . 87
B o n d , F r a n c i s . . . 43
Bouillon, P i e r r e t t e . . . 111, 199 B u c h h o l z , S a b i n e . . . 127
C a l e r a - R u b i o , J o r g e . . . 123
C a n c e d d a , N i c o l a . . . 7
C a r r a s c o , R a f a e l C . . . 123
Clark, A l e x a n d e r . . . 91
C o p e s t a k e , A n n . . . 43
C u s s e n s , J a m e s . . . 184
D a e l e m a n s , W a l t e r . . . 19, 103 D a m p e r , R . I . . . 13
Day, D a v i d . . . 160
De H a a s , E r i k . . . 176
De P a u w , G u y . . . 19
D~jean, Herv~ . . . 95, 133 E l l i o t t , J o h n . . . 25
E s c u d e r o , G e r a r d . . . 31
E s p o s i t o , F . . . 194
Fabre, CEcile . . . 199
Fanizzi, N . . . 194
Ferilli, S . . . 194
Foster, G e o r g e . . . 37
G S m e z H i d a l g o , Jose M . . . 99
J o h a n s s o n , C h r i s t e r . . . 136
J u r a f s k y , D a n i e l . . . 67
Kietz, J S r g - U w e . . . 167
K o e l i n g , R o b . . . 139
Kool, A n n e . . . 103
K u d o h , T a k u . . . 142
M a e d c h e , A l e x a n d e r . . . 167
M a r c h a n d , Y . . . . . . 13
M ~ r q u e z , Lluis . . . 31
M a t s u m o t o , Yuji . . . 142
Miclet, L a u r e n t . . . 87
M i n n e n , G u i d o . . . 43
M o l i n a , A n t o n i o . . . 148
Mullen, T o n y . . . 49
Nepil, Milo~ . . . 219
O s b o r n e , Miles . . . 49, 145 P l a , F e r r a n . . . 148
P o p e l i n s k ~ , L u b o g . . . 219
P r i e t o , N a t i v i d a d . . . 148
P u e r t a s S a n z , E n r i q u e . . . 99
P u l m a n , S t e p h e n . . . 184
P u n y a k a n o k , V a s i n . . . 107
R a a i j m a k e r s , S t e p h a n . . . 55
R i g a u , G e r m a n . . . 31
R o b e r t , G i l b e r t . . . 111
R o d r l g u e z , H . . . 115
R o t h , D a n . . . 1, 107 R u c h , P a t r i c k . . . 111
Sakas, W i l l i a m G r e g o r y . . . 61
S a m u e l s s o n , C h r i s t e r . . . 7
Schone, P a t r i c k . . . 67
S~billot, P a s c a l e . . . 199
S e m e r a r o , G . . . 194
Su, J i a n . . . 163
Tey, T o n g G u a n . . . 163
T j o n g K i m S a n g , E r i k F . . . 127, 151 T u r m o , J . . . 115
V a n H a l t e r e n , H a n s . . . 119, 154 V a n d e n B o s c h , A n t a l . . . 73, 157 V e e n s t r a , J o r n . . . 157
V e r d d - M a s , J o s e L u i s . . . 123
Vilain, M a r c . . . 160
Villavicencio, A l i n e . . . 209
Volz, R a p h a e l . . . 167
W h y t e , Bill . . . 25
Z£~kov£, E v a . . . 219
Zavrel, J a k u b . . . 103