Survey of Language Engineering needs: a Language Resources perspective
Jeffrey Allen, Khalid Choukri
European Language Resources Association (ELRA) & European Language resources - Distribution Agency (ELDA)
55-57, rue Brillat-Savarin 75013 Paris France {jeff, choukri}@elda.fr
Abstract
This paper describes the current state of an on-going survey that aims at determining the needs of users with respect to available and potentially available Language Resources (LRs). Following market monitoring strategies that have been outlined within the Language Resources- Packaging and Production project (LRsP&P LE4-8335), the main objective of this survey is to provide concrete figures for developing a more reliable and workable business plan for the European Language Resources Association (ELRA) and its Distribution Agency (ELDA), and to determine investment plans for sponsoring the production of new resources.
1.
Introduction
This paper provides results of questionnaires on user needs of a survey that has been conducted in 1999 and 2000 by the European Language Resources Association (ELRA) and its Distribution Agency (ELDA) within the LE4-8335 European Commission funded project. This is a follow-up of initial survey work conducted by ELRA/ELDA in 1997 (Nilsson, 1997a; Nilsson, 1997b, 1998) within the LE1-1019 project. Taking a multi-tier approach for gathering information on user needs, these surveys are longitudinal in nature, have evolved and improved over time, and thus provide an excellent barometer for measuring the recent past, present and future needs of LR users. We describe herein the approach and procedures of the recent survey and provide anonymous results that have been obtained. These results are allowing ELRA/ELDA to streamline its approach for future marketing monitoring work, for the identification, collection and distribution activities of Language Resources (LRs), and to better plan new LR investments.
The questionnaire analyzed in this paper was sent mainly to respondents who are not ELRA members. It was sent directly to potential respondents during the first stage of the survey. The second stage of the survey included sending reminder requests to those who had not responded earlier. We provide statistics on various areas of Language Resource activities including: speech systems; speech evaluation and assessment; text processing; text processing systems; authoring and translation environments; information processing systems; multi-media and multi-modal LRs; languages needed, LR domains/fields; and regional areas of respondents.
2.
Survey Methodology
After receiving a low amount of responses to the 1997 ELRA/ELDA LR User Needs Survey (Nilsson, 1998), ELDA staff revised the survey methodology and redesigned the questionnaire. Unlike the 1997 Survey that contained many open questions, the new questionnaire has aimed at providing questions with binary yes/no and check the box options that would limit the amount of time necessary for a participant to complete the questionnaire. Also, the new questionnaire was sent in personalized
messages to all of the respondents. This questionnaire was designed to only take 10-15 minutes of time to complete.
The new questionnaire was sent out in 667 personalized messages during the month of August 1999 to individuals in the general field of language engineering and human language technologies that are listed in one of the contact databases at the ELDA office.
Of the nearly 670 questionnaires sent out to these language engineering specialists, 17.5% of the messages returned as bad addresses. After discounting the invalid addresses, the 90 respondents who returned a completed questionnaire to us represented 16.4% of the total number of valid addresses of potential respondents. The preliminary results obtained from the first 90 respondents are considered to be the first stage of this survey and have already been published (Allen, 1999c). We will not repeat the details in this paper but will rather include them in the cumulated statistics. Given that the first stage of sending out the new version of the questionnaire had a very successful response rate, we proceeded with extending the coverage of the potential respondents for the second stage of this LR survey work that was conducted in October 1999 through January 2000.
As seen in Figure 6, there is currently a significant amount of research being conducted by 25-50% of respondents for the nearly all of these sub-areas that are a rapidly expanding part of current NLP work.
6.
Multi-media and Multi-modal LRs
One of the most recent demands for LRs falls in the area of media and modal data. As for Multi-modal Processing, the recent survey shows that 52% of all respondents are interested in Multi-media data and 35% are interested in Multi-modal data. The specific sub-areas of Multi-modal processing that have been identified and surveyed by this questionnaire include: face tracking, gesture recognition, facial analysis, eye-gaze tracking, face recognition, person identification, speech/lip reading, focus of attention, facial animation and multi-modal error recovery. From 5-10% of all respondents state that they want one of these several types of Multi-modal LRs for research, as shown in Figure 7. Product development is still low, but this is expected to grow quickly since this is a new area of research and development. ELRA/ELDA notice an overwhelming increase in Multi-modal LRs information since the 1997 Autumn/Fall Survey only indicated that 1/18th of the surveyed participants were interested in Multi-modal LRs.Since researchers and developers in the Human Language Technology field are showing interest in this kind of data, it is important that ELRA/ELDA continue to closely monitor and survey this area in further detail in order to more adequately respond to this increasingly important area for new LRs.
Figure 7. Multi-modal percentages
7.
Languages needed
Another one of the questionnaire sections asks for the languages desired with regard to LR data. These statistics clearly help ELRA/ELDA understand language data needs, correlated with what is currently offered, and to see where there is a lack in what is being offered today. It was possible for LR users to tick more than one language box in the questionnaire. The statistics indicated in Figure 8 reflect languages that received 20 or more responses and Figure 9 those languages that each received less than 20 responses. The percentages presented in the charts are therefore based on the total number of individual language boxes that have been selected (i.e., 1,326 selected) as well as with regard to the total number of survey respondents (250).
It is clear that English, French, German, Italian, and Spanish, set apart in the left of Figure8, are currently the most desired languages for LRs. The middle percentile group of responses, to the right and in alphabetical order in Figure 8, containing the Asian Languages and some of the other European languages. The languages that receive less than 20 responses are for the most part Eastern European languages.
A general conclusion to make from these language statistics is that ELRA/ELDA have been responding to the need for European languages. Some Eastern European LRs have been made available to meet such needs. There is however a need for more LRs for the main Asian languages (Chinese, Japanese and Korean) and for Arabic, as can be seen in Figure 8.
Multimodal Processing %
8%
7% 7% 6%
7% 5%
6%
5% 6%
3%
1% 1% 1%
0,4% 0,4% 1%
2% 2% 1%
2%
0% 1% 2% 3% 4% 5% 6% 7% 8% 9%
Gesture recognition
Face trackingFacial analysisFacial animation
Speech reading/Lip reading Focus of attention
Eye - Gaze tracking Face recognition
Figure 8. Over 20 responses per language
Figure 9. Under 20 responses per language
8.
LR domains/fields
The questionnaire also contained a section on the domains/fields for LR data. The results are summarized below.
43% of the respondents indicated that they are interested in LRs from all domains.
The specific fields desired by at least 20% of respondents are: Computer Science; Telecommunications; Technology; Business; Economics and Finance; Administrative; Data Processing.
The fields desired by 10-19% of respondents are: Automobile; Medicine; Education/Pedagogy; Law; Electrical Engineering; Electronics; Health; Tourism; Mechanical Engineering; Pharmaceutical.
The fields desired by under 10% of respondents are: Aeronautics; Heavy-machinery; Sports; Leisure; Chemistry; Geography; Biology; Agriculture; Navigation; Arts; Architecture/construction; Physics; Food Sciences; History; Psychology; Sociology; Geology.
Language answers % (above 20)
14%
10% 9% 6% 8%
2% 4% 2% 2% 4% 2% 2% 3% 2% 2% 4% 4% 3% 2%
76%
52% 46%
34% 44%
12% 19% 12% 13% 19% 9% 11% 16% 13% 8% 19% 20% 15% 10%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
English FrenchGerman ItalianSpanish ArabicChinese Czech Danish DutchFinnish GreekJapaneseKorean Polish
PortugueseRussianSwedish Turkish
% language answers % respondents
Language answers % (under 20)
1% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1%
3% 4%
7% 6%
5% 4%
6%
3%
7% 6%
5% 6% 5%
4%
0% 2% 4% 6% 8%
AlbanianBosnianCatalanCroatianEstonianHebrew Hungarian
Latvian
NorwegianRomanian Slovak
SlovenianSerbianUkrainian
11.
Acknowledgements
We would like to thank the European Commission for funding the Language Resources – Packaging and Production project (LE4-8335) that has allowed ELDA to carry out this valuable survey work. Our gratitude is also extended to Emilie Marquois and Valérie Raymond for their assistance in analyzing some of the results from the questionnaires that were received.
12.
References
Allen, J., 1999a. ELRA 1999 Call for Proposals – ELRA Commission Production of Language Resources. European Language Resources Association (ELRA) Newsletter, January-March 1999, 4.1:8-9.
Allen, J., 1999b. Results of ELRA 1999 Call for Proposals – ELRA Commission Production of Language Resources. European Language Resources Association (ELRA) Newsletter, April-June 1999, 4.2:6-7.
Allen, J., 1999c. Report on ELDA's Survey of Language Resource User needs. European Language Resources Association (ELRA) Newsletter, October-December 1999, 4.4:8-9.
Allen, J., 1999d. Language Resources Go Digital: Update on the European Language Resources Association. Language International magazine, 11.6:38-39. Amsterdam: John Benjamins.
Choukri, K., V. Mapelli, and J. Allen. 1999. New developments within the European Language Resources Association. Paper presented at Eurospeech99. Budapest, Hungary, 4-12 September 1999.
Nilsson, M. 1997a. The ELRA Marketing Survey. European Language Resources Association (ELRA) Newsletter, June 1997, 2.2:11.
Nilsson, M. 1997b. ELRA Market Segmentation Survey - update. European Language Resources Association (ELRA) Newsletter, October 1997, 2.3:10.