• No results found

Information Extraction from ASRed Documents

The output of an ASR system does not contain case information and punctuations. It has been shown that in the absence of punctuations extraction of different syntactic entities like parts of speech and noun phrases is not accurate (Nasukawa et. al., 2007). So IE from ASRed documents becomes harder. Miller et. al. (Miller et. al., 2000) have shown how IE performance varies with ASR noise. It has been shown that it is possible to build aggregate models from ASR data (Roy & Subramaniam, 2006). In this work topical models are constructed by utilizing inter document redundancy to overcome the noise. In this work only a few natural language processing steps have been used. Phrases have been aggregated over the noisy collection to get to the clean underlying text.

FUTURE TRENDS

More and more data from sources like chat, conver- sations, blogs, discussion groups need to be mined to capture opinions, trends, issues and opportunities. These forms of communication encourage informal language which can be considered noisy due to spell- ing errors, grammatical errors and informal writing styles. Companies are interested in mining such data to observe customer preferences and improve customer satisfaction. Online agents need to be able to understand web posts to take actions and communicate with other agents. Customers are interested in collated product reviews from web posts of other users. The nature of the noisy text warrants moving beyond traditional text analytics techniques. There is need for developing natural language processing techniques that are robust to noise. Also techniques that implicitly and explicitly tackle textual noise need to be developed.

CONCLUSION

In this chapter we have looked at information extraction from noisy text. This topic is gaining in importance as more and more noisy data gets generated and useful information needs to be obtained from this. We have presented a survey of existing techniques information extraction techniques. We have also presented some of the future trends in noisy text analytics.

REFERENCES

E. Agirre, K. Gojenola, K. Sarasola & A. Voutilainen (1998). Towards a Single Proposal in Spelling Correc- tion. Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computatio- nal Linguistics (22-28).

Aw, M. Zhang, J. Xiao & J. Su (2006). A Phrase-Based Statistical Model for SMS Text Normalization. In Pro- ceedings of the Joint conference of the Association for Computational Linguistics and the International Com- mittee on Computational Linguistics (ACL-COLING 2006), Sydney, Australia.

N. H. F. Beebe (2005). A Bibliography of Publications on Computer Based Spelling Error Detection and Cor- rection. http://www.math.utah.edu/pub/tex/bib/spell. ps.gz.

M. Choudhury, R. Saraf, V. Jain, S. Sarkar & A. Basu (2007). Investigation and Modeling of the Structure of Texting Language. In Proceedings of the IJCAI 2007 Workshop on Analytics for Noisy Unstructured Text Data (AND 2007), Hyderabad, India.

N. Chinchor (1998). Overview of MUC-7. http:// www-nlpir.nist.gov/related_projects/muc/proceedin- gs/muc_7_proceedings/overview.html

R. Golding (1995). A Bayesian Hybrid Method for Con- text-Sensitive Spelling Correction. Proceedings of the Third Workshop on Very Large Corpora (39—53). R. Golding & D. Roth (1999). A Winnow-Based Appro- ach to Context-Sensitive Spelling Correction. Journal of Machine Learning. Volume 34 (1-3) (107-130)

0

Analytics for Noisy Unstructured Text Data II

R. Golding & Y. Schabes (1996). Combining Tri- gram-Based and Feature-Based Methods for Context- Sensitive Spelling Correction. Proceedings of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics (71—78).

K. Kukich (1992). Technique for Automatically Correc- ting Words in Text. ACM Computing Survey. Volume 24 (4) (377—439).

Y. Lu & C. L. Tan (2004). Information Retrieval in Document Image Databases. IEEE Transactions on Knowledge and Data Engineering. Vol 16, No. 11. (1398-1410)

L. Mangu & E. Brill (1997). Automatic Rule Acquisi- tion for Spelling Correction. Proc. 14th International Conference on Machine Learning. (187—194). M. Michelson & C. A. Knoblock (2005). Semantic Annotation of Unstructured and Ungrammatical Text. In Proceedings of the International Joint Conference on Artificial Intelligence.

D. Miller, S. Boisen, R. Schwartz, R. Stone & R. Wei- schedel (2000). Named Entity Extraction from Noisy Input: Speech and OCR. Proceedings of the Sixth Con- ference on Applied Natural Language Processing. T. Nartker, K. Taghva, R. Young, J. Borsack, and A. Condit (2003). OCR Correction Based On Document Level Knowledge. In Proc. IS&T/SPIE 2003 Intl. Symp. on Electronic Imaging Science and Technology, volume 5010, Santa Clara, CA.

T. Nasukawa, D. Punjani, S. Roy, L. V. Subramaniam & H. Takeuchi (2007). Adding Sentence Boundaries to Conversational Speech Transcriptions Using Noisily Labeled Examples. In Proceedings of the IJCAI 2007 Workshop on Analytics for Noisy Unstructured Text Data (AND 2007), Hyderabad, India.

S. Roy & L. V. Subramaniam (2006). Automatic Gen- eration of Domain Models for Call-Centers from Noisy Transcriptions. In Proceedings of the Joint conference of the Association for Computational Linguistics and the International Committee on Computational Linguistics (ACL-COLING 2006), Sydney, Australia.

E. T. K. Sang, S. Canisius, A. van den Bosch & T. Bogers (2005). Applying Spelling Error Correction Techniques for Improving Semantic Role Labelling. In Proceedings of CoNLL.

Sarma & D. Palmer (2004). Context-based Speech Rec- ognition Error Detection and Correction. In Proceed- ings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004. K. Taghva, T. Narkter & J. Borsack (2004). Information Access in the Presence of OCR Errors. ACM Hardcopy Document Processing Workshop, Washington, DC, USA. (1-8)

K. Taghva, T. Narkter, J. Borsack, Lumos. S., A. Condit, & Young (2001). Evaluating Text Categorization in the Presence of OCR Errors. In Proceedings of IS&T SPIE 2001 International Symposium on Electronic Imaging Science and Technology, (68-74).

E. M. Zamora, J. J. Pollock, & A. Zamora (1983). The Use of Trigram Analysis for Spelling Error De- tection. Information Processing and Management 17. 305-316.

KEy TERMS

Automatic Speech Recognition: Machine recogni- tion and conversion of spoken words into text.

Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns, relationships or obtaining systems that perform useful

tasks such as classification, prediction, estimation, or affinity grouping.

Information Extraction: Automatic extraction of structured knowledge from unstructured documents.

Knowledge Extraction: Explicitation of the internal knowledge of a system or set of data in a way that is easily interpretable by the user.

Noisy Text: Text with any kind of difference in the surface form, from the intended, correct or original text.

Optical Character Recognition: Translation of images of handwritten or typewritten text (usually captured by a scanner) into machine-editable text.

Rule Induction: Process of learning, from cases or instances, if-then rule relationships that consist of

0

Analytics for Noisy Unstructured Text Data II

A