Adding Value to CMC Corpora:
CLARINification and Part-of-Speech
Annotation of the Dortmund
Chat Corpus
NLP 4 CMC 2015: 2nd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media
Michael Beißwenger, Eric Ehrhardt, Andrea Horbach,
Harald Lüngen, Diana Steffen, Angelika Storrer
ChatCorpus2CLARIN: Project background
Duration: May 2015 – February 2016Project team: Michael Beißwenger (U Dortmund), Angelika Storrer, Eric Ehrhardt (U Mannheim), Harald Lüngen (IDS), Axel Herold (BBAW) + other colleagues at IDS and BBAW
The task: Re-modeling of the Dortmund Chat Corpus and samples of other CMC resources compliant with existing standards for the representation of corpora in the Digital
Humanities. Integration into the CLARIN-D infrastructures at BBAW and IDS.
Curation project of the CLARIN-D F-AG 1 “German Philology”
Main goal:
§ Pave the way for the inclusion of linguistically annotated CMC
resources into the CLARIN-D corpus infrastructures and create the prerequisites for investigating linguistic peculiarities of CMC with state-of-the art corpus technology.
Dortmund Chat Corpus
http://www.chatkorpus.tu-dortmund.de
478 logfile documents with 140,240 user postings or 1M words of German chat
discourse.
Resource for the analysis of linguistic variation in chats including chats from different social/institutional contexts (social chats, advisory chats, learning and teaching, mode-rated chats in the media
context).
Annotated in a home-grown XML format (‘ChatXML’): (1) basic structure of chat logfiles and postings, (2) selected “netspeak” phenomena, (3) selected metadata.
The corpus
Beißwenger (2013) in ZGL / LINSE: http://tinyurl.com/ chatkorpus
ChatCorpus2CLARIN
1) Project background
2) Work packages in the project:
- CLARINification, legal issues + licensing
- TEI representation
- enrich the data with additional linguistic annotations
(PoS, normalised spellings, ...)
3) Zoom-in: PoS annotation with an extended STTS
(cooperation with U Saarbrücken)
4) Next Steps / Outlook:
- Manual post-processing with OrthoNormal
- Challenges of PoS tagging: sample comparison of
STTS-SB with a gold standard annotation
Work packages: (1) “CLARINification”
Integration of the resource into the
CLARIN-D infrastructures:
§
Hosting at the CLARIN-D centres BBAW and IDS.
§
Developing a CMDI representation of the metadata.
§
Ingestion into the repositories for long-term data
archiving at both centres.
§
The resource and its metadata will then be:
-
harvestable via OAI-PMH,
-
accessible via CLARIN Virtual Language Observatory,
-
searchable via the CLARIN-D Federated Content
Search,
-
addressable via PIDs,
Work packages: (1) “CLARINification”
Legal issues + licensing:
§
The conditions of licensing the corpus resource for
scientific use will be defined on the basis of a legal
expert opinion that is currently being sought
(John Weitzmann, iRights law office, Berlin).
§
Depending on the legal opinion, different licensing
models are possible (CLARIN-D end user license
type PUB ‘publically available’ or more restrictive
license types).
Work packages: (2) TEI representation
The goal:§ Remodeling of ChatXML in TEI-P5
+ additional structural information and metadata
§ Conversion of the whole chat corpus into the TEI target format
Resources:
§ TEI schemas and models developed in the TEI special interest
group „computer-mediated communication“
(http://www.tei-c.org/Activities/SIG/CMC/)
Result:
§ [Oct 2015] Customized TEI schema (ODD) which adapts the
models available in TEI-P5 for the structural + linguistic
peculiarities of CMC and which has been tested not only for chat but also for samples of other genres (Wikipedia talk pages, blog comments, tweets, WhatsApp, Usenet news)
Work packages: (2) TEI representation
Documentation of schema and ODD on the SIG pages in the wiki of the TEI:
Work packages: (3) PoS annotation and tagset
Extended STTS tag-set („STTS 2.0“) with categories for CMC-specific items and for linguistic phenomena typical of spontaneous dialogic interaction. Downward-compatible with STTS (1999).
Overview of extensions and modifications to STTS (1999) in STTS 2.0
Compatible with
§ the extended STTS
for spoken language
which is used for PoS tagging the
FOLK corpus at IDS,
§ the extended STTS
for CMC which is used in the GSCL/ Empirikom shared task on CMC.
Work packages: (3) PoS annotation and tagset
https://sites.google.com/site/empirist2015/home/annotation-guidelines PoS tagset + annotation
guidelines available on the website of the GSCL/
Empirikom shared task on automatic linguistic
annotation of CMC (EmpiriST2015).
Work packages: (3) PoS annotation and tagset
Workflow:
1. Automatic tokenisation, PoS annotation & lemmatisation of
the chat corpus with tools + tagging models from the BMBF project „Schreibgebrauch“ at U Saarbrücken
(Horbach et al. 2014, Horbach et al. 2015)
PoS tag set: STTS 2.0 as described in Bartz et al. 2014 (‘STTS2.0-BETA’)
Representation of the tagging results as additions to the ChatXML format.
2. Manual post-processing of the tagging results and
“upgrade” to STTS 2.0-ALPHA (Beißwenger et al. 2015) using OrthoNormal in FOLKER (preview version 1.2) with an import/ export filter for PoS tagged ChatXML (defined by Thomas
Workflow, step 1: Automatic PoS Tagging
(cooperation with “Schreibgebrauch” project, U Saarbrücken)
o
To speed up the manual annotation process, we use a
PoS-Tagger to pre-annotate the corpus.
o
Challenge:
n
Standard PoS taggers perform poorly on CMC
data
n
Accuracy on Dortmunder Chat Corpus: ~71%
n
(vs. 97% accurracy on Newspaper)
BMBF project “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen” http://www.schreibgebrauch.de
o
Examples
(Dortmund Chat Corpus)
n tach/ADJD @/XY all/PIAT
n was/PWS wilst/VVFIN duda/NE mit/APPR sagen/VVFIN
n hamburg/ADV hat/VAFIN mehr/PIS brücken/VVINF als/ KOKOM venedig/ADJD
o
Challenge:
n
Chat data contains many words / spelling
variants / spelling mistakes / ... which do not
occur in the training data
n
Taggers perform particularly bad on these
out-of-vocabulary tokens
Workflow, step 1: Automatic PoS Tagging
(cooperation with “Schreibgebrauch” project, U Saarbrücken)
BMBF project “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen” http://www.schreibgebrauch.de
Adapting PoS-Taggers to CMC Data
o
Basic Idea:
n
Annotate a small corpus of CMC data
n
Add this to existing gold standard training data
(TIGER)
n
Retrain the tagger (TreeTagger)
o
Underlying intuition:
n
Manual annotation provides information about
CMC specific words / spelling variants / ...
BMBF project “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen” http://www.schreibgebrauch.de
Adapting PoS taggers to CMC: Training Data
o
Standard training set: TIGER (Newspaper, ~900k
tokens)
o
Additional Corpora:
n
Dortmund Chat Corpus
n
Chefkoch
n
(Twitter)
o
Manual gold-standard annotation for 12k tokens each
n
Training: ~4k
n
Evaluation: ~8k
o
Tagset: “STTS 2.0-BETA”, downward compatible
BMBF project “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen” http://www.schreibgebrauch.de
Adapted Tagger – Accurracy
Complete Test Set “Standard STTS”only
Tagger trained on Chat Forum Both Chat Forum Both
Tiger 0,71 0,85 0,78 0,80 0,87 0,84
Tiger +auto 0,73 0,86 0,80 0,82 0,89 0,85
Tiger +gold 0,83 0,88 0,86 0,86 0,91 0,89
Tiger +gold +auto2 0,84 0,89 0,86 0,87 0,92 0,90
BMBF project “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen” http://www.schreibgebrauch.de
Adapted Tagger – Error Reduction
Complete Test Set “Standard STTS”only
Tagger trained on Chat Forum Both Chat Forum Both
Tiger 0,00 0,00 0,00 0,00 0,00 0,00 Tiger +auto 0,05 0,06 0,06 0,08 0,09 0,08
Tiger +gold 0,39 0,23 0,33 0,31 0,28 0,29 Tiger +gold +auto2 0,42 0,28 0,37 0,37 0,34 0,35
BMBF project “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen” http://www.schreibgebrauch.de
Adapted Tagger – Error Reduction
Complete Test Set “Standard STTS”only
Tagger trained on Chat Forum Both Chat Forum Both
Tiger 0,00 0,00 0,00 0,00 0,00 0,00
Tiger +auto 0,05 0,06 0,06 0,08 0,09 0,08
Tiger +gold 0,39 0,23 0,33 0,31 0,28 0,29
Tiger +gold +auto2 0,42 0,28 0,37 0,37 0,34 0,35
BMBF project “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen” http://www.schreibgebrauch.de
Workflow, step 2:
Manual post-processing of PoS tagging results
A (first) sample evaluation of the ‘beta’ results
Comparison: results from automatic tagging with STTS2.0-BETAvs. manual expert annotation using STTS2.0-ALPHA
Sample: 1,000 tokens from a social chat (= the most ‘extreme’ type of chat in the corpus with most deviations from the written standard in edited text)
Standard PoS taggers:
Accuracy on Chat Corpus: ~71% (vs. 97% accurracy on Newspaper)
Tagging models from the “Schreibgebrauch” project:
Accuracy on Chat Corpus: 83.5%
Tagging results for ‘extreme’ sample: 76%
Result from qualitative evaluation: The tools from the “Schreib-gebrauch” project can assign CMC-specific tags (emoticons,
action words) – nevertheless, the “non-standardness” of written CMC is still causing trouble in several respect.
A (first) sample evaluation of the ‘beta’ results
Emoticons (EMO): 25 out of 35 occurrences tagged correctly
(71%).
Interjections (ITJ): 36 out of 59 occurrences tagged correctly
(61%).
Action words (AKW): 22 out of 37 occurrences tagged correctly
(59%):
Very good results for acronymic AKW
(*g*, G, *lol*, ggg, *s*):
14 out of 15 tagged correctly (= 93%)
Not so good results for simple verb-AKW
(guck, freu, lach, wart; hinstell, aufpluster,
raufkletter, aufkleb):
8 out of 17 tagged correctly (= 47%) 3 complex AKW in the sample haven’t
been tagged as AKW:
Tokens with non-standard spellings:
a)
NN and NE without capitalization
(chatter nicknames excluded):
correct: 48 out of 72 (=
67%
)
ð
particularly problematic:
nominalisations without capitalization
(das küssen/VVFIN, was verdauliches/ADJA, im zeugnis
nur einsen/VVINF, leute zum anpacken/VVINF):
correct: 1 out of 13 (=
8%
)
b)
Other non-standard spellings
(
colloquial spellings, typos, character iterations
):
correct: 42 out of 87 (=
48%
)
Next steps / outlook
•
A manual post-processing of the tokenization + PoS
annotations for parts of the corpus will be done during the
project period. In addition, normalizations will be added
for parts of the corpus.
•
OrthoNormal v. 1.2 (adapted for PoS tagged ChatXML)
can be downloaded from the FOLKER website at
http://agd.ids-mannheim.de/folker.shtml
ð
Further work on annotations will be possible (for
everybody) even after the project is finished.
•
The manually corrected corpus parts will be made
available and may serve as a gold standard for adapting
and optimizing tagging models for CMC/social media.
Outlook: The target resource
After its integration into the CLARIN-D infrastructure the
resource will be characterized by the following added values: • Advanced accessibility and retrieval options;
• interoperability with other corpus resources that are
represented in TEI and with annotation and analysis tools that support the TEI format;
• advanced querying options (PoS tags, normalized spellings);
• interoperability with other corpus resources that have been
tagged with STTS;
• advanced options for corpus-based analyses on the
peculiarities of CMC discourse as compared to the language of edited text and of spoken language, using the text and
speech corpora which are already available in the corpus infrastructures of BBAW and IDS.
References
Bartz, Thomas; Beißwenger, Michael; Storrer, Angelika (2014): Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. In: Journal for Language Technology and Computational Linguistics 28 (1), 157-198. http://www.jlcl.org/
2013_Heft1/7Bartz.pdf
Beißwenger, Michael (2013): Das Dortmunder Chat-Korpus. In: Zeitschrift für germanistische Linguistik 41 (1), 161-164.
Extended version: http://www.linse.uni-due.de/tl_files/PDFs/Publikationen-Rezensionen/Chatkorpus_Beisswenger_2013.pdf Beißwenger, Michael; Ermakova, Maria; Geyken, Alexander; Lemnitzer, Lothar; Storrer, Angelika (2012): A TEI Schema for the
Representation of Computer-mediated Communication. In: Journal of the Text Encoding Initiative (jTEI) 3. http:// jtei.revues.org/476 (DOI: 10.4000/jtei.476).
Beißwenger, Michael; Bartz, Thomas; Storrer, Angelika; Westpfahl, Swantje (2015): Tagset und Richtlinie für das PoS-Tagging von Sprachdaten aus Genres internetbasierter Kommunikation. Guideline Document, Dortmund 2015. https://
sites.google.com/site/empirist2015/home/annotation-guidelines
Horbach, Andrea; Steffen, Diana; Thater, Stefan; Pinkal, Manfred (2014): Improving the Performance of Standard Part-of-Speech Taggers for Computer-Mediated Communication. Proceedings of KONVENS 2014, 171-177.
Horbach, Andrea; Thater, Stefan; Steffen, Diana; Fischer, Peter M.; Witt, Andreas; Pinkal, Manfred (2015): Internet Corpora: A Challenge for Linguistic Processing. In: Datenbank-Spektrum 15 (1), 41-47.
Margaretha, Eliza; Lüngen, Harald (2014): Building Linguistic Corpora from Wikipedia Articles and Discussions. In: Journal of Language Technology and Computational Linguistics (JLCL) 29 (2), 59-82. http://www.jlcl.org/
2014_Heft2/3MargarethaLuengen.pdf
TEI Consortium (2015): TEI P5: Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/Guidelines/P5/ Schiller, Anne; Teufel, Simone; Stöckert, Christine (1999): Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines
und großes Tagset). University of Stuttgart: Institut für maschinelle Sprachverarbeitung.
Schmidt, Thomas (2012): EXMARaLDA and the FOLK tools – two toolsets for transcribing and annotating spoken language. In: Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC’12), Istanbul, Turkey: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/529_Paper.pdf. Zinsmeister, Heike; Heid, Ulrich; Beck, Kathrin Beck (Eds., 2014): Das STTS-Tagset für Wortartentagging - Stand und
Adding Value to CMC Corpora:
CLARINification and Part-of-Speech
Annotation of the Dortmund
Chat Corpus
NLP 4 CMC 2015: 2nd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media
Michael Beißwenger, Eric Ehrhardt, Andrea Horbach,
Harald Lüngen, Diana Steffen, Angelika Storrer
Other corpora / data sets in the project focus
§
German
WhatsApp Corpus
(„What's up, Deutschland?
“
)
§
German
Wikipedia corpus
in DeReKo
§
German
News Corpus
in DeReKo
§
DWDS
Blog Corpus
Mentions of chatter nicknames in the user-generated
content of postings:
86 out of 104 occurrences tagged correctly (=
83%
).
Examples for incorrect tags:
§
Danke
Pharao/NN
§
Habe heute einen SMS von
Tigaaaaelse/NN
bekommen
§
re
schdöbbs/AW
[Nickname: stoeps]
§
Hallo
Erdbeere/NN
§
heya
marc30/ADJA
§
quaki/ADJD
wüsste nich wo sie hinziehn sollte