Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus

(1)

Adding Value to CMC Corpora:

CLARINification and Part-of-Speech

Annotation of the Dortmund

Chat Corpus

NLP 4 CMC 2015: 2nd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media

Michael Beißwenger, Eric Ehrhardt, Andrea Horbach,

Harald Lüngen, Diana Steffen, Angelika Storrer

(2)

ChatCorpus2CLARIN: Project background

Duration: May 2015 – February 2016

Project team: Michael Beißwenger (U Dortmund), Angelika Storrer, Eric Ehrhardt (U Mannheim), Harald Lüngen (IDS), Axel Herold (BBAW) + other colleagues at IDS and BBAW

The task: Re-modeling of the Dortmund Chat Corpus and samples of other CMC resources compliant with existing standards for the representation of corpora in the Digital

Humanities. Integration into the CLARIN-D infrastructures at BBAW and IDS.

Curation project of the CLARIN-D F-AG 1 “German Philology”

Main goal:

§  Pave the way for the inclusion of linguistically annotated CMC

resources into the CLARIN-D corpus infrastructures and create the prerequisites for investigating linguistic peculiarities of CMC with state-of-the art corpus technology.

(3)

Dortmund Chat Corpus

http://www.chatkorpus.tu-dortmund.de

478 logfile documents with 140,240 user postings or 1M words of German chat

discourse.

Resource for the analysis of linguistic variation in chats including chats from different social/institutional contexts (social chats, advisory chats, learning and teaching, mode-rated chats in the media

context).

Annotated in a home-grown XML format (‘ChatXML’): (1) basic structure of chat logfiles and postings, (2) selected “netspeak” phenomena, (3) selected metadata.

The corpus

Beißwenger (2013) in ZGL / LINSE: http://tinyurl.com/ chatkorpus

(4)

ChatCorpus2CLARIN

1) Project background

2) Work packages in the project:

- CLARINification, legal issues + licensing

- TEI representation

- enrich the data with additional linguistic annotations

(PoS, normalised spellings, ...)

3) Zoom-in: PoS annotation with an extended STTS

(cooperation with U Saarbrücken)

4) Next Steps / Outlook:

- Manual post-processing with OrthoNormal

- Challenges of PoS tagging: sample comparison of

STTS-SB with a gold standard annotation

(5)

Work packages: (1) “CLARINification”

Integration of the resource into the

CLARIN-D infrastructures:

§

Hosting at the CLARIN-D centres BBAW and IDS.

§

Developing a CMDI representation of the metadata.

§

Ingestion into the repositories for long-term data

archiving at both centres.

§

The resource and its metadata will then be:

-

harvestable via OAI-PMH,

-

accessible via CLARIN Virtual Language Observatory,

-

searchable via the CLARIN-D Federated Content

Search,

-

addressable via PIDs,

(6)

Work packages: (1) “CLARINification”

Legal issues + licensing:

§

The conditions of licensing the corpus resource for

scientific use will be defined on the basis of a legal

expert opinion that is currently being sought

(John Weitzmann, iRights law office, Berlin).

§

Depending on the legal opinion, different licensing

models are possible (CLARIN-D end user license

type PUB ‘publically available’ or more restrictive

license types).

(7)

Work packages: (2) TEI representation

The goal:

§  Remodeling of ChatXML in TEI-P5

+ additional structural information and metadata

§  Conversion of the whole chat corpus into the TEI target format

Resources:

§  TEI schemas and models developed in the TEI special interest

group „computer-mediated communication“

(http://www.tei-c.org/Activities/SIG/CMC/)

Result:

§  [Oct 2015] Customized TEI schema (ODD) which adapts the

models available in TEI-P5 for the structural + linguistic

peculiarities of CMC and which has been tested not only for chat but also for samples of other genres (Wikipedia talk pages, blog comments, tweets, WhatsApp, Usenet news)

(8)

Work packages: (2) TEI representation

Documentation of schema and ODD on the SIG pages in the wiki of the TEI:

(9)

Work packages: (3) PoS annotation and tagset

Extended STTS tag-set („STTS 2.0“) with categories for CMC-specific items and for linguistic phenomena typical of spontaneous dialogic interaction. Downward-compatible with STTS (1999).

Overview of extensions and modifications to STTS (1999) in STTS 2.0

Compatible with

§ the extended STTS

for spoken language

which is used for PoS tagging the

FOLK corpus at IDS,

§ the extended STTS

for CMC which is used in the GSCL/ Empirikom shared task on CMC.

(10)

Work packages: (3) PoS annotation and tagset

https://sites.google.com/site/empirist2015/home/annotation-guidelines PoS tagset + annotation

guidelines available on the website of the GSCL/

Empirikom shared task on automatic linguistic

annotation of CMC (EmpiriST2015).

(11)

Work packages: (3) PoS annotation and tagset

Workflow:

1.  Automatic tokenisation, PoS annotation & lemmatisation of

the chat corpus with tools + tagging models from the BMBF project „Schreibgebrauch“ at U Saarbrücken

(Horbach et al. 2014, Horbach et al. 2015)

PoS tag set: STTS 2.0 as described in Bartz et al. 2014 (‘STTS2.0-BETA’)

Representation of the tagging results as additions to the ChatXML format.

2.  Manual post-processing of the tagging results and

“upgrade” to STTS 2.0-ALPHA (Beißwenger et al. 2015) using OrthoNormal in FOLKER (preview version 1.2) with an import/ export filter for PoS tagged ChatXML (defined by Thomas

(12)

Workflow, step 1: Automatic PoS Tagging

(cooperation with “Schreibgebrauch” project, U Saarbrücken)

o

To speed up the manual annotation process, we use a

PoS-Tagger to pre-annotate the corpus.

o

Challenge:

n

Standard PoS taggers perform poorly on CMC

data

n

Accuracy on Dortmunder Chat Corpus: ~71%

n

(vs. 97% accurracy on Newspaper)

BMBF project “Analyse und Instrumentarien zur Beobachtung des Schreibgebrauchs im Deutschen” http://www.schreibgebrauch.de

(13)

o

Examples

(Dortmund Chat Corpus)

n  tach/ADJD @/XY all/PIAT

n  was/PWS wilst/VVFIN duda/NE mit/APPR sagen/VVFIN

n  hamburg/ADV hat/VAFIN mehr/PIS brücken/VVINF als/ KOKOM venedig/ADJD

o

Challenge:

n

Chat data contains many words / spelling

variants / spelling mistakes / ... which do not

occur in the training data

n

Taggers perform particularly bad on these

out-of-vocabulary tokens

Workflow, step 1: Automatic PoS Tagging

(cooperation with “Schreibgebrauch” project, U Saarbrücken)

(14)

Adapting PoS-Taggers to CMC Data

o

Basic Idea:

n

Annotate a small corpus of CMC data

n

Add this to existing gold standard training data

(TIGER)

n

Retrain the tagger (TreeTagger)

o

Underlying intuition:

n

Manual annotation provides information about

CMC specific words / spelling variants / ...

(15)

Adapting PoS taggers to CMC: Training Data

o

Standard training set: TIGER (Newspaper, ~900k

tokens)

o

Additional Corpora:

n

Dortmund Chat Corpus

n

Chefkoch

n

(Twitter)

o

Manual gold-standard annotation for 12k tokens each

n

Training: ~4k

n

Evaluation: ~8k

o

Tagset: “STTS 2.0-BETA”, downward compatible

(16)

Adapted Tagger – Accurracy

Complete Test Set “Standard STTS”only

Tagger trained on Chat Forum Both Chat Forum Both

Tiger 0,71 0,85 0,78 0,80 0,87 0,84

Tiger +auto 0,73 0,86 0,80 0,82 0,89 0,85

Tiger +gold 0,83 0,88 0,86 0,86 0,91 0,89

Tiger +gold +auto2 0,84 0,89 0,86 0,87 0,92 0,90

(17)

Adapted Tagger – Error Reduction

Complete Test Set “Standard STTS”only

Tiger 0,00 0,00 0,00 0,00 0,00 0,00 Tiger +auto 0,05 0,06 0,06 0,08 0,09 0,08

Tiger +gold 0,39 0,23 0,33 0,31 0,28 0,29 Tiger +gold +auto2 0,42 0,28 0,37 0,37 0,34 0,35

(18)

Adapted Tagger – Error Reduction

Complete Test Set “Standard STTS”only

Tiger 0,00 0,00 0,00 0,00 0,00 0,00

Tiger +auto 0,05 0,06 0,06 0,08 0,09 0,08

Tiger +gold 0,39 0,23 0,33 0,31 0,28 0,29

Tiger +gold +auto2 0,42 0,28 0,37 0,37 0,34 0,35

(19)

Workflow, step 2:

Manual post-processing of PoS tagging results

(20)

A (first) sample evaluation of the ‘beta’ results

Comparison: results from automatic tagging with STTS2.0-BETA

vs. manual expert annotation using STTS2.0-ALPHA

Sample: 1,000 tokens from a social chat (= the most ‘extreme’ type of chat in the corpus with most deviations from the written standard in edited text)

Standard PoS taggers:

Accuracy on Chat Corpus: ~71% (vs. 97% accurracy on Newspaper)

Tagging models from the “Schreibgebrauch” project:

Accuracy on Chat Corpus: 83.5%

Tagging results for ‘extreme’ sample: 76%

Result from qualitative evaluation: The tools from the “Schreib-gebrauch” project can assign CMC-specific tags (emoticons,

action words) – nevertheless, the “non-standardness” of written CMC is still causing trouble in several respect.

(21)

A (first) sample evaluation of the ‘beta’ results

Emoticons (EMO): 25 out of 35 occurrences tagged correctly

(71%).

Interjections (ITJ): 36 out of 59 occurrences tagged correctly

(61%).

Action words (AKW): 22 out of 37 occurrences tagged correctly

(59%):

Very good results for acronymic AKW

(*g*, G, *lol*, ggg, *s*):

14 out of 15 tagged correctly (= 93%)

Not so good results for simple verb-AKW

(guck, freu, lach, wart; hinstell, aufpluster,

raufkletter, aufkleb):

8 out of 17 tagged correctly (= 47%) 3 complex AKW in the sample haven’t

been tagged as AKW:

(22)

Tokens with non-standard spellings:

a)

NN and NE without capitalization

(chatter nicknames excluded):

correct: 48 out of 72 (=

67%

)

ð

particularly problematic:

nominalisations without capitalization

(das küssen/VVFIN, was verdauliches/ADJA, im zeugnis

nur einsen/VVINF, leute zum anpacken/VVINF):

correct: 1 out of 13 (=

8%

)

b)

Other non-standard spellings

(

colloquial spellings, typos, character iterations

):

correct: 42 out of 87 (=

48%

)

(23)

Next steps / outlook

•

A manual post-processing of the tokenization + PoS

annotations for parts of the corpus will be done during the

project period. In addition, normalizations will be added

for parts of the corpus.

•

OrthoNormal v. 1.2 (adapted for PoS tagged ChatXML)

can be downloaded from the FOLKER website at

http://agd.ids-mannheim.de/folker.shtml

ð

Further work on annotations will be possible (for

everybody) even after the project is finished.

•

The manually corrected corpus parts will be made

available and may serve as a gold standard for adapting

and optimizing tagging models for CMC/social media.

(24)

Outlook: The target resource

After its integration into the CLARIN-D infrastructure the

resource will be characterized by the following added values: •  Advanced accessibility and retrieval options;

•  interoperability with other corpus resources that are

represented in TEI and with annotation and analysis tools that support the TEI format;

•  advanced querying options (PoS tags, normalized spellings);

•  interoperability with other corpus resources that have been

tagged with STTS;

•  advanced options for corpus-based analyses on the

peculiarities of CMC discourse as compared to the language of edited text and of spoken language, using the text and

speech corpora which are already available in the corpus infrastructures of BBAW and IDS.

(25)

References

Bartz, Thomas; Beißwenger, Michael; Storrer, Angelika (2014): Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. In: Journal for Language Technology and Computational Linguistics 28 (1), 157-198. http://www.jlcl.org/

2013_Heft1/7Bartz.pdf

Beißwenger, Michael (2013): Das Dortmunder Chat-Korpus. In: Zeitschrift für germanistische Linguistik 41 (1), 161-164.

Extended version: http://www.linse.uni-due.de/tl_files/PDFs/Publikationen-Rezensionen/Chatkorpus_Beisswenger_2013.pdf Beißwenger, Michael; Ermakova, Maria; Geyken, Alexander; Lemnitzer, Lothar; Storrer, Angelika (2012): A TEI Schema for the

Representation of Computer-mediated Communication. In: Journal of the Text Encoding Initiative (jTEI) 3. http:// jtei.revues.org/476 (DOI: 10.4000/jtei.476).

Beißwenger, Michael; Bartz, Thomas; Storrer, Angelika; Westpfahl, Swantje (2015): Tagset und Richtlinie für das PoS-Tagging von Sprachdaten aus Genres internetbasierter Kommunikation. Guideline Document, Dortmund 2015. https://

sites.google.com/site/empirist2015/home/annotation-guidelines

Horbach, Andrea; Steffen, Diana; Thater, Stefan; Pinkal, Manfred (2014): Improving the Performance of Standard Part-of-Speech Taggers for Computer-Mediated Communication. Proceedings of KONVENS 2014, 171-177.

Horbach, Andrea; Thater, Stefan; Steffen, Diana; Fischer, Peter M.; Witt, Andreas; Pinkal, Manfred (2015): Internet Corpora: A Challenge for Linguistic Processing. In: Datenbank-Spektrum 15 (1), 41-47.

Margaretha, Eliza; Lüngen, Harald (2014): Building Linguistic Corpora from Wikipedia Articles and Discussions. In: Journal of Language Technology and Computational Linguistics (JLCL) 29 (2), 59-82. http://www.jlcl.org/

2014_Heft2/3MargarethaLuengen.pdf

TEI Consortium (2015): TEI P5: Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/Guidelines/P5/ Schiller, Anne; Teufel, Simone; Stöckert, Christine (1999): Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines

und großes Tagset). University of Stuttgart: Institut für maschinelle Sprachverarbeitung.

Schmidt, Thomas (2012): EXMARaLDA and the FOLK tools – two toolsets for transcribing and annotating spoken language. In: Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC’12), Istanbul, Turkey: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2012/pdf/529_Paper.pdf. Zinsmeister, Heike; Heid, Ulrich; Beck, Kathrin Beck (Eds., 2014): Das STTS-Tagset für Wortartentagging - Stand und

(26)