• No results found

Trameur: A Framework for Annotated Text Corpora Exploration

N/A
N/A
Protected

Academic year: 2021

Share "Trameur: A Framework for Annotated Text Corpora Exploration"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Trameur: A Framework for

Annotated Text Corpora Exploration

Serge Fleury (Sorbonne Nouvelle – Paris 3)

[email protected]

Maria Zimina (Paris Diderot – Sorbonne Paris Cité)

(2)

in Brief

(Fleury, 2013a) is a tool for

exploration of

Treebanks

, or parsed text

corpora that annotate syntactic or semantic

sentence structure.

The software is distributed with a

graphical

user interface

.

A reference package for

Windows

is available

for free download from the official website:

http://www.tal.univ-paris3.fr/trameur

COLING 2014 2

(3)

Search engine, context return, text mapping,

graphs and statistical analysis of dependency

relations within a single graphical user

interface.

Simultaneous access to multiple corpus layers

and their interactions statistics.

Possibility to implement

incremental textual

resources for Treebanks.

(4)

Incremental Textual Resources for

Treebanks

23/08/2014 COLING 2014 4

numerical

additiona

l

comments

comments

(5)
(6)

System architecture

Segmentation

of text data

Text

(units, delimiters)

+ parts

<part num=1>

x

1

x

2,

x

3

x

2.

x

4

x

5

.... x

k

§

<part num=2>

x

k+1

x

2,

x

k+2

x

4.

x

2

x

k+3

.... x

n

§

etc.

Delimiters

.,;!?+=() § etc.

a list of annotated positions

x1 x2 , x3 … Form : x3 Lemma : lem(x3) Category : cat (x3) Annotation4 : ann4(x3) Etc.

lists of couples of position identifiers (ONE list = ONE partition)

part 1 part 2 etc.

pos(x1) pos(xk) pos(xk+1) pos(xn)

THREAD

FRAME

engine

COLING 2014 6 23/08/2014 item itemitem item

(7)

Statistical tables,

Types

and

Spans

T

Y

E

S

S

P

A

N

S

Systems of

item types

and

text spans

allow automatic counts

in statistical tables.

(8)

PART i PART i PART i PART i PART j PART j PART j PART j

Segmentation

of units:

Thread

Thread

Thread

Thread

+

Corpus parts:

Frame

Frame

Frame

Frame

+

Statistical Model

Statistical Model

Statistical Model

Statistical Model

Token Prob. Ind.

X 23.43

Y 12.68

Z 5.57

W 5.66

Token Prob. Ind.

Y 13.73 X 21.86 A 7.75 C 6.55

…X….X….

A…Z…E…

W…X…X…

W…Y..Y

…Y….Y….

A…Z…E…

C…X…X

…C…Y..Y

http://www.tal.univ-paris3.fr/trameur

engine

How it works…

8

Statistical tables are processed using quantitative methods

(9)
(10)

10

navigation bar

Frame edition

zone

Thread edition

from Frame

progress bar

echo zone

main window buttons

User interface

(11)

Base file (

Thread

+

+

+

+

Frame

Frame

Frame

Frame

))))

Creating a new base file:

Text only

Tagged texts

XML encoding (TEI etc.)

Importing an existing base file:

Predefined XML format: encoding

Thread

and

Frame

annotations

Example of a multi-annotated base file : Rhapsodie2Trameur

(12)

Example:

Rhapsodie

a Prosodic-Syntactic Treebank for Spoken French

COLING 2014 12

http://www.projet-rhapsodie.fr

57 short samples of spoken French (≈5 min each)

≈ 33 000 words, tabular layout

Microsyntactic annotations

Macrosyntactic analysis

Prosody

23/08/2014

arborator.ilpga.fr

(Gerdes

et al.

, 2012)

(13)

Rhapsodie data within a base file

RELATION(TARGET) :

Position 51: "appel" OBJ(47)

OBJ(47)

OBJ(47)

OBJ(47)

"lance" (position 47)

RELATION

is a character string showing the name of the target relation

TARGET

is a number indicating a specific position on the Thread

Corpus parts

are identified on the

Thread

(14)

14

Thread

and

Frame

display

(15)
(16)

Sections map

16

After each

Illocutionary Unit

(

IU

), a delimiting character (

§

) is used

to build a

map of sections

.

Sections map from

Rhapsodie

(extract):

(17)

categories

lemmatisation

(18)

Demo 1: Sections map / statistics

COLING 2014 18

What POS patterns are underrepresented

in Illocutionary Units with a clitic?

POS N-gram

Ind-Spécif

Fq-Totale

Fq-Partie

Adj V

-10.1

49

25

I I I I

-10.1

16

3

I I I

-19.2

57

21

Pro V

-22.0

133

71

D N V

-24.0

216

130

N

-28.6

6308

5232

N V

-39.4

384

234

I I

-48.0

273

140

I

-91.3

1978

1397

23/08/2014

(19)
(20)

SUB -> penser <-OBJ

20 23/08/2014

(21)
(22)

Searches in dependency graphs

22 23/08/2014

(23)

Demo 2:

Selected relations

Complex predicates which are not

governed by another element (ROOTS)

(24)

RELATIONS

COLING 2014 24

(25)
(26)

Demo 3: V -> AD -> Adv

COLING 2014 26

Verbs with added adverbs

(Freq = 1141)

(27)

COLLOCATIONS CONSTRAINED BY

DEPENDENCY RELATIONS

(28)

28 23/08/2014

(29)

Demo 4: X

-> RELATION -> Y

(where X

is collocate of Y)

Collocates of the verb “aimer” with

dependency relations

(30)

Conclusions

In Progress…

New perspectives for

Treebanks

exploration

Successfully tested for monolingual text

processing within several research projects in

corpus linguistics and discourse analysis

(Branca-Rosoff

et al

., 2012; Née

et al

., 2012)

Potential for processing parallel and comparable

text data in distant languages (Zimina and Fleury,

2014)

COLING 2014 30

(31)

References

Sonia Branca-Rosoff, Serge Fleury, Florence Lefeuvre and Mat Pires. 2012. Discours sur la ville. Corpus de Français Parlé Parisien des années 2000 (CFPP 2000). Sorbonne

nouvelle – Paris 3. Online publication: http://cfpp2000.univ-paris3.fr/CFPP2000.pdf

Serge Fleury. 2013a. Le Trameur. Propositions de description et d’implémentation des

objets textométriques. Sorbonne nouvelle – Paris 3. Online publication:

http://www.tal.univ-paris3.fr/trameur/trameur-propositions-definitions-objets-textometriques.pdf

Serge Fleury. 2013b. Annotations Rhapsodie pour le Trameur. Sorbonne nouvelle – Paris 3. Online publication:

http://www.tal.univ-paris3.fr/trameur/bases/rhapsodie2trameur.pdf(v1 base)

http://www.tal.univ-paris3.fr/trameur/bases/rhapsodie2trameur-v3.pdf(v3 base)

Kim Gerdes, Sylvain Kahane, Anne Lacheret, Arthur Truong and Paola Pietrandrea. 2012. Intonosyntactic data structures: The Rhapsodie treebank of spoken French.

Proceedings of the Linguistic Annotation Workshop, COLING 2012. Jeju, Republic of

Korea, July 2012.

Emilie Née, Erin MacMurray and Serge Fleury. 2012. Textometric Explorations of Writing

Processes: A Discursive and Genetic Approach to the Study of Drafts. Journées

internationales d'Analyse statistique des Données Textuelles (JADT 2012). Liège

(Belgium), June 2012.

Maria Zimina and Serge Fleury. 2014. Approche systémique de la résonance textuelle multilingue. Journées internationales d'Analyse statistique des Données Textuelles (JADT

References

Related documents

En un contexto de restricciones presupuestarias que están comportando una reducción de la financiación pública en las universidades públicas españolas, adquiere

As shown in Figure 4, compared with DOX-Sol treated group, the expression of CD44 was significantly decreased in MCF-7 cells after incubation with HA-VES/DOX (P , 0.05),

Table 2: Percentage weight lost in control and experimental bone at 50 and

For example, a Middle Eastern first generation woman who wanted to attend graduate school for physical therapy said that “a bachelor’s degree is not worth anything these days

Under Article 13(2) of the CVSD, where public sector bod- ies engage in exempt activities, listed in certain articles of the CVSD, as not being as public authorities, Member

Conversely considered, generalized classes of isoperimetric problems in higher- dimensional Minkowski spaces refer to all convex bodies of given mixed volume having minimum

WORLD JOURNAL OF SURGICAL ONCOLOGY Li et al World Journal of Surgical Oncology 2012, 10 171 http //www wjso com/content/10/1/171 REVIEW Open Access The bidirectional interation