• No results found

BertAA: BERT fine-tuning for Authorship Attribution

N/A
N/A
Protected

Academic year: 2021

Share "BertAA: BERT fine-tuning for Authorship Attribution"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

BertAA

: BERT fine-tuning for Authorship Attribution

Maël Fabien

1,2

, Esaú Villatoro-Tello

1,3

, Petr Motlicek

1

, and Shantipriya Parida

1

1

2

3

(2)

Outline

01

1. Introduction to Authorship Attribution

2. Related works

3. BertAA: Bert fine-tuning for AA

4. Authorship Attribution corpora

5. Results

6. Future Works

7. Conclusion

(3)

Authorship Analysis

01

Authorship

Verification

Authorship

Attribution

Author

Profiling

Attributing a text to the correct author among of closed set

of potential writers (e.g. 5, 10, 25, 50, 75 or 100 authors)

Author

Profiling

Authorship

Verification

(4)

Authorship Attribution

01

Authorship

Attribution

Plagiarism detection

Historical Literature

Forensic investigations

Authorship Attribution

(5)

Traditional methods

02

Ensemble models

Input Text

Classifier

Attribution

Features 1

Features 2

Classifier

Majority vote

Features 3

Classifier

Related works

(6)

Deep-learning methods

02

Word 1

Recurrent Neural Networks

Word 2

Word N

LSTM

LSTM

LSTM

Average Pooling

Softmax Classification

Sentence

Related works

Attribution

(7)

Deep-learning methods

Related works

02

Convolutional Neural Networks

wait

for

the

video

and

do

n’t

rent

it

Multi-channel

representation

Convolutional

layer

Max over-time

pooling

Fully-connected

layer

(8)

Architecture

03

Dense Layer

Pre-trained

BERT

Ber

t F

ine-tuning

Logistic

Regression

Stylistic

features

Logistic

Regression

Hybrid

features

Training data

BertAA

+ Style

+ Hybrid

Class probabilities Class probabilities

Logistic Regression

Class probabilities

Class probabilities

Logistic Regression

(9)

External features

03

BertAA

Stylistic

Hybrid

Length of text

Number of words

Average length of words

Number of short words

Proportion of digits and capital letters

Individual letters and digits frequencies

Hapax-legomena

Frequency of 12 punctuation marks

Frequency of the 100 most frequent

character-level bi-grams

Frequency of the 100 most frequent

(10)

Corpora

04

Authorship Attribution Corpora

Dataset

Number of tokens

Number of texts

Enron

± 200

± 10’000

IMDb

± 100

± 3000

IMDb 62

340

1000

(11)

How does the performance compare to SOTA?

05

Results

Accuracy on the Blog Authorship Corpus

+

5.3

% relative improvement

compared to SOTA

(12)

Are external features useful?

05

Results

F1-Score improvement

with external features

+

2.70

% with

stylistic features

+

2.73

% with

hybrid and stylistic

features

(13)

Are external features useful?

05

Results

Wider variety of errors

But errors are less

(14)

What happens with less training data?

05

Results

1000

texts per author

341

tokens on average

Longer and fewer texts

Performance below CNN

(15)

What happens with a more authors?

05

Results

93

% of the accuracy at 5 authors

(16)

How much fine-tuning is needed?

05

Results

Accuracy kept improving with

the fine-tuning

5 epochs is a good trade-off

with the time of fine-tuning

(17)

Take away message

05

Results

Is the number of texts per author >> 1000?

Yes

How long are the texts?

Long

100 tokens

Twitter style

Feature extraction + Classifier

Or CNNs

Syntax enriched CNN

Metric

F1-Score

BertAA + Style

(+Hybrid)

BertAA

Accuracy

No

Feature extraction + Classifier

CNNs

(18)

Future works

06

Future works

Further pre-training of BERT on target domain

Explore other pre-trained Language Models

Add new types of features

Authorship Verification via similarity metrics on the embeddings

Authorship Attribution on Automatic Speech Recognition transcripts

(19)

Conclusion

07

Conclusion

A BERT fine-tuning for AA

That works well for a large number of texts

And can be extended with external features to improve F1-score

While setting a new SOTA on the Blog authorship dataset

And a first benchmark on the full IMDb corpus

Here

(20)

Contact

Maël

Fabien

Ph.D. student at Idiap

Research Institute and EPFL

[email protected]

07

Conclusion

https://github.com/maelfabien

https://www.linkedin.com/in/mael-fabien/

https://maelfabien.github.io/

https://twitter.com/mael2ml

(21)

References

08

Traditional Methods:

• Lukas Muttenthaler, Gordon Lucas, and Janek Amann. « Authorship Attribution in Fan-Fictional Texts given variable length Character and Word N-Grams »

• Yunita Sari, Mark Stevenson, and Andreas Vlachos. 2018. « Topic or Style? Exploring the Most Useful Features for Authorship Attribution » in Proceedings of the 27th International Conference on Computational Linguistics, pages 343–353, Santa Fe, New Mexico, USA. Association for Computational Linguistics.

• David Madigan, Alexander Genkin, David D. Lewis, and Dmitriy Fradkin. 2005. « Bayesian Multinomial Logistic Regression for Author Identification ». AIP Conference Proceedings, 803(1):509–516. Publisher: American Institute of Physics.

• Andrea Bacciu, Massimo La Morgia, Alessandro Mei, Eugenio Nerio Nemmi, and Julinda Stefa. 2020. « Cross-Domain Authorship Attribution Combining Instance-Based and Profile-Instance-Based Features ». page 14.

Deep Learning Methods:

• Chen Qian, Tianchang He, and Rao Zhang. « Deep Learning based Authorship Identification ». page 9

• Sebastian Ruder, Parsa Ghaffari, and John G. Breslin. 2016. «  Character-level and Multi-channel Convolutional Neural Networks for Large-scale Authorship Attribution ». arXiv:1609.06686 [cs]. ArXiv: 1609.06686.

• Prasha Shrestha, Sebastian Sierra, Fabio Gonzalez, Manuel Montes, Paolo Rosso, and Thamar Solorio. 2017. «  Convolutional Neural Networks for Author- ship Attribution of Short Texts ». In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 669–674, Valencia, Spain. Association for Computational Linguistics.

Here https://github.com/maelfabien https://www.linkedin.com/in/mael-fabien/ https://maelfabien.github.io/ https://twitter.com/mael2ml

References

Related documents

We herein propose TGBully, a principled framework consisting of three modules: (1) a semantic context modeling module that encodes each comment by considering both its textual

Expectations regarding income from most sources are relatively positive, notably on sales of publications, fees for services to non-members, investment in fixed deposits,

Nursing staff were asked to keep in mind their routine working with people with dementia and were asked to describe their role, the hospital context, their own and

This poem speaks to (2) in the first stanza: the breathing in of sweet aromas on what is declared to be a "festive day." The second stanza moves to the sweet, musical sound

MY PERSONAL RECIPE TIME SETTINGS FORMS Recipe Name: Program: Time/Temperature Preheat minutes Knead 1 minutes Knead 2 minutes Rise 1 minutes Punch seconds Rise 2 minutes Shape

The disease is known as human granulocytic anaplasmosis (HGA; formerly human granulocytic ehrlichiosis) in people, canine granulocytic anaplasmosis (previously canine

Medicare cost savings were estimated by calcu- lating prospective payments for both actual inpatient hospital costs and hypothetical alternatives of Home Health.. In the case

One of the prevalent implications can be seen from the use of self-service technology (SST) to assist service delivery where customers no longer deal with human, instead with