• No results found

Connecting the dots between

N/A
N/A
Protected

Academic year: 2021

Share "Connecting the dots between"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Connecting the

dots between

Research Team: Carla Abreu, Jorge Teixeira, Prof. Eugénio Oliveira Domain: News

(2)

Objective

"

… larger and larger amounts of news content is published every day.

With this much data, it is often easy to miss the big picture.

(Shahaf and Guestrin, 2010)

Objective: Automatically aggregate similar news and build news chains

(Shahaf and Guestrin, 2010): Connecting the Dots Between News Articles

(3)

How to do this ?

Similarity Keywords Extraction

News group

News group / Keywords

Arch

(4)

Similarity

Aim:

Clustering Similar News Challenges:

What news data are important for the similarity process? How can we use that data ? Which methods can we use in this process ?How can we evaluate this process ?

(5)

Similarity

Filter:

Normalization:

● remove punctuation marks;

● remove patterns;

● remove stop-words (snowball); ● words stemming (ptstemmer)

Revista de imprensa: destaques de "O Jogo" Jornais do dia

Mourinho diz que os seus brasileiros jogaram muito bem. Quiseram embraçá-lo com os 6-2 da goleada sofrida por Portugal.

Revista de imprensa: destaques do "Jornal de Notícias” Jornais do dia

Governo pressiona direcções das escolas. Ministério pondera avaliar conselhos executivos pelo sistema do sector público.

(6)

Similarity

News comparation: Similarity: ● Title - ST*; ● Teaser ( S) - STe*; ● Content - SC*. Temporary Window ● T

* Values between 0 and 1

Title

Teaser

(7)

Similarity

First Approach

Similar Tree (manual threshold assignment; empirical values)

Second Approach

Classification methods (provide by scikit-learn; automatic approach)

● Decision Tree;

● Support Vector Classifier (SVC)

● SVC Linear

● Random Forest

(8)

Similarity

Features 1. Title Similarity 2. Teaser Similarity 3. Content Similarity Variables: ● S = 0,2 ● T = 1 ● Algoritm - Levensthein

(9)

Similarity

Dataset

3 millions of Portuguese news published between 2008 and 2013

Training Set

● Select 100 news of each day (between 23 Dec 2012 and 22 Jan 2013)

○ Annotate randomly 371 comparisons

Test Set

1. TS1: Select 501 distinct news from 19 Nov 2012 - Annotate randomly 5101 comparisons

2. TS2: Select 210 distinct news from 19 Nov 2012 - Annotate randomly 1047 comparisons

(10)

Similarity

(11)

Similarity

Experimental Setup

Precision (P) Recall(R)

Accuracy(A) F measure (F)

True Positives (TP): number of similar news correctly identify;

False Positives (FP): number of non similar news identified as similar; True Negatives (TN): number of non similar news correctly identify; False Negatives (FN): number of similar news identified as non similar.

P = ___ TP_____ TP + FP A = ___ TP_+ TN_ __ TP + TN + FP+ FN R = ___ TP_____ TP + FN F = 2 * ___P * R___ P + R

(12)

Similarity

Results and Analyses

P R A F DecisionTree 0,958 0,932 0,985 0,945 SVC 0,993 0,963 0,994 0,978 SVC Linear 0,991 0,963 0,994 0,977 RandomForest 0,987 0,960 0,993 0,974 Gaussian 0,701 0,964 0,956 0,812 Similar Tree 0,999 0,839 0,974 0,912

RandomForest: Random Behaviour

Gaussian: Worst Performance

SVCs results are better than Decision Tree in all metrics

SVCs have similar results

SVC: Better combination of evaluation metrics

(13)
(14)

News Group

News 2014 (3 April to 20 June)

Number of news: 186 366 Cluster number: 23 047

Average amount of news per cluster: ~ 3,7

March 2014, 10-15

Number of news: 16.747

(15)

Keywords extraction

Aim:

Extract relevant terms from text. Challenges:

Can any word be considered a keyword ? Can a news be described by a simple word ? a compound word ? or an entity ? How we can extract useful keywords from the news ?

(16)

Keywords extraction

Approach Explicit Keywords ○ Simple (uni-grams) ○ Compound (n-grams) Implicit Keywords Entities Governo Tribunal Constitucional

rebeldes busca competição

atentado à bomba avião da Malaysia Airlines fase de grupos Presidente

(17)

Keywords extraction

Explicit Keywords

Pos Tagger (Pablo Gamallo) [n-grams]

Normalization:

Remove Patterns

Stemmer [uni-grams]

Term frequency - Inverse document frequency (TF-IDF):

o(W, DOC): number of occurences of WORD in DOCUMENT; npalavras(DOC): number of words in DOCUMENT

(18)

Implicit Keywords

Normalization

Relation between words ( Ventura, Silva 2013)

Corr(A,B) is based on Pearson’s correlation coefficient; ||D|| is the number of documents of corpus D; di is the i-th document in D; size(di) is its number of words and f(A, di) the frequency of term A in di. Corr(A, B) ranges -1 (non correlation) to +1(strong correlation)

(Ventura, Silva 2013): Automatic Extraction of Explicit and Implicit Keywords to Build Document Descriptors

Keywords extraction

(19)

Entities

Find Entities

Keywords extraction

A idade média dos entrevistados era de 11 anos no início do estudo, sendo rapazes três quartos do total

Os jovens que jogam jogos de vídeo têm mais propensão para pensar e agir de forma agressiva, indica um estudo feito a mais de 3.000 estudantes em Singapura e hoje divulgado.

O estudo, publicado pela revista da American Medical Association e baseado em três anos de trabalho com 3.034 jovens, concluiu, com base nas respostas dos estudantes, que havia uma ligação entre o uso frequente de jogos de vídeo e as altas taxas de comportamentos e pensamentos agressivos.

(20)

Keywords extraction

Dataset

4789 news articles from January to December (2012)

Test set:

1. select one day from each month of 2012 2. select three hours of each day

3. extract keywords

4. select 10 news from each day

(21)

Keywords extraction

Experimental Setup Results Evaluation Explicit - Simple 0,732 Explicit - Compound 0,762 Implicit ~ 0 Entity 0,804

PalavrasChaveRepresentativas Number of words that represents the news

PalavrasChaveAtribuídas Number of words attributed to news

(22)

News Group / Keywords

(23)

Arch

Aim:

Connect groups of news Challenges:

(24)

Arch

Approach (explicit simple keywords, entities and personalities)

Normalization

● lowercase

● explicit simple keywords - reduce words to their stem

Find Personalities

● From entities and explicit compound keywords using Verbetes.

Distance:

|ka| number of words in news group a; |kb| number of words in news group b;

(25)

Arch

Approach (explicit compound keywords)

Normalization

● lowercase

● remove stop-words

All words have the same weigth

Distance:

(26)

Arch

Goldstandard

1408 news (2012, January)

● 131 groups of news

Trainset:

5671 comparisons between groups of news

● 277 connections

● 5394 non connections

Testset:

300 comparisons between groups of news

● 26 connections

(27)

Arch

Experiences

1. 6 Experiences

Metrics to calculate distance(D1 and D2)

2. 11 Experiences

Constraints to comparisons - number of entities

- number of personalities

(28)

Arch

Experimental Setup

Precision (P) Recall(R)

True Positives (TP): number of connections correctly identify;

False Positives (FP): number of non connections identified as connections; True Negatives (TN): number of non connections correctly identify;

False Negatives (FN): number of connections identified as non connections. P = ___ TP_____

TP + FP

R = ___ TP_____ TP + FN

(29)

Arch

Results and Analyses

Experiences

1. Metrics:

a. Explicit simple keyword: D1 b. Personalities: D1

c. Entities: D2

2. Constrains:

a. Entities >= 3

b. Explicit simple keyword similarity >= 0,2

Best Result Gaussian

Precision 0,941

(30)
(31)

Thanks !

Connecting the dots

between news

Carla Abreu

(

cfma@fe.up.pt

)

References

Related documents

As noted above it’s likely that G-Cloud will not fit with prevailing procurement processes as they are likely to presume that either you have to do an OJEU of your own, that you

# Course Name CR Approval Offered (See footer) Needed regarding program or course Reviewed ACT/COMPASS Score 607-127 Civil Engineering Drafting 3 NO Fall O none

In contrast, hydride addition to Au(I):SR oligomers yields free thiols and complexes containing AuAu bonds, which are plausible intermediates for gold nanoparticle growth..

As described, the analysis of the measurement model shows that, in general, they are reliable and valid measures of the corresponding constructs, even though the observed variable

Students will transfer images to a high-end digital work station where basic editing, retouching, color management and printing will be covered using the latest version of

o For Network Blocks and/or Network Role and Competence Suspensions - contact the Network Operator who placed the block.. Network Operator email address can be found at the

But the manifestation of the Spirit is given to each one for the profit of all: for to one is given the word of wisdom through the Spirit, to another the word of knowledge through

O cenário contemporâneo apresenta um crescimento exponencial do volume e da diversidade informacional, o que torna a busca e a recuperação da informação cada