• No results found

We proposed a novel multi-model architecture based on computer vision and text mining techniques to boost the NER task. This architecture, dubbed HORUS, has shown potential to overcome existing challenges at recognizing entities on noisy data. Therefore, given the nature of our work and its novelty, the outcomes of this work are of high relevance for NER on social media.

3.5 Summary

certain objects in images and classify a given textual input for a given text. This architecture led to motivating results, beating state-of-the-art in the Ritter dataset. After, we modified and extended the model to support neural network-based architectures. In a comprehensive study, we observed that features extracted from images are of high relevance, whereas textual features not necessary improve the task over regular NER features. However, the concatenation of the features help to maximize the performance of the model. We compared several NER algorithms along with the features we extract from HORUS, showing the benefit of the proposed method. Several gold-standard datasets have been tested. HORUS exhibited competitive results, even outperforming in some specific configurations. In traditional NER architectures (e.g. CRF), the proposed features have proved feasible to notably improve its overall model performance and, when compared to SOA, achieved a higher precision. SOA had improved in recall, but at expense of precision. However, when benchmarking the models across different training-test sets, the images and news also proved to be beneficial.

The advantages are summarized as follows:

1. A challenging (pre-processing) tasks, such as text normalization [80], is bypassed. 2. The proposed approach is language-agnostic.

3. It does not rely on gazetteers, lookups and normalization and also does not implement any encoded rules.

4. As a result of our experiments, we released to the community a word-level feature database for NER based on image and text. This database contains approx. 3 millions data features for more than 72.000 distinct tokens and has been explored over 5.904 experiments in different configurations.

As downside, we also shed light on existing problem which require further study, as follows: 1. The proposed architecture slows down the classification of named entities, performing

considerably slower than existing solutions. This tends to decrease due to caching, but still might affects at production scale.

2. Using the Web as corpus implies in financial costs as well

3. Due to the existing noisy in the Web, performing text-classification from generic unseen web content is a challenge.

C H A P T E R

4

Web Credibility

This chapter is dedicated to tackle yet one of the core challenges of this thesis, i.e., to assign cred- ibility scores for information sources. The content of this chapter is based on the publications [31, 136].

With the growth of the internet, the number of fake-news online has been proliferating every year. The consequences of such phenomena are manifold, ranging from lousy decision-making process to bullying and violence episodes. Therefore, fact-checking algorithms became a valuable asset. To this aim, an important step to detect fake-news is to have access to a credibility score for a given information source. However, most of the widely used Web indicators have either been shut-down to the public (e.g., Google PageRank) or are not free for use (Alexa Rank). Further existing databases are short-manually curated lists of online sources, which do not scale. Finally, most of the research on the topic is theoretical-based or explore confidential data in a restricted simulation environment.

The results of this chapter provide an answer to the following research question:

RQ2: How to calculate a credibility score for a given information source?

First, in Section 4.1, we present a motivating example illustrating the problem of assigning trustworthy scales for a given information source. To address research question RQ2, we devise WebCred, the first 100% open-source web-based credibility model. WebCred extracts source code metadata and computes scores of trustworthiness for a given website.

Next, Section4.2 describes our approach. Overall, WebCred detects credibility patterns derived from metadata extracted from source code of websites. Afterwards, a comprehensive evaluation of the WebCred approach and analysis of the obtained results is presented in Section 4.3. Observed results suggest that WebCred is able to generalize well to unseen websites. Finally, Section4.6 presents the closing remarks of this chapter. We summarize the contributions of this chapter as follows:

• A novel web credibility Framework named WebCred, which implements the concepts behind this methodology and is 100% open-source.

• An empirical evaluation to assess the effectiveness of WebCred for the web credibility task. Experiments are executed over the most famous datasets for the task: Microsoft and 3C Corpus.

• An updated release of the Microsoft Credibility dataset.

4.1 How Credible is a Website

With the enormous daily growth of the Web, the number of fake-news sources have also been increasing considerably [137]. This social network era has provoked a communication revolution that boosted the spread of misinformation, hoaxes, lies and questionable claims. The proliferation of unregulated sources of information allows any person to become an opinion provider with no restrictions. For instance, websites spreading manipulative political content or hoaxes can be persuasive. As introduced in previous sections, to tackle this problem, different fact-checking tools and frameworks have been proposed [138]. Yet an important underlying fact-checking step relies upon computing the credibility of sources of information, i.e. indicators that allow answering the question: “How reliable is a given provider of information?”. Due to the obvious importance of the Web and the negative impact that misinformation can cause, methods to demote the importance of websites also become a valuable asset. In this sense the high number of new websites appearing at everyday [139], make straightforward approaches - such as blacklists and whitelists - impractical. Moreover, such approaches are not designed to compute credibility scores for a given website but rather to binary label them. Thus, they aim at detecting mostly “fake” (threatening) websites; e.g., phishing detection, which is out of scope of this work. Thus, open credibility models have a great importance, especially due to the increase of fake news being propagated. There is much research into credibility factors. However, they are mostly grouped as follows: (1) theoretical research on psychological aspects of credibility and (2) experiments performed over private and confidential users information, mostly from web browser activities (strongly supported by private companies). Therefore, while (1) lacks practical results (2) report findings which are not much appealing to the broad open-source community, given the non-open characteristic of the conducted experiments and data privacy. Finally, recent research on credibility has also pointed out important drawbacks, as follows:

1. Manual (human) annotation of credibility indicators for a set of websites is costly [140]. 2. Search engine results page (SERP) do not provide more than few information cues (URL,

title and snippet) and the dominant heuristic happens to be the search engine (SE) rank itself [140].

3. Only around 42.67% of the websites are covered by the credibility evaluation knowledge base, where most domains have a low credibility confidence [91]

Therefore, automated credibility models play an important role in the community - although not broadly explored yet, in practice. In this paper, we focus on designing computational models