A Technical Report on Approaches for Information Extraction and Sentiment Mining

(1)

A Technical Report on Approaches for Information Extraction

and Sentiment Mining

Project Title

Sentiment Analysis and Opinion Mining of the Arabic Web (Digital Content)

Selected ITAC Program

Advanced Research Project (ARP)

Academic and ICT Industry Partners

Organization Name Contact Name Role

American University in Cairo Ahmed Rafea Professor at CSE Department LINK Development Hanan Abdel Meguid Chief Executive Officer

AUC Research Team

Name Contact Details Role

Prof. Ahmed Rafea Principal Investigator

Sara Morsy Researcher A

Nada Ayman Researcher A

Islam Elnabarawy Researcher A

May Shalaby Researcher A

Link Development Team

Amira Thabet Researcher A, Team Leader

Ashraf Hamed Researcher and Developer

(2)

Table of Contents

Introduction ... 3

Related Research Areas and Technical Approaches ... 3

Topic Detection and Extraction ... 3

Clustering Approaches: ... 4

Labelling Approaches ... 8

Evaluation Approach ... 11

Concluding Remarks ... 13

Named Entity Recognition ... 13

Rule-Based Approach ... 14

Machine Learning ... 14

Sentiment Analysis and Opinion Mining ... 17

The Machine Learning Approach ... 17

The Semantic Orientation Approach ... 20

Hybrid Approaches ... 20

Detecting Influential Bloggers and Opinion Leaders ... 22

Statistical Approach ... 23

Social Network Analysis Approach ... 28

Sentiment Analysis Tools Available in the Market ... 31

Open Source Software Programs ... 33

Crawlers ... 33

Machine Learning Tools for Classification, and Clustering ... 33

NLP Tools for Tokenization, Part of Speech Tagging, Parsing, and NER ... 34

Methodologies for Building Data Sets ... 35

Available Language Resources and Data Sets ... 35

Building Data Sets ... 36

Conclusion ... 37

(3)

Introduction

The goal of the project, as described in the project document, is to develop a prototype that can "feel" the pulse of the Arabic users with regards to a certain hot topic. This involves:

• Extracting the most popular Arabic entities from online Arabic content

• Extracting user comments related to the entities

• Using extracted popular entities to build semantically-structured concepts

• Building relations between different concepts

• Analyzing concepts to get a sense of the most dominant sentiment, using online user feedback, and thus identifying the general opinion about the topic.

In order to achieve this goal, the following milestones are proposed:

• Technical Report on Approaches for Information Extraction and Sentiment Mining

• A Research Plan including the Proposed Approaches and Results of Preliminary Experiments

• Initial SATA Prototype Requirements Specification and Conceptual Design

• Initial SATA Prototype Implementation

• Prototype Validation Technical Report with Benchmarking Results

• Full documentation of project and final prototype

This report is the first milestone of the project. The report is divided into six sections. The second section is a literature review of technical approaches in related research areas namely: topic detection and extraction, name entity recognition (NER), sentiment analysis and opinion mining, and Detecting Influential Bloggers and Opinion Leaders. The third section is a survey of existing similar tools in the markets. The fourth section identifies the available open source programs that can be used in building the prototype. The fifth section describes the

methodologies commonly used for building datasets to train machine learning classifiers for different purposes. The last section concludes this technical report with recommendations that will help us in achieving the second milestone of the project and beyond.

Related Research Areas and Technical Approaches

Topic Detection and Extraction

Social networks have become the most important source of news and people’s feedback and opinion about almost every daily topic. Analyzing data from social networks is the key to know

(4)

what people think or up to about certain topics. Topic detection and extraction can be considered the first step in this analysis. With this massive amount of information over the web from

different social networks like Twitter, Facebook, Blogs, etc, there has to be an automatic tool that can determine what people are talking about in certain locations and over certain period of time. Several researches have been done with different approaches. A lot of them achieved good performance regarding topic detection which is basically grouping (clustering) similar data together indicating they are about the same topic. Topic extraction can be considered a next step whose goal is to label these groups through extracting a topic title for every group of data. The idea of this research domain has originated back in the 1990’s with a project called TDT (Topic Detection and Tracking). The basic idea originated in 1996, when the Defense Advanced Research Projects Agency (DARPA) realized it needed technology to determine the topical structure of news streams without human intervention (Allan et al, 1998). Topic detection is the problem of identifying stories in several continuous news streams that pertain to new or

previously unidentified events. It involves detecting the occurrence of a new event such as a plane crash, a murder, a jury trial result, or apolitical scandal in a stream of news stories from multiple sources. Topic tracking is the process of monitoring a stream of news stories to find those that track (or discuss) the same event as one specified by a user.

Different clustering techniques have been used for this task. Various results were presented some of them will be mentioned in the literature. Results varied from a domain to another in some techniques. Actually clustering is considered the key role in this task, as the higher the accuracy data is clustered the higher the accuracy achieved in further tasks. Clustering is an unsupervised technique that has no previous information about the data. For that validation metrics must be used to check how accurate the results are. Different metrics are used and some of them are discussed in the literature. Choosing the right clustering technique and the right metric is considered a challenge in this task.

Clustering Approaches:

Data needs to be processed as a first step for clustering. Different presentation of data has been discussed in various researches.

General steps for pre-processing is presented by (Makkonen, 2009) Pre-processing of data:

1. Identify individual words and reduce the typographical variation. (tokenization) 2. Remove non-informative words. ( stop-words removal)

3. Reduce morphological variation. (stemming)

4. Compute the term-weights. (using TFIDF or other models) 5. Build the vector.

Clustering can be divided generally into hierarchical clustering and partitional clustering (Rui Xu & Wunch, 2009). In the following section we are going to present the most common used

techniques related to our research.

Before proceeding in the discussion of various techniques we have to know how certain data will be in one cluster while others in different ones, that’s what is called proximity measures. Simply

(5)

proximity measures are measures of similarity between data. Similar data are grouped together into one cluster. Various measures are used, one of the most commonly used one which is used in most of our literature is the cosine similarity. We can return to the book by (Rui Xu & Wunch, 2009) which discusses in details various clustering techniques.

Hierarchical Clustering

In hierarchical clustering it starts grouping similar items bottom-up till reaching a single cluster which is called Agglomerative clustering, or top-down by dividing them into groups to maximize the objective function (Young & Sycara, 2004). Both methods results in a structure of data called dendogram. The root node represents the whole data set and each node represents a cluster. We can cut at any stage of the dendogram to show the relation between clusters at certain stage.

Fig 1 Dendogram, showing both techniques of hierarchical clustering. (Rui Xu & Wunch, 2009).

Agglomerative Hierarchical clustering

In this technique each point is represented as a cluster. Proximity matrix is calculated for each cluster to determine which pairs to be merged. This process continues till one cluster left. The merging of pairs of clustering depends on the minimal distance between them. Calculating this distance is done using various methods such as Single Link, Complete Link and Average Link. Those can be considered the most common techniques used. The figure below shows the algorithm for this technique.

(Dai & Sun, 2010) used agglomerative clustering with time decay to identify events in news. Time decay feature helps clustering stories about the same event. For example if we have two stories of a plane crash at a specific location, they may be talking about the same event reported by different sources or two stories about different events happened at different times but

happened to be similar. Also it helps detect new events as an event is defined as a newly

happened action. In their work they developed an approach to calculate the weights of different features. They used cosine similarity for calculating similarities between stories multiplied by the decay time factor.

Dai et al (2010) improved the agglomerative hierarchical clustering algorithm based on the average link method. The improvement is achieved through splitting the original algorithm into two steps. The 1st step is calculating the similarity of each pair of two topics, and directly

(6)

combining them if the similarity between them is higher than some threshold. Then the topic model is rebuilt. The 2nd step is performing the universal agglomerative hierarchical clustering algorithm. The threshold is determined empirically. They also added more weight for feature terms occurring in the title of news story so its weight increases when calculating similarity. (Young-dong et al, 2009) used hierarchical agglomerative clustering technique in their work. They used it to establish the hierarchical topic tree as the dendogram represents the same hierarchy of the general topic and sub-topics scheme.

Fig 2 Flow chart showing the algorithm for the agglomerative hierarchical clustering (Rui Xu & Wunch, 2009)

(Huang & Cardenas, 2009) used hierarchical agglomerative method to group articles into clusters of same events. Their work aimed extracting hot events from news feeds.

Though clustering techniques is used to cluster related documents together some works tackled using clustering for topic extraction as well. (Okamoto & Kikuchi, 2009) used agglomerative clustering for topic extraction from blog entries within a neighborhood.

Divisive Hierarchical clustering

This technique works in the opposite way of the agglomerative way. The data set at the starts is in one single cluster then it’s divided in successive operations till each node represents a cluster than can no more be divided. The figure below shows the algorithm for this technique using a famous heuristic approach called DIANA (divisive analysis) (Rui Xu & Wunch, 2009).

Hierarchical clustering still has its drawbacks, it lacks robustness and it’s sensitive to noise (Rui Xu & Wunch, 2009). Once an object is assigned to a cluster it will not be considered again which

(7)

leave no room for correcting an errors happened during the beginning (Young & Sycara, 2004). Their computational complexity is at least O(n2) which is not suitable for dealing with very large data sets.

Fig. 3 DIANA algorithm for divisive hierarchical clustering (Rui Xu & Wunch, 2009).

Partitional clustering

This technique assigns data into K clusters. It is based on optimizing a certain criterion. This criterion defines the homogeneity of the objects in the cluster. The sum of squared error criterion is defined as (Rui Xu & Wunch, 2009).

𝐽𝐽𝑠𝑠(Γ,𝑀𝑀) = � � 𝛾𝛾𝑖𝑖𝑖𝑖�𝑥𝑥𝑖𝑖 − 𝑚𝑚𝑖𝑖�2 𝑁𝑁 𝑖𝑖=1 𝐾𝐾 𝑖𝑖=1 Where

Γ = {𝛾𝛾𝑖𝑖𝑖𝑖 } is a partition matrix, 𝛾𝛾𝑖𝑖𝑖𝑖 = � 1 ₀𝑖𝑖𝑖𝑖𝑥𝑥𝑖𝑖 _{𝑜𝑜𝑐𝑐ℎ𝑐𝑐𝑐𝑐𝑒𝑒𝑖𝑖𝑠𝑠𝑐𝑐}∈ 𝑐𝑐𝑐𝑐𝑐𝑐𝑠𝑠𝑐𝑐𝑐𝑐𝑐𝑐𝑖𝑖 with ∑𝐾𝐾𝑖𝑖=1𝛾𝛾𝑖𝑖𝑖𝑖 =1 ∀j

M = [ m1,….., mk]is the cluster prototype or centroid matrix

𝑚𝑚𝑖𝑖 =_𝑁𝑁1_𝑖𝑖∑𝑁𝑁𝑖𝑖=1𝛾𝛾𝑖𝑖𝑖𝑖 𝑋𝑋𝑖𝑖 is the sample mean for the ith cluster with Ni objects

The partition that minimized the sum of squared error criterion is considered as optimal and is called the minimum variance partition

(8)

k-mean algorithm: It is one of the most known and used clustering algorithm. It minimized the criterion of the sum of squared error using an iterative optimization procedure.

The algorithm of this technique goes as follows:

1. Initialize a K-partition randomly or based on prior knowledge. Calculate the cluster prototype matrix.

2. Assign each object in the data set to the nearest cluster C1

3. Recalculate the cluster prototype matrix based on the current partition. 4. Repeat steps 2 and 3 until there is no change for each cluster.

K-mean algorithm was used by (Zhang et al, 2009) for topic detection. (Wartena & Brussee, 2008) used the induced bisecting k-mean algorithm for their experiment in topic detection by clustering key words of documents. Bisecting k-mean algorithm is basically choosing two elements that have the largest distance as seeds for two clusters then proceed by assigning items to the nearest cluster to them from either seeds. They also experimented with agglomerative hierarchical clustering, they didn’t go much in details but for their experiment the k-mean algorithms performed better.

(Wang et al, 2010) discussed using incremental clustering for automatic topic detection. They proposed a new topic detection method called TPIC which adds the aging nature of topics to pre-cluster stories. Bayesian Information Criterion (BIC) is used to estimate the true number of topics. They compared their method to k-mean and CMU and they achieved high performance by their proposed method.

(Hoogma, 2005) summarized methods and techniques used in TDT including various methods of clustering used.

Labelling Approaches

We just don’t need to know that a set of documents are related and belong to a certain cluster, but also we want to know what they are talking about. In this section we are going to discuss how clusters can be labelled or how we can extract a word or a phrase that describes the topic title.

Key-phrase extraction

In this section we are going to discuss how key words or key phrases are extracted from

documents. A lot of models have been used for this process. Generally key-phrase extraction can be categorized into simple statistics, linguistic and machine learning approach (Jain & Pareek, 2009).

Witten et al, (1999) developed KEA which is a tool for key-phrase extraction. It identifies candidate key phrases using lexical methods, calculates feature values for each candidate, and uses a machine- learning algorithm to predict which candidates are good key phrases. The fig below shows the steps for the training and the extraction phases.

(9)

Fig. 4 Training and Extraction steps for KEA (Witten et al, 1999)

Tomokiyo & Hurst (2003) used the statistical language model in their work. Their approach is to use pointwise KL-divergence between multiple language models for scoring both phrasenessand informativeness, which can be unified into a single score to rank extracted phrases. Phraseness is about how a set of words can be considered a phrase. This can differ based on user criteria. Informativeness is about how a phrase is informative about what the document is about.

Jain & Pareek,( 2009) used part of speech tagging in their work, formatting features and position of words in their work. Their results achieved high matching against the annotated data. They used F1 score for evaluation.

Wang et al, (2008) used semantic information for automatic key phrase extraction in their work. Their method is divided into two stages. The first one is to select candidates, in this stage all phrases are extracted from the document, a word sense disambiguation method is used to get senses of phrases, case folding stemming and semantic relatedness between candidates is

performed for term conflation. The second stage is called filtering stage, where four features are used to computer for each candidate, TFIDF, first occurrence of a phrase, length of a phrase, and coherence score which measures the semantic relatedness between the phrase and other

candidates. They compared their results to KEA and achieved higher performance and showed their method is not domain-specific.

Wang et al (2008) developed an automatic online news topic key phrase extraction system; they combined TDT algorithms with aging theory for topic detection and tracking.

(Lopez et al, 2010) worked on automatic titling of electronic document with noun phrase

extraction. It is based on the morphosyntactic study of human written titles in a corpus of various texts. The method is developed in four stages: Corpus acquisition, candidate sentences

determination for titling, noun phrase extraction in the candidate sentences, and finally, selecting a particular noun phrase to play the role of the text title. They call this approach ChTITRES approach.

(10)

El-Beltagy & Rafea (2008) developed a system called KP-Miner. It extracts key phrases from English and Arabic texts. This system has the advantage that it’s configurable as the rules and heuristics adopted by the system are related to the general nature of documents and key phrase.

Statistical approach

The statistical approach is the widely used approach for labelling clustering; as it depends on the frequency of occurrence of key words.

Huang & Alfonse (2009) in their work they relied on extracting hot events from news feeds. The cluster with more hot terms or with high weighted hot terms is examined for hot terms. Hot terms are mostly topical terms i.e. they express the topic title.

The study presented by Xie et al (2011) discussed the optimization design of subject indexing. Their work is based on the word frequency statistics. They took into consideration the word length, position and frequency in the weighting coefficient of the word. They considered long words as more specialized and short words are more generic.

TFPDF algorithm is used to recognize the terms that try to explain the main topics (Bun & Ishizuka, 2002). This algorithm is designed to assign heavy term weight to these kinds of terms and thus reveal the main topics.

Cataldi et al (2010) tackled Twitter for extracting emerging topics. Twitter is a popular micro blogging service that enables users to send and read short text messages. It was launched on July 2006; users in December 2009 were estimated to be 75 million worldwide! With around 6.2 million new accounts per month (2-3 per second), 88% of the users in the states are above 18, providing a very diverse set of contribution.

The user can easily report what is happening in front of his eyes, without any concern about his readers or his writing style. First, they extract the contents (set of terms) of the tweets and model the term life cycle according to a novel aging theory intended to mine the emerging ones.The term is emerging if it frequently occurs in the specified time interval and it was relatively rare in the past. For the content importance depending on the source, they analyze the social

relationships in the network with the well-known page rank algorithm in order to determine the authority of the users. Finally, a topic graph is constructed connecting the emerging terms with other semantically related keywords, allowing the detection of the emerging topics, under user-specified time constraintsMachine learning approach

Classification on contrary to clustering is a supervised approach. It’s based on training the system to classify input data as a set of predefined classes. Sarmento & Nunes, (2009) developed a tool called “verbatim” which is available online that automatically extracts quotes and topics from news feeds. They used classification using support vector machines SVM, and Rocchio classifier. The challenges they face that the topic titles are so generic and news streams aren’t consistent in using the same name for describing the same event.

Hybrid approach

Some works used a hybrid approach between using statistical approaches for certain parts and machine learning for others. Zhang et al (2008) used the hybrid approach in the following way;

(11)

firstly, documents in a special domain are automatically produced by document classification approach. It integrates the rule-based and statistical method to classify the documents in the large scale collection. Then, the keywords of each document are extracted using the machine learning approach. The keywords are used to cluster the documents subset. The clustered result is the taxonomy of the subset. Lastly, the taxonomy is modified to the hierarchical structure for user navigation by manual adjustments.

Identifying controversial issues and their sub-topic in news articles was the work presented by Choi et al (2010). They used Natural language processing NLP model and sentiment of words to identify controversial terms, a statistical classifier is used to select a subtopic based on

collocation information.

Evaluation Approach

For evaluating the work, a manually annotated data is needed to compare the results. Famous evaluation metrics are used such as precision, recall and accuracy. Other criteria are also used for evaluation cluster techniques. They are divided into three categories, internal, external and relative criteria (Ru & Wunch, 2009).

Internal criteria: It evaluates the clustering structure exclusively from the data set without any external information. It can use the proximity matrix of the data set to examine the validity of the cluster structure.

Examples:

• CPCC (Cophenetic correlation coefficient) is used to validate hierarchical clustering structures.

External criteria: It compares the obtained clustering structure to a pre-specified structure. This can be used to test the match between the cluster labels with the category labels based on prior information.

Examples:

• If P is a predefined partition of data set X with N data points and is independent from the clustering structure C. To evaluate C, we compare C to P. If we consider a pair of data points xi and xj, then there are four different cases based on how they are placed in C and

P. Fig. 5 below shows the different cases.

Case1: both points belong to the same cluster and the same category. Case2: both points belong to the same cluster but different categories.

Case3: points belong to different clusters but both belong to the same category. Case4: both points belong to different clusters and different categories.

(12)

Fig. 5 Different cases of how a pair of data points is placed in P and C. values of a,b,c and d are 2,3,7, and 9 respectively. (Ru & Wunch, 2009)

Number of pair of points for the four cases are denoted as a,b,c, and d. Total number of pairs is M = a+b+c+d.

• Rand index 𝑅𝑅 = (𝑎𝑎+𝑑𝑑)/𝑀𝑀

• Jaccard coefficient 𝐽𝐽 = 𝑎𝑎

𝑎𝑎+𝑏𝑏+𝑐𝑐

• Fowlkes and Mallows index 𝐹𝐹𝑀𝑀 = � 𝑎𝑎

𝑎𝑎+𝑏𝑏 𝑎𝑎 𝑎𝑎+𝑐𝑐

Relative criteria: It compares the cluster structure with other clustering structures, got from the application of different clustering techniques or the same technique with different parameters on the data set. It can be used to compare a set of values of K for K-means algorithms to find the best fit of the data.

Examples:

• Davies-Bouldin (DB) index is a function of the ratio of the sum of within-cluster scatter to between-cluster separation, it uses both clusters and their sample means.

𝐷𝐷𝐷𝐷(𝐾𝐾) = 1

𝐾𝐾∑𝐾𝐾𝑖𝑖=1𝑅𝑅𝑖𝑖 Where K is the number of clusters, Ri = max𝑖𝑖,𝑖𝑖≠𝑖𝑖� 𝑐𝑐𝑖𝑖+ 𝑐𝑐𝑖𝑖

(13)

Where 𝑐𝑐_𝑖𝑖 the average distance of all is patterns in cluster i to their cluster center, 𝑐𝑐_𝑖𝑖 is the average distance of all patterns in cluster j to their cluster center, 𝐷𝐷_{𝑖𝑖𝑖𝑖} is the distance of cluster centers.

Small values of DB correspond to clusters that are compact, and whose centers are far away from each other. Consequently, the number of clusters that minimizes DB is taken at the optimal number of clusters.

• Dunn index, it identifies clusters that are compact and well separated.

𝐷𝐷𝑐𝑐(𝐾𝐾) = min_𝑖𝑖_=1,….,_𝐾𝐾�_𝑖𝑖₌_𝑖𝑖min_+1,……,_𝐾𝐾� 𝐷𝐷�𝐶𝐶𝑖𝑖,𝐶𝐶𝑖𝑖�

max

𝑐𝑐=1,…..,𝐾𝐾𝑑𝑑𝑖𝑖𝑎𝑎𝑚𝑚(𝐶𝐶𝑐𝑐)

��

A large values of Du(K) suggests the presence of compact and well separated clusters, and the corresponding K provides an estimation of the number of clusters.

𝐷𝐷�𝐶𝐶𝑖𝑖,𝐶𝐶𝑖𝑖� is the distance between two clusters 𝐶𝐶𝑖𝑖 and 𝐶𝐶𝑖𝑖, as the minimum distance

between a pair of points x and y belonging to 𝐶𝐶_𝑖𝑖 and 𝐶𝐶_𝑖𝑖 can be given as:

𝐷𝐷�𝐶𝐶𝑖𝑖,𝐶𝐶𝑖𝑖�= min_{𝑥𝑥∈𝐶𝐶}

𝑖𝑖,𝑦𝑦∈𝐶𝐶𝑖𝑖𝐷𝐷(𝑥𝑥,𝑦𝑦)

The diameter of the cluster Ci is defines as the maximum distance between two of its

members x and y is given by: 𝑑𝑑𝑖𝑖𝑎𝑎𝑚𝑚(𝐶𝐶_𝑖𝑖) = max_{𝑥𝑥∈𝐶𝐶}_𝑖𝑖𝐷𝐷(𝑥𝑥,𝑦𝑦)

Concluding Remarks

Topic detection and extraction task can be generally divided into two stages, topic detection which is clustering or grouping related documents together indicating they are talking about the same topic, or detects that a new document is related to a certain cluster or not. And topic extraction which is labelling the clusters or knowing what they are talking about by extracting a word or phrase that can describe or is a candidate to be a topic title. Many approaches were used for both stages. Different clustering techniques were used. The most commonly used is the hierarchical agglomerative and the K-means algorithms. Some researches improved both algorithms to achieve higher performance depending on their domain of work or their proposed models. For topic extraction or labelling the clusters, a lot of approaches are used. Key phrase extraction was the start of this area of research which itself used different approaches. Statistical approaches are used presented as frequency of occurrence of words. Various approaches of feature weighting give different results. Machine learning is also used by using training a classifier to classify each document to a pre defined set of labels. Hybrid approaches are also used by some works which in many cases achieved better performance.

All those approaches can be applied to blogs and news sites. Twitter is tackled in some

researches using the same approached but with some modifications due to the nature of Twitter. Named Entity Recognition

A named entity is a word or phrase that can be classified as the name of a person, a location, or an organization. The task of named entity recognition is the classification of words or phrases in a document as either belonging to one of these categories, or being an “ordinary” word. Named entity recognition has many applications in various natural language processing systems, such as question answering, machine translation, and automatic text summarization, in addition to sentiment analysis. This task is an important part of any such system, thus requiring efficient techniques that produce highly accurate results (Thao et al, 2007).

(14)

Two different approaches were found in the surveyed literature for the task of named entity recognition. The first approach, the rule-based approach, depends on the syntactic and semantic structure of sentences. This approach was shown to produce superior results, but requires a linguistics expert and a large number of language rules to work efficiently. The other approach is statistical machine learning, which trains a classifier using a machine learning algorithm and a set of annotated data. This approach is easier to create, although it may produce worse results than the rule-based approach (Thao et al, 2007).

Rule-Based Approach

The rule-based approach relies on hand-crafted rules for identifying and classifying named entities. The literature covering the rule-based approach is somewhat sparse, which can be attributed to the difficulty of creating such rule-based systems, and the need for computational linguistics experts and a large set of rules (Thao et al, 2007). However, some examples of rule-based named entity recognition exist in the literature.

The first example was published by Chiticariu and Krishnamurthy (2010). In this paper, the authors discuss their design and implementation of a rule-based named entity recognition system based on a previous general-purpose algebraic information extraction system. The system was implemented through creating a new language called NERL, which is used to express the rules that the system uses in the process of named entity recognition and extraction. Their implementation achieved better results than state of the art machine learning implementations. However, they concluded that, even with the added facility of the NERL language, building and customizing NER rules was “a labor intensive process”.

Another example of a paper discussing a rule-based named entity recognition approach is the work done by K. Riaz (2010). In this paper, the author discusses the challenges of implementing named entity recognition for the Urdu language. Urdu is very similar to Arabic and Indian, all sharing some interesting challenges in the task of named entity recognition such as the lack of capitalization and the inherent ambiguity of names, as well as the lack of large annotated corpora. The author highlights the discriminative factors of the Urdu language from the Indian and other eastern languages, and proceeds to define the rule-based Urdu natural language processing algorithm. The results show that this approach performs better than the published machine learning results, especially with the added challenges of the Urdu language.

Machine Learning

The machine learning approach uses statistical machine learning techniques to train a classifier on annotated data sets, and use it to classify named entities. In contrast to the rule-based approach, a lot of recent work was done in machine learning-based named entity recognition. This may be attributed to the fact that it can be used for many different domains and languages, and only requires a corpus of annotated data for training the classifiers. Unlike the rule-based approach, it does not require experts in computational linguistics or a large set of human-made rules. Many different techniques for machine learning-based named entity recognition were found in recent literature, with results that are in close proximity of each other.

(15)

The first example is by Hui et al. (2009). In this paper, a model is developed for Chinese named entity recognition based on a Maximum Entropy Model classifier. The maximum entropy model is a statistical model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables. This model is used at the basis of the authors’ implementation framework for Chinese named entity recognition, which produces experimental results with a maximum overall accuracy of 94.21%. Another example of a machine learning technique for named entity recognition is presented by Todorovic et al. (2008). In this paper, the authors use a Hidden Markov Model for the classification of named entities. The hidden markov model is a probabilistic generative model of sequences, with a set of unobserved or hidden states. A second, modified, implementation of the hidden markov model is also discussed in this paper, which uses the context of surrounding words to determine the class of the current word, which led to more accurate results. Overall, this approach achieved a maximum accuracy of 94.88%.

The next example of named entity recognition was done on the Arabic language, reported in the paper by Abdul-Hamid and Darwish (2010). In this paper, a named entity recognition system was developed based on the Conditional Random Fields technique. A conditional random field (CRF) is a type of discriminative undirected probabilistic graphical model, similar to the hidden markov model mentioned earlier. The authors used CRF sequence labeling trained on features using character n-gram of leading and trailing letters in words, and word n-grams. Among multiple different runs, the best average accuracy achieved by this approach on Arabic language data was 89%.

One more example of a machine learning technique is the work done by Lin, Peng, and Liu (2006). In this paper, a technique called Support Vector Machines is used for creating the classifier. This technique trains a binary classifier that takes as input a set of features and determines whether these features belong to a certain class or the other. The author developed and trained a set of classifiers that can identify instances of each Chinese named entity category, and produced an average accuracy of 95%.

A large number of examples exist of other machine learning techniques for named entity recognition, mostly combining one or more technique, or improving one of the former techniques. In Mansouri, Affendy, and Mamat (2008), a fuzzy support vector machine based technique was developed to improve the accuracy of the support vector machine algorithm. In Benajiba, Diab, and Rosso (2008), a combination of the SVM and CRF techniques is used, while in Benajiba, Diab, and Rosso (2009), the Maximum Entropy Model is added to this combination as well. Other examples that combine several classifiers were also presented by Chiong and Wei (2006), which combined MEM with HMM; and Florian et al.(2003), which combined four classifiers. Other examples that modified some of the techniques were presented by Yang and Zhou (2010) with a semi-CRF approach; Zhang et al. (2009) with a hybrid HMM approach; Chan and Lam (2007) with a cascaded CRF technique; and Hu and Zhang (2008) which used a two-level CRF classifier. Finally, another example that used Latent Semantic Analysis was presented in Guo et al.(2009).

(16)

Other research was surveyed in the literature that attempted to improve the machine learning techniques of named entity recognition. Some examples attempted to evaluate and improve the sensitivity of named entity recognition classifiers to noisy input, such as the work done by Benajiba et al. and Florian et al. [11], both in 2010. Other papers attempted to evaluate the task of fine-grained named entity recognition, such that the classes are more detailed than the standard person, location, and organization coarse-grained classes. Such work was presented in Ekbal et al. (2010); Fleischman, Hovy, and Rey (2002); and Whitelaw et al. (2008). Other examples evaluate the design challenges of named entity recognition, published by Ratinov and Roth (2009); and the performance across domains, by Guo, Zhang, and Su (2006).

Evaluation Approach

For the evaluation of named entity recognition algorithms, some existing datasets are commonly used for the English language. For example, the shared task of the Conference on Computational

Natural Language Learning (CoNLL-2003)

benchmark that was used for measuring and comparing the accuracy of named entity recognition systems. Another English dataset that can be used is the Message Understanding Conference (MUC-3) dataset, which was used in some of the surveyed literature. Other English and Arabic corpora exist, but mostly require manual annotation of the named entities. Additionally, relevant Arabic data can be collected from blogs and news websites using web data collection software. The measures used for algorithm evaluation are the ones commonly used in natural language processing systems, such as accuracy, recall, precision, and the F-measure. Accuracy measures the percentage of named entities that were classified correctly for each class, whether they belong to that class or not. Precision is concerned with how many of the named entities classified under a certain class actually belong to that class and therefore were correctly classified. Recall, on the other hand, measures the percentage of named entities classified into a given class over the set of named entities that actually belong to that class; in other words, how many of the named entities belonging to that class were recalled correctly. The F-measure (also known as the F-score) takes both precision and recall into consideration, and is calculated by the following formula:

𝐹𝐹= 2∗ _{𝑝𝑝𝑐𝑐𝑐𝑐𝑐𝑐𝑖𝑖𝑠𝑠𝑖𝑖𝑜𝑜𝑝𝑝}𝑝𝑝𝑐𝑐𝑐𝑐𝑐𝑐𝑖𝑖𝑠𝑠𝑖𝑖𝑜𝑜𝑝𝑝 ∗ 𝑐𝑐𝑐𝑐𝑐𝑐𝑎𝑎𝑐𝑐𝑐𝑐₊_{𝑐𝑐𝑐𝑐𝑐𝑐𝑎𝑎𝑐𝑐𝑐𝑐}

Concluding Remarks

As an important part of many different natural language processing applications, named entity recognition becomes a crucial task where the best possible performance is needed. In the surveyed literature, plenty of machine learning algorithms were found, and only very few rule-based attempts. This is mainly attributed to the difficulty of creating such rule-rule-based systems, as opposed to the easy and well-known machine learning algorithms. The task of named entity recognition is faced with several challenges, such as the need for a large annotated corpus, and the portability across domains; although some attempts were made by Z. Zokavera (2006), Whitelaw et al. (2008), and J. Zhu (2009), to devise a way to sidestep the need for gazetteers or annotated corpora. One important area that is still open for future work is the support of different languages with the different named entity recognition classifiers, since the different properties of each language produce different results for different techniques.

(17)

Sentiment Analysis and Opinion Mining

Sentiment classification has been recently the interest of many researchers, due to the enormous opinionated web content, including blogs, reviews and social network applications. Rather than searching for facts, where most search engines have focused, people are now also interested in searching for other people’s opinions on the web. For example, when buying a new product, a person is interested in knowing the opinions of those who have already bought it, so he is interested in having an automated tool that crawls the reviews on this product on the web and gives him a summary of the reviews. Thus, given a certain piece of text, the tasks of sentiment classification are: 1) classify it as holding subjective or objective information (known as subjectivity analysis); 2) if it is subjective, determine its polarity as positive or negative; and 3) determine its semantic intensity as strong or weak. Researchers may focus on one or more of these tasks, and they can also focus on one or more type of text; whether it is a phrase, sentence or a document. For a more comprehensive review on sentiment classification, see (Pang & Lee, 2008; Liu, 2010).

Sentiment classification techniques can be mainly divided into two approaches: machine learning (ML) and semantic orientation (SO). Machine learning algorithms have succeeded in classic text categorization, and so they were implemented for sentiment classification, but with having the target classes as “positive” and “negative”. These algorithms are discussed in detail in (Sebastiani, 2002; Witten & Frank, 2005). Supervised ML algorithms, such as Support Vector Machines (SVM), Naive Bayes (NB) and Maximum Entropy (ME) have been used extensively in the sentiment analysis research (Liu, 2010; Pang & Lee, 2008). In supervised ML, a piece of text is converted into a feature vector so that the classifier would learn from a set of data labeled with its class (called training data) that a combination of specific features yields a specific class, and then it can test its accuracy of learning on another set of data (called testing data). On the other hand, in the SO approach, a sentiment lexicon is built either manually, semi-automatically or automatically with each word having its semantic intensity as a number indicating whether it is positive or negative as well as its intensity. Then, this lexicon is used to extract all sentiment words from the document and sum up their polarities to determine if the document is holding an overall positive or negative sentiment besides its intensity.

The Machine Learning Approach

In the ML approach, each document is represented as a feature vector with representative features for the target class. Various feature sets have been tried specifically for sentiment classification, as discussed in (Pang & Lee, 2008). Some of the significant features that are related to our work are:

1. N-grams:

N-grams are common features to use in text classification. They are words that are frequently repeated in the corpus. They vary from uni-grams (one word only, such as director), bi-grams (two neighboring words, such as two stars), to tri-grams (such as science fiction film). N-grams can be useful when capturing the sentiment of the document, for example, the director of a movie could have done outstanding work, so its occurrence in a review most probably indicates a positive sentiment. Many researchers have used n-grams, especially uni-grams, since they result in a high accuracy and they added other features as improvements to their systems (Pang et al.,

(18)

2002; Whitelaw et al., 2005; Ng et al., 2006; Abbasi & Chen, 2005; Abbasi et al., 2008; Dang et al., 2010). Some researchers use uni-grams and bi-grams (Dang et al., 2010), some use bi-grams only (Xia & Zong, 2010), while others use uni-, bi- and tri-grams (Abbasi et al., 2008; Dang et al., 2010). Several ML algorithms have been compared, and SVM outperformed other algorithms (Pang et al., 2002), and hence, it has been used as a common algorithm for sentiment classification, except for a few papers that have used NB or ME or a combination of two or more of them (Xia & Zong, 2010; Xia et al., 2010).

2. Part-Of-Speech tag n-grams:

Part-Of-Speech (POS) tag n-grams have been proven to be good indicators of sentiment (Gamon, 2004; Fei et al., 2004; Abbasi et al., 2008). For instance, the authors in (Fei et al., 2004) experimented the effect of a combination of positive and negative nouns, verbs, adverbs and nouns and have shown that the appearance of a positive adjective followed by a noun is more frequent in positive documents than in negative ones, and that the appearance of a negative adjective followed by a noun is more frequent in negative documents.

3. Stylistic features:

These include lexical and structural attributes as well as punctuation marks and function words, as explained in (Abbasi & Chen, 2005; Zheng et al., 2006; Abbasi et al., 2008). The lexical features include character- or word-based statistical measures of word variation. Examples of the character-based lexical features are: 1) Total number of characters; 2) Characters per sentence; 3) Characters per word; and 4) The usage frequency of individual letters. Some examples of the word-based lexical features are: 1) Total number of words; 2) Words per sentence; 3) Word length distribution; and 4) Vocabulary richness measures; such as: the number of words that occur only once in the document (also called hapax legemona) and the number of words that occur twice in the document (also called hapax dislegemona). The structural features include text organization and layout, for example, signatures, number of paragraphs, average paragraph length, total number of sentences per paragraph and others.

These features were used along with other features in sentiment classification research (Abbasi & Chen, 2005; Abbasi et al., 2008; Dang et al., 2010; Abbasi et al., 2011) and they improved the system’s accuracy when added (Abbasi et al., 2008).

.

4. Negation and modification features:

These are two of the important categories of the contextual valence shifters. Early work in sentiment classification did not investigate the effect of negation or modifiers (Pang et al., 2002), as the authors only used Bag-Of-Words (BOW) and higher-order n-grams. As a result, two sentences like these: “I like this movie” and “I don’t like this movie” would be similar to each other since both contain the verb “like”, although the first one holds a positive sentiment while the second holds a negative sentiment (Pang & Lee, 2008). Modifiers also play a crucial role when classifying sentiment since they can increase or decrease the semantic intensity of polar terms or even shift them towards the positive or negative sentiment. Therefore, later research

(19)

focused on how to detect and extract negation and modifiers and add them as features (Wilson et al., 2005; Sokolova & Lapalme, 2008; Jia et al., 2009; Sokolova & Lapalme, 2009; Carrillo de Albornoz et al., 2010). For instance, the authors focusing on sentence-level sentiment classification in (Wilson et al., 2005) added a binary feature for each word token to determine whether it was negated or not in addition to other features to indicate whether it was modified by an adjective or adverb intensifier or one or more contextual valence shifters that they extracted: general, negative and positive polarity shifters. The general polarity shifter reverses the polarity of the word modified by it, e.g. little truth. The negative polarity shifter yields an overall negative sentiment for the overall expression, e.g. lack of wisdom. The positive polarity shifter changes the overall sentiment of the expression to positive; for example, abate the damage (Wilson et al., 2005). In addition, the authors in (Jia et al., 2009) used a dependency parser to extract typed dependencies and developed special rules to determine the scope of each negation term. Furthermore, the authors in (Carrillo de Albornoz et al., 2010) used typed dependencies and a negation and quantifiers lists to extract negated, amplified and down-toned sentiment words and build a sentence-level polarity and intensity sentiment classifier.

5. Dependency relations:

Some researchers have explored the effect of dependency relations in sentiment classification, since they can hold more information than neighboring words (such as: higher-order n-grams). Dependency relations are typically extracted using a dependency parser. For example, the authors in (Wilson et al., 2005) used a dependency parser to extract modifiers and other dependency relations (e.g. words connected by a conjunction, like and). Also, the authors in (Xia & Zong, 2010) extracted bi-grams and all dependency relations as features besides uni-grams to improve the system’s performance. However, it was shown in (Ng et al., 2006) that adding dependency relations with focus on adjectives preceded by nouns and nouns that have a dependency relation with polar terms to n-grams did not improve the performance.

Researchers have tried using different ML algorithms to compare their performance in the sentiment classification task. It was shown in (Pang et al., 2002) that SVM outperforms both NB and ME when using unigrams, bigrams as well as other features; and hence, most research has used SVM and it became the default algorithm in sentiment classification.

However, the authors in (Xia & Zong, 2010) showed that uni-grams perform better on SVM, whereas higher-order n-grams and dependency relations perform better on NB, due to the nature of the algorithms themselves. They explained the difference between the performance of SVM and NB classifiers as follows: unigrams tend to have more relevant features and less independency than higher order n-grams. The discriminative model, SVM, can capture the complexity of relevant features, whereas a more generative model, NB, performs well on bigrams and dependency relations since the feature independence assumption holds well (Xia & Zong, 2010).

Another problem in machine learning besides choosing the right features is feature selection. Since there can be redundant or irrelevant features in the feature vector that makes it unnecessarily a huge vector, feature selection is performed to reduce the dimensionality of the feature vector so that only representative features for the target class are remaining. By selecting the most relevant features for the target concept, the classifier learns better which class label to

(20)

give to each vector. In addition, it reduces the running time required for the classifier on the training data, which is crucial for real-world applications (Dash & Liu, 1997). Several researchers have implemented different feature selection methods that were used for classic text classification and applied them to sentiment classification (Abbasi et al., 2008; Dang et al., 2010). Other researchers have developed new algorithms for feature selection specified for sentiment classification (Wang et al., 2009; Nicholls & Song, 2010; Abbasi et al., 2011). But, most research uses the Information Gain (IG) heuristic as the feature selection method due to its reported effectiveness (Abbasi et al., 2008; Dang et al., 2010; He, 2010).

The Semantic Orientation Approach

In the SO approach, a document’s polarity is calculated as the sum of its polar terms and/or expressions. Early work in this direction calculated the phrase’s polarity as the difference between the Point Mutual Information (PMI) between the phrase and the word “excellent” as representative for the positive class and the PMI between the phrase and the word “poor” as representative for the negative class, with hit counts from AltaVista search engine using the NEAR operator (Turney, 2002). Some researchers built a sentiment lexicon manually, such as: the General Inquirer1 (GI) and SO-CALculator (SO-CAL) (Brooke, 2009; Taboada et al., 2010). Others used semi-supervised and supervised techniques to build a lexicon either from a small seed list of positive and negative words or an already-existing online dictionary, like the Subjectivity dictionary and SentiWordNet (Wilson et al., 2005; Esuli & Sebastiani, 2006; Baccianella et al., 2010). The Subjectivity dictionary contains a list of about 8,000 words labeled with their corresponding prior polarity (positive, negative, neutral or both), type (strong (or weak) subjective (or objective)) and part of speech. Senti-WordNet contains a list of all words from WordNet (Miller, 1995) with each Part-Of-Speech having 3 types (objective, positive and negative) and each type having a score describing its intensity (from 0 to 1). The GI and SentiWordNet deal with individual words only as holding positive or negative sentiment. On the other hand, the Subjectivity dictionary is accompanied with two lists of intensifiers and contextual valence shifters to extract the contextual polarity instead of the prior polarity of the individual words, but this dictionary was built for sentence-level sentiment classification using a supervised approach (Wilson et al., 2005). Also, SO-CAL tries to capture the contextual polarity of the SO-carrying words in the document by creating separate lists of intensifiers (divided into amplifiers and down-toners), negation expressions, irrealis markers (words whose appearance results in the ignorance of the SO of polar words in the sentence, such as: modals and conditional markers) beside their lists of SO-carrying nouns, verbs, adverbs and adjectives (Taboada et al., 2010).

Hybrid Approaches

There are some significant differences between the machine learning and semantic orientation approaches. First of all, the ML approach performs better than the SO approach on a single domain; the highest accuracy achieved on the Polarity dataset version 2.0 (Pang & Lee, 2004) using a classifier was 91.7% (Abbasi et al., 2008) versus an accuracy of 76.37% using SO-CAL (Taboada et al., 2010). Secondly, the classifier can learn domain-specific polarity; e.g. “long battery life” (+ve) vs “long time to focus” (-ve), unlike the lexicon where a prior polarity is known for each word. However, to build a high accuracy classifier, we need to have a huge corpus labeled with its class (positive or negative) whereas a dictionary does not need one. This

(21)

can be a labor-intensive task; to collect data from different genres (reviews, news articles, blogs . . . etc) and domains (movies, products, politics . . . etc) and label them manually. In addition, the authors in (Brooke, 2009; Taboada et al., 2010) argued that a ML approach does not take linguistic context into account, such as negation and intensification, since there can be three features “good”, “very good” and “not good” but the classifier does not know they are related to each other.

Since both the machine learning and semantic orientation approaches have some advantages and disadvantages, some researchers have attempted to combine them together to benefit from the advantages of each approach (Andreevskaia & Bergler, 2008; Qiu et al., 2009; Read & Carroll, 2009; He, 2010; Dang et al., 2010). For example, the authors in (Dang et al., 2010) built a classifier with unigrams and bigrams with a threshold of 5 and stylistic features and they developed a feature-calculation strategy to extract sentiment features (verbs, adjectives and adverbs) from SentiWordNet 3.0. Similarly, the authors in (Andreevskaia & Bergler, 2008) tried to develop a high-accuracy domain-independent system at the sentencelevel by building an ensemble model of two classifiers; one with unigrams, bigrams and trigrams as features and trained on in-domain data, and the other one with sentiment words from both WordNet and the General Inquirer as features. The authors in (Qiu et al., 2009) then improved this system to be self-supervised so that it does not need labeled data. They developed a two-phase model; the first one is a sentiment lexicon and a negation list to label the data and they took the high-confidence labeled documents as training data to a classifier with sentiment words being the feature set. Another technique to solve the problem of manually labeling the data was shown in (He, 2010), where the authors trained an initial classifier with sentiment features from the Subjectivity lexicon. Then, from the high-confidence labeled data, they extracted the most indicative features for each class (using the Information Gain heuristic) as self-learned features to train another classifier instead of training it with self-labeled data. However, all these models did not take into account contextual valence shifters.

Evaluation Approach

Like any other text classification problem, sentiment classification research has mainly used three measures of classification effectiveness; namely, the accuracy, precision and recall, which are explained in detail in (Sebastiani, 2002). We will be using the same three measures in our work to compare our system’s performance:

1. Accuracy:

The accuracy of a classifier is defined as the percent of correctly classified objects. It is calculated as:

𝐴𝐴 = _{𝑇𝑇𝑇𝑇}₊_{𝑇𝑇𝑁𝑁}𝑇𝑇𝑇𝑇+₊𝑇𝑇𝑁𝑁_{𝐹𝐹𝑇𝑇}₊_{𝐹𝐹𝑁𝑁}

where:

TP denotes the number of positively-labeled test documents that were correctly classified as positive;

TN denotes the number of negatively-labeled test documents that were correctly classified as negative;

FP denotes the number of negatively-labeled test documents that were incorrectly classified as positive; and

(22)

FN denotes the number of positively-labeled test documents that were incorrectly classified as negative.

2. Precision:

The precision for a class is defined as the probability that if a random document should be classified with this class, then this is the correct decision. Precision for the positive class for instance is calculated as follows:

𝑇𝑇 = _{𝑇𝑇𝑇𝑇}𝑇𝑇𝑇𝑇₊_{𝐹𝐹𝑇𝑇} 3. Recall:

The recall for a class is defined as the probability that if a random document should be classified with this class, then this is the taken decision. Precision for the positive class for instance is calculated as follows:

𝑅𝑅 = _{𝑇𝑇𝑇𝑇}𝑇𝑇𝑇𝑇₊_{𝐹𝐹𝑁𝑁} Concluding Remarks

Document-level sentiment classification is an emerging research field in NLP and machine learning. Researchers’ interests have been mainly divided into two main approaches: machine learning and semantic orientation. In the machine learning approach, researchers have been attempting different feature sets to improve the classifier’s performance. They have tried the same features used in classic text classification, such as n-grams and they have yielded good results. Some have also tried extracting words with semantic orientation, for example, the authors in (Dang et al., 2010) have implemented a sentiment feature extraction method from SentiWordNet 3.0. However, the method proposed in that paper is not efficient to grasp the true polarity of the sentiment words, as the authors in (Polanyi & Zaenen, 2006) have noted that there are different categories of the so-called contextual valence shifters that can change or neutralize the polarity and/or intensity of sentiment words.

Detecting Influential Bloggers and Opinion Leaders

Influence has long been studied in many fields, and despite the large number of theories of influence in sociology, there is no tangible way to measure such a force nor is there a concrete definition of what influence means. The Merriam-Webster dictionary defines influence as “the power or capacity of causing an effect in indirect or intangible ways”. The notion of influence plays a vital role in how businesses operate, how a society functions and how people vote (Cha et al., 2010). Online Social networks are becoming increasingly interactive and networked, and are thus playing an ever-important role in shaping the behaviour of users on the web. Social media is dynamic and growing, including collections of blogs, wikis, forums, photos and videos sharing sites, in addition to popular social networking sites, such as Facebook and Twitter. This new self awareness of information society has lead to the fact that more and more users connect online to social networks in order to exchange opinions, leading to the formation of communities around topics like politics, technology, arts, knitting, photography and public relations. They interact with each other and influence their opinions reciprocally.

(23)

Influential members in a social network can be responsible for starting a buzz or getting the community to notice a new trend, product, or even adopt an opinion. The assumption is that actions performed by a user can be seen by their network friends. Users seeing their friends’ actions are sometimes tempted to perform those actions. We are interested in the problem of studying the propagation of such influence, and on this basis, identifying which users are leaders when it comes to setting the trend for performing various actions (Goyal et al., 2008). For companies, organizations and governments, it is of great importance to learn about opinions in order to assess chances and risks. Studying influence patterns can help in better understanding why certain trends and innovations are adopted faster than others, and how advertisers, marketers and public figures can design more effective campaigns (Cha et al., 2010).

A manual analysis is only possible on a very limited scale. An automated computer supported analysis is necessary given the large number of virtual communities with huge amounts of postings (Bodendorf & Kaiser, 2009).

Statistical Approach

On Twitter, an extremely popular online micro-blogging service, each user submits periodic status updates, known as tweets that consist of short messages limited in size to 140 characters. Users interact by following updates of people who post interesting tweets. Users can pass along interesting pieces of information to their followers, this act is known as retweeting, and is useful for propagating interesting posts and links through the Twitter community. Also, users can respond to other people’s tweets or comment on them, which is called mentioning. Cha et al.(2010) assumed that these three activities may represent the different types of influence of a person. The number of followers a user has directly indicates the size of his audience, which represents the Indegree Influence. The number of retweet containing one’s name indicates the ability of that user to generate content with pass-along value, which represents the Retweet Influence. The number of mentions containing one’s name indicates the ability of that user to engage others in a conversation, representing the Mention Influence (Cha et al., 2010). Once every user is assigned a rank for each of the influence measures, Indegree, Retweet and Mention Influence, Cha el al.(2010) use the Spearman’s rank correlation coefficient

𝜌𝜌= 1−6∑_𝑁𝑁(𝑥𝑥₃𝑖𝑖_{− 𝑁𝑁}− 𝑦𝑦𝑖𝑖)2

a non-parametric measure of statistical dependence between two variables to quantify how a user’s rank varies across different measures, and measure the association between two rank sets. This coefficient assesses how well an arbitrary monotonic function could describe the relationship between two variables without making any assumptions about the particular nature of the relationship between them. The closer 𝜌𝜌 is to +1 or −1, the stronger the likely correlation. A perfect positive correlation is +1 and a perfect negative correlation is−1.

Besides the number of followers, retweets and mentions, there are other factors that could be used to determine influence and content propagation in a network, applicable to different types of social media, such as blogs and discussion forums. Agarwal et al. (2008) set forth an initial set of intuitive properties, or factors, which can be approximated by collectable statistics.

(24)

• Recognition – An influential post is recognized by many. This can be equated to the case that an influential post is reference in many other posts, or the number of inlinks is large.

• Activity Generation - A post’s capacity of generating activity can be indirectly measured by how many comments it receives and the amount of discussion it initiates.

• Novelty – Novel ideas exert more influence.

Agarwal et al. (2008) suggested that a large number of outlinks may suggest that a post refers to many other posts, indicating that it is less likely to be novel. Also, the number of outlinks was found to be negatively correlated with the number of comments which means that more outlinks reduce peoples’ attention.

• Elequoence – An influential is often eloquent, which is a difficult property to approximate using some statistics. Although there are many measures that quantify the goodness of a post, such as fluency, rhetoric skills, vocabulary usage, and content analysis, Agarwal et al. (2008) used some simpler measure to examine the effect in determining the influence. They used the length of the blog post as a heuristic measure of goodness in the context of blogging. Their reasoning is that given the informal nature of the social media, there is no incentive for a lengthy piece that bores the readers. Hence, a long post often suggests a necessity for doing so. Agarwal et al. (2008) also found that the length of a post is positively correlated with the number of comments, which means that longer posts attract peoples’ attention.

There are certainly other potential properties possessed by an influential post which can be taken advantage of. It is also evident that each of the properties may not be sufficient on its own; they should be used jointly.

An intuitive way of defining an influential blogger is to check if the blogger has any influential posts. Agarwal et al. (2008) presented a preliminary model attempting to quantify an influential blogger, and to distinguish between “influential” and “activeness” properties of a blogger. They assumed they have an influence score for a post 𝑝𝑝_𝑖𝑖, 𝐼𝐼(𝑝𝑝_𝑖𝑖). For a blogger 𝑏𝑏_𝑘𝑘 who has N blog posts, { 𝑝𝑝₁, 𝑝𝑝₂, . . ., 𝑝𝑝_𝑁𝑁}, their influence scores can be ranked in descending order, and the influence index of the blogger, 𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥(𝑏𝑏_𝑘𝑘)can be defined as 𝑚𝑚𝑎𝑎𝑥𝑥(𝐼𝐼( 𝑝𝑝_𝑖𝑖)), where 1 ≤ 𝑖𝑖 ≤ 𝑁𝑁. Given a set 𝑈𝑈 of𝑀𝑀 bloggers, {𝑏𝑏₁,𝑏𝑏₂, . . . ,𝑏𝑏_𝑀𝑀}, the problem of identifying influential bloggers is defined as determining an ordered subset 𝑉𝑉 of 𝐾𝐾 bloggers (𝐾𝐾 is a user specified parameter),

{𝑏𝑏𝑖𝑖1,𝑏𝑏𝑖𝑖2 , . . . ,𝑏𝑏𝑖𝑖𝐾𝐾} that are ordered according to their 𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥 such that 𝑉𝑉 ⊆ 𝑈𝑈 and 𝐾𝐾 ≤ 𝑀𝑀, i.e.

𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥(𝑏𝑏𝑖𝑖1) ≥ 𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥(𝑏𝑏𝑖𝑖2 ) ≥ . . .≥ 𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥(𝑏𝑏𝑖𝑖𝐾𝐾). The set 𝑉𝑉contains 𝐾𝐾 most influential

bloggers. For all the blog posts { 𝑝𝑝₁, 𝑝𝑝₂, . . ., 𝑝𝑝_𝐿𝐿} by all 𝑀𝑀 bloggers, influential blog posts are those whose influence scores are greater than 𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥(𝑏𝑏_{𝑖𝑖𝐾𝐾}) or, 𝐼𝐼(𝑝𝑝_𝐿𝐿) ≥ 𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥(𝑏𝑏_{𝑖𝑖𝐾𝐾}) for

1 ≤ 𝑐𝑐 ≤ 𝐿𝐿. Hence, we have the following corollary: those bloggers who published blog posts that satisfy 𝐼𝐼(𝑝𝑝_𝐿𝐿) ≥ 𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥(𝑏𝑏_{𝑖𝑖𝐾𝐾}) for 1 ≤ 𝑐𝑐 ≤ 𝐿𝐿 L will be called influential bloggers because their 𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥will be greater than or equal to 𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥(𝑏𝑏_{𝑖𝑖𝐾𝐾}).

The blog post with the maximum influence score can be used as a representative and assign its influence score as the 𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥, or 𝑖𝑖𝐼𝐼𝑝𝑝𝑑𝑑𝑐𝑐𝑥𝑥(𝐷𝐷) = max⁡(𝐼𝐼(𝑝𝑝_𝑖𝑖)), where 1≤ 𝑖𝑖 ≤ 𝑁𝑁.

(25)

Blog-post influence can be visualized in terms of a directed graph, where each node represents a single post characterized by the four properties mentioned above; inlinks (𝜄𝜄), which represents recognition; comments (𝛾𝛾), which represents activity generation; outlinks (𝜃𝜃), which represents novelty; post length (𝜆𝜆), which represents eloquence.

𝜄𝜄 and 𝜃𝜃 represent the incoming and outgoing influence flows of a node, Hence, if 𝐼𝐼 denotes the influence of a node (or blog post 𝑝𝑝), then 𝐼𝐼𝑝𝑝𝑖𝑖𝑐𝑐𝑐𝑐𝑐𝑐𝑝𝑝𝑐𝑐𝑐𝑐𝐹𝐹𝑐𝑐𝑜𝑜𝑒𝑒across that node is given by,

𝐼𝐼𝑝𝑝𝑖𝑖𝑐𝑐𝑐𝑐𝑐𝑐𝑝𝑝𝑐𝑐𝑐𝑐𝐹𝐹𝑐𝑐𝑜𝑜𝑒𝑒(𝑝𝑝) = 𝑒𝑒𝑖𝑖𝑝𝑝 ∑|𝑚𝑚𝜄𝜄|=1𝐼𝐼(𝑝𝑝𝑚𝑚) – 𝑒𝑒𝑜𝑜𝑐𝑐𝑐𝑐 ∑𝑝𝑝|𝜃𝜃=1| 𝐼𝐼(𝑝𝑝𝑝𝑝)(1)

where 𝑒𝑒_{𝑖𝑖𝑝𝑝} and 𝑒𝑒_{𝑜𝑜𝑐𝑐𝑐𝑐} are the weights that can be used to adjust the contribution of incoming and outgoing influence, respectively. 𝑝𝑝_𝑚𝑚 denotes all the blog posts that link to the blog post 𝑝𝑝, where

1 ≤ 𝑚𝑚 ≤ |𝜄𝜄|; and 𝑝𝑝_𝑝𝑝 denotes all the blog posts that are referred by the blog post 𝑝𝑝, where

1 ≤ 𝑝𝑝 ≤ |𝜃𝜃|. |𝜄𝜄| and |𝜃𝜃| are the total numbers of inlinks and outlinks of post p.

𝐼𝐼𝑝𝑝𝑖𝑖𝑐𝑐𝑐𝑐𝑐𝑐𝑝𝑝𝑐𝑐𝑐𝑐𝐹𝐹𝑐𝑐𝑜𝑜𝑒𝑒 measures the difference between the total incoming influence of all inlinks and

the total outgoing influence by all outlinks of the blog post 𝑝𝑝.

𝐼𝐼𝑝𝑝𝑖𝑖𝑐𝑐𝑐𝑐𝑐𝑐𝑝𝑝𝑐𝑐𝑐𝑐𝐹𝐹𝑐𝑐𝑜𝑜𝑒𝑒 accounts for the part of influence of a blog post that depends upon inlinks and

outlinks. It is clear that the more inlinks a blog post acquires the more recognized it is, hence the more influential it gets; and an excessive number of outlinks jeopardizes the novelty of a blog post which affects its influence.

The influence (𝐼𝐼) of a blog post is also proportional to the number of comments (𝛾𝛾_𝑝𝑝) posted on that blog post. We can define the influence of a blog post, 𝑝𝑝as,

𝐼𝐼(𝑝𝑝) ∝ 𝑒𝑒𝑐𝑐𝑜𝑜𝑚𝑚𝛾𝛾𝑝𝑝 + 𝐼𝐼𝑝𝑝𝑖𝑖𝑐𝑐𝑐𝑐𝑐𝑐𝑝𝑝𝑐𝑐𝑐𝑐𝐹𝐹𝑐𝑐𝑜𝑜𝑒𝑒(𝑝𝑝) (2)

where 𝑒𝑒_{𝑐𝑐𝑜𝑜𝑚𝑚} denotes the weight that can be used to regulate the contribution of the number of comments (𝛾𝛾_𝑝𝑝) towards the influence of the blog post 𝑝𝑝. Finally for eloquence, define a weight function, 𝑒𝑒, which rewards or penalizes the influence score of a blog post depending on the length (𝜆𝜆) of the post. The weight function could be replaced with appropriate content and literary analysis tools.

Combining the above equations (1) and (2), the influence of a blog post, 𝑝𝑝, can thus be defined as,

𝐼𝐼(𝑝𝑝) = 𝑒𝑒(𝜆𝜆) × (𝑒𝑒𝑐𝑐𝑜𝑜𝑚𝑚𝛾𝛾𝑝𝑝 + 𝐼𝐼𝑝𝑝𝑖𝑖𝑐𝑐𝑐𝑐𝑐𝑐𝑝𝑝𝑐𝑐𝑐𝑐𝐹𝐹𝑐𝑐𝑜𝑜𝑒𝑒(𝑝𝑝)) (3)

The above equation gives the influence score 𝐼𝐼 to each blog post.

Note that the four weights can take more complex forms and can be tuned to regulate the

contribution of the parameters toward the calculation of the influence score. Changing the weight values can lead to different rankings, which allows the model to be adjusted to attain different goals; in identifying influencers with different characteristics (Agarwal et al., 2008).

However, Akritidis et al. (2009) argued that isolating a single post to identify whether a blogger is influential or not is an over simplistic approach. They think that the productivity of a blogger

(26)

is a significant issue that has been overlooked by the model of (Agarwal et al., 2008). Although productivity and influence do not coincide, there is quite a strong relationship between them, and therefore should somehow be taken into account. Also, they argue that the outcome of the model is not objective, since it depends highly on user defined weights, and most importantly, the model ignored what they considered to be one of the most important factors: The temporal dimention. An effective model should take into consideration the age of a post and also the age of incoming links to that post, in order to be able to identify the now influencers. Motivated by these observations, Akritidis et al. (2009) propose two easily computed blogger ranking methods, which incorporate temporal aspects of the blogging activity. The first metric, termed MEIBI (Metric for Evaluating and Identifying a Blogger’s Influence), takes into consideration the number of the blog post’s inlinks and its comments, along with the publication date of the post. The second metric, MEIBIX (MEIBI eXtended), is used to score a blog post according to the number and age of the blog post’s inlinks and its comments.

The Blogosphere changes rapidly, in a manner that a blogger who would currently considered as an influential, is not guaranteed to remain influential in the future. An issue being discussed in a post at the present time and is now of major importance, may be totally outdated after a couple of months. If 𝐶𝐶(𝑖𝑖) denotes the set of comments to post 𝑖𝑖 of blogger 𝑖𝑖, ∆𝑇𝑇𝑇𝑇𝑖𝑖(𝑖𝑖) the time interval in days between the current time and the date that the 𝑖𝑖𝑐𝑐ℎ blogger’s post 𝑖𝑖 was submitted, and 𝑅𝑅𝑖𝑖(�