4.3 Automatic Metadata Generation: From Isolated Items to Interlinked Items 54
4.3.3 Topic classification
There is a wide range of work on classification of social media, for purposes in-cluding automatic spam detection, sentiment analysis, and quality assessment.
In this section we focus on the work that involves topic classification, i.e., cate-gorising posts in terms of predefined topics.
Learning from specific related objects
Genc et al. (2011) present an approach for improving the classification of tweets into predefined categories from Wikipedia. They first map tweets to their most similar Wikipedia article, based on occurrences of words from the tweet within Wikipedia articles and their titles. They then identify the predefined category that is closest to the category of the article, using the Wikipedia category struc-ture. They do not make use of hyperlinks within posts.
Yin et al. (2009) propose improving object classification within a website by bridging heterogeneous objects across websites so that category information can be propagated from one domain to another. They use tags as bridges to connect the unlabelled objects to labelled ones, based on the intuition that users are likely to use similar tags to describe objects with the same topic, even if the objects are of a different type. They improve the automatic classification of Amazon products by learning from the tags of resources contained within the Open Di-rectory Project categories. This work exploits external metadata by improving classification of items in one domain using text and category information from another domain.
Enriching Social Media Items with Metadata
Learning from models of previously annotated items
One approach to classifying social media items based on a training set is to simply consider the content of a post, especially in mediums without rich metadata, such as Twitter. Garcia Esparza et al. (2010) investigated short message categorisation based purely on their content. They carried out experiments on a dataset from Twitter and a dataset from a product review website. Sharifi (2010) performed a similar study on tweet content categorisation. The labelled dataset used in those experiments was obtained by hand-selecting Twitter accounts corresponding to certain predefined categories and retrieving the tweets generated by each account.
Rodrigues et al. (2008) classified questions on a question and answering site using the text of the initial question. Pal and Saha (2010) performed multi-label categorisation of blog entries in order to identify posts relevant to certain product groups. They then performed sentiment analysis in order to assess the tone of communication about those products in the blogosphere. All of these studies operated only on the plain text of the posts.
Other studies have improved classification of social media by making use of the metadata of the post. For example, Berendt and Hanser (2007) investigated automatic domain classification of blog posts with different combinations of body, tags and title. They identified tags and body nouns as the most useful features for classification. Sun et al. (2007) investigated the classification of entire blogs, as opposed to individual blog posts. Their experiments compared blog tags against blog title and descriptions as sources of text for a classifier. They concluded that blog tags were a better descriptor of blog topic than titles and descriptions but that classification was improved by including all three. These papers both studied the categorisation of social media posts using the items’ own content and metadata.
A relevant study has been performed by Figueiredo et al. (2009), who assess the quality of various textual features in Web 2.0 sites for classifying objects within that site. For Last.fm artists, and videos from Yahoo! Video and YouTube, they found that tags were always the best single feature for classification, although a combination of features generally performs better. They found that a bag-of-words approach for combining feature vectors of different text sources tended to slightly outperform a concatenation approach. They did not investigate the use
62
Enriching Social Media Items with Metadata
of any external data.
Huang et al. (2010) conducted a study which attempted to identify extrem-ist videos on YouTube. They tested various classifiers based on many lexical, syntactic and content-based features, from user-generated text including object titles, descriptions and comments. They included certain tags and categories as binary features in their classifier, but did not compare the improvements gained from different text sources.
A study of event-driven classification of images carried out by Firan et al.
(2010) who compared the timestamp and textual features from Flickr for clas-sifying photos into events. The events were extracted from the YAGO ontology (Suchanek et al., 2007) and the Upcoming events website. A combination of all features provided the best results. The authors also noted that classification based on all text features (title, tags and description) performed only marginally better than classification based on tags alone.
There is also existing work on classifying Web documents based on structural parts of their content. Riboni (2002) tested different text sources for webpage representation – the body, title and meta tag content – and found that classifica-tion based on a combinaclassifica-tion of the meta tag and title gave best results. Othman et al. (2010) performed a similar study but found that adding the document body to the meta tag and title resulted in the highest accuracy. Golub and Ard¨o (2005) compared page title, headings, meta tag and body text with the aim of determining how they affect automated classification. They found that title was the best single indicator of topic, but for best results, all of these elements should be included. These results show that for Web documents, the structure of their text content is relevant for topic categorisation.
Incorporating information from hyperlinks
Some model-based approaches to determining the topic of social media items also integrate information from hyperlinks. Antonelli and Sapino (2005) proposed a rule-based approach to classifying message board topics by type (e.g., announce-ment, question, answer). Their rules use metrics including the similarity of the post to the title and meta tag of hyperlinked webpages. Irani et al. (2010) studied the problem of identifying posters who aim to dishonestly gain visibility by mis-leadingly tagging posts with popular topics. They built models that correspond
Enriching Social Media Items with Metadata
to topics in order to identify messages that are tagged with a topic but are in fact spam. They took advantage of hyperlinks by augmenting their models with text from webpages linked to within posts.
Similarly, some studies on Web document classification have compared the effects of incorporating various parts of neighbouring websites on classification accuracy. Many of these focus on citing or inlinking pages, those which con-tain hyperlinks to the target webpage. An early study by Attardi et al. (1999) proposed Web document classification using the titles, incoming anchortext, and text surrounding hyperlinks from citing pages, and reported that they achieved encouraging experimental results. F¨urnkranz (1999) compared the results of web-page classification based on content, to the results of classification based on parts of citing webpages. They considered anchortext, the paragraph containing the hyperlink, and the headings that structurally precede the hyperlink. The highest-scoring classifier was that based solely on the anchortext and headings of citing pages. Ghani et al. (2001) investigated website classification using a method which exploits patterns in hyperlink structure. They noted that results were im-proved by the inclusion of metadata, titles and words from hyperlinked pages.
Glover et al. (2002) performed a study which compared document text, incoming anchortext, and the text surrounding incoming anchortext as sources for a text classifier. Their results showed that classification based on anchortext gave com-parable results to classification based on the text of the target document, and a combination of text from both the original document and the citing document gave best results. Another study by Sun et al. (2002) compared document text, title, and incoming anchortext and found that the combination of title and an-chortext resulted in the highest accuracy. Lim et al. (2005) investigated the task of genre classification on the Web (e.g., personal homepages, FAQs, image collec-tions). They represented each document using combinations of features from the URL, anchortext, titles, meta tag and body. Qi and Davison (2008) performed experiments which compared different weightings of titles, anchortext, extended anchortext and full text, from the target page and its neighbours. The term neighbours includes citing pages, cited pages, co-citing pages and co-cited pages.
They found that using fielded information as opposed to the full text of webpages improved classification accuracy.
64
Enriching Social Media Items with Metadata
4.4 Conclusions
This chapter has completed the background section of the thesis by describing how structured information can be attached to social media items by way of manually assigned or automatically generated metadata. In the following core chapters, we will focus on the tasks of predicting tag, location and topic metadata for social media posts. In Chapter 8.1, we will show how the three approaches can be combined to augment a social media item with each of these three complementary metadata types.