However, in favor of modeling the mediatext of a hybrid-communicative form, journalism, marketing, PR and advertising realize similar conceptual tasks: all these types of communication actually sell ideas in order to find out the interest of the target audience to whether humanitarian or material products. Such pragmatic interpretation of these essentially different directions in the mass media can serve as a basis in the search for ways to construct new models of formatted media texts. In fact, the format includes not only the form and semantics of the mediatext presentation, but the audience's ideas about bilateral communication, so the question of the effectiveness of this communication is extremely significant. However, the rules of representation are dictated by the organic nature of the mediatext formation. It is necessary to determine in advance what type of creativity will be dominant in the media project and which type will be secondary. Since the dominant semantic code forms in the mediatext, an individualized set of artistic expressive means and communication re-translators, which influence the expectations and cognitive perception of the media user. There are many forms of branded mediatext, used by the mass media on the Internet. These are longreads, which are still interesting to the consumer. Usually, verbal cognitive information dominates in it, but still infographic, video and not a light advertising promotion are integrated there. Longread manifests the interdisciplinary approach to the integration of genres and forms. There is also the format of online games, different in subject, format, provided with comments, and forms for a response to the media user. There are also special projects that are similar to longreads by its form, but differ in its content, and whose task is to provide the consumer with cognitive information, typically encyclopedic one. However, all these different media projects have a single point of reference: the customer or sponsor of the project, ready to popularize this or that socially significant idea, which finally provides the choice of the format. If the mass media does not have any sponsor, it can independently create a media project in order to increase the number of audience, as well as ensure the inflow of financial resources. This is the way to actually customize a social order for a media company, which means the production of a media project taking into account the ideas and requirements of the audience.
Computer Mediated Communication As communication takes place by tech- nical means, such as smartphones or home computers, the term Computer Mediated Communication (CMC) is strongly related to social media. CMC describes communi- cation between people using a technical device as medium (Herring, 1996). The term CMC originated in the 1960s and has been in use long before social media surfaced (Thurlow et al., 2004). As using social media requires using a technical device, it is often also CMC, i.e. both terms have a substantial overlap. In contrast, a service that allows one to chat with friends and enables invitation of additional people is not a classical social media platform but certainly is CMC. For this thesis, with the focus on non-standard text with respect to PoS tagging, we will use the term “social media” as an umbrella term that entails all kind of CMC and social mediatext. Example of Tagging Twitter Figure 3.1 shows an example of a PoS-tagged Twit- ter posting that contains informal utterances using the Stanford (Toutanova et al., 2003) tagger with a model trained on formal text. We show the original PTB tags that have been predicted by the Stanford tagger and their mapping to the correspond- ing coarse-grained word class for improved readability. We see that the model assigns the tag for noun to almost all non-standard word forms. While it is hard to argue which tag would be appropriate, the tag for noun seems inappropriate. Only the two words that are part of the standard language dictionary, miss and them, have been predicted correctly. Thus, we find plenty of evidence that tagging social media can be a challenging task. In the remainder of this chapter, we explore these challenges from the theoretical and practical side to learn how well the currently available PoS taggers are equipped to deal with the challenges of informal utterances.
resents some information about the sequence at every step of the input. In theory, RNNs can learn long dependencies but in practice, they fail to do so and tend to be biased towards the most recent input in the sequence (Bengio et al., 1994). Long Short Term Memory networks or ”LSTMs” are a special kind of RNN, capable of learn- ing long-term dependencies. Here with our data where posts/comments are not very long in the size LSTMs can provide us with a better result as keep- ing previous contexts is one of the specialities of LSTM networks. LSTM networks were first in- troduced by (Hochreiter and Schmidhuber, 1997) and they were refined and popularized by many other authors. They work well with a large variety of problems especially the one consisting of se- quence and are now widely used. They do so using several gates that control the proportion of the in- put to give to the memory cell, and the proportion from the previous state to forget. These network has been used in the past for tasks similar to our task like hate speech detection (Badjatiya et al., 2017), bullying detection (Agrawal and Awekar, 2018), Abusive language detection (Chu et al., 2016), etc on social mediatext. Hence, we ex- periment out data with LSTM model and compare the results as to how good our CNN model works as compares to LSTMs.
Among all UTCNN variations, we find that user information is most important, followed by topic and comment information. UTCNN without user information shows results similar to SVMs — it does well for Sup and Neu but detects no Uns. Its best f-scores on both Sup and Neu among all methods show that with enough training data, content-based models can perform well; at the same time, the lack of user information results in too few clues for minor-class posts to either predict their stance directly or link them to other users and posts for improved performance. The 17.5% improvement when adding user information suggests that user information is especially useful when the dataset is highly imbalanced. All models that consider user information predict the minority class successfully. UCTNN without topic information works well but achieves lower performance than the full UTCNN model. The 4.9% perfor- mance gain brought by LDA shows that although it is satisfactory for single topic datasets, adding that latent topics still benefits performance: even when we are discussing the same topic, we use different arguments and supporting evidence. Lastly, we get 4.8% improvement when adding comment informa- tion and it achieves comparable performance to UTCNN without topic information, which shows that comments also benefit performance. For platforms where user IDs are pixelated or otherwise hidden, adding comments to a text model still improves performance. In its integration of user, content, and comment information, the full UTCNN produces the highest f-scores on all Sup, Neu, and Uns stances among models that predict the Uns class, and the highest macro-average f-score overall. This shows its ability to balance a biased dataset and supports our claim that UTCNN successfully bridges content and user, topic, and comment information for stance classification on social mediatext. Another merit of UTCNN is that it does not require a balanced training data. This is supported by its outperforming other models though no oversampling technique is applied to the UTCNN related experiments as shown in this paper. Thus we can conclude that the user information provides strong clues and it is still rich even in the minority class.
The approaches that have been taken to sentence boundary detection can on a general level be categorized as either machine learning-based or rule-based. State-of-the-art machine learning ap- proaches perform well with an accuracy of around 99% on formal texts such as news-wire and fi- nancial newspaper texts, whereas rule-based ap- proaches on the same type of texts typically report around 93% accuracy. Our approach investigates the use of different punctuations and patterns for marking the end of sentence in social mediatext and is based on comparing three machine learning algorithms — Conditional Random Fields, Naïve Bayes and Sequential Minimal Optimization — to a rule-based system, also with the rule-based strategy showing lower accuracy than the machine learning approaches, as described in the next sec- tions. First, however, we will discuss some rele- vant previous work.
One of the first POS taggers for code-mixed text was developed by Solorio and Liu (2008). They con- structed a POS tagger of English-Spanish text by us- ing existing monolingual POS taggers for both the languages. They combined the POS tag information using heuristic procedures and achieved the maxi- mum accuracy of 93.4%. However, this work was not on social mediatext and hence the difficulties were considerably less. Gella et al. (2013) devel- oped a system to identify word level language and then chunk the individual languages and produce POS tags or every individual chunk. They used a CRF based Hindi POS tagger for Hindi and Twit- ter POS tagger for English and achieved maximum accuracy of 79%. Vyas et al. (2014) developed a English-Hindi POS tagger for code mixed social me- dia text.
There is a large quantity of user-generated content on the web, characterized by social media, creativ- ity and individuality, which has created problems at two levels. Firstly, social mediatext is often un- suitable for various Natural Language Processing (NLP) tasks, such as Information Retrieval, Ma- chine Translation, Opinion Mining, etc., due to the irregularities found in such content. Secondly, non- native speakers of English, older Internet users and non-members of the in-groups often find such texts difficult to comprehend. Prompt use of Internet and the resulting noisy user generated text found in different social media platforms such as social networking sites, blogs, etc., cause a hindrance in
All the above work focused on normalizing words. In contrast, our work also performs other normalization operations such as missing word re- covery and punctuation correction, to further im- prove machine translation. Previously, Aw et al. (2006) adopted phrase-based MT to perform SMS normalization, and required a relatively large num- ber of manually normalized SMS messages. In con- trast, our approach performs beam search at the sen- tence level, and does not require large training data. We evaluate the success of social mediatext nor- malization in the context of machine translation, so research on machine translation of social mediatext is relevant to our work. However, there is not much comparative evaluation of social mediatext transla- tion other than the Haitian Creole to English SMS translation task in the 2011 Workshop on Statistical Machine Translation (WMT 2011) (Callison-Burch et al., 2011). However, the setup of the WMT 2011 task is different from ours, in that the task provided parallel training data of SMS texts and their transla- tions. As such, text normalization is not necessary in that task. For example, the best reported system in that task (Costa-juss`a and Banchs, 2011) did not perform SMS message normalization.
values and second-person pronouns to train a Support Vector Machine for classification purposes. Sentiment analysis technique was leveraged to detect cyberbullying in a Twitter dataset by Xu et al. (2012). They used Latent Dirichlet Allocation to identify the most frequent topics in tweets that exhibited signs of bullying. Semi-supervised approaches with bootstrapping have also been proposed in recent years (Schmidt and Wiegand, 2017). While there have been many significant research contributions in this area, most of them considered the problem of aggression detection as a binary clas- sification task, i.e., each text was to be classified either as aggressive or non-aggressive. Malmasi and Zampieri (2018) is one of the early works that has taken a different approach. In their work, every social mediatext is classified into three categories, hate speech, aggressive text and neutral. They argued that as general profanity is an unavoidable part of social media posts due to their unregulated nature, our efforts
search, which focuses on quoted statements in so- cial mediatext. de Marneffe et al. (2012) conduct an empirical evaluation of FactBank ratings from Mechanical Turk workers, finding a high degree of disagreement between raters. They also construct a statistical model to predict these ratings. We are unaware of prior work comparing the contribution of linguistic and extra-linguistic predictors (e.g., source and journalist features) for factuality rat- ings. This prior work also does not measure the impact of individual cues and cue classes on as- sessment of factuality.
We also present word-level language iden- tification experiments are performed us- ing this dataset. Different techniques are employed, including a simple unsu- pervised dictionary-based approach, su- pervised word-level Language identifica- tion using sequence labelling using Con- ditional Random Fields based models, SVM, Random Forest. The targeted re- search problem also entails solving an- other problem, that to correct English spelling errors in code-mixed social mediatext that contains English words as well as Romanized transliteration of words from another language, in this case Konkani. 1 Introduction
Various claims have been made about social me- dia text being “noisy” (Java, 2007; Becker et al., 2009; Yin et al., 2012; Preotiuc-Pietro et al., 2012; Eisenstein, 2013, inter alia). However, there has been little effort to quantify the extent to which social mediatext is more noisy than conventional, edited text types. Moreover, social media comes in many flavours — including microblogs, blogs, and user-generated comments — and research has tended to focus on a specific data source, such as Twitter or blogs. A natural question to ask is how different the textual content of the myriad of so- cial media types are from one another. This is an important first step towards building a general- purpose suite of social mediatext processing tools. Most research to date on social mediatext has used very shallow text processing (such as
which forms the negative data. n is usually large. However, due to the labor-intensive effort of manual labeling, the user can label only a certain number of training posts. Then the labeled nega- tive training posts may cover only a small num- ber of irrelevant topics S of T (S ⊆ T) as nega- tive. Further, due to the highly dynamic nature of social media, it is probably impossible to label all possible negative topics. In testing, when posts of other negative topics in T − S show up, their classification can be unpredictable. For ex- ample, in an application, the training data has no negative examples about sports. However, in testing, some sports posts show up. These unex- pected sports posts may be classified arbitrarily, which results in low classification accuracy. In this paper, we aim to solve this problem.
Lu et. al. developed a visual analytics framework, which is directly related to EventRiver. This framework allows analysts to visually explore how online news, blogs, social media and other media documents are being framed. Framing constructs a point of view that can influence or be interpreted by the readers. The tool allows exploration related to the overall framing and the sentiment of the media documents over space and time. The tool also uses a time series intervention model to detect if an event affects the level of framing before and after a given date. Figure 2.8 shows the framework developed. The map section allows the analyst to explore specific frames and entities found in the media, as well as over a period of time. The word clouds contain the entities extracted from the document. The time series seen in the lower section contains the frame and sentiment visualizations. The tree located to the left of the time series visualization contains the hierarchical frame class. The control panel located to the left of the figure corresponds to the distribution of frames and events [ Lu16 ] .
ing of code-switched text and report comparison of different language identification systems. The best system from the second iteration of these shared tasks uses a logistic regression model and reports a token-level F1-score of 97.3% for SPA- ENG. Our results are competitive with this score. Das and Gamb¨ack (2014) use a dictionary based method and SVM model with various features for Hindi-English and Bengali-English. Their system achieves an F1-score of 79% for Hindi-English. Barman et al. (2014) create a new dataset and study code mixing between the three languages - English, Hindi, and Bengali using CRF and SVM models. In another work, Gella et al. (2014) build a language detection system for synthetically cre- ated code-mixed dataset for 28 languages. Similar to some of the works in the above mentioned pa- pers, we model the language detection task as a sequence labeling problem and explore combina- tions of several features using the CRF model, but we use a larger set of labels. We obtain signif- icantly higher performance for the Hindi-English language pair than Das and Gamb¨ack (2014).
2016), owing to the fact that they can scale up the analysis to large document collections and alleviate the labor-intensive process of manual document cate- gorization (document coding). In the context of news media, two typical use cases are media agenda (McCombs & Shaw, 1972) and news framing (Entman, 1993) analyses. For instance, Kim et al. (2014) demonstrated the use of topic models for a comparative analysis of media agenda and public agenda issues. The analysis is carried out by first training a topic model on news articles and user comments and then comparing the salience of model topics corresponding to issues in news- and user-generated texts. Korenˇ ci´ c et al. (2015) addressed the problem of weak correspondence between model topics and news issues of interest and proposed a semi-supervised method for media agenda analysis con- sisting of news issue discovery and the measurement of issue salience. Jacobi et al. (2016) used topic models to first identify the topics corresponding to spe- cific issues of interest and then analyzed how the framing of these issues has changed over time. Because topic models and other text mining tools are re- ceiving increased interest in the social science community, Grimmer & Stewart (2013) pointed to a need for new methods for validating these tools before they can be adopted as standard. We see measures of topic coherence proposed in this paper as an important step toward that goal.
4 In above data flow diagram of system, system create connections with twitter streaming api. After that, tweets are fetched in system. Tweets then processed by CEP engine by searching specific keyword in it. Tweets with specific keyword are stored in csv file with id, text and date.
In data normalization step we have mainly focused on cleaning the utterance to improve the performance of the model. We have created the custom list of words to replace the internet slang tokens with proper words and the same method is followed for replacing the emoji’s in utterances with corresponding meaning. In addition to this, shorthand’s text is replaced with proper abbreviation and mutli spaces with single space. Some of the speakers have empty utterance we have replaced the empty utterance with word “empty line”. Sample data set is show in the table 3.
In this paper, we present a study for the identification of authors’ national variety of English in texts from social media. In data from Facebook and Twitter, informa- tion about the author’s social profile is an- notated, and the national English variety (US, UK, AUS, CAN, NNS) that each au- thor uses is attributed. We tested four feature types: formal linguistic features, POS features, lexicon-based features re- lated to the different varieties, and data- based features from each English vari- ety. We used various machine learning algorithms for the classification experi- ments, and we implemented a feature se- lection process. The classification accu- racy achieved, when the 31 highest ranked features were used, was up to 77.32%. The experimental results are evaluated, and the efficacy of the ranked features discussed. 1 Introduction
In Section 2, we described our previous work on the identification of stance in texts, and the linguistic resource that was created and annotated for that purpose. BBC is a gold standard corpus with highly informative and accurately annotated content, but it faces some limitations in terms of generalisation and applicability. There are two major issues we realised that need to be dealt with in our work on stance after BBC. Firstly, the stance-related characteristics from our previous studies need to be evaluated against a different set of data. The stance features that are corpus-, quantitative-, and computational-based were detected and identified as significant entities of stanced sentences extracted from a specific text type (blog posts and comments) towards a specific thematic area (2016 UK referendum). It is an important task to confirm these findings using a different data set consisting of text chunks covering a wider thematic orientation. Secondly, it is very important to confirm or not the efficiency of the proposed stance framework in order to use it for the annotation of other text data too. The efficiency of our stance framework is two-pronged. It aims (i) to examine whether our framework covers to a suffiecient extent the large spectrum of the different stances that people take when positioning towards a topic/event/idea, and (ii) to estimate the frequency of these stances in discourse, and the linguistic patterns used to express each stance. In our first study on this topic , we already observed the paucity of some of the proposed stances in the BBC (agreement/ disagreement, certainty, tact/rudeness and volition), and, in the following studies, we continued with six stances (the six most frequent stances in the BBC). In this study, we followed the same principle, aiming to bring together all the stance-related charac- teristics in order to evaluate them in a data set from social media.