The first is the text features of micro-blog, it refers to the micro-blog own characteristics whether has credi- bility, such as micro-blog certification status, the evidence, the method of describing, the emotional factors. The social public factor mainly includes the public memory and the cognition and the public knowledge reserve, etc. In brief, it is the public’s cognition and emotion to the people, events, things, which are formed in the daily life and social experiences. These are the important basis of public judgment and cognition. If the micro-blog con- tent with the public perception and emotion, then the public tend to believe the micro-blog. Regarding the status of the government as a source of information, on the one hand, the government announced the incident is timely; on the other hand, the government responded to the incident in time. After the Internet exposed the events that have a greater impact on the government or the everyday life, the public tends to look forward to the govern- ment’s official news. If the government does not respond in time, the public will think that this is the default performance or something to hide. In both cases, the expansion of the development trend of events. These two situations will expand the development trend of events. The last is the situational factor, which is the speed of the spread of microblog and the relevant evidence. We can understand it as a condition of the public participa- tion in the network group event. The speed of communication, and the evidence associated with it, will increase the credibility of the event. Under the influence of these four factors, the public may be more actively involved and it will form network group events.
Micro-blog is a new kind of medium which is short and informal. While no segmented corpus of micro-blogs is avail- able to train Chinese word segmentation model, existing Chinese word segmenta- tion tools cannot perform equally well as in ordinary news texts. In this pa- per we present an effective yet simple ap- proach to Chinese word segmentation of micro-blog. In our approach, we incor- porate punctuation information of unla- beled micro-blog data by introducing char- acters behind or ahead of punctuations, for they indicate the beginning or end of words. Meanwhile a self-training frame- work to incorporate confident instances is also used, which prove to be helpful. Ex- periments on micro-blog data show that our approach improves performance, espe- cially in OOV-recall.
In this paper, we proposed a Chinese word segmentation model for micro-blog text. Alt- hough Conditional Random Fields (CRFs) models have been presented to deal with word segmentation, this is still the first time to apply it for the segmentation in the domain of Chi- nese micro-blog. Different from the genres of common articles, micro-blog has gradually be- come a new literary with the development of Internet. However, the unavailable of micro- blog training data has been the obstacle to de- velop a good segmenter based on trainable models. Considering the linguistic characteris- tics of the text, we proposed some methods to make the CRFs models suitable for segmenta- tion in the domain of micro-blog. Several ex- periments have been conducted with different settings and then an optimal tagging method and feature templates have been designed. The proposed model has been implemented for the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing Bakeoff (Bakeoff-2012) and achieves a very high F- measure of 93.38% within the test set of 5,000 micro-blog sentences. One of our main contri-
1) Data Collection: Our first step is to collect data from microblog such as twitter and, naturally, there are a few possible ways to do so. Here, we rely on Twitter’s Streaming API, which provides free access to 1% of all tweets. The API returns each tweet in a JSON format, with the content of the tweet, some metadata (e.g., creation time, whether it is a reply or a retweet, etc.), as well as information about the poster (e.g., username, followers, friends, number of total posted tweets).
In recent years, with the popularity of Internet social applications, researchers begin to detect the public emo- tions through popular social networking websites. For instance, Asur et al.  use Twitter.com (a famous microblog websites) to predict movie box performance based on the public emotions extracted from twitter; Johan Bollen et al.  found that micro blogs which are la- beled as “calm” have a powerful prediction ability to the Dow Jones industries average index, with highest accu- racy over 80%.
Micro-blog post is propagated through their forwarding, commentaries, sharing. The forwarding, commentaries and sharing usually means the user agrees with the original user, and they have the same sentiment on the topic. At the same time, micro-blogs published by the same person within a short timeframe 1 should have a consistent sentiment about the same topic. Based on these four kinds of relation features, we can construct a graph using the input micro-blog collection of a given topic. As illustrated in Figure 3, each node in the graph indicates a micro-blog post. The four kinds of edges indicate forwarding (solid line), commentaries (long dash line), sharing relations (dash line) and being published by the same person (round dotted line)
The development and continuous improvement of campus networks brings new changes to the campus and college life, more and more students are turning to micro-blog to express their views and opinions, and through the micro blogging to achieve fast and efficient sharing and exchange of information and resources. In the microblog era, the power of micro blogging can not only impact college students, expand the breadth and depth of ideological and political education of college students, but also can enhance the ideological and political education of timeliness, interactive, fun, and thus enhance the effectiveness of the full and effective ideological and political education. In this paper we study the achievements and problems of both ideological and political analysis of the current situation of college students, and the effects of micro-blog on the ideological and polit- ical education, and actively explore the ideological and political education of college students in micro blogging era.
We first conducted experiments on the develop- ment data to investigate the effectiveness of var- ious features. Table 5 shows the results of seven settings in terms of precision, recall and F-score. Baseline represents the setting of the conventional CRF, where only character-based features were in- corporated, and no adaptation strategy was used. As we can see, having incorporated rule-based adaptation into the baseline, as shown in Base- line+RB, the F-score was significantly improved from 88.73 to 91.53, which achieved a 24.8% re- duction of error rate. This improvement shows that rule-based adaptation is an very simple and ef- fective approach in adapting a conventional word segmenter to work on micro-blog domain.
The selection of shallow learning features is based on statistical learning. As for text classification, generally all selected feature items have their specific meanings. Meanwhile, ensure all these items have well distinguish capacity. Many scholars [8-10] have studied the feature construction in text classification. After plenty of related literature reading and micro-blog corpus analyses, this paper uses the vector space model (VSM) to denote the text, and chooses word (unigram), part of speech (POS), sentiment dictionary (dict) these three types of shallow learning features, specific shallow learning features construction shown in Table 1.
achieved great performance using the closed train set and test set. However, the segmentation performance on the web document or on the open set is still low (Huang Changing et al., 2007). Specifically, generated by different kinds of users in the daily life, the micro blogs are noisy and full of OOV (Gustavo et al, 2010). For example, for the brevity and the significance of labels, there are lots of emotion labels, URLs, abbreviations and special characters in the micro-blogs. Otherwise, due to the social property of the micro blogs, there are lots of OOVs (including names of users, stars, locations and organizations), which make it a challenge task for the segmentation of micro blogs. In this paper, we propose a cascaded approach of Micro-Blog segmentation. Firstly, we use regex expressions to recognize the URLs, English words and Numbers. Some special characters and punctuations are used to split the sentence into pieces. Secondly, the generated components of the sentences are partitioned into smaller pieces which comprise the preliminary result using a segmentation system. Finally, we leverage quantities of dictionaries of OOVs and idioms from the network to merge the words in order to handle the words which are segmented incorrectly. Our system’s final F1 score on the test set is 92.73%.
Statistical Data Mining for Sina Weibo, a Chinese Micro blog Sentiment Modelling and Randomness Reduction for Topic Modelling London School of Economics and Political Sciences Wenqian Cheng A thesis s[.]
performance of supervised method is limited. Supervised methods have poor portability. Usually supervised methods need to carry on the training in the other domain using new field training data. Unsupervised methods are generally based on the statistical analysis of the data. These methods calculated the micro-blog sentiment distribution by probability model, and then analysis sentiment. In 2009, Lin et al.  proposed an improved model based on LDA (Dirichlet Allocation Latent) model, called JST model. The JST model makes a 4-layer probability model, which gets the sentiment distribution of each topic by obtaining the correspondence between statistical labels and label sentiment theme. In 2013, Ding et al . proposed HDP-LDA (Hierarchical Dirichlet Process-Latent Dirichlet Allocation) model. This model takes advantage of automatically determining the number of theme by the HDP model to mine phrase sentiment tendency. But this method needs to identify phrases of POS tagging, so phrase recognition accuracy will affect the results of the analysis. At the same time this model needs to set a large number of parameters and reduces the portability. Sentiment analysis methods based on topic model achieved more accurate results than the traditional methods. But it is found that training and processing of the topic model is not applicable to large-scale data through a large number of experiments and practice. In this kind of model, it is assumed that the data is subject to exponential distribution, but do not fit to the data in the real environment, especially the data on the Internet, which obeys the long tail distribution . This kind of methods is too much emphasis on the high frequency data from the induction of semantic, ignoring the low frequency data processing, so it is not suitable for the description of the micro-blog text.
And lastly, both newspapers archive electronically online; so their content is easy to access. Sample messages were retrieved from seven Sina Weibo accounts from July 23 to August 1 of 2011. During the immediate aftermath of the collision, a featured special page was con- structed by Sina as an open platform to share updated information. Five accounts were pinpointed from that feature page due to their crucial status as information sources. Two accounts belong to indi- vidual users, one account is associated with a news organization, and another two accounts have government affiliations. Between those two individual users, one sent out the very first report of this acci- dent through her Weibo account, the other posted the first piece of information requesting help from the public after the accident. Both users have gained fame for their Weibo exchanges related to this train accident. Two additional accounts – one initiated by a journalist and the other founded by the prominent newspaper for which the jour- nalist worked -- were recommended by Sina via its special coverage of the July 23rd railway collision, based on this reporter’s manuscript page and the newspaper coverage itself, both of which were known for their in-depth news coverage. In total, some 598 micro-blog exchanges and 201 news reports were collected as the sample for this study.
The concept of knowledge networks was first proposed by the Swedish indus- try, the relevant research began in the 20th century, the mid-90s. Beckmann de- scribed the knowledge network as institutions and activities for production and dissemination of scientific knowledge in the academic point of view . Andreas believes that the knowledge network is a social network among the participants. Participants at all levels realize the production and transfer of knowledge through this knowledge network, thus creating value . These characteristics of know- ledge network make it fit well with the originality, communication and interac- tion of micro-blog. Building enterprise microblogging knowledge network can dig out the inherent value of enterprise microblogging better.
With the development of online social networks, social network applications present a platform for people to post various news and information. Millions of people record and share their daily lives in micro-blog. Although there are many kinds of information, people usually need to concentrate on specific information at a task and find related texts from data generated in a long period. Simple keyword filtering usually produces noisy or incomplete data about the specific information. In order to detect the specific textual information, we usually divide the data into two different types. One includes the specific information. The other does not. For obtaining more accurate and complete related texts, a text classifier which can evolve is needed. The reason is that specific textual information context and linguistic features in the micro-blogs tend to vary with time. Traditional supervised text classification methods with positive/negative training samples usually process training data in batch mode and cannot capture time variance factor in the micro-blog dataset well. So, how to mine long-term specific information from the social network data such as the micro-blog becomes a challenge.
This paper proposed a Hidden Markov Model (HMM) based tokenizer for Chi- nese micro-blog texts. Comparing with normal Chinese texts, micro-blog texts contain more uncertainties. These uncer- tainties are generally aroused by the irreg- ular use of bloggers (such as network words, dialect words, wrong written char- acters, mixture of foreign words and sym- bols, etc.). Besides the lack of the annotat- ed training corpus is also a restriction in solving this task. Hence the segmentation for micro-blogs is much more difficult than that of general text, we present an HMM based segmentation model integrat- ed with a pre and post correction module. The evaluation results show that the pro- posed approach can achieve an F-measure of 90.98% on test set of 5,000 sentences.
introduction of social media as a tool of communication have really set a pace at which several activities are carried out. The social network has attracted so many users across the world. Twitter becoming the second most widely used micro-blog after facebook, as real-time sharing information electronic tool. However, numerous user-interface challenges have been identified in twitter microblog, ranging from credibility of the information shared on twitter to icons improvement. This paper applies heuristic evaluation techniques on twitter microblog. A web-based e-questionnaire is used as a means of collecting data from different level of users (Novice, Beginner and Professional). Over 100 users filled e-questionnaire to express their user experience which serves as source of data used in this study. The result according to the weighted score values state that the twitter as one example of micro-blog proved to be one of the best social networks, which also show that all information is credible in the platform. The statistical T- Test analysis presents outstanding results that the usability evaluation of twitter platform is of great significant.
Important comments of micro-blog not only reflect the views of users but also can influence the public’s opinion towards a particular topic. This paper presents a method of feature weighting based k-Nearest Neighbors for mining important comments of hot topics on Sina Weibo. By using feature weighting method, each selected feature is assigned to a corresponding weight. The comprehensive experimental results demonstrate that the presented method can better predict important comments than traditional k-Nearest Neighbors method. Furthermore, we show the presented method significantly outperforms the state-of-the-art classifiers.
notated micro-blog data (only 500 sentences of micro-blog data are provided and used by us). Fol- lowing the annotation adaptation method proposed by Jiang et al. (2009), we train a general-purpose joint word segmentation and part-of-speech tag- ging model using People’s Daily corpus. Then, the decoding results of such a model are used as features in the final word segmentation model for micro-blog data.
Aiming at the shortcomings of the traditional PageRank algorithm, the drifting theme, splitting page weight, new web links less, the user behavior, the theme of the relevant weight, time factor and other factors are introduced into the PageRank algorithm, then an improved PageRank algorithm based on the users’ behavior and topic similarity is proposed, and the algorithm is applied to the ranking of micro-blog users' influence. Using the collected data to simulate the experiment, the improved algorithm can get a better sorting effect, which shows the feasibility and superiority of the BSPR algorithm.