EXTRACTING EMERGING TOPICS BASED ON USER MENTION

(1)

EXTRACTING EMERGING TOPICS BASED ON USER MENTION

N.Bavithra

¹

, P.Rajesh

²

1, 2

Computer Engineering Science, Kingston Engineering College, (India)

ABSTRACT

Social network is a place where people exchange and share information related to the current events all over the world .This particular behaviour of users made us focus on this logic that processing these contents may lead us to the extraction the current topic of interest between the users. Applying data clustering technique like Text-Frequency-based approach over these content may leads us up to the mark but there will be some chance of false positives.We propose a probability model that can capture both normal mentioning behavior of a user and also the frequency of users occurring in their mentions.It also works good even the contents of the messages are non-textual information.The experiment show that the proposed mention-anomaly based approaches can detect new topics atleast as early as text-anomaly based approaches,and in some cases much earlier when the topic is poorly identified by the textual contents in the posts.

Keywords: Change Point Detection, Anomalyscores, Mentions

I. INTRODUCTION

As in this internet world every one used to engage in social media is very familiar now days. Social media acts fast with the contents than any other media. Lot of contents in many format been scattered in the database were we can look forward to utilize those contents to build an automated news event. Since the information exchanged over social networks is not only texts but also URLs, images, and videos, they are challenging for the study of data mining. The interest is in the problem of detecting emerging topics from social streams. This can be used to create automated “breaking news”, or discover hidden market needs or underground political movements. Compared to other media (news FM etc.) social media are able to capture the earliest, unedited voice of ordinary people. Problem is the challenge is to detect the emergence of a topic as early as possible at a moderate number of false positives. The interest in detecting emerging topics from social network streams based on monitoring the mentioning behavior of users (annotation like). Our basic assumption is that a new (emerging) topic is something people feel like discussing, commenting, or forwarding the information further to their friends. Conventional approaches for topic detection have mainly been concerned with the frequencies of (textual) words. A term-frequency-based approach could suffer from the ambiguity caused by synonyms or homonyms. It may also require complicated preprocessing (e.g., segmentation) depending on the target language. Moreover, it cannot be applied when the contents of the messages are mostly nontextual information.

On the other hand, the “words” formed by mentions are unique, require little preprocessing to obtain (the information is often separated from the contents), and are available regardless of the nature of the contents.

Probability model that can capture the normal mentioning behavior of a user, which consists of both the number of mentions per post and the frequency of users occurring in the mentions. This model is used to measure the anomaly of future user behavior. Using the proposed probability model, we can quantitatively measure the

(2)

Volume No.03, Issue No. 12, December 2014 ISSN (online): 2394-1537

novelty or possible impact of a post reflected in the mentioning behavior of the user. A term-frequency-based approach mainly depends upon the frequencies of (textual) words occurring in the social posts.This removes the verbal and adjective like words and considers only the nonverbal parts of the post.Word frequency is calculated for each word which will be taken mainly for extraction of the topic.The limitation is that A term-frequency- based approach could suffer from the ambiguity caused by synonyms or homonyms (plurals).It cannot be applied when the contents of the messages are mostly non-textual information.For eg “good life depends on liver”,where liver may be organ or living person,so there will be a ambiguity problem.We cannot apply the technique when the content is nontextual information.

II. EMERGING TOPICS 2.1 Probability Distribution

We characterize a post in a social network by the number of post it contains, and the set of users who are mentioned in the post. The joint distribution consists of two parts: the probability of the number of post/comment. We also include the document frequency into our probability model which will enhance the detection process. Now we have probability distribution for both user mention and the document frequency.

2.2 Probability Model

The probability model that we used to capture the normal mentioning behavior of a user and how to train the modelWe characterize a post in a social network stream by the number of mentions k it contains, and the set V of names (IDs) of the mentionees (users who are mentioned in the post).Then we find the joint distribution which consists of two parts: the probability of the number of mentions k=|V| and the probability of each mention given the number of mentions.

Step1:Find probability of no.of mentions p(k|θ)

Step2

: p(k|θ)=(1-θ)

^k

θ

Step3:Joint probability distribution of number of mentions and number of users

P(k,v|θ,{

v

})=p(k|θ)

_v’

Step4:predictive distribution by using training set T={(K_1,V₁),…. (Kn,V_n)}

P(K,V|T)=p(K|T)

III. DERIVING LINK-ANOMALY SCORE

We compute the link anomaly score for each post separately.Anomaly score is defined as the users deviation from the post.The comments are either good or bad whether related to the post are determined by using link anomaly score. Accordingly, the link-anomaly score is defined by the following diagram.

Step1:Compute anomaly score of a new post x=(t,u,k,v)

K-mention,v-user,u-user,t-time

Step2:Find s(x)

s(x)= -log(p(k|T

u

(t)

u (t)

)

(3)

=-log(p(k|T

u (t)

)

Step3:By using training set which consist of both number of user and mention compute anomaly score.

Step4:Finally we aggregate the anomaly score obtained for the post.

IV. CHANGE-POINT DETECTION

A change point used to detects a change in the statistical dependence structure of a time series by monitoring the compressibility of a new piece of data. it uses a sequential version of normalized maximum-likelihood (NML) coding called SDNML coding. A change point is detected through two layers of scoring processes. The first layer detects outliers and the second layer detects change-points. The issues of outlier detection and change point detection from a data stream. In the area of data mining, there have been increased interest in these issues since the former is related to fraud detection, rare event discovery, etc., while the latter is related to event/trend change detection, activity monitoring, etc. Specifically, it is important to consider the situation where the data source is non-stationary, since the nature of data source may change over time in real applications. Although in most previous work outlier detection and change point detection have not been related explicitly,The score for any given data is calculated to measure its deviation from the learned model, with a higher score indicating a high possibility of being an outlier. Further change points in a data stream are detected by applying this scoring method into a time series of moving averaged losses for prediction using the learned model.Anomaly detection can be executed in two layers:

Step1:First layers detect outliers Step2:Second layer detects chagepoint.

V. OUTLIER DETECTION

In outlier detection phase the system learn from the collection of anomaly score and forms a SNDML density function. And then compute the intermediate change point score by smoothing the log loss of the SDNML density function.

VI. CHANGE POINTS

Using the above function we gained a collection of smoothed change-point score. The SNDML learning process continues based on collection of smoothed change-point score. Finally we compute the final change-point score by smoothing the log loss of the SDNML density function as follows:

Score(y

j

)=1/K P

SDNML

(y

j

|y

^j-1

))

VII. (DTO) DYNAMIC THRESHOLD OPTIMIZATION

As a final step in our method, we need to convert the change-point scores into binary alarms by thresholding.

Binary alarm means a binary representation of true and false statement for the emerging topic. Since the distribution of change-point scores may change over time, we need to dynamically adjust the threshold to analyze a sequence over a long period of time. Based on the gernerated score of each topic binary alarm differentiate the emerging topics.

(4)

Volume No.03, Issue No. 12, December 2014 ISSN (online): 2394-1537 VIII. SYSTEM ARCHITECTURE

The system architecture diagram enables you to graphically model the applications of a system, and the externals that they interface with and data stores that they use or provide information too.From the social database post are extracted by combining text frequency and getting anomaly score we can get combinalscore.By using the check point we can find the topic and extract the emerging topic.

IX. RESULTS AND DISCUSSIONS 9.1 Comparison with the Existing System

Fig 9.1shows the results of link anomaly based change detection,.The alarm times of link anomaly based method (20:01),The text anomaly based counter parts at 22:37,The link anomaly based method is much earlier than text anomaly based counter parts and it only finds the first area whereas the text anomaly based method and keyword frequency based method only finds the second area .This is probably because there was an initial stage where people reacted individually using different words and later there was another stage in which the keywords were more unified.

(5)

X. CONCLUSION

In this paper we are interest in detecting emerging topics from social network streams based on monitoring the mentioning behavior of users. Our basic assumption is that a new (emerging) topic is something people feel like discussing, commenting, or forwarding the information further to their friends. Conventional approaches for topic detection have mainly been concerned with the frequencies of (textual) words. A term-frequency-based approach could suffer from the ambiguity caused by synonyms or homonyms. It may also require complicated preprocessing (e.g., segmentation) depending on the target language. The proposed probability model determines both number of mentions per post and the frequency of the mentionee and this approach is used to detect the emergence of topics in a social network stream .We have put forward a probability model that captures both the number of mentions per post and frequency of mentioning .The text frequency based methods used to determine how many times the text gets repeated and from that the repeated words are considered We combined the proposed mentioned model with the SDNML change point detection algorithm to pin point the emergence topic ,the link anomaly based approach have detected emergence of the topic even earlier than the keyword based approach that use hand chosen keywords. It will be more effective when combining both text anomaly based and link anomaly based approach.

XI. FUTUREWORK

From the analysis the existing system is conducted in offline, but it can be applied online. We are planning to scale up the proposed approach to handle social streams in real time.To implement in online IIS(Internet information service is used to onnect one or more systems.When more or system gets connected we can easily share our information by passing comments to the post so we can easily recognize the emerging topic.Internet information services is used to enhance the security services. It would also be interesting to combine the proposed link-anomaly model with text-based approaches, because the proposed link-anomaly model does not immediately tell what the anomaly is. Combination of the word-based approach with the link-anomaly model would benefit both from the performance of the mention model and the intuitiveness of the word-based approach.The idea of extracting emerging topic is to make social network to be more informative to the user.When the proposed link anomaly model is combined with the text based approach would benefit both from the performance of mention model and the intuitiveness of word-based approach.It can also be applied to the case where the topics are concerned with information other than text,such as images etc.The combination of link anomaly based that is from the way of their explanation with the text anomaly based method can easily recognize the terms and we can calculate the probability of user as well as mentionee explanation so that the detection of emerging topic will be more effective when post term frequency algorithm is used to discard the verbs and adjectives and considers only the noun part .Sothat the detection of topic will be more accurate when considering the text frequency.

REFERENCES

[1] Amandeep kaurmann, Navneenkaur‟Survey paper on clustering techniques‟,international journal of science ,engineering and technology research ,volume2,issue 4 (2013).

[2] Anoop kumar jain, Satyam Maheswari‟Survey of recent clustering techniques in data mining‟International journal of computer science and management research ISSN 2278-733x volume 1 issue -1 (2012).

(6)

Volume No.03, Issue No. 12, December 2014 ISSN (online): 2394-1537

[3] Adrian Gepp , J. Holton Wilson ,Kuldeep Kumar , Sukanto Bhattacharya „A Comparative Analysis of Decision Trees Vis-_a-vis Other Computational Data Mining Techniques in Automotive Insurance Fraud Detection ‟Journal of Data Science, volume 2,issue 4 (2012).

[4] A. Dinesh Kumar ,Dr.V.Radhika‟A Survey on Predicting Student Performance‟ International Journal of Computer Science and Information Technologies, Volume 5 (5) , 6147-6149, (2014).

[5] Dr.M.Hemalatha,N.Nagasaranya‟A recent survey on knowledge discovery in spatial data mining‟IJCSI International journal of computer scienceissues,Volumme 8,Issue 3,No.2 (2011).

[6] Edgar Moyotl-Hern_andez, H_ectorJim_enez-Salazar‟An Analysis on Frequency of Terms for Text Categorization‟international journal of computer science application,volume2 issue5(2002).

[7] Irena PletikosaCvijikj, Florian Michahelles‟ Monitoring Trends on Facebook „Ninth IEEE International Conference on Dependable, Autonomic volume2 isssue 5 (2011).

[8] JuhaVesanto, EsaAlhoniemi‟Clustering of self organizingmap‟,IEEE transaction on neural networks,volume11,n0.3 (2000).

[9] K.V.Nagendra, C. Rajendra„Customer behaviour Analysis using CBA‟,national conference on research trends in computer science and technology (2012).

[10] Minghuang Chen, Seiji Yamada, YasufumiTakama‟Investigating User Behavior in Document Similarity Judgment for Interactive Clustering-based Search engines‟ journal of emerging technologies in web intelligence, volume. 3, no. 1 (2011).

[11] PrzemysławKazienko, RadosławMichalski, Sebastian „Social Network Analysis as a Tool for Improving Enterprise Architecture‟Palus,proceedings of 5^th international conference on agent,multiagent system (2011).

[12] Rajangupta,Nasibsingh gill‟A Data Mining Framework for Prevention and Detection of Financial Statement Fraud „International Journal of Computer Applications (0975 – 8887)Volume 50 – No.8, July (2012).

[13] RajeshV.Argiddi,S.S.Apte‟Future trend predicition of indian IT stock market using association rule mining of transaction data‟ International journal of computer science and management research 0975- 8887,volume-39-No-10 (2012).

[14] RajanGupta,Nasibsingh gill‟Prevention and detection of financial statement fraud-An implementation of data mining framework‟International journal of advanced computer science and application volume 3,No8 (2012).

0-7803-8199-8) (2003).

[16] Sunita B Aher, Mr. LOBO L.M.R.J‟Data Mining in Educational System using WEKA‟ International Conference on Emerging Technology Trends (ICETT) ,Proceedings published by International Journal of Computer Applications® (IJCA) (2011).

[17] SundhararajanS,Dr.Karthikeyan S „A Study On Spatial Data Clustering Algorithms In Data Mining‟,International journal of engineering and computer science volume1 issue 1 (2012).

[18] Tsuyoshi Murata, Seiji Yamada‟Intelligent Web Interactions‟ journal of emerging technologies in web intelligence, volume.3, no. 1 (2011).