Authorship Analysis

Top PDF Authorship Analysis:

Software Forensics: Extending Authorship Analysis Techniques to Computer Programs

Software Forensics: Extending Authorship Analysis Techniques to Computer Programs

Many of the structural type metrics can be obtained, perhaps with modifications to definitions, from the software metrics literature. Software metric definitions, and also extraction tools, are available for such aspects of computer programs as complexity, comprehensibility, the degree of reuse made from other code, and various measures of size. The customary uses of these metrics are in managing the software development process, but many are transferable to authorship analysis. In any case, the fundamental concepts that have emerged within the field of software metrics are very useful as starting points for defining authorship metrics. In addition, the metrics extracted from source code can often be similar, or even identical, to stylistic tests used in computational linguistics, especially where sufficient quantities of comments are available.
Show more

8 Read more

Authorship Analysis and Identification Techniques: A Review

Authorship Analysis and Identification Techniques: A Review

Over internet huge amount of data / text is available in form of blogs, emails, digital contracts, books and many more. So correct identity of this available data is difficult. That might incur cybercrime. The problem of anonymity in online communication is addressed by applying authorship analysis techniques. In past times, a lot of research is going on for analysis and identification for owner of data content. There are many approaches for the authorship analysis. In research describes [18] frequent pattern based on writeprint, capturing stylistic variation, analysis based on different training sampling sizes, presentable evidence, removing burden from investigator, clustering by stylometric features for varying training samples and authorship characterization.
Show more

5 Read more

How To Write An Article On Authorship Analysis

How To Write An Article On Authorship Analysis

The e xperimental results for 10, 25 and 50 authors on benchmark En ron dataset are given in Fig. 3. The results clearly depict that the model CEAI ach ieves the highest accuracy for the ema il authorship identification task on 10, 25 and 50 authors. We present the summa ry of a ll e xpe riments ’ types in Table 2. In each e xpe riment, various feature sets are incorporated step-by-step and starting with the baseline stylometric features , which a re trad itionally used for an authorship analysis task. Afterwards, e xtended features, which slightly imp roved the accuracy of the task, are used. Info Ga in based content features are then added, which keeps only the content features which have the capability of distinguishing each author. The original nu mber of content features is too large and the Info Gain based content feature selection maintains the balance between dimensionality reduction and accuracy of the task. The accuracy of the task increased greatly with the addition of these features . The first four e xperiments are performed using SVM and the last e xperiment has been conducted using CCM on the final set of features that again imp roves the accuracy of task. All five types of e xperiments have been performed on authors' constructed real e ma il dataset of seven authors. Analogous effects of all types of e xperiments have been observed on this dataset; though a bit less accuracy has been achieved, due to the fact that for this new dataset we have sma lle r a mount of e mails for each author as compared to the Enron dataset.
Show more

19 Read more

AUTHORSHIP ANALYSIS FOR REGIONAL LANGUAGES USING MACHINE LEARNING APPROACH

AUTHORSHIP ANALYSIS FOR REGIONAL LANGUAGES USING MACHINE LEARNING APPROACH

The framework for authorship identification in online messages to deal with the identity-tracing problem. In this framework, four types of writing style features (lexical, syntactic, structural and content-specific features) are extracted from English and Chinese online-newsgroup messages. Comparison has been made between three classification techniques: decision tree, SVM and back- propagation neural networks. Experimental results showed that this framework is able to identify authors with a satisfactory accuracy of 70 to 95% and the SVM classifier outperformed the two others.
Show more

6 Read more

Research on China's tourism: A 35-year review and authorship analysis

Research on China's tourism: A 35-year review and authorship analysis

Third, the selected articles were classi fi ed into dominant thematic categories, following the approach proposed by Miles and Huberman (1994), which emphasized three fl ows of analytical activity: data reduction, data display and veri fi cation of the data. At the data reduction stage, a ‘ word count ’ technique was adopted. A content analysis of each journal title was completed, and then the full paper was accessed to classify it into a category, as established through the word count. On the basis of the reduction work, it was initially proposed that there were three broad theme categories: tourist markets (demand), tourism industry development and promotion (supply), and policy and tourism impacts (external environment) (Cook et al., 2010). Subcate- gories were de fi ned as research topics. This methodological approach allowed tabulation and visualization of data at the early stage of the research, and initial tentative formulation of the prevalent themes and emerging trends in research topics. To re fi ne the setting of topic subcategories, abstracts, fi rst paragraphs and as much text from relevant sections as needed were read to place articles into the appropriate subcategories (Crawford and McCleary, 1992). This allowed further development of the classi fi cation of subcategories and, consequently, veri fi cation of the fi ndings.
Show more

10 Read more

Sample Size in Arabic Authorship Verification

Sample Size in Arabic Authorship Verification

Authorship Verification aims at identifying whether a document of questionable authorship is created by a specific author, given a number of documents known to have been written by that author. This type of authorship analysis uses feature engineering of feature sets extracted from large documents. Given the nonlinear morphology and flexible syntax of Arabic, feature extraction in large Arabic texts requires complex preprocessing. The requirement of large training and testing documents is also impractical for domains where large documents are available in print, given the scarcity of reliable Arabic OCR. This problem is approached by investigating the effectiveness of using an author profiling-based approach on a small set of shorter documents. The findings show that it is possible to outperform the state-of-the-art authorship verification method by using a small set of training documents. It is also found that an increase in the size of the training or testing corpus does not correlate with improving the accuracy of the authorship verification method.
Show more

8 Read more

Authorship Attribution in Health Forums

Authorship Attribution in Health Forums

With the emergence of user-written Web con- tent, authorship analysis is often done on online messages (Zheng et al., 2006; Narayanan et al., 2012). Large numbers of candidate authors, small volumes of training and test texts, and short length of messages makes the online au- thorship analysis exceptionally challenging (Juo- la, 2006; Koppel, 2009; Luyckx and Daelemans, 2008; Madigan et al., 2005; Stamatatos, 2009). In Koppel et al. (2006), 10,000 blogs were used in the task of author attribution. The test data was built from 500-word snippets, one for each au-
Show more

9 Read more

Authorship Verification, Average Similarity Analysis

Authorship Verification, Average Similarity Analysis

Authorship analysis is an important task for different text applications, for example in the field of digital forensic text analysis. Hence, we propose an authorship analysis method that compares the average similarity of a text of unknown authorship with all the text of an author. Using this idea, a text that was not written by an author, would not exceed the average of similarity with known texts and only the text of unknown authorship would be considered as written by the author, if it exceeds the average of similarity obtained between texts written by him. The experiments were realized using the data provided in PAN 2014 competition for Spanish articles for the task of authorship verification. We realize experiments using different similarity functions and 17 linguistics features. We analyze the results obtained with each pair function-features against the baseline of the competition. Additionally, we introduce a text filtering phase that delete all the sample text of an author that are more similar to the samples of other author, with the idea to reduce confusion or non-representative text, and finally we analyze new experiments to compare the results with the data obtained without filtering.
Show more

7 Read more

Cross Language Authorship Attribution

Cross Language Authorship Attribution

Authorship Attribution (AA), the task of identifying the au- thor of an anonymous text, has a long history and various methods dealing with this task have been proposed (Sta- matatos, 2009). However, none of the proposed meth- ods found in the literature considers the scenario where the same person writes documents in different languages. Nowadays, with the fast growth of the web, users tend to participate in various online communities irrespective of the language. Focusing on social media, a researcher from Spain may have a blog in Spanish, a twitter in English and publish research papers in both languages. At the same time, various novelists write in more than one language. As an example, the Russian-American novelist Vladimir Nabokov wrote in both English and Russian, and the Irish writer Samuel Beckett wrote in English and French. Thus, we foresee a substantial need in reliable methods for cross- language authorship analysis.
Show more

6 Read more

Mining Authorship

Mining Authorship

Perhaps the most extensive and comprehensive application of authorship analysis is in literature and in published ar- ticles. Several studies attempting to resolve Shakespeare’s works date back many years (see, for example, [4]). In one of these studies, attempts were made to show that Shake- speare was a hoax and that the real author was Edward de Vere, the Earl of Oxford [6]. Specific author features such as unusual diction, frequency of certain words, choice of rhymes, and habits of hyphenation have been used as tests for author attribution. An important kind of evidence that can be used to establish authorship is that of linguistic evidence, that is, distinctive language habits that are suf- ficiently unique to identify the author. It is thought that such linguistic evidence is generated dynamically and sub- consciously when language is created, similar to the case of the generation of utterances during speech composition and production [4]. Language patterns or sub-stylistic features are generated beyond an author’s conscious control. An ex- ample of such features is short, all-purpose words (called “function words”) such as “the”, “if”, “to” etc. whose fre- quency or relative frequency of usage is unaffected by the subject matter. Therefore, a combination of sub-stylistic features may be sufficient to uniquely authenticate an au- thor. Bosch et al used a small set of function words for the the classification of two authors involved in the authorship of the Federalist Papers articles [2].
Show more

7 Read more

Bias Analysis and Mitigation in the Evaluation of Authorship Verification

Bias Analysis and Mitigation in the Evaluation of Authorship Verification

Authorship verification is a young task in the field of authorship analysis. Proposed by Koppel and Schler (2004), and mostly solved on book-sized texts right away, it remains a challenging task on short texts. The numerous verification approaches developed over the years employ a wide array of features, methods, and corpora (Stamatatos, 2009), rendering a comparison between approaches diffi- cult. A dedicated shared task series at PAN (Sta- matatos et al., 2015, 2014; Juola and Stamatatos, 2013; Argamon and Juola, 2011) was a key en- abler for comparability and reproducibility. The verifiers submitted by Bagnall (2015), Fréry et al. (2014), and Modaresi and Gross (2014) form the state of the art. While new verifiers are run against the shared task’s data to assess their performance against these baselines (e.g., Halvani et al., 2017; Kocher and Savoy, 2017), PAN continues to de- velop new benchmarks on closely related tasks. 3
Show more

6 Read more

Linguistic Profiling for Authorship Recognition and Verification

Linguistic Profiling for Authorship Recognition and Verification

Linguistic profiling has certainly shown its worth for authorship recognition and verification. At the best settings found so far, a profiling system using combination of lexical and syntactic fea- tures is able select the correct author for 97% of the texts in the test corpus. It is also able to per- form the verification task in such a way that it rejects no texts that should be accepted, while accepting only 8.1% of the texts that should be rejected. Using additional knowledge about the test corpus can improve this to 100% and 2.4%. The next step in the investigation of linguistic profiling for this task should be a more exhaus- tive charting of the parameter space, and espe- cially the search for an automatic parameter selection procedure. Another avenue of future research is the inclusion of even more types of features. Here, however, it would be useful to define an even harder verification task, as the current system scores already very high and fur- ther improvements might be hard to measure. With the current corpus, the task might be made harder by limiting the size of the test texts. Other corpora might also serve to provide more obstinate data, although it must be said that the current test corpus was already designed specifi- cally for this purpose. Use of further corpora will also help with parameter space charting, as they will show the similarities and/or differences in behaviour between data sets. Finally, with the right types of corpora, the worth of the technique for actual application scenarios could be investi- gated.
Show more

8 Read more

Function Words for Chinese Authorship Attribution

Function Words for Chinese Authorship Attribution

This study has several limitations that need to be improved in future work. First, the data set is small and not quite balanced. More authors and works will be added in the future. Second, the random seed for EM is set to the default value 100 in Weka. However, EM clustering result may vary to some extent with different random seeds. More rigorous design is needed for robust performance comparison. One design is to run each clustering experiment multiple times, each time with a different random seed. The clustering accuracy will be averaged over all runs. This new design will allow for performance comparison based on paired-sample t-test significance. Third, the Cultural Revolution time period is excluded from this study due to strong political influence on writers. One reviewer pointed out that this time period should be valuable for examining the relationship between authorship, genre, and time period. Relevant data will be collected in future study.
Show more

9 Read more

Authorship of Federalist Papers v2.pptx

Authorship of Federalist Papers v2.pptx

words strongly suggested that Madison was the author of 11 of the 12 disputed Federalist Papers.  The exception was paper #55, and there the[r]

24 Read more

Detection of Fraudulent Emails by Authorship Extraction

Detection of Fraudulent Emails by Authorship Extraction

We presented a novel method of identifying email authorship using RBF patterns of data. The training data has been collected by averaging the frequencies of words used by each person and fixing a target value for the person. Testing pattern has been created by modifying the existing contents of an email. A new word has been considered while testing. If the new word does not fit into the patterns used for training, that word is excluded in testing. As we are unaware to which author the email belongs, all the training patterns are treated as test patterns after adding the frequencies of the new mail. Since 146 authors are considered, 146 outputs are obtained after testing
Show more

6 Read more

Authorship Attribution with Latent Dirichlet Allocation

Authorship Attribution with Latent Dirichlet Allocation

Court from 1913 to 1975: Dixon, McTiernan and Rich (available for download from www.csse. monash.edu.au/research/umnl/data ). In this paper, we considered the Dixon/McTiernan and the Dixon/Rich binary classification cases, using judgements from non-overlapping periods (Dixon’s 1929–1964 judgements, McTiernan’s 1965–1975, and Rich’s 1913–1928). We removed numbers from the texts to ensure that dates could not be used to dis- criminate between judges. We also removed quotes to ensure that the classifiers take into account only the actual author’s language use. 5 Employing this dataset in our experiments allows us to test our meth- ods on formal texts with a minimal amount of noise. The IMDb62 dataset contains 62,000 movie re- views by 62 prolific users of the Internet Movie database (IMDb, www.imdb.com , available upon request from the authors of (Seroussi et al., 2010)). Each user wrote 1,000 reviews. This dataset is nois- ier than the Judgement dataset, since it may con- tain spelling and grammatical errors, and the reviews are not as professionally edited as judgements. This dataset allows us to test our approach in a setting where all the texts have similar themes, and the num- ber of authors is relatively small, but is already much larger than the number of authors considered in tra- ditional authorship attribution settings.
Show more

9 Read more

Resolving authorship disputes by mediation and arbitration

Resolving authorship disputes by mediation and arbitration

People have used creative ways to spread the benefit of receiving key authorship credits. An increasingly common practice is to use author’s notes to designate equal contri- butions [32–34]. The record for greatest number of “equally contributing” authors is unknown, but a cursory search of recent issues of journals quickly found a paper with seven authors (out of 44) listed as having made equal contributions, and none were first author [35]. One article with four authors designated that all contributed equally (creating the linguistic puzzle of whether they should be called “ co-first ” authors or “ co-senior ” authors), and listed all as corresponding authors [36], making all three desig- nations effectively meaningless. Journals have generally not adopted policies or guidelines for equal contribution statements [33], nor are there general practices for hand- ling such notes in research evaluations [34]. Contribution notes notwithstanding, the first author’s name becomes the most associated with the paper because many jour- nals's citations in the body of the text list only the first
Show more

7 Read more

Authorship productivity trends in journal of travel medicine  (2001-2013): collaboration scientometric analysis

Authorship productivity trends in journal of travel medicine (2001-2013): collaboration scientometric analysis

Table 6.6 reveals that the result of co-authorship index and it is observed that the value of CAI for more than six authored papers is the highest and for six authored papers was lowest, which indicated that the collaborative research is increasing in the field of Journal of Travel Medicine. With regard to the multiple authored publications with more than single authors, the co-authorship has shown fluctuation trend.

5 Read more

Sisterhood Relationships and Self-authorship

Sisterhood Relationships and Self-authorship

Riessman (1993) discussed the human representation of experience in her book. She simplified the process of meaning making by breaking it into five parts. The first component is Attending to the Experience, in which a person first experiences a place or event with the basic five senses. Each person notices different things in the same location and experiences the environment differently. Second, a person Tells about the Experience. People begin to construct meaning as they use language to describe the location. The person explaining the location or event will also begin to construct their identity through the story. In the third step researchers will Transcribe the Experience through audio, video, and written transcripts of the dialogue. Researchers make choices about how to transcribe, which also change the interpretation. In Analyzing the Experience, a metastory is created by writing, editing, and summarizing the original experience or location. The researcher alters the experience through his or her own interpretations, which are politically and morally biased. The fifth and final representation is Reading Experience. Critics of narrative analysis question if there can be a true representation of anything because of the multiple levels of interpretation. Consumers of this research interpret data when they read it and process it through their lens.
Show more

208 Read more

Authorship Attribution Using Text Distortion

Authorship Attribution Using Text Distortion

tures should not be affected by shifts in topic or genre variations and they should only depend on personal style of the authors. However, it is not yet clear how the topic/genre factor can be sep- arated from the personal writing style. Function words (i.e., prepositions, articles, etc.) and lexical richness features are not immune to topic shifts (Mikros and Argiri, 2007). In addition, charac- ter n-grams, the most effective type of features in authorship attribution as demonstrated in multiple studies (Grieve, 2007; Stamatatos, 2007; Luyckx and Daelemans, 2008; Escalante et al., 2011) in- cluding cross-topic conditions (Stamatatos, 2013; Sapkota et al., 2015), unavoidably capture infor- mation related to theme and genre of texts. Fea- tures of higher level of analysis, including mea- sures related to syntactic or semantic analysis of texts, are too noisy and less effective, and can be used as complement to other more powerful low- level features (van Halteren, 2004; Argamon et al., 2007; Hedegaard and Simonsen, 2011).
Show more

12 Read more

Show all 10000 documents...