readability measures

Top PDF readability measures:

New Readability Measures for Bangla and Hindi Texts

New Readability Measures for Bangla and Hindi Texts

The quantitative analysis of English text readability started with L.A. Sherman in 1880 (Sherman, 1893). Till date, English has got over 200 readability metrics. Now there are formulas for Spanish, French, German, Dutch, Swedish, Russian, Hebrew, Chinese, Vietnamese and Korean (Rabin et al., 1988). The existing quantitative approaches towards predicting readability of a text can be broadly classified into three categories (Benjamin, 2012): traditional methods incorporate the easy to compute syntactic features of a text like sentence length, paragraph length etc. The examples are Flesch Reading Ease Score (Flesch, 1948), FOG index (Gunning, 1968), Fry graph (Fry, 1968), SMOG (McLaughlin, 1969) etc. The chronologically newer formulas like new Dale-Chall index (Chall, 1995), lexile framework(Stenner, 1996), ATOS-TASA(Learning, 2001), Read-X (Miltsakaki and Troutt, 2007) consider the readers’ background and text semantics; cognitively motivated methods use high level text parameters like cohesion and cognitive aspects of the reader. Proposition and inference model (Kintsch and Van Dijk, 1978), prototype theory (Rosch, 1978), latent semantic analysis (Landauer et al., 1998), semantic networks (Foltz et al., 1998) are examples of this category. This type of approach introduced text levelling or text revising methods (Kemper, 1983; Britton and Gülgöz, 1991). Two distinguished instances of this class are Coh-metrix (Graesser et al., 2004), and the DeLite software (vor der Brück et al., 2008); the third class of approaches incorporate the power of machine learning methods and probabilistic analysis. They are useful in determining online readability based on user queries (Liu et al., 2004) and predicting readability of web texts (Collins-Thompson and Callan, 2005; Collins-Thompson and Callan, 2004; Si and Callan, 2003). Sophisticated machine learning methods like support vector machines have been used to identify grammatical patterns within a text and classification based on it (Heilman et al., 2008).

10 Read more

Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts

Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts

The statistical language modeling approach has several advantages over traditional readability formulas, which are usually based on linear regres- sion with two or three variables. First, a language modeling approach generally gives much better accuracy for Web documents and short passages (Collins-Thompson and Callan, 2004). Second, language modeling provides a probability distribu- tion across all grade models, not just a single pre- diction. Third, language modeling provides more data on the relative difficulty of each word in the document. This might allow an application, for example, to provide more accurate vocabulary as- sistance.

8 Read more

Automatic Construction of Large Readability Corpora

Automatic Construction of Large Readability Corpora

The classical readability measures have been criticized for applying a superficial analysis of textual characteristics, ignoring, for example, that larger sentences may be clearer and more explicative than a smaller equivalent (Williams, 2004). These formulas are not able to capture several elements of cohesion and textual difficulty, according to McNamara et al. (2002), who also point that these tools force editors to modify the text to increase the calculated readability, but actually reducing cohesion. Recent studies tried to apply automatic approaches that better approximate the complexity of a text, for example using n-gram language models to identify reading ease. Petersen and Ostendorf (2009) trained Support Vector Machines using a corpus created from an educational newspaper, Weekly Reader, with different versions for four different grade levels, completed with articles for adults from the Associated Press. They worked with lexical and syntactic features and also with traditional formulas. Investigating the contribution of syntactic features, it was observed that they were not good enough separately, but contributed to the general performance. Complementarily, Vajjala and Meurers (2014) applied 152 lexical and syntactic attributes to classify a corpus of subtitles from different BBC channels for children and adults, also using SVMs. The most predictive attribute was shown to be the age of acquisition. Similar approaches were applied in multiple other languages, including Italian (Dell’Orletta et al., 2011), German (Hancke et al., 2012) and Basque (Gonzalez-Dios et al., 2014). In Portuguese, Scarton and Alu´ısio (2010) classified articles for children and adults from local newspapers, using a SVM trained on 48 psycholinguistic features and obtaining an F-measure of 0.944.

10 Read more

Estimating readability with the Strathclyde readability measure

Estimating readability with the Strathclyde readability measure

Readability measures are heuristic-based metrics that commonly derive a readability value from ‘intrinsic’ textual characteristics of a document. The usual presumption (the heuristic) is that texts with short sentences and short words are more easily read and comprehended by the average reader. With this in mind, readability measures usually focus on such document features as average sentence length (ASL) and average word length (AWL). Although for many texts this assumption is appropriate, there are also many instances of texts with short but complex words and short but obscure sentences. For such examples, most fog indices would award ‘good’ readability scores, although most human readers may consider these texts ‘difficult’.

8 Read more

Improving Sentence level Subjectivity Classification through Readability Measurement

Improving Sentence level Subjectivity Classification through Readability Measurement

To our best knowledge, readability measures have not been used to assess the subjectivity of any lex- ical units so far, be it word forms, phrases, sen- tences or whole documents. However, there is Hoang et al. (2008)’s work on evaluating the qual- ity of user-created documents, and recent work on grading the helpfulness of reviews by (O’Mahony and Smyth, 2010), both incorporating readabil- ity measures. Close to our research is Nishikawa et al. (2010)’s study on sentiment summarisation which utilises measures both for informativeness and readability. Very recent support in favour of our hypothesis is provided by (Lahiri et al., 2011), who measure a correlation between informality and readability.

7 Read more

Genre oriented Readability Assessment: a Case Study

Genre oriented Readability Assessment: a Case Study

Over the last ten years, the development of efficient natural language processing (NLP) systems led to a resurgence of interest in readability assessment. Several studies have been carried out based on NLP-enabled feature extraction and state–of–the–art machine learning algorithms with significant performance improvement with respect to traditional readability measures. Due to the great potential of automatic readability assessment for educational purposes, many of these studies, mostly focused on English but also tackling less–resourced languages, have been carried out with the final aim of supporting teachers and/or learners in selecting material which is appropriate to a given reading level. In principle, educational material can belong to different textual genres, ranging e.g. from fiction to scientific writing or reportage. The question which naturally arises is whether and to what extent readability assessment is genre– independent, and if this not the case whether and how general purpose readability assessment tools could reliably be used for dealing with texts belonging to different genres. The most recent literature on readability reports that the degree of readability is connected to genre: consider, for instance, the work by (Kate et al., 2010) who improves the accuracy of read- ability predictions by using genre–specific features, or by (Štajner et al., 2012) who proved that linguistic features correlated with readability are also genre dependent. This suggests that textual genre and readability do not represent orthogonal dimensions of classification, but intertwined notions whose complex interplay needs to be further investigated in order to envisage solutions which could be successfully exploited in real educational applications. NLP–based approaches to readability assessment proposed in the literature can be subdivided into two groups, according to whether readability assessment is carried out as a classification task (see among others (Petersen

8 Read more

A Comparison of Features for Automatic Readability Assessment

A Comparison of Features for Automatic Readability Assessment

Shallow features refer to those used by traditional readability metrics, such as Flesch-Kincaid Grade Level (Flesch, 1979), SMOG (McLaughlin, 1969), Gunning FOG (Gunning, 1952), etc. Although recent readability studies have strived to take ad- vantage of NLP techniques, little has been revealed about the predictive power of shallow features. Shallow features, which are limited to superficial text properties, are computationally much less ex- pensive than syntactic or discourse features. To en- able a comparison against more advanced features, we implement 8 frequently used shallow features as listed in Table 5.

9 Read more

Feature Optimization for Predicting Readability of Arabic L1 and L2

Feature Optimization for Predicting Readability of Arabic L1 and L2

Computational readability assessment presents a growing body of work leveraging NLP to extract complex textual features, and ML to build read- ability models from corpora, rather than relying on human expertise or intuition (Collins-Thompson, 2014). Approaches vary depending on the purpose of the readability prediction model, e.g., mea- suring readability for text simplification (Aluisio et al., 2010; Dell’Orletta et al., 2014a; Al Khalil et al., 2017), selecting more cognitively-predictive features for readers with disabilities (Feng et al., 2009) or for self-directed language learning (Bein- born et al., 2012). Features used in predicting readability range from surface features extracted from raw text (e.g. average word count per line), to more complex ones requiring heavier text pro- cessing such as syntactic parsing features (Heil- man et al., 2007, 2008; Beinborn et al., 2012; Hancke et al., 2012). The use of language models is increasingly favored in the literature over simple frequency counts, ratios and averages commonly used to quantify features in traditional readabil- ity formulas (Collins-Thompson and Callan, 2005; Beinborn et al., 2012; François and Miltsakaki, 2012). We evaluate features extracted using both methods in this study.

10 Read more

READABILITY ANALYSIS OF INDIAN DIABETIC ASSOCIATION WEBSITE AND OTHER HEALTH RELATED WEBSITES ON INFORMATION RELATED TO DIABETIC DIET

READABILITY ANALYSIS OF INDIAN DIABETIC ASSOCIATION WEBSITE AND OTHER HEALTH RELATED WEBSITES ON INFORMATION RELATED TO DIABETIC DIET

Patients use the Internet to educate themselves about health related topics. Endocrinologists and diabetologists in India and worldwide are more concerned about educating their patients regarding diabetic diet. Hence in particular diabetic patients give lot of preference for diet related articles in media. The usefulness of health related education materials on the internet depends largely on their Readability, Comprehensibility and Understandability. According to International Diabetes Federation (IDF) (1) Diabetic Atlas, 382 million people had diabetes with in the year

5 Read more

Readability Classification for German using Lexical, Syntactic, and Morphological Features

Readability Classification for German using Lexical, Syntactic, and Morphological Features

Readability research on English has ignored morphological features to a large extent. However, with recent interest in readability assessment for languages other than English, the use of features which are relevant for other languages is gaining some prominence. Research on Italian and French readability is taking advantage of the rich verbal morphology of these languages. Dell’Orletta et al. (2011) worked with a corpus of Italian newspaper text at two different reading levels. They used a mixture of traditional, morpho-syntactic, lexical and syntactic features for building a two class readability model for Italian. Among others, their feature set included verbal mood based features, which relied on the rich verbal morphology of Italian. François and Fairon (2012) built their French readability classification model using a text book corpus designed for adult learners of French. They also considered verb tense and mood based text difficulty features along with several other features. Readability assessment was also studied for Portuguese using various lexical, syntactic, discourse and language modeling features derived from English research (dos Santos Marujo, 2009; Aluisio et al., 2010). Lau (2006) utilized the nature of the Chinese script to form several sub-character and character level features in addition to the common word and sentence level features for Chinese readability classification.

18 Read more

READABILITY ANALYSIS OF INDIAN DIABETIC ASSOCIATION WEBSITE AND OTHER HEALTH RELATED WEBSITES ON INFORMATION RELATED TO DIABETIC DIET

READABILITY ANALYSIS OF INDIAN DIABETIC ASSOCIATION WEBSITE AND OTHER HEALTH RELATED WEBSITES ON INFORMATION RELATED TO DIABETIC DIET

Patients use the Internet to educate themselves about health related topics. Endocrinologists and diabetologists in India and worldwide are more concerned about educating their patients regarding diabetic diet. Hence in particular diabetic patients give lot of preference for diet related articles in media. The usefulness of health related education materials on the internet depends largely on their Readability, Comprehensibility and Understandability. According to International Diabetes Federation (IDF) (1) Diabetic Atlas, 382 million people had diabetes with in the year

5 Read more

Sorting Texts by Readability

Sorting Texts by Readability

within the sorted texts. The norm is thus considered as the location of a text among an ordered set of texts. Our approach linguistically enhances assessment of the readability of a text as the relative ease compared to other texts, not as the absolute difficulty of the text. The root of this idea has been presented in two articles of which we are aware. In Inui and Yamamoto (2001), the readability of sentences for deaf people is judged by a comparator generated by an SVM. In addition, Pitler and Nenkova (2008) presented a comparison of texts in terms of difficulty by using an SVM. Similarly to what we present in Section 3.1, those authors propose constructing a comparator by using an SVM to compare two sentences or texts with multiple features. However, neither further applied this approach to obtain readability assessment based on sorting. Our contribution in this study is therefore that we show how a machine learning method can be used as a comparator and applied to sort texts.

26 Read more

Making Readability Indices Readable

Making Readability Indices Readable

With this work, we aim at bridging the gap be- tween the standard approach to Italian readability based on the Gulpease index (following the same criteria of the Flesch Index) and the more advanced approaches to readability currently available for En- glish and based on psycholinguistic principles. In particular, we present a set of indices for Ital- ian readability covering different linguistics aspects, from the lexical to the discourse level, which are in- spired by Coh-Metrix (Graesser et al., 2004). We make this analysis available online, but we differ- entiate our service from that of Coh-Metrix 1 in that we provide a graphical representation of the aspects affecting readability, comparing a document with the average indices of elementary, middle and high- school level texts. This makes readability analysis really intuitive, so that a user can straightforwardly understand how difficult a document is, and see if some aspects (e.g. lexicon, syntax, discourse) affect readability more than others.

9 Read more

Towards a new model of readability

Towards a new model of readability

During those years, readability studies started to focus on different issues like: (1) the use of cloze procedures as an alternative method to test text properties (Harrison, 1986; Rush, 1985; Shanahan, Kamil & Tobin, 1982), (2) the readers‟ factors that can influence readability (e.g. reading ability) (Pettersson, 1993), (3) motivation , prior knowledge, and interest (Baldwin, Peleg-Brukner & McCintock, 1985; Tobias 1994), and (4) readability effects and written work (Duffy, 1985). Although in that era research in readability was at a high level, at the end of the era there were gradually drops in the number of studies in readability from 1995 and onward. This was because there had been several criticisms, especially, regarding readability formulae in terms of their developmental criteria and grade level scores (Bruce, Rubin & Starr, 1981; Chambers, 1983; Davison & Kantor, 1982; Duffy, 1985; Fuchs, Fuchs & Deno, 1983; Meade & Smith, 1991; McConnell, 1983; Maxwell 1978; Pichert & Elam, 1985; Perera, 1980; Redish & Selzer, 1985; Redish, 2000; Schrivers, 2000; Stokes, 1978; Sydes & Hartley, 1997).

456 Read more

Neural Network Prediction of Censorable Language

Neural Network Prediction of Censorable Language

A balanced corpus is created. The uncensored posts of each dataset are randomly sampled to match with the number of their censored counter- parts (see Table 1 and Table 3). All numeric values have been standardized before classification. We use the MultilayerPerceptron function of Weka for classification. A number of classification experi- ments using different combinations of features are carried out. Best performances are achieved us- ing the combination of CRIE, sentiment, semantic, frequency, readability and follower features (i.e. all features but LIWC) (see Table 2).

7 Read more

Ease of Readability for Dyslexic People

Ease of Readability for Dyslexic People

It is not difficult to propose other RSVP modes having different trajectories and other features, but the ones already described above are sufficiently representative to trigger a variety of relevant questions and tentative hypotheses. The thesis is all about an overview of the essentials about reading and readability. The adaptive RSVP format had to be developed, it already existed as an idea but it had to be put into practice. After that adaptation had to be integrated into an application for RSVP on a computer device. Finally, the prototype had to be benchmarked against other presentation formats in a usability evaluation. The work with the assignment can thus be roughly divided into the following three tasks:

5 Read more

Readability Assessment of Translated Texts

Readability Assessment of Translated Texts

Most of the traditional readability approaches investigate shallow text properties to determine the complexity of a text, based on assumptions which correlate surface features with the linguis- tic factors which influence readability. For ex- ample, the average number of characters or syl- lables per word, the average number of words per sentence and the percentage of words not oc- curring among the most frequent n words in a language are correlated with the lexical, syntac- tic and, respectively, the semantic complexity of the text. The Flesch-Kincaid measure (Kincaid et al., 1975) employs the average number of sylla- bles per word and the average number of words per sentence to assess readability, while the Auto- mated Readability Index (Smith and Senter, 1967) and the Coleman-Liau metric (Coleman and Liau, 1975) measure word length based on character count rather than syllable count; they are func-

7 Read more

Readability Assessment for Text Simplification

Readability Assessment for Text Simplification

We describe a readability assessment ap- proach to support the process of text simplifi- cation for poor literacy readers. Given an in- put text, the goal is to predict its readability level, which corresponds to the literacy level that is expected from the target reader: rudi- mentary, basic or advanced. We complement features traditionally used for readability as- sessment with a number of new features, and experiment with alternative ways to model this problem using machine learning methods, namely classification, regression and ranking. The best resulting model is embedded in an authoring tool for Text Simplification.

9 Read more

NLP–Based Readability Assessment of Health–Related Texts: a Case Study on Italian Informed Consent Forms

NLP–Based Readability Assessment of Health–Related Texts: a Case Study on Italian Informed Consent Forms

sentences are more grammatically complex than shorter ones and that longer words are less com- prehensible than shorter ones, this result witnesses the efforts of the authors of informed consents to- wards the use of an unavoidably complex vocabu- lary used, however, in simpler syntactic construc- tions. Interestingly enough, this is confirmed by the values of lexical features. Among them, it is worth noting that with respect to both 2Par and Rep informed consents contain quite a lower per- centage of lemmas (types) belonging to the “Basic Italian Vocabulary” (De Mauro, 2000), marked as BIV in Table 1 and corresponding to a list of 7000 words highly familiar to native speakers of Italian. This is in line with the outcomes of the studies on the discriminative power of vocabulary clues in readability assessment (see, among others, Pe- tersen and Ostendorf (2009)). Obviously, this also reveals the massive use of health–related words specific to this domain of knowledge and here still considered as out–of–vocabulary lemmas. In ad- dition, 2IC texts show a higher Type–Token Ratio (TTR) value (which is computed for the first 100 tokens of each document), meaning that this text type is much richer lexically, with values which are closer to what observed with respect to Rep, here considered as representative of the class of difficult–to–read texts.

11 Read more

On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition

On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition

Independent of the research on readability, the complexity of the texts produced by language learn- ers has been extensively investigated in Second Language Acquisition (SLA) research (Housen and Kuiken, 2009). Recent approaches have automated and compared a number of such complexity mea- sures for learner language, specifically in English as Second Language learner narratives (Lu, 2010; Lu, 2011b). So far, there is hardly any work on using such insights in computational linguistics, though, with the notable exception of Chen and Zechner (2011) using SLA features to evaluate spontaneous non-native speech. Given that graded corpora are also intended to be used by incremental age groups, we started to investigate whether the insights from SLA research can fruitfully be applied to readability classification.

11 Read more

Show all 10000 documents...