automatic word boundary detection

Top PDF automatic word boundary detection:

A rule based approach for automatic clause boundary detection and classification in Hindi

A rule based approach for automatic clause boundary detection and classification in Hindi

We propose a rule based system which first identifies the clause(s) in the input sentence and marks the ‘clause start position’ (CSP) and ‘clause end position’ (CEP) with brackets and then it classifies the identified clauses into one of the proposed types mentioned in section 2. Hindi usually follows the SOV word order, so ends of the clauses can be found by just using verb information, in most of the cases. The language also has explicit relative pronouns, subordinating conjuncts, coordinate conjunctions etc. which serve as cues that help to identify clause boundaries and the type of the clauses. Thus our system uses lists of coordinate conjunctions, relative markers and adverbial clause markers (see Appendix A and Appendix B for the lists). These lists were created using (Kachru, 2006). Further, the rules for our system have been framed based on our in depth analysis of a section of the Hindi treebank (Palmer et al., 2009). Apart from the lexical cues we have also used POS tag and chunk information to frame these rules.
Show more

10 Read more

Automatic Detection of Intra Word Code Switching

Automatic Detection of Intra Word Code Switching

McArthur (1998) identified four major types of code-switching, ranging from tag-switching (tags and set of phrases) to intra-word switching, where a change occurs within a word boundary. The oc- currence of intra-word switching has only been rarely addressed in computational linguistics re- search. Habash et al. (2005) developed a mor- phological analyzer and generator for the Arabic language family. The tool allows combining mor- phemes from different dialects.

5 Read more

Evaluating Word Embeddings for Sentence Boundary Detection in Speech Transcripts

Evaluating Word Embeddings for Sentence Boundary Detection in Speech Transcripts

The concept of a sentence in written or spoken texts is important in several Natural Lan- guage Processing (NLP) tasks, such as morpho-syntactic analysis [Kepler and Finger 2010, Fonseca and Alu´ısio 2016], sentiment analysis [Brum et al. 2016], and speech processing [Mendonc¸a et al. 2014], among others. However, punctuation marks that constitute a sentence boundary are ambiguous The Disambiguation of Punctuation Marks (DPM) task analyzes punctuation marks in texts and indicates whether they correspond to a sentence boundary. The purpose of the DPM task is to answer the question: Among the tokens of punctuation marks in a text, which of them correspond to sentence boundaries? The Sentence Boundary Detection (SBD) task is very similar to DPM, both of which attempt to break a text into sequential units that correspond to sentences, where DPM is text-based and SBD can be applied to either written text or audio transcriptions and often for clauses, which do not necessarily end in final punctuation marks but are complete thoughts nonetheless. However, performing SBD in speech texts is more com- plicated due to the lack of information such as punctuation and capitalization; moreover text output is susceptible to recognition errors, in case of Automatic Speech Recognition (ASR) systems are used for automatic transcriptions [Gotoh and Renals 2000]. SBD from speech transcriptions is a task which has gained more attention in the last decades due to the increasing popularity of ASR software which automatically generate text from audio input. This task can also be applied to written texts, like online product reviews [Silla Jr and Kaestner 2004, Read et al. 2012, L´opez and Pardo 2015], in order to better their intelligibility and facilitate the posterior use of NLP tools.
Show more

10 Read more

Automatic Vehicle Detection and Identification using Visual Features

Automatic Vehicle Detection and Identification using Visual Features

Shyang-Lih Chang et al. [7] proposed a license plate image technique consisting of two main models: a license plate locating model and a license number identification module. Specifically, the license plate candidates extracted from the first model are examined in the identification model to reduce the error rate. In the first model, several features such as color are taken into consideration to determine the license plate region. Initially, they use color edge detection to compute edge map E which contains three types edge (i.e., black-white, red-white and green-white edges) due to the fact that there are just four kinds of color(white, black, red and green) for the plate and character in Taiwan. To detect the color edges, these three kinds of edges are taken into consideration. The RGB color differences (4r, 4g, 4b) can be calculated to find the edge. Next, with unique formulas, the program can transform RGB space into HSI space that denote (red, green, blue) and (hue, saturation, intensity) values of an image pixel, respectively. The transform formula is as below.
Show more

84 Read more

Automatic New Word Acquisition: Spelling from Acoustics

Automatic New Word Acquisition: Spelling from Acoustics

Automatic New Word Acquisition Spelling from Acoustics Automatic New Word Acquisition Spelling from Acoustics Fil Alleva and Kai Fu Lee School o f Computer Science Carnegie Mellon University Pittsburg[.]

5 Read more

Automatic Association of Web Directories with Word Senses

Automatic Association of Web Directories with Word Senses

Overall, our results suggest that directory-based instances, in spite of being shorter and automatically extracted, are not substantially worse for supervised WSD than the hand-tagged material provided by the Senseval organization. The limitation of the approach is currently the low coverage of word senses and the amount of training samples. Two strategies may help in overcoming such limitations: first, propagating directories via synonymy (attaching directories to synsets rather than word senses) and semantic relationships (propagating directories via hyponymy relations); second, retrieving instances not only from the ODP page describing the directory contents, but from the Web pages listed in the directory.
Show more

18 Read more

Automatic Optic Disc Abnormality Detection in Fundus Images: A Deep Learning Approach

Automatic Optic Disc Abnormality Detection in Fundus Images: A Deep Learning Approach

The cascade classifier was trained in 13 stages on a 2.9 GHz Intel Core i7 using Matlab. The total training time for all stages was 17.3 minutes. However, The average predicting time for a previous unseen image was only 0.034 seconds. From the conducted experiments, it is observed that the cascade classifier is sufficient and performs well on datasets consisting of good quality normal images. However, the performance degrades clearly in other datasets that exhibit variable image conditions, which confirms the need for learning more discriminative features through the CNNs as shown in Figure 4(a-d). Figure 5 shows examples of OD detections by the proposed approach. Table 1 present the evaluation result for the OD detection approach. As shown in figure 4(e-h), the model is able to learn
Show more

9 Read more

Automatic Rule Induction for Unknown Word Guessing

Automatic Rule Induction for Unknown Word Guessing

Using such training data, three types of guessing rules are induced: prefix morphological rules, suffix morphological rules, and ending-guessing rules... Andrei Mikheev Unknown-Word Gues[r]

20 Read more

Automatic Domain Assignment for Word Sense Alignment

Automatic Domain Assignment for Word Sense Alignment

Lexical knowledge, i.e. how words are used and ex- press meaning, plays a key role in Natural Language Processing. Lexical knowledge is available in many different forms, ranging from unstructured terminolo- gies (i.e. word list), to full fledged computational lexica and ontologies (e.g. WordNet (Fellbaum, 1998)). The process of creation of lexical resources is costly both in terms of money and time. To overcome these lim- its, semi-automatic approaches have been developed (e.g. MultiWordNet (Pianta et al., 2002)) with differ- ent levels of success. Furthermore, important informa- tion is scattered in different resources and difficult to use. Semantic interoperability between resources could represent a viable solution to allow reusability and de- velop more robust and powerful resources. Word sense alignment (WSA) qualifies as the preliminary require- ment for achieving this goal (Matuschek and Gurevych, 2013).
Show more

5 Read more

Continual State Representation Learning for Reinforcement Learning using Generative Replay

Continual State Representation Learning for Reinforcement Learning using Generative Replay

We consider the problem of building a state representation model in a continual fashion. As the environment changes, the aim is to efficiently compress the sensory state’s information without losing past knowledge. The learned features are then fed to a Reinforcement Learning algorithm to learn a policy. We propose to use Variational Auto-Encoders for state representation, and Generative Replay, i.e. the use of generated samples, to maintain past knowledge. We also provide a general and statistically sound method for automatic environment change detection. Our method provides efficient state representation as well as forward transfer, and avoids catastrophic forgetting. The resulting model is capable of incrementally learning information without using past data and with a bounded system size.
Show more

9 Read more

Wavelet Energy-Based Support Vector Machine for Noisy Word Boundary Detection With Speech Recognition Application

Wavelet Energy-Based Support Vector Machine for Noisy Word Boundary Detection With Speech Recognition Application

Next, the ten Mandarin digital words in each sequence of transcriptions in the test database are to be recognized. The words in each sequence are detected by the two methods respectively. When the number of successive frames being detected as speech is larger than 0.1 second, we regard it as word for recognition, otherwise these frames are discarded. So the number of words

10 Read more

Preprocessing digital retinal images for vessel segmentation

Preprocessing digital retinal images for vessel segmentation

Automatic detection of optic disc and blood vessels from retinal images using image processing techniques. Automatic optic disc detection and removal of false exudates for improving [r]

6 Read more

Distribution Automation / Smart Grid. Kevin Whitten

Distribution Automation / Smart Grid. Kevin Whitten

• To make fault detection and automatic isolation. • To integrate automatic meter reading to avoid manipulation and loss of revenue by integrating DA with Automatic Billing and Collectio[r]

54 Read more

“Pragma SUM: Key Word Use in Automatic Summarization

“Pragma SUM: Key Word Use in Automatic Summarization

As access to the Internet broadens and with the advent of tools that allow people to create content, the amount information to which we have access grows exponentially. Texts written about subjects and by countless authors are produced every day. It is impossible to absorb all the information available or to select the most adequate piece of information for a certain interest or public. Automatic text summarization, in addition to presenting a text in condensed form, can simplify it, thus generating an alternative for saving time and widening access to contained information for many different types of readers. The automatic summarizers that currently exist in literature do not t personalization methods for each type of reader and, consequently, generate results that have limited precision. This article aims to use the automatic text summarizer PragmaSUM in educational texts with new summarization techniques using keywords. Perso
Show more

6 Read more

Automatic Learning of Word Transducers From Examples

Automatic Learning of Word Transducers From Examples

At the other end of the spectrum, when N is large, the learned model will describe the ex- amples in TS and t h e m only.. is the empty string,.[r]

6 Read more

A Possibilistic Approach for Automatic Word Sense Disambiguation

A Possibilistic Approach for Automatic Word Sense Disambiguation

We discuss in this paper the contribution of a new approach for WSD. We presuppose that combining knowledge extracted from corpora and traditional dictionaries will improve disambiguation rates. We also show that this approach may perform satisfactory results even without using manually labeled corpora for training. We also propose to apply possibility theory as an efficient framework to solve the WSD problem seen as a case of imprecision. Indeed, WSD approaches need training and matching models which compute the similarities (or the relevance) between senses and contexts. Existing models for WSD are based on poor, uncertain and imprecise data. Whereas, possibility theory is naturally designed to this kind of applications; because it makes it possible to express ignorance and to take account of the imprecision and uncertainty at the same time. For example a recent work of Ayed et al. (2012) [23][24] which have proposed possibilistic approach for the morphological disambiguation of arabic texts showed the contribution of possibilistic models compared to probabilistic ones. That is, we evaluate the relevance of a word sense given a polysemous sentence proposing two types of relevance: plausible relevance and necessary relevance. This paper is structured as follows. First, we give an overview of the main existing WSD approaches in section 2. Section 3 briefly recalls possibility theory. Our approach is detailed in section 4. Subsequently, a set of experimentations and comparison results are discussed in section 5. Finally, we summarize our findings in the conclusion and propose some directions for future research.
Show more

15 Read more

Automatic Detection of Point of View Differences in Wikipedia

Automatic Detection of Point of View Differences in Wikipedia

For Arabic, we investigate a number of different options for linguistic preprocessing. Arabic is a clitic language and highly inflectional. Normalization and lemmatization of Arabic text are beneficial preprocessing steps in many NLP applications. Lemmatization has been used widely in classification problems due to its ability to generate one form that matches many other related forms (Al Ameed et al., 2005). Therefore, in addition to the non-lemmatized surface forms, we used two lemmatization types: stem and root. We use light stemming to extract the stem: only frequent suffixes/prefixes are removed. In contrast, a word is reduced to its corresponding root by removing all affixes, not just frequent affixes (Al Ameed et al., 2005). We use the Arabic Text Mining tool for computing stems and roots. 7
Show more

18 Read more

Semi automatic Annotation of Chinese Word Structure

Semi automatic Annotation of Chinese Word Structure

The performance of the proposed semi- supervised approach suggests that the distribu- tion of the data has good characteristics that tightly link to the underlying structures. In other words, the form class descriptions of word struc- tures provide much information for inducing the structural regularities of Chinese words. To the best of our knowledge, this is the first work on automatic annotation of Chinese word structures based on semi-supervised learning. We are unable to find any existing work to directly compare with it. However, there are previous works on semi-supervised learning for other NLP tasks, such as document classification (Ni- gam et al., 2006). They used naïve Bayes for both the supervised learning and unsupervised learning, whereas our supervised and unsuper- vised models are ME and GMM, respectively. In
Show more

9 Read more

Hierarchical word clustering - automatic thesaurus generation

Hierarchical word clustering - automatic thesaurus generation

Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing eprints@whiterose.ac.uk including the URL of the record and the rea[r]

36 Read more

Show all 10000 documents...