In this paper, we presented the application approach of point- wise prediction to sentenceboundarydetection in the PDF Noisy Text in the financial domain for the FinSBD 2019 shared task. Our point prediction model achieved 0.88 and 0.84 averaged f1-score for the beginning/ending of sentences in English and French. In the final results, this model ob- tained 0.84 in English and 0.86 in French. Evidently, the proposed pointwise prediction model outperformed the rule- based prediction model in any index. In our model, we em- ployed some sets of parameters and ensembled models with these parameter sets. The result shows that the ensemble models outperformed any model without ensembling. How- ever, other parameter sets that are also accurate for this task are possible. Moreover, we fixed some parameters. As future works, these parameters should also be modified.
diction of sentence boundaries as compared to a pipeline baseline where both tasks are performed independently of each other. For our analysis, we use the Wall Street Journal as the standard benchmark set and as a representative for copy- edited text. We also use the Switchboard cor- pus of transcribed dialogues as a representative for data where punctuation cannot give clues to a sentenceboundary predictor (other types of data that may show this property to varying degrees are web content data, e.g. forum posts or chat protocols, or (especially historical) manuscripts). While the Switchboard corpus gives us a realis- tic scenario for a setting with unreliable punctua- tion, the syntactic complexity of telephone conver- sations is rather low compared to the Wall Street Journal. Therefore, as a controlled experiment for assessing how far syntactic competence alone can take us if we stop trusting punctuation and capitalization entirely, we perform joint sentenceboundarydetection/parsing on a lower-cased, no- punctuation version of the Wall Street Journal. In this setting, where the parser must rely on syntac- tic information alone to predict sentence bound- aries, syntactic information makes a difference of 10 percentage point absolute for the sentenceboundarydetection task, and two points for la- beled parsing accuracy.
Regardless of how the sentence is defined formally, sen- tence boundarydetection (SBD) (cf. sentenceboundary dis- ambiguation, sentence segmentation, sentence breaking, sen- tence chunking) is a foundational, critically important up- stream step in many NLP applications and (sub)tasks, such as part-of-speech tagging, named entity recognition, depen- dency parsing, and semantic role labelling, to name a few. Sentenceboundarydetection attempts to determine the spans (bounds, begin/from-end/to token indices) of sentences and sentence-like constructs below paragraphs, sections, or other suprasentential structures. Because incorrect sentence spans can propagate and generate noise (and undesirable compli- cations) for downstream tasks, SBD plays a critical role in practical NLP applications.
The first step of many language tasks, such as POS tagging, discourse parsing, machine translation, etc., is the sentenceboundarydetection (SBD), which detects the end of the sen- tence [Nagmani Wanjaria, 2016]. This makes the task of de- tecting the beginning and ending very important, which helps in processing the written language text. However, detecting the end of the sentence is a complicated task due to the am- biguousness of punctuation and words in the sentence [Greg- ory Grefenstette, 1994]. For example, punctuation marks like "." and "!" don't always represent the end part of sentence text and have several functions. The "." can be part of a number like 2.34 or an abbreviation of a phrase, and "!' can represent a word of surprise or shock. A number of research pieces in sentence boundaries mainly used the machine learning meth- ods, such as the hidden Markov model [Mikheev, 2002], Maximum entropy [Jeffrey C.Reynar, 1997], conditional random fields [Katrin Tomanek, 1997], and neural networks [Tibor Kiss, 2006]. Recently, deep learning models have been applied to solve this issue and achieve good perfor- mance [Carlos-Emiliano Gonzalez-Gallardoa, 2018] [Carlos Emiliano Gonza lez Gallardo1, 2018]. Until now, research about SBD has been confined to formal texts, such as news and European parliament proceedings, which have high ac- curacy using rule-based machine learning and deep learning methods due to the perfectly clean text data. There is no re- search about the SBD in noisy text that was extracted from the files in machine-readable formats. The FinNLP workshop in IJCAI-2019 is the first proposal of FinSBD-2019 shared tasks that detect sentenceboundary in noisy text of finance documents [A Ait Azzi, 2019].
A word is a more meaningful unit than a char- acter in Chinese. However, the features from Word Level (W) are slightly inferior to those from Character Level (C) in our experiments. After analyzing the wrongly classified examples, we found that the Chinese word segmentation errors propagate to sentenceboundarydetection task. In addition, many clue words such as “了” (paste tense indicator), “嗎” (interrogative parti- cle), and “吧” (particle used after an imperative sentence) are single character words, hence Character Level (C) features cover these words as well. Part-of-speech not only has the highest precision among all the single feature set, but also improves the precision when it is combined with the other features.
The concept of a sentence in written or spoken texts is important in several Natural Lan- guage Processing (NLP) tasks, such as morpho-syntactic analysis [Kepler and Finger 2010, Fonseca and Alu´ısio 2016], sentiment analysis [Brum et al. 2016], and speech processing [Mendonc¸a et al. 2014], among others. However, punctuation marks that constitute a sentenceboundary are ambiguous The Disambiguation of Punctuation Marks (DPM) task analyzes punctuation marks in texts and indicates whether they correspond to a sentenceboundary. The purpose of the DPM task is to answer the question: Among the tokens of punctuation marks in a text, which of them correspond to sentence boundaries? The SentenceBoundaryDetection (SBD) task is very similar to DPM, both of which attempt to break a text into sequential units that correspond to sentences, where DPM is text-based and SBD can be applied to either written text or audio transcriptions and often for clauses, which do not necessarily end in final punctuation marks but are complete thoughts nonetheless. However, performing SBD in speech texts is more com- plicated due to the lack of information such as punctuation and capitalization; moreover text output is susceptible to recognition errors, in case of Automatic Speech Recognition (ASR) systems are used for automatic transcriptions [Gotoh and Renals 2000]. SBD from speech transcriptions is a task which has gained more attention in the last decades due to the increasing popularity of ASR software which automatically generate text from audio input. This task can also be applied to written texts, like online product reviews [Silla Jr and Kaestner 2004, Read et al. 2012, L´opez and Pardo 2015], in order to better their intelligibility and facilitate the posterior use of NLP tools.
In Sections 6.4.1 and 6.4.2, we evaluate Punkt against three different baseline algorithms. These baselines serve several purposes. First, they establish a lower bound for the task of sentenceboundarydetection. Any sentenceboundarydetection system should perform significantly better than these baseline algorithms. Second, although we compare Punkt to other systems proposed in the literature in Section 7, most previous work on sentenceboundarydetection considered at most three different languages so that no direct comparison is possible for many of the corpora and languages that we have used in our evaluation. A comparison with the performance of the three baselines can at least give an indication of how well our system did on these corpora. Third, there is still an assumption held in the field that simple algorithms such as the baselines presented here are sufficiently reliable to be used for sentenceboundarydetection. This opinion was, for example, held by a reviewer of Kiss and Strunk (2002a). As will become clear in the following sections, a baseline algorithm may perform pretty well on one corpus, but this performance typically does not carry over to other languages or corpora. The baselines thus also serve to illustrate the complexity of the sentenceboundarydetection problem. The absolute baseline (AbsBL) is the simplest approach to sentenceboundary de- tection we can think of. It simply assumes that all token-final periods in a test corpus represent sentence boundaries. Consequently, all periods are tagged with < S > .
Sentenceboundarydetection is a fundamental preprocessing step for the use of text in downstream tasks such as part-of- speech-tagging and machine translation. While rule-based approaches are the earliest method applied, we focus the re- lated work on more advanced approaches, namely neural net- works. The use of neural networks (NN) for sentence bound- ary detection dates as far back as 1994 [Cutting et al., 1992]. Palmer and Hearst used a NN with two hidden units as an adaptable approach to overcome the restrictions of rule-based sentenceboundarydetection [Palmer and Hearst, 1994]. Their work utilised the part-of-speech (POS) sur- rounding sentence endings as an indicator. Since most POS tagger require available sentence boundaries, they inferred the POS based on the previous part-of-speech. When ap- plied on a corpus of Wall Street Journal articles (WSJ), their work correctly disambiguated over 98.5% of sentence ending punctuation marks. Riley uses a decision-tree based approach to detect endings of sentences in the Brown corpus [Riley, 1989]. The maximum entropy approach by Reynar and Rat- naparkhi achieves an accuracy of 98.0 % on the Brown cor- pus and 97.5% on the WSJ corpus [Reynar and Ratnaparkhi, 1997]. In an effort to segment sentences in the output of vocabulary-speech-recognizers, Stolcke and Shriberg use a statistical language model to retrieve the probabilities of sen- tence endings [Stolcke and Shriberg, 1996]. They also men- tion the beneficial impact POS use can have. In a later work, Storcke et al. used decision trees to model a com- bination of prosodic cues aiming at the detection of events (i.e. sentence boundaries and disfluencies) [Stolcke et al., 1998]. Dealing with a similar problem, Gotoh and Renals utilise n-gram language models to predict sentence bound- aries from broadcast transcripts which have been converted to text [Gotoh and Renals, 2000]. Stevenson and Gaizauskas approach sentenceboundarydetection in automated speech recognition transcripts using a memory-based learning ap- proach [Stevenson and Gaizauskas, 2000]. Other works used Hidden Markov Models (HMM) [Shriberg et al., 2000] and Conditional Random Fields (CRF) [Liu et al., 2005; Liu et al., 2006]. Also in a machine translation setting,
This paper presents two different approaches to- wards SentenceBoundaryDetection (SBD) that were submitted to the FinSBD-2019 shared task. The first is a supervised machine learning approach which tackled the SBD task as a combination of binary classifications based on TF-IDF representa- tions of context windows. The second approach is unsupervised and rule-based and applies manually created heuristics to automatically annotated input. Since the latter approach yielded better results on the Dev set, we submitted it to evaluation for En- glish and reached an F score of 0.80 and 0.86 for detecting begin of sentences and end of sentences, respectively.
As part of the First Workshop on FinTech and Natu- ral Language Processing (FinNLP), we introduced the FinSBD shared task which aims at sentenceboundarydetection in noisy text extracted from financial prospec- tuses, in two languages: English and French. Systems participating in this shared task were given a set of tex- tual documents extracted from pdf files, which are to be automatically segmented to extract a set of well de- limited sentences (clean sentences). The data will be in a json format (i.e. figure 1) containing: "text", that corresponds to the text to segment, "begin_sentence" and "end_sentence" correspond to all indexes of to- kens marking the beginning and the end of well formed sentences in the text. It is important to note that the provided text is already segmented at the word level. All participants were asked to keep this segmentation since all tokens indexes are built based on it. The first token in the text has then the index 0 .
SentenceBoundaryDetection (SBD) is a fundamental task for many Natural Language Processing (NLP) and analysis tasks, including POS tagging, syntactic, semantic, and discourse parsing, parallel text alignment, and machine translation (Gillick, 2009). Most research on SBD focus on languages that already have a well-defined concept of what a sentence is, typically indicated by sentence-end markers like full-stops, question marks, or other punctuations. However, as we study more contexts of language use (e.g. speech output which lacks punctuations) as well as look at many more different languages, the assumption of clearly-punctuated sentenceboundary becomes less valid. One such lan- guage is Thai.
This paper describes a project to detect dependen- cies between Japanese phrasal units called bunsetsus, and sentence boundaries in a spontaneous speech corpus. In monologues, the biggest problem with de- pendency structure analysis is that sentence bound- aries are ambiguous. In this paper, we propose two methods for improving the accuracy of sentenceboundarydetection in spontaneous Japanese speech: One is based on statistical machine translation us- ing dependency information and the other is based on text chunking using SVM. An F-measure of 84.9 was achieved for the accuracy of sentence bound- ary detection by using the proposed methods. The accuracy of dependency structure analysis was also improved from 75.2% to 77.2% by using automat- ically detected sentence boundaries. The accuracy of dependency structure analysis and that of sen- tence boundarydetection were also improved by in- teractively using both automatically detected depen- dency structures and sentence boundaries.
1.1 Sentence Segmentation Using HMM Most prior work on sentence segmentation (Shriberg et al., 2000; Gotoh and Renals, 2000; Christensen et al., 2001; Kim and Woodland, 2001; NIST- RT03F, 2003) have used an HMM approach, in which the word/tag sequences are modeled by N- gram language models (LMs) (Stolcke and Shriberg, 1996). Additional features (mostly related to speech prosody) are modeled as observation likelihoods at- tached to the N-gram states of the HMM (Shriberg et al., 2000). Figure 1 shows the graphical model representation of the variables involved in the HMM for this task. Note that the words appear in both the states 1 and the observations, such that the word stream constrains the possible hidden states to matching words; the ambiguity in the task stems entirely from the choice of events. This architec- ture differs from the one typically used for sequence tagging (e.g., part-of-speech tagging), in which the “hidden” states represent only the events or tags. Empirical investigations have shown that omitting words in the states significantly degrades system performance for sentenceboundarydetection (Liu, 2004). The observation probabilities in the HMM, implemented using a decision tree classifier, capture the probabilities of generating the prosodic features
Sentenceboundarydetection is a problem that has received limited attention in the text-based com- putational linguistics community (Schmid, 2000; Palmer and Hearst, 1994; Reynar and Ratnaparkhi, 1997), but which has recently acquired renewed im- portance through an effort by the DARPA EARS program (DARPA Information Processing Technol- ogy Office, 2003) to improve automatic speech tran- scription technology. Since standard speech recog- nizers output an unstructured stream of words, im- proving transcription means not only that word ac- curacy must be improved, but also that commonly used structural features such as sentence boundaries need to be recognized. The task is thus fundamen- tally based on both acoustic and textual (via auto- matic word recognition) information. From a com- putational linguistics point of view, sentence units are crucial and assumed in most of the further pro- cessing steps that one would want to apply to such output: tagging and parsing, information extraction, and summarization, among others.
Despite the important role of sentenceboundarydetection in NLP, this area has not received enough attention so far. The existing approaches for this task are confined to formal texts and to the best of our knowledge no studies have been con- ducted in noisy texts for this task. In FinSBD shared task, the focus is to detect the beginning and ending boundaries for extracting well segmented sentences from financial texts. These financial texts are PDF documents in which investment funds precisely describe their characteristics and investment modalities. The noisy unstructured text from these PDF files was parsed by the shared task organizers and the task is to transform them into semi-structured text by tagging the sen- tence boundaries in two languages - English and French. For example: consider the English sentence “Subscriptions may only be received on the basis of this Prospectus.”. Here the
SentenceBoundaryDetection (SBD) is an important fundamental task in any Natural Language Processing (NLP) application because errors tend to propagate to high-level tasks and because the obviousness of SBD errors can lead users to question the correctness and value of an entire product. While SBD is regarded as a solved problem in many domains, legal text presents unique challenges. The remainder of this paper describes those challenges and evaluates three approaches to the task, including a modification to a commonly-used semi- supervised and rule-based library as well as two supervised sequence labeling approaches. We find that a fully-supervised approach is superior to the semi-supervised rule library.
Sentence alignment is a task that consists in aligning the parallel sentences in a translated article pair, which are cru- cial for machine translation (MT) (Koehn et al., 2003). Pre- vious studies first split the source and target articles into sentences respectively using punctuation information, and then align the source and target sentences based on sentence length and/or bilingual lexicons (Ma, 2006). However, the monolingually determined sentence boundaries are not op- timized for sentence alignment, because translation equiv- alents might cross the monolingual sentence boundaries. In this paper, we propose a method to perform sentenceboundarydetection and alignment simultaneously, which significantly improves the alignment accuracy.
SentenceBoundaryDetection (SBD) is not widely counted among the grand challenges in NLP. Even though there were comparatively few studies on SBD in the past decades, the assessment of extant techniques for English is hindered by variation in the task definition, choice of evaluation metrics, and test data used. Furthermore, two development trends in NLP pose new challenges for SBD, viz. (a) a shift of emphasis from formal, edited text towards more spontaneous language samples, e.g. Web content; and (b) a gradual move from ‘bare’ ASCII to rich text, exploiting the much wider Unicode character range as well as mark-up of text structure. The impact of such textual variation on SBD is hardly explored, and off-the-shelf technologies may perform poorly on text that is not very newswire-like, i.e. different from the venerable Wall Street Journal (WSJ) collection of the Penn Treebank (PTB; Marcus et al., 1993). In this work, we seek to provide a comprehensive, up-to-date, and fully reproducible assessment of the state of the art in SBD. In much NLP research, the ‘sentence’ (in a suitable interpretation; see below) is a foundational unit, for example in aligning parallel texts; PoS tagging; syntactic, semantic, and discourse parsing; or machine translation. Assuming gold-standard sentence boundaries (and possibly tokenisation)—as provided by standard data sets like the PTB—has been common practice for many isolated studies. However, strong effects of error propagation must be expected in standard NLP pipelines, for example of imperfect SBD into morpho- syntactic, semantic, or discourse analysis (Walker et al., 2001; Kiss and Strunk, 2002). For these reasons, we aim to determine (a) what levels of performance can be expected from extant SBD techniques; (b) to which degree SBD performance is sensitive to variation in text types; and (c) whether there are relevant differences in observed behavior across different SBD approaches. Our own motivation in this work is twofold: First, working in the context of semi-automated parser adaptation to domain and genre variation, we would hope to encourage a shift of emphasis towards parsing as an end-to-end task, i.e. taking as its point of departure the running text of a document collection rather than idealized resources comprised of ‘pure’ text with manually annotated, gold-standard sentence and token boundaries. Second, in preparing new annotated language resources (encompassing a broader range of different text types), we wish to identify and adapt extant preprocessing tool(s) that are best suited to our specific needs—both to minimize the need for correction in manual annotation and to maximize quality in automatically produced annotations. Finally, we felt prompted into systematizing this work by a recent query (for SBD technology, a recurrent topic) to the CORPORA mailing list 1 , where a
SentenceBoundaryDetection is a basic require- ment in Natural Language Processing and re- mains a challenge to language processing for specific purposes especially with noisy source documents. In this paper, we deal with the pro- cessing of scanned financial prospectuses with a feature-oriented and knowledge-enriched ap- proach. Feature engineering and knowledge enrichment are conducted with the participa- tion of domain experts and for the detection of sentence boundaries in both English and French. Two versions of the detection sys- tem are implemented with a Random Forest Classifier and a Neural Network. We engi- neer a fused feature set of punctuation, digi- tal number, capitalization, acronym, letter and POS tag for model fitting. For knowledge en- hancement, we implement a rule-based valida- tion by extracting a keyword dictionary from the out-of-vocabulary sequences in FinSBD’s datasets. Bilingual training on both English and French training sets are conducted to en- sure the multilingual robustness of the sys- tem and to extend the relatively small training data. Without using any extra data, our sys- tem achieves fair results on both tracks in the shared task. Our results (English 1 : F1-Mean
In spite of its important role for language pro- cessing, sentenceboundarydetection has so far not received enough attention. Previous research in the area has been confined to formal texts only, and either has not addressed the process of SBD directly (Brill, 1994; Collins, 1996), or not the performance related issues of sentenceboundarydetection (Cutting et al., 1992). In particular, no SBD research to date has addressed the problem in informal texts such as Twitter and Facebook posts. The growth of social media is a global phe- nomenon where people are communicating both using single languages and using mixes of several languages. The social media texts are informal in nature, and posts on Twitter and Facebook tend to be full of misspelled words, show extensive use of home-made acronyms and abbreviations, and con- tain plenty of punctuation applied in creative and non-standard ways. The punctuation markers are also often ambiguous in these types of texts — in particular between actually being used as punctua- tion and being used for emphasis — creating great challenges for sentenceboundarydetection.