The results show document embeddings are able to achieve strong performance on parallel document mining. On a test set mined from the web, all models achieve strong retrieval perfor- mance, the best being 91.4% P@1 for en-fr and 81.8% for en-es from the hierarchicaldocument models. On the United Nations (UN) document mining task (Ziemski et al., 2016), our best model achieves 96.7% P@1 for en-fr and 97.3% P@1 for en-es, a 3%+ absolute improvement over the prior state-of-the-art (Guo et al., 2018; Uszkoreit et al., 2010). We also evaluate on a noisier ver- sion of the UN task where we do not have the ground truth sentence alignments from the orig- inal corpus. An off-the-shelf sentence splitter is used to split the document into sentences. 2 The results shows that the HiDE model is robust to the noisy sentence segmentations, while the aver- aging of sentence embeddings approach is more sensitive. We further perform analysis on the ro- bustness of our models based on different qual- ity sentence-level embeddings, and show that the
Traditional hierarchical clustering methods assume that the documents are represented only by “technical information”, i.e., keywords, phrases, expressions and named entities that can be directly extracted from the texts. However, in many scenarios there is an additional and valuable information about a subset of the documents which is usually disregarded during the clustering task . Unlike technical information, this additional information are domain-speciﬁc features and are not explicitly available in textual documents . High-level features such as user-validated tags metadata, annotations and comments from experts, dictionaries and domain ontologies are common examples of this additional information. Due to the high cost (computational and human resources) to manage this information, as well as their availability to only a (small) subset of documents, this additional information has been named in the literature as “privileged information” , , , .
The initial clusters are constructed by each global closed frequent itemset (common words). All the documents containing this itemset (words) are included in the same cluster. Since a document usually contains more than one global closed frequent itemset (common word), the same document may appear in more than one initial cluster; in other words, initial clusters may be overlapped. The purpose of initial clusters is to ensure the property that all the documents in a cluster contain all the items in the global frequent itemsets that define the cluster. We use this global closed frequent itemsets (common words) as the cluster label to identify the cluster. The cluster label has two other functions. First, we use those cluster labels to build a hierarchical structure, called cluster tree, where cluster tree is the finally result of document clustering. Second, the meaningful cluster labels make user browse easier. We remove the overlapping of clusters in subsection B. Fig. 6 shows the result of the initial clustering.
cently (Cheng and Lapata, 2016; Nallapati et al., 2017) employ hierarchicaldocument encoders and even have neural decoders, which are complex. Training such complex neural models with inac- curate binary labels is challenging. We observed in our initial experiments on one of our dataset that our extractive model (see Section 3.3 for de- tails) overfits to the training set quickly after the second epoch, which indicates the training set may not be fully utilized. Inspired by the recent pre-training work in natural language processing (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2018), our solution to this problem is to first pre-train the “complex”’ part (i.e., the hier- archical encoder) of the extractive model on unla- beled data and then we learn to classify sentences with our model initialized from the pre-trained en- coder. In this paper, we propose H IBERT , which stands for HIerachical Bidirectional Encoder Representations from Transformers. We design an unsupervised method to pre-train H IBERT for document modeling. We apply the pre-trained H IBERT to the task of document summarization and achieve state-of-the-art performance on both the CNN/Dailymail and New York Times dataset. 2 Related Work
Abstract— Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. The similar documents are grouped together in a cluster, if their cosine similarity measure is less than a specified threshold. In this paper we mainly focuses on document clustering and measures in hierarchical clustering. The hierarchicaldocument clustering algorithm provides a natural way of distinguishing clusters and implementing the basic requirement of clustering as high within-cluster similarity and between-cluster dissimilarity
Word-level language model can only learn the relationship between words in one sentence. For sentences in one document which talks about one or several specific topics, the words in the next sentence are chosen partially in accordance with the previous sentences. To model this kind of co- herence of sentences, Le and Mikolov (2014) ex- tend word embedding learning network (Mikolov et al., 2013) to learn the paragraph embedding as a fixed-length vector representation for paragraph or sentence. Li and Hovy (2014) propose a neu- ral network coherence model which employs dis- tributed sentence representation and then predict the probability of whether a sequence of sentences is coherent or not.
retrieve data on plain text cannot be applied here. In this proposed system, every document is represented as a point in a HD space. In other words, if the points whose distances are short in the HD space then they can be classified into a specific category. By comparing the documents which belong to dataset we can conclude that the targeted documents are less in number.
gies to the summarization problem have been proposed (Lloret and Palomar, 2012) (Tur and De Mori, 2011) (Saggion and Poibeau, 2013). It must be distinguished among ab- stractive summarization, where the summary is composed by sentences that does not ap- pear in the document but contain almost all the meaning; extractive summarization, where the summary consist of a selection of the more salient sentences of the document; and mixed summarization, where summaries are generated by combining abstractive and extractive methods (See, Liu, and Manning, 2017). Due to the difficulty of developing good abstractive and mixed strategies, most of the approaches are extractive. These ap- proaches are a good solution in some tasks, such as summarization of news, because the journalistic writing style tends to contain the main information in some few sentences, that usually appear at the beginning of the article. Related to methodologies, due to the difficulty of obtaining training corpus of document-summary pairs to train supervised systems, most of the initial works were based on unsupervised techniques. This is the case of the statistical word features extraction (Carbonell and Goldstein, 1998), the obten- tion of latent concepts by means of Latent Semantic Analysis (Deerwester et al., 1990), the graph based approaches such as LexRank (Erkan and Radev, 2004), among others (Tur and De Mori, 2011)(Lloret and Palomar, 2012). On the other hand, some systems based on supervised techniques were pro- posed when manually training corpus were built. This is the case of summarization based on Support Vector Machines (Begum, Fattah, and Ren, 2009) or Conditional ran- dom Fields (Shen et al., 2007).
This document is a dissertation for a Doctor of Engineering degree in the domain of Artificial Intelligence to study multi-level Deep Learning models with hierarchical loss functions for automatic plant identification. However, it also presents the key results obtained while addressing the general problem of fully automating the identification of plant species based exclusively on images. It describes the key findings in a four year research path that started with a restricted scope, namely, identification of plants from Costa Rica by using a morphometric approach that considers only images of fresh leaves (Chapter 4, Chapter 5). Then, species from other regions of the world were included, but still using hand-crafted feature extractors. A fundamental methodological turn was the subsequent use of Deep Learning techniques on images of any components of a plant (Chapter 6). Then we studied and compared the accuracy of a Deep Learning approach to do identifications based on datasets of images of fresh plants and compared it with datasets of herbarium sheet images for the first time (Chapter 7, Chapter 8). Additionally, because these are data-driven processes, it is critical to use statistically representative data. Thus, potential biases in automatic plant identification dataset creation and usage were found and characterized (Chapter 9). Feasibility of doing transfer learning between different regions of the world was also proven (Chapter 7). Even more importantly, it was for the first time demonstrated that herbarium sheets are a good resource to do identifications of plants mounted on herbarium sheets, which provides additional levels of value and importance to herbaria around the globe (Chapter 7). As a culmination of this research path, this document presents the results of developing novel multi-level classification architectures that use knowledge about higher taxonomic levels to carry out not only species identification but also family and genus level identifications. This last step responds to the research goals established for this dissertation but is the result of ground work which has already been published as a peer- reviewed paper (Carranza-Rojas et al. 2018). Finally, to improve the accuracy of species level identifications, the last chapter of this document focuses on the creation of a hierarchical loss function based on known plant taxonomies, used to guide the model optimization with the prior knowledge of a given class hierarchy, such as genus or family (Chapter 11).
Term or pattern-based approaches are used for information filtering to generate users’ information needs from a collection of documents. Document collection and user interest are categorized under multiple topics. Latent Dirichlet Allocation (LDA) is applied to generate statistical models to represent multiple topics in a collection of documents. Topic models are widely utilized in the fields of machine learning and information retrieval. Selection of the most discriminative and representative patterns from the huge amount of discovered patterns is a complex task. Maximum matched Pattern-based Topic Model (MPBTM) is used to perform information filtering process. Statistical and taxonomic features are used to organize the topic model based patterns. Document relevance is estimated using Maximum Matched Patterns such as discriminative and representative patterns. The following drawbacks are identified from the existing system.
On the other hand, deep learning techniques have also been investigated for seman- tic indexing. Salakhutdinov and Hinton  proposed a novel representation method for extending semantic indexing. This method uses the deep auto-encoder model, where the higher layer is encoded with binary codes and the lower layer is generated based on word-count vectors. They also introduced the constrained Poisson model to deal with documents with varying lengths. Mirowski  improved the deep auto-encoder model by introducing a dynamic variable using gradient-based MAP inference. This dynamic variable is capable of not only calculating the encoder and decoder cross-entropy, but also training classifiers together with document label. Thus, this method can predict the document categories, determine a better optimization objective function, improve doc- ument semantic indexing, and determine the number of steps required by the difference between the current value and the previous dynamic variables. Also, the topic model complexity can be reduced to a cross-entropy loss function model. Wu  introduced a deep architecture composed of restricted Boltzmann machines RBMs (which exploits nonlinear embedding and is thus different from other DNNs) to compute the seman- tic representation of documents. In this low-dimensional semantic space, the nonlinear features embedded through the deep semantic embedding model can achieve a better compact representation. The model also uses discriminative fine-tuning and modifies the calculation of rank scores of relevant and irrelevant documents. Therefore, the indexing performance is improved by this model.
coverage scores. However, Figure 3 (b) and (c) show that in both the Seq2seq-baseline model and the Hierarchical-baseline model, most sam- ples fall into the left-bottom area (low structural- compression and structural-coverage), and only about 13% and 7% samples fall into the right- top area, respectively. Figure 3 (d) shows that our system with structural regularization achieves similar behaviors to human-made summaries (over 80% samples fall into the right-top area). The re- sults demonstrate that the structural-compression and structural-coverage properties are common in document summarization, but both the seq2seq models and the basic hierarchical encoder-decoder models are not yet able to capture them properly. 5.3 Effects of Structural Regularization The structural regularization based on our hi- erarchical encoder-decoder with hybrid attention model improves the quality of summaries from two aspects: (1) The summary covers more salient information and contains very few repetitions, which can be seen both qualitatively (Table 1 and Figure 1) and quantitatively (Table 5 and Figure 4). (2) The model has the ability to shorten a long sentence to generate a more concise one or compress several different sentences to generate a more informative one by merging the informa- tion from them. Table 6 shows several examples of abstractive summaries produced by sentence com- pression in our model.
Penn Discourse TreeBank (PDTB) (Prasad et al., 2008) and RST are the most commonly used framework to represent a discourse structure. PDTB focuses on the relation between two sen- tences, and the annotated structure for a docu- ment is not necessarily a tree. In contrast, RST is forced to represent a document as a tree. Dis- course parsers for both schema are available (Her- nault et al., 2010; Feng and Hirst, 2014; Wang and Lan, 2015). There are at least two methods to convert an RST-based tree structure to a depen- dency structure (Hirao et al., 2002; Li et al., 2014). Hayashi et al. (2016) compared these methods and mentioned that DEP-DT by Hirao et al. (2002) has an advantage for applying to summarization tasks. We use DEP-DT for this research since we focus on integrating the tree structure into a summarizer. We found only one model that jointly learns RST parsing and document summarization (Goyal and Eisenstein, 2016). They used the SampleRank algorithm (Wick et al., 2011), a stochastic struc- ture prediction model, while our main focus is to take into account discourse structures in RNN- based summarizers.
techniques. We have tested two problem transformation methods, Binary Relevance and Label Powerset, combined with various classification algorithms and also an adaptation method, Multilabel k-Nearest Neighbor. All of them were evaluated using two text collections with different number of terms. The best results were obtained with the combination Binary Relevance with Naive Bayes Multinomial. From the various experiences performed we can also conclude that this combination does not depend of both the size of the collection and the number of different terms used. We want to stress that these conclusions are only valid for the hierarchical multi-label document collection extracted from the ACM library.
Document-level machine translation (MT) re- mains challenging due to the difficulty in effi- ciently using document context for translation. In this paper, we propose a hierarchical model to learn the global context for document- level neural machine translation (NMT). This is done through a sentence encoder to cap- ture intra-sentence dependencies and a docu- ment encoder to model document-level inter- sentence consistency and coherence. With this hierarchical architecture, we feedback the ex- tracted global document context to each word in a top-down fashion to distinguish different translations of a word according to its specific surrounding context. In addition, since large- scale in-domain document-level parallel cor- pora are usually unavailable, we use a two- step training strategy to take advantage of a large-scale corpus with out-of-domain parallel sentence pairs and a small-scale corpus with in-domain parallel document pairs to achieve the domain adaptability. Experimental re- sults on several benchmark corpora show that our proposed model can significantly improve document-level translation performance over several strong NMT baselines.
Exploring the hierarchy of the typology of relations, two hierarchical classifiers have been developed according to the Top-Down and Big-Bang approaches. For a comparison between these approaches, see Freitas & Carvalho (2007). In the Top-Down approach, a classifier is used at each level of the typology. For example, at the first level, a classifier is used to choose between “content” and “form” groups. Supposing that the “content” is the selected group, another classifier is then used to choose between “redundancy”, “complement” and “contradiction” subgroups, and so on for each typology branch. When the lowest level of the hierarchy is reached, the process ends with the choice of a CST relation. Table 3 shows the results for each classifier produced according to the Top-Down approach, using the J48 technique. One may see, for instance, that the first classifier (classifier A) decides if a sentence pair contains a “content” or a “form” relation with f- measures of 95.3% and 48.8% for each of these classes, respectively. All average f- measures were higher than 45%. This proves the potential of this approach to identify relations and corroborates that the CST typology makes sense. However, some relations still produced very low results, such as Modality, Translation and Summary.
The Hierarchical P2P Networks: In this type of network, the Nodes are classified in two categories: Peers and Super-Peers. The Super-Peers (powerful nodes) play a specific role and have processing strength and larger bandwidth than other Peers (normal nodes) of the network. The Super-Peer model  introduces a hierarchy between a Super-Peer and the Peers connected to this Super-Peer. The Super-Peers work in P2P mode, so that within a group, a Super-Peer and its Peers work in a classic client-server mode (Fig.3). The hierarchical model has the advantages of using both types of systems (centralized and decentralized). A Super-Peer acts as a centralized repository for the account of a set of Peers. Routing in Super-Peers networks is more effective than pure P2P networks because the routing is limited to Super-Peer networks. This solution solves the problem of
As shown in Table 1, although the method using Wikipedia-based relatedness outperforms that us- ing cooccurrence-based relatedness, the improve- ment is not prominent. Wikipedia-based related- ness is computed according to global statistical in- formation on Wikipedia. Therefore it is more pre- cise than cooccurrence-based relatedness, which is reflected in the performance of the keyphrase ex- traction. However, on the other hand, Wikipedia- based relatedness does not catch the document- specific relatedness, which is represented by the cooccurrence-based relatedness. It will be an in- teresting future work to combine these two types of relatedness measurements.