A common approach to statistical model selection – particularly in scientific domains in which it is of interest to draw inferences about an underlying phenomenon – is to develop powerful procedures that provide control on false discoveries. Such methods are widely used in inferential settings involving variable selection, graph estimation, and others in which a discovery is naturally regarded as a discrete concept. However, this view of a discovery is ill-suited to many model selection and structured estimation problems in which the underlying decision space is not discrete. We describe a geometric reformulation of the notion of a discovery, which enables the development of model selection methodology for a broader class of problems. We highlight the utility of this viewpoint in problems involving subspace selection and low-rank estimation, with a specific algorithm to control for false discoveries in these settings. Concepts from algebraic geometry (e.g. tangent spaces to determinantal varieties) play a central role in the proposed framework. Chapter 4 - LatentVariable Graphical Modeling: Beyond Gaussianity The algorithm to fit a latent-variable graphical model to reservoir volumes in Chapter 2 is appropriate when the variables are Gaussian. In many scientific and engineer- ing applications, the set of variables one wishes to model strongly deviate from Gaussianity. Existing techniques to fit a graphical model to data suffer from one or more of these deficiencies: a) they are unable to handle non-Gaussianity, b) they are based on non-convex or computationally intractable algorithms, and c) they cannot account for latent variables. We develop a framework, based on Generalized Linear Models, that addresses all these shortcomings and can be efficiently optimized to obtain provably accurate estimates. A particularly novel aspect of our formulation is that it incorporates regularizers that are tailored to the type of latent variables: nuclear norm for Gaussian latent variables, max-2 norm for Bernoulli variables, and complete positive norm for Exponential variables. For each case, we provide a semidefinite relaxation and demonstrate that the associated norm yields a better sample complexity (than the nuclear norm) for similar computational cost. We fur- ther demonstrate the utility of our approach with data involving U.S. Senate voting record.
We have discussed a number of algorithms for the problem of inferring topics using LDA. We have shown how to extend a batch algorithm into a series of online algorithms, each more flexible than the last. Our results demonstrate that these algorithms per- form effectively in recovering the topics used in multi- ple datasets. Latent Dirichlet allocation has been ex- tended in a number of directions, including incorpora- tion of hierarchical representations (Blei et al., 2004), minimal syntax (Griffiths et al., 2004), the interests of authors (Rosen-Zvi et al., 2004), and correlations be- tween topics (Blei and Lafferty, 2005). We anticipate that the algorithms we have outlined here will natu- rally generalize to many of these models. In particu- lar, applications of the hierarchical Dirichlet process to text (Teh et al., 2006) can be viewed as an analogue to LDA in which the number of topics is allowed to vary. Our particle filtering framework requires little modifi- cation to handle this case, providing an online alter- native to existing inferencealgorithms for this model.
Neither Gibbs sampling nor variational inference in their respective original formulations scale well to large corpora of millions of documents. Training time gets off practical limits for more complex and nonparametric LDA extensions. In the past decade significant research efforts have been made to address this concern. Online (Hoffman et al., 2010), distributed (Newman et al., 2008) and both online and distributed (Broderick et al., 2013) algorithms have been developed. Online algorithms for Hierarchical Dirichlet Process (HDP) (Teh et al., 2006), a nonparametric counterpart of LDA, have also been proposed (Wang et al., 2011; Bryant & Sudderth, 2012). Both classes of inferencealgorithms (i.e. sampling and variational inference), their virtues notwithstanding, are known to exhibit certain deficiencies, which also propagate through their modern extensions. The problem can be traced back to the need for approximating or sampling from the posterior distributions of the latent variables representing the topic labels. Since these latent variables are not geometrically intrinsic — any permutation of the labels yields the same likelihood — the manipulation of these redundant quantities tend to slow down the computation, and compromise with the learning accuracy.
Broadly speaking, continuous latentvariable models are useful for problems where data lies close to a manifold of much lower dimensionality. By using continuous latent variables, we can express inherent unobserved structure (considering it does exist) in the data with significantly fewer latent variables and therefore these latentvariable models play a key role in the statistical formulation of many dimensionality reduction techniques. Many widely-used pattern recognition techniques can be understood in that framework: probabilistic principle component analysis (PCA) (Tipping & Bishop, 1999; Roweis, 1998), the Kalman filter and others. In addition, as Tipping & Bishop (1999) have pointed out, many non-probabilistic methods can be well understood as a restricted case of a continuous variable model: independent component analysis and factor analysis for example Spearman (1904) which describe variability among observed, correlated variables. By contrast, discrete latentvariable models assume discreteness in the unobserved space. This discreteness naturally implies that random draws from this space have a finite probability of repetition which is one reason why discrete latent models are widely used to express inherent groupings and similarities that underlie the data. They have played a key role in the probabilistic formulation of clustering techniques. However, computationally inferring (learning) such models from the data is a lot more challenging than for continuous latentvariable models and in its full generality clustering implies a combinatorial (NP-hard) problem. This often restricts the application of discrete latent models to applications in which computational resources and time for inference is plentiful.
variable models is based on the maximum-likelihood principle and aims at finding the model that maximizes the likelihood of an observed dataset. How- ever, maximizing the likelihood is a difficult task even for simple models: Arora et al. (2012) for example proved that maximum likelihood estimates for the single topic model are NP-hard, while Roch (2006) proved the same for latent trees. Consequently, several heuristics have been developed aiming at solving the maximum likelihood problem under an approximate point of view. Expectation Maximization (Dempster et al., 1977), for example is an iterative technique that, starting from a user defined initialization (typically random), iteratively updates the parameter of a model, increasing the likelihood at each iteration. EM has only local guarantees of convergence, but is easy to understand and to implement, which makes it the most used technique in this field. Furthermore, EM requires the complete calculation of the posterior distributions, which may not be feasible for models with a complex structure; in this case it is common to use approximate inference approaches based on variational inference or Monte Carlo-based approaches (see Bishop 2006 for a detailed presentation of these techniques). Approximate inference techniques are widely used in tasks like topic modeling (Blei et al., 2003; Griffiths and Steyvers, 2004) and mixture-models learning (Marin et al., 2005). Thanks to their flexibility, they allow to deal with complex models, but, unlike methods of moments, they fail to provide optimal models in small running times. Methods of moments and likelihood-based techniques aim at the same purpose – retrieving the parameters of a model – following two different frameworks.
Instead of placing a prior distribution explicitly on the number of mixture components when this quantity is unknown, another predominant approach is to place a Bayesian nonparametric prior on the mixing measure G, resulting in infinite mixture models. Bayesian nonparametric models such as Dirichlet process mixtures and the variants have remarkably extended the reach of mixture modeling into a vast array of applications, especially those areas where the number of mixture components in the modeling is very large and difficult to fathom, or when it is a quantity of only tangential interest. For instance, in topic modelingapplications of web-based text corpora, one may be interested in the most "popular" topics, the number of topics is less meaningful (Blei et al. (2003); Teh et al. (2006); Nguyen (2015); Yurochkin et al. (2017)). DP mixtures and variants can also serve as an asymptotically optimal device for estimating the population density, under standard conditions on the true density’s smoothness, see, e.g., Ghosal and van der Vaart (2001, 2007); Shen et al. (2013); Scricciolo (2014).
Latentvariable methods provide a flexible approach for complex modeling of correlation in longitudinal studies. We have discussed methods for using latent variables to aggre- gate multiple ultrasound measurements and model fetal growth and development during the first and second trimesters. These results are particularly important to researchers who use ultrasounds to date pregnancies while assuming that there is no measurable variability in fetal growth early in pregnancy. There is a general need to make latentvariable methods more familiar to biostatisticians by applying them to research areas in public health. Furthermore, by having a solid understanding of the subject matter, the insights gained from an analysis using latentvariable methods can be effectively com- municated to researchers in epidemiology and clinical disciplines. To make our methods more accessible, it is possible that, with some modification, the latentvariable mixture models describe in papers two and three could be estimated using commercial software such as M-Plus. Using available software would be particularly useful for journal articles intended for applied researchers in reproductive health.
It is worth mentioning that although the above theorem provides a generic representa- tion of the solution to RegBayes, in practice we usually need to make additional assumptions in order to make either the primal or dual problem tractable to solve. Since such assump- tions could make the feasible space non-convex, additional cautions need to be paid. For instance, the mean-field assumptions will lead to a non-convex feasible space (Wainright and Jordan, 2008), and we can only apply the convex analysis theory to deal with convex sub-problems within an EM-type procedure. More concrete examples will be provided later along the developments of various models. We should also note that the modeling flexibility of RegBayes comes with risks. For example, it might lead to inconsistent posteriors (Barron et al., 1999; Choi and Ramamoorthi, 2008). This paper focuses on presenting several prac- tical instances of RegBayes and we leave a systematic analysis of the Bayesian asymptotic properties (e.g., posterior consistency and convergence rates) for future work.
the i th dimension in the latentvariable z. Using the FB objective causes learning to give up try- ing to drive down KL for dimensions of z that are already beneath the target rate. Pelsmaeker and Aziz (2019) conduct a comprehensive experimen- tal evaluation of related methods and conclude that the FB objective is able to achieve comparable or superior performance (in terms of both language modeling and reconstruction) in comparison with other top-performing methods, many of which are substantially more complex. We notice that Pels- maeker and Aziz (2019) experiment with a slightly different version of FB where the threshold is ap- plied to the entire KL term directly, rather than on each dimension’s KL separately. We examine both versions here and refer to the single threshold ver- sion as “FBP” and the multiple threshold version (Eq. 1) as “FB”. For both FB and FBP, we vary the target rate λ and report the setting with the best validation PPL and the setting with the best bal- ance between PPL and reconstruction loss. 5 Aggressive training (He et al., 2019). He et al. (2019) observe that when posterior collapse oc- curs, the inference network often lags behind the generator during training. In contrast with the KL reweighting methods described above, He et al. (2019) propose an aggressive training schedule which iterates between multiple encoder update steps and one decoder update step to mitigate pos- terior collapse. 6
Several methodological considerations of the current study merit attention. First, we focused on distinguishing between between mental strategies, defined as nothing written down on paper, and written strategies in which something was written down on paper, ranging from intermediate answers to procedural algorithms. Note that this is a similar categorization of strategies as in Siegler and Lemaire’s (1997) original choice / no-choice study, in which they distinguished between using a calculator, using mental arithmetic, and using pencil and paper (experiment 3). Although this is arguably a rough classification, and other categorizations (for example with respect to the number of solution steps) are thinkable, we chose this strategy split for two reasons. First, earlier studies into strategy use on complex division by Dutch sixth graders showed that both strategy types were used, and that they had large predictive power of the accuracy of solutions (Hickendorff et al., 2009b, 2010). Second, the didactical practice in the Netherlands – with the disappearance of the traditional algorithm and many different informal strategies – leads to obstacles in studying the characteristics of different strategies in a choice / no-choice design. That is, if students are forced to use a particular strategy in the no-choice conditions, they should have those strategies in their repertoire, and many students did not get instruction in for example the traditional algorithm.
scribed above can compute T [((a ? b) ? c) ? ] to find out what features fire on abc and its suffixes. One simple feature template performs “vowel/consonant backoff”; e.g., it maps abc to the feature named VCC. Fig. 2 showed the result of applying several actual feature templates to the window shown in Fig. 1. The extended regular expression calculus provides a flexible and concise notation for writ- ing down these FSTs. As a trivial example, the tri- gram “vowel/consonant backoff” transducer can be described as T = V V V , where V is a transducer that performs backoff on a single alignment charac- ter. Feature templates should make it easy to experi- ment with adding various kinds of linguistic knowl- edge. We have additional algorithms for compiling U θ from a set of arbitrary feature templates, 25 in-
Q3 Does the overall sentiment of the language used to describe men and women differ? To answer these questions, we introduce a gen- erative latent-variable model that jointly represents adjective (or verb) choice, with its sentiment, given the natural gender of a head (or dependent) noun. We use a form of posterior regularization to guide inference of the latent variables (Ganchev et al., 2010). We then use this model to study the syntac- tic n-gram corpus of (Goldberg and Orwant, 2013). To answer Q1, we conduct an analysis that re- veals differences between descriptions of male and female nouns that align with common gen- der stereotypes captured by previous human judge- ments. When using our model to answer Q2, we find that adjectives used to describe women are more often related to their bodies (significant under a permutation test with p < 0.03) than adjectives used to describe men (see Fig. 1 for examples). This finding accords with previous research (Nor- berg, 2016). Finally, in answer to Q3, we find no significant difference in the overall sentiment of the language used to describe men and women. 2 What Makes this Study Different?
Item Response Theory (IRT) (Baker and Kim 2004; van der Linden and Hambleton 1997) considers a class of latentvariable models that link mainly dichotomous and polytomous manifest (i.e., response) variables to a single latentvariable. The main applications of IRT can be found in educational testing in which analysts are interested in measuring examinees’ ability using a test that consists of several items (i.e., questions). Several models and estimation procedures have been proposed that deal with various aspects of educational testing.
The effects of instruction on performance were also investigated; both direct ef- fects and effects occurring indirectly through strategy use. The indirect effects can occur because of the large accuracy differences between strategies: as in previous research (Hickendorff, 2013; Hickendorff et al., 2009, 2010; Van Putten, 2005), writ- ten strategies were found to be much more accurate than mental strategies. This was not only the case when potentially biasing strategy selection effects (Siegler & Lemaire, 1997) of student and problem characteristics were statistically corrected for (Chapters 3 and 5), but also when they were eliminated through experimen- tal design (with the choice/no-choice design of Siegler & Lemaire, 1997; Chapter 4). Within written strategies, the digit-based and whole-number-based algorithms were found to be comparable in accuracy, while non-algorithmic approaches ap- peared less accurate than the algorithms (as discussed previously). This suggests that while attention to informal strategies may be very fruitful in earlier stages of the educational process (Treffers, 1987b), performance may benefit from a focus on standardized procedures at the end of the instructional trajectory. This may be especially relevant to students with a lower mathematical ability, who appear to benefit less from more free forms of instruction with attention to multiple solution strategies than from more direct forms of instruction (Royal Netherlands Academy of Arts and Sciences, 2009).
Rheumatoid arthritis is a complex disease that appears to involve multiple genetic and environmental factors. Using the Genetic Analysis Workshop 15 simulated rheumatoid arthritis data and the structural equation modeling framework, we tested hypothesized "causal" rheumatoid arthritis model(s) by employing a novel latent gene construct approach that models individual genes as latent variables defined by multiple dense and non-dense single-nucleotide polymorphisms (SNPs). Our approach produced valid latent gene constructs, particularly with dense SNPs, which when coupled with other factors involved in rheumatoid arthritis, were able to generate good fitting models by certain goodness of fit indices. We observed that Gene F, C, DR, sex and smoking were significant predictors of rheumatoid arthritis but Genes A and E were not, which was generally, but not entirely, consistent with how the data were simulated. Our approach holds promise in unravelling complex diseases and improves upon current "one SNP (haplotype)-at-a- time" regression approaches by decreasing the number of statistical tests while minimizing problems with multicolinearity and haplotype estimation algorithm error. Furthermore, when genes are modeled as latent constructs simultaneously with other key cofactors, the approach provides enhanced control of confounding that should lead to less biased effect estimates among genes as well as between gene(s) and the complex disease. However, further study is needed to quantify bias, evaluate fit index disparity, and resolve multiplicative latent gene interactions. Moreover, because some a priori biological information is needed to form an initial substantive model, our approach may be most appropriate for candidate gene SNP panel applications.
Shallow parsing is one of many NLP tasks that can be reduced to a sequence la- beling problem. In this paper we show that the latent-dynamics (i.e., hidden sub- structure of shallow phrases) constitutes a problem in shallow parsing, and we show that modeling this intermediate structure is useful. By analyzing the automatically learned hidden states, we show how the latent conditional model explicitly learn latent-dynamics. We propose in this paper the Best Label Path (BLP) inference algo- rithm, which is able to produce the most probable label sequence on latent condi- tional models. It outperforms two existing inferencealgorithms. With the BLP infer- ence, the LDCRF model significantly out- performs CRF models on word features, and achieves comparable performance of the most successful shallow parsers on the CoNLL data when further using part-of- speech features.
Statistical models involving latent variables are widely used in many areas of applications, such as biomedical science and social science. When likelihood-based parametric inferential methods are used to make statistical inference, certain distri- butional assumptions on the latent variables are often invoked. As latent variables are not observable, parametric assumptions on the latent variables cannot be veri- fied directly using observed data. Even though semiparametric and nonparametric approaches have been developed to avoid making strong assumptions on the latent variables, parametric inferential approaches are still more appealing in many situa- tions in terms of consistency and efficiency in estimation, and computational burden. The goals of our study are to gain insight into the sensitivity of statistical inference to model assumptions on latent variables, and to develop methods for diagnosing latent-model misspecification to enable one to reveal whether the parametric infer- ence is robust under certain latent-model assumptions. We refer to such robustness as latent-model robustness.
subjects, involvement of drug use, involvement of gen- etic analysis, and the duration of PI training did not dif- fer significantly between the group of applications that were approved and those that failed to receive approval. There were significantly more non-approvals for applica- tions from PIs within our organization than for applica- tions from PIs from outside our institution (p = 0.014). The administration time required for checking by the secretary of our IRB was longer in the group of applica- tions that were not approved (p = 0.008) than in the group that received approval. The revision frequency in the group of applications that were not approved was significantly higher than in the group of applications that were approved (p < 0.001). The total review time was sig- nificantly longer in the group of applications that were not approved than in the group of applications that were approved (p = 0.002) (Table 1).
It is important to note that unlike spectral algo- rithms, the EM algorithm has an interpretation that is valid even when the data it is applied on is not generated from an L-PCFG in the family we are estimating from. It can be viewed as minimizing Kullback-Leibler (KL) divergence, a measure for distributional divergence, between the empirical distribution and the family of possible L-PCFGs from which a model is selected. To date, the theo- retical guarantees of L-PCFGs with spectral algo- rithms require the assumption that the data is gen- erated from an L-PCFG distribution. Still, the EM algorithm and spectral algorithms yield similar re- sults on a variety of benchmarks for multilingual parsing, even for data that are clearly not sampled from an L-PCFG (as one might argue is true for most natural language data).
We describe a two part method for this problem. The method (1) finds clusters of measured variables that are d-separated by a single unrecorded common cause, if such exists; and (2) finds features of the Markov Equivalence class of causal models for the latent variables. Assuming only multiple indicator structure and principles standard in Bayes net search algorithms, principles as- sumed satisfied in many domains, especially in the social sciences, the two procedures converge, probability 1 in the large sample limit, to correct information. The completeness of the information obtained about latent structure depends on how thoroughly confounded the measured variables are, but when, for each unknown latentvariable, there in fact exists at least a small number of measured variables that are influenced only by that latentvariable, the method returns the complete Markov Equivalence class of the latent structure. To complement the theoretical results, we show by simu- lation studies for several latent structures and for a range of sample sizes that the method identifies the unknown structure more accurately than does factor analysis and a published greedy search al- gorithm. We also illustrate and compare the procedures with applications to social science cases, where expert opinions about measurement are reasonably firm, but are less so about causal relations among the latent variables.