A common approach to statistical model selection – particularly in scientific domains in which it is of interest to draw inferences about an underlying phenomenon – is to develop powerful procedures that provide control on false discoveries. Such methods are widely used in inferential settings involving **variable** selection, graph estimation, and others in which a discovery is naturally regarded as a discrete concept. However, this view of a discovery is ill-suited to many model selection and structured estimation problems in which the underlying decision space is not discrete. We describe a geometric reformulation of the notion of a discovery, which enables the development of model selection methodology for a broader class of problems. We highlight the utility of this viewpoint in problems involving subspace selection and low-rank estimation, with a specific algorithm to control for false discoveries in these settings. Concepts from algebraic geometry (e.g. tangent spaces to determinantal varieties) play a central role in the proposed framework. Chapter 4 - **Latent** **Variable** Graphical **Modeling**: Beyond Gaussianity The algorithm to fit a **latent**-**variable** graphical model to reservoir volumes in Chapter 2 is appropriate when the variables are Gaussian. In many scientific and engineer- ing **applications**, the set of variables one wishes to model strongly deviate from Gaussianity. Existing techniques to fit a graphical model to data suffer from one or more of these deficiencies: a) they are unable to handle non-Gaussianity, b) they are based on non-convex or computationally intractable **algorithms**, and c) they cannot account for **latent** variables. We develop a framework, based on Generalized Linear Models, that addresses all these shortcomings and can be efficiently optimized to obtain provably accurate estimates. A particularly novel aspect of our formulation is that it incorporates regularizers that are tailored to the type of **latent** variables: nuclear norm for Gaussian **latent** variables, max-2 norm for Bernoulli variables, and complete positive norm for Exponential variables. For each case, we provide a semidefinite relaxation and demonstrate that the associated norm yields a better sample complexity (than the nuclear norm) for similar computational cost. We fur- ther demonstrate the utility of our approach with data involving U.S. Senate voting record.

Show more
198 Read more

We have discussed a number of **algorithms** for the problem of inferring topics using LDA. We have shown how to extend a batch algorithm into a series of online **algorithms**, each more flexible than the last. Our results demonstrate that these **algorithms** per- form effectively in recovering the topics used in multi- ple datasets. **Latent** Dirichlet allocation has been ex- tended in a number of directions, including incorpora- tion of hierarchical representations (Blei et al., 2004), minimal syntax (Griffiths et al., 2004), the interests of authors (Rosen-Zvi et al., 2004), and correlations be- tween topics (Blei and Lafferty, 2005). We anticipate that the **algorithms** we have outlined here will natu- rally generalize to many of these models. In particu- lar, **applications** of the hierarchical Dirichlet process to text (Teh et al., 2006) can be viewed as an analogue to LDA in which the number of topics is allowed to vary. Our particle filtering framework requires little modifi- cation to handle this case, providing an online alter- native to existing **inference** **algorithms** for this model.

Show more
Neither Gibbs sampling nor variational **inference** in their respective original formulations scale well to large corpora of millions of documents. Training time gets off practical limits for more complex and nonparametric LDA extensions. In the past decade significant research efforts have been made to address this concern. Online (Hoffman et al., 2010), distributed (Newman et al., 2008) and both online and distributed (Broderick et al., 2013) **algorithms** have been developed. Online **algorithms** for Hierarchical Dirichlet Process (HDP) (Teh et al., 2006), a nonparametric counterpart of LDA, have also been proposed (Wang et al., 2011; Bryant & Sudderth, 2012). Both classes of **inference** **algorithms** (i.e. sampling and variational **inference**), their virtues notwithstanding, are known to exhibit certain deficiencies, which also propagate through their modern extensions. The problem can be traced back to the need for approximating or sampling from the posterior distributions of the **latent** variables representing the topic labels. Since these **latent** variables are not geometrically intrinsic — any permutation of the labels yields the same likelihood — the manipulation of these redundant quantities tend to slow down the computation, and compromise with the learning accuracy.

Show more
134 Read more

Broadly speaking, continuous **latent** **variable** models are useful for problems where data lies close to a manifold of much lower dimensionality. By using continuous **latent** variables, we can express inherent unobserved structure (considering it does exist) in the data with significantly fewer **latent** variables and therefore these **latent** **variable** models play a key role in the statistical formulation of many dimensionality reduction techniques. Many widely-used pattern recognition techniques can be understood in that framework: probabilistic principle component analysis (PCA) (Tipping & Bishop, 1999; Roweis, 1998), the Kalman filter and others. In addition, as Tipping & Bishop (1999) have pointed out, many non-probabilistic methods can be well understood as a restricted case of a continuous **variable** model: independent component analysis and factor analysis for example Spearman (1904) which describe variability among observed, correlated variables. By contrast, discrete **latent** **variable** models assume discreteness in the unobserved space. This discreteness naturally implies that random draws from this space have a finite probability of repetition which is one reason why discrete **latent** models are widely used to express inherent groupings and similarities that underlie the data. They have played a key role in the probabilistic formulation of clustering techniques. However, computationally inferring (learning) such models from the data is a lot more challenging than for continuous **latent** **variable** models and in its full generality clustering implies a combinatorial (NP-hard) problem. This often restricts the application of discrete **latent** models to **applications** in which computational resources and time for **inference** is plentiful.

Show more
167 Read more

185 Read more

Instead of placing a prior distribution explicitly on the number of mixture components when this quantity is unknown, another predominant approach is to place a Bayesian nonparametric prior on the mixing measure G, resulting in infinite mixture models. Bayesian nonparametric models such as Dirichlet process mixtures and the variants have remarkably extended the reach of mixture **modeling** into a vast array of **applications**, especially those areas where the number of mixture components in the **modeling** is very large and difficult to fathom, or when it is a quantity of only tangential interest. For instance, in topic **modeling** **applications** of web-based text corpora, one may be interested in the most "popular" topics, the number of topics is less meaningful (Blei et al. (2003); Teh et al. (2006); Nguyen (2015); Yurochkin et al. (2017)). DP mixtures and variants can also serve as an asymptotically optimal device for estimating the population density, under standard conditions on the true density’s smoothness, see, e.g., Ghosal and van der Vaart (2001, 2007); Shen et al. (2013); Scricciolo (2014).

Show more
303 Read more

150 Read more

It is worth mentioning that although the above theorem provides a generic representa- tion of the solution to RegBayes, in practice we usually need to make additional assumptions in order to make either the primal or dual problem tractable to solve. Since such assump- tions could make the feasible space non-convex, additional cautions need to be paid. For instance, the mean-field assumptions will lead to a non-convex feasible space (Wainright and Jordan, 2008), and we can only apply the convex analysis theory to deal with convex sub-problems within an EM-type procedure. More concrete examples will be provided later along the developments of various models. We should also note that the **modeling** flexibility of RegBayes comes with risks. For example, it might lead to inconsistent posteriors (Barron et al., 1999; Choi and Ramamoorthi, 2008). This paper focuses on presenting several prac- tical instances of RegBayes and we leave a systematic analysis of the Bayesian asymptotic properties (e.g., posterior consistency and convergence rates) for future work.

Show more
49 Read more

the i th dimension in the **latent** **variable** z. Using the FB objective causes learning to give up try- ing to drive down KL for dimensions of z that are already beneath the target rate. Pelsmaeker and Aziz (2019) conduct a comprehensive experimen- tal evaluation of related methods and conclude that the FB objective is able to achieve comparable or superior performance (in terms of both language **modeling** and reconstruction) in comparison with other top-performing methods, many of which are substantially more complex. We notice that Pels- maeker and Aziz (2019) experiment with a slightly different version of FB where the threshold is ap- plied to the entire KL term directly, rather than on each dimension’s KL separately. We examine both versions here and refer to the single threshold ver- sion as “FBP” and the multiple threshold version (Eq. 1) as “FB”. For both FB and FBP, we vary the target rate λ and report the setting with the best validation PPL and the setting with the best bal- ance between PPL and reconstruction loss. 5 Aggressive training (He et al., 2019). He et al. (2019) observe that when posterior collapse oc- curs, the **inference** network often lags behind the generator during training. In contrast with the KL reweighting methods described above, He et al. (2019) propose an aggressive training schedule which iterates between multiple encoder update steps and one decoder update step to mitigate pos- terior collapse. 6

Show more
12 Read more

Several methodological considerations of the current study merit attention. First, we focused on distinguishing between between mental strategies, defined as nothing written down on paper, and written strategies in which something was written down on paper, ranging from intermediate answers to procedural **algorithms**. Note that this is a similar categorization of strategies as in Siegler and Lemaire’s (1997) original choice / no-choice study, in which they distinguished between using a calculator, using mental arithmetic, and using pencil and paper (experiment 3). Although this is arguably a rough classification, and other categorizations (for example with respect to the number of solution steps) are thinkable, we chose this strategy split for two reasons. First, earlier studies into strategy use on complex division by Dutch sixth graders showed that both strategy types were used, and that they had large predictive power of the accuracy of solutions (Hickendorff et al., 2009b, 2010). Second, the didactical practice in the Netherlands – with the disappearance of the traditional algorithm and many different informal strategies – leads to obstacles in studying the characteristics of different strategies in a choice / no-choice design. That is, if students are forced to use a particular strategy in the no-choice conditions, they should have those strategies in their repertoire, and many students did not get instruction in for example the traditional algorithm.

Show more
301 Read more

scribed above can compute T [((a ? b) ? c) ? ] to find out what features fire on abc and its suffixes. One simple feature template performs “vowel/consonant backoff”; e.g., it maps abc to the feature named VCC. Fig. 2 showed the result of applying several actual feature templates to the window shown in Fig. 1. The extended regular expression calculus provides a flexible and concise notation for writ- ing down these FSTs. As a trivial example, the tri- gram “vowel/consonant backoff” transducer can be described as T = V V V , where V is a transducer that performs backoff on a single alignment charac- ter. Feature templates should make it easy to experi- ment with adding various kinds of linguistic knowl- edge. We have additional **algorithms** for compiling U θ from a set of arbitrary feature templates, 25 in-

Show more
10 Read more

Q3 Does the overall sentiment of the language used to describe men and women differ? To answer these questions, we introduce a gen- erative **latent**-**variable** model that jointly represents adjective (or verb) choice, with its sentiment, given the natural gender of a head (or dependent) noun. We use a form of posterior regularization to guide **inference** of the **latent** variables (Ganchev et al., 2010). We then use this model to study the syntac- tic n-gram corpus of (Goldberg and Orwant, 2013). To answer Q1, we conduct an analysis that re- veals differences between descriptions of male and female nouns that align with common gen- der stereotypes captured by previous human judge- ments. When using our model to answer Q2, we find that adjectives used to describe women are more often related to their bodies (significant under a permutation test with p < 0.03) than adjectives used to describe men (see Fig. 1 for examples). This finding accords with previous research (Nor- berg, 2016). Finally, in answer to Q3, we find no significant difference in the overall sentiment of the language used to describe men and women. 2 What Makes this Study Different?

Show more
11 Read more

Item Response Theory (IRT) (Baker and Kim 2004; van der Linden and Hambleton 1997) considers a class of **latent** **variable** models that link mainly dichotomous and polytomous manifest (i.e., response) variables to a single **latent** **variable**. The main **applications** of IRT can be found in educational testing in which analysts are interested in measuring examinees’ ability using a test that consists of several items (i.e., questions). Several models and estimation procedures have been proposed that deal with various aspects of educational testing.

25 Read more

The effects of instruction on performance were also investigated; both direct ef- fects and effects occurring indirectly through strategy use. The indirect effects can occur because of the large accuracy differences between strategies: as in previous research (Hickendorff, 2013; Hickendorff et al., 2009, 2010; Van Putten, 2005), writ- ten strategies were found to be much more accurate than mental strategies. This was not only the case when potentially biasing strategy selection effects (Siegler & Lemaire, 1997) of student and problem characteristics were statistically corrected for (Chapters 3 and 5), but also when they were eliminated through experimen- tal design (with the choice/no-choice design of Siegler & Lemaire, 1997; Chapter 4). Within written strategies, the digit-based and whole-number-based **algorithms** were found to be comparable in accuracy, while non-algorithmic approaches ap- peared less accurate than the **algorithms** (as discussed previously). This suggests that while attention to informal strategies may be very fruitful in earlier stages of the educational process (Treffers, 1987b), performance may benefit from a focus on standardized procedures at the end of the instructional trajectory. This may be especially relevant to students with a lower mathematical ability, who appear to benefit less from more free forms of instruction with attention to multiple solution strategies than from more direct forms of instruction (Royal Netherlands Academy of Arts and Sciences, 2009).

Show more
162 Read more

Rheumatoid arthritis is a complex disease that appears to involve multiple genetic and environmental factors. Using the Genetic Analysis Workshop 15 simulated rheumatoid arthritis data and the structural equation **modeling** framework, we tested hypothesized "causal" rheumatoid arthritis model(s) by employing a novel **latent** gene construct approach that models individual genes as **latent** variables defined by multiple dense and non-dense single-nucleotide polymorphisms (SNPs). Our approach produced valid **latent** gene constructs, particularly with dense SNPs, which when coupled with other factors involved in rheumatoid arthritis, were able to generate good fitting models by certain goodness of fit indices. We observed that Gene F, C, DR, sex and smoking were significant predictors of rheumatoid arthritis but Genes A and E were not, which was generally, but not entirely, consistent with how the data were simulated. Our approach holds promise in unravelling complex diseases and improves upon current "one SNP (haplotype)-at-a- time" regression approaches by decreasing the number of statistical tests while minimizing problems with multicolinearity and haplotype estimation algorithm error. Furthermore, when genes are modeled as **latent** constructs simultaneously with other key cofactors, the approach provides enhanced control of confounding that should lead to less biased effect estimates among genes as well as between gene(s) and the complex disease. However, further study is needed to quantify bias, evaluate fit index disparity, and resolve multiplicative **latent** gene interactions. Moreover, because some a priori biological information is needed to form an initial substantive model, our approach may be most appropriate for candidate gene SNP panel **applications**.

Show more
Shallow parsing is one of many NLP tasks that can be reduced to a sequence la- beling problem. In this paper we show that the **latent**-dynamics (i.e., hidden sub- structure of shallow phrases) constitutes a problem in shallow parsing, and we show that **modeling** this intermediate structure is useful. By analyzing the automatically learned hidden states, we show how the **latent** conditional model explicitly learn **latent**-dynamics. We propose in this paper the Best Label Path (BLP) **inference** algo- rithm, which is able to produce the most probable label sequence on **latent** condi- tional models. It outperforms two existing **inference** **algorithms**. With the BLP infer- ence, the LDCRF model significantly out- performs CRF models on word features, and achieves comparable performance of the most successful shallow parsers on the CoNLL data when further using part-of- speech features.

Show more
Statistical models involving **latent** variables are widely used in many areas of **applications**, such as biomedical science and social science. When likelihood-based parametric inferential methods are used to make statistical **inference**, certain distri- butional assumptions on the **latent** variables are often invoked. As **latent** variables are not observable, parametric assumptions on the **latent** variables cannot be veri- fied directly using observed data. Even though semiparametric and nonparametric approaches have been developed to avoid making strong assumptions on the **latent** variables, parametric inferential approaches are still more appealing in many situa- tions in terms of consistency and efficiency in estimation, and computational burden. The goals of our study are to gain insight into the sensitivity of statistical **inference** to model assumptions on **latent** variables, and to develop methods for diagnosing **latent**-model misspecification to enable one to reveal whether the parametric infer- ence is robust under certain **latent**-model assumptions. We refer to such robustness as **latent**-model robustness.

Show more
90 Read more

subjects, involvement of drug use, involvement of gen- etic analysis, and the duration of PI training did not dif- fer significantly between the group of **applications** that were approved and those that failed to receive approval. There were significantly more non-approvals for applica- tions from PIs within our organization than for applica- tions from PIs from outside our institution (p = 0.014). The administration time required for checking by the secretary of our IRB was longer in the group of applica- tions that were not approved (p = 0.008) than in the group that received approval. The revision frequency in the group of **applications** that were not approved was significantly higher than in the group of **applications** that were approved (p < 0.001). The total review time was sig- nificantly longer in the group of **applications** that were not approved than in the group of **applications** that were approved (p = 0.002) (Table 1).

Show more
It is important to note that unlike spectral algo- rithms, the EM algorithm has an interpretation that is valid even when the data it is applied on is not generated from an L-PCFG in the family we are estimating from. It can be viewed as minimizing Kullback-Leibler (KL) divergence, a measure for distributional divergence, between the empirical distribution and the family of possible L-PCFGs from which a model is selected. To date, the theo- retical guarantees of L-PCFGs with spectral algo- rithms require the assumption that the data is gen- erated from an L-PCFG distribution. Still, the EM algorithm and spectral **algorithms** yield similar re- sults on a variety of benchmarks for multilingual parsing, even for data that are clearly not sampled from an L-PCFG (as one might argue is true for most natural language data).

Show more
12 Read more

We describe a two part method for this problem. The method (1) finds clusters of measured variables that are d-separated by a single unrecorded common cause, if such exists; and (2) finds features of the Markov Equivalence class of causal models for the **latent** variables. Assuming only multiple indicator structure and principles standard in Bayes net search **algorithms**, principles as- sumed satisfied in many domains, especially in the social sciences, the two procedures converge, probability 1 in the large sample limit, to correct information. The completeness of the information obtained about **latent** structure depends on how thoroughly confounded the measured variables are, but when, for each unknown **latent** **variable**, there in fact exists at least a small number of measured variables that are influenced only by that **latent** **variable**, the method returns the complete Markov Equivalence class of the **latent** structure. To complement the theoretical results, we show by simu- lation studies for several **latent** structures and for a range of sample sizes that the method identifies the unknown structure more accurately than does factor analysis and a published greedy search al- gorithm. We also illustrate and compare the procedures with **applications** to social science cases, where expert opinions about measurement are reasonably firm, but are less so about causal relations among the **latent** variables.

Show more
56 Read more