C h a p t e r 1
INTRODUCTION
An overarching challenge in science and engineering is to develop concise and interpretable frameworks that characterize the relationships among a large collec- tion of variables. As an example, in computational biology, a common scientific discovery involving a gene regulatory network is to determine how variation in one gene impacts the others genes in the network. In water resources, a complete understanding of the relationship among the different water entities in a network provides an important tool to enforce effective and sustainable policies. Finally, in imaging spectroscopy, characterizing the relationship among the spectral profile of patches in a scene is crucial for the accuracy of existing detection techniques (e.g., matched filters). A significant difficulty that arises with finding the statis- tical dependencies among a collection of variables is that we do not have sample observations of some of the relevant variables. These **latent** (hidden) variables complicate finding a concise representation, as they introduce confounding depen- dencies among the variables of interest. Consequently, significant efforts over many decades have been directed towards the problem of accounting for the effects of **latent** phenomena in statistical **modeling** via **latent**-**variable** techniques. Commonly employed **latent**-**variable** models include factor analysis, **latent** dirichlet allocation, mixture distributions, **latent**-**variable** graphical models, etc.

Show more
198 Read more

considering a near-ignorance about the missingness process, and by updating beliefs accordingly. In order to make the rule profitably used it is important to develop efficient **algorithms** to compute with it.
In this chapter we have shown that it is not possible in general to create efficient **algorithms** for such a purpose (unless P=NP): in fact, using the con- servative updating rule to do efficient classification with Bayesian networks is shown to be NP-hard. This parallels analogous results with more traditional ways to do classification with Bayesian nets: in those cases, the computation is efficient only on polyforest-shaped Bayesian networks. Our second contribu- tion shows that something similar happens using the conservative updating, too. Indeed we provide a new algorithm for robust classification that is efficient on polyforest-shaped s-networks. This extends substantially a previously existing algorithm which, loosely speaking, is efficient only on disconnected s-networks. Yet, it is important to stress that the computational difference between tra- ditional classification with Bayesian nets and robust classification based on the conservative updating rule is remarkable: first, the former is based on the en- tire net, while the latter only on the net made by the class **variable** with its Markov blanket; second, while the former needs that the entire network is a polyforest in order to obtain efficient computation, the latter requires only that the associated s-network is. This means that the computation is efficient also in many cases when the class **variable** with its Markov blanket forms a multiply connected net in the original Bayesian network. In other words, computing ro- bust classifications with the conservative updating will be typically much faster than computing classifications with the traditional updating rule. Given that the latter classifications are necessarily included in the former, by definition of the conservative updating rule, it seems to be worth considering robust clas- sifications not only as a stand-alone task, but also as a pre-processing step of traditional classification with Bayesian nets.

Show more
159 Read more

VB provides a neat, deterministic way for approximating the joint posterior distribution online. We have compared the performance of the VB filter to a stochastic approxima- tion method through a standard PF and seen that it performs very well comparatively with a marked decrease in computational requirements. Previous to this work, sequential Monte Carlo (SMC) methods had already been applied to the state estimation problem in the SSPP framework. In Erg¨ un et al. (2007), the underlying states dynamics were modeled by a random walk process but the underlying parameters were assumed to be known. The authors introduced point process adaptive filters (Eden et al., 2004) for proposing new particles to increase the computational efficiency. The method showed good performance both on a synthetic dataset and on a real dataset, where the problem of tracking the evolution of a hippocampal spatial receptive field was studied. The ex- tension of these results to online parameter learning SMC approaches (see also Storvik, 2002) was thus a natural step. It should be noted that the highly linear substruc- ture (through the underlying AR **latent** process) also allows for Rao-Blackwellised PFs (Doucet et al., 2000a) to be applied. In this case the state forward filtering step may by approximated by that of Smith and Brown (2003) or Fahrmeir and Tutz (1994). How- ever, preliminary results show that even in this case, SMC methods may still prove to be too time consuming for any interesting biomedical application where data needs to be handled in real time.

Show more
203 Read more

One difficulty in applying LCM and OLCM is to select the number of classes K. In some **applications**, K is specified a priori based on some prior knowledge. However, in other cases, prior knowledge of K is not available which requires K to be determined in an exploratory fashion. Generally, there are two classes of exploratory approaches in determining K. The first class treats K as a **modeling** choice. Thus, it has to be specified before the model can be fitted. To select the best number of classes, models with different Ks are fitted. Then the selection of K is based on finding the minimum number of classes that would yield acceptable fit using χ 2 or the likelihood-ratio test. Alternatively, the choice of K can also be based on the information criteria such as the Akaiki information criteria (AIC; Akaike, 1987) and the Bayesian information criteria (BIC; Schwarz, 1978). In this approach, a number of models with different dimensions have to be estimated. Under some circumstances, it can be computationally ineffective and time consuming (Pan & Huang, 2014). Moreover, **inference** conditioning on a specific K from the 2-stage approach clearly ignores the uncertainty in the selection process (Yang et al., 2011).

Show more
116 Read more

We propose nonparametric model-based clustering **algorithms** nearly as simple as K-means which over- come most of its challenges and can infer the number of clusters from the data. Their potential is demon- strated for many different scenarios and **applications** such as phenotyping Parkinson and Parkisonism related conditions in an unsupervised way. With few simple steps we derive a related approach for nonparametric analysis on longitudinal data which converges few orders of magnitude faster than current available sampling methods. The framework is extended to efficient **inference** in nonparametric sequential models where exam- ple **applications** can be behaviour extraction and DNA sequencing. We demonstrate that our methods could be easily extended to allow for flexible online learning in a realistic setup using severely limited computa- tional resources. We develop a system capable of inferring online nonparametric hidden Markov models from streaming data using only embedded hardware. This allowed us to develop occupancy estimation technology using only a simple motion sensor.

Show more
167 Read more

Most of the above-cited papers consider factorizing priors, or regularizers that are sums of regularizers on the individual regression coefficients. However, in many practi cal **applications** the model parameters (the regression coefficients) have an a priori (spa tial) pattern, because they express effects that are (spatially) correlated (e.g. Penny et al., 2005). For example, in fMRI experiments it is reasonable to assume that the activation level of neighboring brain areas is positively correlated3. We would like to choose priors or regularizers that lead to posterior densities or point estimates that take into account this information. Lasso (Tibshirani, 1996) is known to perform poorly in models where there are strong correlations between the regression coefficients, that is, from a group of highly correlated coefficients it chooses a few and suppresses the rest. A way to remedy this drawback is to use the elastic net (Zou and Hastie, 2005). One can go even further and use the group lasso (Meier et al., 2008) which exhibits similar properties as lasso, but the sparsity is represented at a (predefined) group level. However, in many cases it is too re strictive to pre-define the groups. In this chapter, we will define a novel sparsity inducing prior density that allows us to encode prior correlations between the parameters’ magni tudes and yields posterior densities that allow us to assess the relevance of the regression coefficients. We apply it in the linear and logistic regression setting.

Show more
119 Read more

Neither Gibbs sampling nor variational **inference** in their respective original formulations scale well to large corpora of millions of documents. Training time gets off practical limits for more complex and nonparametric LDA extensions. In the past decade significant research efforts have been made to address this concern. Online (Hoffman et al., 2010), distributed (Newman et al., 2008) and both online and distributed (Broderick et al., 2013) **algorithms** have been developed. Online **algorithms** for Hierarchical Dirichlet Process (HDP) (Teh et al., 2006), a nonparametric counterpart of LDA, have also been proposed (Wang et al., 2011; Bryant & Sudderth, 2012). Both classes of **inference** **algorithms** (i.e. sampling and variational **inference**), their virtues notwithstanding, are known to exhibit certain deficiencies, which also propagate through their modern extensions. The problem can be traced back to the need for approximating or sampling from the posterior distributions of the **latent** variables representing the topic labels. Since these **latent** variables are not geometrically intrinsic — any permutation of the labels yields the same likelihood — the manipulation of these redundant quantities tend to slow down the computation, and compromise with the learning accuracy.

Show more
134 Read more

CHAPTER I
Introduction
With the advent of technology, a large amount of today’s data is generated by means of complex mechanisms. Data may be available in various forms - for example, unlabelled data as in images, tweets, articles or time series data as in daily weather reports, traffic scenarios including but not limited to inter-vehicular interactions. Moreover, data available is often high-dimensional in nature as vast amount of information is generated at low costs. For example, large-scale biological datasets obtained through next-generation sequencing, proteomics or brain-imaging are often high-dimensional. Other kinds of data may be more personal in nature, such as mobile app based monitoring of driving, health or other individual-specific activities. Any suitable Statistical method therefore needs to be considerate of the appropriate complexities of the data involved, for efficient **inference**.

Show more
303 Read more

will concentrate on analyzing methods of moments when data does not follow a model of the mathematical form assumed by the algorithm, violating some of the assumptions that the user makes.
In Chapter 5 we will focus on the scenario where a method of moments is required to learn a **latent** **variable** model from data, but the number of **latent** states required by a user is too small to accurately represent the training data. This is a very common scenario, in particular when data is high-dimensional and it is difficult to find a number of **latent** states to compre- hensively describe the dataset, or this number is too high to be estimated. For example, an important application of low-dimensional learning comes from exploratory data analysis, where a mixture model with two states is required to bisect a dataset into two well-distinct classes. The desired behavior of a learning technique when run in a low-dimensional setting, is to return a small model that synthetically describes the data, providing the optimal low-dimensional model approximating the data we are observing. In Chapter 5 we will demonstrate that this is not the behavior of existing methods of moments, which instead are likely to return unexpected results when plugged with a misspecified number or **latent** states. As a consequence, we provide a novel decomposition algorithm for method of moment, that phrases the decomposition task as a non-convex optimization problem and generalizes the method presented in Chapter 2. We demonstrate that the proposed algorithm, when run in a low-dimensional setting, returns the optimal low-dimensional model approximating the one generating the data, according to an intuitive definition of optimality. Starting from these remarks, we apply this method to hierarchically learn **latent** **variable** models, starting with a simple, two- dimensional model, which is then refined iterating the learning step on each of the retrieved dimensions. The hierarchical nature of this method allows for a fast and accurate solution of the optimization problem raising in the decomposition task, based on low-dimensional grid search. An immediate application of this approach is to perform hierarchical clustering, where a mixture with two classes is learned, used to bisect our dataset, and then the procedure is iterated on each of the two retrieved clusters. In this chapter we will also present an application of this approach to natural language processing, providing a specialization of our method to perform hierarchical topic **modeling**.

Show more
185 Read more

Abstract
The problem of estimating a full BRDF from partial ob- servations has already been studied using either paramet- ric or non-parametric approaches. The goal in each case is to best match this sparse set of input measurements. In this paper we address the problem of inferring higher or- der reflectance information starting from the minimal in- put of a single BRDF slice. We begin from the proto- typical case of a homogeneous sphere, lit by a head-on light source, which only holds information about less than 0.001% of the whole BRDF domain. We propose a novel method to infer the higher dimensional properties of the material’s BRDF, based on the statistical distribution of known material characteristics observed in real-life sam- ples. We evaluated our method based on a large set of experiments generated from real-world BRDFs and newly measured materials. Although inferring higher dimensional BRDFs from such modest training is not a trivial problem, our method performs better than state-of-the-art paramet- ric, semi- parametric and non-parametric approaches. Fi- nally, we discuss interesting **applications** on material re- lighting, and flash-based photography.

Show more
Shallow parsing is one of many NLP tasks that can be reduced to a sequence la- beling problem. In this paper we show that the **latent**-dynamics (i.e., hidden sub- structure of shallow phrases) constitutes a problem in shallow parsing, and we show that **modeling** this intermediate structure is useful. By analyzing the automatically learned hidden states, we show how the **latent** conditional model explicitly learn **latent**-dynamics. We propose in this paper the Best Label Path (BLP) **inference** algo- rithm, which is able to produce the most probable label sequence on **latent** condi- tional models. It outperforms two existing **inference** **algorithms**. With the BLP infer- ence, the LDCRF model significantly out- performs CRF models on word features, and achieves comparable performance of the most successful shallow parsers on the CoNLL data when further using part-of- speech features.

Show more
STRUCTURAL EQUATION MODELS USING
**LATENT**-CHANGE CONCEPTS In any data analysis problem where multiple constructs have been measured at multiple oc- casions, we need to consider the importance of causal sequences and determinants of changes (Nesselroade & Baltes 1979). The goal of eval- uating time-based sequences, especially when things are changing, is one of the main reasons for collecting longitudinal repeated-measure data in the ﬁrst place. We have pointed out above the useful beneﬁts of the classical mod- els, but we have also seen that each is limited to speciﬁc forms of dynamic **inference**. Of course, the statistical evaluation of dynamic sequences is not an easy problem, and these problems have puzzled researchers for decades. We describe below how the prior SEMs lead directly to new SEMs that can provide a more ﬂexible frame- work for causal-dynamic questions.

Show more
29 Read more

Table 1 contains all the experimental results when K ranges from 1 to 10. These experimental results validate that the optimal hybrid models achieve the best prediction results when K is 9. Table 1 also shows that the optimal hybrid model predicts more accurately than PCR and PLS when K is greater than 3. This suggests that the proposed approach may be particularly useful for complex prediction tasks that need more predictors. In addition, the MSEs for angular offset are much smaller than the MSEs for parallel offset, which implies that **modeling** the parallel offset is more difficult, at least for the given calibration data.

Show more
64 Read more

the model parameter is of interest, going beyond the classical Bayesian theory, recent at- tempts toward learning a regularized posterior distribution of model parameters (and **latent** variables as well if present) include the “learning from measurements” (Liang et al., 2009), maximum entropy discrimination (MED) (Jaakkola et al., 1999; Zhu and Xing, 2009) and maximum entropy discrimination **latent** Dirichlet allocation (MedLDA) (Zhu et al., 2009). All these methods are parametric in that they give rise to distributions over a fixed and finite-dimensional parameter space. To the best of our knowledge, very few attempts have been made to impose posterior regularization in a nonparametric setting where model com- plexity depends on data, such as the case for nonparametric Bayesian **latent** **variable** models. A general formalism for (parametric and nonparametric) Bayesian **inference** with posterior regularization seems to be not yet available or apparent. In this paper, we present such a formalism, which we call regularized Bayesian **inference**, or RegBayes, built on the con- vex duality theory over distribution function spaces; and we apply this formalism to learn regularized posteriors under the Indian buffet process (IBP), conjoining two powerful ma- chine learning paradigms, nonparametric Bayesian **inference** and SVM-style max-margin constrained optimization.

Show more
49 Read more

(Bentley, 1986) for specifying entire sets of features.
A feature template T is an nondeterministic FST that maps the contents of the sliding window, such as abc, to one or more features, which are also described as strings. 24 The n-gram machine de- scribed above can compute T [((a ? b) ? c) ? ] to find out what features fire on abc and its suffixes. One simple feature template performs “vowel/consonant backoff”; e.g., it maps abc to the feature named VCC. Fig. 2 showed the result of applying several actual feature templates to the window shown in Fig. 1. The extended regular expression calculus provides a flexible and concise notation for writ- ing down these FSTs. As a trivial example, the tri- gram “vowel/consonant backoff” transducer can be described as T = V V V , where V is a transducer that performs backoff on a single alignment charac- ter. Feature templates should make it easy to experi- ment with adding various kinds of linguistic knowl- edge. We have additional **algorithms** for compiling U θ from a set of arbitrary feature templates, 25 in- cluding templates whose features consider windows of **variable** or even unbounded width. The details are beyond the scope of this paper, but it is worth point- ing out that they exploit the fact that feature tem- plates are FSTs and not arbitrary code.

Show more
10 Read more

Analysis of asymmetric data poses several unique challenges. In this thesis, we pro- pose a series of parametric models under the Bayesian hierarchical framework to account for asymmetry (arising from non-Gaussianity, tail behavior, etc) in both continuous and discrete response data. First, we model continuous asymmetric responses assuming normal random errors by using a dynamic linear model discretized from a differential equation which absorbs the asymmetry from the data generation mechanism. We then extend the skew-normal/independent parametric family to accommodate spatial clus- tering and non-random missingness observed in asymmetric continuous responses, and demonstrate its utility in obtaining precise parameter estimates and prediction in pres- ence of skewness and thick-tails. Finally, under a **latent** **variable** formulation, we use a generalized extreme value (GEV) link to model multivariate asymmetric spatially- correlated binary responses that also exhibit non-random missingness, and show how this proposal improves **inference** over other popular alternative link functions in terms of bias and prediction. We assess our proposed method via simulation studies and two real data analyses on public health. Using simulated data, we investigate the per- formance of the proposed method to accurately accommodate asymmetry along with other data features such as spatial dependency and non-random missingness simulta- neously, leading to precise posterior parameter estimates. Regarding data illustrations, we first validate the efficiency in using differential equations to handle skewed exposure assessment responses derived from an occupational hygiene study. Furthermore, we also conduct efficient risk evaluation of various covariates on periodontal disease responses from a dataset on oral epidemiology. The results from our investigation re-establishes the significance of moving away from the normality assumption and instead consider pragmatic distributional assumptions on the random model terms for efficient Bayesian parameter estimation under a unified framework with a variety of data complexities not earlier considered in the two aforementioned areas of public health research.

Show more
142 Read more

7 Conclusions and Future Directions
**Latent** **variable** methods provide a flexible approach for complex **modeling** of correlation in longitudinal studies. We have discussed methods for using **latent** variables to aggre- gate multiple ultrasound measurements and model fetal growth and development during the first and second trimesters. These results are particularly important to researchers who use ultrasounds to date pregnancies while assuming that there is no measurable variability in fetal growth early in pregnancy. There is a general need to make **latent** **variable** methods more familiar to biostatisticians by applying them to research areas in public health. Furthermore, by having a solid understanding of the subject matter, the insights gained from an analysis using **latent** **variable** methods can be effectively com- municated to researchers in epidemiology and clinical disciplines. To make our methods more accessible, it is possible that, with some modification, the **latent** **variable** mixture models describe in papers two and three could be estimated using commercial software such as M-Plus. Using available software would be particularly useful for journal articles intended for applied researchers in reproductive health.

Show more
150 Read more

Abstract
HUANG, XIANZHENG. Robustness in **Latent** **Variable** Models. (Under the direction of Dr. Marie Davidian and Dr. Leonard A. Stefanski.)
Statistical models involving **latent** variables are widely used in many areas of **applications**, such as biomedical science and social science. When likelihood-based parametric inferential methods are used to make statistical **inference**, certain distri- butional assumptions on the **latent** variables are often invoked. As **latent** variables are not observable, parametric assumptions on the **latent** variables cannot be veri- fied directly using observed data. Even though semiparametric and nonparametric approaches have been developed to avoid making strong assumptions on the **latent** variables, parametric inferential approaches are still more appealing in many situa- tions in terms of consistency and efficiency in estimation, and computational burden.

Show more
90 Read more

rely heavily on linguistic knowledge of English, and as such they do not generalize to treebanks in other languages.
With all of this previous work, nonterminal re- finement is central to the underlying parsing for- malism. However, these decorations are extracted from the treebank by means of transformations on trees. It was not until the work by Matsuzaki et al. (2005) and Prescher (2005) that the decora- tion became a “**latent** annotation.” At that point, L- PCFGs were performing close to state of the art in syntactic parsing. Dreyer and Eisner (2006) sug- gested a more complex training algorithm for L- PCFGs to improve their accuracy. Then, Petrov et al. (2006) further improved the parsing results of L-PCFGs to match state of the art and also sug- gested a coarse-to-fine approach that made pars- ing much more efficient (the asymptotic compu- tational complexity of parsing with L-PCFGs, in their vanilla form, grows cubically with the num- ber of **latent** states). It was at this time that many other researchers started to make use of L-PCFGs for a variety of syntax parsers in different lan- guages, some of which are described in the rest of the paper.

Show more
12 Read more

rely heavily on linguistic knowledge of English, and as such they do not generalize to treebanks in other languages.
With all of this previous work, nonterminal re- finement is central to the underlying parsing for- malism. However, these decorations are extracted from the treebank by means of transformations on trees. It was not until the work by Matsuzaki et al. (2005) and Prescher (2005) that the decora- tion became a “**latent** annotation.” At that point, L- PCFGs were performing close to state of the art in syntactic parsing. Dreyer and Eisner (2006) sug- gested a more complex training algorithm for L- PCFGs to improve their accuracy. Then, Petrov et al. (2006) further improved the parsing results of L-PCFGs to match state of the art and also sug- gested a coarse-to-fine approach that made pars- ing much more efficient (the asymptotic compu- tational complexity of parsing with L-PCFGs, in their vanilla form, grows cubically with the num- ber of **latent** states). It was at this time that many other researchers started to make use of L-PCFGs for a variety of syntax parsers in different lan- guages, some of which are described in the rest of the paper.

Show more
12 Read more