Bayesian Active Learning by Disagreement

Chapter 2: Machine Learning Background

2.2. Active Sampling

2.2.3. Bayesian Active Learning by Disagreement

Bayesian active learning by disagreement, or BALD (Houlsby et al., 2011) is an information- theoretic approach in which we seek to reduce the number of possible hypotheses maximally fast (Cover and Thomas, 2012). We therefore seek the point that results in the maximal decrease in

posterior entropy: H _θ _− �_y_H _θ , ,xy__, where θ is the set of model parameters and

H is Shannon’s entropy (Shannon, 2001), an uncertainty measure. The first term is the current

entropy and the second term is the expected entropy after having observed data point

{ }

x,y . As shown in (Houlsby et al., 2011), this expression in possibly infinite-dimensional parameter space

can be rewritten in low-dimensional y space: H y_ x,_−_θ _H y_ x,θ . This expression is __ maximized when the first term is high (model is marginally very uncertain about y), but the second term is low (individual settings of θ are very confident). We can therefore interpret this expression as the degree to which parameters under the posterior disagree (Houlsby et al., 2011).

In the context of GP classification, the parameter set θ becomes the infinite-dimensional latent parameter f , e.g. H y_ x,_−_f _H y_ x,f__. By using several approximations, we can write the following expression for the acquisition function (Houlsby et al., 2011):

( )

(

)

2 2 ln 2 2 ( ) ln 2 2 ₂ _{( )} 2 2 ln 2 2 exp ( ) h ( ) 1 ( ) f f f f f A π µ π σ π µ σ σ +      _ _      = _Φ_ __−    _ + _ +   x x x x x x , (2.12)

where h p

( )

= −plog₂p −

(

1−p

) (

log 1₂ −p

)

is the binary entropy function, µ x and _f( ) σ2_f( )x are respectively the posterior mean and posterior variance of f corresponding to the input point

x , and Φ is the sigmoidal likelihood function for classification (2.6).

2.3. Concluding Remarks

The Gaussian process is a Bayesian inference framework that encodes relationships between variables rather than requiring a parametric form for the function to be estimated. Given an appropriate choice of mean and covariance functions, it can capture a diverse set of function behaviors, and can also incorporate prior constraints on function shapes given prior information. When trained on some (possibly binary) observed data, the GP posterior provides an entire probability distribution on test points, rather than point estimates.

Taken together, these qualities make the GP an attractive framework for performing inference on audiometric functions. Its nonparametric nature supports various audiogram shapes, which cannot be easily parameterized, and its estimation of entire probability distributions allows for painless integration with active sampling frameworks. Overall, the GP represents a flexible and efficient framework for performing audiometric inference.

Chapter 3: Automated Estimation of Human

Threshold Audiograms Using Active Machine

Learning

Note: The research presented in Chapter 3 has been published in Ear and Hearing (Song et al., 2015).

3.1. Introduction

As described in Chapter 1, current methods of determining a threshold audiogram exhibit many shortcomings. The clinical Hughson-Westlake staircase method (Carhart and Jerger, 1959; Katz et al., 2009), along with the numerous automated techniques that replicate the procedure (Mahomed et al., 2013), provide thresholds only at a small number of (6-9) standard audiogram frequencies. Moreover, the procedure for determining threshold at any particular frequency is both inefficient and predictable; it presents multiple identical stimuli and selects tones at sound levels where the listener’s response is already quite certain.

To address primarily the first shortcoming above, techniques that sweep tone stimuli through multiple frequencies have been developed, including Békésy audiometry and Audioscan® (von Békésy, 1947; Meyer-Bisch, 1996; Ishak et al., 2011). While these techniques can in fact provide relatively continuous threshold curves as a function of frequencies, they show some limitations. Békésy audiometry is comparatively quick for a sweep-based method, but results in a somewhat “jagged” estimate of the threshold curve that lacks specificity along the intensity dimension. On the other hand, Audioscan® offers a smoother estimate; however, the estimate is still quantized to discrete intensity levels, and the procedure is much more time-intensive to perform.

A particularly promising set of audiometric procedures have been Bayesian methods (Özdamar et al., 1990; Stadler, 2009). Unlike standard procedural methods such as HW, these methods select at every iteration the optimal stimulus frequency and intensity to present, informed by both prior data and the set of all other responses collected so far. The use of all observed responses across multiple frequencies to inform the current estimate stands in stark contrast to the HW procedure, in which all samples for a particular frequency are discarded after the corresponding threshold has been determined. Studies have shown large efficiency gains using these Bayesian techniques, but like standard clinical techniques, these methods still constrain the choice of possible frequencies to the 6-9 standard locations.

This chapter describes the development of the first version of the machine learning audiogram (MLAG), which is designed for estimating the threshold audiogram. The algorithm utilizes Gaussian process classification (GPC) and can be conceptualized as an extension of the Bayesian techniques: at every iteration, prior constraints and the set of all observations are used to form an estimate, and an optimal query point is selected. The final threshold estimate is approximately continuous along both frequency and intensity dimensions, and active selection of samples allows for efficient threshold audiogram estimation.

In document Improving Pure-Tone Audiometry Using Probabilistic Machine Learning Classification (Page 59-62)