5 Related work - Approximating Likelihood Ratios with Calibrated Discriminative Classifiers

The closest work to the proposed method is due to Neal (2007), who similarly considers the problem of approximating the likelihood function when only a generative model is available.

That work sketches a scheme in which one uses a classifier with both x and θ as an input to serve as a dimensionality reduction map. The key distinction comes in the handling of θ. Neal argues that a classifier cannot be used on real data, since we do not know the correct value for θ, and goes on to outline an approach where one uses regression on a

per-event basis to estimate ˆθ(x) and perform the composition s(x; ˆθ(x)). As pointed out by the author, this can lead to a significant loss of information since a single observation x may carry little information about the true value of θ, though a full data set D may be informative. The work of Neal (2007) correctly identifies this as an approximation of the target likelihood even in the case of a ideal classifier. In contrast, the approach described here does not eliminate the dependence of the classifier on θ. Instead, we embed a parameterized classifier into the likelihood and postpone the evaluation of the classifier to the point of evaluation of the likelihood when θ is explicitly being tested. This avoids the loss of information that occurs from the regression step ˆθ(x) proposed by Neal (2007) and leads to Thm. 1, which is an exact result in the case of an ideal classifier. In both cases, the quality of the classifier is factorized from the calibration of its density, which allows for valid inference even if there is a loss of power due to a non ideal classifier.

Also close to our work, Scott and Nowak (2005) and Xin Tong (2013) consider the machine learning problem associated to Neyman-Pearson hypothesis testing. In a simi-lar setup, they consider the situation where one does not have access to the underlying distributions, but only has i.i.d. samples from each hypothesis. This work generalizes that goal from the Neyman-Pearson setting to generalized likelihood ratio tests and em-phasizes the connection with classification. Ihler et al. (2004) take on a different problem (tests of statistical independence) by using machine learning algorithms to find scalar maps from the high-dimensional feature space that achieve the desired statistical goal when the fundamental high-dimensional test is intractable.

More generally, likelihood ratio testing directly relates to the density ratio estimation problem, which consists in estimating the ratio of two densities from finite collections of observations D₀ and D₁. Density ratio estimation is connected to many machine learning fundamental problems, including transfer learning (Sugiyama and Kawanabe, 2012),

prob-abilistic classification and regression (Vapnik, 1998), outlier detection (Hido et al., 2011), and many others. For learning under covariate shift, Shimodaira (2000) and Sugiyama and M¨uller (2005) estimate the density ratio r(x; θ0, θ1) from straightforward approximations

p(x|θ₀) and ˆp(x|θ₁) separately obtained using kernel density estimation. Despite its theo-retical consistency, this approach is known to be ineffective in practice (Sugiyama et al., 2007; Bickel et al., 2009), since it relies on modeling numerator and denominator high-dimensional densities, which is a harder problem than modeling their ratio only. While the proposed method also proceeds in two similar steps, estimating p(s(x)) is much easier than estimating p(x), since s projects x into a one-dimensional space in which only the infor-mative content of r(x) is preserved. Finally, in contrast with the proposed method which decouples reduction from calibration, other approaches proposed within the literature (see Sugiyama et al. (2012); Gretton et al. (2009); Nguyen et al. (2010); Vapnik et al. (2013) and references therein) provide solutions for estimating r(x; θ₀, θ₁) directly from x, in one step. Under some assumptions, the convergence of the obtained estimates is also proven for some of these approaches.

6 Conclusions

In this work, we have outlined an approach to reformulate generalized likelihood ratio tests with a high-dimensional data set in terms of univariate densities of a classifier score. We have shown that a parameterized family of discriminative classifiers ˆs(x; θ₀, θ₁) trained and calibrated with a simulator can be used to approximate the likelihood ratio, even when it is not possible to directly evaluate the likelihood p(x|θ). The proposed method offers an alter-native to Approximate Bayesian Computation for parameter inference in the likelihood-free setting that can also be used in the frequentist formalism without specifying a prior over

the parameters. In contrast to approaches that learn the posterior conditional on D, our approach can be applied to any observed data D once trained. A strength of this approach is that it separates the quality of the approximation of the target likelihood from the qual-ity of the calibration. The former leverages the continuing advances in supervised learning approaches to classification. The calibration procedure for a particular parameter point is fairly straightforward since it involves estimating a univariate density using a generative model of the data. The difficulty of the calibration stage is performing this calibration continuously in θ. Different strategies to this calibration are anticipated depending on the dimensionality of θ, the complexity of the resulting likelihood function, or the practical issues associated to running the simulator.

Acknowledgments

KC and GL are both supported through NSF ACI-1450310, additionally KC is supported through PHY-1505463 and PHY-1205376. JP was partially supported by the Scientific and Technological Center of Valpara´ıso (CCTVal) under Fondecyt grant BASAL FB0821. KC would like to thank Daniel Whiteson for encouragement and Alex Ihler for challenging dis-cussions that led to a reformulation of the initial idea. KC would also like to thank Shimon Whiteson, Babak Shahbaba for advice in the presentation, Radford Neal for discussion of his earlier work, Yann LeCun, Philip Stark, and Pierre Baldi for their feedback on the project early in its conception, Bal´azs K´egl for feedback to the draft, and Yuri Shirman for reassuring cross checks of the Theorem. KC is grateful to UC-Irvine for their hospitality while this research was carried out and the Moore and Sloan foundations for their generous support of the data science environment at NYU.

References

Aaltonen, T. et al. (2010). Top Quark Mass Measurement in the Lepton + Jets Channel Using a Matrix Element Method and in situ Jet Energy Calibration. Phys.Rev.Lett., 105:252001.

Agostinelli, S. et al. (2003). GEANT4: A Simulation toolkit. Nucl.Instrum.Meth., A506:250–303.

Allanach, B., Lester, C., Parker, M. A., and Webber, B. (2000). Measuring sparticle masses in nonuniversal string inspired models at the LHC. JHEP, 0009:004.

Baak, M., Gadatsch, S., Harrington, R., and Verkerke, W. (2015). Interpolation between multi-dimensional histograms using a new non-linear moment morphing method. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 771:39–48.

Baldi, P., Cranmer, K., Faucett, T., Sadowski, P., and Whiteson, D. (2016). Parameterized Machine Learning for High-Energy Physics. arXiv preprint arXiv:1601.07913.

Beaumont, M. A., Zhang, W., and Balding, D. J. (2002). Approximate bayesian computa-tion in populacomputa-tion genetics. Genetics, 162(4):2025–2035.

Bickel, S., Br¨uckner, M., and Scheffer, T. (2009). Discriminative learning under covariate shift. The Journal of Machine Learning Research, 10:2137–2155.

Brochu, E., Cora, V. M., and De Freitas, N. (2010). A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.

Chen, Y., Di Marco, E., Lykken, J., Spiropulu, M., Vega-Morales, R., et al. (2015). 8D likelihood effective Higgs couplings extraction framework in h → 4`. JHEP, 1501:125.

Cowan, G., Cranmer, K., Gross, E., and Vitells, O. (2010). Asymptotic formulae for likelihood-based tests of new physics. Eur.Phys.J., C71:1554.

Cranmer, K., Lewis, G., Moneta, L., Shibata, A., and Verkerke, W. (2012). HistFactory:

A tool for creating statistical models for use with RooFit and RooStats. CERN-OPEN-2012-016.

Friedman, J., Hastie, T., Tibshirani, R., et al. (2000). Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2):337–407.

GPyOpt (2015). GPy: A bayesian optimization framework in python. http://github.

com/SheffieldML/GPyOpt.

Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Sch¨olkopf, B.

(2009). Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5.

Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., and Kanamori, T. (2011). Statisti-cal outlier detection using direct density ratio estimation. Knowledge and information systems, 26(2):309–336.

H¨ormander, L. (1990). The Analysis of Linear Partial Differential Operators I. Springer Science, Business Media.

Ihler, A., Fisher, J., and Willsky, A. (2004). Nonparametric Hypothesis Tests for Statistical Dependency. IEEE Transactions on Signal Processing, 52(8):2234–2249.

Lin, Y. (2002). Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, 6(3):259–275.

Louppe, G., Cranmer, K., and Pavez, J. (2016). carl: a likelihood-free inference toolbox.

http://dx.doi.org/10.5281/zenodo.47798, https://github.com/diana-hep/carl.

Neal, R. M. (2007). Computing likelihood functions for high-energy physics experiments when distributions are defined by simulators with nuisance parameters. In Proceedings of PhyStat2007, CERN-2008-001, pages 111–118.

Nguyen, X., Wainwright, M. J., Jordan, M., et al. (2010). Estimating divergence func-tionals and the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on, 56(11):5847–5861.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Read, A. (1999). Linear interpolation of histograms. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 425(1):357–360.

Scott, C. and Nowak, R. (2005). A neyman-pearson approach to statistical learning. IEEE Trans. Inform. Theory, 51:3806–3819.

Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244.

Sjostrand, T., Mrenna, S., and Skands, P. Z. (2006). PYTHIA 6.4 Physics and Manual.

JHEP, 0605:026.

Sugiyama, M. and Kawanabe, M. (2012). Machine learning in non-stationary environments:

Introduction to covariate shift adaptation. MIT Press.

Sugiyama, M., Krauledat, M., and M¨uller, K.-R. (2007). Covariate shift adaptation by importance weighted cross validation. The Journal of Machine Learning Research, 8:985–

1005.

Sugiyama, M. and M¨uller, K.-R. (2005). Input-dependent estimation of generalization error under covariate shift. Statistics & Decisions, 23(4/2005):249–279.

Sugiyama, M., Suzuki, T., and Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge University Press.

The ATLAS Collaboration (2012). Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC. Phys.Lett., B716:1–

29.

The CMS Collaboration (2012). Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC. Phys.Lett., B716:30–61.

Vapnik, V. (1998). Statistical learning theory, volume 1. Wiley New York.

Vapnik, V., Braga, I., and Izmailov, R. (2013). Constructive setting of the density ratio estimation problem and its rigorous solution. arXiv preprint arXiv:1306.0407.

Volobouev, I. (2011). Matrix Element Method in HEP: Transfer Functions, Efficiencies, and Likelihood Normalization. arXiv preprint arXiv:1101.2259.

Xin Tong (2013). A Plug-in Approach to Neyman-Pearson Classification. Journal of Machine Learning Research, 14:3011–3040.

In document Approximating Likelihood Ratios with Calibrated Discriminative Classifiers (Page 26-34)