The Misspecified Model Case - Activized Learning: Transforming Passive to Active with Improved

Here we present a proof of Theorem 28, including a specification of the methodA′

afrom the theorem statement.

Proof [Theorem 28] Consider a weakly universally consistent passive learning algorithm_Au(De- vroye, Gy¨orfi, and Lugosi, 1996). Such a method must exist in our setting; for instance, Hoeffding’s inequality and a union bound imply that it suffices to take_Au(L) =argmin1

± BierL( 1 ± Bi) + q ln(4i2_|L_|) 2|L_| , where{B1,B2, . . .}is a countable algebra that generatesFX.

ThenAuachieves a label complexityΛusuch that for any distributionPXY onX × {−1,+1}, ∀ε ∈(0,1), Λu(ε+ν∗(PXY),PXY)<∞. In particular, if ν∗(PXY)<ν(C;PXY), then we have

Λu((ν∗(PXY) +ν(C;PXY))/2,PXY)<∞.

Fix any n_∈N_{and describe the execution of}_A′

a(n)as follows. In a preprocessing step, with- hold the first mun =n− ⌊n/2⌋ − ⌊n/3⌋ ≥ n/6 examples {X1, . . . ,Xmun} and request their labels

any index references in the algorithm by mun), and let ha denote the classifier it returns. Also re- quest the labels Ymun+1, . . .Ymun+⌊n/3⌋, and let

hu=Au

(Xmun+1,Ymun+1), . . . ,(Xmun+⌊n/3⌋,Ymun+⌊n/3⌋) .

If ermun(ha)−ermun(hu)>n−

1/3_{, return ˆh}₌_h

u; otherwise, return ˆh=ha. This method achieves the stated result, for the following reasons.

First, let us examine the final step of this algorithm. By Hoeffding’s inequality, with probability at least 1₋2_·exp₋n1/3/12 ,

|(ermun(ha)−ermun(hu))−(er(ha)−er(hu))| ≤n−

1/3_.

When this is the case, a triangle inequality implies er(ˆh)≤min{er(ha),er(hu) +2n−1/3}. If_PXY satisfies the benign noise case, then for any

n≥2Λa(ε/2+ν(C;PXY),PXY),

we haveE_[er(_h_a_)]_≤ν₍C_;_P_XY_{) +}ε_/_{2, so}E[er(ˆh)]_≤ν₍C_;_P_XY_{) +}ε_/₂₊₂_·_exp_{−_n1/3_/₁₂_}_{, which}

is at mostν(C_;_P_XY_{) +}ε _{if n}_≥₁₂3_ln3₍₄_/_ε_{). So in this case, we can take}_λ₍_ε_{) =}

123ln3(4/ε)

. On the other hand, if_PXY is not in the benign noise case (i.e., the misspecified model case), then for any n_≥3Λu((ν∗(PXY) +ν(C;PXY))/2,PXY),E[er(hu)]≤(ν∗(PXY) +ν(C;PXY))/2, so that

E_[er(ˆh)]_≤E_[er(_h_u_{)] +}_2n−1/3₊₂_·_exp_{−_n1/3_/₁₂_}

≤(ν∗(PXY) +ν(C;PXY))/2+2n−1/3+2·exp{−n1/3/12}. Again, this is at mostν(C_;_P_XY_{) +}ε_{if n}_≥_max₁₂3_ln3 2

ε,64(ν(C;PXY)−ν∗(PXY))−3 . So in this case, we can take

λ(ε) = max 123ln32 ε,3Λu ν_∗ (PXY) +ν(C;PXY) 2 ,PXY , 64 (ν(C_;_P_XY₎₋ν∗₍_P_XY₎₎3 .

In either case, we haveλ(ε)∈Polylog(1/ε).

References

N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging. In Proceedings

of the 15thInternational Conference on Machine Learning, 1998.

M. Alekhnovich, M. Braverman, V. Feldman, A. Klivans, and T. Pitassi. Learnability and automa- tizability. In Proceedings of the 45thFoundations of Computer Science, 2004.

K. Alexander. Probability inequalities for empirical processes and a law of the iterated logarithm.

The Annals of Probability, 4:1041–1067, 1984.

M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.

A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Machine Learning, 30:31–56, 1998.

R. B. Ash and C. A. Dol´eans-Dade. Probability & Measure Theory. Academic Press, 2000.

M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the 23rdInternational Conference on Machine Learning, 2006a.

M.-F. Balcan, A. Blum, and S. Vempala. Kernels as features: On kernels, margins, and low- dimensional mappings. Machine Learning Journal, 65(1):79–94, 2006b.

M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Proceedings of the 20th

Conference on Learning Theory, 2007.

M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer and

System Sciences, 75(1):78–89, 2009.

M.-F. Balcan, S. Hanneke, and J. Wortman Vaughan. The true sample complexity of active learning.

Machine Learning, 80(2–3):111–139, 2010.

J. Baldridge and A. Palmer. How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. In Proceedings of the Conference on

Empirical Methods in Natural Language Processing, 2009.

Z. Bar-Yossef. Sampling lower bounds via information theory. In Proceedings of the 35th Annual ACM Symposium on the Theory of Computing, 2003.

P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal

of the American Statistical Association, 101(473):138–156, 2006.

A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Proceedings

of the International Conference on Machine Learning, 2009.

A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems 23, 2010.

A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the Vapnik- Chervonenkis dimension. Journal of the Association for Computing Machinery, 36(4):929–965, 1989.

F. Bunea, A. B. Tsybakov, and M. Wegkamp. Sparsity oracle inequalities for the lasso. Electronic

Journal of Statistics, 1:169–194, 2009.

C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classifiers. In Pro-

ceedings of the 17th_{International Conference on Machine Learning, 2000.}

R. Castro and R. D. Nowak. Minimax bounds for active learning. IEEE Transactions on Information

Theory, 54(5):2339–2353, 2008.

D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201–221, 1994.

S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Infor-

mation Processing Systems 18, 2005.

S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. In

Proceedings of the 18thConference on Learning Theory, 2005.

S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances

in Neural Information Processing Systems 20, 2007.

S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Journal

of Machine Learning Research, 10:281–299, 2009.

O. Dekel, C. Gentile, and K. Sridharan. Robust selective sampling from single and multiple teachers. In Proceedings of the 23rdConference on Learning Theory, 2010.

L. Devroye, L. Gy¨orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer- Verlag New York, Inc., 1996.

R. M. Dudley. Real Analysis and Probability. Cambridge University Press, 2002.

Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133–168, 1997.

E. Friedman. Active learning for smooth problems. In Proceedings of the 22nd _{Conference on}

Learning Theory, 2009.

R. Gangadharaiah, R. D. Brown, and J. Carbonell. Active learning in example-based machine translation. In Proceedings of the 17thNordic Conference on Computational Linguistics, 2009.

E. Gin´e and V. Koltchinskii. Concentration inequalities and asymptotic results for ratio type empir- ical processes. The Annals of Probability, 34(3):1143–1216, 2006.

S. A. Goldman and M. J. Kearns. On the complexity of teaching. Journal of Computer and System

Sciences, 50:20–31, 1995.

S. Hanneke. Teaching dimension and the complexity of active learning. In Proceedings of the 20th

Conference on Learning Theory, 2007a.

S. Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24thInternational Conference on Machine Learning, 2007b.

S. Hanneke. Adaptive rates of convergence in active learning. In Proceedings of the 22ndConference on Learning Theory, 2009a.

S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Machine Learning Depart- ment, School of Computer Science, Carnegie Mellon University, 2009b.

S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333–361, 2011.

S. Har-Peled, D. Roth, and D. Zimak. Maximum margin coresets for active and noise tolerant learning. In Proceedings of the 20th International Joint Conference on Artificial Intelligence,

D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150, 1992.

D. Haussler, N. Littlestone, and M. Warmuth. Predicting {0,1}-functions on randomly drawn points. Information and Computation, 115:248–292, 1994.

T. Heged¨us. Generalized teaching dimension and the query complexity of learning. In Proceedings

of the 8th Conference on Computational Learning Theory, 1995.

L. Hellerstein, K. Pillaipakkamnatt, V. Raghavan, and D. Wilkins. How many queries are needed to learn? Journal of the Association for Computing Machinery, 43(5):840–862, 1996.

D. Helmbold, R. Sloan, and M. Warmuth. Learning nested differences of intersection-closed con- cept classes. Machine Learning, 5:165–196, 1990.

S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch mode active learning and its application to medical image classification. In Proceedings of the 23rd International Conference on Machine Learning, 2006.

M. Kääriäinen. Active learning in the non-realizable case. In Proceedings of the 17th International Conference on Algorithmic Learning Theory, 2006.

N. Karmarkar. A new polynomial-time algorithm for linear programming. Combinatorica, 4:373– 395, 1984.

M. J. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. The MIT Press, 1994.

M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward efficient agnostic learning. Machine Learn-

ing, 17:115–141, 1994.

L. G. Khachiyan. A polynomial algorithm in linear programming. Soviet Mathematics Doklady, 20:191–194, 1979.

V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The

Annals of Statistics, 34(6):2593–2656, 2006.

V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. Journal

of Machine Learning Research, 11:2457–2485, 2010.

V. Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems. In

Ecole d’ Été de Probabilités de Saint-Flour XXXVIII-2008. Lecture Notes in Mathematics, 2033,

Springer, 2011.

S. Li. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathe-

matics and Statistics, 4(1):66–70, 2011.

M. Lindenbaum, S. Markovitch, and D. Rusakov. Selective sampling for nearest neighbor classi- fiers. Machine Learning, 54:125–152, 2004.

T. Luo, K. Kramer, D. B. Goldgof, L. O. Hall, S. Samson, A. Remsen, and T. Hopkins. Active learning to recognize multiple types of plankton. Journal of Machine Learning Research, 6: 589–613, 2005.

S. Mahalanabis. A note on active learning for smooth problems. arXiv:1103.3095, 2011.

E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27: 1808–1829, 1999.

P. Massart and ´E. N´ed´elec. Risk bounds for statistical learning. The Annals of Statistics, 34(5): 2326–2366, 2006.

A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In Proceedings of the 15thInternational Conference on Machine Learning, 1998.

P. Mitra, C. A. Murthy, and S. K. Pal. A probabilistic active support vector learning algorithm. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 26(3):413–418, 2004.

J. R. Munkres. Topology. Prentice Hall, Inc., 2ndedition, 2000.

I. Muslea, S. Minton, and C. A. Knoblock. Active + semi-supervised learning = robust multi-view learning. In Proceedings of the 19thInternational Conference on Machine Learning, 2002.

R. D. Nowak. Generalized binary search. In Proceedings of the 46th Annual Allerton Conference on Communication, Control, and Computing, 2008.

L. Pitt and L. G. Valiant. Computational limitations on learning from examples. Journal of the

Association for Computing Machinery, 35(4):965–984, 1988.

J. Poland and M. Hutter. MDL convergence speed for Bernoulli sequences. Statistics and Comput-

ing, 16:161–175, 2006.

G. V. Rocha, X. Wang, and B. Yu. Asymptotic distribution and sparsistency for l1-penalized para- metric M-estimators with applications to linear SVM and logistic regression. arXiv:0908.1940v1, 2009.

D. Roth and K. Small. Margin-based active learning for structured output spaces. In European

Conference on Machine Learning, 2006.

N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18thInternational Conference on Machine Learning, 2001.

A. I. Schein and L. H. Ungar. Active learning for logistic regression: An evaluation. Machine

Learning, 68(3):235–265, 2007.

G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proceed-

ings of the 17thInternational Conference on Machine Learning, 2000.

B. Settles. Active learning literature survey. http://active-learning.net, 2010.

S. Tong and D. Koller. Support vector machine active learning with applications to text classifica- tion. Journal of Machine Learning Research, 2, 2001.

A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.

L. G. Valiant. A theory of the learnable. Communications of the Association for Computing Ma-

chinery, 27(11):1134–1142, 1984.

A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996.

V. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, 1982.

V. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998.

V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264–280, 1971.

A. Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2): 117–186, 1945.

L. Wang. Sufficient conditions for agnostic active learnable. In Advances in Neural Information

Processing Systems 22, 2009.

L. Wang. Smoothness, disagreement coefficient, and the label complexity of agnostic active learn- ing. Journal of Machine Learning Research, 12:2269–2292, 2011.

L. Wang and X. Shen. On L1-norm multiclass support vector machines. Journal of the American

Statistical Association, 102(478):583–594, 2007.

L. Yang, S. Hanneke, and J. Carbonell. The sample complexity of self-verifying Bayesian ac- tive learning. In Proceedings of the 14thInternational Conference on Artificial Intelligence and Statistics, 2011.

In document Activized Learning: Transforming Passive to Active with Improved Label Complexity (Page 113-119)