“A Study on Conjugate Gradient Stopping Criteria in Truncated Newton Methods for Linear Classification”

(1)

Supplementary Materials for

“A Study on Conjugate Gradient Stopping Criteria in Truncated Newton Methods for Linear Classification”

For experiments we use two-class classification shown in Table I. Except for yahookr and yahoojp, others are available from LIBSVM Data Sets (2007). Table I shows the statistics of each data set used in our experiments.

Data sets #instances #features density log ₂ (C _best )

LR L2

HIGGS 11,000,000 28 92.1057% -6 -12

w8a 49,749 300 3.8834% 8 2

rcv1 20,242 47,236 0.1568% 6 0

real-sim 72,309 20,958 0.2448% 3 -1

news20 19,996 1,355,191 0.0336% 9 3

url 2,396,130 3,231,962 0.0036% -7 -10 yahoojp 176,203 832,026 0.0160% 3 -1 yahookr 460,554 3,052,939 0.0111% 6 1 webspam 350,000 16,609,143 0.0220% 2 -3 kdda 8,407,752 20,216,831 0.0001% -3 -5 kddb 19,264,097 29,890,095 0.0001% -1 -4 criteo 45,840,617 1,000,000 0.0039% -15 -12 kdd12 149,639,105 54,686,452 0.00002% -4 -11

Table I: Data statistics. The density is the average number of non-zeros per instance. C _best is the regularization parameter selected by cross validation.

I. Additional Experimental Results

In this section we present complete results for linear classification problems with either

• LR-loss or

• L2-loss.

References

LIBSVM Data Sets, 2007. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/

datasets.

c

.

(2)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻²

(f − f * )/ f *

ResConst01 ResConst09 QuadConst01 QuadConst09

(a) HIGGS

0 2 4 6

CG iterations ^1e2

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(b) w8a

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(c) new20

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(d ) rcv1

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10 ⁻⁶ 10 ⁻⁴ 10 ⁻² 10 ⁰

(f − f * )/ f *

(e) real-sim

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(g) yahoojp

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(h) yahookr

0.0 0.5 1.0 1.5

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(i ) webspam

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e3

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(j ) kdda

0 1 2 3 4 5

CG iterations ^1e3

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(k ) kddb

0.0 0.2 0.4 0.6 0.8

CG iterations

^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹

(f − f * )/ f *

(l ) criteo

0 1 2 3 4 5

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(m) kdd12

Figure i: A comparison of different constant thresholds in the inner stopping condition.

We show the convergence of a truncated Newton method for logistic regression. The x-axis

shows the cumulative number of CG iterations. Loss=LR and C = C best in each sub-figure.

(3)

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10 ⁻⁹ 10 ⁻⁷ 10 ⁻⁵ 10 ⁻³ 10 ⁻¹

(f − f * )/ f *

(a) HIGGS

0.0 0.5 1.0 1.5

CG iterations ^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(b) w8a

0.0 0.5 1.0 1.5 2.0 2.5

CG iterations ^1e2

10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹ 10 ²

(f − f * )/ f *

(c) new20

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹ 10 ²

(f − f * )/ f *

(d ) rcv1

0 1 2 3

CG iterations ^1e2

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(f ) url

0.0 0.5 1.0 1.5

CG iterations ^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(g) yahoojp

0 1 2 3

CG iterations ^1e4

10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(h) yahookr

0 2 4 6

CG iterations ^1e2

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹ 10 ²

(f − f * )/ f *

(i ) webspam

0 1 2 3 4

CG iterations ^1e4

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(j ) kdda

0 2 4 6

CG iterations ^1e4

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(k ) kddb

0 1 2 3 4 5

CG iterations

^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹

(f − f * )/ f *

(l ) criteo

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e4

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(m) kdd12

Figure ii: A comparison of different constant thresholds in the inner stopping condition.

We show the convergence of a truncated Newton method for logistic regression. The x-

axis shows the cumulative number of CG iterations. Loss=LR and C = 100C best in each

sub-figure.

(4)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10 ⁷ 10 ⁶ 10 ⁵ 10 ⁴ 10 ³ 10 ²

(f f * )/ f *

ResConst09 ResAdaL2 QuadConst09 QuadAdaL2

(a) HIGGS

0 1 2 3 4 5

CG iterations ^1e2

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(b) w8a

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(c) new20

0 2 4 6

CG iterations ^1e1

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(d ) rcv1

0 2 4 6

CG iterations ^1e1

10 ⁻⁶ 10 ⁻⁴ 10 ⁻² 10 ⁰

(f − f * )/ f *

(e) real-sim

0.0 0.5 1.0 1.5

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(h) yahookr

0.0 0.5 1.0 1.5

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(i ) webspam

0.0 0.5 1.0 1.5

CG iterations ^1e3

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(j ) kdda

0 1 2 3 4

CG iterations ^1e3

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(k ) kddb

0.0 0.2 0.4 0.6 0.8

CG iterations

^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹

(f − f * )/ f *

(l ) criteo

0 1 2 3 4 5

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(m) kdd12

Figure iii: A comparison between adaptive and constant forcing sequences. We show the

convergence of a truncated Newton method for logistic regression. The x-axis shows the

cumulative number of CG iterations. Loss=LR and C = C best in each sub-figure.

(5)

0.0 0.5 1.0 1.5

CG iterations ^1e2

10 ⁻⁹ 10 ⁻⁷ 10 ⁻⁵ 10 ⁻³ 10 ⁻¹

(f − f * )/ f *

(a) HIGGS

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(b) w8a

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹ 10 ²

(f − f * )/ f *

(c) new20

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹ 10 ²

(f − f * )/ f *

(d ) rcv1

0 1 2 3

CG iterations ^1e2

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(f ) url

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e4

10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(h) yahookr

0 1 2 3 4 5

CG iterations ^1e2

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹ 10 ²

(f − f * )/ f *

(i ) webspam

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e4

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(j ) kdda

0 1 2 3 4

CG iterations ^1e4

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(k ) kddb

0 1 2 3 4 5 6

CG iterations

^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹

(f − f * )/ f *

(l ) criteo

0 1 2 3 4 5

CG iterations ^1e3

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(m) kdd12

Figure iv: A comparison between adaptive and constant forcing sequences. We show the

convergence of a truncated Newton method for logistic regression. The x-axis shows the

cumulative number of CG iterations. Loss=LR and C = 100C best in each sub-figure.

(6)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10 ⁷ 10 ⁶ 10 ⁵ 10 ⁴ 10 ³ 10 ²

(f f * )/ f *

ResAdaL1 ResAdaL2 QuadAdaL2 QuadAdaL1 TR

(a) HIGGS

0 1 2 3 4 5 6

CG iterations ^1e2

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(b) w8a

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(c) new20

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(d ) rcv1

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10 ⁻⁶ 10 ⁻⁴ 10 ⁻² 10 ⁰

(f − f * )/ f *

(e) real-sim

0.0 0.5 1.0 1.5

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(h) yahookr

0.0 0.5 1.0 1.5

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(i ) webspam

0.0 0.5 1.0 1.5

CG iterations ^1e3

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(j ) kdda

0 1 2 3 4

CG iterations ^1e3

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(k ) kddb

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations

^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹

(f − f * )/ f *

(l ) criteo

0 2 4 6

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(m) kdd12

Figure v: An investigation on the robustness of adaptive rules and a comparison with

the trust region approach. We show the convergence of a truncated Newton method for

logistic regression. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = C _best in each sub-figure.

(7)

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e2

10 ⁻⁹ 10 ⁻⁷ 10 ⁻⁵ 10 ⁻³ 10 ⁻¹

(f − f * )/ f *

ResAdaL1 ResAdaL2 Q adAdaL2 Q adAdaL1 TR

(a) HIGGS

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(b) w8a

0.0 0.5 1.0 1.5 2.0 2.5

CG iterations ^1e2

10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹ 10 ²

(f − f * )/ f *

(c) new20

0 1 2 3

CG iterations ^1e2

10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹ 10 ²

(f − f * )/ f *

(d ) rcv1

0 1 2 3 4

CG iterations ^1e2

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e3

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(f ) url

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e4

10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(h) yahookr

0 2 4 6

CG iterations ^1e2

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹ 10 ²

(f − f * )/ f *

(i ) webspam

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e4

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(j ) kdda

0 2 4 6

CG iterations ^1e4

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(k ) kddb

0.0 0.2 0.4 0.6 0.8

CG iterations

^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹

(f − f * )/ f *

(l ) criteo

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e4

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(m) kdd12

Figure vi: An investigation on the robustness of adaptive rules and a comparison with

the trust region approach. We show the convergence of a truncated Newton method for

logistic regression. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = 100C _best in each sub-figure.

(8)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

QuadAdaL2 ResConst09 ResAdaL2

(a) HIGGS

0.0 0.5 1.0 1.5 2.0 2.5

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(b) w8a

0 2 4 6

CG iterations ^1e1

10 ⁻¹ 10 ⁰

C osi ne

(c) new20

0 1 2 3 4

CG iterations ^1e1

10 ⁻¹ 10 ⁰

C osi ne

(d ) rcv1

0 1 2 3 4 5

CG iterations ^1e1

10 ⁻¹ 10 ⁰

C osi ne

(e) real-sim

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(f ) url

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(g) yahoojp

0 2 4 6

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(h) yahookr

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(i ) webspam

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e3

10 ⁻¹ 10 ⁰

C osi ne

(j ) kdda

0.0 0.5 1.0 1.5 2.0 2.5 3.0

CG iterations ^1e3

10 ⁻¹ 10 ⁰

C osi ne

(k ) kddb

0 2 4 6

CG iterations

^1e1

10⁻¹ 10⁰

C osi ne

(l ) criteo

0.0 0.5 1.0 1.5 2.0 2.5 3.0

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(m) kdd12

Figure vii: Cosines between the anti-gradient −∇f (w w w _k ) and the resulting direction s s s _k ,

using different truncation rules. Iterations in which the cosine is below the 0.1 threshold

are marked with a 6. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = C _best in each sub-figure.

(9)

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(a) HIGGS

0 1 2 3

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(b) w8a

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(c) new20

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(d ) rcv1

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(e) real-sim

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e3

10 ⁻¹ 10 ⁰

C osi ne

(f ) url

0.0 0.2 0.4 0.6 0.8 1.0 1.2

CG iterations ^1e3

10 ⁻¹ 10 ⁰

C osi ne

(g) yahoojp

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e4

10 ⁻¹ 10 ⁰

C osi ne

(h) yahookr

0 1 2 3

CG iterations ^1e2

10 ⁻¹ 10 ⁰

C osi ne

(i ) webspam

0.00 0.25 0.50 0.75 1.00 1.25 1.50

CG iterations ^1e4

10 ⁻¹ 10 ⁰

C osi ne

(j ) kdda

0.0 0.5 1.0 1.5

CG iterations ^1e4

10 ⁻¹ 10 ⁰

C osi ne

(k ) kddb

0 1 2 3 4

CG iterations

^1e2

10⁻¹ 10⁰

C osi ne

(l ) criteo

0 1 2 3 4

CG iterations ^1e3

10 ⁻¹ 10 ⁰

C osi ne

(m) kdd12

Figure viii: Cosines between the anti-gradient −∇f (w w w _k ) and the resulting direction s s s _k ,

using different truncation rules. Iterations in which the cosine is below the 0.1 threshold

are marked with a 6. The x-axis shows the cumulative number of CG iterations. Loss=LR

and C = 100C _best in each sub-figure.

(10)

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 Q adConst09_l2 Q adAdaL2_l2 ResConst01_l2

(a) HIGGS

0 1 2 3 4 5

CG iterations ^1e2

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

ResConst09_l2 ResAdaL2_l2 QuadConst09_l2 QuadAdaL2_l2 ResConst01_l2

(b) w8a

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10 ⁶ 10 ⁵ 10 ⁴ 10 ³ 10 ² 10 ¹ 10 ⁰ 10 ¹

(f f * )/ f *

(c) new20

0.0 0.2 0.4 0.6 0.8

CG i era ions ^1e2

10 ⁻⁷ 10 ⁻⁵ 10 ⁻³ 10 ⁻¹ 10 ¹

(f − f * )/ f *

ResCons 09_l2 ResAdaL2_l2 QuadCons 09_l2 QuadAdaL2_l2 ResCons 01_l2

(d ) rcv1

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10 ⁻⁶ 10 ⁻⁴ 10 ⁻² 10 ⁰

(f − f * )/ f *

(e) real-sim

0.0 0.5 1.0 1.5 2.0 2.5 3.0

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(f ) url

0.0 0.2 0.4 0.6 0.8 1.0

CG i era ions ^1e3

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

ResCons 09_l2 ResAdaL2_l2 QuadCons 09_l2 QuadAdaL2_l2 ResCons 01_l2

(g) yahoojp

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e3

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(h) yahookr

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e2

10 ⁶ 10 ⁵ 10 ⁴ 10 ³ 10 ² 10 ¹ 10 ⁰ 10 ¹

(f f * )/ f *

(i ) webspam

0 1 2 3 4

CG iterations ^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(j ) kdda

0 1 2 3 4 5

CG iterations ^1e3

10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(k ) kddb

0 2 4 6

CG iterations

^1e2

10 ¹

9 × 10⁰

(f − f * )/ f *

(l ) criteo

0 1 2 3 4

CG iterations ^1e2

10 ⁻⁸ 10 ⁻⁶ 10 ⁻⁴ 10 ⁻² 10 ⁰

(f − f * )/ f *

(m) kdd12

Figure ix: A comparison between adaptive and constant forcing sequences. We show the

convergence of a truncated Newton method for linear SVM. The x-axis shows the cumulative

number of CG iterations. Loss=L2 and C = C best in each sub-figure.

(11)

0.0 0.2 0.4 0.6 0.8

CG iterations ^1e2

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹

(f − f * )/ f *

(a) HIGGS

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e3

10 ⁻⁸ 10 ⁻⁶ 10 ⁻⁴ 10 ⁻² 10 ⁰

(f − f * )/ f *

(b) w8a

0.0 0.2 0.4 0.6 0.8 1.0

CG iterations ^1e4

10 ⁴ 10 ³ 10 ² 10 ¹ 10 ⁰ 10 ¹ 10 ²

(f f * )/ f *

(c) new20

0.00 0.25 0.50 0.75 1.00 1.25

CG iterations ^1e3

10 ⁴ 10 ³ 10 ² 10 ¹ 10 ⁰ 10 ¹ 10 ²

(f f * )/ f *

(d ) rcv1

0 2 4 6

CG iterations ^1e2

10 ⁻⁵ 10 ⁻³ 10 ⁻¹ 10 ¹

(f − f * )/ f *

(e) real-sim

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e3

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹ 10 ²

(f − f * )/ f *

(f ) url

0.0 0.5 1.0 1.5 2.0

CG iterations ^1e3

10 ⁶ 10 ⁵ 10 ⁴ 10 ³ 10 ² 10 ¹ 10 ⁰ 10 ¹

(f f * )/ f *

(g) yahoojp

0 1 2 3 4 5

CG iterations ^1e3

10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(h) yahookr

0 1 2 3 4 5 6

CG iterations ^1e2

10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹ 10 ²

(f − f * )/ f *

(i ) webspam

0 1 2 3 4

CG iterations ^1e4

10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(j ) kdda

0 1 2 3 4

CG iterations ^1e4

10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰ 10 ¹

(f − f * )/ f *

(k ) kddb

0.0 0.5 1.0 1.5 2.0

CG iterations

^1e3

10 ¹

(f − f * )/ f *

(l ) criteo

0.0 0.5 1.0 1.5

CG iterations ^1e3

10 ⁻⁷ 10 ⁻⁶ 10 ⁻⁵ 10 ⁻⁴ 10 ⁻³ 10 ⁻² 10 ⁻¹ 10 ⁰

(f − f * )/ f *

(m) kdd12

Figure x: A comparison between adaptive and constant forcing sequences. We show the

convergence of a truncated Newton method for linear SVM. The x-axis shows the cumulative

number of CG iterations. Loss=L2 and C = 100C best in each sub-figure.

“A Study on Conjugate Gradient Stopping Criteria in Truncated Newton Methods for Linear Classification”

Supplementary Materials for