Joint K-Means Quantization for ANN - Residual Vector Quantization based Techniques

4 CONTRIBUTIONS

4.2 Residual Vector Quantization based Techniques

4.2.2 Joint K-Means Quantization for ANN

Another RVQ based VQ method for ANN proposed in this thesis is Joint K-Means Quan- tization (JKM). As mentioned in Chapter 3.3.2.1 and 4.2.1 RVQ’s hierarchical structure separates the quantization problem into 𝑀 subproblems and the solution of each of problem strongly depend on the previous one. However, in its proposed solution, RVQ does not consider this dependence. ERVQ claims to offer a joint training scheme, but the proposed algorithm only provides an update on the codebooks generated by RVQ, which are already obtained independently from each other. Hence, the proposed codebook update does not really construct a joint scheme.

Nevertheless, a combination of the hierarchical structure with a joint codebook generation strategy would increase the performance while enjoying the low encoding complexity. Following this claim, Joint K-Means is proposed [P4]. JKM expands the “K-means”22

training on one of RVQ’s layers to all layers, providing a joint training scheme. Investi- gating the training scheme of K-Means23_{, first an “expectation” step is performed, where}

22_{K-Means clustering algorithm and Lloyd’s vector quantization are sometimes used inter-}

changeably in the literature.

the vectors are assigned to the nearest codevectors. Later a “maximization” step follows the expectation step where the codevectors are updated with the means of the assigned vectors. RVQ applies these steps for many iterations separately at each layer. In JKM, it is proposed to extend this to all layers.

The “expectation-maximization” steps of JKM occurs as follows: in the “expectation” step, each vector is assigned to its “selected” codevector and the residual is immediately cal- culated and transferred to the next layer, where the same operation will be repeated until the final layer is reached. Then in the maximization stage, codevectors at each layer are updated with the means of assigned codevectors. Therefore, while RVQ waits for the quantization on a layer to converge, JKM propagates the residuals through layers during the iterations. Note that, JKM does not assign the given vector to the nearest codevector, but instead it assigns to the “selected” codevector, and this selection is performed by the encoding algorithm. JKM proposes a joint encoding algorithm, which takes also the layer below the current layer into account, while selecting the codevector from the current layer. Incorporating this encoding method into the training improves the codebook generation even further.

Encoding in RVQ is also performed independently for each layer in a nearest neighbor fashion. In other words, the nearest codevector from the corresponding codebook is selected for each residual. However, this does not guarantee the minimum error. Let 𝒄1,𝑎 be the closest codevector to 𝒙 and 𝒄_2,𝑎 is the closest codevector to the first residual 𝒓₁= 𝒙-𝒄1,𝑎. 𝒄1,𝑏 is a different codevector from the first codebook, i.e., 𝒄1,𝑎 ≠ 𝒄1,𝑏 and 𝒄2,𝑏 is a codevector from the second codebook. The suboptimality of this encoding scheme can be proven as follows:

lemma: Given ‖𝒙 − 𝒄1,𝑎‖₂2≤ ‖𝒙 − 𝒄1,𝑏‖₂2, and ‖(𝒙 − 𝒄1,𝑎) − 𝒄2,𝑎‖₂2≤ ‖(𝒙 − 𝒄1,𝑎) − 𝒄2,𝑏‖₂2 there exist at least one 𝒄_1,𝑏 and 𝒄_2,𝑏, which satisfy

‖(𝒙 − 𝒄_1,𝑎) − 𝒄_2,𝑎‖₂2≥ ‖(𝒙 − 𝒄1,𝑏) − 𝒄2,𝑏‖₂ 2

(4.12)

proof: Assume that 𝒙 = 𝒄_1,𝑏+ 𝒄_2,𝑏. Then (4.12) turns into the following:

which is always true. Now if one can show that the assumption for 𝒙 = 𝒄1,𝑏+ 𝒄2,𝑏 is valid, the proof is complete. If 𝒙 = 𝒄_1,𝑏+ 𝒄_2,𝑏, then putting it in the first inequality given in lemma gives the following:

‖𝒄1,𝑏+ 𝒄2,𝑏− 𝒄1,𝑎‖₂2≤ ‖𝒄2,𝑏‖₂2 (4.14)

Rearranging the terms in (4.14), one can obtain the equation below:

‖𝒄2,𝑏− (𝒄1,𝑎− 𝒄1,𝑏)‖₂2≤ ‖𝒄2,𝑏‖₂2 (4.15)

which is true when ‖𝒄1,𝑎− 𝒄1,𝑏‖₂2≤ 2〈𝒄2,𝑏, 𝒄1,𝑎− 𝒄1,𝑏〉. For the second inequality in lemma, when the proposed assumption for 𝒙 = 𝒄_1,𝑏+ 𝒄_2,𝑏 is put into the inequality, then the following inequality is obtained:

‖(𝒄_1,𝑏+ 𝒄_2,𝑏) − 𝒄_1,𝑎− 𝒄_2,𝑎‖₂2≤ ‖𝒄1,𝑏− 𝒄1,𝑎‖₂2 (4.16)

Rearranging the terms in (4.16), one can obtain the equation below:

‖(𝒄1,𝑏− 𝒄1,𝑎) − (𝒄2,𝑎− 𝒄2,𝑏)‖₂2≤ ‖𝒄1,𝑏− 𝒄1,𝑎‖₂2 (4.17)

which is true when ‖(𝒄_2,𝑎− 𝒄_2,𝑏)‖₂2≤ 2〈𝒄_1,𝑏− 𝒄_1,𝑎, 𝒄_2,𝑎− 𝒄_2,𝑏〉. Since (4.15) and (4.17) can be true according to the selection of codevectors, in other words they are not always false, then 𝒙 = 𝒄_1,𝑏+ 𝒄_2,𝑏 is a valid case, hence the proof is complete.

In order to improve the encoding performance, “joint encoding” is proposed in JKM. Joint encoding is similar to beam search in AQ or OCKM, but much less complex since it enjoys the hierarchical structure, which reduces the number of required computations significantly. The joint encoding method searches for the codevector with the minimum quantization error in a small neighborhood of the nearest codevector. So instead of the nearest codevector, it is proposed to select the 𝐻 nearest codevectors and calculate the residuals for each of them. Then the same operation is repeated for each residual, giving 𝐻2_{candidates. The best 𝐻 according to the quantization error is selected and the oper-}

ations proceed until the final layer is reached. To explain the computational costs of encoding in detail, the distance between the 𝑚𝑡ℎ_{layer residual 𝒓}

𝑚 of the given vector 𝒙, and the 𝑘𝑡ℎ_{codevector on the 𝑚}𝑡ℎ_{layer 𝒄}

𝑚,𝑘 can be rewritten as follows:

𝑑(𝒓𝑚, 𝒄𝑚,𝑘) = ‖𝒙 − ∑ 𝒄̇𝑙 𝑚−1 𝑙=1 − 𝒄_𝑚,𝑘‖ 2 2 = ‖𝒙 − ∑ 𝒄̇𝑙 𝑚−1 𝑙=1 ‖ 2 2 − 2 〈𝒙 − ∑ 𝒄̇_𝑙 𝑚−1 𝑙=1 , 𝒄_𝑚,𝑘〉 + ‖𝒄𝑚,𝑘‖₂2 = ‖𝒙 − ∑ 𝒄̇𝑙 𝑚−1 𝑙=1 ‖ 2 2 − 2〈𝒙, 𝒄_𝑚,𝑘〉 + 2 ∑ 〈𝒄̇𝑙, 𝒄𝑚,𝑘〉 𝑚−1 𝑙=1 + ‖𝒄𝑚,𝑘‖₂2 (4.18)

where 𝒄̇_𝑙 is the nearest codevector on the 𝑙𝑡ℎ_{layer. For each layer, note that the first} term is already calculated in the previous layers. The third and fourth terms can be re- trieved from a look-up table. Hence, the second term should be calculated first for all the codevectors, which requires 𝑂(𝐾𝐷) operations for one layer. The look-ups for the third and fourth terms require 𝑂(𝑚𝐾𝐻) look-ups and additions for the 𝑚𝑡ℎ_{layer. Finally,} among all the distances the best 𝐻 are selected, which cost 𝑂(𝐾𝐻 log 𝐻). This is repeated 𝑀 times so the final cost of encoding is 𝛰 (𝑀𝐷𝐾 +(𝑀−1)(𝑀−2)

2 𝐾𝐻 + 𝑀𝐾𝐻 𝑙𝑜𝑔 𝐻).

More details on this encoding scheme can be found in [P4] and [P5].

To conclude, JKM takes the lower layers into account during both codebook generation and vector encoding steps. This affects the quantization performance as expected. The tests on ANN benchmarks are shown in Table 11 and Table 12. JKM is also presented in comparison with the prior art in Table 16, Table 17 and Table 18, in Chapter 4.3.

Table 11: JKM Test Results

TEST RESULTS FOR SIFT1M,32-BIT CODES

recall@1 recall@10 recall@100

SOBE 0.100 0.348 0.731

JKM 0.121 0.402 0.790

TEST RESULTS FOR GIST1M,32-BIT CODES

recall@1 recall@10 recall@100

SOBE 0.064 0.189 0.403

JKM 0.077 0.213 0.511

TEST RESULTS FOR SIFT1M,64-BIT CODES

recall@1 recall@10 recall@100

SOBE 0.282 0.701 0.962

JKM 0.323 0.759 0.980

TEST RESULTS FOR GIST1M,64-BIT CODES

recall@1 recall@10 recall@100

SOBE 0.136 0.360 0.705

Table 12: Computational and Storage Costs of JKM

Method Encoding Cost Encoding Cost for Different Datasets and Code Lengths (Number of Operations)

SIFT1M-32 SIFT1M-64 GIST1M-32 GIST1M-64

SOBE 𝛰(2𝑀𝐾𝐷) 262144 524288 1966080 3932160 JKM 𝛰 (𝑀𝐷𝐾 +(𝑀 − 1)(𝑀 − 2) 2 𝐾𝐻 + 𝑀𝐾𝐻 𝑙𝑜𝑔 𝐻) 319488 761856 1171456 2465792

Method Storage Cost Storage Cost for Different Datasets and Code Lengths (MB)

SIFT1M-32 SIFT1M-64 GIST1M-32 GIST1M-64

SOBE Ο(𝑀𝐾𝐷) 1.00 2.00 7.5 15 JKM Ο(𝑀𝐾𝐷) 1.00 2.00 7.5 15 𝑴: number of layers 128 128 960 960 𝑲: number of codevectors 256 256 256 256 𝑫: number of dimensions 8 8 4 4 𝑯: number of candidates 32 32 32 32

In document Vector Quantization Techniques for Approximate Nearest Neighbor Search on Large-Scale Datasets (Page 72-76)