Fine Slice-Specific Regressor - Hierarchical regression learning for car pose estimation

After coarse group classification, fine regressor for each angle groupGj, j = 1,2,· · ·J is trained to learn regression functionsfj(x)that minimise the loss functionL(fj(xk), yk)

∀ hxk, yki | yk ∈ Gj between the prediction fj(xk)and yk. In mathematics, the object function for thejth regressor with the sum of squared loss can be written as

min N

i=1

kf(xk)−ykk22, (5.3)

where k · k2 denote the Euclidean norm. Regression forest [20] is a popular regression method with high computational efficiency and robustness. It learns an ensemble of decor- related regression trees by randomly selecting the training samples and features. In the training stage, each tree is grown independently with binary splitting strategy adopted in

each node, i.e. each node can have two child nodes. To cope with the limitation of the standard binary splitting method leading to less efficient tree model in reducing the empirical loss, K-clustering Regression Forests (KRF) [4] employ a more flexible splitting method that can allow each node to have more than two child nodes. Motivated by the strong performance of the K-clustering regression forest in vehicle viewpoint estimation in [4], we adopt the following loss function for thejth regressor as:

L(fj(xk), yk) = 1−cos(yk−fj(xk))∈[0,2] (5.4) for training fine regressors in the second step of our framework. Each node of K-clustering regression forests partitions the output space intoK clustersT ={T1,T2,· · · ,TK}, and the cluster labels are used to divide the input space into K disjoint subspaces R =

{R1,R2,· · · ,RK}. It is worth mentioning here, clusters inT are determined without considering the input space. Let us assume that T∗ ₌ _{T∗

1,T2∗,· · · ,TK∗} are the opti- mised clusters and A = {A1,A2,· · · ,AK}denoting a set of constant estimates associ- ated with each subspace, object function in (5.3) can thus be written as

T∗ = argmin T K X k=1 X i∈Tk 1−cos(y_ij− Ak), (5.5)

whereTk = {i : 1−cos(yji − Ak) 6 1−cos(yij − Al),∀1 6 l 6 K}. GivenT∗, the problem is cast as a multi-class classification problem, i.e. partitioningK disjoint input subspace R = {R1,R2,· · · ,RK}to preserve T can be equivalent to training samples xand their class labels {1,2,· · · , K}. As mentioned before, Support Vector Machines is generally adopted for benchmarking classification. For higher efficiency, we use linear kernel in (5.2). Since determining the parameters ofK, the size of clusters adopted in each node splitting, the straightforward choice is to tune via n-fold cross-validation. Alternatively, in [4], adaptive determination of the flexible number of child nodes for each node, namely Adaptive K-clustering Regression Forests (AKRF) was proposed by measuring Bayesian Information Criterion (BIC) [31, 32] to select the size of clustersK.

5.6 Experiments

We also applied our Hierarchical Sliding Slice Regression (HSSR) on EPFL multi-view car dataset. HOG features are extracted from resized 64×64image patches. Two experiments are conducted according to two settings of data split, even split and leave- one-sequence-out. We compared our results against a number of state-of-the-art methods

Table 5.1. Comparative evaluation of the state-of-the-art methods with the EPFL Multi-View Car dataset.

even split leave-one-sequence-out Methods mae (90%) mae (95%) mae (100%) mae (90%) mae (95%) mae (100%) Ozuysalet al. [12] – – 46.48o _– _– _– Torkiet al. [13] 19.40o _26.70o _33.98o _23.13o _26.85o _34.90o Fenziet al. [14] 14.51o _22.83o _31.27o _14.41o _22.72o _31.16o KPLS [4] 16.86o _21.20o _27.65o _– _– _– SVR [4] 17.38o _22.70o _29.14o _– _– _– BRF [4] 23.97o _30.95o _38.13o _– _– _– KRF [4] 8.32o _16.76o _24.80o _11.16o _14.99o _20.18o AKRF [4] 7.73o _16.18o _24.24o _15.74o _21.50o _27.42o HSSR (ours) 3.88o _11.98o _20.30o _8.31o _10.90o _14.24o

for car pose estimation. Besides Ozuysal et al. [12], the remaining methods cast car pose estimation into a regression problem. KPLS [33], SVR [19], BRF [4], KRF [4] and AKRF [4] employ the identical HOG features. Free parameters of KPLS, SVR, KRF and AKRF are tuned via leave-one-sequence-out cross-validation by following the original works. Mean absolute error maeto take the average of the absolute difference between predicted angles and the ground-truth is employed to evaluate and compare the performance of our approach. In addition, following the previous works, the mae of 90%- percentile of the absolute errors and that of 95%-percentile are also reported as robust performance measures.

Comparison with State-Of-The-Arts – The results in Table 5.1 verify that our method significantly outperforms all state-of-the-art methods with at least 16.25% marginal in reducing mae and25.96%and49.81%in reducing95%mae and90%mae, respectively, using the even split setting. Similar performance improvement is observed for the leave- one-sequence-out setting. Figure 5.3 illustrates the results for each sequence separately and compares to our direct competitors KRF and AKRF. Evidently, the proposed method performs better in most of the sequences. By exploiting target locality, our method can mitigate the suffering from the flipping errors (≈ 180◦,e.g. the sequences 3, 15 and 16). By adopting the cumulative scores introduced in Genget al. [?], we visualise the results generated by our method (HSSR) and the other state-of-the-arts methods in Figure 5.4. It is shown that the HSSR approach significantly improves the accuracy with both splitting methods.

Slice Construction – Given a fixed slice size of45◦, Table 5.2 compares varying slice step values. The results indicate that the combination of all different step sizes performs best. Besides the combination strategy, the best results for even splitting were achieved using the1/2×slice step (half overlap) while for the leave-one-sequence-out setting the 1/4×step performed the best (three quarters overlap). Evidently, all overlapping strate-

Figure 5.3. Comparison of HSSR (ours), KRF and AKRF on mae (100%) for each sequence (leave-one-sequence-out).

Table 5.2. Evaluation on the circular slice step size proportional to the45◦ slice size.

even split leave-one-sequence-out Slice step mae (90%) mae (95%) mae (100%) mae (90%) mae (95%) mae (100%)

1×slice 8.04o _16.78o _24.88o _10.75o _13.54o _17.34o 3/4× 4.30o _12.69o _20.98o _10.21o _12.65o _16.14o 1/2× 4.11o _12.14o _20.45o _8.98o _11.60o _15.52o 1/4× 4.97o _13.74o _21.98o _8.81o _11.10o _14.63o

Combination 3.88o _11.98o _20.30o _8.31o _10.90o _14.24o

gies (i.e. 3/4×, 1/2×, 1/4×, and combination) show higher accuracy than the non- overlapping slice spacing which verifies our sliding slice strategy.

Evaluation on Varying Size of Sliding Window – We also evaluated different size of

Table 5.3. Evaluation on the circular slice size.

even split leave-one-sequence-out

Slice size mae (90%) mae (95%) mae (100%) mae (90%) mae (95%) mae (100%)

180◦ 4.51o _12.66o _20.93o _9.65o _12.36o _16.90o

90◦ 5.12o _14.00o _22.24o _9.24o _11.83o _16.21o

45◦ _4.11o _12.14o _20.45o _8.98o _11.60o _15.52o

the slices by half slice step in our method and the results are shown in Table 5.3. Among them,45◦achieved the best results. Notably, HSSR is superior to state-of-the-art with all slice sizes (cf. Table 5.1).

(a) even split (b) leave-one-sequence-out

Figure 5.4.Comparative cumulative scores. The higher the better.

5.7 Summary

We proposed a two-step, coarse-to-fine, hierarchical approach for visual regression where one-vs-all SVM classifiers are used to find coarse regression groups and group-specific regressors are used to provide an accurate estimate. Our application was vehicle viewing angle estimation where axial symmetry brings additional challenges for regression. In this case, we formed the groups as circular overlapping slices and demonstrated how this approach leads to state-of-the-art accuracy on a public benchmark dataset. In our future work, we will extend the novel approach to other similar circular visual regression problems and study the effect of non-uniform slices and to further improve the approach.

6 CONCLUSION

The goal of car pose estimation is to predict the pose of vehicle in a given still image or video frame. Such a problem has its significance in intelligent transportation system and visual surveillance. Generally, this problem is considered as a regression problem duo to its continuously and cumulatively changing target space. However, direct regression mapping from the low-level imagery feature space to the target space is challeng- ing. Visual variation of different car types and the changing of light conditions can lead to inconsistency in such mapping relationships. A number of regression methods have been proposed to address this problem. Among these methods, K-clustering Regression Forests (KRF) has achieved outstanding performance by employing a more flexible target clustering based node splitting algorithm instead of selecting from pre-defined standard binary splitting rules. However, global regression lacks of power to describe the continuously changing of labels. Some problem-specific characteristics are ignored in existing regression methods for car pose estimation.

The two novel regression frameworks proposed in this thesis are both in a hierarchical manner and classification is adopted as first layer in order to make following regressors more robust. Part-Aware Target Coding (PATC) is a novel concept that SVM classifiers are employed to map the low-level imagery feature onto the proposed part-aware target codes. The predicted probabilities of the pose-sensitive parts combined with low-level imagery features as mid-level input features are used to train a strong regressor. Hierarchical Sliding Slice Regression (HSSR), which is inspired by previous works, is a coarse-to-fine framework. Coarse classifiers approximate a rough range known as a slice in the label space. A fine regressor, which is specifically trained for the slice, estimates the viewing angle. In this way, each regressor is more powerful to represent the label space than direct regression mapping. Moreover, the overlap defined by slice steps makes this regression framework perform more robustly.

Finally, both of the proposed frameworks have achieved better performance than the state- of-the-art methods applied on the benchmarking EPFL multi-view car dataset. While frameworks are designed specifically for car pose estimation, the novel concepts introduced in this thesis, part-aware coding and sliding slice, are not limited to applications on vehicle objects. Other regression problems that have similar characteristics, part- awareness or circular label space, can utilise the proposed frameworks to solve the in- consistent input-output problem. HSSR, as a part of the thesis has been submitted to IEEE Transactions on Intelligent Transportation Systems [42].

REFERENCES

[1] Matthias Dantone, Juergen Gall, Christian Leistner, and Luc Van Gool. Body parts dependent joint regressors for human pose estimation in still images. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 36(11):2131–2143, 2014.

[2] Jacob Foytik and Vijayan K Asari. A two-layer framework for piecewise linear manifold-based head pose estimation. International journal of computer vision, 101(2):270–287, 2013.

[3] Xin Geng and Yu Xia. Head pose estimation based on multivariate label distribution. InCVPR, 2014.

[4] Kota Hara and Rama Chellappa. Growing regression forests by classification: Appli- cations to object pose estimation. InComputer Vision–ECCV 2014, pages 552–567. Springer, 2014.

[5] Heikki Huttunen, Ke Chen, Abhishek Thakur, Artus Krohn-Grimberghe, Oguzhan Gencoglu, Xingyang Ni, Mohammed Al-Musawi, Lei Xu, and Hendrik Jacob Van Veen. Computer vision for head pose estimation: Review of a competition. InScandinavian Conference on Image Analysis, pages 65–75. Springer, 2015.

[6] Johan Nilsson, Jonas Fredriksson, and Anders CE Odblom. Reliable vehicle pose estimation using vision and a single-track model. IEEE Transaction on Intelligent Transportation Systems, 15(6):2630 – 2643, 2014.

[7] Angel Domingo Sappa, Fadi Dornaika, Daniel Ponsa, David Gerónimo, and Antonio López. An efficient approach to onboard stereo vision system pose estimation.IEEE Transaction on Intelligent Transportation Systems, 9(3):476 – 490, 2008.

[8] Hongsheng He, Zhenzhou Shao, and Jindong Tan. Recognition of car makes and models from a single traffic-camera image. IEEE Transaction on Intelligent Trans- portation Systems, 16(6):3182 – 3192, 2015.

[9] Chi-Chen Raxle Wang and Jenn-Jier James Lien. Automatic vehicle detection using local featuresâ ˘AˇTa statistical approach.IEEE Transaction on Intelligent Transporta- tion Systems, 9(1):83 – 96, 2008.

[10] Tongtong Chen, Ruili Wang, Bin Dai, Daxue Liu, and Jinze Song. Likelihood- field-model-based dynamic vehicle detection and tracking for self-driving. IEEE Transaction on Intelligent Transportation Systems, pp(99):1 – 17, 2016.

[11] Shunguang Wu, Stephen Decker, Peng Chang, Theodore Camus, and Jayan Eledath. Collision sensing by stereo vision and radar sensor fusion. IEEE Transaction on Intelligent Transportation Systems, 10(4):606 – 614, 2009.

[12] Mustafa Ozuysal, Vincent Lepetit, and Pascal Fua. Pose estimation for category specific multiview object localization. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 778–785. IEEE, 2009.

[13] Marwan Torki and Ahmed Elgammal. Regression from local features for viewpoint and pose estimation. InIEEE International Conference on Computer Vision (ICCV), pages 2603–2610. IEEE, 2011.

[14] Michele Fenzi, Laura Leal-Taixé, Bodo Rosenhahn, and Jörn Ostermann. Class gen- erative models based on feature regression for pose estimation of object categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 755–762. IEEE, 2013.

[15] Patricia Cronin, Frances Ryan, and Michael Coughlan. Undertaking a literature review: a step-by-step approach. British journal of nursing, 17(1):38, 2008.

[16] Renee Elio, Jim Hoover, Ioanis Nikolaidis, Mohammad Salavatipour, Lorna Stewart, and Ken Wong. About computing science research methodology, 2011.

[17] Daniel Glasner, Meirav Galun, Sharon Alpert, Ronen Basri, and Gregory Shakhnarovich. Viewpoint-aware object detection and pose estimation. In 2011 International Conference on Computer Vision, pages 1275–1282. IEEE, 2011.

[18] Guodong Guo, Yun Fu, C.R. Dyer, and T.S. Huang. Head pose estimation: Classifi- cation or regression? InPattern Recognition, 2008. ICPR 2008. 19th International Conference on, pages 1–4, Dec 2008.

[19] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.

[20] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[21] Anna Bosch, Andrew Zisserman, and Xavier Munoz. Image classification using random forests and ferns. In2007 IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007.

[22] Min Sun, Pushmeet Kohli, and Jamie Shotton. Conditional regression forests for human pose estimation. InIEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 3394–3401. IEEE, 2012.

[23] Matthias Dantone, Juergen Gall, Gabriele Fanelli, and Luc Van Gool. Real-time facial feature detection using conditional regression forests. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2578–2585. IEEE, 2012.

[24] Gabriele Fanelli, Thibaut Weise, Juergen Gall, and Luc Van Gool. Real time head pose estimation from consumer depth cameras. InPR. 2011.

[25] Wei-Yin Loh. Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1):14–23, 2011.

[26] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893. IEEE, 2005.

[27] Meng Ding and Guoliang Fan. Articulated and generalized gaussian kernel cor- relation for human pose estimation. IEEE Transactions on Image Processing, 25(2):776–789, 2016.

[28] Seong G Kong and Ralph Oyini Mbouna. Head pose estimation from a 2d face image using 3d face morphing with depth parameters. IEEE Transactions on Image Processing, 24(6):1801–1808, 2015.

[29] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.

[30] Ke Chen and Zhaoxiang Zhang. Pedestrian counting with back-propagated information and target drift remedy. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2016.

[31] Rangasami Laksminarayana Kashyap. A bayesian comparison of different classes of dynamic models using empirical data. IEEE Transactions on Automatic Control, 22(5):715–727, 1977.

[32] Gideon Schwarz et al. Estimating the dimension of a model. The annals of statistics, 6(2):461–464, 1978.

[33] Roman Rosipal and Leonard J Trejo. Kernel partial least squares regression in repro- ducing kernel hilbert space. The Journal of Machine Learning Research, 2:97–123, 2002.

[34] Kuan-Hsien Liu, Shuicheng Yan, and C-C Jay Kuo. Age estimation via grouping and decision fusion. IEEE Transactions on Information Forensics and Security, 10(11):2408–2423, 2015.

[35] Lihi Zelnik-Manor and Pietro Perona. Self-tuning spectral clustering. InAdvances in neural information processing systems, pages 1601–1608, 2004.

[36] Gabriele Fanelli, Matthias Dantone, Juergen Gall, Andrea Fossati, and Luc Van Gool. Random forests for real time 3d face analysis. International Journal of Computer Vision, 101(3):437–458, 2013.

[37] Chen Huang, Xiaoqing Ding, and Chi Fang. Head pose estimation based on random forests for multiclass classification. In 20th International Conference on Pattern Recognition (ICPR), pages 934–937. IEEE, 2010.

[38] Alfred DeMaris. A tutorial in logistic regression. Journal of Marriage and the Family, pages 956–968, 1995.

[39] Ke Chen, Shaogang Gong, and Tao Xiang. Human pose estimation using structural support vector machines. In2011 IEEE International Conference on Computer Vi- sion Workshops (ICCV Workshops), pages 846–851. IEEE, 2011.

[40] Javier Orozco, Shaogang Gong, and Tao Xiang. Head pose classification in crowded scenes. InProceedings of the British Machine Vision Conference (BMVC), volume 1, page 3, 2009.

[41] Rafael Muñoz-Salinas, Enrique Yeguas-Bolivar, Alessandro Saffiotti, and R Medina-Carnicer. Multi-camera head pose estimation. Machine Vision and Ap- plications, 23(3):479–490, 2012.

[42] Dan Yang, Yanlin Qian, Ke Chen, Eleni Berki, and Joni-Kristian Kämäräinen. Hier- archical sliding slice regression for vehicle viewing angle estimation. IEEE Trans- action on Intelligent Transportation Systems, accepted with changes.

In document Hierarchical regression learning for car pose estimation (Page 45-54)