A Proportion Learning Algorithms with Density Peaks

(1)

Procedia Computer Science 91 ( 2016 ) 841 – 846

Peer-review under responsibility of the Organizing Committee of ITQM 2016 doi: 10.1016/j.procs.2016.07.092

ScienceDirect

Information Technology and Quantitative Management (ITQM 2016)

A Proportion Learning Algorithms with Density Peaks

Limeng Cui

a,b

, Zhiquan Qi

b,∗

, Fan Meng

b,c

a_{School of Computer and Control Engineering, UCAS, Beijing, China} b_{Research Center on Fictitious Economy}_{& Data Science, CAS, Beijing, China}

c_{School of Economics and Management, UCAS, Beijing, China}

Abstract

As a powerful tool of weakly labeled learning, proportion learning has drawn much attention in recent years. In this problem, instances are grouped in bags and only the proportion of labels is known in each bag. It has broad applications, such as video event detection and the prediction of political election. However, proportion learning problem has suﬀered great challenge because of the sensitivity to the distribution of data and complexity of the iterative optimization procedure. To overcome these drawbacks, we propose a novel proportion learning method based on the distribution information of data. We demonstrate the power of our algorithm on several test cases. In addition, our method is fast and robust to the shape of data.

c

2016 The Authors. Published by Elsevier B.V.

Selection and/or peer-review under responsibility of ITQM2016.

Keywords: proportion learning; clustering; density peaks

1. Introduction

The problem of proportion learning has drawn much attention in recent years [1, 2, 3]. In this scenario, the instances are grouped in different bags. Although the proportion of each bag is given, the label of each instance is unknown. The main purpose is to learn a model and find the relative label of each instance. As a powerful tool for weakly labeled learning, it has already applied in many fields successfully, such as the prediction of political election [1], visual attribute modeling [4], video event detection [5] and so on.

This learning problem has received much attention so far, for the reason that obtaining the label of observations may not always be possible. For example, whether a patient is affected should be confidential between him and his treating physician. At the same time, scientists need data to predict the outbreak of a new type of influenza. As a result, storing only the proportion labels over different groups may be a legally advisable way.

In terms of existing methods, there are two main drawbacks. On one hand, these methods are sensitive to the distribution of data. To be speciﬁc, in some methods [6] which are based on SVMs [7, 8, 9], the selection of kernel function depends on the distribution of data (linear or non-linear). Moreover, the parameters need to be

∗_{Corresponding author.}

E-mail address: [email protected].

(2)

manually tuned to achieve the best performance. On the other hand, the optimization procedure needs to be solved iteratively until the optimal solution is found, which is very time-consuming.

In order to overcome these two shortages mentioned above, we propose a novel proportion learning method based on the distribution information of data. It can utilize the density peaks to infer the prior distribution of data. In specific, we first find the density peaks of the data and initialize the data points in groups. After that, we assign the class labels to each groups by minimizing the overall bag error. Also, the label of each individual point can be modified in this step as well. With the density of each point and the distances between each other, we can predict the label of new data. Experimental results demonstrate the precision and the robustness of our method.

2. Related Work 2.1. Proportion Learning

Inverse Calibration (InvCal): Rueping et al. [10] propose a proportion learning framework which treats the mean of each bag as a super-instance, which is assumed to have a soft label corresponding to the label proportion. The drawback of InvCal is that the super-instances is not satisfying in representing the properties of the bags.

∝ SVM: Yu et al. [6] propose a large margin framework, which recursively optimizes over the unknown instance labels and the known label proportions until the objective function converges.

2.2. Clustering

Numerous paper have been written on clustering over the past few decades [11, 12, 13, 14, 15, 16]. Here we list some of the representative methods.

K-means: K-means deﬁnes the center of a cluster as the mean value of the points within the cluster. For each point, it is assigned to the nearest center. The algorithm iteratively improves the within-cluster variation until the clusters formed in the current round are the same as those formed in the previous round. However, because a data point is always assigned to the nearest center, K-means is not able to detect nonspherical data.

DBSCAN [17]: Density-based spatial clustering of applications with noise (DBSCAN) ﬁnds core objects, which are those that have dense neighborhoods. It connects centers and their neighborhoods to form dense regions as clusters. For each point, we need to set value manually to specify the radius of its neighborhood. However, choosing an appropriate can be non-trivial.

Cluster-DP [18]: The algorithm has its basis in the assumptions that cluster centers are surrounded by neigh-bors with lower local density and that they are at a relatively large distance from any points with a higher local density. The local densityρi and its distanceδifrom the points with higher density are calculated for each data

point i. It is able to deal with nonspherical clusters and to automatically ﬁnd the correct number of clusters. 3. Proportion Learning with Density Peaks

3.1. Problem Setting

Suppose we are given a data set{xi, y∗_i}N_i₌₁, where x∈ X and y∗_i ∈ {−1, +1} denotes the unknown ground truth

label of xi. The data set is grouped into K bags. For each data point xi,{xi|i ∈ Bk}K_k₌₁. In this paper, we assume

that the bags are disjoint.

The label proportion of the k-th bag can be deﬁned as: Pk:=

|{i|i ∈ Bk, y∗i = +1}|

|Bk|

(1) Assume the instance labels are explicitly modeled as{yi}_iN₌₁, where yi∈ {−1, +1}. The label proportion of the

k-th bag can be modeled as:

pk(y)= |{i|i ∈ B_|Bk, yi= +1}| k|

(2) The bag error on the data set is deﬁned as:

(3)

BE=

K

k=1

|pk(y)− Pk| (3)

Also, the bag error on the k− th bag is deﬁned as:

BEk= pk(y)− Pk (4)

Our goal is to ﬁnd a proper mapping Π : X → {−1, +1}, which achieves the minimal BE and BEk, k =

1, 2, . . . , K.

3.2. Find of Density Peaks

We assume the data from the same class are distributed closely. So, in order to explore the distribution of data points, we select the most representative point for each class. We assume that these points are surrounded by neighbors with lower local density and that they are far from points with a higher local density. Thus, two attributes are calculated for each data point i to ﬁnd the representative points: the local densityρiof point i and its

distanceδifrom the points with higher density. The local densityρiof point i is deﬁned as:

ρi=jχ(di j− dc) (5)

whereχ(x)=1 if x < 0 and χ(x) = 0 otherwise, dcis a threshold distance. Intuitively,ρiis equal to the number of

points that are closer than dcto point i.

δiis measured by computing the minimum distance between the point i and any other point with higher density:

δi= minj:ρj>ρi(di j) (6)

For the point with the highest density, we denote itsδias maxj(di j). Thus, only the points with both large

value ofρiandδiare recognized as centers. Points near the density peaks have aρ and a δ lower than the centers.

Noises are isolated points, with highδ and low ρ. See the example in Fig. 1. Fig. 1(a) shows a 2D dataset which consists of 200 points. The density maxima are two red points, which we denote as density peaks xc1and xc2 in

two clusters. Fig. 1(b) shows the relationship betweenδi andρifor each point. We call this representation the

decision graph in which points with highδ and ρ are chosen to be the centers.

x(1) -8 -6 -4 -2 0 2 4 6 8 10 x(2) -3 -2 -1 0 1 2 3 4 (a) ρ 0 2 4 6 8 10 12 δ 0 1 2 3 4 5 6 7 8 9 10 (b)

(4)

3.3. Learning with Label Proportion 3.3.1. Initialization

After the cluster centers have been formed, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. This assignment is done in a single loop, while other clustering algorithms will iteratively optimize the object function. The data set are separated into two clusters after the assignment (see Fig. 1(a)). Different colors correspond to different clusters. Next, we need to identify the class label to the two groups respect to the proportion information. To be specific, in the 1st kind of setting, we denote the points in yellow along with their center point as class+1, and −1 otherwise. Also, the class labels are assigned conversely in the 2nd kind of setting. Then, the corresponding bag error BE is calculated under these two settings. Finally, we choose the setting with lower value of BE.

3.3.2. Label Modiﬁcation

Now we need to modify the class label of individual point respect to the bag error BEkon each bag. First of

all, we deﬁne the border region for each class as the set of points assigned to the class but being within a distance dcfrom data points belonging to the other class. For each class, we choose the point with the highest density in its

border region, and deﬁne its densityρb, which denote the border density. The points of the class whose density is

lower thanρbare considered as class halo.

We assume that some points belonged to class−1 are assigned to class +1 bag in the initialization step if BEk> 0. We ﬂip the labels of halo points assigned to class +1 in ascending order of distances from data points in

class−1 until BEk= 0 or no halo point exists. If BEk< 0, we conduct the procedure otherwise.

As the bags are disjoint, the modiﬁcation is conducted on each bag for only once. 3.4. Predicting

The modeled labels of each points are obtained in the learning procedure. Now we have{xi, yi, ρi}N_i₌₁. Suppose

the testing set is{xm}mM=1, which consists of M data points.

The local densityρm =iχ(dmi− dc) of point m and its distanceδm = mini:ρi>ρm(dmi) from the points with

higher density is calculated one by one. Then, each point in the testing set is assigned to the same class as its nearest neighbor in the data set of higher density.

4. Experiments

In order to evaluate our method, we compare it with InvCal [10] and∝ SVM [6]. Part of the Matlab code is implemented on Cluster-DP [18].

4.1. A Toy Experiment

To demonstrate the robustness of our method, we first show an experiment on a toy dataset. Data in Fig. 2 contains 200 points. We split this dataset into two bags of equal size. Different colors correspond to different classes. Different shapes indicate different bags.

From Fig. 3, we can see that our method is superior than other methods. Both InvCal and∝ SVM confuse the two classes. On the other hand, our method, which takes the advantage of data distribution, achieves the perfect performance with 98.5%.

4.2. UCI Datasets

We conduct the comparison of diﬀerent methods on various datasets from the UCI repository. The details of the datasets is shown in Table 1. To test the robustness of our method, we apply the 5-fold validation. The training set is split randomly into bags of a ﬁxed size: 2, 4, 8, 16, 32, 64. The average precision of each method is calculated.

Our method outperforms the alternatives in most of the cases. In addition, by exploiting the prior distribution of data, the precision of our method doesn’t decrease with the rise of bag size.

(5)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 2. Different colors correspond to different classes. Different shapes indicate different bags (P1= 48%, P2= 52%).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 3. Different colors correspond to different classes. Different shapes indicate different bags. (a) shows the result of InvCal. (b) shows the result of∝ SVM. (c) shows the result of our method.

5. Conclusion

As distribution information is of great importance and has been widely ignored in recent methods. In this paper, we propose a fast proportion learning method by exploiting the prior distribution information of data. Our method builds upon the density peak selection and a straightforward label assignment method. With the help of two attributes, we ﬁrst ﬁnd the density peaks of the data. Next, the remaining data points are separated into two groups based on its nearest neighbor. Finally, we assign the class label to each data point respect to the proportion information. Experimental results show that our method performs better than the alternatives. In addition, it is robust to noises and can classify data with arbitrary shape.

Acknowledgements

This work is supported by National Natural Science Foundation of China (Grant No. 91546201, 71331005, 71110107026, 61402429).

References

[1] N. Quadrianto, A. J. Smola, T. S. Caetano, Q. V. Le, Estimating labels from label proportions, The Journal of Machine Learning Research 10 (2009) 2349–2374.

[2] M. Stolpe, K. Morik, Learning from label proportions by optimizing cluster model selection, in: Machine Learning and Knowledge Discovery in Databases, Springer, 2011, pp. 349–364.

[3] G. Patrini, R. Nock, T. Caetano, P. Rivera, (almost) no label no cry, in: Advances in Neural Information Processing Systems, 2014, pp. 190–198.

(6)

Table 1. Datasets

Datasets Size Attribute Classes

heart 270 13 2

haberman 306 3 2

Table 2. An example of a table.

Dataset Method 2 4 8 16 32 64 InvCal 82.37 77.01 73.27 65.88 56.03 56.08 heart ∝ SVM 80.55 78.5 78.36 80.17 71.62 73.27 Our Method 79.62 78.53 79.12 78.72 75.65 74.22 InvCal 73.37 73.65 73.44 73.59 73.34 73.75 haberman ∝ SVM 73.99 72.94 70.66 64.46 60.64 61.45 Our Method 83.38 82.58 83.51 83.67 83.36 83.69 InvCal 99.36 99.33 99.60 99.33 99.35 98.85 iris _{∝ SVM} 99.36 85.56 76.77 57.02 88.35 48.60 Our Method 99.60 99.25 99.60 99.60 99.60 99.32

[4] T. Chen, F. X. Yu, J. Chen, Y. Cui, Y.-Y. Chen, S.-F. Chang, Object-based visual sentiment concept analysis and application, in: Pro-ceedings of the ACM International Conference on Multimedia, ACM, 2014, pp. 367–376.

[5] K.-T. Lai, F. X. Yu, M.-S. Chen, S.-F. Chang, Video event detection by inferring temporal instance labels, in: Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, IEEE, 2014, pp. 2251–2258.

[6] F. Yu, D. Liu, S. Kumar, J. Tony, S.-F. Chang,∝ svm for learning with label proportions, in: Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 504–512.

[7] Z. Qi, Y. Tian, Y. Shi, Laplacian twin support vector machine for semi-supervised classiﬁcation, Neural Networks 35 (2012) 46–53. [8] Z. Qi, Y. Tian, Y. Shi, Twin support vector machine with universum data, Neural Networks 36 (2012) 112–119.

[9] Z. Qi, Y. Tian, Y. Shi, Robust twin support vector machine for pattern classiﬁcation, Pattern Recognition 46 (1) (2013) 305–316. [10] S. Rueping, Svm classiﬁer estimation from group probabilities, in: Proceedings of the 27th International Conference on Machine

Learn-ing (ICML-10), 2010, pp. 911–918.

[11] S.-G. Lee, D.-K. Yun, Clustering categorical and numerical data: a new procedure using multidimensional scaling, International Journal of Information Technology & Decision Making 2 (01) (2003) 135–159.

[12] D. Zhou, C. J. Burges, Spectral clustering and transductive learning with multiple views, in: Proceedings of the 24th international conference on Machine learning, ACM, 2007, pp. 1159–1166.

[13] E. Elhamifar, R. Vidal, Sparse subspace clustering, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 2790–2797.

[14] G. Nan, S. Zhou, J. Kou, M. Li, Heuristic bivariate forecasting model of multi-attribute fuzzy time series based on fuzzy clustering, International Journal of Information Technology & Decision Making 11 (01) (2012) 167–195.

[15] Y. Marchetti, Q. Zhou, et al., Solution path clustering with adaptive concave penalty, Electronic Journal of Statistics 8 (1) (2014) 1569– 1603.

[16] B. Rouba, S. N. Bahloul, A multicriteria clustering approach based on similarity indices and clustering ensemble techniques, International Journal of Information Technology & Decision Making 13 (04) (2014) 811–837.

[17] J. Sander, M. Ester, H.-P. Kriegel, X. Xu, Density-based clustering in spatial databases: The algorithm gdbscan and its applications, Data mining and knowledge discovery 2 (2) (1998) 169–194.