4.3 Robust Random Forest with Pairwise Constraints
4.3.2 Node Splitting with Pairwise Constraints
The conventional Random Forest described above only takes labeled data as input and the limited size of supervised data in our problem would lead to an obvious performance drop X. Liu et al. (2013). To avoid this problem, we extend the current splitting strategy to take pairwise constraint data into training. According to equation 4.2, the target of split is to maximum the Gini impurity at each node which requires to obtain the proportion of well labeled data belonging to the same category. However, for partially labeled data, samples come in pairs, so when the Must-link or Cannot-link relation is broken by splitting, the gini index calculated byP
i6=jpipj does not work.
In this situation, we introduce a new method for both Must-link M pairs and Cannot-link C pairs. In pairwise constraint situation, instead of directly cal- culating the number of samples falling into the correct side of each node, we count the ratio of broken links in total must-link and cannot-link. So the following equation will illustrate how we calculate that ratio.
Equation 4.6 obtains the number of samples in Must-link set M falling into the same node, and Equation 4.7 calculates the total number of samples from Must-link set M in the node R.
NM(R) = 2 ∗ |{(xα, xβ)|xα ∈ R ∧ xβ ∈ R ∧ (xα, xβ) ∈ M }|
(4.6)
NtotalM (R) = |{xα|xα ∈ R ∧ (xα, xβ) ∈ M }|
+ |{xβ|xβ ∈ R ∧ (xα, xβ) ∈ M }|
(4.7)
where R denotes the current tree node. α and β refers to the pair data in a pairwise constraint. The reason we twice the result in the final Equation 4.6 is because there are two samples in a pair. Similar to Must-link, we obtain the same data from Cannot-link set C with Equation 4.8 and 4.9
NC(R) = 2 ∗ |{(xα, xβ)|xα ∈ R ∧ xβ ∈ R ∧ (xα, xβ) ∈ C}| (4.8)
NtotalC (R) = |{xα|xα ∈ R ∧ (xα, xβ) ∈ C}| + |{xβ|xβ ∈ R ∧ (xα, xβ) ∈ C}|
(4.9)
The Equation 4.6 , 4.7 and 4.8, 4.9 calculate the number of pairwise con- straint samples falling into same side or different side and the total number of
must-link and cannot-link. That information will be used further to calculate the radio of success splitting for pairwise constraints.
Based on N and Ntotal defined above, we propose two estimation func-
tions as follows: EM = − log N M(R) NM total(R) (4.10) EC = − logN C total(R) − NC(R) NC total(R) (4.11)
Equations 4.10 and 4.11 are the estimation functions for Must-link and Cannot-link sets respectively. These two equations are constructed by the ratio between successful splitting number and total samples. In addition to making it more robust for noisy data, we apply log function to it, which makes derivative of it smaller when the successful number is close to the total num- ber. Actually, there are other ways to define the estimation function, but in our experiment, this is the most simple and efficient solution.
To use the estimation proposed above for node splitting, this project follow the idea from the origin Random Forest as discussed in Equation 4.2, we can propose a similar split criterion for pairwise constraint data which evaluating the tree before and after splitting.
∆E(R) = EM(R) − |R M l | |RM|E M(RM l ) − |RM r | |RM|E M(RM r ) + EC(R) − |R C l | |RC|E C(RC l ) − |RC r| |RC|E C(RC r) (4.12)
Then the new target becomes to find a split to maximize Equation 4.12 which contains both must-link estimation and cannot-link estimation.
In addition, our algorithm considers a more complex condition that the pairwise data include a lot of noise information, when the procedure of tree construction closes to the leaf nodes, the total number of pairwise data NtotalM (R) and NtotalC (R) could be smaller and Equation4.10 and 4.11 would too sensitive to noise data. In order to avoid this problem, we introduce a combined split strategy as follows: ∆C(R) = ∆G(R) + α∆E(R), |L| < |M | + |N | ∆G(R), otherwise (4.13)
where |L|, |M |, |N | refers to the number of well-labeled, must-link and cannot-link data respectively, and α is the learning rate for pairwise data. In
practice, we usually choose a small number for α which makes ∆E(R) only have a limited influence at tree construction.
In this paper, we assume a situation that the size of partially labeled data is much larger than that of the well-labeled data, but the accuracy is on the contrary. So to calculate Equation 4.13 more efficiently, the samples with broken Must-link or satisfied Cannot-link will be removed from child nodes as shown in Equations 4.14 and 4.15. With the help of this data filtering strategy, the number of total pairwise data will be reduced in child node and make the algorithm more efficiency.
Mnew(R) = {(xα, xβ)|xα ∈ R ∧ xβ ∈ R ∧ (xα, xβ) ∈ M } (4.14)
Cnew(R) = {(xα, xβ)|xα ∈ R ∧ xβ ∈ R ∧ (xα, xβ) ∈ C} (4.15)
The tree constructing procedure is shown in Algorithm 1. It illustrate how our pairwise constraint node splitting can be merged into normal random forest. When split strategy set to δC = ∆G(R) it works as normal random forest, otherwise it works based on pairwise constraint information.