Effect of label noise in the complexity of classification problems

(1)

Universidade de São Paulo

2015-07

Effect of label noise in the complexity of

classification problems

Neurocomputing, Amsterdam, v. 160, p. 108-119, Jul. 2015

http://www.producao.usp.br/handle/BDPI/50947

Downloaded from: Biblioteca Digital da Produção Intelectual - BDPI, Universidade de São Paulo

Biblioteca Digital da Produção Intelectual - BDPI

(2)

Effect of label noise in the complexity of classi

ﬁcation problems

Luís P.F. Garcia

a,n

, André C.P.L.F. de Carvalho

a

, Ana C. Lorena

b

a

Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, Trabalhador São-carlense Av. 400, São Carlos, São Paulo 13560-970, Brazil

b

Instituto de Ciência e Tecnologia, Universidade Federal de São Paulo, Talim St. 330, São José dos Campos, São Paulo 12231-280, Brazil

a r t i c l e i n f o

Article history: Received 11 March 2014 Received in revised form 8 October 2014 Accepted 11 October 2014 Available online 11 February 2015 Keywords: Classiﬁcation Label noise Complexity measures Noise Filter

a b s t r a c t

Noisy data are common in real-world problems and may have several causes, like inaccuracies, distortions or contamination during data collection, storage and/or transmission. The presence of noise in data can affect the complexity of classification problems, making the discrimination of objects from different classes more difficult, and requiring more complex decision boundaries for data separation. In this paper, we investigate how noise affects the complexity of classification problems, by monitoring the sensitivity of several indices of data complexity in the presence of different label noise levels. To characterize the complexity of a classification dataset, we use geometric, statistical and structural measures extracted from data. The experimental results show that some measures are more sensitive than others to the addition of noise in a dataset. These measures can be used in the development of new preprocessing techniques for noise identification and novel label noise tolerant algorithms. We thereby show preliminary results on a newfilter for noise identification, which is based on two of the complexity measures which were more sensitive to the presence of label noise.

1. Introduction

Noisy data can be regarded as objects that present inconsis-tencies in their predictive and/or target attribute values[1]. These inconsistencies can be either errors, absent information or unknown values [2]. Whereas noise needs to be identi_{ﬁed and} treated, secure data in a dataset must be preserved[3]. The term secure data usually refers to instances that are core of the knowl-edge necessary to build accurate learning models.

There are various strategies and techniques in the literature to handle with noise data [4–8,3,9]. Generally, the identified noisy data arefiltered and removed from the datasets. Nonetheless, it is usually hard to determine if a given instance is indeed noisy or not. Some recent proposals include designing classification tech-niques more tolerant and robust to noise, as surveyed in [10]. Despite the strategy employed to deal with noisy data, either by data cleansing or by the design of noise-tolerant learning algo-rithms, it is important to understand the effect of noise in the complexity of classification problems. This analysis can support the development of new techniques for dealing with noisy data more effectively.

This paper experimentally investigates the effect of distinct levels of label noise in the complexity of classiﬁcation problems.

For such, we employ a series of statistic and geometric descriptors described in [11]. These indices account for the difficulty of a classification problem by analyzing some characteristics of the dataset and the predictive performance of some simple classi fica-tion models induced using the dataset. The descriptors are grouped into three main categories: measures of overlapping between feature values, measures of class separability and mea-sures of data geometry, topology and density. We also employ some structural indices, captured by representing the dataset through a graph structure[12]. These measures extract topological and structural properties from the graphs built[13,14].

Throughout the experiments, we identify those indices that are more sensitive to the presence of label noise and can thereby be used to support noise identiﬁcation. Afterwards, we preliminarily propose a simple preprocessing technique for label noise identiﬁcation, which uses two of the measures that were more sensitive to the presence of label noise. Nonetheless, we emphasize that the studies carried out can support the development of other data cleaning techniques and of novel noise-tolerant learning algorithms too.

The main contributions from this study can be summarized as

Show that the presence of label noise at different levels influences the complexity of a classification problem. This is performed by monitoring a group of measures able to char-acterize the complexity of a classification problem from differ-ent perspectives;

Analyze a new set of features which characterizes the complex-ity of the problem by modeling a classiﬁcation dataset through Contents lists available atScienceDirect

journal homepage:www.elsevier.com/locate/neucom

Neurocomputing

http://dx.doi.org/10.1016/j.neucom.2014.10.085

n_{Corresponding author. Tel.:}_{þ55 16 3373 8161; fax: þ55 16 3373 9633.}

E-mail addresses:[email protected],[email protected](L.P.F. Garcia),

(3)

a graph structure. These measures consider distinct topological properties of the graph built from the underlying classiﬁcation dataset;

Highlight the measures that are most sensitive to label noise imputation and use some of them to propose a new preproces-sing technique able to identify label noise in a dataset.

This paper is structured as follows. Section 2 defines label noise.Section 3describes the measures employed in this paper to characterize the complexity of the noisy classification problems, whileSection 4presents the experimental methodology followed to evaluate the sensitivity of the complexity measures to label noise imputation. Section 5 presents and discusses the results achieved in this analysis.Section 6describes a newfilter for noise identification, based on a subset of measures previously investi-gated and presents experimental results related to the technique performance. Finally,Section 7concludes this paper.

2. Label noise

In supervised learning, a learning algorithm is applied to a dataset containing n pairs_{ð x}!i; yiÞ, where each x!i is a data point

described by m predictive features and yi corresponds to the expected label of x!i. In classiﬁcation problems, this label

corre-sponds to a class or a category. The learning algorithm then induces a classiﬁcation model, which should be able to predict the label of new data points. The presence of noise in a dataset affects its quality and may impair the predictive performance of the classiﬁers induced.

When dealing with supervised classiﬁcation problems, noise can be found in [15]: (a) predictive features and (b) target attribute. Noise in predictive features is introduced in one or more predictive attributes as a consequence of incorrect, absent or unknown values. On the other hand, noise in target attributes occurs in the class labels. They can be caused by errors or subjectivity in data labeling, as well as by the use of inadequate information in the labeling process. Lately, noise in predictive features can lead to a wrong labeling of the data points, since they can be moved to the wrong side of the decision frontier.

The majority of the existent Machine Learning (ML) algorithms for solving classiﬁcation problems try to minimize a cost function based on misclassiﬁcation, assuming the target label values in the dataset as correct. Due to the importance of the class information in supervised learning, in this paper, we deal with noise in the target attribute only.

Learning in noisy environments has been largely investigated in recent decades [10]. Some authors prefer to deal with the problem while the classification model is induced, providing more robustness to the induced models. The pruning process in Decision Tree induction algorithms (DT) is an early initiative to increase their robustness to noisy data[16]. Nonetheless, if the noise level is high, the definition of the pruning degree can be challenging and can ultimately remove branches that are based on safe information too. Another example is the use of slack variables in Support Vector Machines training[17], which allow some exam-ples to be misclassified or to lie within the margins of separation between the classes. This introduces an additional parameter to be tuned during the SVM training: the regularization constant. This constant accounts for the amount of training examples that can be misclassified or be placed near the decision boundary induced.

Recent work addresses noise-tolerant classifiers, where a label noise model is learnt jointly to the classification model itself[18]. For such, typically, some information must be available about the label noise or its effects[10]. The learning algorithm can also be modified to embed data cleansing [19]. Other authors prefer to

treat noise previously, in a preprocessing step. Filters are devel-oped for such, which scan the dataset for unreliable data[7,8,3]. The preprocessed dataset can then be used as input to any classiﬁcation algorithm.

One of the contributions of this paper is experimentally showing how label noise affects the complexity of various classi-fication datasets. Based on these results, we also present a new noisefiltering technique. This evidences the importance of under-standing the influence of label noise in classification problems. Furthermore, these results may contribute to the proposal of other filters and of novel noise tolerant algorithms.

3. Complexity indices for describing data

Each noise-tolerant technique and cleansingfilter has a distinct bias when dealing with noise. To better understand their particu-larities, it is important to know how noisy data affects a classi fica-tion problem. According to[20], noisy data tends to increase the complexity of the classification problem. Therefore, the identifica-tion and removal of noise can simplify the geometry of the separation border between the problem classes[21].

Singh[22]recommends a technique that estimates the complexity of the classification problem using neighborhood information for the identification of outliers. Sáez et al. [23] use measures able to characterize the complexity of the classification problem to predict when a noisefilter can be effectively applied to a dataset. Smith et al.

[9]propose a measure to capture instance hardness, considering an instance as hard if it is misclassified by a diverse set of classification algorithms. The instance hardness measure proposed is afterwards included into the learning process in two ways. Theyfirst propose a modification of the error function minimized during neural networks training, so that hard instances have a lower weight on the error function update. The second proposal is a filter that removes hard instances, which correspond to potential noisy data. In[24], we used data complexity measures as input to classifiers which were able to successfully predict whether a given dataset contains noise or not. All previous work confirm the effect of noise in the complexity of the classification problem.

In this paper we evaluate deeply the effects of different noise levels in the complexity of the classification problems, by extract-ing different measures from the datasets and monitorextract-ing their sensitivity to noise imputation. According to Ho and Basu[11], the difficulty of a classification problem can be attributed to three main aspects: the ambiguity among the classes, the complexity of the separation between the classes, and the data sparsity and dimensionality. Usually, there is a combination of these aspects. They propose a set of geometrical and statistical descriptors able to characterize the complexity of the classification problem associated with a dataset. Originally proposed for binary classi fica-tion problems[11], some of these measures were later extended to multiclass classification in[25,26]. For measures only suitable for binary classification problems, we first transform the multiclass problem into a set of binary classification subproblems by using the one-vs-all approach. The mean of the complexity values obtained in such subproblems is then used as an overall measure for the multiclass dataset.

The descriptors of Ho and Basu[11]can be divided into three categories:

Measures of overlapping in the feature values: Assess the separ-ability of the classes in a dataset. The discriminant power of each feature reﬂects its ambiguity level compared to the other features.

Measures of class separability: Quantify the complexity of the decision boundaries separating the classes. They are

(4)

usually based on linearity and on the distance between examples.

Measures of geometry and topology: They extract features from the local (geometry) and global (topology) structure of the data to describe the separation between classes and data distribution.

Additionally, we characterize the classiﬁcation dataset as a graph and extract some structural measures from it. Modeling a classi_{ﬁcation dataset through a graph allows capturing additional} topological and structural information from a dataset. In fact, graphs are powerful tools for representing the information of relations between data[27]. Therefore, we include an additional class of complexity measures in the experiments:

Measures of structural representation: They are measures extracted from a structural representation of the dataset using graphs, which are built taking into account the relation-ship among the examples.

The recent work of Smith et al.[9]also proposes a new set of measures, which are intended to understand why some instances are hard to classify. Since this type of analysis is not within the scope of this paper, these measures were not included in our experiments.

3.1. Measures of overlapping in feature values

Fisher's discriminant ratio (F1): Selects the feature that best dis-criminates the classes. It can be calculated by Eq.(1), for binary classiﬁcation problems, and by Eq. (2) for pro-blems with more than two classes (C classes). In these equations, m is the number of input features and fiis the i-th predictive feature:

F1¼ maxm i¼ 1 ð

μ

fi c1

μ

fi c2Þ 2 ð

σ

fi c1Þ 2 þð

σ

fi c2Þ 2 ð1Þ F1¼ maxm i¼ 1 PC cj¼ 1 PC ck¼ cj þ 1pcjpckð

μ

fi cj

μ

fi ckÞ 2 PC cj¼ 1pcj

σ

2 cj ð2Þ For continuous features,

μ

fi

cj andð

σ

fi

cjÞ

2 _{are, respectively,}

the average and standard deviation of the feature fi within the class cj. Nominal features are ﬁrst mapped into numerical values and

μ

fi

cjis their median value, while

ð

σ

fi

cjÞ

2

is the variance of a binomial distribution, as presented in Eq.(3), where p_μf i

cj

is the median frequency and ncj is the number of examples in the class cj.

σ

fi cj¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p_μf i cjð1pμf icjÞnncj r ð3Þ High values of F1 indicate that at least one of the features in the dataset is able to linearly separate data from different classes. Low values, on the other hand, do not indicate that the problem is non-linear, but that there is not an hyperplane orthogonal to one of the data axis that separates the classes.

Overlapping of the per-class bounding boxes (F2): This measure calculates the volume of the overlapping region on the feature values for a pair of classes. This overlapping considers the minimum and maximum values of each feature per class in the dataset. A product of the calculated values for each feature is generated. Eq.(4)

illustrates F2, where fiis the feature i and c1and c2are

two classes.

F2¼ ∏m i¼ 1

minðmaxðfi; c1Þ; maxðfi; c2ÞÞmaxðminðfi; c1Þ; minðfi; c2ÞÞ maxðmaxðfi; c1Þ; maxðfi; c2Þminðminðfi; c1Þ; minðfi; c2ÞÞÞ

ð4Þ In multiclass problems, the ﬁnal result is the sum of the values calculated for the underlying binary subproblems. A low F2 value indicates that the features can discriminate the examples of distinct classes and have low overlapping. Maximum individual feature efﬁciency (F3): Evaluates the

indivi-dual efﬁcacy of each feature by considering how much each feature contributes to the classes separation. This measure uses examples that are not in overlapping ranges and outputs an efﬁciency ratio of linear separ-ability. Eq.(5)shows how F3 is calculated, where n is the number of examples in the training set and overlap is a function that returns the number of overlapping exam-ples between two classes. High values of F3 indicate the presence of features whose values do not overlap bet-ween classes. F3¼ maxm i¼ 1 noverlapð x!fi c1; x !fi c2Þ n ð5Þ

3.2. Measures of class separability

Distance of erroneous instances to a linear classifier (L1): This measure quanti_{fies the linearity of data, since the} classi-fication of linear separable data is considered a simpler classification task. L1 computes the sum of the distances of erroneous data to an hyperplane separating two classes. Support Vector Machines (SVM) with a linear kernel [17] are used to induce the hyperplane. This measure is used only for binary classification problems. Values equal to 0 indicate a linearly separable problem. Training error of a linear classifier (L2): Measures the predictive

performance of a linear classiﬁer for the training data. It also uses a SVM with linear kernel. Lower training error indicate the linearity of the problem.

Fraction of points lying on the class boundary (N1): Estimates the complexity of the correct hypothesis underlying the data. Initially, a Minimum Spanning Tree (MST) is generated from the data, connecting the data points by their distance. The fraction of points from different classes that are connected in the MST is returned. High values of N1 indicate the need for more complex boundaries for separating the data.

Average intra/inter class nearest neighbor distances (N2): The mean intra-class and inter-class distances use the k-nearest neighbor (k-NN) algorithm to analyse the spread of the examples from distinct classes. The intra-class distance considers the distance from each example to its nearest example in the same class, while the inter-class distance computes the distance of this example to its nearest example in other class. Eq.(6)illustrates N2.

N2¼ Pn i¼ 1intrað x!Þi Pn i¼ 1interð x!Þi ð6Þ Low N2 values indicate that examples of the same class are next to each other, while far from the examples of the other classes.

Leave-one-out error rate of the 1-nearest neighbor algorithm (N3): Evaluates how distinct the examples are to different classes by considering the error rate of the 1-nearest

(5)

neighbor (1-NN) classiﬁer, with the leave-one-out strat-egy. Low values indicate a high separation of the classes.

3.3. Measures of geometry and topology

Nonlinearity of a linear classifier (L3): Creates a new dataset by the interpolation of training data. New examples are created by linear interpolation with random coefficients of points chosen from a same class. Next, a SVM classifier with linear kernel is induced and its error rate for the original data is recorded. It is sensitive to the spread and over-lapping of the data points and used for binary classi fica-tion problems only. Low values indicate a high linearity. Nonlinearity of the one-nearest neighbor classifier (N4): Has the same reasoning of L3, but using the 1-NN classi_fier instead of SVM.

Fraction of maximum covering spheres on data (T1): Builds hyper-spheres centered on the data points. The radius of these hyperspheres are increased until touching any example of different classes. Smaller hyperspheres inside larger ones are eliminated. It outputs the ratio of the number of hyperspheres formed to the total number of data points. Low values indicate a low number of hyperspheres due to a low complexity of the data representation.

There are other measures presented in[11,26] that were not employed in this work because, by deﬁnition, they do not vary when the label noise level is increased. One of them is the dimensionality of the dataset and another is the ratio of the number of features to the number of data points (data sparsity). Similarly, measures like the directional-vector Fisher's discrimi-nant ratio (F1v) and collective feature efﬁciency (F4) from [26]

were disregarded, since they have a concept similar to other measures already employed.

3.4. Measures of structural representation

Before using these measures, it is necessary to transform the classiﬁcation dataset into a graph. This graph must preserve the similarities and distances between examples, so that the data relationships are captured. Each data point will correspond to a node or vertex of the graph. Edges are added connecting all pairs of nodes or some of the pairs.

Several techniques can be used to build a graph for a dataset. The most common are the k-nearest neighbor (k-NN) and the

ϵ

-NN[28]. While k-NN connects a pair of vertices i and j whenever i is one of the k-nearest neighbors of j,

ϵ

-NN connects a pair of nodes i and j only if dði; jÞo

ϵ

, where d is a distance function. We employed the

ϵ

-NN variant, since many edge and degree based measures would be ﬁxed for k-NN, despite the level of noise inserted in a dataset. Afterwards, all edges between examples from different classes are pruned from the graph [28]. This is a postprocessing step that can be employed for labeled datasets, which takes into account the class information.

There are various measures able to characterize the topological and structural properties of a graph. Some of them come from the statistical characterization of complex networks [14]. We used some of these graph-based measures in this paper, which are referred by their original nomenclature, as follows:

Number of edges (Edges): Total number of edges contained in the graph. High values for edge-related measures indicate that many of the vertices are connected and, therefore, that there are many regions of high densities from a same class. This is

true because of the postprocessing of edges connecting examples from different classes applied in this work. Thus, the dataset is regarded as having low complexity if it shows a high number of edges.

Average degree of the network (Degree): The degree of a vertex i is the number of edges connected to i. The average degree of a network is the average degree of all vertices in the graph. For undirected networks, it can be computed by Eq. (7), where vij is equal to 1 if i and j are connected, and 0

otherwise:

degree¼1_nX

i;j

vij ð7Þ

The same reasoning of edge-related measures applies to degree based measures, since the degree of a vertex corresponds to the number of edges incident to it. Therefore, high values for the degree indicates the presence of many regions of high densities from a same class, and the dataset can be regarded as having low complexity.

Average density of network (Density): The density of a graph is the fraction of the number of edges it contains by the number of possible edges that could be formed. The average density also allows capturing whether there are dense regions from the same class in the dataset. High values indicate the presence of such regions and a simpler dataset.

Maximum number of components (MaxComp): Corresponds to the maximal number of connected components of a graph. In an undirected graph, a component is a subgraph with paths between all of its nodes. When a dataset shows a high overlapping between classes, the graph will prob-ably present a large number of disconnected compo-nents, since connections between different classes are pruned from the graph. The minimal component will tend to be smaller in this case. Thus, we will assume that smaller values of the MaxComp measures represent more complex datasets.

Closeness centrality (Closeness): Average number of steps required to access every other vertex from a given vertex, which is the number of edges traversed in the shortest path between them. It can be computed by the inverse of the distance between the nodes, as shown in the follow-ing equation:

closeness¼P 1

ia jdðvijÞ ð8Þ

Since the closeness measure uses the inverse of the shortest distance between vertices, larger values are expected for simpler datasets that will show low dis-tances between examples from the same class.

Betweenness centrality (Betweenness): The vertex and edge betweenness are deﬁned by the average number of shortest paths that traverses them. We employed the vertex variant. Eq.(9)represents the betweenness value of a vertex _vj, where dðvilÞ is the total number of the

shortest paths from node i to node l and djðvilÞ is the

number of those paths that pass through j:

betweennessðvjÞ ¼

X

ia j a l

djðvilÞ

dðvilÞ ð9Þ

The indice for Betweenness will be small for simpler datasets, since the distance between the shortest paths and the paths which pass through j will be close. Clustering Coefﬁcient (ClsCoef): Measures the probability that

adjacent vertices of a graph are connected. The clustering coefﬁcient of a vertex vi is given by the ratio of the

(6)

number of edges between its neighbors and the max-imum number of edges that could possibly exist between these neighbors.

Measure ClsCoef will be larger for simpler datasets, which will produce large connected components joining vertices from the same class.

Hub score (Hubs): Measures the score of each node by the number of connections it has to other nodes, weighted by the number of connections these neighbors have. That is, more connected vertices, which are also connected to highly connected vertices, have higher hub score. The hub score is expected to have a low mean for high complexity datasets, since strong vertices will become less connected to strong neighbors. For instance, hubs are expected at regions of high density from a given class. Therefore, simpler datasets with high density will show larger values for this measure.

Average Path Length (AvgPath): Average size of all shortest paths in the graph. It measures the efﬁciency of information spread in the network. It is illustrated by Eq.(10), where n represents the number of vertices of the graph and dðvijÞ is the shortest distance between vertices i and j.

AvgPath¼_n 1

nðn1Þ X

ia j

dðvijÞ; ð10Þ

For the AvgPath measure, high values are expected for low density graphs, indicating an increase in complexity.

For those measures that are calculated for each vertex indivi-dually, we computed an average for all vertices in the graph. The graph measures used in this paper mainly evaluate the over-lapping of the classes and their density.

A previous paper also investigated the use of complex-network measures to characterize supervised datasets[12]. It used part of the measures presented here to design meta-learning models able to predict the best performing model between a pair of classiﬁers for a given dataset. They also compared these measures to those from[11], but in a distinct scenario from the one adopted here. It is not clear whether they employ a postprocessing of the graph for removing edges between nodes of different classes, as done in this paper. Also, some of the measures employed in that paper are not

suitable for our scenario and are not used here. One example is the number of nodes of the graph, which will not vary for a given dataset despite of its noise level. The only measures in common to those used in[12]are the number of edges, the average clustering coefﬁcient and the average degree. Morais and Prati[12]also do not analyse the expected values of the measures for simpler or more complex datasets. Besides introducing new measures, we also describe the behavior of all measures for simpler or complex problems. Moreover, we try to identify the best suited measures for detecting the presence of label noise in a dataset.

3.5. Summary of measures

Table 1summarizes the measures employed to characterize the complexity of the datasets used in this study. For each measure, we present upper (Maximum value) and lower bounds (Minimum value) achievable and how they are associated with the increase or decrease of complexity of the classi_{ﬁcation problems (Complexity column). For} a given measure, the value in column“Complexity” is “þ” if higher values of the measure are observed for high complexity datasets, that is, when the measure value correlates positively to the complexity level. On the other hand, the“” sign denotes the opposite, so that low values of the measure are obtained for high complexity datasets, denoting a negative correlation.

Most of the bounds were obtained considering the equations directly, while some of the graph-based bounds were experimen-tally deﬁned. For instance, for the F1 measure, if the means of the feature values are always equal, meaning that the classes overlap for all features (an extreme case), the nominator of Eq.(2)will be 0. Similarly, a maximum value cannot be determined for F1, as it is dependent on the feature values of each dataset. We denote that by the“1” value in the table. In the case of graph-based measures, we generated graphs representing simple and complex relations between the same number of data points and observed the measure values achieved. A simple graph would correspond to a case where the classes are well separated and there is a high number of connections between examples from the same class, while a complex dataset would correspond to a graph where examples of different classes are always next to each other and ultimately the connections between them are pruned according to our graph construction method.

Table 1

Summary of measures.

Type of measure Measure Minimum value Maximum value Complexity

Overlapping in feature values F1 0 þ1

F2 0 þ1 þ F3 0 1 Class separability L1 0 þ1 þ L2 0 1 þ N1 0 1 þ N2 0 þ1 þ N3 0 1 þ

Geometry and topology L3 0 1 þ

N4 0 1 þ

T1 0 1 þ

Structural representation Edges 0 nnðn1Þ=2

Degree 0 n1 MaxComp 1 n Closeness 0 1=ðn1Þ Betweenness 0 ðn1Þnðn2Þ=2 þ Hubs 0 1 Density 0 1 ClsCoef 0 1 AvgPath 1=nnðn1Þ 0.5 þ

(7)

4. Experiments

This section presents the experiments performed to evaluate how the different data complexity measures from Section 3

behave in the presence of label noise for several benchmark public datasets. First, a set of classiﬁcation benchmark datasets were chosen for the experiments. Different levels of label noise were afterwards added to each dataset. We then monitor how the complexity level of the datasets are affected by noise imputation. This is accomplished by:

1. Verifying the Spearman correlation between the measures values with the noise rates artiﬁcially imputed. This analysis allows identifying a set of measures more sensitive to the presence of noise in a dataset.

2. Evaluating the correlation between the measures values to identify those measures that (1) capture different concepts regarding noisy environments and (2) can be jointly used to support the development of new noise-handling techniques.

Next, we detail the experimental protocol previously outlined.

4.1. Datasets

For the experiments, we selected artificial and real datasets. The artificial datasets were introduced and generously provided by Amancio et al.[29]. These authors generated artificial classification datasets based on multivariate Gaussians, with different levels of overlapping between the classes. For this paper we selected 180 balanced datasets (with the same number of examples per class) with 2 classes, containing 2, 10 and 50 features and with different overlapping rates for each of the number of features. We chose such datasets based on the observations of a recent work [9], which points out that class overlap seems to be a principal contributor to instance hardness and that noisy data can ulti-mately be considered hard instances.

For the real datasets, we selected 53 benchmarks from UCI and Keel repositories[30,31]for our experiments. As most of them are real datasets, it is not possible to assert that they are noiseless, although some of them are artiﬁcial and show no label incon-sistency. Nonetheless, a recent study showed that most of the datasets from UCI can be regarded as easy problems, once many classiﬁcation techniques are able to attain high accuracies when applied to them[32].Table 2summarizes the main characteristics of these datasets: number of examples (Examples), number of features (Features), number of classes (Class) and percentage of the examples in the majority class (%MC).

4.2. Noise imputation

In order to corrupt the datasets, we used the most common type of artificial noise imputation method for classification pro-blems: uniform random addition [15]. In this approach, each example has the same probability of having its label exchanged by another random label [33]. For each dataset, the noise was inserted at different levels, namely 5%, 10%, 20% and 40%. Thus, we were able to investigate the influence of increasing noise levels. Once the selection of examples was random, we generated 10 different noisy versions of each dataset for each noise level. Since most of the original benchmark datasets may already contain some noise level, in this case we have as result a potential noise level of 5%, 10%, 20% and 40%.

4.3. Methodology

First, we created noisy versions of the original datasets from

Section 4.1by using the previously described systematic model of noise imputation. Despite all datasets being partitioned by 10-fold-cross-validation, noise was inserted in the training folds only. The complexity measures were extracted from the original train-ing datasets and from their noisy versions.

To calculate the complexity measures described fromSection 3.1

toSection 3.3, we used the Data Complexity Library (DCoL)[26]. All

Table 2

Summary of datasets characteristics: name, number of examples, number of features, number of classes and the percentage of the majority class of each dataset.

Dataset Examples Features Class %MC Dataset Examples Features Class %MC

abalone 4153 9 19 16 molecular-promoters 106 58 2 50 acute-inflammations 120 8 2 58 monk1 556 7 2 50 appendicitis 106 8 2 80 monk2 601 7 2 65 australian 690 15 2 55 movement-libras 360 91 15 06 backache 180 32 2 86 newthyroid 215 6 3 69 balance 625 5 3 46 page-blocks 5473 11 5 89 banana 5300 3 2 55 parkinsons 195 23 2 75 blood-transfusion-service 748 5 2 76 phoneme 5404 6 2 70 breast-tissue 106 10 6 20 pima 768 9 2 65 bupa 345 7 2 57 ringnorm 7400 21 2 50 car 1728 7 4 70 saheart 462 10 2 65 cardiotocography 2126 21 10 27 segmentation 2310 19 7 14 cmc 1473 10 3 42 spectf 349 45 2 72 collins 485 22 13 16 statlog-german 1000 21 2 70 connectionist-mines-vs-rocks 208 61 2 53 statlog-heart 270 14 2 55 crabs 200 6 2 50 tae 151 6 3 34 expgen 207 80 5 57 tic-tac-toe 958 10 2 65 flags 178 29 5 33 titanic 2201 4 2 67 flare 1066 12 6 31 vehicle 846 19 4 25 glass 205 10 5 37 vowel 990 11 11 09 habermans-survival 306 4 2 73 waveform-5000 5000 41 3 33 hayes-roth 160 5 3 40 wdbc 569 31 2 62 ionosphere 351 34 2 64 wine 178 14 3 39 iris 150 5 3 33 wine-quality 6497 12 6 43 kr-vs-kp 3196 37 2 52 yeast 1484 9 9 31 led7digit 500 8 10 11 zoo 84 17 4 48 leukemia-haslinger 100 51 2 51

(8)

distance-based measures employed the normalized euclidean dis-tance for continuous features and the overlap disdis-tance for nominal features (this distance is 0 for equal categorical values and 1 other-wise)[34]. To build the graph for the graph-based measures, we used the

ϵ

-NN algorithm, with the

ϵ

threshold value equal to 15%, like in

[12]. The measures described inSection 3.4were calculated using the Igraph library[35].

As a result, we have one meta-level dataset that will be employed in the subsequent experiments. It contains 20 meta-features ( complexity and graph-based measures) describing the benchmark datasets and their noisy versions. This meta-level dataset has therefore 2173 examples: 53 ( original datasets) þ 53 ( datasets)n 4 ( noise levels) n 10 ( random versions).

Two types of analysis were performed using the meta-level dataset:

1. Correlation between the measure values and the noise level of the datasets;

2. Correlation within the measure values.

Theﬁrst analysis will consider all measures. Its results will then reﬁne a subset of measures more sensitive to noise imputation, which will be further analysed in the second correlation study.

Thefirst analysis verifies if there is a direct relation between the noise level of a dataset and the values of the measures extracted from it. This allows the identification of the measures that are more sensitive to the presence of noise. For such, the Spearman's rank correlation between the measure values and the different noise levels was calculated for all datasets. Those measures that present a significant correlation according to the Spearman's statistical test (at 95% of confidence value) were selected to be analyzed further.

The second analysis evaluates the Spearman correlation between the measures with the highest sensitivity to the presence of noise according to the previous experimental results. It looks for overlapping in the complexity concepts extracted by these measures. Similar analyses are carried out in[9]for accessing the relationship between some instance hardness measures proposed by them. While a high correlation could indicate that the measures are capturing the same complexity concepts, a low correlation indicates that the measures could complement each other, an issue that can be further explored.

5. Results of correlation analysis

This section presents the experimental results for the correlation analysis previously described. We also have evaluated the results for

F1 F2 F3 L1

L2 L3 N1 N2

N3 N4 T1 Edges

Degree Density MaxComp Closeness

Betweenness Hub ClsCoef AvgPath

0 5 10 15 0 5 10 15 20 0 5 10 15 0 5 10 15 0 5 10 15 20 0 5 10 0 10 20 0 10 20 0 10 20 0 10 20 0 5 10 15 0 10 20 0 10 20 0 10 20 0.0 2.5 5.0 7.5 10.0 0 4 8 12 0 2 4 6 0.0 2.5 5.0 7.5 10.0 0 3 6 9 0.0 2.5 5.0 7.5 10.0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Normalized Range Density Noise Rate 0% 5% 10% 20% 40%

(9)

some artificial datasets as described inSection 4.1. These results were quite similar to those observed for the benchmark datasets, with the difference that the absolute correlation values calculated were higher for the artificial datasets. The complete set of results, including those obtained for the artificial datasets and the datasets employed, can be consulted athttp://lpfgarcia.github.io/rcorr/.

Fig. 1 presents histograms of the values of the complexity measures for all benchmark datasets when random noise is added. The bars are colored according to the amount of noise inserted, from 0% (original datasets) to 40%. The measure values were normalized considering all datasets to allow their direct compar-ison. It is possible to notice that some of the measures are more sensitive to noise imputation and present clear limits on their values for different noise levels. They are: N1, N3, Edges, Degree and Density. On the other hand, other measures like Betweenness do not present a clear contrast in their values for different noise levels.

Furthermore, it is also possible to notice fromFig. 1 that, as more noise is added to the datasets, the complexity of the classiﬁcation problem tends to increase. This is reﬂected in the values of the majority of the complexity measures, that either increased or decreased when noise is added, in accordance to their positive or negative correlation to the complexity level, as shown inTable 1(column“Complexity”). For instance, higher N1 values are expected for more complex datasets and the N1 values indeed increased for higher levels of noise. On the other hand, lower F1 values are expected for more complex datasets and we can observe that as more noise is added to the datasets, the F1 values tend to reduce.

5.1. Correlation of measures to the noise level

Fig. 2shows the correlation between the values of the mea-sures for the different noise levels in the datasets. Positive and negative values are plotted in order to show clearly which measures are directly or indirectly correlated to the noise levels. It is noticeable that, as the noise level increases, the values of the complexity measures either increase or reduce accordingly, indi-cating increases in the complexity level of the noisy datasets. The closer to 1 or1, the higher is the relation between the measure and the noise level.

According to the statistical test employed, 18 measures pre-sented signiﬁcant correlation to the noise levels, at 95% of conﬁdence. Among the measures with direct correlation to the noise level, nine are basic complexity measures from the literature (N3, N1, N2, N4, L2, L1, T1, F2, and L3). These measures mainly

capture: classes separability (N3, N1, N2, L2 and L1), data topology according to a nearest neighbor classiﬁer (N4, T1 and L3) and individual feature overlapping (F2). Regarding those measures indirectly related to the noise levels, two are basic complexity measures based on feature overlapping (F1 and F3), while six are based on structural representation (Density, Hub, Degree, ClsCoef, Edges and MaxComp). Only the Closeness and Betweenness measures did not present signiﬁcant correlation to the noise levels. As expected, the most prominent measures are the same that showed more distinct values for different noise levels in the histograms fromFig. 1.

Despite the statistical difference, it is possible to notice some low correlation values inFig. 2. Only the measures N3, N1 and N2 presented correlation values higher than 0.5. These correlations were higher in the experiments with artiﬁcial datasets. This can be a result of the fact that, for real datasets, the amount of noise added is potential rather than actual.

5.2. Correlation between measures

In order to verify whether the measures capture similar or distinct information from data, we calculated pairwise correlations between their values. Only those measures considered more relevant in the previous analysis were considered. These measures were highlighted as more sensitive to noise imputation and can therefore be successfully employed for noise identiﬁcation.

Fig. 3shows a heatmap of the correlation between these pairs of measures. Each column and row corresponds to a measure. Each box is colored according to the correlation values calculated, from gray (highest correlation, despite positive or negative) to white (lowest correlation). The absolute values of all correlations are also shown inside the heatmap cells. We highlight in bold the correla-tion values that are not signiﬁcant according to the Spearman's correlation test (at 95% of conﬁdence level). These pairs of measures correspond to those that can potentially complement each other.

According to the heatmap, various measures are weakly corre-lated to each other. Therefore, they capture distinct aspects from the data. As expected, the measures N1, N2, N3 and N4 from[11]

are highly correlated. They are all based on nearest neighbor information. Despite the fact that all structural representation measures are extracted from a nearest neighbor graph, their correlation to N1, N2, N3 and N4 is low in several cases. Among the graph-based measures, high correlations are observed between Edges, Degree and MaxComp. Since the degree of a graph

−1.0 −0.5 0.0 0.5 1.0 F1 Density F3 Hub

Degree _ClsCoef Edges

MaxComp Closeness Betweenness AvgPath L3 F2 T1 L1 L2 N4 N2 N1 N3 Measures Correlation

(10)

is calculated considering the number of its edges and number of connected components, this correlation is expected by deﬁnition. It is interesting to notice that many of the measures highlighted as distinguishing the noise levels have low correlation between them. This is particularly true for class separability measures (e.g., N3) when paired to the structural representation measures (e.g., Degree and Edges). Therefore, they could be combined to improve noise identi_{ﬁcation and handling. This issue is preliminarily} investigated in the next section.

6. Proposed noiseﬁltering technique

Based on the previous experimental results, this section pro-poses and investigates a new filter, named GraphNN, for label noise identification. It uses two of the measures more correlated to the increasing noise levels in the datasets: Leave-one-out error rate of the 1-nearest neighbor classifier (N3); and Average degree of the network (Degree).

The GraphNNﬁlter identify noisy examples by ﬁrst construct-ing a graph from the dataset, as described inSection 3.4. After-wards, it uses the degree of each vertex for pointing an example as

a potential noise. In fact, when an example is misclassiﬁed, it will be probably close to examples from another class. In this case, its edges to close examples will be pruned and the example will tend to have a low degree value. Safe examples, on the other hand, will be connected to a high number of examples from the same class and show a high degree value. Therefore, it is necessary to stipulate a threshold on the node degree so as its corresponding example can be considered as noisy or not.

When a dataset has a large amount of noise, a larger number of examples will have a low degree value and the threshold value can be higher. On the other hand, for datasets with a lower noise level, a lower threshold value can be required. Otherwise, many safe examples will be regarded as noisy. Due to the difﬁculty in selecting a threshold value, it would be appropriate for all of the datasets employed in our experiments to use the N3 measure value to estimate the percentage of noise in the dataset. This was the most correlated measure to the noise levels in our experiments and for which clearer limits on the values obtained for distinct noise levels can be observed. The

ϵ

value adopted to build the graph from data also inﬂuences on the average degree of the nodes and can inﬂuence the threshold choice, although this was not studied in our experiments.

1 −0.34 0.72 −0.64 −0.62 0.01 −0.35 −0.24 −0.36 −0.45 −0.08 −0.38 −0.4 0 −0.51 −0.15 0.48 −0.15 −0.34 1 −0.47 0.02 0.09 0.16 0.31 0.27 0.37 0.62 −0.09 0.24 0.21 −0.37 0.2 −0.27 −0.03 0.07 0.72 −0.47 1 −0.57 −0.51 0.15 −0.11 −0.09 −0.18 −0.28 −0.09 −0.48 −0.53 −0.17 −0.58 −0.24 0.43 −0.08 −0.64 0.02 −0.57 1 0.94 −0.05 0.05 −0.08 0.02 −0.16 −0.01 0.05 0.12 0.33 0.21 0.34 −0.44 0.22 −0.62 0.09 −0.51 0.94 1 0.17 0.15 −0.04 0.12 −0.05 −0.1 0.01 0.07 0.22 0.18 0.3 −0.37 0.29 0.01 0.16 0.15 −0.05 0.17 1 0.29 0.18 0.25 0.33 −0.16 0.03 −0.03 −0.38 0.01 −0.19 0.22 0.19 −0.35 0.31 −0.11 0.05 0.15 0.29 1 0.83 0.95 0.57 0.2 0 −0.09 −0.65 0.01 −0.42 −0.06 0.21 −0.24 0.27 −0.09 −0.08 −0.04 0.18 0.83 1 0.84 0.46 0.33 0.01 −0.07 −0.58 −0.02 −0.41 −0.06 0.01 −0.36 0.37 −0.18 0.02 0.12 0.25 0.95 0.84 1 0.62 0.24 0.06 −0.02 −0.62 0.06 −0.44 −0.02 0.2 −0.45 0.62 −0.28 −0.16 −0.05 0.33 0.57 0.46 0.62 1 0.02 0.42 0.33 −0.58 0.36 −0.39 0.1 0.16 −0.08 −0.09 −0.09 −0.01 −0.1 −0.16 0.2 0.33 0.24 0.02 1 0.15 0.11 −0.2 0.12 −0.08 −0.26 −0.28 −0.38 0.24 −0.48 0.05 0.01 0.03 0 0.01 0.06 0.42 0.15 1 0.98 −0.06 0.95 −0.08 −0.02 −0.04 −0.4 0.21 −0.53 0.12 0.07 −0.03 −0.09 −0.07 −0.02 0.33 0.11 0.98 1 0.08 0.97 0.04 −0.08 −0.04 0 −0.37 −0.17 0.33 0.22 −0.38 −0.65 −0.58 −0.62 −0.58 −0.2 −0.06 0.08 1 0.09 0.78 −0.36 −0.06 −0.51 0.2 −0.58 0.21 0.18 0.01 0.01 −0.02 0.06 0.36 0.12 0.95 0.97 0.09 1 0.12 −0.17 0.1 −0.15 −0.27 −0.24 0.34 0.3 −0.19 −0.42 −0.41 −0.44 −0.39 −0.08 −0.08 0.04 0.78 0.12 1 −0.52 −0.03 0.48 −0.03 0.43 −0.44 −0.37 0.22 −0.06 −0.06 −0.02 0.1 −0.26 −0.02 −0.08 −0.36 −0.17 −0.52 1 0.37 −0.15 0.07 −0.08 0.22 0.29 0.19 0.21 0.01 0.2 0.16 −0.28 −0.04 −0.04 −0.06 0.1 −0.03 0.37 1 F1 F2 F3 L1 L2 L3 N1 N2 N3 N4 T1 Edges Degree Density MaxComp Hub ClsCoef AvgPath F1 F2 F3 L1 L2 L3 N1 N2 N3 N4 T1 Edges Degree Density MaxComp Hub ClsCoef AvgP ath Measures Measures −1.0 −0.5 0.0 0.5 1.0 Correlation

(11)

Therefore, in GraphNN wefirst order all examples according to their degree values. Afterwards, the N3 value delimits how much of the examples of lower degree can be regarded as noisy. Furthermore, among the examples with a degree lower than the threshold, only those that are misclassified by the NN classifier used in N3 are considered noisy. This polling allows more robust-ness for maintaining safe examples.

6.1. Experiments on noiseﬁltering

We performed a set of experiments in order to analyse the performance of the GraphNNﬁlter in identifying noisy examples. The same datasets used in our previous analysis were employed in these experiments.

Once noise was added in a controlled way, it is possible to assess which are the noisy examples and to record the perfor-mance of thefilter in retrieving them. We thereby employ the F1-measure in this evaluation, as illustrated by Eq.(11)and proposed in[3]. This measure combines the precision and recall values of a filter in noise identification: precision is defined as the number of correctly identified noisy cases divided by the number of examples identified by the filter as noisy; recall is the number of correctly identified noisy cases divided by the total number of noisy examples introduced in the dataset.

F1¼ 2n

precisionnrecall

precisionþrecall ð11Þ

As baselines, we included some distance-based noiseﬁltering techniques from literature[36,4]. All of them employ the k-nearest neighbor (k-NN) algorithm to determine whether an example is suspicious or not. Therefore, according to them, an example is

consistent only if it is located close to examples from its class. This resembles our method, where the graph captures the similarity between examples from the same class. Otherwise, the example is either incorrectly labeled or in the decision border. In the later case, it is also considered unsafe, since small perturbations in a borderline example can move it to the wrong side of the decision border.

The Edited Nearest Neighbor (ENN) technique removes an example if the labels of its k nearest neighbors are different from its actual label[37]. Repeated ENN (RENN) is a variation of ENN, which applies ENN repeatedly until all objects have the majority of their neighbors from the same class. In All-kNN, instead of using a ﬁxed value of k, increasing values of k are considered[4]. At the end of each iteration, examples that have the majority of their neighbors from other classes are marked as noisy. These signalized examples are then eliminated from the dataset.

6.2. Performance of the cleansing_ﬁlter

The F1-measure values obtained by thefilters in the retrieval of the arti_{ficially imputed noisy examples are shown in the heatmap} ofFig. 4. In thisfigure, each column represents one filter, while each row corresponds to a dataset. The closer the color is to black, the higher is the F1-score computed. The gray color represents an intermediate F1-score and the white color represents low F1-score levels. According to the heatmap, the GraphNNfilter had attained, in average, a good predictive performance for most of the datasets. For the banana, connectionist-mines-vs-rocks, crabs, kr-vs-kp, monk2, ringnorm, titanic and tic-tac-toe datasets, the perfor-mance decreased. 0.3 0.75 0.46 0.42 0.4 0.55 0.57 0.37 0.49 0.24 0.31 0.62 0.24 0.3 0.44 0.6 0.67 0.36 0.49 0.43 0.31 0.4 0.45 0.74 0.54 0.3 0.35 0.18 0.75 0.46 0.42 0.4 0.55 0.57 0.37 0.49 0.24 0.47 0.62 0.24 0.46 0.44 0.6 0.67 0.29 0.49 0.49 0.31 0.4 0.45 0.74 0.54 0.51 0.35 0.2 0.78 0.52 0.46 0.47 0.64 0.6 0.39 0.53 0.25 0.49 0.71 0.26 0.49 0.47 0.62 0.7 0.33 0.5 0.5 0.33 0.42 0.48 0.78 0.56 0.52 0.4 0.34 0.81 0.45 0.37 0.55 0.62 0.47 0.43 0.53 0.29 0.61 0.6 0.31 0.67 0.32 0.47 0.79 0.41 0.51 0.52 0.36 0.37 0.39 0.88 0.49 0.52 0.52 abalone accute−inflammations appendicitis australian backache balance banana blood−transfusion−service breast−tissue bupa car cardiotocography cmc collins connectionist−mines−vs−rocks crabs expgen flags flare glass habermans−survival hayes−roth ionosphere iris kr−vs−kp led7digit leukemia−haslinger

AENN ENN RENN GraphNN

0.4 0.39 0.41 0.3 0.76 0.79 0.57 0.53 0.31 0.3 0.26 0.76 0.43 0.26 0.42 0.34 0.6 0.4 0.45 0.31 0.42 0.59 0.3 0.72 0.41 0.49 0.4 0.39 0.41 0.52 0.76 0.79 0.57 0.53 0.31 0.3 0.26 0.76 0.43 0.26 0.42 0.34 0.6 0.4 0.45 0.83 0.42 0.59 0.43 0.72 0.38 0.86 0.42 0.42 0.44 0.55 0.79 0.84 0.61 0.55 0.35 0.27 0.28 0.81 0.45 0.29 0.45 0.36 0.68 0.4 0.49 0.87 0.46 0.67 0.47 0.78 0.42 0.91 0.48 0.49 0.37 0.43 0.82 0.89 0.61 0.55 0.4 0.09 0.34 0.84 0.5 0.37 0.39 0.31 0.48 0.29 0.5 0.62 0.57 0.79 0.51 0.76 0.51 0.74 molecular−promoters monk1 monk2 movement−libras newthyroid page−blocks parkinsons phoneme pima ringnorm saheart segmentation spectf statlog−german statlog−heart tae tic−tac−toe titanic vehicle vowel waveform−5000 wdbc wine wine−quality yeast zoo

AENN ENN RENN GraphNN

0.25 0.50 0.75

F−score

(12)

The number of times each ﬁlter performed better in noise identi_{ﬁcation for all noise rates and all datasets is presented at}

Fig. 5. The best performance was obtained by GraphNN, followed by RENN and AENN. The ENNﬁlter had shown the worst performance. 7. Conclusion

This work investigated how label noise affects the complexity of classiﬁcation tasks, by monitoring the values of simple mea-sures extracted from datasets with increasing noise levels. Part of these measures were already used in the literature for under-standing and analyzing the complexity of classiﬁcation tasks. Some other measures that are based on modeling the datasets by graphs were introduced in this paper.

Experimentally, measures able to capture characteristics like separability of the classes, alterations in the class boundary and densities within the classes were the most affected ones by the introduction of label noise in the data. Therefore, they are good candidates for further exploitation and to support the design of new noise identification techniques and noise-tolerant classifica-tion algorithms. Moreover, experimental results showed a low correlation between the basic complexity measures and the graph-based measures, stressing the relevance of exploring different views and representations of the data structure. Using two of the most prominent measures from these analyses, we devised a new noisefiltering technique, which has shown promising results in noise identi_fication.

The graph-based measures Edges, Degree and Density were highlighted in all analysis carried out. This may have occurred because, when label noise is introduced, examples from distinct classes become closer to each other and are not connected in the graph. For the standard data complexity measures, those that rely on nearest neighbor information, as N1 and N3, were also able to better capture the effects of noise imputation. This is also due to the fact that label noise tends to affect the spatial proximity of data from different classes. Thus, the idea that data from the same class tend to be next from each other in the feature space, while far from examples from different classes, is reinforced.

Our experimental protocol and graph-based measures can also be used in other types of analysis, such as in verifying the effects of data unbalance, feature selection, feature discretization, among

others. It is also possible to use other combinations of measures to devise new preprocessingﬁlters. We also plan to employ feature selection strategies to evidence the best measures able to char-acterize noisy data. It would be also interesting to investigate how the graph-based measures are affected by the choice of the

ϵ

parameter used to build the graph. Finally, we plan to use some of the highlighted measures to develop new noise-tolerant algo-rithms and compare GraphNN with other up-to-date noiseﬁlters. We created a webpage with additional information regarding the experimental results, the codes used for the experiments and the results for some artiﬁcial datasets. It can be found athttp:// lpfgarcia.github.io/rcorr/.

Acknowledgments

The authors would like to thank FAPESP, CNPq and CAPES for their_{ﬁnancial support.}

References

[1]J.R. Quinlan, The effect of noise on concept learning, in: R.S.I. Michalski,

J.G. Carboneel, Mitchell (Eds.), Machine Learning, Morgan Kaufmann

Publish-ers Inc., 1986, pp. 149–166.

[2]U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, Knowledge discovery and data

mining: towards a unifying framework, in: E. Simoudis, J. Han, U.M. Fayyad

(Eds.), KDD, AAAI Press, 1996, pp. 82–88.

[3] B. Sluban, D. Gamberger, N. Lavrac, Ensemble-based noise detection: noise ranking and visual performance evaluation, Data Mining Knowl. Discov. 28 (2) (2014) 265–303.http://dx.doi.org/10.1007/s10618-012-0299-1.

[4] I. Tomek, An experiment with the edited nearest-neighbor rule, IEEE Trans. Syst. Man Cybern. 6 (6) (1976) 448–452.http://dx.doi.org/10.1109/TSMC.1976.4309523.

[5]C.E. Brodley, M.A. Friedl, Identifying and eliminating mislabeled training

instances, in: W.J. Clancey, D.S. Weld (Eds.), AAAI/IAAI, vol. 1, AAAI Press/The

MIT Press, 1996, pp. 799–805.

[6] S. Verbaeten, A.V. Assche, Ensemble methods for noise elimination in classiﬁcation problems, in: T. Windeatt, F. Roli (Eds.), Multiple Classiﬁer Systems, Lecture Notes in Computer Science, vol. 2709, Springer, Berlin, Heidelberg, 2003, pp. 317–325.http://dx.doi.org/10.1007/3-540-44938-8_32.

[7]B. Sluban, D. Gamberger, N. Lavrac, Advances in class noise detection, in:

H. Coelho, R. Studer, M. Wooldridge (Eds.), ECAI, Frontiers in Artiﬁcial

Intelligence and Applications, vol. 215, IOS Press, 2010, pp. 1105–1106.

[8] L.P.F. Garcia, A.C. Lorena, A.C.P.L.F. Carvalho, A study on class noise detection and elimination, in: A.C. Lorena, C.E. Thomaz, A.T.R. Pozo (Eds.), 2012 Brazilian Symposium on Neural Networks (SBRN), IEEE, 2012, pp. 13–18.http://dx.doi.org/10.1109/SBRN.2012.49.

[9] M. Smith, T. Martinez, C. Giraud-Carrier, An instance level analysis of data complexity, Mach. Learn. 95 (2) (2014) 225–256.http://dx.doi.org/10.1007/

s10994-013-5422-z.

[10] B. Frenay, M. Verleysen, Classiﬁcation in the presence of label noise: a survey, IEEE Trans. Neural Netw. Learning Syst. 99 (2015) 1–25.http://dx.doi.org/10.

1109/TNNLS.2013.2292894.

[11] T.K. Ho, M. Basu, Complexity measures of supervised classiﬁcation problems, IEEE Trans. Pattern Anal. Mach. Intell. 24 (3) (2002) 289–300.http://dx.doi.

org/10.1109/34.990132.

[12] G. Morais, R.C. Prati, Complex network measures for data set characterization, in: 2013 Brazilian Conference on Intelligent Systems (BRACIS), 2013, pp. 12–18.

http://dx.doi.org/10.1109/BRACIS.2013.11.

[13] L.F. Costa, F.A. Rodrigues, G. Travieso, P.R.V. Boas, Characterization of complex networks: a survey of measurements, Adv. Phys. 56 (2008) 167–242. [14] E. Kolaczyk, Statistical Analysis of Network Data: Methods and Models, in:

Springer Series in Statistics, Springer, 2009.

[15] X. Zhu, X. Wu, Class noise vs. attribute noise: a quantitative study, Artif. Intell. Rev. 22 (3) (2004) 177–210.http://dx.doi.org/10.1007/s10462-004-0751-8. [16] J.R. Quinlan, Induction of decision trees, Mach. Learn. 1 (1) (1986) 81–106.

http://dx.doi.org/10.1007/BF00116251.

[17]V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New

York, Inc., 1995.

[18] E. Eskin, Detecting errors within a corpus using anomaly detection, in: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, NAACL 2000, Association for Computa-tional Linguistics, 2000, pp. 148–153.

[19] A. Ganapathiraju, J. Picone, Support vector machines for automatic data cleanup, in: INTERSPEECH, ISCA, 2000, pp. 210–213.

[20] L. Li, Y.S. Abu-Mostafa, Data Complexity in Machine Learning, Technical Report. CaltechCSTR:2006.004, Caltech Computer Science, 2006.

[21] T.K. Ho, Data complexity analysis: linkage between context and solution in classiﬁcation, in: Structural, Syntactic, and Statistical Pattern Recognition, vol.

0.0

0.1

0.2

0.3

0.4

0.5 AENN

ENN

GraphNN

RENN

Filters

Frequency of wins

(13)

5342 of Lecture Notes in Computer Science, 2008, pp. 986–995.http://dx.doi.

org/10.1007/978-3-540-89689-0_102.

[22] S. Singh, Prism: a novel framework for pattern recognition, Pattern Anal. Appl. 6 (2) (2003) 134–149.http://dx.doi.org/10.1007/s10044-002-0186-2. [23] J.A. Sáez, J. Luengo, F. Herrera, Predicting noiseﬁltering efﬁcacy with data

complexity measures for nearest neighbor classiﬁcation, Pattern Recognit. 46 (1) (2013) 355–364http://dx.doi.org/http://dx.doi.org/10.1016/j.patcog.2012.

07.009.

[24] L.P.F. Garcia, A.C. Carvalho, A.C. Lorena, Noisy data set identiﬁcation, in: J.-S. Pan, M. Polycarpou, M. Wozniak, A. Carvalho, H. Quintián, E. Corchado (Eds.), Hybrid Artiﬁcial Intelligent Systems, Lecture Notes in Computer Science, vol. 8073, Springer, Berlin, Heidelberg, 2013, pp. 629–638.http://dx.doi.org/10.1007/978-3-642-40846-5_63.

[25] R.A. Mollineda, J.S. Sánchez, J.M. Sotoca, Data characterization for effective prototype selection, in: J.S. Marques, N.P. de la Blanca, P. Pina (Eds.), Pattern Recognition and Image Analysis, Lecture Notes in Computer Science, vol. 3523, Springer, Berlin, Heidelberg, 2005, pp. 27–34.http://dx.doi.org/10.1007/11492542_4.

[26] A. Orriols-Puig, N. Maciá, T.K. Ho, Documentation for the Data Complexity Library in Cþ þ, Technical Report, La Salle – Universitat Ramon Llull, 2010. [27] N. Ganguly, A. Deutsch, A. Mukherjee, Dynamics on and of Complex Networks:

Applications to Biology, Computer Science, and the Social Sciences, Modeling and Simulation in Science, Engineering and Technology, Birkhäuser, Boston, 2009.

[28] X. Zhu, J. Lafferty, R. Rosenfeld, Semi-Supervised Learning with Graphs (Ph.D. Thesis), Carnegie Mellon University, Language Technologies Institute, School of Computer Science, 2005.

[29] D.R. Amancio, C.H. Comin, D. Casanova, G. Travieso, O.M. Bruno, F.A. Rodrigues, L. da F. Costa, A systematic comparison of supervised classiﬁers, PLoS ONE 9 (4), 2014, e94137,http://dx.doi.org/10.1371/journal.pone.0094137.

[30] K. Bache, M. Lichman, UCI Machine Learning Repository,http://archive.ics.uci.

edu/ml, 2013.

[31]J. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, Keel data-mining

software tool: data set repository, integration of algorithms and experimental analysis framework, Mult.-Valued Logic Soft Comput. 17 (2–3) (2011)

255–287.

[32] N. Maciá, E. Bernadó-Mansilla, Towards UCIþ: a mindful repository design, Inf. Sci. 261 (2014) 237–262.http://dx.doi.org/10.1016/j.ins.2013.08.059.

[33]C.-M. Teng, Correcting noisy data, in: I. Bratko, S. Dzeroski (Eds.), ICML,

Morgan Kaufmann, 1999, pp. 239–248.

[34] C. Giraud-Carrier, T. Martinez, An Efﬁcient Metric for Heterogeneous Inductive Learning Applications in the Attribute-Value Language, Technical Report, University of Bristol, Bristol, UK, 1995.

[35] G. Csardi, T. Nepusz, The Igraph software package for complex network research, InterJ. Complex Syst. 34 (2006) 695, URL〈http://igraph.sf.net〉.

[36]D.R. Wilson, T.R. Martinez, Reduction techniques for instance-based learning

algorithms, Mach. Learn. 38 (3) (2000) 257–286.

[37]D.L. Wilson, Asymtoptic properties of nearest neighbor rules using edited data,

IEEE Trans. Syst. Man Cybern. 2 (3) (1972) 408–421.

Luís P.F. Garcia received the B.Sc. degree in Computer Engineering in 2010, from University of São Paulo, Brazil. He is currently a Ph.D. candidate at the Institute of Computer Sciences and Mathematics, University of São Paulo, Brazil. His current research interests include Machine Learning, Data Mining, Meta-learning and Data Preprocessing.

André C.P.L.F. de Carvalho is a Full-time Professor in the Department of Computer Science, University of São Paulo, Brazil, where he was head of Department from 2008 to 2010. He received his B.Sc. and M.Sc. degrees in Computer Science from the Universidade Federal de Pernambuco, Brazil. He received his Ph.D. degree in Electronic Engineering from the University of Kent, UK. He has published around 80 Journal and 200 Confer-ence refereed papers. He has been involved in the organization of several conferences and journal special issues. His main interests are Machine Learning, Data Mining and Hybrid Intelligent Systems. He is in the editorial board of several journals and was a member of the Brazilian Computing Society, SBC, Council. He was the editor of the SBC/Elsevier textbook series until July 2011. Prof. de Carvalho is the Director of the Research Center for Machine Learning in Data Analysis of the Universidade de Sao Paulo, Brazil.

Ana C. Lorena has bachelor (2001) and Ph.D. (2006) degrees in Computer Science from the University of São Paulo – São Carlos. She is currently an Assistant Professor at the Federal University of São Paulo. She has experience in Computer Science, working mainly on the following topics: data mining, supervised machine learning, hybrid intelligent systems and sup-port vector machines.