Optimization of Insert-Tray Matching using Machine Learning

(1)

Upps al a univ ersit ets l ogot yp

UPTEC F 21056

Examensarbete 30 hp September 2021

Optimization of Insert-Tray Matching using Machine Learning

Karolina Hedberg

Civilingenj örspr ogrammet i t ek nisk fysik

(2)

Teknisk-naturvetenskapliga fakulteten Uppsala universitet, Utgivningsort Uppsala/Visby

Upps al a univ ersit ets l ogot yp

Optimization of Insert-Tray Matching using Machine Learning

Karolina Hedberg

Abstract

The manufacturing process of carbide inserts at Sandvik Coromant consists of several

operations. During some of these, the inserts are positioned on trays. For some inserts the trays are pre-defined but for others the insert-tray matching is partly improvised. The goal of this thesis project is to examine whether machine learning can be used to predict which tray to use for a given insert. It is also investigated which insert features are determining for the choice of tray. The study is done with insert and tray data from four blasting operations and considers a set of standardized inserts since it is assumed that the tray matching for these is well tuned.

The algorithm that is used for the predictions is the supervised learning algorithm k-nearest neighbors. The problem of identifying the determining features is regarded as a feature selection problem and is done with the ReliefF algorithm. From the classification results it is seen that the classifiers are overfitting. The main reason for this is probably that the datasets contain features that together are uniquely defining for which tray is used. This was not detected during the feature selection since ReliefF identifies features that are individually relevant to the output. An idea to avoid overfitting the classifiers is to exclude these defining features from the dataset. Further work is thus recommended.

Tek nisk-nat urvetensk apliga f ak ulteten, Upps ala universit et . Utgiv nings ort U pps al a/Vis by . H andledare: Alvi n Ljung, Ämnesgransk are: Per M attss on, Ex aminat or: T omas Ny berg

(3)

Popul¨ arvetenskaplig sammanfattning

Sedan begreppet maskininlärning introducerades i mitten av 1900-talet har in- tresset för olika typer av lärande algoritmer växt. Idag tillämpas maskininlärn- ing inom flera omr˚aden s˚asom bankväsendet och sjukv˚arden. Användningen av maskininlärning har även spridit sig till tillverkningsindustrin där artificiell intelligens, ett paraplyuttryck där maskininlärning inkluderas, sägs vara en del av den nya generationens industri. Detta examensarbete är gjort i samarbete med Sandvik Coromant och hade som m˚al att undersöka om maskininlärning kan användas för att matcha skär och lastbärare i deras produktion.

Sandvik Coromant är en tillverkare av skär och skärverktyg för olika typer av metallbearbetning, s˚asom svarvning, borrning och fräsning. Under delar av tillverkningsprocessen placeras skären p˚a en lastbärare vilken kan liknas vid en bricka med plats för flera skär. För att h˚alla koll p˚a vilken lastbärare som ska användas till ett visst skär finns denna information registrerad i ett dataset.

Detta gäller dock inte för alla skär och för nya skär är valet av lastbärare i dagsläget delvis improviserat. M˚alet med detta examensarbete var att utifr˚an ett givet dataset av skär och lastbärare identifiera de egenskaper hos skären som verkar bestämmande för valet av lastbärare, samt att undersöka om mask- ininlärning kan användas för att optimera valet av lastbärare för ett nytt skär.

De dataset som användes för detta bestod av input-värden som beskrev olika egenskaper hos skären och output-värden som beskrev vilken lastbärare som används för respektive skär vid en specifik operation i tillverkningen.

För identifieringen av de bestämmande egenskaperna användes algoritmen Re- liefF. Denna identifierar de individuella egenskaper eller input-variabler i ett dataset som har en p˚averkan p˚a output-variabeln, i det här fallet lastbäraren.

De egenskaper hos skären som identifierades som relevanta av ReliefF användes sedan för att träna en maskininlärningsalgoritm. I detta projekt användes den

¨

overvakade inlärningsalgoritmen k-nearest neighbors. Liksom andra övervakade maskininlärningsalgoritmer försöker denna generalisera relationen mellan input- och output-variabler utifr˚an ett givet dataset. Syftet är att den generaliserade modellen ska kunna användas för att förutse output-värdet för ett nytt set av input-värden.

Experiment med de modeller som tr¨anats i projektet p˚avisar f¨orekomsten av

¨

overanpassning, det vill säga att modellerna misslyckats att generalisera infor- mationen i datasetet. Detta medför att de inte kan användas för att bestämma lastbärare till nya skär. Att överanpassning uppst˚ar beror troligtvis p˚a att lastbärarna kan bestämmas unikt utifr˚an n˚agra egenskaper i datasetet, vilket uppst˚ar som en följd av hur data om lastbärare lagras och hämtas. Denna relation mellan ett set av egenskaper och valet av lastbärare p˚averkar ocks˚a tolkningen av konceptet bestämmande egenskaper eftersom de egenskaper i datasetet som unikt definierar valet av lastbärare intuitivt bör definieras som bestämmande.

Som en fortsättning p˚a projektet rekommenderas att maskininlärningsalgorit- men tränas om p˚a ett dataset där dessa definierande egenskaper exkluderats, detta med förhoppningen att algoritmen kan generalisera baserat p˚a n˚agra av de andra egenskaperna. Vidare föresl˚as även experimenterande med andra typer

(4)

Acknowledgements

I want to express my gratitude to my supervisor Alvin Ljung at Sandvik Coro- mant for guiding me throughout this project. Also many thanks to other personnel at Sandvik Coromant who in different ways have contributed with help and guidance. Last I want to thank my subject’s reviewer Per Mattsson at Uppsala University for his good counsel and for reviewing this thesis.

(5)

1 Introduction

This master thesis project is conducted at Sandvik Coromant in Gimo, a sup- plier of metal-cutting tools and tooling systems within the Sandvik industry group. At the factory in Gimo, Sandvik Coromant manufactures carbide inserts and tools used in the metalworking industry. The assortment includes inserts and tools for various types of metal-cutting, such as turning, drilling and milling. The carbide inserts are the cutting part of the tools and are made of cemented carbide. These are replaceable such that they can be exchanged when they are worn out. The range of inserts manufactured by Sandvik Coro- mant can be divided into two sets, a set of standard articles and a set of special articles. The standard articles include inserts that are standardized by the In- ternational Organization for Standardization (ISO) and are usually in frequent production. The special articles can instead be inserts that are tailored based on the requirements from a specific customer and are sometimes produced in very limited numbers.

The manufacturing process of carbide inserts at Sandvik Coromant is described in the video [1]. The carbide inserts are made of a combination of tungsten carbide (80%) and a metal matrix, which most commonly mainly consists of cobalt (20%). When producing an insert, the ingredients are first milled to the correct grain size together with a mixture of water, ethanol and an organic binder, creating a grey slurry. The slurry is then spray dried into a powder that are pressed into molds before the inserts are hardened in a sintering oven. After the sintering, the inserts are processed through different grinding and blasting operations in order to attain the wanted size, geometry and tolerance. Most inserts are also coated, this is done either by physical vapor deposition (PVD) or by chemical vapor deposition (CVD).

During parts of the manufacturing process the inserts are positioned on trays.

These trays vary both for different inserts and for different operations of manufacturing. For standard articles, the selection of trays is based on an empirically constructed dataset which contains suitable trays for each insert and operation.

For the special articles however, the selection of trays is to some extent improvised. This improvised matching of inserts and trays is time consuming and may lead to a sub-optimal match which can cause inserts to fall off the trays during the different operations.

The hypothesis behind this project is that the matching between inserts and trays can be considered well-tuned for some families of inserts, in particular for ISO-articles with large production volumes. The idea is thus to train a machine learning model based on insert-tray data for a set of ISO-articles and examine whether the model manages to generalize such that it can be used to select trays for ISO-articles not included in the original dataset. The long-term goal for Sandvik Coromant is to examine whether it would be possible to use a machine learning model in order to predict suitable trays for special articles.

1.1 Objective

The objective of this project is to answer the following questions:

1. What insert features are determining for tray-matching?

(7)

2. Can we make good predictions of which trays should hold a given insert at a specific operation, if the model is trained on existing data for standard articles?

1.2 Limitations

The project is restricted to consider ISO-standardized inserts from the i700- and i701-families. The corresponding tray data is collected for four different operations of manufacturing. The chosen operations are the blasting operations BDOG 3G CLEAN, BULLDOG 3G ER, BDOG CLEAN and BULLDOG ER which are performed for edge reinforcement. That is, to ensure that the edge of the inserts has the desired properties. Since the trays for these operations differ depending on which coating is used, it was decided to collect tray data assuming that the insert would be coated by PVD.

2 Theory

This section presents the basic theory of machine learning, with a focus towards supervised learning and classification, and describes the different parts of the machine learning process. It also contains detailed descriptions of algorithms for feature selection and supervised learning.

2.1 Machine Learning

Machine learning is a multidisciplinary field based on concepts from fields in- cluding statistics, artificial intelligence and information theory. The general problem of machine learning is to build computer programs that improve their performance automatically based on experience. This improvement by experience is what is referred to as learning. More formally, a computer program is said to learn if its performance at a task, evaluated with a performance measure, improves with experience [2].

Given this broad definition, the topic of machine learning includes several different types of algorithms each with the goal to learn something from a set of data. Three types that are commonly mentioned are supervised learning, unsupervised learning and reinforcement learning [3]. Supervised learning includes learning algorithms that, given a dataset of inputs and the corresponding outputs, generalizes the relation between these in order to predict the correct output value for each new combination of input values. Unsupervised learning algorithms are as opposed to this provided with a dataset only of inputs. The goal of these algorithms is to categorize the instances in the datasets based on their similarities. Reinforcement learning can be considered as somewhere between supervised and unsupervised learning. The algorithms get information about whether they provided a correct answer, but they do not get any guidance on how to improve. Instead, the algorithms have to try different strategies to find a way to get the answer correct [3].

(8)

2.1.1 Supervised Learning

Supervised learning requires a set of training examples which contains the correct output value yi, i = 1, 2, ..., n for each vector of input values ¯xi [4]. From the dataset an algorithm is trained that relates the output values to the values of the inputs [4]. The goal is to get the resulting model to generalize the information in the training set such that it can be used to predict the output values of a new set of input vectors [3]. An example of a dataset that can be used for supervised learning is a set of pictures with labels that describes what each picture represents. By applying a supervised learning algorithm to the dataset, it is possible to train a model that predicts the labels for new pictures. An algorithm of supervised learning is presented in Section 2.5.

2.1.2 Classification

The task of labeling pictures described in Section 2.1.1 is also an example of a classification problem. Classification is the task of assigning a qualitative output value yi, often referred to as a class, to a vector of input values ¯xi. This is as opposed to regression, where the assigned output value y_i is quantitative. The different algorithms used for classification are commonly called classifiers. For supervised learning, a classifier is trained on a set of labeled training instances (¯x₁, y₁), ..., (¯x_n, y_n). The goal is then to train a classifier that can perform well on a set of unseen test data, that is, correctly predict the classes for a set of unlabeled data not included in the training set [4].

If the number of classes for an output in the dataset is larger than two, the output is said to be non-binary. The classification problem then is a multi-class problem. If the considered dataset contains multiple non-binary outputs, the problem of a classification is a multi-output classification (MOC) problem. A naive solution to these problems is to train an independent classifier for each output, thus transforming the problem to several multi-class problems. By combining the predictions from each classifier, a complete prediction can be achieved. A drawback with this method is that it implicitly assumes indepen- dence between the outputs, which is not necessarily the case [5].

2.2 The Machine Learning Process

In [3], the task of applying machine learning to a dataset is roughly divided into six steps: Data collection and preparation, Feature selection, Algorithm choice, Parameter and model selection, Training and Evaluation. Each step is briefly described below.

The step of Data collection and preparation includes the collection of a relevant dataset and preparation of the dataset such that it can be analyzed more effec- tively [3]. Some common steps of data preparation are described in Section 2.3.

Before collecting the data, the size of the dataset should be considered. For a machine learning algorithm to perform well a considerable amount of data is needed. However, a large dataset also increases the computational costs. The problem is thus often to find a dataset that is sufficiently large without causing the algorithm to be excessively computational expensive [3].

The Feature selection is made to identify the inputs or features that are pre-

(9)

sumed to be useful for the considered problem. This step requires some knowledge both of the available data and of the problem at hand [3]. The problem of feature selection in a pre-defined dataset is described further in Section 2.4.

Given the dataset with the chosen features, the next step is the Algorithm choice.

This means to select a machine learning algorithm that is suitable both for the given dataset and for the considered problem [3]. For a given problem, there are often a number of different machine learning algorithms that can be found suitable.

When an algorithm has been chosen, the next part of the process is Parameter and model selection. Many machine learning algorithms have hyperparameters that have to be set manually or based on experimentation before training the model [3]. How the hyperparameters can be tuned to the data is described in Section 2.6.3.

In the Training step, the chosen machine learning algorithm, with its tuned parameters, is applied to the dataset [3]. The training is usually performed on a subset of instances in the dataset called the training set. The data not used for training is usually referred to as the test set and is used to evaluate the model.

The last step is the Evaluation. This means testing the model and evaluating its accuracy [3]. For supervised learning, the model can be evaluated on the test set by comparing the output values predicted by the model with the true values of the outputs.

2.3 Data Preparation

As mentioned in Section 2.2 the first step after the data collection is to pre- pare the data such that it is suitable for the purpose of machine learning. This includes removing erroneous datapoints and datapoints with missing input values [3]. Two other methods of data preparation are normalization and one-hot encoding which are described in Section 2.3.1 and Section 2.3.2.

2.3.1 Normalization

When handling a dataset with numerical inputs it can be useful to normalize the input values into a fixed interval, such that the values of all inputs are in the same range. A common choice is to normalize the inputs in an interval between zero and one [6]. For an input value x its normalized value is then computed as

normalized(x) = x − min(x)

max(x) − min(x), (1)

where min(x) and max(x) are the minimum and maximum values of the input respectively.

(10)

2.3.2 One-hot Encoding

Many learning algorithms require the inputs in the dataset to be numerical. Any categorical inputs thus have to be transformed to numerical before the learning algorithm is applied. One common method of performing this transformation is to encode each category of the original categorical input as a new binary input, a method that is commonly known as one-hot encoding [7]. Consider the example with a categorical input Shape which is described by the categories {circular, quadratic, triangular}. An example set of data for this input is displayed in Figure 1. After one-hot encoding the original input Shape is transformed into three new inputs, one for each of the three categories. For an instance described as circular in the original dataset, the value of the new input circular is set to one in the one-hot encoded data while the two other input values, quadratic and triangular, are set to zero. Similarly, if an instance is described as quadratic, the new input quadratic is set to one while the two others are set to zero.

Figure 1: Example data for an input Shape before and after one-hot encoding.

2.4 Feature Selection

The inputs in a machine learning dataset can also be referred to as features [4].

As mentioned in Section 2.2, the process of choosing useful inputs for a machine learning problem is thus known as feature selection. With feature selection, the number of irrelevant or redundant features in a given dataset is reduced. The potential benefits of this includes cheaper computations, an improved prediction accuracy and better interpretability of the trained model [8].

There are several different methods of feature selection available. In [9], the methods are divided into filters and wrappers. Filters include feature selection algorithms that are applied as a pre-process of the data and thus perform the feature selection independently of any machine learning algorithm. Wrappers instead operate by evaluating different feature subsets by the performance of a machine learning algorithm. These methods are thus tuned for a specific algorithm and aim to choose features such that the prediction accuracy of this is optimized [9]. A third category of feature selection methods are embedded methods. These can be considered as a combination of filters and wrappers and are in similarity with wrappers dependent on a certain machine learning algorithm. The difference, however, is that embedded methods perform the feature selection during the construction of the model [8].

(11)

2.4.1 Relief

Relief [10] is a filter method inspired by instance-based learning that aims to select the statistically relevant features in a dataset. The idea of Relief is to compute the relevance or weight of each feature based on whether its feature values can distinguish the classes of near lying instances [11]. An intuitive interpretation of this idea can be obtained by considering a simple dataset as visualized in Figure 2. The dataset contains four instances divided into two classes represented by black circles and white squares. Each instance is described by the features x₁and x₂. By observing the feature x₁it is noticed that the value of this is similar for instances belonging to the same class. On the contrary, the feature x₂ has different values for instances of the same class. Thus, according to the Relief algorithm, the feature x₁can be considered relevant to which class an instance belongs to while the feature x2is irrelevant.

Figure 2: Example dataset with four instances described by the features x1and x2, and classified either as a black circle or a white square.

As in the example above, the original Relief algorithm [10] is limited to two class classification problems. The feature weights are computed using a ran- domly chosen training instance and its corresponding near-hit and near-miss instances. These are defined as instances in the near neighborhood of the training instance, belonging to the same and the opposite class as the training instance respectively [10]. The distance between the instances can be computed using either the Euclidian or the Manhattan distance since these have proven to give similar results [12]. Algorithm 1 describes the Relief algorithm in pseudo-code.

In this, each training instance Ri is represented by a p-dimensional vector of feature values. The feature weights are initialized as a zero vector and is up- dated iteratively for different training instances Ri. The selection of training instances is done without replacement and the number of iterations m is thus limited by the number of instances in the dataset. Since a larger m gives more reliable weightings, a reasonable choice is to iterate over the whole set of training instances [11].

(12)

Algorithm 1 : Relief

Initiate weights W =(0,0,...,0) for i = 1 to m do

Select random training instance R_i

Find near hit instance N H and near miss instance N M for A = 1 to p do

W [A] = W [A] − diff(A, R_i, N H)²/m + diff(A, R_i, N M )²/m end for

end for

The diff-function in the algorithm computes the difference between the value of a feature A for two instances I₁ and I₂ and is defined as

diff(A, I1, I2) =

(0, if value(A, I1) and value(A, I2) are the same

1, if value(A, I₁) and value(A, I₂) are different, (2) if I1 and I2are nominal and as

diff(A, I₁, I₂) =|value(A, I1) − value(A, I2)|

max(A) − min(A) , (3)

if I1 and I2 are numerical [12]. When using the Relief algorithm with the diff- functions as defined in (2) and (3), it should be assumed that all features in the dataset are either nominal or numerical [10]. Otherwise, the numerical features might be underestimated [12].

The weights computed by the Relief algorithm are normalized in the interval [-1,1] [11]. To determine whether a feature is relevant or not its final weight is compared to a threshold τ . All features with a weight above the threshold are selected as relevant [10]. In [13] it is concluded that statistically, the weight of a relevant feature is expected to be positive while the weight of an irrelevant feature is expected to be zero or negative. It is also shown that by use of Cheby- shev’s inequality, the value of τ can be restricted further as 0 < τ ≤ 1/√

αm, where α is the probability of choosing an irrelevant feature as relevant and m is the number of iterations. However, the value of τ can also be determined by inspection [13].

In [10] it is noted that a drawback with the Relief algorithm is that it fails to identify redundant features. Consider a dataset where two of the features are identical, such that their values are the same for each instance. If the features are relevant to the output, both of them will be selected by the Relief algorithm although only one of them would be sufficient to describe their impact on the input-output behavior. This means that the algorithm does not necessarily find the smallest possible subset of features.

2.4.2 ReliefF

ReliefF is an extension of the Relief algorithm to handle multi-class problems and datasets with incomplete and noisy data. The method was introduced by

(13)

Kononenko in [11] and was further analyzed by Robnik-ˇSikonja and Kononenko in [12]. Instead of finding one near-hit and one near-miss for each training instance, the ReliefF algorithm finds the k nearest hits and the k nearest misses from each different class. The averaged contribution of all near hits and misses is then used to update the feature weights. The use of k neighbors to update the weights makes the algorithm less sensitive to noise in the feature values compared to the original Relief [12]. The ReliefF algorithm is described in pseudo-code in Algorithm 2.

Algorithm 2 : ReliefF

Initiate weights W = (0, 0, ..., 0) for i = 1 to m do

Select random training instance R_i Find k nearest hits H_j

for class C 6= class(Ri) do Find k nearest misses Mj(C) end for

for A = 1 to p do W [A] = W [A] −Pk

j=1diff(A, Ri, Hj)/(m · k)+

P

C6=class(Ri)(1−P (class(R^{P (C)} _i))

Pk

j=1diff(A, Ri, Mj(C)))/(m · k) end for

end for

In order for ReliefF to handle incomplete datasets, the diff-function in (2) and (3) is altered such that

diff(A, I1, I2) = 1 − P (value(A, I2)|class(I1)), (4) if one instance I1 has an unknown value and

diff(A, I₁, I₂) = 1 −

#values(A)

X

V

(P (V |class(I₁)) × P (V |class(I₂))), (5)

if both instances I₁ and I₂ have unknown values [11].

2.5 k-Nearest Neighbors

The k-Nearest Neighbors algorithm (k-NN) is an instance-based method of supervised learning. This means that instead of constructing an explicit model, the learning of the algorithm consists only of storing the training data. When classifying a new instance, an instance-based algorithm examines the relation between this new instance and the previously stored training instances and then uses this to assign a class. This type of learning is sometimes also referred to as lazy learning since the algorithm delays all generalization of the training data until a new instance is to be classified [2]. The k-NN algorithm is considered

(14)

For k-NN it is assumed that all instances correspond to a datapoint in an n- dimensional space. When using the algorithm for classification, the class of a new instance is determined as the most common class among its k nearest neighbors in this space [2]. The distance between two instances xi and xj is defined as the Euclidian distance. By assuming that each instance x1 and x2

is described by a feature vector (a1(x), a2(x), ..., an(x)), the Euclidian distance between x1 and x2is expressed as

d(xi, xj) = v u u t

n

X

r=1

(ar(xi) − ar(xj))², (6)

where a_r(x_i) and a_r(x_j) are the values of the rth attribute of the instance x_i and x_j respectively [2].

Figure 3 visualizes an example dataset where the datapoints are described by the features x1 and x2. Each datapoint is also labeled either as a black circle or as a white square. The striped triangle is a new datapoint for the algorithm to classify. By applying the k-NN algorithm with k = 1 it is clear that the new instance will be classified as a white square, since this is the label of the single nearest neighbor. However, by instead choosing k = 3 the new instance will be classified as a black circle, since this is the label of the majority of the three nearest neighbors.

Figure 3: Visualization of how the k-NN algorithm uses the nearest neighbors to classify a new instance, represented by a striped triangle, as either a black circle or a white square.

A remark concerning the k-NN algorithm is that since the nearest neighbors of an instance depend on the distance between all its features, the algorithm is sensitive to the presence of irrelevant features in the data. This can be understood by considering a dataset where each instance is described by a set of 20 features, only two of which are relevant for a classification problem. Even though two instances have similar values for the two relevant features they may still be distant depending on the values of the 18 irrelevant features. This problem is referred to as the curse of dimensionality [2].

(15)

2.6 Evaluation

In order to estimate the performance of a machine learning model on new data, the performance of the model has to be evaluated on a set of data that has not been used to create the model. Thus, as mentioned in Section 2.2, the dataset should be divided into a training set and a test set. The training set is used to train the learning algorithm while the test set is used exclusively for evaluation. Evaluation of the model on its performance on the test data provides an estimate of how well the model will manage to predict the output values on new unseen data. If the model instead was to be evaluated on training data, its performance would be overestimated since the model is expected to provide better predictions on data that have been used for training [6]. For classifiers, the performance on the test data can be measured by the number correct or erroneous predictions. The proportion of correct predictions is referred to as the success rate [6].

Provided that the dataset is large and assuming that both sets are representative of the data, dividing the data into a training and a test set is enough to provide a good estimate of the performance on new data. However, if the dataset is small, the amount of data that can be used for training and evaluation is limited. This results in a conflict of interests since a larger training set in general gives a better model, while a larger test set gives a better estimation of its performance [6].

2.6.1 Cross-validation

One common way to solve the complications which occur with a limited dataset is to use cross-validation. With a K-fold cross-validation, the dataset is divided into K parts of equal size. The training and evaluation of the model are iterated such that each part of the data is in turn used as test data, while the rest of the data is used for training [14]. An example of how the data is divided for each iteration in the case K = 3 is shown in Figure 4. In the first iteration the first part of the dataset is held out as test data, in the second iteration the second part and so on. After the last iteration all three parts of the data have been used for evaluation. The final estimation of the model performance is achieved by averaging the performance measure computed in each iteration [14]. The advantage of cross-validation is that while the model is evaluated on a set of data that is separate of that used for training, all data is eventually used for both training and testing [14].

Figure 4: Visualization of how the data is divided into training and test sets with K-fold cross-validation when K = 3.

(16)

2.6.2 Overfitting

When constructing a machine learning model, one has to be aware of the problem of overfitting. Overfitting means that the model has fitted the training data too closely, such that it has captured noise and other irregularities in the data. This is undesirable since it prevents the model from generalizing [3]. An overfitted model is thus not able to provide correct predictions of the outputs on new data [4]. In the k-NN algorithm, overfitting can occur due to an unwise choice of the number of neighbors k. With a small k, the model risks to become overly flexible and may overfit. This causes the number of correct predictions on the training data to become high while the number of correct predictions on the test data may be low. By instead using a too large value of k the model does not become flexible enough and may instead generalize too much. This usually results in a poor performance on both the training and the test data [4].

The concept of overfitting can also be described in terms of bias and variance.

A model that overfits the data is said to have a low bias but a high variance.

For a model that underfits the data the bias is instead said to be high while the variance is low. In this setting, bias is the error that occurs when a real-life problem is approximated by a simplified model while variance is a measure of how much the estimate of a model would change if the model was trained on a different set of data [4].

2.6.3 Tuning of hyperparameters

Some learning algorithms have hyperparameters that can be tuned in order to optimize the performance of the model. An example of such a hyperparameter is the parameter k in the k-NN algorithm [6]. As described in Section 2.6.2, the choice of k has to be done with caution in order to avoid over- or underfitting the training data. The best performance on a test set is usually achieved by tuning k such that it fits the characteristics of the dataset. It is however important not to tune k by the performance on the test data since this introduces optimistic bias in the final evaluation of the model [6]. Not only should the test data be held out of training, it should not be used in any step of creating the model. It might thus be necessary to divide the data into three independent sets: a training set, a validation set and a test set. The training set is used to train multiple models with different values of the hyperparameters, each of which is evaluated on the validation set. The hyperparameters that give the best performance on the validation data are then used to create a new model which is trained on the data from both the training and the validation set. This final model is then evaluated by its performance on the test data, which gives an estimate of the performance on unseen data [6].

3 Method

This section contains descriptions of the datasets and how these are pre-processed as well as details concerning the implementation of the methods of feature selection (ReliefF) and classification (k-NN) that are used in this project.

(17)

3.1 Dataset

The datasets used to train and evaluate the classifiers for each operation consists of 18 input features describing the properties of the inserts, and two qualitative outputs defining which tray is used. The feature data is collected for ISO- standardized articles in the i700- and i701-families, giving a set of 5803 inserts.

The corresponding output data is collected separately for each of the four considered operations, each operation thus gives rise to an individual dataset.

A short description of each input feature is given in Table 1. The feature SYS STDCODE is a unique identifier of each insert and is thus not used to train the classifiers. Instead, it is considered as the name of the insert. Its value consists of the combined values of ISO1-ISO7 and CHIPBRK, together with an internal code. The features ISO1-ISO7 are ISO-parameters. Each of these encodes information of one or multiple features of the inserts. For example, ISO1 encodes information of both the geometrical shape of the insert and its nose angle. Table 1 also states whether the features are regarded as numerical or categorical in the raw dataset. For the categorical features it is assumed that there exists no internal order between the categories, such that the features can be treated as nominal. As stated in Table 1, each ISO-parameter is regarded as categorical in the raw dataset. However, ISO-parameters that encode a single numerical feature are translated to these numerical values during the pre-processing of the data.

The set of input features are chosen to contain both features that are considered likely and features that are considered less likely to be important to the outputs.

The features that are expected to be important are mainly those concerning the geometry of the inserts, while features such as the powder mixture or edge rounding are not expected to be determining for the choice of tray. Several of the chosen features are also related to one another and can thus be expected to correlate. Examples are the features CC and IC that are related to the size of the inserts described by ISO5, and DENSITY and VOLUME that are related to the mass of the inserts as described by WEIGHT.

(18)

Table 1: Description of the input features.

Feature Description Type

SYS STDCODE Identifier Categorical

CC Radius of the smallest circle

enclosing the insert Numerical IC Radius of the largest circle

enclosed by the insert Numerical

ISO1 Shape Categorical

ISO2 Relief angle Categorical

ISO3 Tolerance Categorical

ISO4 Hole/Chipbreaker Categorical

ISO5 Size Categorical

ISO6 Thickness Categorical

ISO7 Corner radius Categorical

GRADE Powder mixture Categorical

WEIGHT Weight Numerical

AREA Lateral surface Numerical

CHIPBRK Chipbreaker Categorical

HV3MEAN Hardness Numerical

DENSITY Density Numerical

VOLUME Volume Numerical

ERSIZE Edge rounding Numerical

Table 2 shows an example of output values for an instance in one of the datasets.

The output variable PICKFILE defines which tray and pickfile is used while POSITION describes how the insert is positioned on the tray. POSITION is either UP or DOWN depending on which side of the insert that is positioned upwards. If no value is given it is assumed that the position of the insert is arbitrary. In the examined datasets, the value of POSITION is either DOWN or not given. The tray data is stored in the database PARAD and is fetched through an API based on the six first characters in the SYS STDCODE, that is the values of the input features ISO1-ISO5. This means that all inserts with the same values for ISO1-ISO5 use the same tray.

Table 2: Example of output values.

PICKFILE POSITION

587 11 DOWN

3.2 Pre-processing of the Dataset

Before applying feature selection and using the datasets to train and test the classifiers, the datasets are pre-processed. The initial pre-processing is divided into two steps: cleaning of the datasets and translation of the ISO- parameters.

(19)

3.2.1 Cleaning of the Dataset

The cleaning of the datasets mainly consists of discarding instances without given output values. First, all instances without a given value for PICKFILE are removed and saved in a separate dataset. These sets of unlabeled instances are later used to test the final models. For POSITION it is assumed that a missing value can be interpreted as if the insert can be positioned arbitrarily.

For instances with missing values for POSITION, a new class UNKNOWN is therefore introduced, indicating arbitrary positioning of the insert. During the cleaning, also instances with missing values for any of the input features are removed from the datasets.

After the datasets have been cleaned out, the labeled datasets for operations BDOG 3G CLEAN, BULLDOG 3G ER and BULLDOG ER consists of 5214 inserts while the dataset for BDOG CLEAN consists of 4968 inserts. The number of different output combinations is 25-27 depending on the operation. The unlabeled datasets consist of 653 inserts for BDOG CLEAN and 207 inserts for the other operations.

3.2.2 Translation of the ISO-parameters

Table 3 defines the features encoded by each ISO-parameter. The values of the ISO-parameters are defined by the values of its encoded features and vice versa. Each ISO-parameter thus encodes information about the features of the inserts. The ISO-parameters ISO2, ISO5, ISO6 and ISO7 encode one numerical feature each. In the datasets, these ISO-parameters are thus directly translated to the numerical values of their encoded features and are henceforth regarded as numerical. The parameters ISO1, ISO3 and ISO4 are each described by a set of encoded features. For simplicity, these parameters are not translated but are instead regarded as categorical nominal features.

Table 3: Description of the ISO-parameters.

ISO-parameter Encoded feature(s)

ISO1 Geometrical shape

Nose angle [^◦] ISO2 Relief angle [^◦] ISO3

Tolerances of:

Cornerpoint [mm]

Thickness [mm]

Inscribed circle [mm]

ISO4

Hole Hole shape Chipbreaker type ISO5 Cutting edge length [mm]

ISO6 Thickness [mm]

ISO7 Corner radius [mm]

(20)

3.3 Feature Selection

The feature selection is performed with a Python implementation of the ReliefF algorithm described in Section 2.4.2. The number of nearest hits and nearest misses is set to k = 10. Since the algorithm thus assumes that all instances have at least 10 neighbors with the same output value, any instance for which this is not true is discarded from the dataset. The number of iterations m is set to the remaining number of instances in the dataset, such that the algorithm runs over all instances. Since instances with missing input values are removed from the dataset during the pre-processing, the functions to estimate the difference between unknown values ((4) and (5)) are not implemented.

The ReliefF algorithm is applied to the datasets corresponding to each of the four operations, once for PICKFILE and once for POSITION. The result is thus two weight vectors for each operation, one describing the relevance of the features with respect to PICKFILE, and one describing the relevances with respect to POSITION. The selection of relevant features for each operation and output based on the ReliefF weights is made by inspection. In practice, this meant selecting any feature with a weight higher than τ = 0.1 as relevant.

3.4 Classification

The classification is done using tools from the Python machine learning library scikit-learn [15]. The classifiers used to predict the trays for each operation are constructed with an implementation of the k-NN algorithm. For the evaluation of the classifiers each dataset is divided into a training set (80%) and a test set (20%). Before training the classifiers, all numerical features are normalized in [0,1] and the nominal features are one-hot-encoded. The hyperparameter k is tuned in the interval [1,100] using a 5-fold cross-validation on the training set.

Except the evaluation on the test data, the classifiers are also used to predict the trays for those instances without given class that were discarded from the datasets during the pre-processing, that is, the unlabeled dataset. For this, the classifiers are retrained on the complete dataset.

Since the datasets contain two outputs, the classification problem for each operation is a multi-output classification problem. For simplicity the algorithm is run once for each output such that the result is two independent classifiers.

The final predictions are then obtained by combining the predictions from the classifiers corresponding to each output.

4 Results

This section presents the feature weights as computed by the ReliefF algorithm and the success rates of the classifiers both when evaluated on a held-out test set and when evaluated on unlabeled data.

4.1 Feature weights

The feature weights computed by the ReliefF algorithm are similar for each of the examined operations. The visualizations of the feature weights for the operation BDOG 3G CLEAN in Figure 5 and Figure 6 are thus considered to

(21)

be representative for all operations. Figure 5 visualizes the feature weights as computed by the ReliefF algorithm when PICKFILE was used as output. For this output, multiple of the features have similar weights. There are however some features that have weights close to or less than zero and thus can be considered irrelevant. The feature ISO1 stands out with a comparably high weight.

Figure 5: Visualization of the feature weights computed by ReliefF for PICK- FILE for the operation BDOG 3G CLEAN.

Figure 6 visualizes the feature weights when POSITION was used as output.

In comparison with what is observed for PICKFILE, most of the features have low weights. The feature ISO4 however, stands out with a considerably higher weight than the rest. Also ISO2 and CHIPBRK stand out as relevant for this output.

(22)

Figure 6: Visualization of the feature weights computed by ReliefF for POSI- TION for the operation BDOG 3G CLEAN.

4.1.1 Relevant features

Based on inspection by the weights computed by the ReliefF algorithm the features ISO3, ISO7, GRADE, HV3MEAN, DENSITY and ERSIZE are regarded as irrelevant for PICKFILE and is thus disregarded when training the classifiers for this output. For POSITION, the features ISO2, ISO4 and CHIPBRK are regarded as relevant and are used to train the corresponding classifiers.

4.2 Classification results

This section presents the success rates both on a test set held out from the labeled dataset and on a set of previously unlabeled data.

4.2.1 Evaluation on a test set

In order to examine whether the feature selection improves the number of correct predictions on a test set, the k-NN algorithm is trained and evaluated once using all features and once using only the relevant features. The resulting success rates for each operation are presented in Table 4 - Table 7. What is seen is that for each operation the success rate on the test set improves when the algorithm is trained only on the relevant features. This improvement depends on improvement in the predictions of PICKFILE. The success rate for POSITION on the contrary gets worse when the classifiers are trained only on the relevant features.

What is also observed when the algorithm is trained only on the relevant features as compared to when it is trained on all features is that the value of the hyperparameter k tends to be tuned to a lower value. This is especially evident for the classifiers trying to predict the output PICKFILE. For this output, the value of k is tuned to k = 1 for each operation when the model is trained on

(23)

the relevant features. This means that the best performance of the classifier is achieved when it predicts the output of a new instance to be the same as that of the nearest neighbor in the training set.

Table 4: Success rates for operation BDOG 3G CLEAN.

Features Success rate PICKFILE

Success rate

POSITION Success rate

All features 0.917 0.997 0.915

Relevant features 1.0 0.992 0.992

Table 5: Success rates for operation BULLDOG 3G ER.

Success rate

All features 0.940 0.998 0.938

Table 6: Success rates for operation BDOG CLEAN.

Success rate

All features 0.918 0.988 0.910

Table 7: Success rates for operation BULLDOG ER.

Success rate

All features 0.902 0.997 0.900

4.2.2 Evaluation on unlabeled data

For the predictions of the unlabeled dataset, the k-NN algorithm is trained only on the relevant features. A sample of predictions for each operation has been manually labeled as ”correct”, ”can work” or ”wrong” by a personnel on Sandvik Coromant with knowledge on the matching of inserts and trays. The samples were constructed by selecting one representative insert for each combination of the values of ISO1-ISO5 that occurred within the unlabeled dataset.

If different trays had been predicted for inserts with the same combination of the values of ISO1-ISO5, an insert representing the most common prediction for this combination was chosen. For BDOG 3G CLEAN, BULLDOG 3G ER and BULLDOG ER the sample consisted of predictions for ten inserts while the sample for BDOG CLEAN consisted of predictions for eleven inserts. The results for all operations were that approximately 20% of the inserts were con-

(24)

5 Discussion

By observing the feature weights visualized in Figure 5 and Figure 6 it is seen that the weights of the features differ between PICKFILE and POSITION. For PICKFILE, which describes which tray is used, the features with high relevance mostly are features that describe the geometry of the inserts. As mentioned in Section 3.1, these are also the features that are expected to have a high relevance for the choice of tray. Other features, such as ISO3 which describes the tolerance for some of the geometrical measures of the insert, are as expected assigned a low weight. The same observations are true for POSITION, however for this output the feature selection reduces the number of features considerably more such that only three features remain as relevant. A fact that may undermine the result of the feature weights is that the ReliefF algorithm here is applied on a dataset with a combination of numerical and nominal features. As stated in Section 2.4.1, this is expected to cause the numerical features to be underestimated in relation to the nominal. This complicates the comparison of weights for nominal and numerical features and may also explain the comparably high weights of the nominal features ISO1 and ISO4 for PICKFILE and POSITION respectively.

When using ReliefF as a method to identify determining features it is assumed that a determining feature can be interpreted as a feature that is individually relevant to the output. However, considering that the features ISO1-ISO5 are used to store and fetch the tray data, the combination of these uniquely defines the outputs in the datasets. What was realized near the end of the project is that it is thus sensible to argue that the features ISO1-ISO5 are determining in this dataset. The reason this was not detected by the ReliefF algorithm is as said that ReliefF aims to find the individual features with a statistical relevance to the outputs. Thus, if a set of features together are defining of the outputs, it does not necessarily mean that they are assigned high weights by ReliefF. An example of this is, as already mentioned, the feature ISO3. Even though ISO3 is included in the feature set ISO1-ISO5 its individual relevance to both outputs is low.

A problem that arises from the fact that ISO1-ISO5 are uniquely defining of the outputs is that there is a risk that a classifier that is trained on the dataset overfits to this relation. This will cause the classifier to generalize bad to new unseen inserts. The relation between ISO1-ISO5 and the outputs can be illus- trated with an example of a dataset that describes a group of persons by their name and social security number. The same name can occur multiple times, but the social security number will be unique for each person. Thus, it is not difficult to train a classifier to predict the name of a person in the dataset given their social security number. The classifier will however not generalize to persons that are not present in the dataset. The main difference between the dataset in this example and the one that has been considered within the project is that while a social security number only should appear once in a list of persons, the values of the features ISO1-ISO5 can be the same for several inserts. This complicates the training and evaluation of a classifier since it creates a group structure in the data. This structure, and the problems that follows, are best described by another example.

(25)

Figure 7 visualizes a dataset where each instance is defined by the features x₁ and x₂, and is labeled either as a circle, a square or a star. These classes are characterized by similar values for x1and x2such that the data is divided into distinct groups. When using the k-NN algorithm to classify a new instance, represented by a triangle, it is clear that an instance placed within one of the groups (Figure 7a) is easier to classify than an instance between the groups (Figure 7b). In the dataset in this project, the features ISO1-ISO5 group the data in a similar way to that in Figure 7. Thus, a new instance is expected to be easier to classify if its combination of values for ISO1-ISO5 is the same as for one or several of the training instances. By considering how the dataset is divided into training, validation and test sets there is a risk that instances with the same values for ISO1-ISO5 are present in all sets. Not only does this cause a bias in the evaluation of the classifiers on the test data, but it also allows the classifiers to overfit such that the classifiers generalize bad to new data.

(a) The new instance is placed within a group in the training data.

(b) The new instance is placed between the groups in the training data.

Figure 7: Example of a dataset with a clear group structure. The striped tringle represents a new instance to be classified.

The occurrence of these problems is confirmed by the classification results presented in Section 4.2. For the test data, the suspiciously high success rate is consistent with the suspicion that the classifiers are evaluated on data that is similar to or the same as the data they are trained on. Or, to return to the example in Figure 7, that the classifiers are evaluated on data in the same groups as the training data. As expected, the low values of k in the nearest neighbor algorithm indicates that the classifiers are also most likely overfitting to the features ISO1-ISO5. This is confirmed further by the classification results on the unlabeled dataset. For this it is assured that the classifiers are evaluated on completely new combinations of values for ISO1-ISO5. That is, that the classifiers are evaluated on instances between the groups in the training data.

The poor success rate on the unlabeled data thus strengthens the suspicion that the classifiers overfit to the values of ISO1-ISO5.

With the problem of overfitting in mind, there also arises a probable expla- nation to why the correct predictions for PICKFILE increases and the correct

(26)

features that can be considered as noise to the relation between ISO1-ISO5 and the outputs. This might allow the overfitting to increase such that it becomes even easier to classify an instance that has the same values of ISO1-ISO5 as in the training set. However, for POSITION, where several of the features ISO1- ISO5 are discarded, it is possible that the overfitting decreases which instead gives a lower success rate.

5.1 Further Work

Based on the discussion above, the main complication in this project is the group structure of the data that arises due to the pre-known relation between the features ISO1-ISO5 and the outputs. In an eventual continuation of this work, measures should thus be taken to prevent the impact of this relation.

One idea to prevent overfitting the classifiers is to exclude the features ISO1- ISO5 such that the classifiers are trained only on the remaining features. This way the group structure of the data is hopefully reduced, and the classifiers can generalize better to new inserts. Another thing to consider is that there may be other classification algorithms that is better suited for and thus can provide better classification results for this problem. Experimentation with different classifiers is thus also recommended as a part of an eventual continuation of the project.

6 Conclusions

From the discussion it is clear that the answer to which features are determining depends on the definition of a determining feature, a concept that is open for interpretation. In this project, the initial approach was to regard a feature as determining if it was individually relevant to the outputs. The determining features were thus assumed to be the features identified as relevant by the ReliefF algorithm. What was later realized is that, considering that the outputs are fetched based on the features ISO1-ISO5, these are defining of the outputs in the dataset and can thus in a sense also be regarded as determining. In this case the determining features are instead defined as a set of features that together describes the relation to the outputs.

The defining relation between the features ISO1-ISO5 and the outputs proved to complicate the matter of training a well-performing classifier on the data.

The classifiers in this project overfit to the relation such that they are unable to predict a suitable tray for new inserts. Furthermore, the evaluation of the classifiers on the test set is misleading since the performance does not correspond to the performance on that of the unlabeled dataset. As mentioned in Section 5.1, a solution to the problem of overfitting may be to exclude the features ISO1-ISO5 from the dataset. Another proposed continuation of the project is to compare the performances of classifiers trained with a few different classification algorithms in order to find the algorithm best suited for the problem.

Considering the results from this project, the overall conclusion is that further work is needed in order to provide satisfactory answers to the questions defined in the objective.

(27)

References

[1] Sandvik Coromant. How carbide inserts are made by Sandvik Coro- mant [video file]. 2017, Jan 24 [cited 2021 May 5]. Available from:

https://www.youtube.com/watch?v=0QrynzJ lZ4

[2] Mitchell T M. Machine Learning. International Edition 1997. New York:

McGraw-Hill; c1997.

[3] Marsland S. Machine Learning: An Algorithmic Perspective [Inter- net]. 2nd ed. Boca Raton: CRC Press; c2015. [cited 2021 April 18]. Available from: https://learning.oreilly.com/library/view/machine- learning-2nd/9781466583283/

[4] James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning - with Applications in R [Internet]. Springer, New York, NY;

c2013. [cited 2021 March 12]. Available from: https://doi.org/10.1007/978- 1-4614-7138-7

[5] Read J, Martino L, Olmos PM, Luengo D. Scalable multi-output label prediction: From classifier chains to classifier trellises. Pattern Recognition.

2015;48(6):2096-109.

[6] Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: Practi- cal Machine Learning Tools and Techniques [Internet]. 4th ed.

Cambridge: Morgan Kaufmann; 2016. [cited 2021 May 10]. Avail- able from: https://learning.oreilly.com/library/view/data-mining- 4th/9780128043578/

[7] Brownlee J. Why One-Hot Encode Data in Machine Learning?

2017, July 28 [cited 2021 August 3]. In: Machine Learning Mas- tery [blog on the Internet]. Jason Brownlee; 2013 - . Available from: https://machinelearningmastery.com/why-one-hot-encode-data-in- machine-learning/

[8] Wang S, Tang J, Liu H. Feature Selection [Internet]. In: Sammut C, Webb G.I, editors. Encyclopedia of Machine Learning and Data Mining. 2nd ed.

Springer, Boston, MA; 2017.

[9] Kohavi R , John GH. Wrappers for feature subset selection. Artificial In- telligence. 1997;97(1-2):273-324.

[10] Kira K, Rendell LA. A Practical Approach to Feature Selection. In: Slee- man D, Edwards P, editors. Machine Learning: Proceedings of the Ninth International Workshop. Ninth International Machine Learning Conference (ML92). July 1-3, 1992; Aberdeen. San Mateo: Morgan Kaufmann; 1992.

p. 249-56.

[11] Kononenko I. Estimating Attributes: Analysis and Extensions of RELIEF.

In: Bergadano F, De Raedt L, editors. Machine Learning: ECML-94. Euro- pean Conference on Machine Learning. April 6-8, 1994; Catania. Springer, Berlin, Heidelberg; 1994. p. 171-82.

(28)

[13] Kira K, Rendell LA. The Feature Selection Problem : Traditional Methods and a New Algorithm. AAAI-92 Proceedings. Tenth National Conference on Artificial Intelligence (AAAI-92). July 12-16, 1992; San Jose. Palo Alto:

AAAI Press; 1992. p. 129-34.

[14] Theodoridis S. Machine Learning: A Bayesian and Optimization Perspective [Internet]. Academic Press; c2015. [cited 2021 May 18]. Available from: https://learning.oreilly.com/library/view/machine- learning/9780128015223/

[15] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825-30.

Optimization of Insert-Tray Matching using Machine Learning

Examensarbete 30 hp September 2021

Optimization of Insert-Tray Matching using Machine Learning

Karolina Hedberg

Teknisk-naturvetenskapliga fakulteten Uppsala universitet, Utgivningsort Uppsala/Visby

Optimization of Insert-Tray Matching using Machine Learning

Abstract

The manufacturing process of carbide inserts at Sandvik Coromant consists of several

Popul¨ arvetenskaplig sammanfattning

Acknowledgements

Contents

1 Introduction

1.1 Objective

1.2 Limitations

2 Theory

2.1 Machine Learning

2.2 The Machine Learning Process

2.3 Data Preparation

2.4 Feature Selection

2.5 k-Nearest Neighbors

2.6 Evaluation

3 Method

3.1 Dataset

3.2 Pre-processing of the Dataset

3.3 Feature Selection

3.4 Classification

4 Results

4.1 Feature weights

4.2 Classification results

5 Discussion

5.1 Further Work

6 Conclusions

References