Means Clustering: - The Linux networking stack can benefit from “inferences” due to machine le

The Linux networking stack can benefit from “inferences” due to machine learning, which may be used in “smart” applications

K- Means Clustering:

• K-Means is an algorithm for clustering data. Though computationally expensive, K-Means is good at creating clusters with very high learning rates.

• Since K-Means is an unsupervised algorithm, it allows for independent interpretation of the results.

Support Vector Machine (SVM):

• SVMs are a kind of supervised learning method that can be used for clustering.

SVMs use a subset of the training data in the decision function, making it memory-efficient.

• Using an SVM with the Radial Basis Function (RBF) kernel with a moderate, its C-parameter proved computationally expensive but had the best training rates.

Decision Trees (DT):

• DTs are a supervised learning method for classification. In creating simple rules from labeled data features, it attempts to predict the values of the target variable.

• DTs have prediction costs logarithmic to the number of training sets. In fact, DTs are exceptional at using incomplete data sets and training models with low numbers of sets and still yielding high similarity quotients with statistical methods like K-means.

USING MACHINE LEARNING TO OPTIMIZE LINUX NETWORKING

Nearest Neighbor Classifier (NCC):

• NCC is a supervised NN model that uses centroids to define boundaries when classifying data sets. NCCs also are good at classifying sets where probability distributions are unknown and can respond to changes quickly.

• Since network delay patterns are likely to change as devices and users are

introduced or removed from the network, an NCC may be better at adapting to those changes.

As an example of the outcomes of our testing, Figure 4 shows the results of classifying the normalized network data using K-Means clustering. Note from Figure 4 that four differentiated outcomes are clearly present: 1) systems

Figure 4.

ML classification outcomes of Linux kernel network data using the K-Means algorithm.

The classified data clearly indicates four different outcomes for external networked systems.

USING MACHINE LEARNING TO OPTIMIZE LINUX NETWORKING

“moving toward” or 2) “moving away” from each other, as well as systems experiencing 3) “signal saturation” and (4) “network congestion”. Note that regardless of the test scenario, the ML algorithm can discern valuable information about the networked systems. Although additional analysis can produce more information, these outcomes may be quite useful in certain applications.

The ideal outcome or classification for the different network testing scenarios is a clear separation between the four groups, as shown in Figure 4. Two of the cases are clearly and individually separated from the group. However, in the cases of congestion and signal over-saturation, the classification overlaps without

a clear separation. This lowers the accuracy of the ML model. To improve the classification, we would need to include another parameter in the analysis that helps to discriminate between congestion and saturation.

In all cases, the ML techniques analyzed data that is consistently produced internal to the Linux kernel to gain new insights about the external network context. However, since these algorithms are operating on kernel-level data, the issue of efficiency is also important.

Figure 5 summarizes the accuracy and efficiency of classifying the network data using ML algorithms. From Figure 5, it’s clear that the SVM approach is most efficient: it achieves the highest accuracy using the least number of training samples. This means that the SVM classifier can quickly distinguish differences in the kernel data and can produce useful outcomes efficiently. It is also an excellent practice to realize a large number of examples to see the convergence of accuracy for each classifier. However, NN models that reach 100% accuracy may produce large failure rates due to overfitting.

Accurate and fast classification of network data using ML techniques may result in Linux systems that can autonomously react based on external network conditions. These reactions may include modifying kernel data, selecting

alternate retransmission or transport protocols, or adjusting other internal

USING MACHINE LEARNING TO OPTIMIZE LINUX NETWORKING

parameters based on external context. This form of “inference” is a fundamental problem in systems communicating via a network. In our experiments, network data was collected directly from the existing kernel processes and used to train the four ML classifiers to produce interesting and useful inferences.

Our results suggest that the SVM approach may be a promising ML technique for inferring results from kernel networking data. The SVM approach reduces the number of nodes per layer, adjusts precision of the data without losing accuracy and reduces the latency of computation. The output of the inference layer provides the networking stack with additional information to handle

different scenarios. The inference development represents a form of customizing network communication between endpoints for a variety of applications. Other network conditions may benefit from inference models that utilize different data produced by the networking stack.

As ML techniques are included in new designs and applications, they have become a valuable tool and part of the implementation process for smarter networks.

Concepts related to “machine learning” and “artificial intelligence” even may be implemented inside the Linux kernel to improve networking performance. ◾

Figure 5. Accuracy and speed of classification algorithms when using normalized data from different networking

scenarios.

USING MACHINE LEARNING TO OPTIMIZE LINUX NETWORKING

Damian Valles is a second-year Assistant Professor in the Ingram School of Engineering at Texas State University. His goal is not to partially tear another Achilles Heel anytime soon while staying active. Damian welcomes your comments at [email protected] or Twitter: @VallesDamian.

Stan McClellan has been an avid Linux user and network experimenter for many years. One of his interests is using Linux to help Damian avoid another athletic injury. He can be reached at [email protected].

Send comments or feedback

via http://www.linuxjournal.com/contact or email [email protected].

Resources

• “Introduction to Stream Control Transmission Protocol” by Jan Newmarch, LJ, September 2007

• Silver, D., Schrittwieser, J., et al., “Mastering the game of Go without human knowledge”, Nature 550: 354–359, Macmillan Publishers Limited, DOI:

10.1038/nature24270

• NVIDIA TensorRT Programmable Interface Accelerator

• “Introduction to K-means Clustering” by Andrea Trevino on December 6, 2016

• Towards Data Science: Support Vector Machine—Introduction to Machine Learning Algorithms by Rohith Gandhi on June 7, 2018

• Decision Trees (DTs): A. Navada, A. N. Ansari, S. Patil, and B. A. Sonkamble,

“Overview of the use of decision tree algorithms in machine learning”, 2011 IEEE Control and System Graduate Research Colloquium, Shah Alam, 2011, pp. 37–42

• Nearest Centroid Classifier (NCC): V. Praveen, K. Kousalya and K. R.

Prasanna Kumar, “A nearest centroid classifier-based clustering algorithm for solving vehicle routing problem,” 2016 2nd International Conference on Advances in Electrical, Electronics, Information, Communication, and Bioinformatics (AEEICB), Chennai, 2016, pp. 414–419.

In document THE KERNELISSUE (Page 134-139)