Improved K-Means Clustering Algorithm to Analyze Students Performance for Placement Training using R-Tool

(1)

http://www. europeanjournalofscientificresearch.com

Improved K-Means Clustering Algorithm to Analyze Students Performance for Placement Training using R-Tool

T.Thilagaraj

Part-Time Ph.D(Category – B), R&D Centre, Bharathiar University Coimbatore & Assistant Professor in Computer Applications

Kongu Arts and Science College, Erode, Tamil Nadu, India E-mail: [email protected]

N Sengottaiyan

Director, Professor in Sri Shanmugha College of Engineering and Technology Sankari, Tamil Nadu, India

E-mail: [email protected]

Abstract

The Procedure of organizing all objects in particular groups through some kind of similarity among objects is clustering. The student’s placement training is necessary for their career and also for successful placement in industry. The selection of students for the training is important and group formation among them using some strategy is a difficult task. This provides a common activity for entire students that will not lead them exactly on their path. The clustering of students is needed to overcome this problem. The K-Means clustering algorithm will choose the centroid in random approach, so the clusters will differ according to the centroid value. This will not lead to the required results which is the main drawback of the K-Means clustering algorithm. So the improved K-Means clustering algorithm is proposed to cluster the student performance with initial centroid and will choose best centroid from that to provide the required clusters.

Keywords: K-means, clustering, data mining, placement

1. Introduction

Data Mining is very useful to extract the hidden patterns from massive data. Here the variety of algorithms is available with good scalability and accuracy. The data mining is facing enormous problems and producing expected solutions to the researchers [1]. The varieties of tools are also available to deal these issues. All over the world the researchers are updating the methodologies according to the current situations. The majorly used techniques in data mining are association, clustering and classification.

In the present paper, we are dealing with clustering technique an unsupervised learning. The various clustering will not estimate the target value but it will segment the entire data into homogeneous clusters. This is not depending on predefined object or class. There are two categories of clustering algorithm available, one is hierarchical and another is partition algorithm [2]. Here the Educational data clustering shows the classification of student’s academic performance for the placement training scenarios. This will lead the management to understand the level of groups and to provide the better training for each group.

(2)

Performance for Placement Training using R-Tool 158

The standard K-Means algorithm is widely and mostly used in various applications. The objective of this algorithm is to provide group of data objects which are similar among all objects. The algorithm randomly selects the initial centroid that is belonging to the available data objects. Using the centre and the distance will be calculated with other objects distance. The minimum distance objects will be grouped together and form as one cluster. If the assignment of the centroid is changed, the cluster value also changed according to its distance calculation [3].

2. Related Work

Nazeer and Sebastian [4] have proposed an enhanced K-Means algorithm which combines organized methods for finding initial centroid and also better way to handling over the data points for making clusters. The centroid is reassigned by calculating mean values of data points. Here the heuristic method is used to improve the process. This algorithm is more efficient and it also reduced the complexity.

Yedla, et al. [5] have proposed the method that reduce the time complexity through finding better initial centroid and assign the data points in efficient way to form clusters. This algorithm works with good accuracy with less time while compare with original K-Means algorithm. And also it does not need any other inputs like threshold values. But the cluster size value is needed like all other models.

Fahim, et al. [6] have proposed the method that improves the speed and the distance calculations made among objects. All researchers could easily understand the simple structure of this model.

Goyal and Kumar [7] have proposed the algorithm that deals the consistent and non-consistent data sets of this model is introduced to formulate the initial centroid that makes a better clustering technique. First sorting of distance value and dividing them equally will happens. Then calculate the mean and finding the initial centroid to perform the clustering.

Raval and Jani [8] have proposed the algorithm for the initial centroid was calculated and assigning the data points to nearest object. Then mean and distance values are calculated between data points by using middle value.

Duan, et al. [9] have proposed the algorithm for finding the density of all samples and selects the samples which hold high density data will moves to next level. Using cluster centre k the clusters were found.

Yuan, et al. [10] have proposed the method to find the density of object by using density- sensitivity similarity measures. Then computing minimum distance between any two points are chosen out by the candidate. The average density is calculated and initial centroid is finding to make clusters.

3. Proposed K-Means Clustering Algorithm Input:

stid = {sid1,sid2,sid3,….,sidn} //Student Id

pv={val1,val2,val3,…., valn} //Student Performance percentage values Output:

A set of 4 clusters, Clusters range are low, medium, high, very high are found using distance values.

Algorithm:

Step 1: The stid and pv is the vector arguments passed to the function placement.

Step 2: Function placement(s,v) begins. Where ‘s’ and ‘v’ are the vectors to receive the ‘stid’ and ‘pv’

correspondingly.

Step 3: Find the minimum value of the vector ‘v’ and store it in ‘mv’. Now find the length of ‘v’ and store it in ‘n’.

(3)

Step 4: To find the distance ‘d’ subtract ‘v’ by ‘mv’.

Step 5: Find Max(d) and Min(d). Now to find the initial centroid value (Max(d)-Min(d))/2 and store in

‘cd’.

Step 6: Assign initial centroid value cd to cd1 and cd2. Initialize the vector variables dva, vd1, vd2, c1, c2, v1, v2 as zero for further process.

Step 7: For each i=1 to n.

7.1 For each j=1 to n.

7.2 If distance d[j] is equal to initial centroid cd1 Then, Store the distance ‘d[j]’ to ‘vd1’ and its corresponding vector ‘v[j] to ‘v1’ and initial centroid value ‘cd1’ to ‘c1’. End if.

7.3 If ‘c1’ not equal to zero, break, End if.

7.4 End for.

7.6 The initial centroid ‘cd1’ value is decremented by 1.

End for.

Step 8: For each i=1 to n.

7.1 For each j=1 to n.

7.2 If distance d[j] is equal to initial centroid cd1 Then, Store the distance ‘d[j]’ to ‘vd2’ and its corresponding vector ‘v[j] to ‘v2’ and initial centroid value ‘cd2’ to ‘c2’. End if.

7.4 End for.

7.6 The initial centroid ‘cd2’ value is incremented by 1.

End for.

Step 9: Now calculate, bd1= cd-c1, bd2=c2-cd and

Step 10: To find best centroid value compare bd1 and bd2. If bd1 is less than or equal to bd2, then assign ‘v1’ as best centroid value and store its distance value ‘vd1’ to ‘dva’. If ‘bd1’ is greater than ‘bd2’ then assign ‘v2’ as best centroid and store its distance value ‘vd2’ to ‘dva’.

Step 11: To find first cluster, assign j=1, a1=a2=0.

11.1. For each i=1 to n.

11.2 If distance value d[i] is less than or equal to best centroid value ‘dva’ then vector ‘a1[j]’ assigns the student-id value ‘s[i]’ and ‘a2[j]’ assigns the performance value ‘v[i]’. Now Increment the j value by 1. End if.

11.3 End for.

Step 12: To find second cluster, assign j=1, b1=b2=0.

12.1. For each i=1 to n.

12.2 If distance value d[i] is greater than best centroid value ‘dva’ then vector ‘b1[j]’ assigns the student-id value ‘s[i]’ and ‘b2[j]’ assigns the performance value ‘v[i]’. Now Increment the j value by 1. End if.

12.3 End for.

Step 13: Now return a1, a2, b1, b2 as list and then convert as vector for further clustering.

Step 14: Assign stid=a1, and pv=a2, Now call the function placement. Go to step1.

Step 15: Now two clusters values will be returned from the function placement. The low range and medium range clusters will be found.

Step 16: Assign stid=b1, and pv=b2, Now call the function placement. Go to step1.

Step 17: Now another two clusters values will be returned from the function placement. That are high range and very high range clusters. Now the four ranges of clusters were found from the given values using its distance value.

(4)

Performance for Placement Training using R-Tool 160 4. Results and Discussions

Figure 1: Overall Performance of Students

In above figure, the overall performance of student’s parameter is calculated for 100 students by using their average of Academic, Technical, Aptitude and Interpersonal marks. The X axis implies the overall performance and Y axis implies the students Id. By using R-Tool the above algorithm is implemented.

Figure 2: Two groups of cluster in overall Performance

In figure 2, the entire student’s performance values are clustered into two groups by using the best centroid value.

(5)

Figure 3: First level of cluster group

In above figure, the first part of clustered group is separated and taken to next level. Once again the same methodology is going apply here to find the low and medium range of cluster groups.

Figure 4: Two clusters found from first level group

In above figure, the two clusters are displayed in two different shapes and colors. The filled square (red) implies the low range of cluster and triangle point up (blue) represents medium range of clusters.

(6)

Performance for Placement Training using R-Tool 162 Figure 5: Second level of cluster group

In above figure, the remaining part of initial cluster group is separated and taken to next level.

The same method is once again implemented to cluster this group.

Figure 6: Two clusters grouped from second level of cluster

In above figure, another two clusters are found. The solid circle (dark green) implies high range of cluster and the square plus (Magenta) implies very high range of cluster group.

(7)

Figure 7: The complete view of all clusters

In above figure, The Filled square (red) represents low range of cluster, Triangle point up (blue) represents medium range of clusters, Solid circle (dark green) represents high range of cluster and Square plus (Magenta) represents very high range of cluster are found.

Table 1: Student performance clustering sizes

S.No. Category Cluster Size

1 Low 3

2 Medium 43

3 High 28

4 Very High 26

The above table shows the sizes of low, medium, high and very high range of clusters.

5. Conclusion

The normal k-means algorithm with randomized initial centre leads to different form of clusters. Here the proposed method will measure the distance between maximum and minimum data point to find initial centroid. The distance for nearest objects will be calculated from initial centroid, then find the minimum distance object and fix it as the best centroid. The clusters are grouped by using this centroid object. Here the centroid value will be fixed and no change will happen in clusters while series of iteration are made. This proposed method is more suitable for finding the different level of student’s performance. And also this will be helps the management to provide the suitable training methods to students according to their levels.

(8)

Performance for Placement Training using R-Tool 164

References

[1] J. Wang and X. Su, "An improved K-Means clustering algorithm," in 2011 IEEE 3rd International Conference on Communication Software and Networks, 2011: IEEE, pp. 44-46.

[2] R. Vij and S. Kumar, "Improved k-means clustering algorithm for two dimensional data," in Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology, 2012: ACM, pp. 665-670.

[3] G. Usman, U. Ahmad, and M. Ahmad, "Improved k-means clustering algorithm by getting initial cenroids," World Applied Sciences Journal, vol. 27, no. 4, pp. 543-551, 2013.

[4] K. A. Nazeer and M. Sebastian, "Improving the Accuracy and Efficiency of the k-means Clustering Algorithm," in Proceedings of the world congress on engineering, 2009, vol. 1:

Association of Engineers London, pp. 1-3.

[5] M. Yedla, S. R. Pathakota, and T. Srinivasa, "Enhancing K-means clustering algorithm with improved initial center," International Journal of computer science and information technologies, vol. 1, no. 2, pp. 121-125, 2010.

[6] A. Fahim, A. Salem, F. A. Torkey, and M. Ramadan, "An efficient enhanced k-means clustering algorithm," Journal of Zhejiang University-Science A, vol. 7, no. 10, pp. 1626-1633, 2006.

[7] M. Goyal and S. Kumar, "Improving the initial centroids of K-means clustering algorithm to generalize its applicability," Journal of The Institution of Engineers (India): Series B, vol. 95, no. 4, pp. 345-350, 2014.

[8] U. R. Raval and C. Jani, "Implementing and Improvisation of K-means Clustering," Int. J.

Comput. Sci. Mob. Comput, vol. 5, no. 5, pp. 72-76, 2016.

[9] Y. Duan, Q. Liu, and S. Xia, "An improved initialization center k-means clustering algorithm based on distance and density," in AIP Conference Proceedings, 2018, vol. 1955, no. 1: AIP Publishing, p. 040046.

[10] Q. Yuan, H. Shi, and X. Zhou, "An optimized initialization center K-means clustering algorithm based on density," in 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), 8-12 June 2015 2015, pp. 790-794, doi: 10.1109/CYBER.2015.7288043.