Application of Multi Panel Scatter Plot in Data Analysis

Download (0)

Full text

(1)

2017 2nd International Conference on Advances in Management Engineering and Information Technology (AMEIT 2017) ISBN: 978-1-60595-457-8

Application of Multi-Panel Scatter Plot in Data Analysis

Xiu-hua GENG

1

and

Xin PAN

2,*

1

School of Software and Information Engineering of Beijing Information Vocational Technical College, Beijing 100018, China

2

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, Inner Mongolia 010018, China

*

Corresponding author

Keywords: R language, Lattice package, Multi-panel, Scatter plot, Xyplot().

Abstract. With the arrival of the big data era, data visualization analysis is more and more important. For its powerful features and open source advantages, R are changing the "ecological system" of quantitative analysis and lattice package of R language for the univariate and multivariable data Visualization provides a comprehensive visualization system. This paper describes how to use the advanced drawing functions in the lattice package to show the relationships among the variables in discrete or continuous conditional variables.

Introduction

With the explosive growth of Internet data, the traditional technology architecture is increasingly difficult to meet the needs of massive data processing. In order to solve the massive data storage and

data query delay, there have been many new technologies and tools [1]. R is a complete set of data

processing, computing and mapping software systems, with its powerful features and attractive price

(free) is changing the "quantitative analysis" of the ecosystem [2]. Lattice was originally a realization

of the Trellis graphics in the S-PLUS in R language, and Trellis was the method of multivariate

visualization [3]. The lattice package intuitively shows the multivariate relationship through a one-,

two- or three-dimensional conditional drawing, which allows the image subsets to be displayed on a separate panel. The main feature is decomposing data into a number of subsets according to a particular variable, and plotting for each subset, where the xyplot() function is used to plot the multi-panel scatter plot. This article mainly explains the function and characteristics of the xyplot() function, and shows how to apply the xyplot() function to show the distribution of some variables or the association with other variables under the action of discrete or continuous condition variables.

Xyplot Function

The typical expression for the xyplot() function is: y~x|A. In the above expression, the variables on the left of vertical line (also called pipe symbol) is called the main variables, and the right is called the condition variable. y~x means that the variables y and x are mapped to the vertical and horizontal axes respectively. y~x | A indicates that the scatter plot shows the relationship between the variable y and the variable x in the various values of the condition variable A, and the role of the conditional variable is to generate the multi-panel. When there are no condition variables, the result of the xyplot()

function is similar to the normal plot function, and the graph of the data is drawn in a single panel [4].

Table 1. Iris Data.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

4.9 3.0 1.4 0.2 setosa

(2)

5.0 2.0 3.5 1.0 versicolor 7.1 3.0 5.9 2.1 virginica 6.3 2.9 5.6 1.8 virginica 6.4 3.2 4.7 1.4 versicolor

Conditional variables are usually a factor, but it can also be a continuous variable. When using a continuous variable as a condition variable, each of its values is understood as a discrete value by default, but such variables usually have many values, and we need to divide them into sections, functions Shingle() and equal.count() can complete this task[5]. In this paper, all the graphics are based a data set called iris coming with R (version 3.2.4), and Table 1 is part of the iris data. This is a data sheet for measuring 150 different iris flowers. The measurements include Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width including three kinds of iris: setosa, versicolor and virginica.

Multi - Panel Scatter Plots of Discrete Conditional Variables

In order to load the lattice package in memory, before using the lattice package, you must use the

command [6]:

> library (lattice)

If we want to study the relationship between the length of the sepals and the sepal width of the iris by the data in Table 1, the sepal width can be defined as the independent variable x, and the length of the sepals is defined as the dependent variable y, expressed as an expression: Sepal.Length ~ Sepal.Width

Write the R command as follows:

>xyplot (Sepal.Length~Sepal.Width, data = iris, xlab = "Sepal Width", ylab = "Sepal Length", main = "Iris Sepal Length and Sepal Width")

This graph is not much different from the graph drawn with the plot function, but unlike the plot function, it can be grouped by setting the parameter. For example, we would like to further study the relationship between the length and width of the iris sepal, and the species of the iris can be set as a grouping variable, that is, the “group = Species” is added to the order. The R command is as follows: >xyplot (Sepal.Length~Sepal.Width, data = iris, xlab = "Sepal Width", ylab = "Sepal Length", main = "Iris Sepal Length and Sepal Width", group=Species,auto.key=TRUE)

The graph drawing with above instruction is shown in Figure 1:

Figure 1. Single-panel grouping. Figure 2. Multi-panel grouping.

(3)

virginica. The parameter auto.key = TRUE is used to display the legend symbol. In this graph, although different types of iris are distinguished by different colors, all the information is displayed in a panel, and the different colors are mixed together. It is inconvenient to observe. To this end, we paint different varieties of iris on different panels, the following we still draw the relationship of iris sepal’s length and sepal width, but this time the iris varieties is set as conditional variables. The order is as follows:

>xyplot (Sepal.Length~Sepal.Width|Species, data = iris, xlab = "Sepal Width", ylab = "Sepal Length", main="Iris Sepal Length and Sepal Width (Multi-panel)", layout=c(3,1))

The expression for this drawing is: Sepal.Length~Sepal.Width|Species. The species behind the vertical line is the condition variable, where the Species is a discrete condition variable with three values: setosa, versicolor, virginica. So there are three panels in the drawing, and layout = c (3,1) set the layout of the panel to three rows, as shown in Figure 2.

Analysis: The first panel shows all the data for the setosa variety, the second panel shows information of versicolor varieties, and the third panel of the graphics is virginica varieties. It can be seen from Fig. 2 that, in general, the length of the sepals of the setosa cultivars is the smallest and that of the virginica varieties is the largest. Versicolor cultivars are between the two. In the above graph, the dependent variable is only one, are the length of the sepals, the dependent variable can be set multiple in actual application. For example, while observing the relationship between the length and width of the sepals, we also want to see whether there is some association between the length of the petals and the width of the sepals. At this time, we can use multiple dependent variables. The expression for this case is: Sepal.Length+Petal.Length~Sepal.Width|Species. The order is as follows:

>xyplot (Sepal.Length + Petal.Length ~ Sepal.Width | Species, data = iris, xlab = "Sepal Width", ylab = "Sepal Length", main="Iris Sepal Length and Sepal Width (Multi-dependent Variables) ", layout = c (3,1), auto.key = TRUE)

The graphic is shown in Figure 3:

Figure 3. Multi-panel and Multi-dependent Variables. Figure 4. Multi-panel and Multi-independent Variables.

Analysis: there are still three panels in Fig 3, but each panel shows two sets of data, the top of the blue point represents the length of the sepals, and the bottom of the magenta color on behalf of the petal length. It can be seen from Figure 3, for the setosa varieties, when the sepals width is unchanged the length of the sepals much longer than the length of the petals, they are closer for versicolor varieties, and closest for virginica varieties.

(4)

>xyplot (Sepal.Length ~ Sepal.Width + Petal.Width | Species, data = iris, xlab = "Sepal Width", ylab = "Sepal Length", main="Iris Sepal Length and Sepal Width (Multi-independent Variables)", layout = c (3,1), auto.key = TRUE)

The drawings is shown in Figure 4.Analysis: In Figure 4, the left side of each panel represents the relationship between the length of the sepals and the width of the petals, and the right side represents the relationship between the length of the sepals and the width of the sepals. As can be seen from Figure 4: the sepals width of each species is generally greater than the petals width, this difference of setosa varieties is the largest, it is the smallest for the virginica varieties, and versicolor varieties middle.

In addition, the independent and dependent variables can be multiple at the same time, for example:

Sepal.Length+Petal.Length~Sepal.Width+Petal.Width|Species。

Multi - panel Scatter Plots of Continuous Conditional Variables

Conditional variables are usually a factor, but it can also be a continuous variable. When using a continuous variable as a condition variable, each of its values is understood as a discrete value by default, but such variables usually have a lot of values. At this point we need split a continuous variable into some sections, and the function equal.count can complete this task. For example, in the above example, if we want to use the petal length as a conditional variable to observe the relationship between the length of the sepals and the width of the sepals, but the petal length is a continuous variable, At this time we need to use the function equal.count to divide petal length into several sections, and segmentation method is as follows:

>length<-equal.count(iris$Petal.Length, number=4,overlap=0)

Its first parameter is the split object (emphasis, the same below), where to split Petal.Lengt variable in iris table, the second parameter is the number of segments, where it is set to 4. The third parameter is the degree of overlap, and 0 indicates that there is no overlap between the divided sections. Through the statement Petal.Lengt is divided into four sections not overlapping each other, called the tile interval, and then use it as a condition variable. The command is as follows:

>xyplot (Sepal.Length ~ Sepal.Width | length, data = iris, xlab = "Sepal Width", ylab = "Sepal Length", main="Iris Sepal Length and Sepal Width ",sub="Petal Length as Conditional Variable") The pattern is shown in Figure 5:

Figure 5. Multi-panel for Continuous Condition Variable.

(5)

shows iris data for that parts of petals length smallest, meanwhile, the iris data whose petals length are biggest are showed in the first row of the second column of the panel.

Conclusion

The grid package of R language provides a comprehensive visualization system for visualization of univariate and multivariable data. This article describes how to use the high-level drawing function xyplot() in the grid package to draw multi-panel scatterplots. This graph shows the distribution of some variables or the association with other variables under the action of one or more conditional variables. Conditional variables can be discrete variables and can also be a continuous variable. If it is a continuous variable, then need to be divided into a number of sections of the tile interval.

References

[1] Yang Xia and Wu Dongwei. Application of R Language in Big Data Processing [J]. Science & Technology Information, 2013(23):19-20.

[2] Luis Torgo. Data Mining with R: Learning with Case Studies, Second Edition [M]. CRC Press, New York, NY, USA, 2016.

[3] [EB/OL].[2017-01-09]http://cm.bell-labs.com/cm/ms /departments/sia/project/ trellis/index.html.

[4] Robert I. Kabacoff. R in Action: Data Analysis and Graphics with R [M]. Manning Publication Co, Shelter Island, 2011.

[5] Alain F. Zuur, Elena N. Ieno and Erik H.W.G. Meesters. A Beginer Guide to R [M]. Springer, Science + Business Media, LLC 2009.

Figure

Figure 1. Single-panel grouping.                                             Figure 2

Figure 1.

Single-panel grouping. Figure 2 p.2
Figure 3. Multi-panel and Multi-dependent Variables.          Figure 4. Multi-panel and Multi-independent Variables

Figure 3.

Multi-panel and Multi-dependent Variables. Figure 4. Multi-panel and Multi-independent Variables p.3
Figure 5. Multi-panel for Continuous Condition Variable.

Figure 5.

Multi-panel for Continuous Condition Variable. p.4

References