Preliminary experiment with 3-D artificial data

6.2 Telecommunications customer data

7.2.1 Experiments with artificial data

7.2.1.1 Preliminary experiment with 3-D artificial data

We start by providing some initial impressions of the cartogram representation and by investigating the hypothesis that data outliers will be mapped onto areas of high distortion of the visualization space ex- pressed as a cartogram. A simple statistic, described in [206], and extended to the GTM in [259], will be used to characterize to what extent a data point xncan be considered to be an outlier. It is defined as

On=∑Kk=1rknβ∥yk−xn∥2, where rkn≡ p(uk|xn) is the responsibility defined in equation (4.15) . An outlier

is expected to yield comparatively large values of On.

For this experiment, a total of 1,500 3-D points were randomly drawn from three spherical Gaussians (500 points each), all with unit variance, and with centres sitting at the vertices of an equilateral triangle. 3-D data will allow the direct visualization of the model prototypes y_k(and as a result, the visualization of the generated manifold) in the observed data space. They were modeled using a GTM with a 20× 20 grid of latent points.

Nine outliers, in three groups of three each, were first added to the previously described data:

• Three outliers at the edges of the triangle (type A): These three outliers are away from the clusters, over the edges of the imaginary triangle defined by them, and within the plane in which this imaginary triangle would lie.

• Three outliers near the centroid of the triangle (type B): These three outliers are away from the clusters, near the centroid of the imaginary triangle defined by the three clusters, and within the plane in which this imaginary triangle would lie.

• Three outliers outside the triangle but not far away from the plane in which it lies (type C): One of them (C1) is located in the direction of one of the cluster centres, at right angles with the plane defined by the three cluster centres, but not too far from the cluster itself; a second one (C2) is located near the centroid of the imaginary triangle defined by the three clusters; and a third (C3) is located in between two clusters, over one of the edges of the imaginary triangle defined by the three clusters. The three of them are atypical in one way or another with respect to the rest of the data set. The GTM, though, fits these data very differently.

Results and discussion

The original data, together with the nine outliers are superimposed in Figure 7.1 (top row, left) to the prototypes y_kand to the approximation of the manifold in which they lie, as generated by the GTM. This smoothly stretching manifold lies near the plane defined by the triangle of clusters. This means that the outliers have not exerted much influence on the GTM data fitting process. The latent space mapping of this data using the posterior mean projection described in Section 4.3 is also displayed in Figure 7.1 (top row, right).

The corresponding MF and cartogram can be seen in Figure 7.1 (center row, left and right, respec- tively). Areas of high distortion neatly separate the three clusters and the area of highest distortion roughly corresponds to the central area of the imaginary cluster triangle. An interesting effect can be observed: the manifold is less distorted in the directions that join each pair of clusters (compare the MF rope-like fea- tures in Figure 7.1 (center row, left) that link the areas in which the clusters are mapped with the manifold folding at the edges of the imaginary cluster triangle visualized in Figure 7.1 (top row, left)).

The mapping of the nine outliers is quite telling. Outliers of the type A are located in areas of relatively high distortion as measured by the MF, but they are not as well characterized as outliers by the Onmeasure,

as displayed in Figure 7.1 (bottom row). This is caused by the aforementioned lower distortion in the directions that join each pair of clusters, which result in a relatively higher concentration of prototypes. Instead, outliers of the type B are neatly mapped onto areas of relatively high distortion. This is consistent with their values of MF, but, again, because they lie so close to the manifold, they are not well-characterized by the Onmeasure. In fact, the examples of type A and B illustrate a limitation of the own Onmeasure: it

becomes a poor indicator of atypicality if outliers lie close to the manifold. Finally, outliers of type C have a mixed behavior: Those roughly over the triangle centroid and edges behave similarly to their counterparts of types A and B, whereas the one approximately in a perpendicular to the manifold and over one of the clusters is assigned to the prototypes that represent that cluster. Thus, even if all these points show high Onvalues, the third one is assigned to a low MF area and will thus not be visualized in the high distortion

Figure 7.1: Cartogram visualization for the first of the outlier experiments with 3-D data. Top row: left) direct visual-

ization of the 3-D observed data together with the model prototypes yklinked according to the lattice of corresponding

latent points (in an approximation of the GTM-defined manifold). Nine outliers as gray symbols, characterized (A, B, C) as described in the main text; right) GTM visualization map using the posterior mean data projection. Center row: left) The MF, color-coded over the GTM visualization map, with scale column; right) cartogram representation. Bottom row: Values of Onversus MF for all data points, including the nine outliers.

Guideline 1: When atypical data are away from the areas of main data density but still near the model manifold, their mapping location can be unexpected and, as a result, they might not always end up in the areas of highest distortion. Rules here are likely to depend on the NLDR method used. For GTM, data points that are located near the manifold and away from the directions linking pairs of clusters are likely to end up in the highly distorted areas of the cartogram. Instead, data points that are located near the manifold and in the general directions linking pairs of clusters might well be mapped away from clusters but not in highly distorted cartogram areas. Finally, data points that are only moderately away from both the clusters and the manifold might not always be mapped either away from clusters or in areas of high

distortion of the cartogram. In summary, the data analyst might benefit from isolating the data points mapped onto the high-distortion areas of the cartogram to further investigate them as potential outliers, but bearing in mind that some of the outliers might not be amenable to this characterization.

For the next part of this experiment, a different set of three clear outliers (type D) were added to the original data, previously described. These points were located further away from the plane defined by the centres of the three clusters, at distances from it that were larger than the inter-cluster distances. The GTM was fitted to this augmented dataset and the results are shown in Figure 7.2.

Figure 7.2: Cartogram visualization for the second outlier experiment with 3D data. Representation as in the previous

figure.

The direct visualization of the fitted manifold is revealing: just three outliers are enough to exert quite a pull on this manifold, fairly stretching it towards them (see Figure 7.2, top row, left). The result is that they are mapped onto latent points that are away from the clusters (see Figure 7.2, top row, right) and which correspond to the stretched part of the manifold, as seen in Figure 7.2 (center row, left). This is despite the

fact that one of them (D1) is located in the direction of one of the cluster centres, at right angles with the plane defined by the three cluster centres; a second one (D2) is located approximately in the direction of the centroid of the imaginary triangle defined by the three clusters, at right angles with the same plane; and a third (D3) is located in the direction of one of the edges of the imaginary triangle defined by the three clusters, at right angles with the same plane.

This is unlike in the previous experiment, where the location of the outliers in relation to the clusters affected their mapping location. Unsurprisingly, they all end up confined into the high-distortion area of the cartogram, displayed in Figure 7.2 (center row, right). This can again be quantitatively assessed by comparing the values of the statistic O and the MF, as reported in Figure 7.2 (bottom row): The three outliers show, simultaneously, high values of Onand MF.

Guideline 2: Outliers clearly away from the areas of main data density are likely to be mapped into areas of high distortion. This should at least be the case for unregularized NLDR models or density models based on Gaussian distributions (or other distributions that not behave well in the presence of outliers). Therefore, the data analyst might benefit from isolating the data points mapped onto the high-distortion areas of the cartogram to further investigate their atypicality.

In document Exploration of customer churn routes using machine learning probabilistic models (Page 107-111)