Unconstrained Clustering - The Use of Point Pattern Analysis in Archaeology: Some Methods and A

Unconstrained Clustering is a technique proposed by Robert Whallon (1984) as a methodology to determine activity areas by considering the spatial location of a number of different artifact types found on living floors. While this might seem a rather narrow focus, in the 1980s the entire purpose of spatial statistics in archaeology in North America seems to have been focused on that specific objective. In his 1984 paper, Whallon critiques the then current methods of spatial analysis, noting that all are in various ways constrained in their capabilities. For example, his Nearest Neighbour analysis (Whallon, 1974) has an implicit constraint that fails to handle clusters of varying density. The Nearest Neighbour distances will be lower in the denser cluster even though the clusters might be compositionally identical. As an example, consider two knapping stations where only one biface tool was produced at one, while several were produced at the other. In contrast, Unconstrained Clustering was designed to minimize the number of constraints that might come into play. For a more detailed description of Unconstrained Clustering see Whallon (1984) or Kintigh (1990).

As envisioned by Whallon (1984:244) the technique has a seven step process which proceeds first, by creating smoothed density contours for each artifact type. The

smoothed values are then taken at each point (artifact location) and are combined into a vector of values for that point. These vectors are then used to create proportional

densities for each point, thus removing variable artifact density from the process. Cluster analysis is then used to combine data points into reasonably homogeneous clusters, which can then be plotted and examined for spatial integrity. If this integrity is demonstrated, then description of the clusters and their interpretations can proceed. Keith Kintigh (1990:192) proposed a variant to this procedure, which he considers to be “more direct and, I believe, theoretically preferable to Whallon’s proposed procedure.” This

methodology substitutes local density calculations (Johnston 1984) for the smoothed density and proportional vectors used by Whallon’s procedure. Further, Kintigh (2015)

went on to provide a software product (TFQA) that can be used to perform Unconstrained Clustering

The analysis conducted in this study was with the TFQA which is well-documented and the documentation is available on the WWW (Kintigh 2015). With Unconstrained Clustering, TFQA requires the execution of three different programs and while each of these is well documented on its own, the actual process to conduct Unconstrained Clustering is not well defined with respect to how the three programs interact. The remaining portion of this section describes what was learned about conducting Unconstrained Clustering in TFQA through a series of questions to Kintigh, as well as trial and error.

In Kintigh’s procedure, the first step is to run a Local Density Analysis on all the types in the study area. Local Density Analysis is a procedure developed by Johnson (1984) that measures the local density of one artifact type around a second artifact type and divides this result by the density of the first artifact type as if it was randomly distributed over the study area. The size of the area around the second point is a user specified variable. This creates a statistic where a value of one indicates that the two artifact types are not

associated, a value lower than 1 indicates that they are segregated, and a value greater than one indicates a greater degree of association. If multiple types are present, Kintigh’s software calculates a pair-wise matrix measuring the local density coefficient of each pair of types in the original data. This spatial statistic is a reasonable one on its own but Kintigh’s software includes the option to create a file which can be used for

Unconstrained Clustering. The program in TFQA which performs this operation is

LDEN. At this stage in running Unconstrained Clustering, the key variable parameter that is entered is the radius of the area for which local density is calculated. This parameter is critical since different radii can give better or worse results for the analysis. There are no rules for selecting this radius so the only approach is to try different numbers and run the entire Unconstrained Clustering process and compare the results. In the Kellis 2 example in Chapter 5, several different radii were selected specifically 5, 7.5, 10, 15 and 20 m. Comparing the results, 5 m and 7.5 m were consistently worse than all others so were discarded.

The next stage in the analysis in TFQA is the program KMEANS, which performs the cluster analysis. One of the key user inputs into this program is the number of clusters expected. In this case Kintigh’s (personal communication 2014) recommendation is to start with twice the number of clusters that you are expecting. The program then goes through an iterative procedure dividing the input points into more and more clusters. In order to measure the goodness of fit for both the local density radius and number of clusters, a statistic called the Sum Squared Error (SSE) is created, which essentially measures misfits in cluster assignment. Smaller values of the SSE represent a better fit to the data. The SSE is very high for one cluster and gradually reduces as more clusters are split off. After some number of clusters is reached, the decrease in the SSE becomes negligible and division into more clusters produces only a slight reduction in SSE. Thus, the number of clusters can be fairly simply determined from a single run by examining the resulting SSE values and locating the number of clusters beyond which only marginal improvement in the SSE is seen. This number is essentially the “knee of the curve” in non-mathematical parlance. To find the best local density radius it is necessary to run a series of Unconstrained Clusterings and track how the density radius impacts the SSE values. When comparing multiple Unconstrained Clustering runs, the best form of SSE to track is the %SSE, which shows the SSE at the nth cluster as a percentage of the SSE with one cluster.

While minimizing the SSE does represent a better fit to the data, it does not indicate if the SSE is statistically significant or not. The program also provides the option of conducting a series of randomizations on the input data, which can then be used to plot significance envelopes. If the SSE of the real data is outside these randomizations, then the results are statistically significant, but the program does not calculate a P-value. In Figure 3-1, the narrow lower line is the actual %SSE. Clearly, as more clusters are created, the value decreases with the knee of the curve occurring around 15 clusters. The thicker line higher in the graph represents the results of the randomizations such that the slope envelopes would be represented by the top and bottom of the thick line. Unfortunately, no p value is given. However, simple examination of Figure 3-1 shows that the results of

The results of Unconstrained Clustering are then plotted with the program KMPLT. This program can plot both the SSE actual versus random values for various degrees of clustering and an actual map of the clusters with a cluster number assigned to each point in the original data. The down side of the plot program (and to some extent all of TFQA) is that it was designed to run under DOS with a command line user interface. While this procedure is acceptable in running the calculations, the plot is not really up to modern standards of publication quality. Further, the KMPLT program does not run in the current Windows Command Prompt interface and thus, requires a third party software tool called DOSBOX software to even run. At this point, you must capture the resulting graph with a Print Screen. However, the values are all available and can be used as input to modern graphics programs.

Figure 3-1 is a plot of the %SSE and the associated slope envelopes from one of the runs with the Kellis 2 data. Figure 3-2 shows the plot of various clusters from this run.

Figure 3-2: TFQA Plot of Unconstrained Clustering

In document The Use of Point Pattern Analysis in Archaeology: Some Methods and Applications (Page 66-70)