Data Sources: Characteristics, Validation and Study Area
4.3 Geographical Data and Area Classifications
The following section concentrates on the boundary datasets and various area classifications to be used in the research. The geography of GB is built on hierarchies of geographies based on a variety of systems. For example, the UK is comprised of administrative, census, electoral, environmental, postal and historical boundaries. The complex hierarchies and linkages between these composite geographies are explained in greater detail by Dennett (2010), demonstrating how the small area geographies eventually aggregate up to the national level. In doing so, Dennett (2010) also highlights the problem that not all lower level geographies are compatible, for example, electoral wards aggregate into both districts and parliamentary constituencies, but these two geographies cannot be harmonised. As such, all the geographic boundaries used in this research will be based on a system whereby they can all be aggregated into one another (Section 4.3.1).
In conjunction, area classifications involve the classification of areas into different groups on the basis of the similarity of characteristics of selected attribute features. They provide a unique way of bringing together spatial patterns from a range of variables, and identifying similarities
and dissimilarities between areas (Webber and Craig, 1978; Everitt et al., 2001; Sleight, 2004;
Vickers and Rees, 2007). Furthermore, the scheme of classification represents a convenient technique for the organisation of a large data into groupings which make it much easier for our brains to process the information and see patterns in the distribution of the different types of area (Vickers and Rees, 2007). Amongst the most commonly used area classifications are geodemographic classifications. Geodemographics is the analysis of people by where they live, and works on the principle that the place and population are inextricably linked (Sleight, 2004).
Geodemographics can be said to effective because similar people and households cluster spatially (Vickers and Rees, 2007). Consequently, knowing information about one person enables information about others in that locality to be broadly inferred (Sleight, 2004; Weiss, 2000). Furthermore, geodemographics has a long history of application in retailing (Birkin et al., 2002; 2010). In particular, it can be argued that geodemographics is a shorthand label for both the development and application of area typologies that have proven to be powerful discriminators of consumer behaviour and aids to market analysis (Brown, 1991).
4.3.1 Boundary Data
Boundary datasets define geographical areas and are essential for mapping attribute data that are not released as individual points. For this research, the boundary data will be comprised of a combination of administrative and census boundaries. Nevertheless, in order to achieve this, a considerable amount of data cleaning and manipulation had to be undertaken. For example, the level of geography recorded in the Acxiom Ltd micro data (see Section 4.6) is the postcode level. However, postcodes are not a spatially stable form of geography (Raper et al., 1992), as the building of new housing, commercial or industrial premises leads to changes in the postcode listings. Equally, demolition of property leads to postcodes becoming (temporarily at least) redundant. This causes problems when working with postcode data taken from different reference periods (as is the nature of this research) as the changes to the boundaries are difficult to reconcile with changes in the population (Raper et al., 1992). Additionally, postcode geographies (area, districts and sectors) cannot be uniformly aggregated up into other census geographies and do not contain similar populations across postcodes. Consequently, all responses from the Acxiom data were matched against corresponding Output Areas (OAs) to provide the smallest spatial unit within the 2001 Census boundary system (OAs also used for 2011 Census data dissemination). This was achieved by using the National Statistics Postcode Directory (NSPD) which provides details of the locations of current and historic postcodes along with details of other geographic areas in which each postcode is located (ONS, 2012a).
As the micro-level data was delivered for different years, the corresponding NSPD for each year was used to increase the likelihood of a match.
Once all the postcode data within the Acxiom micro data had been converted into OAs, the responses could be cleaned to remove all OAs outside Yorkshire and the Humber and London.
This was done for every year of survey data to ensure consistency within the analysis. From this point, other geographies within the census boundary system could be matched to each respondent’s OA through a series of lookup tables. The 2001 and 2011 Census boundary structures form a much more appropriate system and consist of a hierarchical subdivision of UK local government areas of various types down to sub-authority areas, such as wards, to lower levels created specifically for census purposes such as OAs. In addition to OAs, Super Output Areas (SOAs) and LADs were also linked to the respondent data. There are two layers of SOA geography – Lower Super Output Areas (LSOAs) and the slightly larger Middle Super Output Areas (MSOAs) as described in Table 4.1. Built from groups of 2001 OAs, SOAs were designed to improve the reporting of small area statistics since they are of a consistent size and have fixed boundaries and more homogenous populations. The comparability and stability of the geography is a key benefit to users of statistics which cannot be provided for other small area geographies such as wards, parishes or postcodes. As such, SOAs were chosen as the smallest level of geography to be used within the regional level analysis (Chapters 6, 7 and 8). It was decided that OA would be too small a geography because it would cause small number problems when using the micro-level data..
Table 4.1 SOA description
LSOA Minimum population 1,000; mean 1,500. Built from groups of OAs (typically five) and constrained by the boundaries of the Standard Table (ST) wards used for 2001 Census outputs.
MSOA Minimum population 5,000; mean 7,200. Built from groups of Lower Layer SOAs and constrained by the 2003 local authority boundaries used for 2001 Census outputs.
In addition to the smaller geographic boundary data, LADs/UAs were also chosen for the more aggregate level analysis of GB (Section 4.2.1). It can be argued that a smaller geographic boundary such as Census Area Statistic (CAS) wards would have been more appropriate;
however, anything smaller would have been difficult to manage and visualise at the national level. LADs/UAs sit within the hierarchy of administrative areas relating to national and local government as well as the more high-level structure of census geography. The administrative boundary pyramid is complicated, for not only are there several layers, the boundaries of many of the layers in the hierarchy are subject to periodic or occasional change. One recent major reorganisation of local government was in 2009 where ten new UAs were created. This involved the counties of Bedfordshire and Cheshire being abolished and each being split into two UAs.
Five complete counties were abolished altogether and created as five separate UAs - Cornwall, County Durham, Northumberland, Shropshire and Wiltshire (ONS, 2012d). Additionally, in
2011, there were also plans to create two new UAs in Exeter and Suffolk although these were revoked by Parliament. In terms of the research for this thesis, because much of the analysis is concerned with pre-recession trends, the original 2001 LAD boundary system with 409 districts will be used instead for consistency.
At this point, it is important to stress that any geographic analysis which organises any data into discrete areal units presents a set of more conceptual problems associated with mapping variables at different scales. These include the Modifiable Areal Unit Problem (MAUP) and the related problem of the ecological fallacy. First of all, the MAUP is associated with the problem of organising data into discrete areal units (Openshaw, 1984), a challenge for spatial analysts since it was first identified by Gehlke and Biehl (1934), and outlined by a number of subsequent authors (Wrigley et al., 1996; Openshaw, 1984). The MAUP contains two problems: the first relates to scale, the second to zoning. It may be evident that patterns identified in data at one geographical scale (number of zones) may not present themselves at a different level of aggregation. Alternatively, two zone systems with the same number of area units may give different patterns. The ecological fallacy emerges from the practice of ‘ecological inference’
described by King et al. (2004) and is the problem of inferring something at a lower level of aggregation, from something observed at a higher level.
Whilst it will be impossible to avoid these two issues within the research, provisions will be made to reduce their impact as much as possible. For example, as stated, census geographies will be used because they contain much more uniform populations than say postcodes.
Additionally, whenever boundary data or area classifications are used for spatial analysis, the smallest level of geography conceivable will be implemented to retain important spatial patterns. Finally, and most importantly, the thesis will also take advantage of a unique set of individual household data (see Section 4.5) that will not be contained by geographic boundaries.
4.3.2 Output Area Classification (OAC)
The Office for National Statistics (ONS) 2001 Output Area Classification (OAC) groups geographic areas according to key characteristics that are common to the population in that grouping. The classification was produced jointly by the Office for National Statistics (ONS) and researchers at the University of Leeds (Vickers and Rees, 2007) and forms part of a suite of geodemographic area classifications that were produced by the ONS from the 2001 Census. For instance, classifications of LADs (discussed in Section 4.3.3), statistical wards and health areas are also available. However, the OAC, produced at the OA level has a number of advantages over other classifications for a number of reasons. First of all, it is the only classification
accredited as a ‘National Statistic’ and represents a useful tool for identifying key results from the 2001 Census. Furthermore, for geographic analysis, OAs provide a more stable geography and a very fine resolution for data analysis. Additionally, unlike other classification schemes such as Mosaic (Experian), the methodology is fully documented and all the data used in the classification are available from the 2001 Census (Vickers and Rees, 2007). Consequently, this makes it more appropriate for academic research over some of the more up-to-date commercial segmentation packages.
The classification itself was produced using an extensive geographical k-means cluster analysis that identified 41 important variables. These variables were chosen because they were the most successful at creating distinct clusters of people and are listed in Vickers and Rees (2007).
Moreover, the OAC partitions each output area into one of 7 ‘Supergroups’, 21 ‘Groups’ and 52
‘Subgroups’. Table 4.2 provides a list of the classification names and demonstrates how the main ‘Supergroups’ collapse into the more detailed ‘Groups’. The ‘Subgroup’ part of the classification has no cluster name associated with it and is therefore not included. For a more detailed description of the cluster groups, see Vickers and Rees (2007).
Table 4.2. OAC cluster names
Supergroup Cluster name Group Cluster name
1 Blue collar communities 1a Terraced blue collar
1b Younger blue collar 1c Older blue collar
2 City Living 2a Transient communities
2b Settled in the city
7 Multicultural 7a Asian communities
7b Afro-Caribbean communities
It must be noted that an inevitable reservation with this style of geodemographic analysis is the degree of averaging which takes place even when the OAs are relatively small neighbourhoods.
‘Senior Communities’, for example, cannot be expected to completely exclude younger
residents (the ‘ecological fallacy’). Unfortunately however, similar academic classifications of individuals or households were not available for this purpose (see Longley and Singleton, 2009, who suggest a classification of households with specific reference to their online behaviours and Burns (forthcoming PhD) in which a general purpose individual and household classification is currently in development).
4.3.3 Local Authority District Classification
As stated, the analysis of LADs will be carried out for GB to highlight some of the more aggregate level changes happening in the British grocery market. Therefore, in addition to the OAC (associated with the micro-level data), the classification of LADs that has been developed by Vickers et al. (2003) will be utilised for the national level analysis. The area classification was also produced using 2001 Census data and assigns each district in the UK to a different
‘Family’, ‘Group’ or ‘Class’ based on a range of socio-economic and demographic characteristics. Other general-purpose district level classifications are also available, such as the three tier system developed by the ONS (ONS, 2004) and the urban-rural classification produced by the Department for Food and Rural Affairs (DEFRA) (DEFRA, 2009). However, once again, the Vickers et al. (2003) classification has been selected for this analysis because of its comprehensive and transparent methodology and because it makes a more logical distinction between rural and urban areas than the ONS classification, whilst also separating London and prospering commuter areas from other districts.