Theory-driven versus Data-driven

6 RESEARCH DESIGN

6.1.3 Theory-driven versus Data-driven

As computing power and the amount of information have grown, complex techniques focusing on data-driven modelling can provide an alternative to predetermined and reductionist models. The contradiction that could potentially arise is that data evaluation provides legitimate results without them being subjected to theory restrictions and is applied with less explicit criteria. Whilst in the theory-driven standard positivist modelling, data evaluation is legitimised in the context of a well-formed theory. The data- driven techniques, such as REEM trees, do not require a theory base and can deal with the previously mentioned shortcomings and methodological errors inherent in neoclassical economics (Adam and Westlund, 2013). Thus, the results are expected to be more dynamic, complex and sophisticated (Kitchin, 2014:3).

However, it could be erroneous to focus only on data. Although data-driven modelling does not impose additional assumptions, it still requires contextualisation on existing knowledge; it may be limited in scope and may produce only one kind of knowledge (Crampton et al., 2012). While these algorithms enable rich representations of complex economic systems and advanced reasoning capabilities, a key challenge is finding the right balance between leveraging computational resources and applying theory because the data may capture interactions that do not necessarily provide meaningful insights into the questions addressed. Having said that, the whole data about a phenomenon can be included

131

in the modelling and even those irrational, according to our current understanding, interactions may be worth acknowledging.

As a result, this research project follows the paradigm of positivism but employs both data-driven and theory-driven modelling in an attempt to gather a complete understanding of the phenomenon. Given all criticism of positivism, it may seem reasonable to suggest that data-driven modelling could at least partly reduce the drawbacks of positivism. The availability of both larger datasets and more sophisticated techniques should result in greater complexity. Athough Hutchison's (1941) positivist seeks to reduce a phenomenon to universal and abstract principles and tends to fragment human behaviour, data-driven reasoning may help to act against this reductionist nature by including unpredetermined relationships within the analysis.

Overall, as a researcher, I adopt the belief that the world of social interactions exists independently of what I perceive it to be; it is a broadly rational, external entity and responsive to scientific modes of inquiry. Also, I believe that data analysis may be useful both before and after theory development. The combination of both could not only advance the knowledge of Small Business Rate Relief but also answer recent calls for evidence concerning this topic.

6.2 D

ATA

This section aims to describe how the data was chosen, modified and cleaned when appropriate. It starts by reviewing the datasets used by UK researchers on topics related to BRs since the methodology depends on the available data sources. This review influenced the choice of the datasets which are described in Section 6.2.1. The section then expands on the key variables that were used in both survival and productivity analyses and how such issues as missingness and computation errors were solved (Section 6.2.2.4).

6.2.1 Datasets

Previous studies suggested that macro-level analyses, particularly in TFP estimation, is being replaced by the micro-level analyses. The Empirical Review discussed several UK papers dealing with issues related to BRs. These studies used various secondary data sources. These included the Census of Production (Mair, 1987), the Department of Environment floor-space data, Central Statistics Office estimates from county national and level gross domestic product (GDP) components (Bennett and Krebs, 1988), Inland Revenue Survey of Personal Incomes and Return of Rates (Blair, 1989; Mair, 1990), Ireland Revenue Valuation Agency and Investment Property Databank (Bond et al., 1996),

132

Valuation Office Agency data (Bond et al., 2013), Chartered Institute of Public Finance and Accounting data (Hilber et al, 2011) and a variety of local government financial statistics. Throughout time, aggregate data was replaced by the more extensive micro datasets. This shift is due to the benefits of estimating TFP with micro-level data. It permits the direct comparison of an outcome variable across treated and control groups and, therefore, facilitates the estimation of treatment effects. More specifically, Del Gatto et al. (2011) argue that aggregate analysis plays a significant role in comparative, cross-country studies but micro analysis permits the investigation of TFP patterns at a deeper level, controlling for issues like not perfectly competitive markets, increasing returns, and different firms.

This analysis directed towards using the Annual Respondents Database (ARD). Most recent papers related to both BRs and TFP often employed the ARD. One of the most theoretically and empirically sophisticated studies uncovered in the Empirical Review Chapter was by Duranton et al. (2011) which employed the ARD. Likewise, the recent and extensive estimates of TFP prepared by Harris, Moffat and their co-authors in various papers reviewed in the Methodology Review were based on the ARD.

Given the longitudinal structure and breadth of the ARD dataset, this would enable the estimation of TFP and survival at the micro level but only until 2008. In fact, the Office for National Statistics (ONS) Virtual Microdata Laboratory (VML) and the University of the West of England extended this dataset by combining several other datasets to achieve better coverage and a better fit for productivity analysis. The new dataset called Annual Respondents Database X (ARDx) now contains harmonised variables from 1998 to 2015 with 42,000-65,000 annual observations. Thus, the following paragraphs will describe ARDx by focusing on its structure and introduce other two data sources27_{that were used to}

supplement ARDx. The correct procedures required by the UK Data Service to access and report on this data were undertaken owing to the sensitive nature of the data.

6.2.1.1 Annual Respondents Database x (ARDx)

Not only the coverage, but also the variable base made ARDx the preferable dataset. As an extension of the ARD, the ARDx has a rich variable base (289-402 variables). It includes such chief variables in TFP estimation as labour, estimated capital, investment, materials and, most importantly, BRs expense. The surveys also cover diverse sectors; construction; retail; motor trades; catering and allied trades; wholesale; property; service trade sectors and from 2000 agriculture (partly), hunting, forestry and fishing. From an

27_{Sections 6.2.1.3 Business Structure Database (BSD) and 6.2.1.4 Prices Survey Microdata}

133

administrative perspective, the ARDx is created by combining two datasets, Annual Respondent Database with data from 1998 to 2008 and the Annual Business Survey (ABS) supplemented with employment data from the Business Register and Employment Survey with data from 2009. These datasets are described in the following paragraphs to understand the ARDx dataset better and to highlight some possible mismatch between pre and post-2008 data.

The pre-2008 data in ARDx mainly comes from ARD. ARD28_{consists of two surveys,}

employment (ABI1) and financial information (ABI2). These were standardised into a single consistent format and linked by the Inter-Departmental Business Register (IDBR). ARDx starts in 1998 because ABI’s (1998-2008) structure was more similar to ABS (2008- 2016). These two datasets can be combined because they have similar sampling procedures, structure and questions as well as the ability to be linked with IDBR, which have only existed since 1997.

More recent (post-2008) data in ARDx comes from the Annual Business Survey (ABS). The Office for National Statistics (ONS) sends this postal survey to around 62,000 businesses in Great Britain each year. It is the most extensive business survey currently conducted by the ONS in terms of the combined number of respondents and variables it covers from around 600 different questions asked. The details of these businesses, registered for Value Added Tax (VAT) and/or Pay As You Earn (PAYE), are obtained from the ONS’s Inter-Departmental Business Register (IDBR). In a similar manner to ARD, the ABS’s population of legal units is stratified by SIC (2007), employment, and country using the information from the IDBR.

With regard to the sample procedures, both ARD and ABS include all large businesses and smaller businesses are sampled. The estimation might suffer some bias as smaller firms may receive a shorter form, which may not necessarily require detailed breakdowns of totals. Thus, for specific variables, the values may be acquired from third- party sources (e.g. HMRC) or estimated rather than returned by respondents. Approximately 60,000 - 75,000 businesses were surveyed each year between 1998 and 2008 (and ~15,000 between 1973 and 1997) by using a postal survey method.

Although the survey provides substantial coverage, there are some inaccuracies. For instance, linking datasets brought several different variables reporting similar or closely linked figures. One of them is employment. The employment figures are present from all three sources. The ONS produces many different measures of employment,

134

including the Workforce Jobs and Annual Population Survey/Labour Force Survey. However, ONS recommends using the Business Register and Employment Survey (BRES) for the information on employment by detailed geography and industry (ONS, 2017). The most reliable recent (2009-2015) employment data in ARDx comes from the BREST, which is aimed at updating local unit information and business structures on the IDBR. This postal survey has approximately 80,000 sampled businesses covering approximately 500,000 sampled local units each year.

Overall, the data owners have used variable names from the ABS. They advise that names may not be the same, but the survey is mainly consistent with the ARD. The variables may be divided into three types according to their information source:

• IDBR-based information about the reporting unit • Respondent’s answers to the ARD and ABS surveys • Employment data from BRES.

6.2.1.2 Inter-Departmental Business Register (IDBR)

ARDx is preferable because it is cross-checked with administrative data. There are larger and more up to date data sources, but these cannot be accessed by non-civil servants without special permission. All ONS datasets consist of IDBR reference numbers. These are anonymous but unique reference numbers assigned to business organisations. Their inclusion helped to combine various datasets required to achieve the objectives. For merging purposes that will be described in Section 6.2.2, it is important to define different identifier levels at IDBR to merge ONS datasets. The IDBR can be grouped into three types: administrative, statistical and observation units. Figure 6:1 illustrates the principal relationships. The reporting and statistical units can be used for matching since they have unique identifiers. To estimate more precise effects, the data in this analysis was limited to firms having only one local unit. Thus, although there would be unique identifiers for each of the groups (reporting unit, enterprise group, enterprise and local unit), they would refer to the same information.

If this limitation was not imposed, as Figure 6:1 illustrates, the reporting unit would provide information on behalf of the company and have its identifier (ruref). This information would be assigned to statistical units. They are a group of legal units under joint ownership which is called an Enterprise Group (Entref). An Enterprise can be defined as the smallest combination of legal units (based on VAT and PAYE records), whilst, a local unit is an enterprise or part of a company situated in a geographically identified place.

135

Finally, administrative units refer to VAT trader and PAYE employer information supplemented with incorporated business data from Companies House.

Figure 6:1 Relationships amongst reporting units, enterprise groups, enterprise

In document Policy evaluation with advanced analytics: non-domestic property tax reliefs (Page 133-138)