• No results found

Data validity, the implications of using commercial sources, and sample structures

According to the Encyclopaedia of Computer Science and Technology, ‗rough data sets‘ are defined as those which include uncertain or inaccurate information (Düntsch and Gediga, 2000). It may seem odd at first to allude to corporate accounting data that is available commercially in this way, especially given the audited nature of the published

financial statements which are the primary source. Furthermore, we could expect that the market demand for online information would impose a market discipline that encourages high levels of completeness and accuracy. Nevertheless, the paper of Izadi et al,(2010)24 presented at the 46th annual British Accounting Association Conference demonstrates that there can be considerable uncertainty surrounding those data points which are left blank or recorded as not applicable in portfolio download. In fact, in some cases, the underlying information appears to be available, and can be retrieved using alternative methods of interrogating the data services, e.g. by examining the financial statement summaries for each individual firm rather than by downloading items for portfolios of companies. This paper also notes that there are cases where missing information is not in the database but nevertheless can be retrieved by referring to the primary source, e.g. the accounts published by the company involved.

Furthermore, the aforementioned paper demonstrates how some missing values may be deduced directly from the data that are available, by backfilling via the appropriate accounting identity. Finally, it shows that a few basic errors do creep into these data sets, in spite of the evident internal checks.

This thesis considers the data structure of the most common UK provider, Thomson One Banker, and specifically the Worldscope, Thomson and Extel data platforms. We are already aware from prior research that such financial data banks are not perfect substitutes, not only as the coverage of firms and accounting items varies across the databases, but also because there are differences in the way each database defines and constructs key variables (Alves et al, 2007).

24

Using Rough Data Sets in Accounting Research: An Evaluation of the Integrity of UK Company Data in Commercial Databases (2010)

However, the main aim of this kind of work is not to compare these data structures and values per se, but rather to consider the potential for uncovering ‗hidden values‘ amongst items that are reported in downloads as ‗not applicable‘, or that are recorded ambiguously in the form of a dash, or simply left blank.

With reference to a recent paper on earnings management (Botsari and Meeks, 2008), the JBFA discussant makes the suggestion that the results reported in the paper may be influenced by the use of Worldscope data for the empirical analysis. The Worldscope financial statement data items are said to have been adjusted by Thomson analysts in order to reverse differences in local accounting practices, with the aim of enhancing their international comparability (Young, 2008). The use of standardised information in the database rather than as-reported numbers is potentially problematic because the comparability adjustments made by Worldscope may be conflated with the discretion exercised by company management in computing earnings, a problem which is exacerbated in the research design when such adjustments vary across the items that comprise estimated accruals. The evidence to support these claims is set out clearly in another paper (Alves et al, 2007), where the properties of items from the Worldscope income statement, cash flow statement and balance sheet are compared with those of corresponding items from Extel Financial and the Datastream Company Accounts Archive (both of which are said to contain as-reported data). The results are based on a single sample of UK firm-years that is common to all three databases, and they reveal some dramatic disparities. For instance, the mean and median Worldscope values for operating cash flow are 25 per cent lower than the Extel equivalents, and the results reported by Tesco PLC illustrate this difference between operating cash flow computed according to UK GAAP and Worldscope‘s adjusted operating cash flow figure – Tesco‘s reported operating cash flows are £1,321 million for the financial year ending in

February 1999, whilst Worldscope gives a figure of £955 million. Yet, at the same time, the operating profit is identical on the two delivery platforms.

The effects of database choice on accounting research have also been examined recently by Lara et al (2006). They regress the book value of shareholders‘ equity and earnings on the market value of the company, using EU data from seven sources for the period 1990–99. They conclude that much of the variation is attributable to differences in firm coverage across databases. In the US and Canada, Ulbricht and Weiner (2005) compare Worldscope and Compustat over the period from 1985 to 2003.

Full UK listed company data sets were formed in Worldscope and Thomson Financial in order to download accounting data, and then banking firms and other financial institutions were excluded (GeneralIndustryClassification 4 and 5), as the corporate accounting identities used in the study apply only to industrial groupings 1-3 and 6. The data was downloaded for ten years, from 1999 to 2008, for all income statement and balance sheet items, including all subtotals. For now, this work has not been extended to the completion of firm series that are interrupted by balance sheet date changes where two fiscal periods end in the same calendar year, and it is to be noted that this represents an additional source of ‗hidden values‘ across the entire ‗missing‘ financial statement set, which can generally be recovered directly from the database in question. The accounting identities that underlie the following line items were then evaluated:

CashAndSTInvestments

TotalInventories

Total Investments TotalPropPlantEquipGross AccumulatedDepreciation TotalPropPlantEquipNet TotalOtherAssets TotalAssets TotalCurrentLiabilities LTDebtExclCapitalizedLeases TotalLTDebt DeferredTaxesBalSht TotalLiabilities TotalCommonEquity TotalLiabAndShareholdersEquity GrossIncome TotalOperatingExpenses OperatingIncomeAfterDepr EarningsBeforeInterestAndTaxes IncomeBefIncomeTaxes

IncomeBefExtraItemsAndPfdDiv

IncomeBefPreferredDividends

NetIncome

Table 6.1 provides an indication of the initial results for a subset of the key aggregates in the above listing. Two line items, Current Assets and Current Liabilities, which are often used in accounting research, are characterised by incomplete data that can be obtained by backfilling the missing value, by summing components and/or by logical deduction, e.g. from balance sheet net totals of other aggregates. It was found that the highest recovery was with Worldscope‘s Current Assets: 3900 backfilled firm- years.

In certain cases, the failed accounting identity could not be backfilled from other line items, as these were not all present. In such cases, the firm‘s financial statements were referred to systematically on the system, and then to any available copy of the relevant annual report. The highest recovery rate was for Thomson‘s Total Receivables: 70 firm-years.

Finally, a number of instances where there was an error were also noted, which could be verified not only because the financial statement clearing identity failed, but also because recourse to the original annual accounts proved this to be the case. The greatest number of corrections was with respect to Thomson‘s Sales: 23 firm-years.

Lastly, this provides initial evidence that seems to point to a potential drawback in accounting research that is based on commercial data sets. Evidence of missing values has been provided that can be readily reconstructed from accounting identities, of other missing values that are retrievable either from the underlying database or from the

source accounts, and finally of incorrect summations of accounting identities that lead to the discovery of data errors.