Value, size and momentum on Equity indices a likely example of selection bias

(1)

Value, size and momentum on Equity indices – a likely example of selection bias

Allan Evans, PhD, Senior Researcher

Carsten Schmitz, PhD, Head of Research (Zurich)

Value, size and momentum have a long history as stock price predictors, and similar indicators have been applied to stock indices in order to predict the performance of one national index against another. Published back tests of trading systems based on these ideas have shown impressive performance, but in this paper we find that this performance does not continue past the publication dates.

We argue that selection bias at the time of publication has a part to play in the disappointing out‐of‐sample performance of these indicators. We show how the combination of estimation uncertainty and selective reporting can readily explain the observed deterioration in performance. Importantly, with a fuller understanding of these effects, the long‐term poor performance of the indicators could have been anticipated at the time.

Introduction

Efforts to find predictors for stock returns have a long history. Quantitative work on momentum goes back at least to the 1960s, with the observation that, over timescales of a few months, stocks that performed well in the past tend to also perform well in the future [1]. Later work showed some evidence for a negative effect (mean reversion) on a longer timescale [2, 3]. Valuation ratios also have a long history. The controversy over the ‘Value Line’ system in the 1960s [4, 5] is a well‐known example. Fama and French recently reviewed the evidence for value, size and momentum factors across the world’s stock markets [6].

Starting in the 1990s, a series of papers were published suggesting that similar effects could be seen in national stock indices [2, 3, 7, 8]. In analogy to cross‐sectional equity systems [9], countries were ranked on size, value or momentum indicators to form portfolios with long positions in the indices for the highest‐ranked countries and short positions for the lowest‐ranked countries. The published work showed excellent historical performance for such trading systems.

This paper reviews the performance of these cross‐country equity portfolios. Nearly twenty years have passed, so we have the advantage of a large set of historical data that was not available to the authors of the listed papers. In Section 1 we replicate the published evaluations in the period before 1995, but in Section 2 we show that the performance of the trading systems on more recent data is disappointing. We believe this is because of selection bias in the published results, and in Sections 3 and 4 we show how this may have led to over‐optimistic assessments of the systems. More importantly we demonstrate how this could have been avoided at the time.

Enquiries

[email protected] +44(0)20 8576 5800

(2)

Indicator

Publication

Sharpe Ratio

published/ replicated

Momentum (MOM) [8] 0.85 0.85

1. Replicating the published work

The papers [2, 3, 8] use national stock market indices provided by MSCI [10] across 18 developed countries in the period 1970‐1995. Table 1 lists the four indicators calculated from this data set that were used in these papers.

The momentum indicator (MOM) is the total return of an index over the last year, ignoring the return from the last month. The value indicator (V) is the book‐to‐market ratio of the index and size (S) is the inverse free‐float market capitalisation. The book value and market capitalisation are each calculated for the index as a whole by summing the values for the component stocks. These three indicators are the same as those used in [8]. The mean‐reversion indicator (MR) is defined in a number of different ways in [2] and [3]. We choose the fractional return between 3 years and 1 year ago so that the MR indicator is nearly independent of the MOM indicator.

Each indicator defines a distinct trading system which will be associated with a series of portfolios through time.

The indicator is calculated for each country at the end of a calendar month. The countries are then ranked by the values of their indicator. Out of the 18 countries, we take a positive position in the top six countries and a negative position in the bottom six. All positions are equal in their allocation, so the portfolio is dollar neutral. The positions are held for one month.

Table 1 shows Sharpe ratios from the published sources and from our own tests over the period 1970‐1995. We believe that the remaining differences between the published values and our own are due to small differences in the definition of the trading systems (for example, the way in which ties in the ranks are resolved).

Table 1. Indicators used in the cited publications to construct long‐short trading systems on national equity indices. The published Sharpe ratios are compared with the Sharpe ratios from our tests.

2. What happened next: out‐of‐sample performance It should be noted that the Sharpe ratios in Table 1 are so‐called in‐sample results. The same data set (1970‐1995) that was used to develop or select the trading systems was used to calculate the results.

We have an advantage over the researchers who published the original work. Working in 2014, we can compute the performance of the systems on the market data from 1995 to 2014. This is an out‐of‐sample test. The results are shown in Figure 1 and Table 2.

Figure 1. Cumulative profit (positive) or loss (negative) in billion USD of the published systems over the in‐sample and out‐of‐sample periods (left and right of the vertical line).

A $100M long or short position is taken in each single country selected. The performance replications for MOM and MR start later due to the required price history, and the V replication starts in 1975 when the book price data becomes available. The combined performance of the four systems is shown in the bottom section.

All four systems have worse performance in the out‐of‐

sample period than in the original test period. Two of the four systems lose money after 1995, and even the momentum system shows almost no profit after 2000.

Mean reversion (MR) [2, 3] ‐ 0.51

Value (V) [8] 0.84 0.86

Size (S) [8] 0.63 0.60

(3)

Symbol Description Published

V Book value/market cap [8]

EM Earnings/market cap

S 1/market cap [8]

DIVYLD Dividend yield

CEM Cash earnings/market cap

MOM 1‐year momentum [8]

MOM6M 6‐month momentum

MOM2Y 2‐year momentum

MR Mean reversion (3 years) [2, 3]

MR2 Mean reversion (2 years)

Table 2. Sharpe ratios for the four published systems, in‐sample and out‐of‐sample.

The Sharpe ratios for the combined system are also given. All systems show a decline in performance.

Table 3. Indicators for ten trading systems which could have been tested in the 1990s, including the four already tested.

Indicator Sharpe ratio

(1970‐1995)

Sharpe Ratio

(1995‐2014)

Momentum (MOM) 0.85 0.57

Mean reversion (MR) 0.51 ‐0.15

Value (V) 0.86 ‐0.05

Size (S) 0.60 0.40

TOTAL 1.36 0.42

3. A family of trading systems: evidence for selection bias As we saw in the previous section, the performance of the national stock index portfolios is disappointing after the publication of these papers. This might be explained in various different ways. Perhaps the poor out‐of‐sample performance is due to bad luck and will be reversed in the future. Or the market environment might have changed in a way which makes the systems perform more poorly.

However, the most likely explanation is selection bias. In fact, we should have expected a drop in performance even without the benefit of hindsight.

We can understand this better by looking at the family of trading systems which the authors selected from. Usually it is difficult to know exactly which trading systems were considered before one was selected. In this case, we can make a reasonable guess.

The MSCI data set that contains the book‐to‐market ratio and the market capitalisation for each country also contains a selection of other indicators, which have likely been studied alongside the published indicators. As further

Figure 2. In‐sample (pre‐1995) Sharpe ratios of the ten equity index trading systems listed in Table 3. Those published and discussed in section 2 are highlighted in red. The mean (0.34) is indicated by a black line, and the one‐standard‐deviation range (σ=0.23) is shown as the red band around the mean. Error bars show the estimated sampling error from a formula given in [12].

evidence, we find that in [7] (which is referred to by [8]) the correlations between index returns and some of these additional indicators are studied. The book‐to‐market (V) indicator is assessed as the most promising.

Furthermore, the momentum and mean‐reversion systems use specific timescales suggested by successful systems trading individual equities. But other timescales, from one quarter to five years, have also been used in systems trading both indices and individual equities [9, 11].

These considerations suggest a wider group of ten trading systems which might have been tested by researchers in the 1990s alongside the published systems. They are listed in Table 3. To compare the performance of this family of systems across the entire time range, we use the 13 countries with data for all the indicators across the period 1975‐2014. The in‐sample results are shown in Figure 2.

Strikingly, all four of the published Sharpe ratios are above the average. This is very suggestive of active performance‐

based selection of trading systems.

If the published trading systems were picked because they were among the best in a wider set of systems, then we would expect their out‐of‐sample performance to be worse than their in‐sample performance [13]. This phenomenon of ‘regression to the mean’ applies in many other contexts and often leads to controversy [14]. Students with the best test scores in a particular school year are expected to decline in performance the next year (leading to accusations that they are a neglected group). Patients selected because they have high blood pressure will tend to show a decrease in blood pressure in the next stage of a clinical trial (leading to a positive assessment of any treatment they are given – even if it is ineffective).

(4)

The reason for this decline is easy to understand. In each case there is a large random variation in the quantity measured. Trying to select the very best, we select the very lucky. And the very lucky are not likely to be as lucky next time.

4. Estimating the effect of selection bias

We now understand why trading systems selected for their performance in‐sample are expected to show a decline in performance outside the sample period. If we can estimate the size of this effect using only an in‐sample test, then it will not be necessary to wait until out‐of‐sample data are available; we can make a corrected estimate of the future performance of a trading system immediately. This type of calculation is a valuable tool for the analysis of trading systems and portfolio construction.

To do this, we need to know how much of the difference in performance between the systems is caused by real differences in the systems’ effectiveness and how much is due to luck. The luck here is random sampling error, caused by the use of a finite amount of data to estimate the Sharpe ratios. Real differences in effectiveness should be consistent between the in‐sample and out‐of sample periods, but luck will not persist into the out‐of‐sample period.

The standard formula [12] for the sampling error of Sharpe ratios gives values of between 0.22 and 0.25 for all the systems in the family (Table 3). These are the error bars in Figure 2. In every case, the estimated sampling error is close to the standard deviation =0.23 of the set of Sharpe ratios (the measured range of Sharpe ratios across the systems). Differences between the Sharpe ratios can therefore be attributed to random sampling error alone. They do not indicate true differences in performance.

This has an important consequence. Given only in‐sample data, our best estimate of the future (out‐of‐sample) Sharpe ratio of one of the systems is not the in‐sample Sharpe ratio of that system: it is the mean in‐sample Sharpe ratio of the whole family (0.34 in this case). The Tweedie formula [15], a well‐known method for correcting selection bias, confirms this conclusion.

We can now look at the out‐of‐sample data to see whether they confirm this conclusion. Figure 3 shows the in‐sample and out‐of‐sample Sharpe ratios for the ten systems. It’s clear that the in‐sample and out‐of‐sample results are not correlated, and a statistical test confirms this. In other words, exceptionally ‘good’ systems in‐sample are not likely to remain good in out‐of‐sample tests. This is exactly what the analysis of the sampling error led us to expect. We have managed to foresee the drop in performance, due to

selection bias, without requiring the extra 20 years of data.

A large change in the overall mean would suggest some change in the world’s financial systems between the two periods. In fact, the small decrease in the mean Sharpe ratio (0.34 to 0.25) is of the same order of magnitude as the change expected from random variations¹, so there is no statistical evidence for such a change.

Figure 3. In‐sample (pre‐1995) and out‐of‐sample (post‐1995) Sharpe ratios for the family of ten trading systems. The one‐standard deviation ranges and the mean values are shown in the side bars.

5. Conclusion

In this paper we have seen how published systems on national stock indices from the 1990s have underperformed in the following 20 years, and we have shown strong evidence that the decrease in performance was caused by cherry‐picking a set of indicators based on in‐sample performance. This selection is not always done by an individual researcher or group. For example, it is possible that different researchers tried the different systems, and only the ones who obtained positive results published their work. Or one group of researchers may evaluate a set of possible indicators (as in [7]), leading a different group of researchers (such as [8]) to make a particular choice of trading systems.

Investment managers will always select trading systems which performed well in the past, and we do not argue that this is a bad policy. But as this paper shows, the selection introduces a bias, which should be corrected so that the performance of individual trading systems is not over‐stated. Methods to do this are an important part of the arsenal of scientific quantitative investment. Recent concerns expressed in the academic literature and in the financial community [16, 17, 18] make it clear that not everyone working in finance has fully adopted these methods yet.

1The sampling error 0.23 divided by the square root of the number of systems √10 gives a rough estimate of 0.07 for the standard error of the mean.

(5)

6. Acknowledgements

The authors are grateful to the Winton research department for discussions, suggestions and support to access data and in particular to William Cobern for help with data preparation.

For correspondence please email [email protected]

References

1. R. Levy, “Relative strength as a criterion for investment selection,” Journal of finance, pp. 595‐610, 1967.

2. R. Balvers, “Mean reversion across national stock markets and parametric contrarian investment strategies,” Journal of finance, pp. 745‐772, 2000.

3. A. Richards, “Winner‐loser reversals in national stock market indices: can they be explained?,”

IMF Working paper, 1997.

4. J. Shelton, “The Value Line contest: a test of predictability of stock‐price changes,”

Journal of business, pp. 251‐269, 1967.

5. F. Black, “Yes Virginia there is hope: tests of the Value Line ranking system. Financial analysts’

journal” 29, 1973).

6. E. Fama and K. French, “Size value and momentum in international stock returns,” Journal of financial economics, pp. 457‐472, 2012.

7. L. Heckman, “Valuation ratios and cross‐country equity allocation,” Journal of investing,

pp. 54‐63, 1996.

8. C. Asness, J. Liew and R. Stevens, “Parallels

Between the Cross‐Sectional Predictability of Stock and Country Returns,” The Journal of Portfolio Management, pp. 79‐87, 1997.

9. N. Jegadeesh and S. Titman, “Returns to buying winners and selling losers: implications for stock market efficiency,” Journal of finance, pp. 65‐91, 1993.

10. “MSCI country and regional indices,” [Online].

Available: www.msci.com/products/indexes/

country_and_regional/dm/.

11. F. DeBondt and R. Thaler, “Does the stock market overreact?,” Proceedings of the 43rd annual meeting of the American Finance Association, pp. 28‐30, 1985.

12. A. Lo, “The statistics of Sharpe Ratios,” Financial analysts journal, pp. 36‐52, 2002.

13. M. Roulston and D. Hand, “Blinded by optimism,”

Winton Capital Management Working Paper, December 2013.

14. J. Kruger, “Superstition and the regression effect,”

Skeptical enquirer, March/April 1999.

15. B. Efron, “Tweedie’s formula and selection bias,”

Journal of the American Statistical Association, pp. 1602‐1614, 2011.

16. D. Bailey, “Pseudo‐mathematics and financial

charlatanism,” Notices of the American Mathematical Society, pp. 458‐471, 2014.

17. Wall Street Journal Online, “Huge returns at low risk?

Not so fast,” 27 June 2014. [Online]. Available: blogs.

wsj.com/moneybeat/2014/06/27/huge‐returns‐at‐

low‐risk‐not‐so‐fast/.

18. Vanguard, “Joined at the hip: ETF and index development,” July 2012. [Online]. Available:

pressroom.vanguard.com/nonindexed/7.23.2012_

Joined_at_the_hip.pdf.

(6)

Legal Disclaimer

This document has been prepared by Winton Capital Management Limited (“WCM”), which is authorised and regulated by the UK Financial Conduct Authority, registered as an investment adviser with the US Securities and Exchange Commission, registered with the US Commodity Futures Trading Commission and a member of the National Futures Association.

This document is provided for information purposes only and the information herein does not constitute an offer to sell or the solicitation of any offer to buy any securities.

The information herein is subject to updating and further verification and may be amended at any time and WCM is under no obligation to provide an updated version. WCM has used information in this document that it believes to be accurate and complete as of the date of this document. However, WCM does not make any representation or warranty, express or implied, as to the information’s accuracy or completeness, and accepts no liability for any inaccuracy or omission. No reliance should be placed on the information herein and WCM does not recommend that it serves as the basis of any investment decision.

This document may contain results based on simulated or hypothetical performance results that have certain inherent limitations. Unlike the results shown in an actual performance record, such results do not represent actual trading. Also, because such trades have not actually been executed, these results may have under‐ or over‐

compensated for the impact, if any, of certain market factors, such as lack of liquidity. Simulated or hypothetical trading programs in general are also subject to the fact that they are designed with the benefit of hindsight. No representation is being made that any investment will or is likely to achieve profits or losses similar to those being shown using simulated data.

Unauthorised dissemination, copying, reproducing or transmitting of this information is strictly prohibited.