Please cite this article in press as: Makridakis, S., et al., The M4 Competition: Results, findings, conclusion and way forward. International Journal of Please cite this article in press as: Makridakis, S., et al., The M4 Competition: Results, findings, conclusion and way forward. International Journal of Forecasting
Forecasting (2018), (2018), https://doi.org/1https://doi.org/10.1016/j.ijforec0.1016/j.ijforecast.2018.06.001ast.2018.06.001..
The M4 Competition: Results, findings, conclusion and way
The M4 Competition: Results, findings, conclusion and way
forward
forward
Spyros Makridakis
Spyros Makridakis
aa,,bb,,*
*,, Evangelos Spiliotis
Evangelos Spiliotis
cc, Vassilios Assimakopoulos
, Vassilios Assimakopoulos
ccaaUniversity of Nicosia, Nicosia, CyprusUniversity of Nicosia, Nicosia, Cyprus b
bInstitute For the Future (IFF), Nicosia, CyprusInstitute For the Future (IFF), Nicosia, Cyprus
ccForecasting and Strategy Unit, School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, GreeceForecasting and Strategy Unit, School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, Greece
a r a r t i t i c l c l e e i i n f n f oo Keywords: Keywords: Forecasting competitions Forecasting competitions M Competitions M Competitions Forecasting accuracy Forecasting accuracy Prediction intervals (PIs) Prediction intervals (PIs) Time series methods Time series methods
Machine Learning (ML) methods Machine Learning (ML) methods Benchmarking methods Benchmarking methods Practice of forecasting Practice of forecasting a b s t r a c t a b s t r a c t
The M4 competition is the continuation of three previous competitions started more The M4 competition is the continuation of three previous competitions started more than 45 years ago whose purpose was to learn how to improve forecasting accuracy, and than 45 years ago whose purpose was to learn how to improve forecasting accuracy, and how such learning can be applied to advance the theory and practice of forecasting. The how such learning can be applied to advance the theory and practice of forecasting. The purpose of M4 was to replicate the results of the previous ones and extend them into purpose of M4 was to replicate the results of the previous ones and extend them into three directions: First significantly increase the number of series, second include Machine three directions: First significantly increase the number of series, second include Machine Learning (ML) forecasting methods, and third evaluate both
Learning (ML) forecasting methods, and third evaluate both point forecasts and predictionpoint forecasts and prediction int
intervervalsals.. ThefivemajorfindThefivemajorfindingingss ofof theM4theM4 ComCompetpetitiitionsare:onsare: 1.1. OutOfOutOf the17the17 mosmostt accaccurauratete methods, 12 were ‘‘combinations’’ of mostly statistical approaches. 2. The biggest surprise methods, 12 were ‘‘combinations’’ of mostly statistical approaches. 2. The biggest surprise was a ‘‘hybrid’’ approach that utilized both statistical and ML features. This method’s was a ‘‘hybrid’’ approach that utilized both statistical and ML features. This method’s average sMAPE was close to 10% more accurate than the combination benchmark used to average sMAPE was close to 10% more accurate than the combination benchmark used to compare the submitted methods. 3. The second most accurate method was a combination compare the submitted methods. 3. The second most accurate method was a combination of seven statistical methods and one ML one, with the weights for the averaging being of seven statistical methods and one ML one, with the weights for the averaging being calculated by a ML algorithm that was trained to minimize the forecasting. 4. The two calculated by a ML algorithm that was trained to minimize the forecasting. 4. The two most accurate methods also achieved an amazing success in specifying the 95% prediction most accurate methods also achieved an amazing success in specifying the 95% prediction int
intervervalsals corcorrecrectlytly.. 5.5. TheThe sixsix purpuree MLML metmethodhodss perperforformedmed poopoorlyrly,, witwithh nonnonee ofof thethemm beibeingng more accurate than the combination benchmark and only one being more accurate than more accurate than the combination benchmark and only one being more accurate than Naïve2. This paper presents some initial results of M4, its major findings and a logical Naïve2. This paper presents some initial results of M4, its major findings and a logical conclusion. Finally, it outlines what the authors consider to be the way forward for the conclusion. Finally, it outlines what the authors consider to be the way forward for the field of forecasting.
field of forecasting.
©
©2018 Published by Elsevier B.V. on behalf of International Institute of Forecasters.2018 Published by Elsevier B.V. on behalf of International Institute of Forecasters.
1.
1. IntroIntroductionduction
The M4 Competition ended on May 31, 2018. By the The M4 Competition ended on May 31, 2018. By the deadline, we had received 50 valid submissions predicting deadline, we had received 50 valid submissions predicting all 100,000 series and 18 valid submissions specifying all all 100,000 series and 18 valid submissions specifying all 100,000 prediction intervals (PIs). The numbers of 100,000 prediction intervals (PIs). The numbers of sub-missions are a little over 20% and 7%, respectively, of the missions are a little over 20% and 7%, respectively, of the 248 registrati
248 registrations to ons to partiparticipatcipate. e. We We assumassume e that 100,000that 100,000 series was a formidable number that discouraged many series was a formidable number that discouraged many of those who had registered from generating and of those who had registered from generating and submit-tin
tingg forforecaecastssts,, espespecieciallallyy thothosese whowho werweree utiutilizlizinging comcompleplexx
**
Corresponding author. Corresponding author.E-mail address:
E-mail address:[email protected]@unic.ac.cyic.ac.cy (S. Makridakis). (S. Makridakis).
machine learning (ML) methods that required demanding machine learning (ML) methods that required demanding computations. (A participant, living in California, told us computations. (A participant, living in California, told us of his huge electricity bill electricity bill from using his of his huge electricity bill electricity bill from using his 5
5 homhome e comcomputputers ers runrunninning g almalmost ost conconstastantlntly y oveover r 4.54.5 months.)
months.)
After a brief introduction concerning the objective of After a brief introduction concerning the objective of the M4, this short paper consists of two parts. First, we the M4, this short paper consists of two parts. First, we present the major findings of the competition and discuss present the major findings of the competition and discuss the
theirir impimpliclicatiationsforonsfor thetheoryandthetheoryand prapracticticece ofof forforecaecastisting.ng. A special issue of the IJF is scheduled to be published A special issue of the IJF is scheduled to be published next year that will cover all aspects of the M4 in detail, next year that will cover all aspects of the M4 in detail, describing the most accurate methods and identifying the describing the most accurate methods and identifying the reasons for their success. It will also consider the factors reasons for their success. It will also consider the factors that had negative influences on the performances of the that had negative influences on the performances of the
https://doi.org/10.1016/j.ijforecast.2018.06.001
https://doi.org/10.1016/j.ijforecast.2018.06.001
0169-2070/
Trusted by over 1 million members
Try Scribd
FREE for 30 days to access over 125 million titles without ads or interruptions!
Start Free Trial
2
2 S. S. MakMakridridakiakis s et et al. al. / / IntInternernatioational nal JouJournarnal l of of ForForecasecastinting g ( ( ) ) ––
majority of the methods submitted (33 of the 50), whose majority of the methods submitted (33 of the 50), whose accuracies were lower than that of the statistical accuracies were lower than that of the statistical bench-ma
markrk ususeded,, anandd wiwillll outoutlilinene whwhatat wewe coconsnsididerer toto bebe ththee wawayy forward for the field. Second, we briefly discuss some forward for the field. Second, we briefly discuss some pre-dictions/hypotheses that we made two and a half months dictions/hypotheses that we made two and a half months ago (detailed in a document e-mailed to R. J. Hyndman ago (detailed in a document e-mailed to R. J. Hyndman on 21/3/2018), predicting/hypothesizing about the actual on 21/3/2018), predicting/hypothesizing about the actual results of the M4. A detailed appraisal of these results of the M4. A detailed appraisal of these predic-tions/hypotheses will be left for the special issue.
tions/hypotheses will be left for the special issue.
We would like to emphasize that the most important We would like to emphasize that the most important objective of the M4 Competition, like that of the previous objective of the M4 Competition, like that of the previous three ones, has been to ‘‘learn how to improve the three ones, has been to ‘‘learn how to improve the fore-casting accuracy, and how such learning can be applied casting accuracy, and how such learning can be applied to advance the theory and practice of forecasting’’, thus to advance the theory and practice of forecasting’’, thus providing practical benefits to all those who are interested providing practical benefits to all those who are interested in the field. The M4 is an open competition whose series in the field. The M4 is an open competition whose series ha
haveve bebeenen avavaiailalablblee sisincncee ththee enendd ofof lalastst yeyearar,, bobothth ththrorougughh the M4 site
the M4 site ( (https://www.m4.unic.ac.cy/the-dataset/https://www.m4.unic.ac.cy/the-dataset/)) and and in the
in the M4comp2018 M4comp2018 R package ( R package (Montero-Manso, Netto,Montero-Manso, Netto,
& Talagala, 2018
& Talagala, 2018). In addition, the majority of the par-). In addition, the majority of the par-ticipants have been requested to deposit the code used ticipants have been requested to deposit the code used for generating their forecasts in GitHub (
for generating their forecasts in GitHub (https://github.https://github.
com/M4Competition/M4-methods
com/M4Competition/M4-methods)), along with a detailed, along with a detailed
description of their methods, while the code for the ten description of their methods, while the code for the ten M4 benchmarks has been available from GitHub since the M4 benchmarks has been available from GitHub since the beginning of the competition. This will allow interested beginning of the competition. This will allow interested parties to verify the accuracy and/or reproducibility of the parties to verify the accuracy and/or reproducibility of the forecasts of most of the methods submitted to the forecasts of most of the methods submitted to the compe-tit
titionion (fo(forr propropriprietaetaryry metmethodhods/ss/softoftwarware,e, thethe checheckickingng wilwilll be done by the organizers). This implies that individuals be done by the organizers). This implies that individuals and organizations will be able to download and utilize and organizations will be able to download and utilize most of the M4 methods, including the best one(s), and to most of the M4 methods, including the best one(s), and to ben
benefiefitt frofromm thetheirir enhenhancanceded accaccurauracy.cy. MorMoreoveeover,r, acaacademdemicic researchers will be able to analyze the factors that affect researchers will be able to analyze the factors that affect the forecasting accuracy in order to gain a better the forecasting accuracy in order to gain a better under-standing of them, and to conceive new ways of improving standing of them, and to conceive new ways of improving the accuracy. This is in contrast to other competitions, like the accuracy. This is in contrast to other competitions, like th
thososee ororgaganinizezedd byby KaKagglggle,e, whwhereree ththereree isis acactutualallyly aa ‘‘‘‘hohorsrsee race’’ that aims to identify the most accurate forecasting race’’ that aims to identify the most accurate forecasting method(s) without any consideration of the reasons method(s) without any consideration of the reasons in-volved with the aim of improving the forecasting volved with the aim of improving the forecasting perfor-mance in the future. To recapitulate, the objective of the mance in the future. To recapitulate, the objective of the M Competitions is to learn how to improve the forecasting M Competitions is to learn how to improve the forecasting accuracy, not to host a ‘‘horse race’’.
accuracy, not to host a ‘‘horse race’’. 2.
2. The five major finThe five major findings and the condings and the conclusion of M4clusion of M4 Below, we outline what we consider to be the five Below, we outline what we consider to be the five maj
majoror finfindindingsgs ofof thethe M4M4 ComCompetpetitiitionon andadvanandadvancece aa loglogicaicall conclusion from these findings.
conclusion from these findings.
•
• The The cocombmbininatatioionn ofof memeththododss wawass ththee kikingng ofof ththee M4M4..
Of the 17 most accurate methods, 12 were Of the 17 most accurate methods, 12 were
‘‘combi-competition (see below), which is a huge competition (see below), which is a huge improve-ment. It is noted that the best method in the M3 ment. It is noted that the best method in the M3 Competition (
Competition (Makridakis & Hibon, 2000Makridakis & Hibon, 2000)) was only was only 4% more accurate than the same combination. 4% more accurate than the same combination.
•
• The second most accuratThe second most accurate method was a e method was a combicombina-
na-tion of seven statistical methods and one ML one, tion of seven statistical methods and one ML one, with the weights for the averaging being calculated with the weights for the averaging being calculated by a ML algorithm that was trained to minimize the by a ML algorithm that was trained to minimize the for
forecaecastistingng erroerrorr thrthroughholdooughholdoutut testests.ts. ThiThiss metmethodhod was submitted jointly by Spain’s University of A was submitted jointly by Spain’s University of A Coruña and Australia’s Monash University.
Coruña and Australia’s Monash University.
•
• The most accurate and second most accurate meth-The most accurate and second most accurate
meth-ods also achieved an amazing success in specifying ods also achieved an amazing success in specifying the 95% PIs correctly. These are the first methods the 95% PIs correctly. These are the first methods we are aware of that have done so, rather than we are aware of that have done so, rather than underestimating the uncertainty considerably. underestimating the uncertainty considerably.
•
• Th Thee sisixx pupurere MLML memeththododss ththatat wewerere susubmbmitittetedd inin ththee
M4 all performed poorly, with none of them being M4 all performed poorly, with none of them being more accurate than Comb and only one being more more accurate than Comb and only one being more acc
accurauratete thathann NaïNaïve2ve2.. TheThesese resresultultss areare inin agragreemeementent with those of a recent study that was published in with those of a recent study that was published in PLOS ONE
PLOS ONE ( (Makridakis, Spiliotis, & Assimakopoulos,Makridakis, Spiliotis, & Assimakopoulos,
2018
2018).).
Our conclusion from the above findings is that the Our conclusion from the above findings is that the ac-curacy of individual statistical or ML methods is low, and curacy of individual statistical or ML methods is low, and that hybrid approaches and combinations of method are that hybrid approaches and combinations of method are the way forward for improving the forecasting accuracy the way forward for improving the forecasting accuracy and making forecasting more valuable.
and making forecasting more valuable. 2.1.
2.1. The comThe comb benchmb benchmarkark
One innovation of the M4 was the introduction of ten One innovation of the M4 was the introduction of ten sta
standandardrd metmethodhodss (bo(bothth stastatististicticalal andand ML)ML) forfor benbenchmchmark ark--ing the accuracy of the methods submitted to the ing the accuracy of the methods submitted to the compe-tition. From those ten, we selected one against which to tition. From those ten, we selected one against which to compare the submitted methods, namely Comb, a compare the submitted methods, namely Comb, a com-bination based on the simple arithmetic average of the bination based on the simple arithmetic average of the Simple, Holt and Damped exponential smoothing models. Simple, Holt and Damped exponential smoothing models. The reason for such a choice was that Comb is easy to The reason for such a choice was that Comb is easy to compute, can be explained/understood intuitively, is quite compute, can be explained/understood intuitively, is quite robust, and typically is more accurate than the individual robust, and typically is more accurate than the individual me
meththododss bebeiningg cocombmbinineded.. FoForr ininststanancece,, foforr ththee enentitirere sesett of of the M4 series, the sMAPE o
the M4 series, the sMAPE of Single was 13.09%, that f Single was 13.09%, that ofof HoltHolt was 13.77%, and that of Damped was 12.66%, while that of was 13.77%, and that of Damped was 12.66%, while that of Com
Combb waswas 12.12.55%55%.. TabTablele11shoshowsws thethe sMAsMAPEPE andtheandthe oveoveralralll weighted average (OWA) – for definitions, see
weighted average (OWA) – for definitions, see M4 Team M4 Team
((20182018) – of the Comb benchmark, the 17 methods that) – of the Comb benchmark, the 17 methods that per
perforformedmed betbetterter thathann sucsuchh aa benbenchmchmarkark,, andtheandthe twomosttwomost accurate ML submissions. The last two columns of
accurate ML submissions. The last two columns of Table 1 Table 1
show the percentage that each method was better/worse show the percentage that each method was better/worse than Comb in terms of both sMAPE and OWA.
than Comb in terms of both sMAPE and OWA. 2.
P P l l e e a a s s e e c c i i t t e e t t h h i i s s a a r r t t i i c c l l e e i i n n p p r r e e s s s s a a s s : : M M a a k k r r i i d d a a k k i i s s , , S S . . , , e e t t a a l l . . , , T T h h e e M M 4 4 C C o o m m p p e e t t i i t t i i o o n n : : R R e e s s u u l l t t s s , , f f i i n n d d i i n n g g s s , , c c o o n n c c l l u u s s i i o o n n a a n n d d w w a a y y f f o o r r w w a a r r d d . . I I n n t t e e r r n n a a t t i i o o n n a a l l J J o o u u r r n n a a l l o o f f F F o o r r e e c c a a s s t t i i n n g g ( ( 2 2 0 0 1 1 8 8 ) ) , , h h t t t t p p s s : : / / / / d d o o i i . . o o r r g g / / 1 1 0 0 . .1 1 0 0 1 1 6 6 / / j j . .i i j j f f o o r r e e c c a a s s t t . .2 2 0 0 1 1 8 8 . . 0 0 6 6 . . 0 0 0 0 1 1 . . S S . . M M a a k k r r i i d d a a k k i i s s e e t t a a l l . . / / I I n n t t e e r r n n a a t t i i o o n n a a l l J J o o u u r r n n a a l l o o f f F F o o r r e e c c a a s s t t i i n n g g ( ( ) ) – – 3 3 Table Table 11
The performances of the 17 methods submitted to M4 that had OWAs at least as good as that of the Comb. The performances of the 17 methods submitted to M4 that had OWAs at least as good as that of the Comb.
T
Tyyppe e AAuutthhoorr((ss) ) AAffffiilliiaattiioon n ssMMAAPPEEcc OWAOWAdd RankRankaa % improvement of % improvement of method over the method over the benchmark benchmark Yearly Yearly (23k) (23k) Quarterly Quarterly (24k) (24k) Monthly Monthly (48k) (48k) Others Others (5k) (5k) Average Average (100k) (100k) Yearly Yearly (23k) (23k) Quarterly Quarterly (24k) (24k) Monthly Monthly (48k) (48k) Others Others (5k) (5k) Average Average (100k) (100k) Benchmark for methods
Benchmark for methodsbb 1144..88448 8 1100..11775 5 1133..44334 4 44..99887 7 1122..55555 5 00..88667 7 00..88990 0 00..99220 0 11..00339 9 00..88998 8 119 9 ssMMAAPPE E OOWWAA H
Hyybbrriid d SSmmyyll, , SS. . UUbbeer r TTeecchhnnoollooggiiees s 1133..11776 6 99..66779 9 1122..11226 6 4..040114 4 1111..33774 4 0..707778 8 00..88447 7 00..88336 6 00..99220 0 00..88221 1 1 1 99..44% % 88..66%% Combination
Combination Montero-Manso, Montero-Manso, P.,P., Talagala, T., Hyndman, R. Talagala, T., Hyndman, R. J. & Athanasopoulos, G. J. & Athanasopoulos, G.
University of A Coruña & University of A Coruña & Monash University Monash University
1
133..55228 8 99..77333 3 1122..66339 9 44..11118 1 18 1 1..77220 0 00..77999 9 00..88447 7 00..88558 8 00..99114 0 .4 0 .88338 8 2 2 66..66% % 66..77%%
Combi
Combinatination on PawlPawlikowikowski, ski, M.,M., Chorowska, A. & Yanchuk, Chorowska, A. & Yanchuk, O.
O.
P
PrrooLLooggiissttiicca a SSoofft t 1133..99443 3 99..77996 6 1122..77447 7 33..33665 15 111..88445 5 00..88220 00 0..88555 5 00..88667 7 00..77442 02 0..88441 1 3 3 55..77% % 66..33%%
Combination
Combination Jaganathan, Jaganathan, S. S. & & Prakash,Prakash, P.
P.
IInnddiivviidduuaal l 1133..77112 2 99..88009 9 1122..44887 7 33..88779 9 1111..66995 5 00..88113 3 00..88559 9 00..88554 4 00..88882 2 00..88442 2 4 4 66..88% % 66..22%% Combi
Combinatination on FioruFiorucci, cci, J. A. & J. A. & LouzaLouzada, F. da, F. UnivUniversitersity of Brasy of Brasilia ilia && University of São Paulo University of São Paulo
1
133..66773 3 99..88116 6 1122..77337 7 44..44332 1 12 1 1..88336 6 00..88002 2 00..88555 5 00..88668 8 00..99335 0 .5 0 .88443 3 5 5 55..77% % 66..11%% Combi
Combinatination on PetroPetropoulopoulos, F. s, F. && Svetunkov, I. Svetunkov, I.
University of Bath & Lancaster University of Bath & Lancaster University
University
1
133..66669 9 99..88000 0 1122..88888 8 44..11005 1 15 1 1..88887 7 00..88006 6 00..88553 3 00..88776 6 00..99006 0 .6 0 .88448 8 6 6 55..33% % 55..66%% C
Coommbbiinnaattiioon n SShhaauubb, , DD. . HHaarrvvaarrd d EExxtteennssiioon n SScchhooool l 1133..66779 9 1100..33778 8 1122..88339 9 44..44000 10 122..00220 0 00..88001 01 0..99008 8 00..88882 2 00..99778 08 0..88660 0 7 7 44..33% % 44..22%% Sta
Statististictical al LegLegakiaki, , N. N. Z. Z. & & KouKoutsotsouriuri,, K.
K.
National Technical University of National Technical University of Athens
Athens
1
133..33666 6 1100..11555 5 1133..00002 2 44..66882 1 12 1 1..99886 6 00..77888 0 .8 0 .88998 8 00..99005 5 00..99889 0 .9 0 .88661 1 8 8 44..55% % 44..11%% Combi
Combinatination on DoornDoornik, J., Cik, J., Castlastle, J. &e, J. & Hendry, D.
Hendry, D.
U
Unniivveerrssiitty y oof f OOxxffoorrd d 1133..99110 0 1100..00000 0 1122..77880 0 33..88002 2 1111..99224 4 00..88336 6 00..88778 8 00..88881 1 0..808774 4 00..88665 5 9 9 55..00% % 33..77%% Combination
Combination Pedregal, Pedregal, D.J., D.J., Trapero, Trapero, J.J. R., Villegas, M. A. & R., Villegas, M. A. & Madrigal, J. J. Madrigal, J. J.
U
Unniivveerrssiitty y oof f CCaassttiillllaa--LLa a MMaanncchha 1 3a 1 3..88221 1 1100..00993 3 1133..11551 1 4..040112 2 1122..11114 4 00..88224 4 00..88883 3 00..88999 9 00..88995 5 00..88669 9 110 0 33..55% % 33..22%%
Sta
Statististictical al SpiSpilioliotistis, , E. E. && Assimakopoulos, V. Assimakopoulos, V.
National Technical University of National Technical University of Athens
Athens
1
133..88004 4 1100..11228 8 1133..11442 2 44..66775 5 1122..11448 8 00..88223 3 00..88889 9 00..99007 7 00..99775 5 00..88774 4 111 1 33..22% % 22..77%% Co
Combmbininatatioion Roubn Roubininchchteteinin, , A. A. WashWashiningtgton on StStatate e EmEmplployoymementnt Security Department Security Department
1
144..44445 5 1100..11772 2 1122..99111 1 44..44336 6 1122..11883 3 00..88550 0 00..88885 5 00..88881 1 00..99992 2 00..88776 6 112 2 33..00% % 22..44%% O
Otthheer r IIbbrraahhiimm, , MM. . GeGeoorrggiia a IInnssttiittuutte e oof f TTeecchhnnoollooggy y 1133..66777 7 1100..00889 9 1133..33221 1 44..77447 7 1122..11998 8 00..88005 5 00..88990 0 00..99221 1 11..00779 9 00..88880 0 113 3 22..88% % 22..00%% C
Coommbbiinnaattiioon n KKuullll, , MM.., , eet t aall. . UUnniivveerrssiitty y oof f TTaarrttu u 1144..00996 6 1111..11009 9 1133..22990 0 4..141669 9 1122..44996 6 0..808220 0 00..99660 0 00..99332 2 00..88883 3 00..88888 8 114 4 00..55% % 11..11%% Co
Combmbiinanatition Won Waaheheebeb, , WW. . UnUniiveversrsititi i TuTun n HuHusssseiein n OnOnnn Malaysia
Malaysia
1
144..77883 3 1100..00559 9 1122..77770 0 44..00339 9 1122..11446 6 00..88880 0 00..88880 0 00..99227 7 00..99004 4 00..88994 4 115 5 33..33% % 00..44%% Sta
Statististictical al DarDarin, in, S. S. & & SteStellwllwageagen, n, E. E. BusBusineiness ss ForForecaecast st SysSystemtemss (Forecast Pro)
(Forecast Pro)
1
144..66663 3 1100..11555 5 1133..00558 8 44..00441 1 1122..22779 9 00..88777 7 00..88887 7 00..88887 7 11..00111 1 00..88995 5 116 6 22..22% % 00..33%% Combi
Combinatination on DantDantas, as, T. T. & & OlivOliveira, eira, F. F. PontiPontificafical l CathoCatholic lic UnivUniversiersity ty of of Rio de Janeiro
Rio de Janeiro
1
144..77446 6 1100..22554 4 1133..44662 2 44..77883 3 1122..55553 3 00..88666 6 00..88992 2 00..99114 4 11..00111 1 00..88996 6 117 7 00..00% % 00..22%% B
Beesst t MML L TTrroottttaa, , BB. . IInnddiivviidduuaal l 1414..33997 7 1111..00331 1 1133..99773 3 44..55666 6 1122..88994 4 00..88559 9 00..99339 9 00..99441 1 00..99991 1 00..99115 5 2323 −−2.7%2.7% −−1.9%1.9%
2
2nnd d BBeesst t MML L BBoonntteemmppii, , GG. . UUnniivveerrssiitté é LLiibbrre e dde e BBrruuxxeellllees s 1166..66113 3 1111..77886 6 1144..88000 0 44..77334 4 1133..99990 0 11..00550 0 11..00772 2 1..010007 7 11..00551 1 11..00445 5 3377 −−11.4%11.4% −−16.4%16.4%
aaRank: rank calculated considering both the submitted methods (50) and the set of benchmarks (10).Rank: rank calculated considering both the submitted methods (50) and the set of benchmarks (10). b
bCom: the benchmark is the average (combination) of Simple, Holt and Damped exponential smoothing.Com: the benchmark is the average (combination) of Simple, Holt and Damped exponential smoothing. ccsMAPE: symmetric mean absolute percentage error.sMAPE: symmetric mean absolute percentage error.
d
4
4 S. S. MakMakridridakiakis s et et al. al. / / IntInternernatioational nal JouJournarnal l of of ForForecasecastinting g ( ( ) ) ––
foreca
forecastingsting enginengine.e. ItIt isis alsoalso interinterestinestingg thatthat thisthisforecaforecastingsting metho
methodd incorincorporateporatedd informinformationation fromfrom bothboth thethe indivindividualidual serie
series s and the whole and the whole datasdataset, thus exploitinet, thus exploiting g data in a data in a hi- hi-erarchical way. The remaining five methods all had erarchical way. The remaining five methods all had some-thi
thingng inin comcommonmon:: thetheyy utiutilizlizeded varvariouiouss forformsms ofof comcombinbininging mainly statistical methods, and most used holdout tests mainly statistical methods, and most used holdout tests to figure out the optimal weights for such a combination. to figure out the optimal weights for such a combination. The differences in forecasting accuracy between these five The differences in forecasting accuracy between these five methods were small, with the sMAPE difference being less methods were small, with the sMAPE difference being less than 0.2% and the OWA difference being just 0.01. It is also than 0.2% and the OWA difference being just 0.01. It is also interesting to note that although these five methods used interesting to note that although these five methods used a combination of mostly statistical models, some of them a combination of mostly statistical models, some of them utilized ML ideas for determining the optimal weights for utilized ML ideas for determining the optimal weights for averaging the individual forecasts.
averaging the individual forecasts.
The rankings of the six ML methods were 23, 37, 38, The rankings of the six ML methods were 23, 37, 38, 48, 54, and 57 out of a total of 60 (50 submitted and the 48, 54, and 57 out of a total of 60 (50 submitted and the 10 benchmarks). These results confirm our recent findings 10 benchmarks). These results confirm our recent findings ((Makridakis et al., 2018Makridakis et al., 2018) regarding the poor performances) regarding the poor performances of
of thethe MLML metmethodhodss relrelatiativeve toto stastatististicticalal oneones.s. HowHoweveever,r, wewe must emphasize that the performances of the ML methods must emphasize that the performances of the ML methods reported in our paper, as well as those found by
reported in our paper, as well as those found by Ahmed, Ahmed,
Atiya, Gayar, and El-Shishiny
Atiya, Gayar, and El-Shishiny ((20102010) and those submit-) and those submit-ted in the M4, used mainly generic guidelines and ted in the M4, used mainly generic guidelines and stan-dard algorithms that could be improved substantially. One dard algorithms that could be improved substantially. One possible direction for such an improvement could be the possible direction for such an improvement could be the development of hybrid approaches that utilize the best development of hybrid approaches that utilize the best characteristics of ML algorithms and avoid their characteristics of ML algorithms and avoid their disadvan-tag
tages,es, whiwhilele atat thethe samsamee timtimee explexploitoitinging thethe strstrongong feafeaturtureses of well-known statistical methods. We believe strongly in of well-known statistical methods. We believe strongly in the
the potpotententialial ofof MLML metmethodhods,s, whiwhichch areare stistillll inin thetheirir infinfancancyy as far as time series forecasting is concerned. However, to as far as time series forecasting is concerned. However, to be able to reach this potential, their existing limitations, be able to reach this potential, their existing limitations, such as preprocessing and overfitting, must be identified such as preprocessing and overfitting, must be identified and most important
and most importantly accepted by ly accepted by the ML the ML commucommunity,nity, rather than being ignored. Otherwise, no effective ways of rather than being ignored. Otherwise, no effective ways of correcting the existing problems will be devised.
correcting the existing problems will be devised. 2.3.
2.3. Prediction intervals Prediction intervals (PIs)(PIs)
Our knowledge of PIs is rather limited, and much could Our knowledge of PIs is rather limited, and much could be learned by studying the results of the M4 Competition. be learned by studying the results of the M4 Competition. Table 2
Table 2 presents the performance of the 12 methods (out presents the performance of the 12 methods (out of the 18 submitted) that achieved mean scaled interval of the 18 submitted) that achieved mean scaled interval scores (MSISs), (see
scores (MSISs), (see M4 Team, 2018 M4 Team, 2018,, for definitions) that for definitions) that we
werere atat leleasastt asas gogoodod asas ththatat ofof ththee NaNaïvïve1e1 memeththodod.. ThThee lalastst two columns of
two columns of Table 2 Table 2 show the percentages of cases in show the percentages of cases in which each method was better/worse than the Naive1 in which each method was better/worse than the Naive1 in terms of MSIS and the absolute coverage difference (ACD); terms of MSIS and the absolute coverage difference (ACD); i.e., the absolute difference between the coverage of the i.e., the absolute difference between the coverage of the method and the target set of 0.95. The performances of method and the target set of 0.95. The performances of two additional methods, namely ETS and auto.arima in two additional methods, namely ETS and auto.arima in the R
the R forecast forecast package ( package (Hyndman et al., 2018Hyndman et al., 2018)), have also, have also been included as benchmarks, but Naïve1 was preferred been included as benchmarks, but Naïve1 was preferred
of estimating the forecasting uncertainty. Once again, the of estimating the forecasting uncertainty. Once again, the hyb
hybridrid andand comcombinbinatiationon appapproaroachechess ledled toto thethe mosmostt corcorrecrectt PIs, with the statistical methods displaying significantly PIs, with the statistical methods displaying significantly worse performances and none of the ML methods worse performances and none of the ML methods man-aging to outperform the benchmark. However, it should aging to outperform the benchmark. However, it should be mentioned that, unlike the best ones, the majority of be mentioned that, unlike the best ones, the majority of the methods underestimated reality considerably, the methods underestimated reality considerably, espe-cially for the low frequency data (yearly and quarterly). cially for the low frequency data (yearly and quarterly). For example, the average coverages of the methods were For example, the average coverages of the methods were 0.80, 0.90, 0.92 and 0.93 for the yearly, quarterly, monthly 0.80, 0.90, 0.92 and 0.93 for the yearly, quarterly, monthly and other data, respectively. These findings demonstrate and other data, respectively. These findings demonstrate that standard forecasting methods fail to estimate the that standard forecasting methods fail to estimate the un-certainty properly, and therefore that more research is certainty properly, and therefore that more research is required to correct the situation.
required to correct the situation.
3.
3. The hyThe hypothepothesesses We
We now present the now present the ten predictionten predictions/hyps/hypothesotheses es wewe sent to the IJF Editor-in-Chief two and a half months ago sent to the IJF Editor-in-Chief two and a half months ago about the expected results of the M4. A much fuller and about the expected results of the M4. A much fuller and more detailed appraisal of these ten hypotheses is left more detailed appraisal of these ten hypotheses is left for the planned special issue of IJF. This section merely for the planned special issue of IJF. This section merely provides some short, yet indicative answers.
provides some short, yet indicative answers. 1.
1. The The forforecaecastisting ng accaccurauraciecies s of of simsimple ple metmethodhods,s, such as the eight statistical benchmarks included such as the eight statistical benchmarks included in M4, will not be too far from those of the most in M4, will not be too far from those of the most accurate methods.
accurate methods.
This is only partially true. On the one hand, the This is only partially true. On the one hand, the hyb
hybrid rid metmethod hod and and sevseveraeral l of of the the comcombinbinatiationon met
methodhods s all all outoutperperforformed med ComComb, b, the the stastatististicticalal bench
benchmark, by mark, by a a percenpercentage ranging between 9.4%tage ranging between 9.4% for the first and 5% for the ninth. These percentages for the first and 5% for the ninth. These percentages are higher than the 4% improvement of the best are higher than the 4% improvement of the best method in the M3 relative to the Comb benchmark. method in the M3 relative to the Comb benchmark. On the other hand, the improvements below the On the other hand, the improvements below the tenth method were lower, and one can question tenth method were lower, and one can question whether the level of improvement was exceptional, whether the level of improvement was exceptional, giv
givenen theadvantheadvancedced algalgoriorithmthmss utiutilizlizeded andand thecom- thecom-putational resources used.
putational resources used. 2.
2. The 95% PIs will underestimate reality consider- The 95% PIs will underestimate reality consider-abl
ably,y, andthisandthis undundereerestistimatimationon willwill incincreareasese asas thethe forecasting horizon lengthens.
forecasting horizon lengthens.
Indeed, except for the two top performing Indeed, except for the two top performing meth-ods of the competition, the majority of the ods of the competition, the majority of the remain-ing methods underestimated reality by providremain-ing ing methods underestimated reality by providing overly
overly narronarroww conficonfidencedence interintervals.vals.ThisThis isis especiespeciallyally true for the low frequency data (yearly and true for the low frequency data (yearly and quar-terly
terly), ), where longer-term forecastwhere longer-term forecasts s are are requirerequired,d, and therefore the uncertainty is higher. Increases in and therefore the uncertainty is higher. Increases in data availability and the generation of short-term data availability and the generation of short-term forecasts may also contribute to the improvements forecasts may also contribute to the improvements
P P l l e e a a s s e e c c i i t t e e t t h h i i s s a a r r t t i i c c l l e e i i n n p p r r e e s s s s a a s s : : M M a a k k r r i i d d a a k k i i s s , , S S . . , , e e t t a a l l . . , , T T h h e e M M 4 4 C C o o m m p p e e t t i i t t i i o o n n : : R R e e s s u u l l t t s s , , f f i i n n d d i i n n g g s s , , c c o o n n c c l l u u s s i i o o n n a a n n d d w w a a y y f f o o r r w w a a r r d d . . I I n n t t e e r r n n a a t t i i o o n n a a l l J J o o u u r r n n a a l l o o f f F F o o r r e e c c a a s s t t i i n n g g ( ( 2 2 0 0 1 1 8 8 ) ) , , h h t t t t p p s s : : / / / / d d o o i i . . o o r r g g / / 1 1 0 0 . .1 1 0 0 1 1 6 6 / / j j . .i i j j f f o o r r e e c c a a s s t t . .2 2 0 0 1 1 8 8 . . 0 0 6 6 . . 0 0 0 0 1 1 . . S S . . M M a a k k r r i i d d a a k k i i s s e e t t a a l l . . / / I I n n t t e e r r n n a a t t i i o o n n a a l l J J o o u u r r n n a a l l o o f f F F o o r r e e c c a a s s t t i i n n g g ( ( ) ) – – 5 5 Table Table 22
The performances of the 12 methods submitted to M4 with MSISs at least as good as that of the Naïve. The performances of the 12 methods submitted to M4 with MSISs at least as good as that of the Naïve.
T
Tyyppe e AAuutthhoorr((ss) ) AAffffiilliiaattiioon n MSMSIISScc ACDACDdd RankRankaa % improvement% improvement of method over of method over the benchmark the benchmark Yearly Yearly (23k) (23k) Quarterly Quarterly (24k) (24k) Monthly Monthly (48k) (48k) Others Others (5k) (5k) Average Average (100k) (100k) Yearly Yearly (23k) (23k) Quarterly Quarterly (24k) (24k) Monthly Monthly (48k) (48k) Others Others (5k) (5k) Average Average (100k) (100k) Naïve 1: Benchmark for methods
Naïve 1: Benchmark for methodsbb 5566..55554 4 1144..00773 3 1122..33000 0 3355..33111 1 2244..00555 5 00..22334 4 00..00884 4 00..00223 3 00..00119 9 00..00886 6 115 5 MMSSIIS S AACCDD H
Hyybbrriid d SSmmyyll, , SS. . UUbbeer r TTeecchhnnoollooggiiees s 2233..88998 8 88..55551 1 77..22005 5 2244..44558 8 1122..22330 0 00..00003 3 00..00004 4 00..00005 5 00..00001 1 00..00002 2 1 1 4499..22% % 9977..44%% Combi
Combinatination on MontMontero-Mero-Mansoanso, , P.,P., Talagala, T., Hyndman, R. Talagala, T., Hyndman, R. J. & Athanasopoulos, G. J. & Athanasopoulos, G.
University of A Coruña & University of A Coruña & Monash University Monash University
2
277..44777 7 99..33884 4 88..66556 6 3232..11442 2 1144..33334 4 00..00114 4 00..00116 6 00..00116 6 00..00227 7 0..000110 0 2 2 4400..44% % 8888..88%%
Combi
Combinatination on DoornDoornik, ik, J., CJ., Castlastle, Je, J. &. & Hendry, D.
Hendry, D.
U
Unniivveerrssiitty y oof f OOxxffoorrd d 3300..22000 0 99..88448 8 99..44994 4 2626..33220 0 1155..11883 3 00..00337 7 00..00229 9 00..00554 4 00..00339 9 00..00443 3 3 3 3366..99% % 5500..00%% Combi
Combinatination on FioruFiorucci, cci, J. AJ. A. & . & LouzaLouzada, da, F. F. UnivUniversiersity of ty of BrasBrasilia ilia && University of São Paulo University of São Paulo
3
355..88444 4 99..44220 0 88..00229 9 2626..77009 9 1155..66995 5 00..11664 4 00..00556 6 00..00228 8 00..00005 5 0..000665 5 5 5 3344..88% % 2244..66%% Combi
Combinatination on PetroPetropoulospoulos, , F. F. && Svetunkov, I. Svetunkov, I.
University of Bath & University of Bath & Lancaster University Lancaster University
3
355..99445 5 99..88993 3 88..22330 0 2727..77880 0 1155..99881 1 00..11771 1 00..00664 4 00..00335 5 00..00112 2 0..000772 2 6 6 3333..66% % 1166..44%% Co
Combmbininatatioion Roun Roubibincnchthteiein, n, A. A. WaWashshiningtgton on StStatatee Employment Security Employment Security Department Department 3 377..77006 6 99..99445 5 88..22333 3 2929..88666 6 1166..55005 5 00..11667 7 00..00449 9 00..00221 1 00..00004 4 0..000661 1 7 7 3311..44% % 2299..44%% St
Statatisistiticacal l TaTalalagagalala, , T.T.,, Athanasopoulos, G. & Athanasopoulos, G. & Hyndman, R. J. Hyndman, R. J.
M
Moonnaassh h UUnniivveerrssiitty y 3399..77993 3 1111..22440 0 99..88225 5 3737..22222 2 1188..44227 7 00..11665 5 00..00774 4 00..00556 6 00..00448 8 00..00885 5 8 8 2233..44% % 00..99%%
O
Otthheer r IIbbrraahhiimm, , MM. . GGeeoorrggiia a IInnssttiittuutte e oof f Technology Technology
4
455..00228 8 1111..22884 4 99..33990 0 5522..66004 4 2200..22002 2 00..22111 1 00..00886 6 00..00550 0 00..00111 1 0..000994 4 110 0 1166..00%% −−9.1%9.1%
Sta
Statististictical al IqbIqbal, al, A.,A.,SeeSeery, ry, S. S. & & SilSilviavia,, J.
J.
W
Weelllls s FFaarrggo o SSeeccuurriittiiees s 5533..44669 9 1111..22995 5 1111..11559 9 3232..77221 1 2222..00001 1 00..22114 4 00..00447 7 00..00448 8 00..00550 0 00..00886 6 111 1 88..55% % 00..11%% C
Coommbbiinnaattiioon n RReeiillllyy, , TT. . AAuuttoommaattiic c FFoorreeccaassttiinngg Systems, Inc. (AutoBox) Systems, Inc. (AutoBox)
5
533..99998 8 1133..55337 7 1100..33444 4 3344..66880 0 2222..33667 7 00..22993 3 00..11002 2 00..00557 7 00..00446 6 00..11221 1 112 2 77..00%% −−41.1%41.1%
Sta
Statististictical al WaiWainwrnwrighight, t, E., E., ButButz, z, E. E. && Raychaudhuri, S. Raychaudhuri, S.
O
Orraacclle e CCoorrppoorraattiioon n 5588..55996 6 1122..44226 6 99..77114 4 3311..00336 6 2222..66774 4 00..22995 5 00..11000 0 00..00556 6 00..00228 8 00..11220 0 113 3 55..77%% −−39.6%39.6%
Combi
Combinatination on SegurSegura-Hera-Heras, as, J.-V.J.-V.,, Vercher-Gonzál Vercher-González, ez, E.,E., Bermúdez-Edo, J. D. & Bermúdez-Edo, J. D. & Corberán-Vallet, A. Corberán-Vallet, A. Universidad Miguel Universidad Miguel Hernández & Universitat Hernández & Universitat de Valencia
de Valencia
5
522..33116 6 1155..00558 8 1111..22220 0 3333..66885 5 2222..77117 7 00..11332 2 00..00339 9 00..00114 4 00..00552 2 0..000449 9 114 4 55..66% % 4433..11%%
aaRank: rank calculated considering both the submitted methods (18) and the set of benchmarks (3).Rank: rank calculated considering both the submitted methods (18) and the set of benchmarks (3). b
bNaive: the benchmark is the Naïve 1 method.Naive: the benchmark is the Naïve 1 method. ccMSIS: mean scaled interval score.MSIS: mean scaled interval score.
d
6
6 S. S. MakMakridridakiakis s et et al. al. / / IntInternernatioational nal JouJournarnal l of of ForForecasecastinting g ( ( ) ) ––
Th
This is is is veveririfified ed ththrorough ugh ththe e peperfrforormamancnces es of of the
the respecrespective M4 tive M4 benchbenchmarksmarks. . Holt exponentiHolt exponentialal smo
smoothothingdisplingdisplaysanaysan OWAofOWAof 0.90.97171 (30(30thth pospositiition)on),, while Damped has an OWA of 0.907 (21st position), while Damped has an OWA of 0.907 (21st position), thus improving the forecasting accuracy by more thus improving the forecasting accuracy by more tha
thann 6.56.5%% whewhenn extextraprapolaolatintingg usiusingng damdampedped insinsteateadd of linear trends.
of linear trends. 4.
4. The majority of ML methods will not be more The majority of ML methods will not be more acc
accurauratete thathann stastatististicaticall oneones,s, altalthouhoughgh thetherere may may be
be onone e or or twtwo o plpleaeasasant nt susurprprisrises es whwherere e susuchch methods are superior to statistical ones, though methods are superior to statistical ones, though only by some small margin.
only by some small margin.
This was one of the major findings of the This was one of the major findings of the competi-tion, the demonstration that none of the pure ML tion, the demonstration that none of the pure ML used managed to outperform Comb and that only used managed to outperform Comb and that only one of them was more accurate than the Naive2. one of them was more accurate than the Naive2. The one pleasant surprise was the hybrid method The one pleasant surprise was the hybrid method intro
introduced by duced by Smyl, which Smyl, which incorpincorporateorated d featurfeatureses from both statistical and ML methods and led to a from both statistical and ML methods and led to a considerable improvement in forecasting accuracy. considerable improvement in forecasting accuracy. 5.
5. The The comcombinabinationtion ofof stastatististicaticall andand/or/or MLML metmethodhodss will
will produce produce more more accurate accurate results results than than the the bestbest of the individual methods combined.
of the individual methods combined.
This is a major finding, or better, a confirmation, This is a major finding, or better, a confirmation, from this competition, the demonstration that from this competition, the demonstration that com-bin
binatiation on is is one one of of the the besbest t forforecaecastisting ng prapracticticesces.. Sin
Sincece thethe majmajoriorityty ofof thethe parparticticipaipantsnts whowho comcombinbineded various methods used more or less the same pool various methods used more or less the same pool of models (most of which were the M4 statistical of models (most of which were the M4 statistical ben
benchmchmarkarks),s), anan intintereerestistingng quequestistionon is,is, ‘‘wh‘‘whichich fea fea--tures make a combination approach more accurate tures make a combination approach more accurate than others?’’
than others?’’ 6.
6. Seasonality will continue to dominate the fluctu-Seasonality will continue to dominate the fluctu-ations of the time series, while randomness will ations of the time series, while randomness will remain the most critical factor influencing the remain the most critical factor influencing the forecasting accuracy.
forecasting accuracy. Hav
Havinging corcorrelrelateatedd thesMAPEthesMAPE valvaluesues reporeportertedd perper se- se-ries for the 60 methods utilized within M4 with the ries for the 60 methods utilized within M4 with the corresponding estimate of various time series corresponding estimate of various time series char-acteristics
acteristics ((Kang, Hyndman, & Smith-Miles, 2017Kang, Hyndman, & Smith-Miles, 2017),), such
such asas randorandomnessmness,, trendtrend,, seasoseasonalitnality,y, linealinearityrity andand stability, our initial finding is that, on average, stability, our initial finding is that, on average, ran-domness is the most critical factor for determining domness is the most critical factor for determining the forecasting accuracy, followed by linearity. We the forecasting accuracy, followed by linearity. We al
alsoso fofounundd ththatat seseasasononalal titimeme seseririeses tetendnd toto bebe eaeasisierer to predict. This last point is a reasonable statement to predict. This last point is a reasonable statement if we consider that seasonal time series are likely to if we consider that seasonal time series are likely to be less noisy.
be less noisy. 7.
7. The sample size will not be a The sample size will not be a signifsignificant factoicant factor inr in improv
improving ing the the forecaforecasting accuracy sting accuracy of of statisstatisticaltical methods.
methods. Aft
Afterer corcorrelrelatiatingng thethe sMAsMAPEPE valvaluesues repreportorteded forfor eaceachh series with the lengths of these series, it seems that series with the lengths of these series, it seems that
This prediction/hypothesis is related closely to the This prediction/hypothesis is related closely to the following two claims (9 and 10). If the following two claims (9 and 10). If the character-istics of the M3 and M4 series are similar, then istics of the M3 and M4 series are similar, then the forecasting performances of the various the forecasting performances of the various partic-ipating methods will be close, especially for ipating methods will be close, especially for eco-nomic
nomic/busin/businessess data,data, wherewhere thethe datadata availavailabiliabilityty andand freq
frequenuencycy ofof thethe twotwo datdataseasetsts areare comcomparparablable.e. How How--ever, excessive experiments are required in order to ever, excessive experiments are required in order to answer this question, and therefore no firm answer this question, and therefore no firm conclu-sion can be drawn at this point. We can only state sion can be drawn at this point. We can only state that this is quite true for the case of the statistical that this is quite true for the case of the statistical M4 benchmarks.
M4 benchmarks. 9.
9. We would expect few, not statistically important,We would expect few, not statistically important, dif
differeferencences s in in the the chacharacracteriterististics cs of of the the serseriesies used in the M3 and M4 Competitions in terms of used in the M3 and M4 Competitions in terms of their seasonality, trend, trend-cycle and their seasonality, trend, trend-cycle and random-ness.
ness.
An analysis of the characteristics of the time An analysis of the characteristics of the time se-ries (
ries (Kang et al., 2017Kang et al., 2017) in each of the two datasets) in each of the two datasets demonstrated that the basic features of the series demonstrated that the basic features of the series were
were quiquitete simsimilailar,r, thothoughugh thethe M4M4 datdataa werweree slislightghtlyly more trended and less seasonal that those of M3. more trended and less seasonal that those of M3. How
Howeveever,r, aa mormoree detdetailaileded anaanalyslysisis wilwilll bebe perperforformedmed to
to veriverify fy thethese se preprelimliminainary ry finfindindings, gs, taktaking ing intintoo consideration possible correlations that may exist consideration possible correlations that may exist between such features. This conclusion is also between such features. This conclusion is also sup-ported through claim 10, as time series ported through claim 10, as time series character-ist
istics ics are are relrelateated d clocloselsely y to to the the perperforformanmances ces of of foreca
forecastinstingg methomethodsds ((PetropPetropoulosoulos,,MakridMakridakis,akis,AssiAssi-
-makopoulos, & Nikolopoulos, 2014
makopoulos, & Nikolopoulos, 2014).).
10.
10. There will be small, not statistically significant, There will be small, not statistically significant, differe
differences in nces in the accuracies of the the accuracies of the eight statisti-eight statisti-ca
call bebencnchmhmararksks bebetwetweenen ththee M3M3 anandd M4M4 dadatatasesetsts.. Similarities were evident when the performances Similarities were evident when the performances of the M4 benchmarks were compared across the of the M4 benchmarks were compared across the M3
M3 anand d M4 M4 dadatatasesetsts. . NoNot t ononly ly didid d ththe e raranknks s of of the benchmarks remain almost unchanged the benchmarks remain almost unchanged (Spear-man’s correlation coefficient was 0.95), but also the man’s correlation coefficient was 0.95), but also the abs
absoluolute te valvalues ues of of the errors the errors (sM(sMAPEAPE, , MASMASE E andand OWA) were very close. This is an indication that OWA) were very close. This is an indication that both the M3 and M4 data provide a good both the M3 and M4 data provide a good repre-sentation of the forecasting reality in the business sentation of the forecasting reality in the business world.
world.
4.
4. The waThe way forwy forwardard
Forecasting competitions provide the equivalent of lab Forecasting competitions provide the equivalent of lab experi
experiments in ments in physiphysics or cs or chemichemistry. The stry. The probleproblem m is thatis that sometimes we are not open-minded, but instead put our sometimes we are not open-minded, but instead put our biases and vested interests above the objective evidence biases and vested interests above the objective evidence from the competitions (
from the competitions (Makridakis, Assimakopoulos et al.,Makridakis, Assimakopoulos et al.,
in press
Competi-Trusted by over 1 million members
Try Scribd
FREE for 30 days to access over 125 million titles without ads or interruptions!
Start Free Trial
Cancel Anytime.
S.
S. MakMakridridakiakis s et et al. al. / / IntInternernatiationaonal l JourJournal nal of of ForForecaecastinsting g ( ( ) ) –– 77
ended, it still led other researchers to experiment with ended, it still led other researchers to experiment with the M3 data and propose additional, more accurate the M3 data and propose additional, more accurate fore-casting approaches. The major contributions of M4 have casting approaches. The major contributions of M4 have been (a) the introduction of a ‘‘hybrid’’ method, (b) the been (a) the introduction of a ‘‘hybrid’’ method, (b) the con
confirfirmatmationofionof thesuperithesuperioriorityty ofof thecombithecombinatnationionss andtheandthe inc
incorporporaoratiotion n of of ML ML algalgoriorithmthms s for for detdetermerminiining ng thetheirir weighting, and thus improving their accuracy, and (c) the weighting, and thus improving their accuracy, and (c) the impressive performances of some methods for impressive performances of some methods for determin-ing PIs.
ing PIs.
Rationality suggests that one must accept that all Rationality suggests that one must accept that all fore-casting approaches and individual methods have both casting approaches and individual methods have both ad-vantages and drawbacks, and therefore that one must be vantages and drawbacks, and therefore that one must be eclectic in exploiting such advantages while also avoiding eclectic in exploiting such advantages while also avoiding or minimizing their shortcomings. This is why the logical or minimizing their shortcomings. This is why the logical way forward is
way forward is the utilizatthe utilization of ion of hybrihybrid d and combinatand combinationion methods. We are confident that the 100,000 time series of methods. We are confident that the 100,000 time series of the M4 will become a testing ground on which forecasting the M4 will become a testing ground on which forecasting resear
researchers can chers can experiexperiment and ment and discodiscover ver new, increas-new, increas-ingly accurate forecasting approaches. Empirical evidence ingly accurate forecasting approaches. Empirical evidence mus
mustt guiguidede ourour effeffortortss toto impimprovrovee thethe forforecaecastistingng accaccurauracy,cy, rather than personal opinions.
rather than personal opinions. References
References
Ahm
Ahmed,N.ed,N.K.,AtiK.,Atiya,A.ya,A.F.,GayaF.,Gayar,N.r,N.E.,&E.,&El-El-ShiShishishiny,H.ny,H.(20(2010)10)..AnempirAnempiricaicall
comparison of machine learning models for time series forecasting.
comparison of machine learning models for time series forecasting.
Econometric Reviews
Econometric Reviews,, 29 29(5–6), 594–621.(5–6), 594–621.
Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., & Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., &
O’Har
O’Hara-Wia-Wild,ld, M.M. etet al.,al., (2018)(2018).. ForecForecast:ast: ForecForecastiastingng functfunctionsions forfor timetime series and linear models. R package version 8.3.
series and linear models. R package version 8.3. Kang
Kang, , Y., Hyndman, R. Y., Hyndman, R. J., & J., & SmithSmith-Mil-Miles, K. es, K. (2017(2017). ). VisuaVisualisilising fore-ng
fore-cast
casting ing algoalgorithrithm m perforperformancmance e using time using time serieseries s instinstance ance spacspaces.es.
International Journal of Forecasting
International Journal of Forecasting ,, 33 33(2), 345–358.(2), 345–358.
M4 Team (2018). M4 competitor’s guide: prizes and rules. See
M4 Team (2018). M4 competitor’s guide: prizes and rules. See https:// https://
www.m4.unic.ac.cy/wp-content/uploads/2018/03/M4-Competitors-Guide.pdf
Guide.pdf .. Mak
Makridridakiakis,s,S.,AnderS.,Andersensen,,A.,CarboA.,Carbone,R.,ne,R.,FilFildesdes,,R.,R.,HibHibon,M.,on,M.,etetal.(1982al.(1982).).
The accurac
The accuracy y of of extraextrapolatpolation (time series) methods: resultion (time series) methods: results s of of aa
forecasting competition.
forecasting competition. Journal of Forecasting Journal of Forecasting ,, 1 1, 111–153., 111–153.
Makridakis, S., Assimakopoulos, V., & Spiliotis, E. (2018). Objectivity,
Makridakis, S., Assimakopoulos, V., & Spiliotis, E. (2018). Objectivity,
re-producibility and replicability in forecasting research.
producibility and replicability in forecasting research. International International
Journal of
Journal of Forecasting Forecasting ,, 34 34(3) (in press).(3) (in press).
Mak
Makridridakiakis,s,S.,ChatfS.,Chatfielield,d,C.,HibonC.,Hibon,,M.,LawrM.,Lawrencence,e,M.,MillM.,Mills,s,T.,etT.,etal.(1993al.(1993).).
The
The M2-cM2-competompetitionition: : A A real-real-time time judgmejudgmentalntally ly based based forecforecastiastingng
study.
study. International Journal of Forecasting International Journal of Forecasting ,, 9 9(1), 5–22.(1), 5–22.
Makridakis, S., & Hibon, M. (1979). Accuracy of forecasting: an empirical
Makridakis, S., & Hibon, M. (1979). Accuracy of forecasting: an empirical
investigation.
investigation. Journal of the Royal Statistical Society, Series A (General) Journal of the Royal Statistical Society, Series A (General),,
142
142(2), 97–145.(2), 97–145.
Makridakis, S., & Hibon, M. (2000). The M3-Competition: results,
Makridakis, S., & Hibon, M. (2000). The M3-Competition: results,
con-clusions and implications.
clusions and implications. International Journal of Forecasting International Journal of Forecasting ,, 16 16(4),(4),
451–476.
451–476.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). Statistical and
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). Statistical and
machine learning forecasting methods: concerns and ways forward.
machine learning forecasting methods: concerns and ways forward.
PLOS ONE
PLOS ONE ,, 13 13(3), 1–26.(3), 1–26.
Montero-Manso, P., Netto, C., & Talagala, T. (2018). M4comp2018: Data Montero-Manso, P., Netto, C., & Talagala, T. (2018). M4comp2018: Data
from the M4-Competition. R package version 0.1.0. from the M4-Competition. R package version 0.1.0.
Petropoulos, F., Makridakis, S., Assimakopoulos, V., & Nikolopoulos, K.
Petropoulos, F., Makridakis, S., Assimakopoulos, V., & Nikolopoulos, K.
(2014). ‘Horses for courses’ in demand forecasting.
(2014). ‘Horses for courses’ in demand forecasting. European Journal European Journal
of Operational Research