Problems, causes and solutions when adopting continuous delivery—A systematic literature review

25  Download (0)

Full text

(1)

ContentslistsavailableatScienceDirect

Information

and

Software

Technology

journalhomepage:www.elsevier.com/locate/infsof

Problems,

causes

and

solutions

when

adopting

continuous

delivery—A

systematic

literature

review

Eero Laukkanen

a,∗

, Juha Itkonen

a

, Casper Lassenius

b,a a Department of Computer Science, PO Box 15400, FI-00076 AALTO, Finland

b Massachusetts Institute of Technology, Sloan School of Management, USA

a

r

t

i

c

l

e

i

n

f

o

Article history:

Received 2 December 2015 Revised 11 October 2016 Accepted 11 October 2016 Available online 12 October 2016 Keywords:

Continuous integration Continuous delivery Continuous deployment Systematic literature review

a

b

s

t

r

a

c

t

Context:Continuousdeliveryisasoftwaredevelopmentdisciplineinwhichsoftwareisalwayskept re-leasable.Theliteraturecontainsinstructionsonhowtoadoptcontinuousdelivery,buttheadoptionhas beenchallenginginpractice.

Objective:Inthisstudy,asystematicliteraturereviewisconductedtosurveythefacedproblemswhen adoptingcontinuousdelivery.Inaddition,weidentifycausesforandsolutionstotheproblems.

Method:Bysearchingfivemajorbibliographicdatabases,weidentified293articlesrelatedtocontinuous delivery.We selected30 ofthemforfurther analysisbasedonthemcontainingempiricalevidenceof adoptionofcontinuousdelivery,andfocusonpracticeinsteadofonlytooling.Weanalyzedtheselected articlesqualitativelyandextractedproblems,causesandsolutions.Theproblemsandsolutionswere the-maticallysynthesizedintoseventhemes:builddesign,systemdesign,integration,testing,release,human andorganizationalandresource.

Results:Weidentifiedatotalof40problems,28causalrelationshipsand29solutionsrelatedtoadoption ofcontinuousdelivery.Testingandintegrationproblemswerereportedmostoften,whilethemostcritical reportedproblemswererelatedtotestingandsystemdesign.Causally,systemdesign andtestingwere mostconnectedtootherthemes.Solutionsinthesystemdesign,resourceandhumanandorganizational themeshadthemostsignificantimpactontheotherthemes.Thesystemdesignandbuilddesignthemes hadtheleastreportedsolutions.

Conclusions:Whenadoptingcontinuousdelivery,problemsrelatedtosystemdesignarecommon, crit-icaland littlestudied.Thefoundproblems,causesand solutionscan beusedtosolveproblemswhen adoptingcontinuousdeliveryinpractice.

© 2016TheAuthors.PublishedbyElsevierB.V. ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Continuous delivery (CD)isa softwaredevelopmentdiscipline in whichthesoftware iskept insuch a state thatin principle,it could be released to its users at anytime [1,2]. Fowler [2] pro-poses that practicing CD reducesdeployment risk, allows believ-ableprogresstrackingandenablesfastuserfeedback.

WhileinstructionsonhowtoadoptCDhaveexistedfora cou-pleofyears [1],theindustry hasnot stilladoptedthe practiceat large[3],andthosewhohavetakensteps towardsCDhavefound itchallenging[4,5].Thisraises thequestionwhethertheindustry is lagging behindthebest practices orwhether the

implementa-∗ Corresponding author.

E-mail addresses: eero.laukkanen@aalto.fi(E. Laukkanen), juha.itkonen@aalto.fi (J. Itkonen), casper.lassenius@aalto.fi(C. Lassenius).

tion difficulty is higherand the payoff lowerthan speculated by theproponentsofCD.Inthisliteraturestudy,welookatproblems inadoptingCD,their causesandrelatedsolutions.Wedo not at-tempttounderstand thecost-benefitratioofCDimplementation, since currently there are not enough primary studies about the cost-benefitratioin ordertocreatea meaningfulliterature study onthesubject.

Inthis study,we attempt to createa synthesizedview of the literatureconsideringCDadoptionproblems,causesandsolutions. Ourmissionisnotjusttoidentifydifferentproblemconcepts,but alsotounderstandtheirrelationshipsandrootcauses,whichis re-flectedinthethreeresearchquestionsofthestudy:

RQ1. Whatcontinuousdelivery adoptionproblemshavebeen re-portedinmajorbibliographicdatabases?

http://dx.doi.org/10.1016/j.infsof.2016.10.001

(2)

RQ2.What causesforthecontinuousdeliveryadoption problems havebeenreportedinmajorbibliographicdatabases? RQ3.What solutions forthe continuousdelivery adoption

prob-lemshavebeenreportedinmajorbibliographicdatabases? Summarizing the literature is valuable for practitioners for whom the large amount of academic literature from various sourcesis not easily accessible.In addition,for research commu-nityourattemptprovidesagoodstartingpointforfutureresearch topics.

We believe this study provides an important contribution for thefield,becausewhileCDhasbeensuccessfullyadoptedinsome pioneeringcompanies, it is not known how generally applicable it is. For example, can testing and deployment be automated in all contexts or is it not feasible in some contexts? Furthermore, inmanycontexts whereCDhasnotbeenapplied,thereare signs ofproblemsthat CD isproposed to solve.Organizations who are developing software in contexts other than typical CD adoption would be eager to know what constraints CD has and would it be possible to adopt it in their context. We aim to address the decision-makingchallenge ofwhetherto adoptCDornot andto whatextent.

TounderstandthecurrentknowledgeabouttheproblemsofCD adoption,weconductedasystematicliteraturereview(SLR). Previ-ousliteraturestudieshavefocusedoncharacteristics[6,7]benefits [7,8],technicalimplementations [9], enablers[6,10]andproblems [6,7] of CD ora relatedpractice. Thus, there has been only two literaturestudiesthatinvestigatedproblems,andthey studiedthe problemsof thepractice itself, not adoption ofthe practice. Fur-thermore,one ofthe studies wasa mapping studyinstead ofan SLR,andanotheronefocusedonrapidreleases,thestrategyto re-leasewithtight interval,instead ofCD,thepracticetokeep soft-warereleasable. Therefore,toourknowledge, thisisthefirst SLR whichstudiesCDadoptionproblems,theircausesandsolutions.

This paper is structured asfollows. First, we give background informationaboutCDandinvestigatetheearlierSLRsinSection2. Next,weintroduce ourresearch goal andquestions,describe our methodology, andasses thequality ofthe study in Section 3. In Section 4, we introduce the results, which we further discuss in Section5.Finally,wepresentourconclusionsandideasforfuture workinSection6.

2. Backgroundandrelatedwork

Inthissection,wefirstdefinetheconcepts relatedtothe sub-jectofthestudy:continuousintegration(CI),continuousdelivery (CD)andcontinuousdeliveryadoption.CIisintroducedbeforeCD, becauseitisa predecessorandrequirementofCD.After defining theconcepts, weintroduceprevious literaturestudiesthatare re-latedtothesubject.

2.1.Continuousintegration

According toFowler [11], continuousintegration(CI)isa soft-ware development practice where software is integrated contin-uously duringdevelopment.In contrast,some projects have inte-gratedtheworkofindividualdevelopersorteamsonlyafter mul-tipledays, weeksoreven months ofdevelopment. When the in-tegrationisdelayed,thepossibilityandseverenessofconflicts be-tweendifferentlinesofworkincrease.

Good practice of CI requires all developers to integrate their worktoa commoncode repositoryona dailybasis[11].In addi-tion,aftereachintegration,thesystemshouldbebuiltandtested, toensurethat thesystemisstillfunctionalaftereachchangeand thatitissafeforotherstobuildontopofthenewchanges. Typ-ically, a CI server is used for tracking new changes from a ver-sioncontrolsystemandbuildingandtestingthesystemaftereach

Fig. 1. The conceptual difference between CI and CD in this study.

change[12].Ifthebuildortestsfailduetoachange,thedeveloper whohasmadethechangeisnotifiedaboutthefailure andeither thecauseofthefailureshouldbefixedorthechangereverted,in ordertokeepthesoftwarefunctional.

There exist only a few scientific studies that have investi-gated howwidelyCIispracticed intheindustry.StåhlandBosch [3]studiedtheCIpracticesinfiveSwedishsoftwareorganizations andfound that the practices were not really continuous: “activi-tiesarecarried outmuch moreinfrequentlythansome observers mightconsider to qualifyasbeingcontinuous”. In addition, Deb-bicheetal.[4]studiedalargeorganizationadoptingCIandfound multiple challenges. Based on these two studies, it seems that adoptingCIhasproventobedifficult,butwhyitisdifficultisnot knownatthemoment.

2.2. Continuousdelivery

Continuousdelivery (CD)is asoftware developmentdiscipline inwhich softwarecan bereleased toproduction atanytime [2]. The discipline is achieved through optimization, automatization andutilizationofthebuild,deploy,testandreleaseprocess[1].

CD extendsCI by continuously testingthat the softwareis of productionqualityandbyrequiringthatthereleaseprocessis au-tomated.The differencebetweenCIandCDisfurther highlighted inFig.1whereitisshownthat whileCIconsistsofonlya single stage,CDconsistsofmultiplestagesthatverifywhetherthe soft-wareisinreleasablecondition.However,oneshouldbeawarethat thetermsareuseddifferentlyoutsidethisstudy.Forexample,Eck etal.[10]usethetermCIwhiletheirdefinitionforitissimilarto thedefinitionofCDinthisstudy.

TheproposedbenefitsofCDareincreasedvisibility,faster feed-backandempowermentofstakeholders[1].However,whentrying toadoptCD,organizationshavefacednumerouschallenges[5].In thisstudy,weattempttounderstandthesechallengesindepth.

Continuous deployment is an extension to CD in which each change is built, testedand deployed to production automatically [13].Thus, in contrastto CD,there are no manual steps or deci-sionsbetweenadevelopercommitandaproductiondeployment. Themotivationforautomatingthedeploymenttoproductionisto gainfasterfeedbackfromproductionusetofixdefectsthatwould be otherwise tooexpensive to detect [13]. One should also note that continuous delivery anddeployment are used as synonyms outsidethisstudy.Forexample,Rodriguezetal.[7]usetheterm continuousdeployment whilethey refertothepracticeof contin-uousdelivery,sincetheydonotrequireautomaticdeploymentsto production.Whileitwouldbeinterestingtostudycontinuous de-ployment,we didnot findanyreportsof continuousdeployment implementationsfromthescientific literature. Therefore,wehave chosen to use continuous delivery as a primary concept in this study.

(3)

Fig. 2. Distinction between problems CD solves, problems preventing CD adoption, problems of CD and benefits of CD.

Table 1

Comparison of previous literature studies and this study.

Study Focus Type Findings

Ståhl and Bosch [8] CI SLR Benefits

Ståhl and Bosch [9] CI SLR Variation points where implementations differ

Eck et al. [10] CD SLR Adoption actions

Mäntylä et al. [6] Rapid releases Semi-systematic literature review Benefits, adoption actions and problems Rodriguez et al. [7] CD Systematic mapping study Characteristics, benefits and problems Adams and McIntosh [15] Release engineering Research agenda Characteristics

This study CD SLR Problems preventing adoption, their causes and solutions

2.3. Continuousdeliveryadoption

WedefineCDadoptionasthesetofactionsasoftware develop-mentorganizationhastoperforminordertoadoptCD(seeFig.2). Theactualsetofactionsdependsonthestartingpointofthe orga-nizationandthustheadoptioncanbedifferentfromcasetocase. Whilesome sourcesprovidespecificactionsthat needtobedone duringtheadoption[1,6,10],orsimplisticsequentialmodelsofthe adoption [14],thesemodelsare prescriptiveinnature andinreal lifetheadoptionmorelikelyrequiresiterationandcase-specific ac-tions, as even CI implementations differ among the practitioners [3].

To be ableto discuss previous literature andthe focus ofour study, we haveconstructed thefollowing concepts relatedto the CDadoption:

ProblemsCDsolvesare problemsthatarenot directlyrelated to CD,but CD is promised to solve them. These include,e.g., slowfeedbackofchangesanderror-pronereleases.

CDadoptionproblemsareproblemsthataredirectly prevent-ingCDadoptionandadditionalactionsneedtobedonetosolve them.Theseproblemsarethefocusofthisstudy.

Adoptionactionsneedto beperformedbyan organizationto adoptCD.

ProblemsofCDareproblemsthatemergewhenCDisadopted. Benefits of CDare positive effects that are achieved afterCD

hasbeenadopted.

The definitive source for CD [1] describes the problems CD solves, benefits of CD and suggests needed adoption actions. It brieflymentionsproblemspreventingCDadoption,butthereisno detaileddiscussion ofthem. Next,we introducerelatedacademic literaturestudies andshowthat neithertheyfocusonthesubject ofthisstudy,theproblemspreventingCDadoption.

2.4. Previousrelatedliteraturestudies

Toourknowledge,therehavebeensixliteraturestudiesrelated tothesubjectofthisstudy(Table1).Thesestudieshavereviewed the characteristics [7,15], benefits[6–8],variation in implementa-tions [9], adoption actions [6,10],and problems [6,7] ofCD or a related practice such as CI, rapid releases or release engineering

(Table 2). Rapid releases is a practice of releasing software fre-quently,sothatthetimebetweenreleasesishours,daysorweeks insteadofmonths[6].Releaseengineeringmeans thewhole pro-cessoftakingdevelopercodechangestoarelease[15].WeseeCD asa releaseengineeringpracticeandasan enabler for hourlyor dailyreleases,butwe donotseeitnecessaryifthetimebetween releasesismeasuredinweeks.Inaddition,practicingCDdoesnot implyreleasingrapidly,sinceonecankeepsoftwarereleasableall thetimebutstillperformtheactualreleasesmoreseldom.

The identified characteristics of CD are fast and frequent re-lease, flexible product design and architecture, continuous test-ing and quality assurance, automation, configuration manage-ment, customer involvement, continuousand rapid experimenta-tion,post-deploymentactivities,agileandleanandorganizational factors [7]. To avoid stretching the concept of CD and keep our studyfocused, weuse thedefinitionin thebook by Humbleand Farley[1],whichdoesnotincludeallthefactorsidentifiedby[7]. Forexample,we investigateCDasa developmentpracticewhere softwareiskeptreleasable,butnotnecessarilyreleasedfrequently. In addition, our definition does not necessarily imply tight cus-tomerinvolvementorrapidexperimentation.Instead,thefocusof ourstudyisonthecontinuoustestingandqualityassurance activi-ties,especiallyautomatedactivities.Ourviewismoreproperly de-scribedbythecharacteristicsofreleaseengineeringasdefinedby AdamsandMcIntosh:branchingandmerging,buildingandtesting, buildsystem,infrastructure-as-code,deploymentandrelease[15].

Proposed benefits of CI, CD or rapid releases are automated acceptance and unit tests [8], improved communication [8], in-creased productivity [6–8], increased project predictability as an effectoffindingproblems earlier[6–8],increasedcustomer satis-faction[6,7],shorter time-to-market [6,7],narrowertestingscope [6,7]andimprovedreleasereliabilityandquality[7].Theclaimed benefitsvarydependingonthefocusofthestudies.

Variation in implementations of CI can be in build duration, buildfrequency, buildtriggering,definitionoffailure andsuccess, fault duration, fault handling, integration frequency, integration on broken builds, integration serialization and batching, integra-tion target,modularization, pre-integrationprocedure, scope, sta-tus communication, test separation andtesting of new function-ality [9]. However, the variations listed here are limited only to theCIsystemsthatautomatetheactivitiesofbuilding,testingand

(4)

Table 2

Summary of the results from previous literature studies.

Results Results

Benefits Automated acceptance and unit tests [8] , improved communication [8] , increased productivity [6–8] , increased project predictability as an effect of finding problems earlier [6–8] , increased customer satisfaction [6,7] , shorter time-to-market [6,7] , narrower testing scope [6,7] , improved release reliability and quality [7] .

Variation points Build duration, build frequency, build triggering, definition of failure and success, fault duration, fault handling, integration frequency, integration on broken builds, integration serialization and batching, integration target, modularization, pre-integration procedure, scope, status communication, test separation, testing of new functionality [9] .

Adoption actions Devising an assimilation path, overcoming initial learning phase, dealing with test failures right away, introducing CD for complex systems, institutionalizing CD, clarifying division of labor, CD and distributed development, mastering test-driven development, providing CD with project start, CD assimilation metrics, devising a branching strategy, decreasing test result latency, fostering customer involvement in testing, extending CD beyond source code [10] . Parallel development of several releases, deployment of agile practices, automated testing, the involvement of product managers and pro-active customers, efficient build, test and release infrastructure [6] . Problems Increased technical debt [6] , lower reliability and test coverage [6] , lower customer satisfaction [6,7] , time pressure [6] , transforming

towards CD [7] , increased QA effort [7] , applying CD in the embedded domain [7] .

Characteristics Fast and frequent release, flexible product design and architecture, continuous testing and quality assurance, automation, configuration management, customer involvement, continuous and rapid experimentation, post-deployment activities, agile and lean, organizational factors [7] . Branching and merging, building and testing, build system, infrastructure-as-code, deployment and release [15] .

deployingthesoftware.Inaddition, thereshould be variationsin thepractices how the systems are used,but thesevariationsare notstudied inany literaturestudy. Ourfocusis not tostudy the variations, butwe see that becausethere is variation in the im-plementations, the problemsemerging during theadoption must vary toobetween cases. Thus, we cannot assume that the prob-lemsareuniversally generalizable,butonemustinvestigatethem case-specifically.

CDadoptionactionsaredevisinganassimilationpath, overcom-inginitiallearningphase,dealingwithtestfailuresrightaway, in-troducingCDforcomplexsystems,institutionalizingCD,clarifying divisionoflabor,CDanddistributeddevelopment,mastering test-drivendevelopment,providingCDwithprojectstart,CD assimila-tionmetrics, devising a branching strategy, decreasing test result latency, fosteringcustomer involvement intesting andextending CDbeyond source code[10]. Rapidreleases adoption actions are paralleldevelopmentofseveralreleases,deploymentofagile prac-tices,automatedtesting,theinvolvementofproductmanagersand pro-active customers and efficient build, test and release infras-tructure[6]. Theintention inthisstudyisto gostep furtherand investigatewhatkindofproblemsarisewhentheadoptionactions areattemptedtobeperformed.

ProposedproblemsofCDorrapidreleasesareincreased techni-caldebt[6],lowerreliabilityandtestcoverage[6],lowercustomer satisfaction[6,7], time pressure[6],transforming towards CD[7], increasedQAeffort[7]andapplyingCDintheembeddeddomain [7].Interestingly,previousliteraturestudieshavefoundthatthere isthebenefitofimprovedreliabilityandquality,butalsothe prob-lemoftechnicaldebt,lowerreliabilityandtestcoverage.Similarly, theyhaveidentifiedthebenefitofautomatedacceptanceandunit testsandnarrowertestingscope,butalsotheproblemofincreased QA effort. We do not believe that the differences are caused by the differentfocus of the literature studies. Instead, we see that since the benefits and problems seem to contradict each other, theymustbecasespecificandnotgeneralizable.Inthisstudy,we donotinvestigatetheproblemsoftheCDpracticeitself,butwe fo-cusontheproblemsthatemergewhenCDisadopted.Oneshould notthinktheseproblemsasgeneralcausalnecessities,butinstead instancesof problemsthat maybe presentin other adoptions or not.

As asummary,previous literaturestudieshaveidentifiedwhat CD[7]andreleaseengineering[15]are,verifiedthebenefitsofCD [7],CI[8]andrapidreleases[6],discovereddifferencesinthe im-plementationsofCI[9],understoodwhat isrequiredtoadoptCD [10]and rapid releases [6] andidentified problems of practicing CD[7]andrapidreleases [6](seeTable 2). However, noneofthe previous studies hasinvestigated whythe adoption efforts of CD

are failing in the industry.One ofthe studies acknowledged the problemwiththe adoption [7], butdidnot investigateit further, asit wasa systematicmappingstudy. Atthe same time thereis increasingevidencethatmanyorganizationshavenotadoptedCD yet[3].To addressthisgap intheprevious literature studies, we haveexecutedthisstudy.

3. Methodology

In this section, we present our research goal and questions, searchstrategy,filteringstrategy,dataextractionandsynthesisand studyevaluationmethods.Inaddition,wepresenttheselected ar-ticlesusedasdatasourcesanddiscusstheirqualityassessment.

3.1. Researchgoalandquestions

Thegoalofthispaperistoinvestigatewhatisreportedinthe majorbibliographicdatabasesabouttheproblemsthat preventor hinderCDadoptionandhowtheproblemscanbesolved.Previous software engineering research indicates that understanding com-plexproblemsrequiresidentifyingunderlyingcausesandtheir re-lationships[16].Thus,inordertostudyCDadoptionproblems,we need to study their causes too. Thisis reflected in the three re-searchquestionsofthispaper:

RQ1. Whatcontinuousdeliveryadoptionproblemshavebeen re-portedinmajorbibliographicdatabases?

RQ2. Whatcausesforthecontinuousdeliveryadoptionproblems havebeenreportedinmajorbibliographicdatabases? RQ3. What solutions for the continuousdelivery adoption

prob-lemshavebeenreportedinmajorbibliographicdatabases? Weanswertheresearchquestionsusingasystematicliterature reviewofempiricalstudiesofadoptionandpracticeofCDin real-worldsoftware development(see Section3.3forthedefinitionof real-worldsoftwaredevelopment).

Welimitourselvestomajorbibliographicdatabases,becauseit allows executing systematic searches and provides material that, in general, has more in-depth explanations and neutral tone. The bibliographic databases we used are listed in Table 3. The databases include not only research articles, but also, e.g., some books written by practitioners and experience reports. However, thedatabases donot containsome ofthematerial thatmight be relevantforthesubjectofstudy,e.g.,technicalreports,blogposts andvideo presentations.Whilethe excludedmaterial mighthave provided additional information, we believe that limiting to the majorbibliographicdatabasesprovides agoodcontributionon its

(5)

Fig. 3. An overview of the research process used in this study.

Table 3

Search results for each database in July 2014 and in February 2015. Search was executed for all years in July 2014, but only for years 2014–2015 in February 2015.

Database July 2014 February 2015 Total

Scopus 197 35 232

IEEE Explore 98 30 128

ACM Digital Library 139 30 169 ISI Web of Science 79 11 90

ScienceDirect 13 11 24

Total 526 117 643

own andthisworkcan be extendedinfuture.This limitation in-creases the reliability and validity of the material, but decreases theamountofreportsbypractitioners[17].

Welimit ourinvestigationtoproblemsthatarise when adopt-ingorpracticingCD.Wethusrefrainfromcollectingproblemsthat CD is meant to solve—an interesting study on its own. Further-more, we do not limit ourselvesto a strict definitionof CD.The reasonsarethat CDisafairlynewtopicandtheredoesnotexist muchliteraturementioningCDinthecontextofourstudy.Sinceit isclaimedthatCIisaprerequisiteforCD[1],weincludeitinour study.Similarly, continuousdeployment isclaimedtobea exten-sionofCD,andwe includeittoo.Wedothisbyincludingsearch termsforcontinuousintegrationandcontinuousdeployment.This way,wewillfindmaterialthatconsidersCDadoptionpath begin-ningfromCIadoptionandendingincontinuousdeployment.

We followed Kitchenham’s guidelines for conducting system-aticliteraturereviews[18],withtwoexceptions.First,wedecided toinclude multiplestudies ofthe sameorganization andproject, inorder to use all available information foreach identified case. We clearly identify such studies as depicting the same case in our analysis, results anddiscussion. The unit of analysis used is acase, nota publication.Second, insteadofusingdataextraction forms,we extracteddatabyqualitatively codingtheselected arti-cles,asmostof thepapers containedonly qualitative statements andlittlenumericaldata.Thecodingisdescribedinmoredetailin Section3.4.2.

The overall research process consisted of three steps: search strategy, filtering strategy and data extractionand synthesis (see Fig.3).Next,wewillintroducethesteps.

3.2.Searchstrategy

The search string used was “

(‘‘continuous

integration’’

OR

‘‘continuous

delivery’’

OR

‘‘continuous

deployment’’)

AND

software

”. The first parts ofthe stringwere thesubject ofthe study. The “software” stringwasincluded toexcludestudies thatrelatedtoother fields thansoftwareengineering;thesameapproachwasusedinan ear-lierSLR[9].The search stringwasappliedto titles, abstractsand keywords.Thesearchwasexecuted firstinJuly2014andagainin February2015.Thesecondsearchwasexecutedbecausetherehad beenrecentnewpublicationsinthearea.Bothsearchesprovided atotalof643results(Table3).Afterthefilteringstrategywas

(6)

ap-pliedandanarticlewasselectedforinclusion,weusedbackward snowballing[19],whichdidnot resultintheidentificationofany additionalstudies.

3.3.Filteringstrategy

We used two guiding principles when forming the filtering strategy:

Empirical:theincludedarticlesshouldcontaindatafrom real-lifesoftwaredevelopment.

CD practice: the included articles should contain data from continuous delivery as a practice. Some articles just describe toolchains, whichusually isseparated from thecontext of its use.

Withreal-lifesoftwaredevelopment,we meananactivity pro-ducing software meant to be used in real-life. For example, we includedarticles discussing the development ofindustrial, scien-tificandopensourcesoftwaresystems.Wealsoclassified develop-menthappeningin the contextof engineeringeducation as real-life, ifthe produced software was seen to be usable outside the coursecontext.However,softwaredevelopmentsimulationsor ex-periments were excluded to improvethe external validity ofthe evidence.For example,[20] wasexcluded, because it only simu-latessoftwaredevelopment.

First,we removedduplicateandtotallyunrelated articlesfrom theresults,whichleftuswith293articles(Fig.3).Next,we stud-iedtheabstractsoftheremainingpapers,andappliedthe follow-inginclusionandexclusioncriteria:

InclusionCriterion:areal-lifecaseisintroducedorstudied. ExclusionCriterion1:thepracticeoradoptionofcontinuous

in-tegration,deliveryordeploymentisnotstudied.

Exclusion Criterion 2: themain focusofthearticle isto evalu-ateanewtechnologyortoolinareal-lifecase.Thus,thearticle doesnotprovideinformationaboutthecaseitselforCD adop-tion.

ExclusionCriterion3:thetextisnotavailableinEnglish. Atotalof107articlespassedthecriteria.

Next,weacquiredfull-textversionsofthearticles.Wedidnot havedirectaccesstoonearticle,butanextension ofitwasfound tobeenpublishedasaseparatearticle[P11].Weappliedthe exclu-sioncriteriadiscussed above tothe full-textdocuments, assome ofthepapersturned outnot toincludeanyreal-world caseeven iftheabstractshadledustothinkso.Forexample,thetermcase studycan indeedmeana studyofa real-worldcase,butinsome papersitreferredtoprojectsnot usedinreal-life.Inaddition,we appliedthefollowingexclusioncriteriatothefull-texts:

ExclusionCriterion4:thearticleonlyrepeatsknownCDpractice definitions,butdoesnotdescribetheirimplementation. Exclusion Criterion 5: thearticleonlydescribesa technical

im-plementationofaCDsystem,notpractice.

Out of the 107 articles, 30 passed our exclusion criteria and wereincludedinthedataanalysis.

3.4.Dataextractionandsynthesis

We extracteddataandcodeditusingthreemethods.First,we used qualitative coding to ground the analysis.Second, we con-ducted contextual categorization and analysis to understand the contextualvarianceofthereportedproblems.Third,weevaluated thecriticality ofproblemstoprioritize thefound problems.Next, thesethreemethodsaredescribedseparatelyindepth.

3.4.1. Unitofanalysis

Inthispaper,theunit ofanalysisisan individualcaseinstead of an article, as severalpapers included multiple cases.A single casecould also be describedin multiple articles. The 30 articles reviewedhere discusseda total of35 cases.When referring to a case, weusecapitalC,e.g. [C1],andwhenreferring toan article, weusecapitalP,e.g.[P1].Ifanarticlecontainedmultiplecases,we usethe samecasenumberforall ofthembutdifferentiate them withasmallletter,e.g. [C9a]and[C9b]. Thereferred articlesand casesarelistedinaseparatebibliographyinAppendixA.

3.4.2. Qualitativecoding

We coded the data using qualitative coding, as most of the studieswere qualitativereports.We extractedthedataby follow-ing the coding procedures of grounded theory [21]. Coding was performedusingthefollowingsteps:conceptualcoding,axial cod-ingandselectivecoding.AllcodingworkwasdoneusingATLAS.ti [22]software.

During conceptual coding, articles were first examined for in-stances of problems that had emerged when adopting or doing CD.Wedidnot haveanypredefinedlistofproblems,so the pre-cisemethodwasopencoding.Identifyinginstancesofproblemsis highly interpretivework andsimply including problems that are namedexplicitlyproblemsorwithsynonyms,e.g. challenges,was notconsideredinclusiveenough.Forexample,thefollowingquote wascodedwiththecodes“problem” and“Ambiguoustestresult”, evenifitwasnotexplicitlymentionedtobeaproblem:

Sinceitisimpossibletopredictthereasonforabuildfailureahead oftime,werequiredextensiveloggingontheservertoallowusto determinethecauseofeachfailure.Thisleftuswithmegabytesof serverlogfileswitheachbuild.Thecauseofeachfailurehadtobe investigatedbytrollingthroughtheselargelogfiles.

–CaseC4

For each problem, we examined whether any solutions or causesforthatproblemwerementioned.Ifso,wecodedthe con-cepts as solutions and causes, respectively. The following quote wascodedwiththecodes“problem”,“largecommits”,“causefor” and“network latencies”. Thiscan betranslatedinto thesentence “networklatenciescausedtheproblemoflargecommits”.

Onaverage,developerscheckedinonceaday.Offshoredevelopers hadtodealwithnetworklatenciesandcheckedinlessfrequently; batchingupworkintosinglechangesets.

–CaseC13

Similarly,thefollowingquotewascodedwiththecodes “prob-lem”, “time-consuming testing”, “solution”, and “test segmenta-tion”. Thiscan be read as“test segmentation solves theproblem oftime-consumingtesting”.

We ended uprunning several differentCI builds largelybecause runningeverythinginonebuildbecameprohibitivelyslowandwe wantedthecheck-inbuildtorunquickly.

–CaseC13

During axial coding, we made connections between the codes formed during conceptual coding. We connected each solution codetoeveryproblemcodethatitwasmentionedtosolve. Simi-larly,weconnectedeachproblemcodetoeveryproblemcodethat it was mentioned causing. The reported causes are presented in Section4.2.Wedidnotseparateproblemandcausecodes,because oftencausescouldbeseenasproblemstoo.Ontheotherhand,we dividedthecodesstrictlytobeeitherproblemsorsolutions,even ifsomesolutions wereconsidered problematicinthearticles.For example,the solution “practicing smallcommits” can be difficult ifthe“networklatencies” problemispresent.Buttocodethis,we usedtheproblemcode“largecommits” intherelationto“network

(7)

Table 4

Case categories and categorization criteria.

Category Criteria Category Criteria

Publication time Number of developers

Pre 2010 year ≤ 2010 Small size < 20 Post 2010 year > 2010 Medium 20 ≤ size ≤ 100

Large size > 100

CD implementation maturity Commerciality

CI CI practice. Non-commercial E.g., open source or scientific development. CD CD or advanced CI practice. Commercial Commercial software development.

latencies”. Thecode“systemmodularization” was anexception to thisrule,beingcategorizedasboth aproblemandasolution, be-causesystemmodularizationinitselfcancausesomeproblemsbut alsosolveotherproblems.

Duringselectivecoding,onlythealreadyformedcodeswere ap-pliedto thearticles.Thistime, eveninstances,that discussedthe problem code but did not consider it as a faced problem, were coded to groundthe codes better andfindvariance in the prob-lemconcept.Alsosomeproblemconcepts werecombinedtoraise the abstractionlevel of coding.For example,the following quote wascodedwith“effort” duringselectivecoding:

Continuallymonitoringandnursingthesebuildshasasevere im-pact on velocity early on in theprocess, butalso saves time by identifyingbugsthatwouldnormallynotbeidentifieduntilalater pointintime.

–CaseC4

Inaddition,we employedthecode“preventedproblem” when a problem concept was mentioned to having been solved before becomingaproblem. Forexample,thefollowingquotewascoded with the codes “parallelization”, “prevented problem” and “time-consumingtesting”:

Furthermore, the testing system separates time consuming high level tests by detaching the complete automated test run to be done in parallel on different servers. So whenever a developer checks in anew version ofthesoftwarethecomplete automated setoftestsisrun.

–CaseC1

Finally, we employed the code“claimed solution” when some solutionwasclaimedtosolveaproblembutthesolutionwasnot implemented in practice. For example, the following quote was codedwiththecodes“problem”,“ambiguoustestresult”,“claimed solution” and“testadaptation”:

Therefore,ifaproblem isdetected,thereisaconsiderableamount of time invested following the software dependencies until find-ing where the problem is located. The separation of those tests intolower leveltaskswouldbean importantadvantagefor trou-bleshootingproblems,whileguaranteeingthathighleveltestswill workcorrectlyifthelowerlevelonesweresuccessful.

–CaseC15

3.4.3. Thematicsynthesis

During thematic synthesis [23], all the problem and solution codesweresynthesizedintothemes.Asastartingpointofthemes, we took the differentactivities of software development: design, integration,testingandrelease.Thedecisiontousethesethemesas a startingpoint wasdone aftertheprobleminstanceswere iden-tified andcoded. Thus,thethemeswere notdecided beforehand; theyweregroundedintheidentifiedproblemcodes.

If a problemoccurred duringor wascaused by an activity, it wasincludedinthetheme.Duringthefirstroundofsynthesis,we noticed that other themeswere requiredas well, andadded the themes ofhuman andorganizational andresource.Finally,the

de-signtheme wassplitintobuilddesignandsystem design, to sepa-ratethesedistinctconcepts.

3.4.4. Contextualcategorizationandanalysis

Wecategorizedeach reportedcaseaccordingtofourvariables: publicationtime,numberofdevelopers,CDimplementation matu-rityandcommerciality,asshowninTable4.Thecriteriawerenot constructedbeforehand,butinsteadafterthequalitativeanalysisof thecases,lettingthecategoriesinductivelyemergefromthedata. Whendataforthecategorizationwasnotpresentedinthearticle, the categorization wasinterpreted based on the case description bythefirstauthor.

TheCDimplementationmaturityofcaseswasdeterminedwith twosteps.First,ifacasedescribedCDadoption,its maturitywas determinedtobeCD,andifacasedescribedCIadoption,its ma-turitywasdeterminedto beCI.Next,advancedCIadoption cases that described continuous system-level quality assurance proce-dures were upgraded to CD maturity, because those cases had moresimilarityto CDcasesthan toCIcases.The upgraded cases wereC1,C4andC8.

After the categorization, we compared the problems reported betweendifferentcategories.Thecomparisonresultsarepresented inSection4.2.

3.4.5. Evaluationofcriticality

Weselected the mostcriticalproblemsforeach casein order to see which problems had the largest impact hindering the CD adoption.Thenumberofthemostcriticalproblemswasnot con-strainedanditvaried fromzerototwo problemsper case.There were two criteriaforchoosing themost criticalproblems.Either, themostsevereproblemsthatpreventedadoptingCD,or,themost criticalenablersthatallowedadoptingCD.

Enablingfactorswerecollectedbecause,insomecases,no criti-calproblemswerementioned,butsomecriticalenablerswere em-phasized. However, when the criticality assessments by different authorswere compared,itturnedoutthat theselectionofcritical enablerswas moresubjective than the selection of critical prob-lems. Thus, onlyone critical enabler wasagreed upon by all au-thors(unsuitablearchitectureincaseC8).

The most critical problems were extracted by three different methods:

Explicit: Ifthe articleasa wholeemphasizeda problem,orif it wasmentioned explicitly in thearticle that a problemwas themostcritical,thenthatproblemwasselectedasanexplicit criticalproblem.E.g,incaseC5,wheremultipleproblemswere given,onewasemphasizedasthemostcritical:

A unique challenge forAtlassian has been managing the on-line suite of products (i.e. the OnDemand products) that are deeply integrated withone another...Due to the complexity of cross-product dependencies, several interviewees believed this wasthemainchallengeforthecompanywhenadoptingCD.

(8)

Implicit:Theauthorsinterpretedwhichproblems,ifany,could be seen as themost critical.These interpretations were com-paredbetweentheauthorstomitigatebias,detaileddescription oftheprocessisgiveninSection3.5.

Causal: the causes given in the articles were taken into ac-count, by considering the more primary causesas more criti-cal.Forexample,incaseC3a,thecomplexbuildproblemcould be seenascritical,butitwasactually causedbytheinflexible buildproblem.

3.5.Validityofthereview

The search, filtering, data extraction and synthesis were first performedbythefirstauthor,causingsingleresearcherbias,which hadto be mitigated.The searchbias wasmitigatedby construct-ingthereviewprotocolaccordingtotheguidelinesbyKitchenham [18].Thisreviewprotocolwasreviewedbythetwootherauthors.

Wemitigatedthepaperselectionbiasbyhavingthetwoother authorsmake independent inclusion/exclusion decisionson inde-pendentrandomsamplesof200articleseachofthetotal293.The randomsamplingwasdonetolowertheeffortrequiredfor assess-ing the validity.This way, each paper was ratedby atleast two authors,and104ofthepaperswereratedbyallthree.

We measured inter-rater agreementusing Cohen’skappa [24], whichwas0.5–0.6,representingmoderateagreement[25].All dis-agreements(63papers)wereexamined,discussedandsolvedina meeting involvingall authors. All the disagreements were solved through discussion, and no modifications were made to the cri-teria.In conclusion,thefiltering ofabstracts wasevaluated tobe sufficientlyreliable.Thedataextractionandsynthesisbiasesinthe laterpartsofthestudyweremitigated byhavingthesecond and thirdauthorsreviewtheresults.

Bias in thecriticality assessmentwas mitigatedby havingthe firsttwoauthorsassessallthecasesindependentlyofeachother. Fromthetotalof35cases,therewere 12fullagreements,10 par-tialagreements and13 disagreements, partial agreements mean-ing that some ofthe selected codes were the samefor thecase, butsome werenot.All thepartialagreementsanddisagreements wereassessed alsobythethird authorandthe resultswerethen discussedtogether byalltheauthorsuntilconsensuswasformed. Thesediscussionshadanimpactnotonlyontheselected critical-ityassessmentsbutalsoonthecodes,whichfurtherimprovedthe reliabilityofthestudy.

3.6.Selectedarticles

When extracting data from the 30 articles (see Appendix A), we notedthat some of thearticles did not contain any informa-tionaboutproblemsrelatedtoadoptingCD.Thosearticlesarestill includedin this paperfor examination. The articles that did not containany additionalproblems were P3, P17, P19,P21 andP26. ArticleP3containedproblems, butthey were duplicate toArticle P2whichstudiedthesamecase.

Allthecaseswerereportedduringtheyears2002–2014(Fig.4). Thisisnotparticularly surprising,since continuousintegrationas apracticegainedmostattentionafterpublicationofextreme pro-grammingin1999[26].However, overhalf ofthe caseswere re-portedafter2010, whichshowsanincreasing interestinthe sub-ject. Seven of the cases considered CD (C5, C7, C14, C25a, C25b, C25c,C26).TheothercasesfocusedonCI.

Notall thearticlescontainedquotationsaboutproblemswhen adoptingCIorCD.Forexample,papersP21andP26contained de-taileddescriptionsofCIpractice,butdidnotlistanyproblems.In contrast,twopapersthathadthemostquotationswereP6with38 quotationsandP4with13quotations.Thisisduetothefactthat thesetwoarticles specificallydescribedproblems andchallenges.

Fig. 4. Number of cases reported per year. The year of the case was the latest year given in the report or, if missing, the publication year.

Other articles tended to describe the CI practice implementation without considering any observed problems. Furthermore, major failuresarenotoftenreportedbecauseofpublicationbias[18].

3.7. Studyqualityassessment

Ofthe included 30 articles,we considered nine articles to be scientific (P6, P7, P8, P11, P19, P20, P21, P28, P30), because they containeddescriptionsoftheresearchmethodologyemployed.The other 21articleswere consideredasdescriptivereports. However, only two of the selected scientific articles directly studied the problemsorchallenges(P6,P30),andtherefore,wedecidednotto separatetheresultsbasedonwhetherthesourcewasscientificor not.Instead, weaimed atextractingtheobservations and experi-encespresentedinthepapersratherthanopinionsorconclusions. Observationsand experiences can be considered morevalid than opinions,becausethey reflect thereality ofthe observerdirectly. Inthecontextofqualitativeinterviews,Pattonwrites:

Questionsaboutwhatapersondoesorhasdoneaimtoelicit be-haviors, experiences,actionsand activities thatwould havebeen observablehadtheobserverbeenpresent.

–Patton[27,p.349–350] 4. Results

Intotal,weidentified40problems,28causalrelationshipsand 29 solutions. In the next subsections, we explain these in de-tail.The resultsare augmented withquotes fromthearticles. An overviewoftheresultscanbeobtainedbyreadingonlythe sum-mariesatthebeginningofeach subsectionandaricherpictureof thefindingsisprovidedthroughthedetailedquotes.

4.1. Problems

Problemswerethematicallysynthesizedintoseventhemes.Five ofthesethemes arerelatedto the differentactivities ofsoftware development:builddesign,systemdesign,integration,testing,and release. Two of the themes are not connected to any individual part:humanandorganizationalandresource.Theproblemsinthe themesarelistedinTable5.

Thenumberofcaseswhichdiscussedeachproblemtheme var-ied (Fig. 5). Most of the cases discussed integration and testing problems,both ofthem beingdiscussed inat least16 cases.The

(9)

Table 5

Problem themes and related problems. Cases where a problem was prevented with a solution are marked with a star ( ∗).

Theme Problems

Build design Complex build [C2, C3a], inflexible build [C3a]

System design System modularization [C2, C17e, C21, C25a, C25b], unsuitable architecture [C3a, C8, C22, C26, C25c], internal dependencies [C5], database schema changes [C5, C7( ∗), C25c( )]

Integration Large commits [C3a, C5, C7( ∗), C13, C14( ), C22], merge conflicts [C2( ), C3a, C5, C14, C20( ), C21, C24], broken build [C3a, C5, C6, C8, C9a, C14, C17a], work blockage [C3a, C5( ∗), C11, C17a, C27], long-running branches [C7( ), C14, C24, C27], broken development flow [C3a], slow integration approval [C17a]

Testing Ambiguous test result [C2, C4, C6, C15, C17a, C27], flaky tests [C4, C6, C8, C11, C14, C22, C27], time-consuming testing [C1( ∗), C2( ), C3a, C3b( ∗), C11, C13, C14, C27], hardware testing [C1( ), C8], multi-platform testing [C2, C9b, C21], UI testing [C8, C14], untestable code [C22, C25b( ∗)], problematic deployment [C25a, C25b( ), C25c], complex testing [C21, C25c]

Release [all in C5] customer data preservation, documentation, feature discovery, marketing, more deployed bugs, third party integration, users do not like updates, deployment downtime [C25c( ∗)]

Human and organizational

Lack of discipline [C1( ∗), C6, C10, C11, C12, C14], lack of motivation [C5, C6, C19, C27], lack of experience [C5, C12, C27], more pressure [C5, C27], changing roles [C5], team coordination [C5], organizational structure [C26]

Resource Effort [C2, C3a, C4, C19, C17e, C26], insufficient hardware resources [C4, C5], network latencies [C13]

Fig. 5. Number of cases per problem theme.

Table 6

Build design problems. Problem Description

Complex build Build system, process or scripts are complicated or complex. Inflexible build The build system cannot be modified flexibly.

second mostreported problemswere systemdesign, human and organizational and resource problems,all of them handled in at least8cases.Finally,build designandreleaseproblemswere dis-cussed intwocasesonly. Mostofthe releaseproblemswere dis-cussedonlyinonecase[C5].

4.1.1. Builddesignproblems

Thebuilddesignthemecoveredproblemsthatwerecausedby build design decisions. The codes in the theme are describedin Table6.Thecodesinthethemewereconnectedandwere concur-rentin CaseC3a. Fromthat case, we caninfer that theinflexible buildactuallycausedthecomplexityofthebuild:

The Bubble team was just one team in a larger programme of work, and each team used the same build infrastructure and

buildtargetsfortheirmodules.Astheapplicationdeveloped, spe-cialcases inevitablycrept intoindividualteambuildsmaking the scriptsevenmorecomplexandmoredifficulttochange.

–CaseC3a

Inanothercase,itwasnotedthatsystemmodularizationofthe applicationincreasedthecomplexityofthebuild:

Finally,themodularnatureofAMBERrequiresacomplicatedbuild processwherecorrectdependencyresolutioniscritical.Developers whoalter thebuild orderor addtheir code areoften unfamiliar withGNU Makefiles, especiallyatthe levelofcomplexity as AM-BER’s.

–CaseC2

Complexbuildsaredifficulttomodify[C2]andsignificanteffort canbeneededtomaintainthem[C3a].Complexbuildscancause buildstobebrokenmoreoften[C3a].

4.1.2. Systemdesignproblems

The systemdesign theme covered problems that were caused bysystemdesigndecisions.Thecodesinthethemearedescribed inTable7.

(10)

Table 7

System design problems.

Problem Description

System modularization The system consists of multiple units, e.g., modules or services. Unsuitable architecture System architecture limits continuous delivery.

Internal dependencies Dependencies between parts of the software system. Database schema changes Software changes require changes of database schema.

Table 8

Integration problems.

Problem Description

Large commits Commits containing large amount of changes.

Merge conflicts Merging changes together reveals conflicts between changes. Broken build Build stays broken for long time or breaks often.

Work blockage Completing work tasks is blocked or prevented by broken build or other integrations in a queue. Long-running branches Code is developed in branches that last for long time.

Broken development flow Developers get distracted and the flow [28] of development breaks. Slow integration approval Changes are approved slowly to the mainline.

Systemmodularization. Systemmodularization wasthe most dis-cussedsystem designproblem: itwas mentioned infive articles. Whilethe codes systemmodularization andunsuitablearchitecture

canbeseentooverlapeachother,systemmodularizationis intro-ducedasaseparate codebecauseofitsunique properties;itwas mentionedto be both a problemanda solution.Forexample, in theCaseC2,itwassaidtocausebuild complexity,butinanother quoteitwassaidtopreventmergeconflicts:

Mergeconflicts arerare,as eachdevelopertypicallystaysfocused on asubsetofthecode,andwillonlyeditothersubsectionsina minor wayto fixsmall errors,or withpermission and collabora-tion.

–CaseC2

In another case, system modularization was said to ensure testabilityofindependentunits:

They designedmethodsand classesas isolatedserviceswith very small responsibilities and well-defined interfaces. This allows the team to test individual units independently and to write (mocks of)theinputsandoutputsofeachinterface.Italsoallowsthemto testtheinterfacesinisolation withouthavingtointeractwiththe entiresystem.

–CaseC25a

System modularization wasnot a problemon its own in any instance.Ratheritseffectsweretheproblems:increased develop-menteffort [C17e],testingcomplexity [C21] andproblematic de-ployment[C25a].

Unsuitablearchitecture. AnarchitectureunsuitableforCDwasthe secondmostdiscussedsystemdesignproblembybeingmentioned infourarticles. Again,unsuitable architecturewasnot a problem on its own but its effects were the problems: time-consuming testing[C3a],developmenteffort[C8],testability [C22,C25c] and problematicdeployment[C25c].Casesmentionedthatarchitecture wasunsuitableifitwasmonolithic[C22,C26],coupled[C3a], con-sistedofmultiplebranchesofcode[C8]ortherewereunnecessary serviceencapsulation[C25c].

Other systemdesignproblemswerediscussedlightlyina cou-pleofcasesonlyandthusarenotincludedherefordeeper analy-sis.

4.1.3. Integrationproblems

The integration theme covered issues that arise when the sourcecodeis integratedintothe mainline.Theproblems inthis themearedescribedinTable8.

Fig. 6. Reported causal relationships between integration problems and related testing problems.

All the codes in this theme are connected through reported causal relationships,see Fig.6.Some have tried toavoid integra-tion problemswith branching,but long-livingbranchesare actu-allymentionedtomakeintegrationmoretroublesomeinthelong run:

...as the development of the main code base goes on, branches diverge further and furtherfrom the trunk, making the ultimate merge ofthe branch back into the trunkan increasingly painful andcomplicatedprocess.

–CaseC24

Another interesting characteristic in the integration theme is the viciouscyclebetweenthecodes broken build, work blockage andmergeconflicts.ThisisemphasizedinCaseC3a:

Oncethebuildbreaks,theteamexperiencesa kindof“work out-age”. And thelonger thebuild is broken, the more difficult it is forchangestobemergedtogetheroncecorrected.Quiteoften,this mergeeffortresultsinfurtherbuildbreaksandsoon.

–CaseC3a

Largecommits. Largecommitsareproblematic,becausethey con-tainmultiplechangesthatcanconflictwithchangesmadeby oth-ers:

(11)

These largerchange setsmeantthatthereweremorefile merges requiredbeforeacheck-incouldbecompleted,furtherlengthening thetimeneededtocommit.

–CaseC3a

However, there are multiple reasonswhy developers do large commits: time-consuming testing [C3a], large features [C7, C14], network latencies [C13] and a slow integration approval process [C17a].Thus,todealwithlargecommits,onemustconsiderthese underlyingreasonsbehind.

Merge conflicts. Merge conflicts happen when changes made by differentdevelopersconflict.Solvingsuch aconflictcantake sub-stantialeffort:

Wehavefeltthepainofmerginglongrunningbranchestoomany times. Merge conflicts cantake hours to resolve and itis alltoo easytoaccidentallybreakthecodebase.

–CaseC14

Merge conflictscan be causedby long-running branches[C14, C24] or large commits [C3a].Alsodelays in thecommitting pro-cess,suchaslengthycodereviews,cancausemergeconflicts[C21]. Insomesituations,mergeconflictscanberarer:ifdeveloperswork ondifferentpartsofsourcecode[C2]orifthereisasmallamount ofdevelopers[C20].

Broken build. Broken build was the mostmentioned problem by being discussed in tenarticles. Broken buildsbecome a problem whenitishardtokeepabuildfixedandittakesasignificanteffort tofixthebuild:

TheBubbleteambuildwouldoftenbreakandstaybrokenforsome time(onone occasionfora fullmonthiteration)so a significant proportionofdeveloperstimewasspentfixingthebuild.

–CaseC3a

Ifabroken buildisnotfixedimmediately,feedbackfromother problems will not be gained. In addition, problematic code can spreadtootherdeveloperworkstations,causingtrouble:

Somenotedthatifcodewascommittedaftera buildfailure, that newcodecouldconceivablybeproblematictoo,butthe confound-ingfactorswouldmakeitdifficulttodetermineexactlywhere the problemwas.Similarly,otherdevelopersmayinadvertentlyobtain copiesofthecodewithoutrealizingitisinabrokenstate.

–CaseC6

However,ifdevelopersareofteninterruptedtofixthebuild,it willbreaktheirdevelopmentflowandtaketimefromothertasks:

The Bubble teams otherproblem was being often interrupted to fixthebuild.Thistooksignificanttimeawayfromdevelopingnew functionality.

–CaseC6

Reasons for broken builds were complex build [C3a], merge conflicts[C3a]andflakytests[C14].

Workblockage. When thecompletion ofa developmenttask,e.g. integration,isdelayed,itcausesaworkblockage:

ItshouldalsobenotedthattheSCMMainlinenodeaffordsno par-allelism:ifthereisablockage,asintervieweestestifyisfrequently thecase,iteffectivelyhaltstheentireproject.

–CaseC17a

Thereasoncanbethatabrokenbuildmustbefixed[C3a,C11] orthatthereareotherintegrationsinthequeue[C27].Inaddition todelays,workblockagescancausefurthermergeconflicts[C3a].

Long-running branches. Long-running branches easily lead to merge conflicts, and developing code in branches slows the fre-quencyofintegration.However, somecasesstillinsistonworking withmultiplebranches:

Comparedtosmallerproducts,whereallcodeismergedtoasingle branch,thedevelopmentmakesuseofmanybrancheswhichadds tothecomplexity.

–CaseC27

Thereisnotmuchevidencewhethertherearesituationswhen multiplebranchesarenecessary.Those whohavechosen towork withasinglebranch havebeensuccessfulwithit[C7, C14]. Nev-ertheless,a workingCI environment can help with solving large merge conflicts by providing feedback during the merge process [C24].

Brokendevelopmentflow. WhentheCIsystemdoesnotwork prop-erlyandfailuresinthesystemdistractdevelopersfromwritingthe software,thedevelopmentflow[28] mightgetbroken[C3a]. Bro-kendevelopmentflowdecreasesdevelopmentproductivity.

Slowintegrationapproval. Thespeedofintegrationcanbe slowed downbytoostrictapprovalprocesses:

Each change...must be manually approved by a project manager before it is allowed ontothe SCM Mainline. The consequence of thisisaqueuingsituation, withanelaborateticketsystemhaving sprunguptosupportit,wherelowpriority“deliveries” canbeput onholdforextendedperiodsoftime.

–CaseC17a

Aslow integration approval process is detrimental to CD, be-causeitleadstolargercommitsanddelaysfeedback.Codereview processesshouldbedesignedsothat theydonotcauseextensive delaysduringintegration.

4.1.4. Testingproblems

The testingproblemtheme includes problems relatedto soft-waretesting.TheproblemsaredescribedinTable9.Themost dis-cussedtestingproblemswereambiguoustestresult,flakytestsand time-consumingtesting,allofthembeingmentionedinatleastsix cases.

Ambiguous test result. An ambiguous test result means that the testresultdoesnotguidethedevelopertoaction:

...severalof theautomated activitiesdonot yielda clear“passor fail” result.Instead,theygeneratelogs,whicharetheninspectedin orderto determine whether there wereany problems—something onlyasmallminorityofprojectmembersactuallydo,orareeven capableofdoing.

–CaseC17a

Reasonsforambiguity canbe that not every commitistested [C2], analyzing the test result takes large amount of time [C4, C17a], the test results are not communicated to the developers [C6],there arenolow-level teststhatwouldpinpoint wherethe problemisexactly[C15] andthatthe testsmayfail regardlessof thecodechanges [C27].Inaddition toincreasedeffortto investi-gatethetestresult, ambiguitymakesit alsodifficultto assign re-sponsibilitytofixissuesandthusleadstolackofdiscipline[C6].

Flakytests. Teststhatcannotbetrustedbecausetheyfailrandomly cancauseproblems:

Test cases aresometimes unstable(i.e. likely to break or not re-flectingthefunctionality to be tested)and mayfail regardless of thecode.

(12)

Table 9 Testing problems.

Problem Description

Ambiguous test result Test result is not communicated to developers, is not an explicit pass or fail or it is not clear what broke the build. Flaky tests Tests that randomly fail sometimes.

Time-consuming testing Testing takes too much time.

Hardware testing Testing with special hardware that is under development or not always available. Multi-platform testing Testing with multiple platforms when developers do not have access to all of them. UI testing Testing the UI of the application.

Untestable code Software is in a state that it cannot be tested.

Problematic deployment Deployment of the software is time-consuming or error-prone. Complex testing Testing is complex, e.g., setting up environment.

Table 10 Release problems.

Problem Description

Customer data preservation Preserving customer data between upgrades.

Documentation Keeping the documentation in-sync with the released version. Feature discovery Users might not discover new features.

Marketing Marketing versionless system.

More deployed bugs Frequent releases cause more deployed bugs. Third party integration Frequent releases complicate third party integration. Users do not like updates Users might not like frequent updates.

Deployment downtime Downtime cannot be tolerated with frequent releases.

–CaseC27

The flakiness can be caused by timing issues [C4], transient problemssuchasnetworkoutages [C6],test/codeinteraction[C8], test environment issues [C11], UI tests [C14] or determinism or concurrencybugs [C14, C22].Flaky testshavecaused lackof dis-cipline[C14]andambiguityintestresults[C22,C27].

Time-consumingtesting. Gettingfeedback fromthe testscan take toolong:

One common opinion at the case company is that the feedback loopsfromtheautomatedregressiontestsaretoolong.Regression feedback timesarereportedto takeanywherefrom fourhoursto twodays.Thishighlightstheproblemofgettingfeedbackfrom re-gressiontestsuptotwodaysafterintegratingcode.

–CaseC27

Ifteststaketoolong,itcanleadtolargercommits[C3a],broken developmentflow[C3a]andlackofdiscipline[C11,C14].Reported reasonsfortoo long testswere unsuitablearchitecture [C3a] and unoptimizedUItests[C14].

Specifictesting problems. From therestoftestingproblems, hard-waretesting,multi-platform testingandUItestingproblemsare re-latedinthe sensethat they refertoproblemswithspecific kinds oftests. These tests make thetesting more complexand require moreefforttosetupandmanageautomatedtesting:

...becausetheUIisthepartofthesystemdesignthatchangesmost frequently,havingUI-basedtestingcandrivesignificanttrashinto automatedtests.

–CaseC8

Othertestingproblems. Untestablecode,problematicdeploymentand

complex testing, are more general problems that relate to each otherviasystemmodularization.Forexample,system modulariza-tionwasclaimedtomaketestingmorecomplex:

To simplify the development process, the platform was modu-larised; thismeantthateach APIhaditsowngit repository.This also madetesting morecomplex.Since theAPIsandcore compo-nentsareundercontinuouslydevelopmentbygroupswhichapply rapiddevelopmentmethodology,itwouldbeveryeasyforcertain

API to break othercomponentsand eventhewhole platform de-spitehavingpasseditsownunittest.

–CaseC21

Systemmodularizationwasreported tocauseproblematic de-ployment [C25a, C25c] and complex testing [C21, C25c]. On the other hand,system modularization was claimed to make testing andthedeploymentoftheindividualpartsofthesystem indepen-dent of other parts [C25b]. Thus, system modularization can re-move theproblemofuntestablecodeandmake deployment eas-ier.Thereforeoneneedstofindabalancebetweentheseproblems whendesigningmodularity.

4.1.5. Releaseproblems

Release problems(see Table10) causetrouble when the soft-ware isreleased.Releaseproblems werereportedonly inone ar-ticle[C5],withtheexceptionofdeploymentdowntimewhichwas mentionedintwoarticles[C5,C25c].

The lackof evidenceaboutreleaseproblems isa resulton its own.Mostofthearticlesfocusedonproblemsthat wereinternal, whereasreleaseproblemsmightbeexternaltothedevelopers.The exceptional article[C5]focused more on theimpact ofCD exter-nally,whichisoneofthereasonsitincludedmultiplerelease chal-lenges. To get more in-depth understanding of the release prob-lems,readersareencouragedtoreadtheArticleP6.

4.1.6. Humanandorganizationalproblems

Humanandorganizationalproblemsarenotrelatedtoany spe-cific development activity, but are general problems that relate tohumanandorganizationalaspects inCDadoption.These prob-lemsaredescribedinTable11.Themostreportedproblemsinthis themewerelackofdiscipline,lackofmotivationandlackof expe-rience.

Lackofdiscipline. Sometimesthesoftwareorganizationasawhole cannotkeeptotheprinciplesdefinedfortheCDdiscipline:

The second limitation is that violations reported by automated checkscanbeignoredbydevelopers,andunfortunatelyoftenthey are.

(13)

Table 11

Human and organizational problems. Problem Description

Lack of discipline Discipline to commit often, test diligently, monitor the build status and fix problems as a team. Lack of motivation People need to be motivated to get past early difficulties and effort.

Lack of experience Lack of experience practicing CI or CD.

More pressure Increased amount of pressure because software needs to be in always-releasable state. Changing roles Different roles need to adapt for collaboration.

Team coordination Increased need for team coordination.

Organizational structure Organizational structure, e.g., separation between divisions causes problems.

Table 12 Resource problems.

Problem Description

Effort Initially setting up continuous delivery requires effort. Insufficient hardware resources Build and test environments require hardware resources. Network latencies Network latencies hinder continuous integration.

This can mean discipline to committing often [C1], ensuring sufficient automated testing [C1], fixing issues found during the integration immediately [C6, C10, C12, C14] and testing changes on a developermachine before committing [C11]. Weak parts of a CI system can cause lack of discipline: ambiguous test result can make it difficult to determine who should fix integration is-sues [C6],time-consumingtestingcan makedevelopersskip test-ing on their own machines[C11] andhaving flakytests or time-consumingtestingcanleadtoignoringtestsresults[C14].

Lackofmotivation. Despitetheproposed benefitsofCD,everyone mightnot bemotivatedtoadoptit.Butinorderto achieve disci-pline,onemustinvolvethewholeorganizationtopracticeCD[C5]. Thisisespeciallydifficultwhenthereseemstobenotimefor im-provement:

But itwashard to convincethem thatwe needed to gothrough ourimplementation“humpofpain” togetthepiecesinplacethat would allow us to have continuous integration. I worked on a small teamandwe didn’tseem to haveany “extra” timeforme toworkontheinfrastructureweneeded.

–CaseC19

In addition torequired effort[C19],lack ofmotivation can be causedbyskepticismabouthowsuitableCDisinaspecificcontext [C27].

Lack of experience. Having inexperienced developers can make it difficulttopracticeCD:

[ChallengewhenadoptingCD:]alackofunderstandingoftheCD process by novice developers due to inconsistent documentation andalackofindustrystandards.

–CaseC5

Lack of experience can cause lack of understanding [C5, C12] andpeople easily drift into usingold habits[C27]. Lack of expe-rience can leadto a feeling ofmore pressure whenthe changeis drivenintheorganization:

Despite thepositivesupport andattitude towards theconceptof CI,teamsfeelthatmanagementwouldlikeittohappenfasterthan currently possiblewhichleadsto increasedpressure.Some devel-opers feel thatthey lack the confidenceand experience to reach desiredintegrationfrequencies.

–CaseC27

Otherhumanandorganizationalproblems. Changingroles,team co-ordinationandorganizationalstructurewerementionedonlybriefly

insinglecasesandlittleevidenceforthemispresented.Thusthey arenotdiscussedhereindepth.

4.1.7. Resourceproblems

Theresource problemswere relatedto theresources available forthe adoption. The problemsare listed in Table 12. Effortwas reportedinsixcases,insufficienthardwareresourcesintwocases andnetworklatenciesinonecase.

Effort. Effortwasmentionedwithtwodifferentmeanings.First,if thebuild systemis notrobust enough, itrequiresconstant effort tobefixed:

TheBubbleteamexpendedsignificanteffortworkingonthebuild. Thebuildwasverycomplexwithautomatedapplicationserver de-ployment,databasecreationandmoduledependencies.

–CaseC3a

Second,atthestartoftheadoption,an initialeffortisneeded forsettinguptheCDsystemandformonitoringit:

Continuallymonitoringandnursingthesebuildshasa severe im-pacton velocityearly on in the process, but also savestime by identifyingbugsthatwouldnormallynotbeidentifieduntilalater pointin time. Itis therefore extremelyimportant to getthe cus-tomer to buyinto thestrategy (...) While initiallysetting upthe framework is a time-consuming task, once this is accomplished, addingmoresuchbuildsisnot onlystraightforward,butalsothe mostnaturalapproachtosolvingother“non-functional” stories.

–CaseC4

Effortis neededforimplementing theCI system[C2, C4, C19, C26],monitoringandfixingbrokenbuilds[C3a,C4],workingwith acomplexbuild[C3a],workingwithmultiplebranches[C8], work-ingwithmultiplecomponents[C17e]andmaintainingtheCI sys-tem[C26].Accordingtoonecase,theperceivedinitialeffortto im-plementtheCIsystemcancauseasituationwhereitisdifficultto motivatestakeholdersfortheadoption[C19].

Hardwareresources. Hardwareresourcesare neededfortest envi-ronments,especiallyrobustnessandperformancetests:

Robustnessand performancebuildstend to be resource-intensive. Wechasedanumberofred-herringsearly onduetoapoor envi-ronment.Itis important to geta goodenvironmentto runthese tests.

–CaseC4

Alsonetworklatenciescannotbe tolerated,ifpresent,they dis-ruptcommittingsmallchanges[C13].

(14)

Table 13

Reported causal explanations. Theme Causes

Build design inflexible build → complex build [C3a] system modularization → complex build [C2] System design –

Integration complex build → broken build [C3a] broken build → work blockage [C3a] broken build → broken development flow [C3a] work blockage → merge conflicts [C3a] large commits → merge conflicts [C3a]

time-consuming testing → broken development flow [C3a] time-consuming testing → large commits [C3a]

network latencies → large commits [C13] slow integration approval → large commits [C17a] merge conflicts → broken build [C3a]

long-running branches → merge conflicts [C14] flaky tests → broken build [C6]

Testing unsuitable architecture → untestable code [C22] unsuitable architecture → time-consuming testing [C3a] system modularization → complex testing [C21, C25c] system modularization → problematic deployment [C25a, C25c] flaky tests → ambiguous test result [C22, C27]

Release –

Human & time-consuming testing → lack of discipline [C11, C14] organizational flaky tests → lack of discipline [C14]

effort → lack of motivation [C19]

ambiguous test result → lack of discipline [C6] lack of experience → more pressure [C27] Resource complex build → effort [C3a]

broken build → effort [C3a] unsuitable architecture → effort [C8] system modularization → effort [C17e]

Fig. 7. All reported causal explanations. Different themes are highlighted with colors. In addition, roots that do not have any underlying causes are underlined and leafs that do not have any effects are in italics.

4.2.Causesofproblems

To study the causes of the problems, we extracted reported causalexplanationsfromthearticles,seeTable13andFig.7.

4.2.1. Causesofbuilddesignproblems

There weretworeportedcausesforbuilddesignproblems: in-flexible build and system modularization. The first problem was

synthesizedunder the build design problem theme and the sec-ond underthe systemdesign problemtheme. This indicates that thebuilddesignisaffectedbythesystemdesign.

4.2.2. Causesofsystemdesignproblems

Noreportedcausesforsystemdesignproblemswerereported. Thisindicatesthatsystemdesignactivityisoneoftherootcauses

Figure

Updating...

References