The Journal of
Information Technology Management
Cutter
IT Journal
Vol. 24, No. 8August 2011
Devops: A Software Revolution
in the Making?
Opening Statement
by Patrick Debois . . . 3
Why Enterprises Must Adopt Devops to Enable Continuous Delivery
by Jez Humble and Joanne Molesky . . . 6
Devops at Advance Internet:
How We Got in the Doorby Eric Shamow . . . 13
The Business Case for Devops:
A Five-Year Retrospectiveby Lawrence Fitzpatrick and Michael Dillon . . . 19
Next-Generation Process Integration:
CMMI and ITIL Do Devopsby Bill Phifer . . . 28
Devops:
So You Say You Want a Revolution?by Dominica DeGrandis . . . 34
At Arm’s Length
Automating the delivery process reduces the time development and operations have to spend together. This means that when dev and ops must interact, they can focus on the issues that really matter.
Arm in Arm
Development and operations often play cats and dogs during deployment. To stop the fighting, we must bring them closer together so they will start collaborating again.
“Some people get stuck on the
word ‘devops,’ thinking that
it’s just about development
and operations working
together. Systems thinking
advises us to optimize the
whole; therefore devops
must apply to the whole
organization, not only the
part between development
and operations.”
— Patrick Debois,
Guest Editor
Cutter IT Journal®
Cutter Business Technology Council: Rob Austin, Ron Blitstein, Christine Davis, Tom DeMarco, Lynne Ellyn, Israel Gat, Tim Lister, Lou Mazzucchelli, Ken Orr, and Robert D. Scott
Editor Emeritus: Ed Yourdon Publisher: Karen Fine Coburn Group Publisher: Chris Generali Managing Editor: Karen Pasley Production Editor: Linda M. Dias Client Services: [email protected] Cutter IT Journal®is published 12 times
a year by Cutter Information LLC, 37 Broadway, Suite 1, Arlington, MA 02474-5552, USA (Tel: +1 781 648 8700; Fax: +1 781 648 8707; Email: [email protected]; Website: www.cutter.com). Print ISSN: 1522-7383; online/electronic ISSN: 1554-5946. ©2011 by Cutter Information LLC. All rights reserved. Cutter IT Journal® is a trademark of Cutter Information LLC. No material in this publication may be reproduced, eaten, or distributed without written permission from the publisher. Unauthorized reproduction in any form, including photocopying, downloading electronic copies, posting on the Internet, image scanning, and faxing is against the law. Reprints make an excellent training tool. For information about reprints and/or back issues of Cutter Consortium publications, call +1 781 648 8700 or email [email protected]. Subscription rates are US $485 a year in North America, US $585 elsewhere, payable to Cutter Information LLC. Reprints, bulk purchases, past issues, and multiple subscription and site license rates are available on request.
Part of Cutter Consortium’s mission is to foster debate and dialogue on the business technology issues challenging enterprises today, helping organizations leverage IT for competitive advantage and business success. Cutter’s philosophy is that most of the issues that managers face are complex enough to merit examination that goes beyond simple pronouncements. Founded in 1987 as American Programmer by Ed Yourdon, Cutter IT Journal is one of Cutter’s key venues for debate.
The monthlyCutter IT Journal and its com-panionCutter IT Advisor offer a variety of perspectives on the issues you’re dealing with today. Armed with opinion, data, and advice, you’ll be able to make the best decisions, employ the best practices, and choose the right strategies for your organization. Unlike academic journals, Cutter IT Journal doesn’t water down or delay its coverage of timely issues with lengthy peer reviews. Each month, our expert Guest Editor delivers arti-cles by internationally known IT practitioners that include case studies, research findings, and experience-based opinion on the IT topics enterprises face today — not issues you were dealing with six months ago, or those that are so esoteric you might not ever need to learn from others’ experiences. No other journal brings together so many cutting-edge thinkers or lets them speak so bluntly.
Cutter IT Journal subscribers consider the Journal a “consultancy in print” and liken each month’s issue to the impassioned debates they participate in at the end of a day at a conference.
Every facet of IT — application integration, security, portfolio management, and testing, to name a few — plays a role in the success or failure of your organization’s IT efforts. Only Cutter IT Journal and Cutter IT Advisor deliver a comprehensive treatment of these critical issues and help you make informed decisions about the strategies that can improve IT’s performance.
Cutter IT Journal is unique in that it is written by IT professionals — people like you who face the same challenges and are under the same pressures to get the job done. The Journal brings you frank, honest accounts of what works, what doesn’t, and why. Put your IT concerns in a business context. Discover the best ways to pitch new ideas to executive management. Ensure the success of your IT organization in an economy that encourages outsourcing and intense inter-national competition. Avoid the common pitfalls and work smarter while under tighter constraints. You’ll learn how to do all this and more when you subscribe to Cutter IT Journal.
About Cutter IT Journal
Cutter
IT Journal
Name Title
Company Address
City State/Province ZIP/Postal Code
SUBSCRIBE TODAY
Request Online License Subscription Rates
For subscription rates for online licenses,
Opening Statement
Despite all the great methodologies we have in IT, delivering a project to production still feels like going to war. Developers are nervous because they have the pressure of delivering new functionality to the cus-tomer as fast as possible. On the other side, operations resists making that change a reality because it knows change is a major cause of outages. So the usual finger-pointing begins when problems arise: “It’s a develop-ment problem”; “Oh no, it’s an operations problem.” This tragic scene gets repeated over and over again in many companies, much to the frustration of manage-ment and the business, which is not able to predict releases and deliver business value as expected and required.
The problem is further amplified by two industry trends: agile development and large-scale/cloud infra-structure. Agile development by its nature embraces change and fosters small but frequent deployments to production. This higher frequency puts additional pres-sure on operations, which becomes even more of a bot-tleneck in the delivery process. What is remarkable is that even though agile actively seeks collaboration from all its stakeholders, most agile projects did not extend themselves toward the operations people. Continuous integration and the testing department served as the traditional buffer between development and operations, and in that buffer, pressure has also been building up. The second driver, large-scale/cloud infrastructure, forced operations to change the way they managed their infrastructure. Traditional manual labor didn’t cut it at the scale of thousands of machines. It’s good to want to do small increments and frequent changes, but you need to have the tools to support that. Virtualization enabled ops to spin up new environments very quickly, and cloud made away with the resource problem. The real differentiator came from two concepts: configuration management and infrastructure as code. The first enables you to describe the desired state of an infra-structure through a domain-specific language, allowing you to both create and manage your infrastructure using the same tools. Infrastructure as code is a result of the increase of SaaS and APIs used to handle infrastructure. Embracing both concepts allows operations to automate
a lot of its work — not only the initial provisioning of a new machine, but the whole lifecycle. This gave birth to the concept of agile infrastructure.
As a response to these two drivers, devops was born. A number of people got together and started a grass-roots movement that set out to remove the traditional bound-aries between development and operations. Some consider this picking up where “traditional agile” left off. After all, software doesn’t bring value unless it is deployed in production; otherwise, it’s just inventory. To tackle the problem, devops encourages cross-silo collaboration constantly, not only when things fail.
Operations becomes a valued member of the traditional agile process with an equal voice. Far too often, SLAs negotiated with a customer work counter to the changes requested by the same customer. Now both functional and nonfunctional requirements are evaluated for their business priority. This moves the usual discussion on priorities back before development starts. As both development and operations serve the same customer, the needs of both are discussed at the same time. Once priorities have been determined and work can start, developers pair together with operations people to get the job done. This pairing allows for better knowledge diffusion across the two traditionally sepa-rate teams. Issues such as stability, monitoring, and backup can be addressed immediately instead of being afterthoughts, and operations gets a better understand-ing of how the application works before it actually deploys it in production.
When it’s hard, do it more often. Like exercise, the more you practice deployment to production, the better you will get at it. Thanks to automation, operations can introduce small changes much more quickly to the
by Patrick Debois, Guest Editor
Like exercise, the more you practice deployment
to production, the better you will get at it.
infrastructure. Practicing deployment in development, testing, quality assurance, and production environ-ments, and managing it from the same set of defined models in configuration management, results in more repeatable, predictable results.
As both development and operations are now responsi-ble for the whole, lines are beginning to blur. Developers are allowed to push changes to production (after going through an extensive set of tests), and just like their oper-ations counterparts, they wear pagers in case of emergen-cies. They are now aware of the pain that issues cause, resulting in feedback on how they program things. People active in continuous integration were among the first to reach out to operations. In the first article in this issue, Jez Humble and Joanne Molesky elaborate on the concept of an “automated deployment pipeline,” which extends the boundaries of software development to the deployment phase. All changes to a production environ-ment follow the same flow, regardless of whether the changes are to code, infrastructure, or data. The deploy-ment pipeline acts as sonar for issues: the further you get in the pipeline, the more robust the results get. And you want to track the issues as fast as possible, since the cost of fixing them goes up the further you go in the process.
Feedback is available to all people: those in operations learn what issues they might expect in production, while developers learn about the production environ-ments. Nor is feedback only of a technical nature — management and the business can learn from produc-tion trial runs what customers really want and how they react. This moves product development even further
into customer development and allows you to test-drive your ideas in almost real time.
In our second article, Eric Shamow shows that, like any other change, devops is clearly a discovery path — people and processes don’t change overnight. A common pattern in devops adoption seems to be that organizational changes go hand in hand with the intro-duction of new tools. A tool by itself will not change anything, but the way you use a tool can make a difference in behavior, resulting in a cultural change. Therefore, driving change requires management sup-port; otherwise, the impact of the change will be limited to some local optimization of the global process. Shamow’s story also shows that a lot of organizations were already taking devops-like approaches, so for them it is nothing new. The great thing with the term “devops” is that people can use it to label their stories. And thanks to this common label, we can search for those stories and learn from others about what works or not.
All the focus in devops on automation to reduce cycle time and enhance repeatability makes you wonder if we will all become obsolete and be replaced by tools. There is no denying that automation can be an effective cost cutter, as our next article, a case study by Larry Fitzpatrick and Michael Dillon, clearly demonstrates. But we need to recognize the real opportunity: all the time won by automation gains us more time to design and improve our work. This is where the competitive advan-tage comes in. People sometimes lightly say that their most valuable asset is people, but it’s oh so true. They will give your business the edge it needs if you let them. Agility in coding, agility in systems — it takes time and effort to nurture these ideas, but the results can be astonishing. The feedback cycle makes the whole difference in quality and results. With all this close collaboration, lines begin to blur. Will we all be develop-ers? Does everybody deploy? In a multidisciplinary approach, you have either the option of generalizing specialists (the approach taken in the organization Fitzpatrick and Dillon discuss) or specializing generalists. Both can work, but if you are generalizing specialists, you get the extra depth of their specialization as well. This is sometimes referred to as having “T-shaped people” in the collaboration: people with in-depth knowledge (vertical) who are also able to understand the impact in a broad spectrum (horizontal). In the traditional organization, this would map to a matrix organization structure, similar to the one the authors describe. It’s important to rotate people through different
UPCOMING TOPICS IN CUTTER IT JOURNAL
SEPTEMBER
Robert D. Scott
21st-Century IT Personnel: Tooling Up or Tooling Down?
OCTOBER
Dennis A. Adams
Creative Destruction: How to Keep from Being Technologically Disrupted
projects and roles, and it’s equally important for the specialists to touch base again with their respective specialist groups to catch up with new global changes. Many of the initial devops stories originated in Web 2.0 companies, startups, or large-scale Internet applications. In contrast with more traditional enterprises, these organizations have a much lighter and more flexible management structure. It’s worth nothing that several of the large Web 2.0 companies such as Amazon and Flickr didn’t use the term “agile” but were working with a similar mindset. Small groups of both developers and operations people delivered new releases and understood that they were working toward the same goals for the same customers.
Given these origins, many people at first dismiss
devops, thinking it will never work in their traditionally managed enterprise. They see it conflicting with exist-ing frameworks like CMMI, agile, and Scrum. Consider, however, that most of the stories in this journal come from traditional enterprises, which should serve to debunk that myth. Traditional enterprises need to care even more about collaboration, as the number of people involved makes one-on-one communication more difficult. The term “devops” is only a stub for more global company collaboration.
Our next author, Bill Phifer, clearly shows there is no reason to throw existing good practices away. They might need to be adapted in execution, but noth-ing in them inherently conflicts with devops. Focusnoth-ing particularly on CMMI for Development and ITIL V3, Phifer discusses how these two best practice models can be integrated to enable improved collaboration between development and operations. But as useful as this mapping of CMMI and ITIL functions to each other is, Phifer argues that it is not enough. Rather, the “result of effective integration between CMMI and ITIL as applied to the devops challenge is a framework that ensures IT is aligned with the needs of the business.” Indeed, we sometimes forget that the reason we do IT is for the business, and if the business doesn’t earn money, we can be out of a job before we know it. As Phifer emphasizes, this relation to the business is an important aspect in devops: some people get stuck on the word “devops,” thinking that it’s just about devel-opment and operations working together. While this feedback loop is important, it must be seen as part of the complete system. Systems thinking advises us to optimize the whole; therefore, devops must apply to the whole organization, not only the part between development and operations.
In our final article, Dominica DeGrandis focuses on just this kind of systems thinking. She talks about the leadership changes that will be required to enable the “devops revolution” and outlines how statistical process control can be applied to devops to increase predictability and thus customer satisfaction. DeGrandis concludes by describing her experience at Corbis (the same company where David Anderson developed his ideas on applying Kanban to software development), recounting how devops-friendly practices like contin-uous integration and smaller, more frequent releases greatly improved the IT organization’s ability to deliver business value.
Only by providing positive results to the business and management can IT reverse its bad reputation and become a reliable partner again. In order to do that, we need to break through blockers in our thought process, and devops invites us to challenge traditional organizational barriers. The days of top-down control are over — devops is a grass-roots movement similar to other horizontal revolutions, such as Facebook. The role of management is changing: no longer just directive, it is taking a more supportive role, unleashing the power of the people on the floor to achieve awesome results. Patrick Debois is a Senior Consultant with Cutter Consortium’s
Agile Product & Project Management practice. To understand
current IT organizations, he has made a habit of changing both his consultancy role and the domain in which he works: sometimes as a developer, manager, sys admin, tester, or even as the customer. During 15 years of consultancy, there is one thing that annoys him badly; it is the great divide between all these groups. But times are changing now: being a player on the market requires you to get the “battles” under control between these silos. Mr. Debois first presented concepts on agile infrastructure at Agile 2008 in Toronto, and in 2009 he organized the first devopsdays conference. Since then he has been promoting the notion of “devops” to exchange ideas between these different organizational groups and show how they can help each other achieve better results in business. He’s a very active member of the agile and devops communities, sharing a lot infor-mation on devops through his blog http://jedi.be/blog and twitter (@patrickdebois). Mr. Debois can be reached at [email protected].
The term “devops” is only a stub for more
global company collaboration.
Roughly speaking those who work in connection with the [Automated Computing Engine] will be divided into its masters and its servants. Its masters will plan out instruc-tion tables for it, thinking up deeper and deeper ways of using it. Its servants will feed it with cards as it calls for them. They will put right any parts that go wrong. They will assemble data that it requires. In fact the servants will take the place of limbs. As time goes on the calcula-tor itself will take over the functions both of masters and of servants.
— Alan Turing,
“Lecture on the Automatic Computing Engine”1
Turing was — as usual — remarkably prescient in this quote, which predicts the dev/ops division. In this article, we will show that this divide acts to retard the delivery of high-quality, valuable software. We argue that the most effective way to provide value with IT systems — and to integrate IT with the business — is through the creation of cross-functional product teams that manage services throughout their lifecycle, along with automation of the process of building, testing, and deploying IT systems. We will discuss the implications of this strategy in terms of how IT organizations should be managed and show that this model provides benefits for IT governance not just in terms of improved service delivery, but also through better risk management.
THE PREDICAMENT OF IT OPERATIONS
IT organizations are facing a serious challenge. On the one hand, they must enable businesses to respond ever faster to changing market conditions, serving cus-tomers who use an increasing variety of devices. On the other hand, they are saddled with evermore com-plex systems that need to be integrated and maintained with a high degree of reliability and availability. The division between projects and operations has become a serious constraint both on the ability of businesses to get new functionality to market faster and, ironically, on the ability of IT to maintain stable, highly available, high-quality systems and services.
The way organizations traditionally deliver technology
management. However, while the projects that create new systems are often successful, these projects usually end with the first major release of the system — the point at which it gets exposed to its users. At this stage the project team disbands, the system is thrown over the wall to operations, and making further changes involves either creating a new project or work by a “business as usual” team. (The flow of value in this model is shown in Figure 1.) This creates several problems:
Many developers have never had to run the systems they have built, and thus they don’t understand the tradeoffs involved in creating systems that are reli-able, scalreli-able, high performance, and high quality. Operations teams sometimes overcompensate for potential performance or availability problems by buying expensive kits that are ultimately never used. Operations teams are measured according to the
stability of the systems they manage. Their rational response is to restrict deployment to production as much as possible so they don’t have to suffer the instability that releases of poor-quality software inevitably generate. Thus, a vicious cycle is created, and an unhealthy resentment between project teams and operations teams is perpetuated.
Because there are several disincentives for teams to release systems from early on in their lifecycle, solu-tions usually don’t get near a production environ-ment until close to release time. Operation teams’ tight control over the build and release of physical servers slows the ability to test the functionality and deployment of solutions. This means that their full production readiness is usually not assessed until it is too late to change architectural decisions that affect stability and performance.
The business receives little real feedback on whether what is being built is valuable until the first release, which is usually many months after project approval. Several studies over the years have shown that the biggest source of waste in software development is features that are developed but are never or rarely
Why Enterprises Must Adopt Devops to Enable
Continuous Delivery
by Jez Humble and Joanne Molesky
used — a problem that is exacerbated by long release cycles.
The funding model for projects versus operating expenses creates challenges in measuring the cost of any given system over its lifecycle. Thus, it is nearly impossible to measure the value provided to the business on a per-service basis.
Due to the complexity of current systems, it is difficult to determine what should be decommissioned when a new system is up and running. The tendency is to let the old system run, creating additional costs and complexity that in turn drive up IT operating costs. The upshot is that IT operations must maintain an ever-increasing variety of heterogeneous systems, while new projects add more. In most organizations, IT operations consumes by far the majority of the IT budget. If you can drive operating costs down by preventing and removing bloat within systems created by projects, you’d have more resources to focus on problem solving and continuous improvement of IT services.
DEVOPS: INTEGRATING PROJECT TEAMS AND IT OPERATIONS
Devops is about aligning the incentives of everybody involved in delivering software, with a particular empha-sis on developers, testers, and operations personnel. A fundamental assumption of devops is that achieving both frequent, reliable deployments and a stable production environment is not a zero-sum game. Devops is an approach to fixing the first three problems listed above through culture, automation, measurement, and sharing.2 We will address each of these aspects in turn.
Culture
In terms of culture, one important step is for operations to be involved in the design and transition (develop-ment and deploy(develop-ment) of systems. This principle is in fact stated in the ITIL V3 literature.3Representatives from IT operations should attend applicable inceptions, retrospectives, planning meetings, and showcases of project teams. Meanwhile, developers should rotate through operations teams, and representatives from project teams should have regular meetings with the IT operations people. When an incident occurs in production, a developer should be on call to assist in discovering the root cause of the incident and to help resolve it if necessary.
Automation
Automation of build, deployment, and testing is key to achieving low lead times and thus rapid feedback. Teams should implement a deployment pipeline4to achieve this. A deployment pipeline, as shown in Figure 2, is a single path to production for all changes to a given system, whether to code, infrastructure and environments, database schemas and reference data, or configuration. The deployment pipeline models your process for building, testing, and deploying your systems and is thus a manifestation of the part of your value stream from check-in to release. Using the deploy-ment pipeline, each change to any system is validated to see if it is fit for release, passing through a compre-hensive series of automated tests. If it is successful, it becomes available for push-button deployment (with approvals, if required) to testing, staging, and produc-tion environments.
Business
Projects
Measured on deliveryOperations
Measured on stability Project teamsTech, ops, and app management
DBAs Service desk
Users
Systems
Value stream (”concept to cash”)
Deployments should include the automated provision-ing of all environments, which is where tools such as virtualization, IaaS/PaaS, and data center automation tools such as Puppet, Chef, and BladeLogic come in handy. With automated provisioning and management, all configuration and steps required to recreate the cor-rect environment for the current service are stored and maintained in a central location. This also makes dis-aster recovery much simpler, provided you regularly back up the source information and your data.
Measurement
Measurement includes monitoring high-level business metrics such as revenue or end-to-end transactions per unit time. At a lower level, it requires careful choice of key performance indicators, since people change their behavior according to how they are measured. For example, measuring developers according to test cover-age can easily lead to many automated tests with no assertions. One way to help developers focus on creat-ing more stable systems might be to measure the effect of releases on the stability of the affected systems. Make sure key metrics are presented on big, visible displays to everybody involved in delivering software so they can see how well they are doing.
In terms of process, a critical characteristic of your deliv-ery process is lead time. Mary and Tom Poppendieck ask, “How long would it take your organization to deploy a change that involves just one single line of code? Do you do this on a repeatable, reliable basis?”5
Set a goal for this number, and work to identify and remove the bottlenecks in your delivery process. Often the biggest obstacle to delivering faster is the lengthy time required to provision and deploy to production-like environments for automated acceptance testing, show-cases, and exploratory and usability testing, so this is a good place to start.
Sharing
Sharing operates at several levels. A simple but effective form of sharing is for development and operations teams to celebrate successful releases together. It also means sharing knowledge, such as making sure the rel-evant operations team knows what new functionality is coming their way as soon as possible, not on the day of the release. Sharing development tools and techniques to manage environments and infrastructure is also a key part of devops.
DIY Deployments
If you implemented all of the practices described above, testers and operations personnel would be able to self-service deployments of the required version of the sys-tem to their environments on demand, and developers would get rapid feedback on the production readiness of the systems they were creating. You would have the ability to perform deployments more frequently and have fewer incidents in production. By implementing continuous delivery, in which systems are production-ready and deployable throughout their lifecycle, you would also get rid of the “crunches” that characterize most projects as they move toward release day. However, while the practices outlined above can help fix the new systems you have entering production, they don’t help you fix the string and duct tape that is hold-ing your existhold-ing production systems together. Let’s turn our attention to that problem now.
• The pipeline tells you the status of what’s currently in each environment • Everyone gets visibility into the risk of each change
• People can self-service the deployments they want
• The pipeline provides complete traceability and auditing ability into all changes
Single path to production for all changes to your system
Everything required to recreate your production system (except data) in
version control: Version Control Build binaries; Run automated tests Self-service automated deployment to testing envs. Self-service automated deployment to staging and prod.
Source code, infrastructure configuration, automated tests, database scripts, tool chain Every change triggers a new release candidate Changes that pass automated tests are available for deployment
Approved changes can be deployed into user-facing environments on demand
Figure 2 — The deployment pipeline.
One way to help developers focus on creating
more stable systems might be to measure
the effect of releases on the stability of the
affected systems.
DID RUBE GOLDBERG DRAW THE ARCHITECTURE DIAGRAM FOR YOUR PRODUCTION SYSTEMS?
We propose an old-fashioned method for simplifying your production systems, and it rests on an approach to managing the lifecycle of your services. Treat each strategic service like a product, managed end to end by a small team that has firsthand access to all of the information required to run and change the service (see Figure 3). Use the discipline of product manage-ment, rather than project managemanage-ment, to evolve your services. Product teams are completely cross-functional, including all personnel required to build and run the service. Each team should be able to calculate the cost of building and running the service and the value it delivers to the organization (preferably directly in terms of revenue).
There are many ways people can be organized to form product teams, but the key is to improve collaboration and share responsibilities for the overall quality of ser-vice delivered to the customers. As the teams come together and share, you will develop a knowledge base that allows you to make better decisions on what can be retired and when.
What is the role of a centrally run operations group in this model? The ITIL V3 framework divides the work of the operations group into four functions:6
1. The service desk 2. Technical management
3. IT operations management 4. Application management
Although all four functions have some relationship with application development teams, the two most heavily involved in interfacing with development teams are application management and IT infrastructure. The application management function is responsible for managing applications throughout their lifecycle. Technical management is also involved in the design, testing, release, and improvement of IT services, in addition to supporting the ongoing operation of the IT infrastructure.
In a product development approach, the central appli-cation management function goes away, subsumed into product teams. Nonroutine, application-specific requests to the service desk also go to the product teams. The tech-nical management function remains but becomes focused on providing IaaS to product teams. The teams responsi-ble for this work should also work as product teams. To be clear, in this model there is more demand for the skills, experience, and mindset of operations people who are willing to work to improve systems, but less for those who create “works of art” — manually config-ured production systems that are impossible to repro-duce or change without their personal knowledge and presence.
Once your organization has reached some level of maturity in terms of the basics of devops as described
Products/Services
Operations
Ops management Service desk
Users
Infrastructure
Value stream (”concept to cash”)
Product owner
in the previous section, you can start rearchitecting to reduce waste and unnecessary complexity. Select a service that is already in production but is still under active development and of strategic value to the busi-ness. Create a cross-functional product team to manage this service and create a new path to production, imple-mented using a deployment pipeline, for this service. When you are able to deploy to production using the deployment pipeline exclusively, you can remove the unused, redundant, and legacy infrastructure from your system.
Finally, we are not proposing that the entire service portfolio be managed this way. This methodology is suitable for building strategic systems where the cost allocation model is not artificial. For utility systems that are necessary for the organization but do not differ-entiate you in the market, COTS software is usually the correct solution. Some of the principles and practices we present here can be applied to these services to improve delivery, but a dependence on the product owner to complete change will restrict how much you can do and how fast you can go. Certainly, once changes are deliv-ered by a COTS supplier, you can test and deploy the changes much faster if you have the ability to provision suitable test and production environments on demand using automation.
DEVOPS AT AMAZON: IF IT’S A TRENDY BUZZWORD, THEY’VE BEEN DOING IT FOR YEARS
In 2001 Amazon made a decision to take its “big ball of mud”7architecture and make it service-oriented. This involved not only changing the architecture of the company’s entire system, but also its team organization. In a 2006 interview, Werner Vogels, CTO of Amazon, gives a classic statement not only of the essence of devops, but also of how to create product teams and cre-ate a tight feedback loop between users and the business:
Another lesson we’ve learned is that it’s not only the technology side that was improved by using services. The development and operational process has greatly benefited from it as well. The services model has been a key enabler in creating teams that can innovate quickly with a strong customer focus. Each service has a team
associated with it, and that team is completely responsible for the service — from scoping out the functionality, to architecting it, to building it, and operating it.
There is another lesson here: Giving developers oper-ational responsibilities has greatly enhanced the quality of the services, both from a customer and a technology point of view. The traditional model is that you take your software to the wall that separates development and oper-ations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their soft-ware. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service.8
Many organizations attempt to create small teams, but they often make the mistake of splitting them function-ally based on technology and not on product or service. Amazon, in designing its organizational structure, was careful to follow Conway’s Law: “Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.”9
IMPROVED RISK MANAGEMENT WITH CONTINUOUS DELIVERY
A common reason given for not trying devops and con-tinuous delivery in IT shops is that this approach does not comply with industry standards and regulations. Two controls that are often cited are segregation of duties and change management.
Regulations and standards require organizations to prove they know what is happening and why, protect information and services, and perform accurate report-ing. Most IT organizations are subject to some kind of regulation and implement controls in order to ensure they are in compliance. Controls are also essential to reducing the risk of having bad things happen that may affect the confidentiality, integrity, and availability of information.
Segregation of duties is a concept derived from the world of accounting to help prevent fraud and reduce the possibility of error. This control is required by regulations and standards such as SOX and PCI DSS. The relevant COBIT control states:
PO4.11 Segregation of Duties
Implement a division of roles and responsibilities that reduces the possibility for a single individual to compromise a critical process. Make sure that personnel are performing only authorized duties relevant to their respective jobs and positions.10
Many organizations attempt to create
small teams, but they often make the
mistake of splitting them functionally based
on technology and not on product or service.
The spirit of the IT control is that one person should not have the ability to make preventable errors or introduce nefarious changes. At a basic level, you have checks and balances to make sure this doesn’t happen. This control can be implemented many different ways; an extreme interpretation of this control by some organizations is that development, operations, and support functions need to be functionally and physically separated and cannot talk to each other or view each other’s systems. Those of us who have experienced working in these organizations know that this level of control only serves to increase cycle time, delay delivery of valuable func-tionality and bug fixes, reduce collaboration, and increase frustration levels for everyone. Furthermore, this type of separation actually increases the risk of error and fraud due to the lack of collaboration and under-standing between teams. All IT teams should be able to talk and collaborate with each other on how to best reach the common goal of successful, stable deploy-ments to production. If your IT teams don’t talk and collaborate with each other throughout the service/ product delivery lifecycle, bad things will happen.
A Better Way: The Automated Deployment Pipeline
Reducing the risk of error or fraud in the delivery process is better achieved through the use of an auto-mated deployment pipeline as opposed to isolated and manual processes. It allows complete traceability from deployment back to source code and requirements. In a fully automated deployment pipeline, every command required to build, test, or deploy a piece of software is recorded, along with its output, when and on which machine it was run, and who authorized it. Automation also allows frequent, early, and comprehensive testing of changes to your systems — including validating con-formance to regulations — as they move through the deployment pipeline.
This has three important effects:
1. Results are automatically documented and errors can be detected earlier, when they are cheaper to fix. The actual deployment to production with all associ-ated changes has been tested before the real event, so everyone goes in with a high level of confidence that it will work. If you have to roll back for any reason, it is easier.
2. People downstream who need to approve or imple-ment changes (e.g., change advisory board members, database administrators) can be automatically notified, at an appropriate frequency and level of detail, of what is coming their way. Thus, approvals can be performed electronically in a just-in-time fashion.
3. Automating all aspects of the pipeline, including provisioning and management of infrastructure, allows all environments to be locked down such that they can only be changed using automated processes approved by authorized personnel.
Thus, the stated goals of the change management process are achieved:11
Responding to the customer’s changing business requirements while maximizing value and reducing incidents, disruptions, and rework
Responding to business and IT requests for change that will align the services with the business needs As well as meeting the spirit of the control, this approach makes it possible to conform to the letter of the control. Segregation of duties is achieved by having your release management system run all commands within the deployment pipeline as a special user created for this purpose. Modern release management systems allow you to lock down who can perform any given action and will record who authorized what, and when they did so, for later auditing. Compensating controls (monitoring, alerts, and reviews) should also be applied to detect unauthorized changes.
IMPLEMENTING CONTINUOUS DELIVERY
Continuous delivery enables businesses to reduce cycle time so as to get faster feedback from users, reduce the risk and cost of deployments, get better visibility into the delivery process itself, and manage the risks of soft-ware delivery more effectively. At the highest level of maturity, continuous delivery means knowing that you can release your system on demand with virtually no technical risk. Deployments become non-events (because they are done on a regular basis), and all team members experience a steadier pace of work with less stress and overtime. IT waits for the business, instead of the other way around. Business risk is reduced because decisions are based on feedback from working software, not vaporware based on hypothesis. Thus, IT becomes integrated into the business.
Achieving these benefits within enterprises requires the discipline of devops: a culture of collaboration between
If your IT teams don’t talk and collaborate with
each other throughout the service/product
delivery lifecycle, bad things
will happen.
all team members; measurement of process, value, cost, and technical metrics; sharing of knowledge and tools; and regular retrospectives as an input to a process of continuous improvement.
From a risk and compliance perspective, continuous delivery is a more mature, efficient, and effective
method for applying controls to meet regulatory require-ments than the traditional combination of automated and manual activities, handoffs between teams, and last-minute heroics to get changes to work in production. It is important not to underestimate the complexity of implementing continuous delivery. We have shown that objections to continuous delivery based on risk manage-ment concerns are the result of false reasoning and mis-interpretation of IT frameworks and controls. Rather, the main barrier to implementation will be organiza-tional. Success requires a culture that enables collabora-tion and understanding between the funccollabora-tional groups that deliver IT services.
The hardest part of implementing this change is to deter-mine what will work best in your circumstances and where to begin. Start by mapping out the current deploy-ment pipeline (path to production), engaging everyone who contributes to delivery to identify all items and activities required to make the service work. Measure the elapsed time and feedback cycles. Keep incrementalism and collaboration at the heart of everything you do — whether it’s deployments or organizational change.
ACKNOWLEDGMENTS
The authors would like to thank Evan Bottcher, Tim Coote, Jim Fischer, Jim Highsmith, John Kordyback, and Ian Proctor for giving their feedback on an early draft of this article.
ENDNOTES
1Turing, Alan. The Essential Turing, edited by B. Jack Copeland. Oxford University Press, 2004.
2Willis, John. “What Devops Means to Me.” Opscode, Inc., 16 July 2010 (www.opscode.com/blog/2010/07/16/what-devops-means-to-me).
3ITIL Service Transition. ITIL V3. Office of Government Commerce (OGC), 2007. (ITIL published a “Summary of Updates” in August 2011; see www.itil-officialsite.com.) 4Humble, Jez, and David Farley. “Continuous Delivery:
Anatomy of the Deployment Pipeline.” informIT, 7 September 2010 (www.informit.com/articles/article.aspx?p=1621865).
5Poppendieck, Mary, and Tom Poppendieck. Implementing Lean
Software Development: From Concept to Cash. Addison-Wesley
Professional, 2006.
6ITIL Service Transition. See 3.
7Foote, Brian, and Joseph Yoder. “Big Ball of Mud.” In Pattern
Languages of Program Design 4, edited by Neil Harrison, Brian
Foote, and Hans Rohnert. Addison-Wesley, 2000. 8Vogels, Werner. “A Conversation with Werner Vogels.”
Interview by Jim Gray. ACM Queue, 30 June 2006.
9Conway, Melvin E. “How Do Committees Invent?” Datamation,
Vol. 14, No. 5, April 1968, pp. 28-31.
10COBIT 4.1. IT Governance Institute (ITGI), 2007. 11ITIL Service Transition. See 3.
Jez Humble is a Principal at ThoughtWorks Studios and author of
Continuous Delivery, published in Martin Fowler’s Signature Series
(Addison-Wesley, 2010). He has worked with a variety of platforms and technologies, consulting for nonprofits, telecoms, and financial services and online retail companies. His focus is on helping organiza-tions deliver valuable, high-quality software frequently and reliably through implementing effective engineering practices. Mr. Humble can be reached at [email protected].
Joanne Molesky is an IT governance professional and works as a Principal Consultant with ThoughtWorks Australia. Her work focuses on helping companies to reduce risk, optimize IT processes, and meet regulatory compliance through the application of agile methodologies and principles in delivering IT services. She is certified in ITIL and COBIT Foundations and holds CISA and CRISC designations through ISACA. Ms. Molesky can be reached at jmolesky@ thoughtworks.com.
Advance Internet is a medium-sized company with roughly 75 employees and around 1,000 servers devoted to a wide array of internally written and modified appli-cations. When I arrived here in 2009, the company was in the midst of rolling out a new configuration manage-ment system (CMS) that had been in the works for roughly three years. Multiple deployment failures and poor information transfer had resulted in a complete breakdown of communication between operations and development. Developers were finding it increasingly difficult to get code onto test or production servers, or to get new servers provisioned. When they were deployed, it was often done incorrectly or without following release notes, and it often had to be redone several times. Operations had no information from develop-ment about when to expect releases, and when the releases did come, they would often be incomplete or inconsistent: they had configuration files in varied loca-tions, contained poor or outdated release notes, and often depended on applications not installed from any repository. The entire operations team eventually left under various circumstances, and I was brought in as the new manager of systems operations to evaluate the situation, build a new team, and try to instill a culture of cooperation between development and operations. I was not aware of “devops” as a movement at the time we started this project, although I became aware of it through my work and by reaching out to other pro-fessionals for suggestions and support. However, my experience discovering these problems and solving them from scratch led us straight down the devops path.
THE UNDERLYING ENVIRONMENT
There were several problems that became evident immediately. We began to tackle these issues individu-ally, beginning with monitoring. At that time, the alerts we received were not relevant and timely, and metrics were unavailable, which led to a lot of guesswork and “from-the-gut” decision making. The build situation also needed to be fixed. New servers would be built by oper-ations, be handed over to development for vetting, and then bounce back and forth in a three- or four-week
cycle as corrections and modifications were made on both sides. By the end of this process, there was no clear record of who had done what, and the next build was inevitably condemned to failure. Even once we had got-ten an idea of what went into a successful build, there was no automated means of repeating the process. Monitoring and alerting turned out to be the easy part of the equation, but the one that would have the greatest impact on the tenor of the changes to come. Rather than attempting to fix the existing monitoring framework, we built a brand-new one using Nagios and PNP for graphing. We then developed an application severity matrix — based around the organization’s list of known applications and their relative importance — in order to form the basis for a per-application escalation chain. With these established, we then met group by group with each team and the relevant stakeholders for the application or server in question and painstakingly listed which services needed to be monitored, how criti-cal those services were, and when alerts should escriti-calate. The key piece of this was involving not just people responsible for monitoring the system, but the people who “owned” the system on the business side. Because they were participating in the discussion, they could help determine when a software engineer should be paged or when an alert could be ignored until morning, when they themselves should be paged, and how long to wait before senior management was notified. Once these meetings were held, we were able to move over 80% of the servers in our original alerting system to the new one. The remaining outliers enabled us to convert “unknown unknowns” into “known unknowns” — we now had a map of those servers with no clear owner and little documentation, so we knew which servers required the most focus. We also established a recurring quarterly schedule with the same stakeholders to review their application settings and determine whether we needed to make adjustments in thresholds and escalations. Once this process was complete, access to our new Nagios install was given to everyone inside the technical organization, from the lowest-level support representa-tive to the VP overseeing our group. We met with any-one who was interested and had several training classes
Devops at Advance Internet:
How We Got in the Door
by Eric Shamow
that demonstrated how to navigate the interface and how to build graph correlation pages. The goal was to open the metrics to the organization. Operations may not be able to keep its eye on everything at all times, so with more people with different interests looking at our data, problems were more likely to become apparent and deci-sions could be made and fact-checked immediately. In the meantime, the configuration piece was moving much more slowly. Advance had long grown past the point where server deployments were manual, but what was in place was not much better than that man-ual deploy. There were two build systems in place — CONF REPO and HOMR — and I will stop to discuss them here purely as cautionary tales. Prior groups had clearly attempted to improve server deployment before and failed. The question I faced was, why? And how could I avoid repeating those mistakes?
PREVIOUS ATTEMPTS AT A TOOLSET, AND MOVING FORWARD
In small-scale environments, making bulk configuration changes is not challenging. Servers are configured indi-vidually, and when a change must be made to multiple systems, administrators typically use a tool such as clusterSSH (which sends input from the user’s work-station to multiple remote systems) or a shell script, which repeats the same changes on all of these systems sequentially. As environments become larger, however, it becomes more difficult to maintain systems in this way, as any changes made in an ad hoc fashion must be carefully documented and applied across the board. Therefore, in large-scale environments, it is common to use a configuration management system, which tracks changes and applies them automatically to designated servers.
The original configuration environment evident at the time of my arrival at Advance was CONF REPO. CONF REPO was in fact a heavily modified version of CFEngine. At some point in time, an inadequately tested change in CFEngine apparently caused a widespread systems outage, and the ops team responded by mod-ifying CFEngine’s default behavior. A script called cache_configs now attempted to generate a simulation
of CFEngine’s cache from configuration files and auto-matically copied this to each host. The host would then check what was in the new cache against what was on the filesystem, and it would then stop and send an alert rather than replace the files if there were any discrepan-cies. Although most of this functionality could have been achieved by CFEngine out of the box, the team had elected to rewrite large parts of it in the new CONF REPO framework, with the result that as CONF REPO’s codebase diverged from CFEngine’s, the framework became unmaintainable. CONF REPO was dropped as an active concern but never fully retired, so many hosts remained on this system.
This was followed by an attempt at writing a config management system from scratch, called HOMR. HOMR was really a wrapper around Subversion and functioned as a pull-only tool. Changes would be checked into HOMR, and then “homr forceconfigs” would be run on the client. This would pull down all configs and replace existing ones without warning. The result was that HOMR wouldn’t be run on a host for weeks or months while individual admins tinkered with settings; then another admin on an entirely unrelated task (changing default SSH or NSCD settings, for instance) would run “homr forceconfigs” across the board and wipe out a bunch of servers’ configs. This often resulted in production outages. My evaluation of this situation was that operations was trying to solve a discipline problem with a technological solution. CFEngine, the originally chosen configuration management tool, is a well-respected and widely used system. Had it been used properly and left in place, Advance would have had a CMS years ago. Instead, due to poor discipline surrounding testing of changes, the system was abandoned and replaced with increas-ingly convoluted and difficult-to-maintain frameworks. This was not the way forward.
Although there was appeal in simply returning to CFEngine, there was enough mistrust of the tool around the organization that I instead elected to get a fresh start by bringing in Puppet as our new CMS. The idea was to create a greenfield “safe zone” of servers that were properly configured and managed and then to slowly expand that zone to include more and more of our environment.
Advance is a CentOS/RHEL shop, so I started by building a Cobbler server and cleaning up our exist-ing Kickstart environment so that we had a reasonable install configuration. I then took the base configuration out of HOMR, translated it into Puppet, and set up a
Operations was trying to solve a discipline
bootstrap script in our Kickstart so that we could build Puppet machines with our base environment. We now had the ability to bring up new Puppet hosts. I wrapped the Puppet configuration directories in a Mercurial repos-itory and shut off direct write access to this directory for the entire admin team — all changes would have to be made through Mercurial, ensuring accountability. The next step was to start porting configs into config-uration management (hereafter CM), building more hosts, and training my newly hired team of admins to use this system. More critically, I needed a way to get buy-in from suspicious developers and product owners (who had heard lots of stories of “new” configuration management systems before) so that I could lift up their existing systems, get their time and involvement in building CM classes reflecting their use, and ultimately rebuild their production environments to prove that we had in fact captured the correct configuration. This needed to work, it needed to be surrounded by account-ability and visible metrics, and it needed to be done quickly.
BOOTSTRAPPING THE PROCESS
There were a few key pieces of learned wisdom I took into this next phase that helped ensure that it was a suc-cess. This was information I’d gathered during my years as a system administrator, and I knew that if I could leverage it, I’d get buy-in both from my new staff and from the rest of the organization. To wit:
Unplanned changes are the root of most failures. Almost all organizations resist anything like change management until you can prove this to them, in black and white.
Engineers do a much better job of selling internal
initiatives to each other than management does, even
if the manager in question is an engineer. When possi-ble, train a trainer, then let him or her train the group. Engineers like to play with fun stuff. If you tie fun
stuff to what they consider “boring stuff,” they will put up with a certain amount of the “boring” to get at the “fun.”
Organizations generally distrust operations if
operations has a lousy track record. The only way
to fix this is to demonstrate repeatedly and over a long period of time that operations can execute to the business’s priorities.
The “unplanned changes” principle was the first item to tackle and the easiest to prove within my team. Most
engineers are at least somewhat aware of this phenome-non, even if they think that they themselves are never responsible for such problems. Clear metrics, graphs, and logs made this stunningly clear. Problems were almost always caused by changes that had been made outside the approved process, even if those problems didn’t show up until a change made as part of the process ran up against the unplanned change weeks later. It took a lot longer to prove this to the business, because engineers will see the results in a log file and immediately draw the picture, whereas less technical stakeholders will require explanations and summary information. But over time, the concept is one that most people can understand intuitively; logs and audit trails made the case.
I was careful to pick a single person within the group to train on Puppet, a move that had a number of effects. It created something of a mystique around the tool (“Why does he get to play with it? I want in!”). It also enabled me to carefully shape that person’s methodology to ensure that it was in line with my thinking, while giving him a lot of latitude to develop his own best practices and standards. This was a true collaboration, and it gave me a second person who could now answer CM questions as well as I could and who could train others. Finally, we used the carrot approach and moved the most exciting, interesting new projects into CM. Anything new and shiny went there, regardless of its importance to the business. Want to build a new backup server for everybody’s local Time Machine backups? Fine, but it’s got to go in CM, and I want to see some-one else deploy the final build to ensure it works. A few months of this, combined with scheduled training, meant that we quickly had a full group of people at least nominally versed in how Puppet worked, how to interact with Mercurial, and how to push changes out to servers.
At this point we were successful, but we had not yet achieved our goals. We could easily have failed here through the same kind of testing failure that brought down CFEngine within our group. The key step was scaling the toolset to keep pace with the number of people involved.
SCALING
Once we were prepared to move beyond two people making commits to our repository, we needed to worry about integration testing. At this point I identified two possible strategies for dealing with this:
1. Automated integration testing with a toolset such as Cucumber
2. Manual testing with a development environment Although the first was clearly the optimal solution, the two weren’t mutually exclusive. While the Cucumber solution would require a long period of bootstrapping and development, the development environment could be set up quickly and be used and tested immediately. We thus forked our efforts: I tasked my CM expert on the team with figuring out the development environ-ment and how to get code moved around, and I began the longer-term work on the Cucumber solution.
An issue that immediately raised itself was how to deal with potential code conflicts and the code repository. The team we had built was comfortable with scripting and low-level coding but had limited familiarity with code repositories. My initial plan had been to use typi-cal developer tools such as branches and tags or per-haps a more sophisticated Mercurial-specific toolset revolving around Mercurial Queues. As we began to introduce more junior admins into the process, how-ever, we found an enormous amount of confusion around merging branches, the appropriate time to merge, how to deal with conflicts, and other typical small-group issues when people encounter distributed change control for the first time. Rather than attempt to educate a large number of people on this quickly, we decided to build a simple semaphore-based system to prevent two people from working on the same CM module simultaneously.
Some will argue that this solution is suboptimal, but it allowed us to work around the branching issue during the early phases of deployment. There was never — and still isn’t — any doubt that the “right” way to do this was with native repository technologies, but the key point for us was that not adhering to those enabled us to roll out a functioning environment with minimal conflicts and without the need to debug and train on complex merge issues. This kept our admins focused on the core ele-ments of moving to real change control rather than tied up in the details of running it. The goal was to keep it simple initially to encourage more focus on getting systems into CM and writing good, clean CM entries.
DISCIPLINE AND CODING STANDARDS
Again drawing from lessons learned as a systems administrator and dealing with maintaining admin-written scripts, I began work on Puppet, operating from a series of core principles that informed decision making around the environment:
Coding standards are easy. Many style guides are available online, but if your tool of choice doesn’t have one, write your own. It does not have to be deeply complex, but your code should adhere to two principles: code must be clean and readable, and output must be similarly clean and machine parsable. Development should be as generic as possible so
that nobody wastes time reinventing the wheel. If you are building a ticket server that runs behind Apache, take the time to build a generic Apache module that can be used by the entire team; if it uses MySQL, build that MySQL module. Treat the code pool as a commons and make everyone responsible for growing it and keeping it clean.
Rely on self-correction. Once a critical mass is built and everyone is contributing to the same code base, peers won’t tolerate bad code, and you as a manager won’t have to execute as much oversight day-to-day. Oversight and periodic code review are still impor-tant, but it’s much more important to have a strong group ethic that contributing “quick” but poorly writ-ten code to the repository is unacceptable. The more that peers find errors in each other’s code and help to fix them, the more likely the system is to be stable as a whole and the better the code will likely integrate. Enforce the use of a development environment. If
changes aren’t put into place in a real, running CM environment and run in that environment to verify that they function properly, there is almost no point in doing devops — you are just abstracting the process of “doing it live.” The key isn’t just to put code through a code repository — it’s to then do the same rigorous testing and deployment that develop-ers must do before operations accepts its code as ready to go.
We did all of the above, to good effect. Within two to three months, everybody on the team had a few per-sonal virtual machines hooked up to the development environment, which they could quickly redeploy to match real-world systems. Code was being contributed, shared, fixed, and extended by the entire group. When the CM system got upgraded and offered us new fea-tures we wanted to use, it took only very little training and some examples set by the specialist to enable the
Treat the code pool as a commons and
make everyone responsible for growing it
and keeping it clean.
CLOSING THE LOOP
None of the things we did to this point would have been helpful if we hadn’t achieved the goal of improved communication and better deployment. So how did the company respond to these changes?
Most of the concerns from management and other groups were focused on time to deliver new builds and server updates, the accuracy of those builds, and com-munication with development about the processes it needed to follow to get material onto a server. Prior to our changeover to devops-style configuration manage-ment, the CMS project was a particular sore point. Rebuilding an individual server took an average of two weeks and two full days of admin time. After the change, a single engineer could do that same build in half a day — with active attention not fully on the build — with absolute precision. As a result, we were able to rebuild entire CMS environments in a day. Because developers worked with us to manage configuration files (only operations has commit rights, but we will often grant developers limited read access), it became very clear what needed to be in a package. And to facilitate faster development, for nonproduction or QA environments such as development and load, we used the Puppet ensure => latest directive for key pack-ages. Consequently, developers only needed to check the newest version of their application package into a repository, and within 30 minutes Puppet would install it to their environments. In the meantime, our publicly available and transparent metrics enabled them to measure the results of these changes and ensure that changes were consistent across environments.
These developments had two immediate effects. First, the folks on the old CMS project started talking loudly about how well our project was going, and soon other product managers whose applications suffered from a situation similar to the former CMS project began to request that their servers move into our new CM. We soon had a queue longer than we could handle of proj-ects fighting to be moved into the new environment. For a while, my ability to move servers in lagged behind my ability to train my staff in doing so. Second, senior management immediately noticed the improvement in turnaround time and the disappearance of debugging time due to environment inconsistency. They immediately mandated that we roll our new CM out to as many environments as possible, with the goal of fully moving all of Advance into full CM by year’s end. Management recognized the importance of this approach and was willing to make it a business rather than a technological priority.
I should not underemphasize the role of PR in making this project a success. A tremendous amount of time went into meeting with product owners, project man-agers, development teams, and other stakeholders to assure them about what was changing in their new environment. I was also willing to bend a lot of rules in less critical areas to get what we wanted in place; we made some short-term engineering compromises to assuage doubts in exchange for the ability to gain better control of the system in its entirety. For instance, we would often waive our rules barring code releases on Fridays, or we would handle less urgent requests with an out-of-band response normally reserved for emer-gencies to ensure that the developers would feel a sense of mutual cooperation. The gamble was that allowing a few expediencies would enable us to establish and maintain trust with the developers so we could accom-plish our larger goals. That gamble has paid off. Puppet is now fully entrenched in every major applica-tion stack, and it is at roughly 60% penetraapplica-tion within our environment. The hosts not yet in the system are generally development or low-use production hosts that for a variety of reasons haven’t been ported; we con-tinue slow progress on these. We have also begun to train engineers attached to the development teams in the system’s use, so that they can build modules for us that we can vet and deploy for new environments. This gives the development teams increased insight into and control over deployments, while allowing us to main-tain oversight and standards compliance. We have also established some parts of our Mercurial repositories where developers can directly commit files (e.g., often-changing configuration files) themselves, while still leveraging our system and change history.
NEXT STEPS
We’re not finished yet — in fact we’re just getting started. Exposure to the devops community has left me full of fresh ideas for where to go next, but I would say that our top three priorities are integration and unit testing, elastic deployments, and visibility and monitoring improvements.
For integration and unit testing, we have already selected Cucumber and have begun writing a test suite, with the goal that all modifications to our Puppet environment will have to pass these tests in order to be committed. This suite would not be in lieu of, but rather in addition to, our development environ-ment. The ultimate goal here is to enable development teams or product owners to write agile-style “user