Improve Your Exchange
Deployment by Learning
from a Massive Scale
By Tony Redmond
100,000 servers to support some 60 million mailboxes. At least, that’s the best-guess estimate based on information provided by Microsoft at the Exchange Conference
there’s no doubt that Exchange Online is a massive distributed environment that supports thou-sands of companies worldwide. strongly and some estimates predict that more mailboxes will be hosted in the cloud than are on-premises servers by the end of 2017.
Exchange Online running
cloud platform operates
more than 100,000 servers
to support some 60 million
mailboxes.
Without a solid methodology for monitoring such an environment, Microsoft would 365, it is fair to ask whether any lessons learned from this experience can be applied to on-premises deployments?
-on-premises Exchange deployments and examines the steps that administrators can take to improve their operating environment.
MANAGING OFFICE 365
Of course, the team that designed, built, and oper-in its task. The Exchange engoper-ineeroper-ing team is part of are directly responsible for supporting the features they develop for “the service.” Other Microsoft engi-neering assets are available too, such as the talented engineers who work on Windows.
2014
Service Level Performance (SLA) guarantee of 99%
In addition, Microsoft Research has helped to develop the machine learning capabili -ties that underpin new features such as People View and Clutter. Finally, the strong sheet to support the investment of many billions of dollars required to build out a
But all the assets in the world do not provide a simple answer to the problem of how to build and operate software at such massive scale. A lot of heavy lifting has been done to develop Exchange to the point where it can be run at the economic price point required for a competitive cloud service.
This work has happened over the last decade and includes areas such as storage (to make Exchange run well on low-cost JBOD disks), automation, and protocol
Much of the output from this engineer-ing work is found in the on-premises version of Exchange too. For example, Exchange 2013 introduces Managed
-oped to provide servers with an automat -ic “self-healing” capability necessary to
-lems.
a problem with a failing Exchange server if only two or three servers are in use; it is datacenters. Generation 1
Colocation
Server
Capacity ~2.0 PUE 20 yr Technology1989-2005
Generation 2Density
Rack
Density and Deployment 1.4 - 1.6 PUE Minimized Resource Impact
2007
Generation 3
Containers
Containers & Pods
Scalability & Sustainability ~1.2-1.5 PUE Air & Water Economization
2008
Generation 4
Modular
IT-PACs
(Pre-Assembled Components)
Reduced Carbon, Rightsized 1.04-1.20 PUE Faster Time to Market
2010+
Much of the output from
this engineering work is
found in the on-premises
version of Exchange too.
LEARNING FROM PREVIOUS
GENERATIONS
As it built out the Office 365 program, Microsoft paid a lot of attention to the lessons learned from its previous Business Productivity Online Services (BPOS) cloud offering. BPOS suffered because it used an older gener-ation of software products that were designed for enterprise deployment rather than the cloud. It also suffered because of a lack of automation and workflow capabilities.
THE IMPORTANCE OF
AU-TOMATION
Today’s Office 365 uses a sophisticated workflow engine called “Central Admin” (CA) that is capable of handling more than 50 million workflow tasks per month. The idea is to automate common tasks as much as possible so as to achieve a reliable and robust
throughput of actions across the system. Tasks are expressed to CA in the form of scripted workflows in either C# or PowerShell. CA executes tasks on schedule-to perform actions such as server deployment, data-base rebalancing within a Datadata-base Availability Group, and so on. More complex
automate common tasks as
much as possible so as to
achieve a reliable and robust
throughput of actions across
the system.
tasks such as the addition of new capacity to the service still needs some human inter-vention and intelligence, but the application of a structured model and great attention to detail has enabled Microsoft to reduce the time necessary to complete even very complex tasks down from weeks to days.
Office 365 servers are built to a standardized design. This does not mean that exactly the same components are used every time as this would be impossible in an industry where components change frequently. However, it does mean that a server will have the same general characteristics (CPU, disk, memory) and thatsoftware is installed in the same way on all servers of a specific type. Low-cost components such as disks are used in order to be able to provide users with features such as 100GB mailboxes while still being market competitive. Using JBOD disks brings a certain risk of a higher failure rate and indeed, across Office 365, hard disk failures are the most common event of the more than ten thousand hardware events that are handled monthly.
Exchange’s Active Manager will failover a database to a new server quickly if a disk problem is detected. It will also create a new copy of the failed database using the failures and will open support tickets automatically if an issue is detected like a failed
“stragglers” are servers
that run out-of-date
software versions that
might provide an
incon-sistent servers to users.
Even with such a sophisticated and smooth-running into some problems. For instance, “stragglers” are serv-ers that run out-of-date software vserv-ersions that might a constant state of server refresh to introduce new software builds that contain new features. As such, with so many servers and so many updates, it can be expect-ed that some updates don’t happen as well as they should, which is the usual reason why a straggler exists.
Figure 1: “RepairBox” (Source: Microsoft)
are standardized in that individual tenants cannot customize the functionality they receive from the service (some element of control is available through the “First Release” option).
Support also follows well-laid down paths where problems flow (sometimes more slowly than desired) through first-level phone support eventually to a small group who can actually make a change to a setting that controls how the service operates. Great attention to detail and absolute adherence to procedure are hallmarks of how Office 365 works.
MANAGING THE NETWORK
Given that Office 365 is a cloud service, it should come as no surprise that network is its most precious resource. Without sufficient high-quality bandwidth, users will be unable to connect to Office 365, migrations cannot transfer data from on-premises servers, and hybrid connectivity won’t work. Microsoft does not control the network used to connect to Office 365 as this is managed by a large set of Internet Service Providers (ISPs) around the world. Although the Internet was originally designed to survive a nuclear holocaust, local failures caused by cable problems, ISP datacenter issues, and hardware failure can all prevent access to Office 365.
Microsoft can’t control the Internet, but it can take control of its own destiny within the network that connects all of the
Office 365 datacenters. That network is dedicated and
ON PREM
CLOUD
tightly controlled and monitored. Automatic redundancy is deployed so that a tempo-rary outage is contained and automatically addressed. Everything that can be done to ensure that the service is maintained is done, but even so, like all cloud services, the SLA delivered by Office 365 can only be guaranteed at the boundary of the cloud provider’s datacenters.
like all cloud services, the SLA
delivered by Office 365 can only
be guaranteed at the boundary
of the cloud provider’s
datacen-ters.
MONITORING
With so many servers in use, it should come as no surprise that Office 365 generates a reasonable number of signals relating to server and application operations. Microsoft built a “Data Insights Engine” (Figure 2) using Azure and SQL Azure to process the up to 500 million events generated per hour. The events are
aggregated and analyzed to understand how the overall service is operating and to detect problems with individual components.
The general approach is that if a problem is being reported by many entities or different signals, then it must be true. By depending on signals from multiple resources you can get close to 100% fidelity when it comes to the automatic detec-tion of problems, or “Red Alerts” as they are known within Office 365. In addidetec-tion, by analyzing signals from different sources,
Office 365 is able to focus on where the root cause of the problem is likely to be with a high degree of accuracy and this, in turn, allows automatic recovery actions to be launched with a high degree of confi-dence that they will fix the problem. Taking a data-driven and analytic
approach to the detection and resolution of problems is keep to being able to operate at scale.
Office 365 is able to focus
on where the root cause
of the problem is likely to
be with a high degree of
accuracy
In addition to its signal processing engine, Office 365 also uses much simpler tech-niques to know when something might be going wrong. For instance, if a spike in page views occurs for the Service Dashboard, it is likely that customers are checking the dashboard to know whether a problem exists with one of the applications running in the service. Such a spike can often be correlated with an output from the signal processing engine but sometimes it leads to a discovery of a problem that is identified by human beings. The characteristics of that problem can then be
Figure 2: Office 365 Central Admin (source: Microsoft)
APPLYING LESSONS
FROM OFFICE 365 TO
ON-PREMISES
EXCHANGE
On-premises administrators do not have the same levers available to control Exchange but they can apply some of the same basic principles in order to achieve a more reliable service from Exchange. Broadly speaking, most value is gained by focusing on the following areas:
Standardization
Simplification
Automation
Monitoring
Let’s examine each of these areas in more detail.
STANDARDIZATION
It does not make sense to install Exchange in a different way on every server. The general principle is that the configuration and layout of a server should not be so removed from the norm that valuable time is lost if an administrator has to master the details of the environment when a problem occurs. Every company will have different ideas as to how rigid server standardization should be, but here are some ideas to consider.
captured as a recognizable scenario for future automatic identification and resolu-tion.
The general principle is that
the configuration and
layout of a server should
not be so removed from
the norm that valuable
time is lost if an
administra-tor has to master the
de-tails of the environment
when a problem occurs.
Table 1 lists some of the areas to consider when designing a standardized approach to server deployment for Exchange 2013.
Table 1: Areas to consider when standardizing server builds
The idea is that all servers designed for a specific purpose (for example, a multi-role Exchange server that is a member of a DAG) are equipped with broadly the same configuration. Computer components change over time as improvements are intro-duced to the market and it can be impossible to match up everything at a very precise level unless all servers are purchased from the same vendor at the same time. Therefore some difference is acceptable at a detailed level as long as the
servers remain broadly the same. For exam-ple, it does not matter much if one server uses 4TB disks and another has 6TB disks as long as the same number of disks are availa-ble to each server and the software and files are installed in the same locations on all servers. It is more problematic if one server has markedly different performance charac-teristics than others as in the case when one machine is equipped with more physical CPUs that are faster than available to other computers.
some difference is
acceptable at a
de-tailed level as long as
the servers remain
broadly the same.
Area Advice
Operating system Install the same software (to build number) on all Exchange servers, including patches and fixes released through Windows Update.
Exchange Install the same version of Exchange (including updates) on all servers. Only deploy multi-role servers unless an obvious and well-defined need is identified for a single-role server (such as an Edge server).
Databases Within a DAG: Have the same number of copies (at least three) for all databases. Enable circular logging. Keep the transaction logs with their database.
Server-specific
settings Exchange holds most of its configuration data in Active Directory. Where server-specific configuration files or registry settings are used, care should be taken to ensure that the same values are used on all servers.
Disks Have the same number of disks available to all servers. Put the Exchange binaries on the same disk on all servers. Use the same number of disks for Exchange databases.
Consider using database autoreseed to provide additional resilience against disk failure.
Other installed
software Some servers will host specific products (for example, BlackBerry Enterprise Server), but in general it is best if the same tools and add-on products are available on all Exchange servers.
-practical when you deal in thousands of servers and can dictate the form factor and design of the desired server. The time then spent in design is more than recouped in
However, it medium email server and a large email server) and using them throughout makes the servers were used.
Going to this level of detail
to achieve standard server
only
practi-cal when you deal in
thou-sands of servers and can
dictate the form factor and
design of the desired server.
Note that in some cases you are required to have the same software installed to create a mailbox servers in a DAG must run the same operating system and the same version of Exchange. Ideally speaking, these versions should be as close to identical as possible. In addition, given that outdated drivers (espe-cially storage drivers) are a major known source of failure on Exchange servers, it makes sense to ensure that all of the hardware drivers used on Exchange servers are up-to-date throughout the organization.
SIMPLIFICATION
As already mentioned, the advice is to only deploy multi-role Exchange servers. This is actually against the practice in 365 where Microsoft use dedicated mailbox and CAS servers. The scale at which 365 operates and the management practices used in software deployment (reduce a server to bare
Only deploy multi-role
Exchange servers.
metal and reinstall instead of applying patches) makes this example inapplicable to most on-premises environments. Using multi-roleserversdeliversmaximumresilience totheorganization (Click to read: ”Why installing a multirole Exchange 2013 server is the best option”), increases and makes optimum use of available machine resources. It is also simpler if the same type of server is used everywhere.
Advocates of virtualization will say that deploy-ing virtual Exchange servers is the best
approach. This might be true, providing that
bestpractices are followed in the deployment, that a and obvious business
is gained, and that the company has operational maturity and experience to be able to extract the of virtualiza-tion. In other situations, virtualization increases the overhead, cost,
Advocates of
virtualiza-tion will say that
deploy-ing virtual Exchange
servers is the best
ap-proach.
and complexity of a deployment and is therefore not aligned with the aim of simplify-ing the environment.
A vast array of client versions and protocols are available to connect to Exchange. Every client/version/protocol combination increases the complexity of support and introduces new potential for problems after an update to client software or the proto-col it uses. A brief list of the potential clients for Exchange 2013 is:
• Outlook 2013 Professional Plus (SP1+ required for MAPI over HTTP)
• Outlook 2010 • Outlook 2007
• Exchange ActiveSync (EAS) clients running on Apple iOS, Windows Phone, and Android devices
• Outlook Web App (OWA) for PC browsers • OWA for iOS
• OWA for Android • IMAP4 clients • POP3 clients
It makes sense to settle on one desktop client and (as allowed by user desire) a reasonable number of client/protocol choices. Mobile device policies can be used to control the versions of Exchange ActiveSync (EAS) clients that can connect so that, for instance, you might only allow Apple devices running iOS 8.1 or above to connect. Protocol settings on the mailbox can be used
AUTOMATION
The potential for error is high every time a performs a management operation on a server. This is not because human administra-tors are stupid. Rather, it is our intelligence that causes us to lose interest in mundane oft-repeated operations, which then leads to mistakes and problems.
The potential for error is
high every time a human
makes a
change or performs a
man-agement operation on a
server.
Given that users expect
24x7 access to email, the
task of ensuring that
everything runs smoothly is
MONITORING
The number of events generated by small Exchange installation can
produce thousands of events and other signals daily. No human being can make sense of so much data unless they spend an inordinate amount of time checking event logs and other sources to ensure that no out-of-ordi-nary occurrence goes unnoticed. Given that users expect 24x7 access to email, the task of ensuring that everything runs smoothly is very indeed. Exchange 2013 helps to resolve some common issues with Managed Availa-bility. Every minute, Managed Availa
All scalable operations focus on eliminating human error through automation. As evident from its use in the product installation procedure, even the most complex Exchange operations can easily be scripted with PowerShell (a script that you can customize to perform an unattended installation of Exchange is available from the Microsoft TechNet Gallery). It makes sense for administrators to script as many common operations as possible so that operations from mailbox provisioning to reporting on database growth are performed in a consistent and predictable manner. It’s true that Exchange administrators are not all PowerShell gurus, but it is also true that many scripts or other snippets of code can be found on web sites that can be repurposed and reused.
bility measures hundreds of health metrics from a server using a comprehensive set of probes, moni-tors, and responders. Its probes recover information from a variety of sources that are then analyzed by the monitors and, if necessary, a monitor might decide to invoke a responder to resolve an issue. Unfortunately, Managed Availability is a “black box” with no friendly user interface to allow an adminis-trator to interrogate what’s happening on a server. A certain amount of good faith must be attributed that Managed Availability will do the right thing every
time.
In fact, given that Office 365 provides its tenants with some nice reports, it is curious that this area is very weak in the on-premises version. PowerShell scripts can help close the gap and many examples of PowerShell-driven reports are available on the web. The scripts used to generate these reports can be altered to customize for your environment.
Monitoring of a hybrid environment is particularly challenging. Data is available from the on-premises side that is unavailable from Office 365. Apart from PowerShell, no APIs are available to interrogate information about service throughput and other statistics. And even if you use PowerShell, Microsoft throttles its use within Office 365, which makes it harder to deal with large collections of objects such as mailboxes. While under-standable because they protect the resources necessary to run a multi-tenant service, these restrictions can sometimes be frustrating when outsiders seek to gain an insight into what’s happening inside Office 365.
Monitoring of a hybrid
environment is
particu-larly challenging.
The evidence from Office 365 is that Managed Availability does a good job. However, Office 365 is a highly standardized and structured environment that is unlikely to resemble your deployment. For this reason, it is wise to regard Managed Availability as just one part in the overall monitoring solution. Depending on your circumstances, you might need additional tools to handle situations such as mobile client management, statistic gathering and reporting (for example, mailbox growth over the last year), number and use of distribution groups, public folder usage, analysis of protocol logs, message tracking reports, and so on.
SUMMARY
tens of millions of users daily. It can only achieve this goal because Microsoft has invested heavily to automate its operations, and monitor and under -stand the signals emitted from the infrastructure.
Smaller companies do not have the resources available to Microsoft, but it is possi -into the operation of on-premises Exchange servers. As always, your knowledge of the company’s operating environment and business goals should guide you in selecting which lessons are appropriate and valuable.