Data Center Construction
(and management)
Johan Tordsson
Department of Computing Science
Last time
1. Common (Web) application architectures
– N-tier applications
• Load Balancers • Application Servers • Databases
2. Cloud application guidelines
– Scalability – Fault-tolerance – Some best practices – A few Amazon examples
Today
• Data centers – how to build (and operate)
– Servers – Network – Storage – Power – Cooling – Energy-efficiency
– … and a building to keep everything in…
• Conceptual overview only
– Details about these only relevant for those who actually build/operate data centers…
Data Center as a Computer
• Majority of cloud computing infrastructure consists of reliable services delivered through data centers • Traditional co-location data centers
– Multiple servers and communications gear collocated due to common environmental & security needs – Hosts a large number of relatively small or
medium-sized applications, each running on a dedicated hardware infrastructure
• Data centers for cloud computing platforms
– Belongs to a single organization
– Uses a relatively homogeneous hardware and system software platform
– Common system management layer
Warehouse Scale
Computers (WSC)
• Not just a collection of servers
– 100s to 1000s coordinated servers – Typically runs on a virtualized platform – Fault behavior & energy considerations have
significant impact
– Needs to be considered as a single unit
• Must be highly manageable
– Deployment of software updates – Monitoring & system management
• Affordability
– Currently power public clouds such as Google, Amazon, Yahoo, Microsoft, etc…
– Soon to be affordable by Enterprises
• A rack of servers can easily have > 600 cores
What’s different about
WSC’s?
“As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional
hosting facilitiesof earlier times and cannot be
viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in
concertto efficiently deliver good levels of
Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must
treat the datacenter itself as one massive warehouse-scale computer (WSC).”
Google “Warehouse Style
Computer” Data Center
The New Data Center Industry
• Data centers replaces servers
• Container Computer for high efficiency and environmental conservation (Packaging, PUE, …) • Bundled software for integrated service, high
scalability, and availability
• Large Enterprise will bypass traditional server channels (IBM, HP, Dell, …)
– Purchase of entire data center directly from manufacturers
• Significant cost reductions • Horizontal scalability • High Availability
• Google already buys directly purchase from Taiwan
– Google 4thlargest server manufacturer, does not sell…
– Facebook’s opencompute.org project
Container Computers
9
Data Center Architecture
• Treat the entire data center as a computer
-
Air flow analysis- Cooling architecture (thermal management) - Power/energy management
- Focus on ease of system and network management - What cannot be managed/monitored does not get deployed
• Modular and Scalable
- Card to Rack - Rack to Container - Container to Warehouse
• Explore low power, commodity CPU as a
building block
Data center server hardware
• Standard servers • Standard networks • Standard storage • But at a very large scale • Comparison: Parallel computer
– Custom high-performance hardware (?) – Fast interconnection networks
Design Motivation
• Multicore CPUs in mid-range servers typically carry a price/performance benefit
– 2-5 times cheaper than top-of-the-line systems
• Many services are memory-bound
– Faster CPUs do not scale well for large services – Applications are larger-than-server anyway
• Slower CPUs are more power efficient;
Cost comparison example
Server and network overview
• High-latency, low-price network
– Gigabit ethernet
• Hierarchy of commodity switches
Storage
• Increased space with distance • Decreased latency and bandwidth
Data center management
tools
Physical Cluster Deployment Tool
Virtual Cluster Provisioning
Power Management Intra-Virtual-Cluster Load Balancing Network/System Management Security
Cloud Application Management Tool
Physical Compute Servers
Network
Management (cont.)
• Virtualization Platform (virtualize everything)
– CPUs
– Storage (Filesystems) – Network
• Resource Management
– Provisioning of virtual clusters – Physical machine load balancing – Network traffic load balancing
• Power Management • Security
– Hypervisor protection – Isolation between clusters
• System Management • High Availability
– Physical component failure should not interrupt availability of virtual resources
• Cloud Applications management
• Unless a resource can be remotely managed, it should not be part of the data center…
Virtualization Platform
• Leverage existing hypervisors – Allocation of virtual machine instances – Monitor VM Performance – Virtual storage provisioning – Intra-VirtualCluster load balancing – Scalable data center network – Isolation between virtual clusters – Virtual machine migration Physical Node Physica l Node Stora ge Serve r Stora ge Serve r Physical Node Physical NodeStorage Server Storage Server Mail Virtual Cluster Compute Nodes Bkup Virtual Cluster HC Virtual Cluster AppXYZ Virtual Cluster Data Nodes Service Nodes System Service daemons Cloud OS agents
Virtual Machine Management
• Objective
– Power Management
– Physical Machine Load Balancing
• Monitor runtime VM statistics
– Heuristic calculation to predict workloads
• Determine power down/up of machines
– Multi-dimensional bin packing (knapsack)
• CPU, network, disk
– VM migration algorithm
• Physical machine load balancing
– Migration of VM’s to other physical machine
Power
• To run servers • To run data center
Power (cont.)
• Uninterruptable Power Supply (UPS)
– Detects power failure
– Batteries (for short-term outage + switch) – (Diesel) Generator (long-term outage)
• Power Distribution Unit (PDU)
– Fancy socket w. power distribution and/or control
• Power usage breakdown
Power (cont.)
• Data centers major power users
– Common claim: ~4% of world electricity use
• Example: Facebook in Luleå (120MW)
– ~1BSEK (1’000’000’000 SEK) / year (list prices)
• Exponential growth of data center capacity & cheaper server hardware
– Power costs (will) dominate. Exponential power use?
• Cost breakdown (examples):
Cooling
• Keep heat-generating servers cool • Computer Room Air Conditioners (CRAC)
-Like room air conditioner, for server rooms • Very complex to model and design
- Airflow 3D and non-linear
Cooling (cont.)
• Cooling by water
– Water cooling very close to servers
• Cooling by sea water
– Inf. availability of cool water
• Example: Google Finland • Cooling by location
Energy-efficiency
• Not all power is used by servers… • Power Usage Efficiency (PUE)
– Power used / power used computing:
• Typical: 2.0
• State-of-the-art: ~1.2
• Quite a few variants of the definition
– Many to make data centers look good
• Others look at power source
– Carbon vs. solar vs. ….
Energy-efficiency (cont.)
• Non-linear server power usage
– Performanc/power ration changes with load
• High server utilization beneficial
– But not common by default
Energy efficiency (cont.)
• 5k Google servers (6 months)
Energy efficiency (cont.)
• Consolidate workloads
– Power servers off – Or slow servers down
• Dynamic Voltage Frequency Scaling(DVFS)
– Very hard to assess impact for bursty (rapidly changing) workloads
• Oscillations and un-wanted correlations • More next time…
• Consolidation requires software support
– Must be able to start/stop instances and autoscale
Costs for a Data Center
• How much performance is required?
– How many/fast servers, disks, networks etc.? – Size of data center: Watt
• How much power is needed?
– PUE?
– How much cooling? – Price of electricity?
• What additional physical equipment is needed?
– Redundancy of power and cooling
• Where to place it, given the above?
– Costs vs. location of users
– Very attractive to host data centers…
Cloud computing = cost cuts?
• Amazon EC2 examples
• Small VM, 3 years full use (est. server lifetime)
– Per h: $0.08*(24*365*3) -> ~$2100 (!) – Reserved: $300 + $0.013*(24*365*3) -> $640
• Rough estimate of costs for Amazon (according to ”data center as a computer”)
– Assume server cost 25% of total cost (TCO) – Standard $2k (list price) 1U server today:
• 32 cores + memory, disk etc. Total cost $8k • Estimate: Can deliver 64 Small VMs
– Revenue: $2100*64 … -> ~17 times server cost! – Amazon does not pay list prices
• 90% discount rumoured
• With 24/7 use, hourly prices are very high…
Cloud cost life cycle
1. Develop service
– Run in-house for testing and very early use
2. Move to cloud-hosting
– To handle large scale-up of user base
3. Build own data center to cut hosting costs
– Once size of service is roughly known – Unless major price cuts by IaaS providers,
this will happen for more and more SaaS providers as server and data center costs drop…
Conclusions
• Data centers at warehouse scale
– More than just a group of servers – Holistic management perspective needed
• Standard solutions superior
– Off-the-shelf servers, networks, disks, etc. – Redundancy, scalability, etc. in software layer
Suggested reading
• ”Data center as a computer”
– Barroso & Hölzle (Google)
• Read (somewhat) carefully:
– Chapter 1, Chapter 3-5
• Focus on principles, ignore numbers (examples are a few years old...) • Skim:
– Chapter 2
• Overlaps texts from last lecture + Data management lecture
Next time….
• Thursday:
Data center #2: Autonomic management
– Data centers are large – Cloud services are complex – How to make these
• Configure themselves? • Optimize themselves? • Heal themselves?
• Delay project demos???
– From Thursday (31 May) to Monday (June 4)? – 3 hours, 13-16:
• 1h review + evaluation (me) • 2h presentation + demo (you)