CSE-E5430 Scalable Cloud Computing Lecture 7

(1)

CSE-E5430 Scalable Cloud Computing

Lecture 7

Keijo Heljanko

Department of Computer Science School of Science

Aalto University [email protected]

(2)

CSE-E5430 Scalable Cloud Computing (5 cr) 2/34

Flash Storage

I Currently one of the trends is the introduction of Flash memory SSDs (solid state disks) are replacing hard disks in many applications

I Flash capacity per Euro is increasing faster than hard disk capacity per Euro

I Random SSD read (and often also write) IOPS are more than 100 × those of high end hard disk read and write IOPS

(3)

Flash Storage (cnt.)

I Sequential read and write speeds are faster than those of the best hard disks

I As SSDs have no moving parts they are fail quite a bit less frequently than hard disks

I As such SSDs are a very good match for typical client laptop and desktop usage patterns

(4)

Flash Storage

I Flash has both strengths and limitations compared to hard disks, they are not a direct hard disk replacement for all server workloads

I Especially the write endurance of SSDs (the amount of data that the SSD is specified to write reliably during its lifetime) is quite often very small compared to hard disks

I For write intensive server workloads hard disks might still be the only economically viable option due to SSDs having to be replaced once write endurance is exhausted

(5)

Flash Organization

I A flash chip is organized in pages of some KBs in size (e.g., 2KB)

I A page read can load the data of a page into buffers that can be quickly randomly accessed

I A page write can only flip bits of the page from 0 to 1

I In order to change the bits from 1 to 0 and erase operation operating on a large number of blocks needs to be

performed

I An erase operation works on blocks that consist of several flash pages at the time (e.g., 128KB of data)

(6)

Flash Types

I Flash memory comes in several flavors, the main ones being: triple-level cell TLC (cheapest), multi-level cell MLC (cheap), and single-level cell SLC (expensive)

I Flash blocks can be erased: around 1000 times (TLC flash), 1000-10000 times (MLC flash), or 100000-1000000 times (SLC flash)

(7)

Flash Wear Levelling Algorithms

I Highly sophisticated algorithms are used for wear leveling, where the goal is to write each flash cell as evenly as possible

I When flash disk becomes almost full, the free space can become fragmented which can result in performance loss on writes as the SSD moves blocks around to make space for the written data

I Flash disks often use “spare capacity” set aside to make the fragmentation problems less severe

(8)

Sustained Flash Write Performance

I Because of the write leveling algoritms, the sustained Flash memory write speeds can sometimes be compromised: http:

//www.xssist.com/blog/[SSD]_Comparison_of_ Intel_X25-E_G1_vs_Intel_X25-M_G2.htm

(9)

Sustained Flash Write Performance (cnt.)

I Thankfully performance consistency has improved on newer SSD designs but long benchmarking is needed: http://hexus.net/tech/reviews/storage/ 57593-intel-ssd-dc-s3700-series-800gb/

(10)

Sustained Flash Write Performance (cnt.)

I Benchmarking another set of SSDs from:

http://www.storagereview.com/samsung_ssd_ 840_pro_review

(11)

Optimizing for Flash

I One should minimize the number of bytes written to Flash

I _{Less bytes written means less wear and longer lifetime}

expectancy of Flash

I _{Less bytes written means less frequent slow block erases,}

and this improves the overall Flash IOPS

I Heavy sequential write workloads (e.g., database logs) can be efficiently handled by arrays of hard disks without any problems with write endurance

I Operating systems support such as the TRIM command which tells to the SSD which data blocks can be safely discarded is very useful

(12)

Future of Storage Technologies

A mantra from Jim Gray of Microsoft in his 2006 presentation http://research.microsoft.com/en-us/um/people/ gray/talks/Flash_is_Good.ppt:

I Tape is Dead

I Disk is Tape

I Flash is Disk

(13)

Jim Gray on Disks (in 2006)

I Disk are cheap per capacity

I Sequential access full disk reads of take hours

I Random access full disk reads take weeks

I Thus most of disk should be treated as a cold-storage archive

(14)

Jim Gray on Flash (in 2006)

I Lots of IOPS

I Expensive compared to disks (but improving)

I Limited write endurance

(15)

Jim Gray on RAM (2006 numbers)

I Flash/disk is 100000-1000000 cycles away from the CPU

I RAM is 100 cycles away from the CPU

I Thus Jim Gray concludes that main memory databases are going to be common

I Also opens possibilities for new memory technologies between RAM and Flash, see e.g., Intel and Micron 3D X-point:

(16)

Some Rule of Thumb Numbers for Flash

For a single random access read, the following rules of thumb numbers apply:

I Main memory reference: 100 ns

I Flash drive access: 100 000 ns (1000x slower than memory)

(17)

Tape is Dead, Disk is Tape, RAM locality is King

(18)

Tape is Dead, Disk is Tape, RAM locality is King

I RAM (and SSDs) are radically faster than HDDs: One should use RAM/SSDs whenever possible

I RAM is roughly the same price as HDDs were a decade earlier

I _{Workloads that were viable with hard disks a decade ago}

are now viable in RAM

I _{One should only use hard disk based storage for datasets}

that are not yet economically viable in RAM (or SSD)

I The Big Data applications (HDD based massive storage)

should consist of applications that were not economically feasible a decade ago using HDDs

(19)

RAMCloud

John K. Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazires, Subhasish Mitra, Aravind Narayanan, Diego Ongaro, Guru M. Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, Ryan Stutsman:The case for RAMCloud. Commun. ACM 54(7): 121-130 (2011).

I Facebook in 2009 has been running in 2009 with enough RAM to fit 75% of their dataset (not counting images or video) in RAM

I If they would use enough RAM cache to hit 99% of the dataset, the random disk seek latency would still kill their average latency performance: (99% × 100 ns + 1% × 10 000 000 ns) = 100099 ns, which is way more that 100ns!

(20)

RAMCloud - Discussing the Numbers

I With Flash disks and 99% cache hit rate we get latencies: (99% × 100 ns + 1% × 100 000 ns) = 1099 ns, which is still 10x more that 100ns.

I The above concerns only average latency, not throughput, but similar reasoning also applies to aggregate IOPS numbers of RAM, Flash, and Disk

I If a system can have enough memory to have 100% of the working set in RAM, instead of Flash / Hard disk, way more IOPS can be served with the same number of servers.

(21)

RAMCloud - Limiting Factors

I The main problem in RAMcloud style designs to guarantee data persistence in case of power failure + quick recovery in case of server crash / power outage

I The second problem is to have enough low latency networking hardware and OS to keep RAM busy

(22)

RAM vs Flash vs Disk

The RAMCloud paper has the following diagram adapted from the paper: Andersen, D., Franklin, J., Kaminsky, M., et al.,

FAWN: A Fast Array of Wimpy Nodes, Proc. 22nd Symposium on Operating Systems Principles, 2009:

(23)

Cloud Computing - Definition by NIST (recap)

I The NIST Definition of Cloud Computing, SP 800-145, Sept. 2011

http://csrc.nist.gov/publications/nistpubs/ 800-145/SP800-145.pdf

I _{Cloud computing is a model for enabling ubiquitous,}

convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed offive essential characteristics,three service models, andfour deployment models.

(24)

Cloud Computing - Essential characteristics

I On-demand self-service

I _{Automated self-provisioning of services such as computing}

and storage I Broad network access

I _{Wide variety of clients accessing the cloud services over}

standard networks I Resource pooling

I Multi-tenancy of several customers on shared resources,

allows high utilization ratios I Rapid elasticity

I _{Computing capacity can be quickly scaled both up and}

down on customer demand I Measured Service

(25)

Cloud Computing - Service Models

I Software as a Service (SaaS)

I Examples: Salesforce.com, Google Apps for Business

I Platform as a Service (PaaS)

I _{Examples: Google App Engine, Heroku.com}

I Infrastructure as a Service (IaaS)

(26)

Cloud Computing - Deployment Models

I Private cloud

I _{Cloud operated by a single organization for its internal use}

I Community cloud

I _{Cloud provisioned for a community of users with shared}

concerns I Public cloud

I _{Cloud for open use by anyone from the general public}

I Hybrid cloud

I Composition of several of the above but still enabling data

(27)

Cloud and Open Source

I What kinds of open source projects are supporting the cloud in addition to the usual suspects such as Linux?

I Quite often successful open source projects start as clones of existing proprietary systems, so it makes sense to take a look at the proprietary industry leaders, and figuring out the open source alternatives to their software stacks

(28)

What is Inside Amazon IaaS Stack?

I Amazon Web Services is currently the leading commercial IaaS provider

I Many successful open source projects have started as clones of proprietary software

I What kinds of components are inside the Amazon Web Services IaaS stack?

(29)

Amazon Web Services - Top Level Categories

I Compute

I Content Delivery

I Database

I Deployment & Management

I E-Commerce

I Industry-specific Clouds

I Messaging

I Monitoring

I Networking

I Payments & Billing

I Storage

(30)

Amazon Compute and Open Source

I Some categories will have open source replacements, and many are currently under heavy development:

I Compute, virtual machines/Amazon EC2:

I _{OpenStack Compute}_{(http://www.openstack.com/)} I _{Open Nebula}_{(http://opennebula.org/)}

I _Eucalyptus_{(http://www.eucalyptus.com/)} I _{Currently OpenStack has the}_{largest momentum}_{, while}

Open Nebula is the most stable solution

I Compute, scalable batch processing, Elastic MapReduce (Google MapReduceclone):

I Based ofApache Hadoop

(http://hadoop.apache.org/)

I _{Also Microsoft has integrated Hadoop with Windows Azure}

cloud platform as well as its SQL server in the HDInsight offering: http://azure.microsoft.com/fi-fi/ services/hdinsight/

(31)

Amazon Storage and Open Source

I Storage, local network block storage/Amazon EBS:

I _Sheepdog_{project (http://www.osrg.net/sheepdog/)} I _{Ceph Rados}

(http://ceph.newdream.net/category/rados/) I Storage, global network object storage/Amazon S3:

I OpenStack Object Storage

(32)

Open Sourcing the Datacenter

I Open sourced datacenter design: Open Compute

(http://opencompute.org/) from Facebook

I _{Datacenter design based on heavy use of outside air}

cooling

I _{Custom server designs for Intel and AMD processors} I _{Designs of electrical systems and distributed UPSs} I _{All designs fully open sourced}

(33)

Google’s Infrastructure

I Google is the leading SaaS cloud provider, what new components do they use, and what are the open source replacements?

I _{Google Filesystem GFS (distributed filesystem) -}_Hadoop

Distributed Filesystem HDFS

I _{Google MapReduce (distributed batch processing)}

-Hadoop MapReduce - Apache Spark

I _{Google Bigtable / Megastore / Dremel (distributed}

database) -Hadoop Database (HBase)/Apache Cassandra/Apache Hive/Cloudera Impala-Spark SQL-Facebook Presto

I Google Chubby (distributed coordination service) /Apache

(34)

Google’s Infrastructure (cnt.)

I Many components have open source replacements developed by Google’s competitors: Yahoo!, Facebook, Twitter, . . .

I Google’s systems implement amassively distributed fault tolerant PaaS infrastructurefor the SaaS solutions from Google

I Runs on hardware made from commodity components

I Major focus on automated fault tolerance an multi-tenancy tominimize the number of administrators/computer

I Google has the“luxury of embarrassing parallelism”: Running a single word processing application is inherently sequential but running a million word processing