Cloud Computing:
What a Project Manager
Needs to Know
Dr. Patrick D. Allen, PMP
Purpose
Provide Project Managers with the very
basics of the three primary types of Clouds
and Cloud Computing, and the questions
they should ask when Clouds and their
project intersect
Overview
“Storage Clouds”
“Computing as a Service” Clouds
Questions PMs should ask
“Data-Focused” Clouds
Relational Databases vs Clouds
Map-Reduce and Accumulo examples
Questions PMs should ask
General Cloud questions PMs should ask
What’s a Cloud?
Three primary definitions of Clouds presented today:
1. Storage Cloud (just stores data; provides memory) 2. Compute-power as a Service (VMs)
Infrastructure as a Service or
Platforms as a Service or
Software as a Service
3. A Data-focused Cloud that also runs on VMs
E.g. Hadoop Data File System and data
processing
PMs need to make sure everyone understands which
type is being discussed
If you think you’re discussing a different one, confusion
First Type: Storage Cloud
Just gives you a place to store electronic data
Music, photos, scanned documents, back-ups
Can’t run any calculations or run programs on it
Can’t do Big Data calculations on it
Many Cloud Service Providers offer storage as
one of their options; others specialize in just
storage
Internet Service Providers, such as Comcast,
also provide Cloud Storage
Online gaming (like Steam) allows storing saved
games on clouds
2
ndType of Cloud: Computing as a Service
Instead of using your own computers, you use a
Third-Party’s computers at another location (e.g., AWS’s EC2)
Usually all same hardware with a variety of Virtual
Machine (VM) configurations to meet customer needs
When hardware dies, it is seamlessly replaced
All hardware and infrastructure and physical security
headaches are the responsibility of the Third Party
You’re responsible for secure comms to and from the
data stores and the security on the machines you use
You only pay for what you use (memory, computing
power or number of virtual machines used)
Great for surge-type activities, such as the census
that’s run every ten years, or new venture start-ups
2
ndType: Questions PMs Should Ask –1
What’s the cost per data stored (Cents per Gigabyte)?
What’s the cost for number of VM’s used?
How secure or private is my data when I store it on a
third-party platform?
What security or privacy guarantees are provided?
Will the PII be adequately protected?
Can I test Cloud security before I put real data there?
Am I starting a new business with limited investment?
Would a Cloud be useful for my Continuity of
Operations (COOP) plans?
It depends. Do your employees already regularly
perform remote operations like teleworking? Do you have a re-routing plan to get them to the Cloud?
2
ndType: Questions PMs Should Ask – 2
Can you store classified data on a cloud?
If a properly secured government-accredited private cloud, Maybe
If you are planning to use a Third-Party service, Maybe
As a minimum, use a virtual private cloud (e.g., AWS VPC)
And located entirely in the U.S. (not distributed world wide)
Probably need to limit access to selected personnel at the service provider site (like no foreign access in US Gov Cloud)
US-Gov-only Cloud important for data under export control
Need your security department’s approval, which includes your plan and vetting the provider
Probably need to do penetration testing before use, like “side channel attack” prevention
Not sure if this is yet being used for more than unclassified but sensitive data
For either case, always get a cyber security expert to prepare a risk assessment, and for classified data, a proper accreditation
Process for Approval for U & SBU Data
FedRAMP is a new standardized approach to
security assessment, authorization and
security monitoring for cloud-based products
and services
FedRAMP is mandatory for federal agency
cloud deployments and service models at the
low and moderate risk impact levels
Ref:
http://www.gsa.gov/portal/category/102371Ref: The Business Monthly, Aug 2012 by
Gloria Larkin “Cybersecurity and FedRAMP: A
Mandatory Combination”
3
rdType: Data-Focused Cloud–Definitions
Huge Data: Petabytes or larger amounts of dataHDFS is Hadoop Data File System (more on this later)
Relational Database: Think rows and columns, densely populated (like a spreadsheet)
Structured non-relational databases: Cloud-based structured data technologies like Accumulo and HBase running on HDFS
Can be densely or sparsely populated
Tend to use flexible labels of length three to six (more later)
Many different types of data that may have some overlapping elements, but not the same across all types of data
If put into rows and columns it would be a huge table only sparsely populated
Relational Database Example
Name Address Age Height
John Smith Jane Doe Fred Flintstone Tony D. Tiger Elmer Fudd Peter Parker Bruce Wayne Roger Rabbit Peter Rabbit White Rabbit Washington DC Baltimore Rockville Battle Creek DeForest New York Gotham Fantasyland Rural Address Wonderland 35 29 55 67 60 28 36 41 118 135 5’10” 5’8” 4’10” 6’2” 4’6” 5’5” 6’1” 4’0” 1’1” 1’11” Find the Names of those of Age >25 but <60, and > 5’ tall
Sparse Data Example
John Smith Jane Doe Peter Parker Bruce Wayne Washington DC Baltimore New York Gotham Age 35 Age 29 Age 28 36 5’10” 5’8” 5’5” 6’1”Medical Records Drivers Licenses Facebook Dating Service
John Smith
Peter Parker
Bruce Wayne
Accumulo Data Example
ID Col. Family Col. Qualifier Time Security Value
001 001 001 001 001 001
Personal Name 31 Apr ‘12 PII John Smith Personal Age 31 Apr ‘12 PII 35
Personal Height 31 Apr ‘12 PII 5’ 10” Address City 31 Apr ‘12 PII Wash DC Address Street 31 Apr ‘12 PII K Street Address Number 31 Apr ‘12 PII 810
002 002 002 002 002 002
Personal Name PII Peter Parker
Personal Age 31 Apr ‘12 PII 28 Personal Height 31 Apr ‘12 PII 5’ 5”
Address City 31 Apr ‘12 PII New York Address Street 31 Apr ‘12 PII
Address Number 31 Apr ‘12 PII
72nd Street
145 31 Apr ‘12
3
rdType: Data-Focused Cloud
Also runs on a VM farm, but uses a “Hadoop” or “Sector”
file management system (Hadoop is most widely used)
What does a Hadoop Data File System (HDFS) do for you?
Let’s you store huge amounts of non-relational data
Automatically parallelizes the computations
Automatically sorts results of “map” step
Handles all of the overhead associated with storing,
locating and processing your data
Allows for Map-Reduce programs and Direct Access
Table-based searches using Hadoop to be run
Can find relationships not easily visible in unstructured
3
rdType: Map-Reduce Program Example
Find the number people per household in census data
Distributed Databases of Household (HH) Census Data Count members of HH Hadoop Auto Sorts Map Reduce Add # HH w/ N members, N = 1 to 25 1, 3.5 M 2, 9.6 M 3, 6.8 M 4, 5.3 M
Key = HH Size, Value = #
HH001, 3 HH002, 6 HH003, 4 HH004, 3 HH001, 3 HH002, 6 HH003, 4 HH004, 3
Key = #, Value = Total
Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ Σ
3
rdType: Map-Reduce Pros and Cons
Map-Reduce programs are good for:
When you have huge data sets
If your data can't be managed in a relational database
When you are not sure what types of queries you will want to run
If you want to summarize the results of independent processes that can be applied to data in parallel
Map-Reduce programs are not good for:
If you can answer your questions with an existing relational database in a reasonable amount of time, why bother with the overhead of a cloud?
If your data can fit within a relational database, AND
If the queries you plan to run are fairly well-defined THEN
3
rdType: Questions PMs Should Ask – 1
Do I even need to use a Cloud?
If you have well-structured reasonable amounts of data, stick with a relational database UNLESS you just want the compute power on demand (2nd Type of Cloud presented)
If it is required by external authorities (like a customer), yes
Do I have a lot of "surge" events, where you only need to store and process large amounts of data periodically
Then using a cloud makes sense
Do I need to know how to write a Map-Reduce program or an Accumulo Table to use a Cloud?
No, can use pre-defined programs, OR you need someone who knows how write new ones for you
Do I need to know how to design a Map-Reduce program?
No, but it helps so you can ask for realistic output from the Cloud and really leverage the Cloud to solve your data problems
3
rdType: Questions PMs Should Ask – 2
Do I have access to an existing Cloud I could use?
If it meets your requirements, third-party Clouds work
Make sure of the “fine print” on the guarantees, and whether the recourse of the guarantee is sufficient to match the cost of the failure to guarantee
Have a security expert do a risk assessment before committing
Do I need to build my own instead?
If you have security, privacy or proprietary needs not met by an existing Cloud, might want to build your own
Consider the ongoing maintenance costs (may be primary rationale for moving to a Cloud)
General Cloud Questions for PMs
Where is the Cloud located? Can it be restricted to U.S.?
Who gets access to it?
How are the communications to/from the cloud secured?
How does it ingest its data?
How does it store its data?
How do they secure your data at rest?
How does it delete its data? Can you test that it’s gone?
Does it keep your data separate from other people's data?
Do you need/want a virtual private cloud instead?
How often is the hardware upgraded?
How many versions of VMs can you choose from?
Summary Observations
Cloud computing is here to stay
Many more projects in the future will encounter Clouds in
some way that will impact the project
Need to be aware of the strengths and limitations of
Clouds and whether they are appropriate for your project
You may not have a choice whether or not to use a Cloud
This briefing listed some of the basic questions you
should ask as appropriate to your project
Hopefully some of the mystery (and hype) of the Cloud
has been dispelled by this talk
It is useful to be able to design a Map-Reduce program so
your expectations of the output are realistic
Always do a cyber risk assessment on a Cloud you plan
Contact Info
Dr. Patrick D. AllenJohns Hopkins University Applied Physics Lab 11100 Johns Hopkins Road
MS 21-N246
Laurel, MD 20723-6099 443-778-9915 v
443-778-3838 f
Back-up: Terminology Relationship
Google File System (GFS)
Hadoop Data File System (HDFS)
Hadoop
(Map Reduce) Map Reduce Big Table HDFS Accumulo APACHE GOOGLE Structured Data Map Reduce Environment File System
Back-up: Sample Map Reduce Program
Map algorithm
Map (key: sourceURL, value: text) { for each (targetURL in text)
EmitIntermediate (targetURL, sourceURL); }
Reduce Algorithm
Reduce (key: targetURL, value: sourceURL) { sourceList[] = null;
for each (u in sourceURL)
add sourceList[sourceURL]; Emit (targetURL, sourceList[]); }
Back-up: Map Reduce Example 2
Find targets for source 1
Find targets for source 2
Find targets for source 10^9 targetURL a – URL1 targetURL b – URL1 targetURL a – URL2 targetURL c – URL2 targetURL b – URL10^9 targetURL c – URL10^9 targetURL d – URL10^9 targetURL a – URL1 targetURL a – URL2 targetURL b – URL1 targetURL b – URL10^9 targetURL c – URL2 targetURL c – URL10^9 targetURL d – URL10^9
Create list for targetURL a
Create list for targetURL b
Create list for targetURL c
Create list for targetURL d sorted targetURL – sourceURL list Doc 1 Doc 2 Doc 10^9