Flexible Storage Allocation
A. L. Narasimha Reddy
Department of Electrical and Computer Engineering Texas A & M University
Students: Sukwoo Kang (now at IBM Almaden) John Garrison
Outline
Big Picture
Part I: Flexible Storage Allocation
– Introduction and Motivation– Design of Virtual Allocation – Evaluation
Part II: Data Distribution in Networked Storage Systems
Part II: Data Distribution in Networked Storage Systems –– Introduction and MotivationIntroduction and Motivation–– Design of User-Optimal Data MigrationDesign of User-Optimal Data Migration –– EvaluationEvaluation
Part III: Storage Management across diverse devices
Part III: Storage Management across diverse devices
Conclusion
Storage Allocation
Allocate entire storage space at the time of the file system creation
Storage space owned by one operating system cannot be used by another
30 GB
50 GB Windows NT
(NTFS)
Linux (ext2)
70 GB
50 GB
98 GB AIX (JFS)
Running out of space!
Actual
Allocations
Big Picture
Memory systems employ virtual memory for several reasons
Current storage systems lack such flexibility
Current file systems allocate storage statically at the time of their creation
– Storage allocation: Space on the disk is not allocated well across multiple file systems
File Systems with Virtual Allocation
When a file system is created with X GB,
– Allows the file system to be created with only Y GB, where Y << X – Remaining space used as one common available pool
– As the file system grows, the storage space can be allocated on demand
30 GB
50 GB Windows NT
(NTFS)
Linux (ext2)
98 GB AIX (JFS)
10 GB
10 GB Actual
Allocations
60 GB 40 GB
100 GB Common Storage Pool
Our Approach to Design
Physical Disk Physical block address
Employ Allocate-on-write policy
– Storage space is allocated when the data is written
– Writes all data to disk sequentially based on the time at which data is written to the device
– Once data is written, data can be accessed from the same location, i.e., data is updated in-place
Allocate-on-write Policy
Physical Disk Write at t = t’
Extent
Storage space is allocated by the unit of the extent when the data is written
Extent is a group of file system blocks
– Fixed size– Retain more spatial locality
– Reduce information that must be maintained
Allocate-on-write Policy
Physical Disk Extent
0
Extent 1 Write at t = t’
Write at t = t’’ (where t’’ > t’)
Data is written to disk sequentially based on write-time
– Further writes to the same data updated in-place– VA (Virtual Allocation) requires additional data structure
Block Map
Physical Disk Extent
0
Extent 1 Write at t = t’
Write at t = t’’ (where t’’ > t’) Extent
2
Block map
Block map keeps a mapping of logical storage locations and
real (physical) storage locations
VA Metadata
Physical Disk Extent
0
Extent 1
Extent 2
Block map
VA Meta
data
Hardening
This block map is maintained in memory and regularly written to
disk for hardening against system failures
On-disk Layout & Storage Expansion
Physical Disk
FS Meta
data
Extent 1
Extent 2
VA Meta
data
Extent 0
Virtual Disk Extent
3
Extent 4
Extent 5
Extent 6
Extent 7
Storage Expansion Threshold
Storage Expansion
When the capacity is exhausted or reaches storage expansion threshold, a physical disk can be expanded to other available storage resources
– File system unaware of the actual space allocation and expansion
Write Operation
Application Write Request
File System
Buffer/Page Cache Layer Page
Acknowledgement
Allocate new extent
and update mapping information Block I/O Layer (VA)
Search VA block map
Extent 3
FS
Meta Extent
1
Extent 2
VA
Meta Extent
Disk 0
Hardening
Read Operation
Application Read Request
File System
Buffer/Page Cache Layer
Block I/O Layer (VA)
Search VA block map
Extent 3
FS Meta
data
Extent 1
Extent 2
VA Meta
data
Extent
Disk 0
Allocate-on-write vs. Other Work
Key difference from log-structured file systems (LFS)
– Only allocation is done at the end of log– Updates are done in-place after allocation
LVM still ties up storage at the time of file system creation
Design Issues
Extent-based Policy Example (with Ext2)
– I (inode), B (data block), V (VA block map)
– A B (B is allocated to A)
File system-based Policy Example (with Ext3 ordered mode)
VA Metadata Hardening (File System Integrity)
– Must keep certain update ordering of VA metadata and FS (meta)data
Design Issues (cont.)
Extent Size
– Larger extent size: Reduce block map size, retain more spatial locality, cause data fragmentation
Reclaiming allocated storage space of deleted files
– Needed to continue to provide the benefits of virtual allocation
– Without reclamation, possible to turn virtual allocation into static allocation
Interaction with RAID
– RAID remaps blocks to physical devices to provide device characteristics – VA remaps blocks for flexibility
– Need to resolve performance impact of VA’s extent size and RAID’s chunk size
Spatial Locality Observations & Issues
Metadata and data separation
Data clustering: Reduce seek distance
Multiple file systems
Data placement policy
– Allocate hot data in a high data region of disk – Allocate hot data in the middle of the partition
Implementation & Experimental Setup
Virtual allocation prototype
– Kernel module for Linux 2.4.22– Employ a hash table in memory for speeding up VA lookups
Setup
– A 3GHz Pentium 4 processor, 1GB main memory – Red Hat Linux 9 with a 2.4.22 kernel
– Ext2 file system and Ext3 file system
Workloads
– Bonnie++ (Large-file workload) – Postmark (Small-file workload) – TPC-C (Database workload)
VA Metadata Hardening
Compare EXT2 and VA-EXT2-EX
Compare EXT3 and VA-EXT3-EX, VA-EXT3-FS
Reclaiming Allocated Storage Space
Reclaim operation for deleted large files
How to keep track of deleted files?
– Employed stackable file system: Maintain duplicated block bitmap – Alternatively, could employ “Life or Death at Block-Level” (OSDI’04)
work
VA with RAID-5
Large-file workload Small-file workload
Large-file workload with NVRAM
Used Ext2 with software RAID-5 + VA
NVRAM-X%: X% of total VA metadata size
Data Placement Policy (Postmark)
VA NORMAL partition: Same data rate across a partition
VA ZCAV partition: Hot data is placed in high data region of a partition
VA-NORMAL: start allocation from the outer cylinders
VA-MIDDLE: start allocation from the middle of a partition
Multiple File Systems
VA-7GB: 2 x 3.5GB partition, 30% utilization
VA-32GB: 2 x 16GB partition, 80% utilization
Used Postmark
VA-HALF: The 2nd file system is created after 40% of the 1st file system is written
VA-FULL: 80%
Real-World Deployment of Virtual Allocation
Prototype built
VA in Networked Storage Environment
Flexible allocation provided by VA leads to
– Balancing locality vs. load balance issuesPart II: Data Distribution
Locality-based approach
– Use data migration (e.g. HP AutoRAID)
– Employ “hot” data migration from slower device (remote disk) to faster device (local disk)
Load balancing-based approach (Striping)
Hot data Cold data
User-Optimal Data Migration
data
Locality is exploited first
– Data is migrated from Disk B to Disk A
Load balancing is also considered
– If the load on Disk A is too high, data is migrated from Disk A to Disk B
Migration Decision Issues
data
Where to migrate: Use I/O request response time
When to migrate: Migration threshold
– Initiate migration from Disk A to Disk B only when
How to migrate: Limit number of concurrent migrations (Migration token)
write writewrite read
Design Issues
Allocation policy
– Striping with user-optimal migration: will improve data access locality – Sequential allocation with user-optimal migration: will improve load
balancing
Multi-user environment
– Each user migrates data in a user-selfish manner
– Migrations will tend to improve the performance of all users over longer periods of time
Evaluation
Implemented as a kernel block device driver
Evaluated it using SPECsfs benchmark
Configuration
SPECsfs Performance
Curve
Multi-User
Single-User Environment
Striping with user-optimal migration
Seq. allocation with user- optimal migration
Configuration: (Allocation Policy)-(Migration Policy)
– STR (Striping), SEQ (Seq. Alloc.), NOMIG (No migration), MIG (User-Optimal migration)
Single-User Environment (cont.)
Comparison between migration systems
– Migration based on locality: hot data (remotelocal), cold data (localremote)
Multi-User Environment - Striping
Server A: Load from 100 to 700
Server B: Load from 50 to 350
Multi-User Environment – Seq. Allocation
Server A: Load from 100 to 1100
Server B: Load from 30 to 480
Storage Management Across Diverse Devices
Flash storage becoming widely available
– More expensive than hard drives– Faster random accesses – Low Power consumption
In Laptops now
In hybrid storage systems soon
Manage data across Different Devices
– Match application needs to device characteristics – Optimize for performance, power consumption
Motivation
VFS Allows many file systems underneath
VFS maintains 1 to 1 mapping from namespace to storage
Can we provide different storage options for different files for a single user?
– /user1/file1 storage system 1, /user2/file2 storage system 2…
Normal File System Architecture
Calc Impress Writer WinAmp
VFS
Ext2
/user1/file1 /user1/file2 /user2/file3 /user2/file4
/user1/*
User Space Kernel
FAT32 /user2/*
Magnetic Disk Flash Drive
Umbrella File System
Calc Impress Writer WinAmp
VFS
Ext2 /user1/file1 /user1/file2
User Space Kernel
Ext3 Ext2 FAT32
/FS1/user1/file3
/FS2/user1/file1 /FS2/user1/file2
/FS3/user1/file4
Encrypted Magnetic Disk
Magnetic Disk Flash Drive
UmbrellaFS
/user1/file3 /user1/file4
Example Data Organization
/usr/dir1/foo.avi /usr/dir1/foo.txt /usr/dir1/foo.jpg
/usr/dir1
/usr
/media/usr/dir1 /text/usr/dir1
/images/usr/dir1
/media/ usr /text/usr
/images/ usr
/media/usr/dir1/foo.avi /text/usr/dir1/foo.txt
/images/usr/dir1/foo.jpg
User View
Underlying data organization
Motivation --Policy Based Storage
User or System administrator Choice
– Allow different types of files on different devices – Reliability, performance, power consumption
Layered Architecture
– Leverage benefits of underlying file systems
– Map applications to file systems and underlying storage
Policy decisions can depend on namespace and metadata
– Example: Files not touched in a week slow storage systemRules Structure
Provided at mount time
User specified
Based on inode values (metadata) and filenames (namespace)
Provides array of branches
Umbrella File System
Sits under VFS to enforce policy
Policy enforced at open and close times
Policy also enforced periodically (less often)
UmbrellaFS acts as a “router” for files
– Not only based on namespace, but also metadata
Inode Rules Structure
Rule Inode/
Filename
Field Match Value Branch
1 Inode file permissions = Read Only /fs1, /fs2
2 Filename n/a n/a n/a n/a
3 Inode file creation time >= 8:00 am,
August 3rd, 2007
/fs2
4 Inode file length < 20 KB /fs3
…
Inode Rules
Provide in order of precedence
First match
Compare inode value to rule
– At file creation some inode values indeterminate – Pass over those rules
Filename Rules Structure
Rule Match String Branch
1 /*.avi /fs2,/fs1
2 /home/*.txt /fs1
3 /home/jgarrison/* /fs3
…
Filename Rules
Once first filename rule triggered, all checked
Similar to longest prefix matching
Double index based on – Path matching
– Filename matching
Example:
– Rules: /home/*/*.bar, /home/jgarrison/foo.bar – File: /home/jgarrison/foo.bar
– File matches second rule more closely (3 path length and 7
characters of file name vs. 3 path length and 4 characters of file name)
Evaluation
Overhead
– Throughput – CPU Limited – I/O Limited
Example Improvement
UmbrellaFS Overhead
Bonnie Read Overhead
0 5 10 15 20 25 30 35 40
Ext2 1 2 4 8 16 32
Rules
Throughput (MB/s)
Ext2
Inode Rules Filename Rules
CPU Limited Benchmarks
I/O Limited Benchmarks
Flash vs. RAID5 Read Performance
Flash vs. RAID5 Write Performance
Write Performance
0 10 20 30 40 50 60 70
1 10 100 1000 10000
File Size (kB)
Throughput (MB/s)
RAID 5 Flash SSD
Flash and Disk Hybrid System
Disks with Encryption hardware
Encryption Example
0 100 200 300 400 500 600 700 800
Partial Encryption Full Encryption
Time (s)
Conclusion
Virtual allocation allows Flexibility
– Improve the flexibility of managing storage across multiple file systems/platforms
Enabled user-optimal migration
– Balance disk access locality and load balance automatically and transparently
– Adapt to changes of workloads and loads in each storage device