Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison1/20
Turbocharging the DBMS Buffer Pool using an SSD
Jaeyoung Do, Donghui Zhang, Jignesh M. Patel, David J. DeWitt, Jeffrey F. Naughton, Alan Halverson
Memory Hierarchy
DRAM
Disk HDD
For over three decadesâŚ
Now: a disruptive changeâŚ
??
SSD wisdom:
- Store hot data.
- Store data with random-I/O access.
Fast random I/Os; but expensive.
Cache
Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison3/20
Take Home Message
⢠Use an SSD to extend the Buffer Pool.
⢠Implemented in Microsoft SQL Server 2008R2.
⢠Evaluated with TPC-C, E, and H.
⢠Up to 9X speedup.
Prior Art
⢠[Holloway09] A. L. Holloway. Chapter 4: Extending the Buffer Pool with a Solid State Disk. In Adapting Database Storage for New Hardware, UW-Madison Ph.D. thesis, 2009.
⢠[KV09] Koltsidas and Viglas. The Case for Flash-Aware Multi- Level Caching. University of Edinburgh Technical Report, 2009.
⢠[KVSZ10] B. M. Khessib, K. Vaid, S. Sankar, and C. Zhang. Using Solid State Drives as a Mid-Tier Cache in Enterprise Database OLTP Applications. TPCTCâ10.
⢠[CMB+10] M. Canim, G. A. Mihaila, B. Bhattacharjee, K. A.
Ross, and C. A. Lang. SSD Bufferpool Extensions for Database Systems. In VLDBâ10.
State-of-the-art:
Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison5/20
Research Issues
⢠Page flow
⢠SSD admission policy
⢠SSD replacement policy
⢠Implication on checkpoint
Implemented Designs
⢠Temperature-Aware Caching (TAC)
⢠Dual-Write (DW)
⢠Lazy-Cleaning (LC)
Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison7/20
Page Flow
BP Operations:
read ď evict ď read ď modify ď evict
TAC writes a clean page to the SSD right after reading from the disk.
C
Buffer pool
Disk SSD BP
C C
C
Buffer pool
Disk SSD BP
C
C
Buffer pool
Disk SSD BP
C
TAC Dual-Write Lazy-Cleaning
TAC Dual-Write Lazy-Cleaning
Page Flow
BP Operations:
read ď evict ď read ď modify ď evict
C
Buffer pool
Disk SSD BP
C C
C
Buffer pool
Disk SSD BP
C
C
Buffer pool
Disk SSD BP
C
DW/LC writes a clean page to the SSD upon eviction from BP.
C C
Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison9/20
TAC Dual-Write Lazy-Cleaning
Page Flow
BP Operations:
read ď evict ď read ď modify ď evict
C
Buffer pool
Disk SSD BP
C C
C
Buffer pool
Disk SSD BP
C
C
Buffer pool
Disk SSD BP
C
Read from the SSD: same for all.
C C
C C
C
TAC Dual-Write Lazy-Cleaning
Page Flow
BP Operations:
read ď evict ď read ď modify ď evict
Buffer pool
Disk SSD BP
C C
Buffer pool
Disk SSD BP
C
Buffer pool
Disk SSD BP
C
Upon dirtying a page, TAC does not reclaim the SSD frame.
C C
I
D D D
I I
11/20 Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison
TAC Dual-Write Lazy-Cleaning
D
D D D
I
Page Flow
BP Operations:
read ď evict ď read ď modify ď evict
Buffer pool
Disk SSD BP
C
Buffer pool
Disk SSD BP
Buffer pool
Disk SSD BP
Upon evicting a dirty page:
- TAC and DW are write through;
- LC is write back.
I C
Lazy cleaning
C
SSD Admission/Replacement Policies
⢠TAC
â Admission: if warmer than the coldest SSD page.
â Replacement: the coldest page.
⢠DW/LC
â Admission: if loaded from disk using a random I/O.
â Replacement: LRU2.
13/20 Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison
Implication on Checkpoint
⢠TAC/DW
â No change, because every page in the SSD is clean.
⢠LC
â Needs change, to handle the dirty pages in the SSD.
Experimental Setup
Configuration Machine HP Proliant DL180 G6 Server
Processor IntelÂŽ XeonÂŽ L5520 2.27GHz (dual quad core) Memory 20GB
Disks 8X SATA 7200RPM 1TB
SSD 140GB Fusion ioDrive 160 SLC OS Windows Server 2008 R2
15/20 Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison
TPC-C
100GB200GB400GB 0
2 4 6 8 10
TAC DW LC
Speedup Relative to
noSSD
Q: Why is LC so good?
A: Because TPC-C is update
intensive. In LC, dirty pages in the SSD are frequently re-
referenced.
83% of the SSD references are to dirty SSD pages.
LC is 9X better than noSSD, or 5X better than DW/TAC.
TPC-E
100GB200GB400GB 0
2 4 6 8 10
TAC DW LC
Speedup Relative to
noSSD
Q: Why do the three designs have similar speedups?
A: Because TPC-E is read intensive.
Q: Why does the highest speedup occur for 200GB database?
A: For 400GB, a smaller fraction of data is cached in the SSD;
For 100GB, a larger fraction of data is cached in the
memory BP.
17/20 Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison
TPC-H
45GB 160GB 0
2 4 6 8 10
TAC DW LC
Speedup Relative to
noSSD
Q: Why are the speedups smaller than in C or E?
A: Because most I/Os are sequential.
For random I/Os: Fusion is 10X faster;
For sequential I/Os: 8x disks are 1.4X faster.
Disks are the Bottleneck
As long as disks are the bottleneckâŚ
8 Disks
0 5 10 15 20 25 30 35 40 45 50
read write
Time (hours)
I/O Bandwidth (MB/s)
0 10 20 30 40 50
readwrite
Time (hours)
I/O Bandwidth (MB/s)
SSD
capacity reached!
about half capacity
I/O traffic to the disks and SSD, for TPC-E 200GB.
19/20 Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison
Long Ramp-up Time
If restarts are frequentâŚ
Restart from the SSD may reduce rampup time.
0.00 1.00
2.00 3.00
4.00 5.00
6.00 7.00
8.00 9.00
10.00 0
20 40 60 80 100
120 TAC DW LC
Time (#hours)
(# trans/ sec)tpsE
TPC-E (200GB) Q: Why does rampup take 10 hours?
A: Because the SSD is being filled
slowly, gated by the random read speed of the disks.
Conclusions
⢠SSD buffer pool extension is a good idea.
â We observed a 9X speedup (OLTP) and a 3X speedup (DSS).
⢠The choice of design depends on the update frequency.
â For update-intensive (TPC-C) workloads: LC wins.
â For read-intensive (TPC-E or H) workloads: DW/LC/TAC have similar performance.
⢠Mid-range SSDs may be good enough.
â With 8 disks, only half of FusionIOâs bandwidth is used.
⢠Caution: rampup time may be long.
â If restarts are frequent, the DBMS should restart from the SSD.
21/20 Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison
Backup Slides
Architectural Change
Buffer Manager
I/O Manager
Disk
BP Buffer Manager
I/O Manager
Disk
SSD Manager
SSD BP
BP
23/20 Microsoft Jim Gray Systems Lab & University of Wisconsin, Madison
Data Structures
Further Issues
⢠Aggressive filling
⢠SSD throttle control
⢠Multi-page I/O request
⢠Asynchronous I/O handling
⢠SSD partitioning
⢠Gather write