DISCONNECTED OPERATION IN THE CODA
FILE SYSTEM
Background
We are back to 1990s.
Network is slow and not stable Terminal “powerful” client
– 33MHz CPU, 16MB RAM, 100MB hard drive
Mobile Users appeared
Disconnected Operation
Disconnected operation is a mode of operation that enables a client to continue accessing critical data during temporary failures of a shared data repository.
Key idea: caching data.
– Performance – Availability
Design Overview
Coda is designed for an environment consisting of a large collection of untrusted Unix clients and a much smaller number of trusted Unix file servers.
Each Coda client has a local disk and can communicate with the servers over a high bandwidth network.
Design Overview
The Coda namespace is mapped to individual file servers at the granularity of subtrees called volumes.
Mechanisms for high availability:
(1) Server replication
VSG : volume storage group- a set of replicas for a volume
AVSG : client’s accessible VSG
(2) Disconnected operation
takes effect when the AVSG becomes empty.
An example depicts a typical scenario involving transitions between server replication and disconnected operation.
An Example
An Example
An Example
Design Rationale
Scalability
– Callback cache coherence (inherit from AFS) – Whole file caching
– Fat clients. (security, integrity) – Avoid system-wide rapid change
Portable workstations
(Powerful, lightweight and compact laptop computers)Design Rationale -Replication
First vs Second Class Replication
Server replication (why?)
higher quality:
+ Persistent, Secure physically - Expensive
Client replication(i.e., cache copies)
Design Rationale –Replica Control
By definition, a network partition exists between a disconnected second class replica and all its first class associates.
Pessimistic
– Disable all partitioned writes
– disallowing all partitioned writes or by restricting reads and writes to a single partition.
Optimistic
- sophisticated: conflict detection
Hoarding
Hoard useful data for disconnection
Balance the needs of connected and disconnected operation.
– Cache size is restricted
– Unpredictable disconnections
Prioritized algorithm
User defined hoard priority p: how interest it is? Recent Usage q
Object priority = f(p,q)
Kick out the one with lowest priority
+ Fully tunable
Everything can be customized
- Not tunable (?)
Hoard Walking
We say that a cache is in equilibrium, signifying that it meets user expectations about availability, when no uncached object has a higher priority than a cached object.
Equilibrium – uncached obj < cached obj
– Why it may be broken? Cache size is limited.
Walking: restore equilibrium
– Reloading HDB (changed by others) – Reevaluate priorities in HDB and cache – Enhanced callback
Emulation
Act like a server
Record modified objects
Replay update activity Preparation
– Log based per volume
Persistence
– Meta-data RVM – Exhaustion
Reintegration
Replay algorithm
– Execute in parallel to all AVSG – Transaction based
– Succeed?
Yes. Free logs, reset priority
Conflict Handling
Only care write/write confliction File vs Directory
– File: Halt entire reintegration process – Dir: investigate more
Coda Evaluation
Hardware
– 386 laptop, IBM Decstation 3100s – 350MB disk
How …?
– How long does reintegration take?
Answers
Duration of Reintegration
– A few hours( 4 to 5) disconnection ->1 min
Cache size
– 100MB(disk) at client is enough for a “typical” workday
Conflicts
– No Conflict at all! Why?
– Over 99% modification by the same person
Conclusion
Disconnected operation is a simple idea Hard to implement in each stage
– Why?
An extended version of write-back cache?
– A critical data pre-fetched write-back cache
Remember this slide?
We are back to 1990s.
Network is slow and not stable Terminal “powerful” client
– 33MHz CPU, 16MB RAM, 100MB hard drive
Mobile Users appear
What’s now?
We are in 2000s now.
Network is fast and reliable in LAN
“powerful” client very powerful client
– 2.4GHz CPU, 1GB RAM, 120GB hard drive
Mobile Users everywhere
– IBM Thinkpad 10 yrs anniversary
Do we still need disconnection?
Do we still need disconnection?
WAN and wireless is not very reliable, and is slow PDA is not very powerful
– 200MHz strongARM, 128M CF Card – Electric power constrained
LBFS (MIT) on WAN, Coda and Odyssey (CMU) for mobile users
What is the future?
We are in 2011s now
High bandwidth, reliable wireless everywhere Even PDA is powerful
– 2GHz, 1G RAM/Flash
– Unlimited kinetic or solar energy (?)
What will be the research topic in FS?