We have argued in favor of a BLOB access interface that needs to be asynchronous, versioning-
based and which must guarantee atomic generation of new snapshots each time the BLOB gets
updated.
To meet these properties, we propose a series of primitives. To enable asynchrony, con- trol is returned to the application immediately after the invocation of primitives, rather than waiting for the operations initiated by primitives to complete. When the operation com- pletes, a callback function, supplied as parameter to the primitive, is called with the result of the operation as its parameters. It is in the callback function where the calling application takes the appropriate actions based on the result of the operation.
42 Chapter4– Design principles
4.2.1.1 Basic BLOB manipulation
1 CREATE( allba k(id))
✝
This primitive creates a new empty BLOB of size 0. The BLOB will be identified by itsid,
which is guaranteed to be globally unique. The callback function receives this id as its only parameter.
1 WRITE(id, buffer, offset, size, allba k(v)) 2 APPEND(id, buffer, size, allba k(v))
✝
The client can update the BLOB by invoking the corresponding WRITEor APPENDprimitive.
The initiated operation copiessizebytes from a localbufferinto the BLOB identified byid,
either at the specifiedoffset(in case of write), or at the end of the BLOB (in case of append).
Each time the BLOB is updated by invoking write or append, a new snapshot reflecting the changes and labeled with an incremental version number is generated. The semantics of the write or append primitive is to submit the update to the system and let the system decide when to generate the new snapshot. The actual version assigned to the new snapshot is not known by the client immediately: it becomes available only at the moment when the operation completes. This completion results in the invocation of the callback function, which is supplied by the system with the assigned versionvas a parameter.
The following guarantees are associated to the above primitives.
Liveness: For each successful write or append operation, the corresponding snapshot is
eventually generated in a finite amount of time.
Total version ordering: If the write or append primitive is successful and returns version
number v, then the snapshot labeled with v reflects the successive application of all updates numbered 1 . . . v on the initially empty snapshot (conventionally labeled with version number 0), in this precise order.
Atomicity: Each snapshot appears to be generated instantaneously at some point between
the invocation of the write or append primitive and the moment it is available for reading.
Once a snapshot was successfully generated, its contents can be retrieved by calling the following primitive:
1 READ(id, v, buffer, offset, size, allba k(result ))
✝
READenables the client to read from the snapshot versionvof BLOBid. This primitive results
in replacing the contents of the local buffer with size bytes from v, starting at offset, if
the snapshot has already been generated. The callback function receives a single parameter,
result, a boolean value that indicates whether the read succeeded or failed. If v has not
been generated yet, the read fails and result is false. A read fails also if the total size of the snapshotvis smaller thanoffset+size.
4.2 – Versioning as a key to support concurrency 43 4.2.1.2 Learning about new snapshot versions
Note that there must be a way to learn about both the generated snapshots and their sizes, in order to be able to specify meaningful values forv,offsetandsize. This is the performed
by using the following ancillary primitives:
1 GET_RECENT(id, allba k(v, size)) 2 GET_SIZE(id, v, allba k(size))
✝
TheGET_RECENTprimitive queries the system for a recent snapshot version of the blobid. The
result of the query is the version number v which is passed to the allba kfunction and the
size of the associated snapshot. A positive value for v indicates success, while any negative value indicates failure. The system guarantees that: 1) v≥max(vk), for all snapshot versions vk that were successfully generated before the call is processed by the system; and 2) All
snapshot versions with number lower or equal to v have successfully been generated as well and are available for reading. Note that this primitive is only intended to provide the caller with information about recent versions available: it does not involve strict synchronizations and does not block the concurrent creation of new snapshots.
The GET_SIZE primitive is used to find out the total size of the snapshot version v for
BLOB id. This size is passed to the callback function once the operation has successfully
completed.
Most of the time, theGET_RECENTprimitive is sufficient to learn about new snapshot ver-
sions that are generated in the system. However, some scenarios require the application to react to updates as soon as possible after they happen. In order to avoid polling for new snapshot versions, two additional primitives are available to subscribe (and unsubscribe) to notifications for snapshot generation events.
1 SUBSCRIBE(id, allba k(v, size)) 2 UNSUBSCRIBE(id)
✝
Invoking the SUBSCRIBEprimitive registers the interest of a process to receive a notification
each time a new snapshot of the BLOB id is generated. The notification is performed by
calling the callback function with two parameters: the snapshot versionvof the newly gen-
erated snapshot and its total size. The same guarantees are offered for the version as with
the GET_RECENT primitive. Invoking the UNSUBSCRIBE primitive unregisters the client from
receiving notifications about new snapshot versions for a given BLOBid.
4.2.1.3 Cloning and merging
WRITEandAPPENDfacilitate efficient concurrent updates to the same BLOB without the need to
synchronize explicitly. The client simply submits the update to the system and is guaranteed that it will be applied at some point. This works as long as there are no semantic conflicts between writes such that the client needs not be aware of what other concurrent clients are writing. Many distributed applications are in this case. For example, the application spawns a set of distributed workers that concurrently process records in a large shared file, such that
44 Chapter4– Design principles
each worker needs to update a different record or the worker does not care if its update to the record will be overwritten by a concurrent worker.
However, in other cases, the workers need to perform concurrent updates that might in- troduce semantic conflicts which need to be reconciled at higher level. In this case, WRITE
andAPPENDcannot be used as-is, because the state of the BLOB at the time the update is ef-
fectively applied is unknown. While many systems introduce transactional support to deal with this problem, this approach is widely acknowledged both in industry and academia to have poor availability [50]. We propose a different approach that is again based on version- ing and introduce another specialized primitive for this purpose:
1 CLONE(id, v, allba k(new_id))
✝
TheCLONEprimitive is invoked to create a new BLOB identified bynew_id, whose initial
snapshot version is not empty (as is the case withCREATE) but rather duplicates the content
of snapshot version vof BLOBid. The new BLOB looks and acts exactly like the original,
however any subsequent updates to it are independent of the updates performed on the original. This enables the two BLOBs to evolve in divergent directions, much like the fork
primitive on UNIX-like operating systems.
Using CLONE, workers can isolate their own updates from updates of other concurrent
workers, which eliminates the need to lock and wait. At a later point, these updates can be “merged back” in the original BLOB after detecting potential semantic conflicts and recon- ciling them. Another primitive is introduced for this purpose:
1 MERGE(sid, sv, soffset, size, did, doffset, allba k (dv))
✝
MERGE takes the region delimited by soffset and size from snapshot version sv of
BLOB sid and writes it starting at doffset into BLOB did. The effect is the same
as if READ(sid, sv, buffer, soffset, size, allba k(result)) was issued, followed by WRITE(did, buffer, doff, size, allba k(dv )). The only difference is the fact that MERGE
enables the system to perform this with negligible overhead, as unnecessary data duplica- tion and data transfers to and from the client can be avoided.
Both CLONE and MERGE can take advantage of differential updates, sharing unmodified
data and metadata between snapshots of different BLOBs. Since no data transfer is involved, this effectively results in the need to perform minimal metadata updates, which enables efficient semantic-based reconciliation, as described in Section 4.2.3.