• No results found

Maintaining Internal Transaction Representation

Implementation Issues

8.2 Maintaining Internal Transaction Representation

The design of internal transaction representation is very important to overall system performance and resource consumption. This section describes the main data structures shown in Figure 8.2 and how transaction readset and writeset are recorded.

8.2.1

Main Data Structures

A Client-Wide Transaction Database To centralize data management, all the important transaction information is stored in a client-wide data structure calledIOTDB, which contains the following items.

misc data SG env DB pending iot-list running iot-list committed iot-list

. . . .

misc data iot spec iot env read set write set read vol-set write vol-set IOTDB misc data name list name name name fid spec vid iotrep objrep vid vid

This figure shows the main data structures used by the internal transaction representation. Each rectangular box corresponds to a major data item and the shaded areas represent data structures that are further explained in the figure or the subsequent discussion.

Transaction Lists

The most important IOTDB component is a group of transaction lists, each contain- ing all the transactions in a particular state. For example, all the running transac- tions are in the running-iot-list and all the pending transactions are in the

pending-iot-list. Each element of the list is a pointer to iotrep, the inter- nal representation of an IOT. The main purpose of using multiple transaction lists is to reduce the performance overhead resulting from frequent internal search activities. Note that terminated transactions are temporarily maintained in their lists and garbage collected by a periodic daemon.

Serialization Graph

The transaction serialization graph (SG) maintains the local dependency among all live transactions with each node representing an IOT or an IFT. SG nodes and edges are inserted and removed as transaction activities proceed. Internally, SGis represented by a group of doubly linked lists.

Environment Database

An environment database (envDB) is created to allow different transactions executed by the same user to share common environment variables.

Miscellaneous Data

There are miscellaneous data items mainly used by the basic internal transaction opera- tions. An example is the wait-for graph for detecting transaction deadlock.

Transaction Representation The internal representation of an isolation-only transaction, referred to asiotrepin Figure 8.2, contains the following key elements.

Transaction Specification

This group of information contains both the identity and the conflict resolution require- ment of the transaction. It includes the transaction identifier, the process-id and process group-id of the Unix process that invoked the transaction, the transaction’s selection of resolution option and the pathname of the resolver executable file if the selected option is ASR.

Environment Information

iotrepstores the environment information needed for possible automatic resolution. As described in Chapter 7, such information contains the pathname of the transaction executable file, the command line arguments, the environment variable list, theumask

Readset and Writeset

Readset and writeset are the most important components ofiotrep. They are repre- sented by a doubly linked list with each element containing a pointer to objrep, the internal representation of an object accessed by the transaction. objrep records the information about all the access operations this transaction has performed on the object. Volume Lists

Because of the need to frequently check transaction connectivity, a list of volumes read by a transaction is included in its iotrep with each element containing the internal identifier of a volume. Similarly, the iotrepalso maintains a list of volumes that are updated by the transaction.

Miscellaneous Data

There are miscellaneous data items in iotrepfor recording information such as the current transaction connectivity, transaction execution time, etc.

Representation of a Transactionally Accessed Object The most important information recorded inobjrepis the fidof the object. The data item namedspec in Figure 8.2 is a bitmap recording the sub-parts of the object that are actually read or written by the transaction. For a directory object, itsobjrepcontains a list of the names that are accessed in the directory. Miscellaneous data items include a pointer to the shadow cache file of the object if one exists.

8.2.2

Recording Transaction Readset/Writeset

The most frequent internal bookkeeping activity during transaction execution is recording readset and writeset. For every file access operation, the transaction system must detect whether it is performed on behalf of an ongoing transaction. Because such detection needs the process group-id associated with each file access operation, the communication interface between the kernel and the IOT-Venus is extended to pass such process information.

Extending the Kernel/Venus Interface Client support in Coda is divided between a small in-kernel Mini-Cache [67] and a much larger user-level Venus cache manager. The main purpose of Mini-Cache is to reduce the frequency of kernel/Venus communication by caching a small amount of information (such as the result of successfullookupcalls) in the kernel. The Mini-Cache intercepts file system calls on Coda objects from the kernel Vnode layer [28, 57] and redirects them to the user-level Venus by exchanging messages through the Coda pseudo- device. The information about each operation passed from Mini-Cache to Venus is defined in the Vnode interface including the operation code, the internal identifier of the operands and the

ucreddata about the user who issued the operation [28, 57]. Unfortunately, such information does not include the needed process information associated with the operation. To address this problem, we extended the Mini-Cache/Venus communication interface to pass this information, as shown in Figure 8.3. Coda MiniCache uarea VFS/Vnode Layer vnode_opr + process_info process_info vnode_opr

Kernel

Application IOT-Venus

This figure illustrates the kernel extension needed for the IOT-Venus to obtain the necessary process information for every file access operation on Coda objects. The Mini-Cache packs the process information obtained from the kerneluareainto messages sent to the IOT-Venus.

Figure 8.3: Extending Kernel/Venus Communication with Process Information

Recording Readset/Writeset Recording transaction readset and writeset involves searching and updating the relevant data structures. Upon receiving a new file access operationoprfrom the kernel, the first step the transaction system undertakes is checking whetheroprbelongs to an ongoing transaction. This is accomplished by linearly scanning all the transactions in the

running-iot-listusing the attached process information. Ifopris found to belong to a currently running transactionT, we must searchT’s readset or writeset depending on whether

opris a read or update operation.

Suppose that opr is a read operation and has only one operand obj. The transaction system will linearly search through the linked list representing T’s readset. Ifobj is not in the list, a newobjrepis created and inserted into the list, storing information aboutobjand the sub-parts of objaccessed by opr. If objis already in the readset, the specbitmap in

theobjrepofobjis updated to include the sub-parts ofobjaccessed byopr. Ifobjis a directory, a new name may need to be inserted into the name-list of theobjrepdepending on the actions performed byopr. Update operations involving multiple operands can be processed in a similar manner. Note that the performance overhead caused by the linear search activities can be reduced by using more advanced data structures such as a hash table, particularly for large transactions accessing hundreds of objects.

Detecting Abnormal Termination The ability to accurately identify the scope of a running transaction influences the amount of search activity needed for recording transaction readset and writeset. If a transactionTforgot to issue theend iotcall or its program exits before the

end iotcall can be made, Twill remain in the running state and cause unnecessary internal search activities. Thus, we need a reliable mechanism to detect such abnormal transaction termination. Intuitively, solving this problem requires either the kernel to notify the IOT-Venus every time a process exits or the IOT-Venus to poll the kernel about whether the master process of a running transaction has exited. Both approaches are costly in performance and increase complexity to the kernel/Venus communication interface.

We use a much simpler solution based on the observation that whenever a process exits, the kernel always closes all its open descriptors. We designate a special Coda object /coda

and internally open it for read on behalf of any transaction at the beginning of itsbegin iot

call. The transaction system maintains a counter iniotrepwhich is incremented whenever

/codais opened by the transaction and decremented whenever it is closed by the transaction. If the counter reaches zero while the transaction is still in the running state, this means that the transaction’s master process has exited without callingend iot and the final decrement causing the counter to reach zero resulted from the kernel closing the open/coda. Note that this approach assumes that transactions do not close/codawithout opening it first.

8.3

Shadow Cache File Management

8.3.1

Shadow Cache File Organization

As discussed in Chapter 4, the transaction system maintains two entities for each shadow cache file, a disk container file holding the shadow content, and a shadow entry containing a pointer to the contained file and a counter recording the number of live transactions that accessed the shadow content. An example of the internal organization of shadow cache files is shown in Figure 8.4. There is a central database (SCFDB) containing key information about shadow cache files and their management. The main components ofSCFDBare a list of shadow entries and some data items used for managing shadow space allocation. The relation between a

transaction and its shadow cache files is maintained by the shadow entry pointer stored in the relevantobjrepbelonging to the transaction.

SCFDB

miscdata

obj-1 obj-2 obj-2 obj-3

T

1

T

2