An effective recovery under fuzzy checkpointing in
main memory databases
S.K. Woo, M.H. Kim*, Y.J. Lee
Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1, Kusung-Dong, Yusung-Gu, Taejon 305-701, South Korea
Received 13 October 1998; received in revised form 28 June 1999; accepted 30 June 1999
Abstract
In main memory databases, fuzzy checkpointing gives less transaction overhead due to its asynchronous backup feature. However, till now, fuzzy checkpointing has considered only physical logging schemes. The size of the physical log records is very large, and hence it incurs space and recovery processing overhead. In this paper, we propose a recovery method based on a hybrid logging scheme, which permits logical logging under fuzzy checkpointing. The proposed method significantly reduces the size of log data, and hence offers faster recovery processing than other conventional fuzzy checkpointing with physical logging.q2000 Elsevier Science B.V. All rights reserved. Keywords: Database recovery; Main memory databases; Fuzzy checkpointing; Hybrid logging
1. Introduction
In main memory databases (MMDB), since the primary copy of the data resides in the main memory, MMDB provides much better performance than disk-resident data-bases (DRDB). Due to the significant decrease of memory cost with the fast increase in memory capacity, the impor-tance of MMDB has been increasingly recognized [1]. However, due to the volatility of main memory, the updated data in MMDB must be flushed to backup databases on disks in order to maintain the consistency of the database against system failures. The recovery-related works, e.g. checkpointing and logging, involve disk I/Os, so they have to provide for less overhead in transaction processing as well as fast recovery at the restart.
A lot of recovery methods for MMDB have been proposed in the literature [2β10]. So far, recovery methods based on fuzzy checkpointing, introduced in Ref. [4], have shown efficient performance because of the asynchronous backup feature of fuzzy checkpointing [5]. By the asynchro-nous feature, partially update pages may be flushed to disks, so fuzzy checkpointing employs a physical logging scheme. This is because under fuzzy checkpointing, the last consis-tent database state is difficult to reestablish from the last
complete checkpoint without physical logging. In general, the size of the physical log records is very large, which results in space overhead and long recovery time. Ref. [5] indicates that fuzzy checkpointing incurs longer recovery time than other consistent checkpointing methods with logi-cal logging because of the large size of physilogi-cal log records. There has been some studies in the past that attempted to reduce the size of log data. Refs. [2] and [4] present a log compression method, where the βredoβ parts of log records for aborted transactions and the βundoβ parts of log records for committed transactions are not maintained. In Refs. [7] and [8], redo log records are flushed to disks and undo log records are discarded, when a transaction is committed. However, most of those methods are not based on fuzzy checkpointing. Ref. [5] employs a shadow updating policy to record only redo log data. They, however, also indicate that fuzzy checkpointing with physical logging incurs significant overhead, even though only redo log records are maintained through shadow updating.
In DRDB, there have been some works on logical logging under fuzzy checkpointing. Ref. [11] describes a penulti-mate fuzzy checkpointing method with logical logging. Ref. [12] introduces a recovery method, called ARIES that is based on a fuzzy checkpointing. ARIES supports logical logging, which is, however, restricted to objects with increment or decrement kinds of operations, e.g. garbage collection and changes to the amount of free space. Note that two fuzzy checkpointing methods mentioned above are for DRDB, which is not directly
0950-5849/00/$ - see front matterq2000 Elsevier Science B.V. All rights reserved. PII: S 0 9 5 0 - 5 8 4 9 ( 9 9 ) 0 0 0 5 7 - 9
www.elsevier.nl/locate/infsof
* Corresponding author. Tel.: 182-42-869-3530; fax: 1 82-42-869-3510.
E-mail addresses: [email protected] (S.K. Woo), [email protected].
applicable to MMDB. Since MMDB has data in main memory permanently without buffering activities, the idea of penultimate checkpointing in Ref. [11] and the scheme of un-flushing dirty pages in Ref. [12] cannot be applied to MMDB.
In this paper, we propose a recovery method based on a hybrid logging scheme which permits logical logging under a recovery method based on fuzzy checkpointing. Since logical logging can replace a large physical log record by a single record of smaller size, it reduces the size of log data for recovery. In the end, the reduced log data make recovery processing fast. Even though we accommodate logical logging, we still keep the asynchronous backup feature of fuzzy checkpointing.
The rest of the paper is organized as follows. In Section 2, we propose our recovery policies including a hybrid logging scheme, revised fuzzy checkpointing, etc. We describe recovery processing and some considerations for our recov-ery method in Section 3 and Section 4, respectively. In Section 5, we analyze the performance of the proposed method. Finally, Section 6 gives concluding remarks.
2. Proposed recovery method
2.1. Basic policies
The database area on main memory consists of MMDB area, log buffer, and shadow area. We assume that an entire database can be stored in the MMDB area. The log buffer is composed of several log pages where the log data of trans-actions are recorded. A log page is flushed into a log disk when it is full. We employ a shadow updating policy. The updated data of a transaction are temporarily stored in the shadow area. Then, these data are propagated to the MMDB area appropriately during the commit work of the transac-tion. Shadow updating provides some advantages for MMDB: reduced log space, reduced MMDB access, faster reload processing, and reduced UNDO time [13]. It also prevents the partial undo of a transaction and generates only redo log records. Furthermore, performance studies in Refs. [10] and [14] have shown that shadow updating in MMDB provides better performance on transaction processing and post-crash log processing.
We use a pingβpong policy [15] as a backup method.
During checkpointing, only portions of the database that have been updated are written out to the backup database according to pingβpong policy. So, checkpointings alternate between two backup databases. This backup policy increases the number of pages to be flushed, but avoids the violation of write-ahead logging (WAL) under fuzzy checkpointing. The violation in MMDB may occur when fuzzy checkpointing is used carelessly [5]. Under shadow updating, a transaction writes its log records to a log page and reflects its updates on MMDB. Then, the transaction waits until the log page is flushed into the log disk. At this time, if checkpointing is in progress, a partially updated page in MMDB can be flushed to a backup database before the corresponding log records are flushed. This is the viola-tion of WAL. When a failure occurs in this situaviola-tion, the flushed page cannot be recovered, because there is no corre-sponding log records in the log disk. Here, the ping-pong policy can be used to avoid the violation of WAL. This is because the ping-pong policy maintains two backup data-bases, and hence the previous copy of a page being flushed always exists.
2.2. Hybrid Logging Scheme
Under fuzzy checkpointing [4] in MMDB, the checkpoin-ter flushes dirty pages without considering transaction activ-ities. It may flush partially updated pages. Thus, physical logging is inevitable during checkpointing because only physical log records can reestablish a previous consistent database state without worrying about the current activities on the data [11]. However, during the time between two checkpointings, physical logging may not be necessary. In other words, logical logging can be used during the time when there is no flushing, i.e. the time between two conse-cutive checkpointings.
To support logical logging under fuzzy checkpointing, we need a mechanism that can reestablish the consistent data-base state where the logical log records were created. That is, the reestablished database state should be either a trans-action-consistent or an trans-action-consistent state. Otherwise, we cannot apply logical log records to the recovered check-point because the logical log records can only be effective when the database has a consistent state. In our scheme, we establish a transaction-consistent state from a fuzzy check-point by applying physical log records generated during the Transaction Execution Propagate to MMDB Copy log to log buffer Release locks Ensure WAL Begin of Commit_Work Start Commit_Work End of Commit_Work Release Shadow area (1) (2) (3) (4) (5)
corresponding checkpointing, so it is possible to use logical logging during the interval between two fuzzy checkpoint-ings. This is the basic idea of accommodating logical logging under fuzzy checkpointing. We refer to the approach as the Hybrid Logging Scheme.
Hybrid Logging Scheme: Write physical log records during checkpointing, and use logical log records other-wise.
2.3. Commit processing
When shadow updating policy is employed, the data updated by a transaction are first written to the shadow area, not to the MMDB area in place. The updated data are propagated to the MMDB area at the commit time. Fig. 1 shows the commit processing model in our recovery method, which is similar to the pre-commit scheme in Ref. [2].
Before reaching the beginning of its commit_work, a transaction either aborts or completes its normal operations. All the updated data are temporarily stored in the shadow area, and logical log records are stored in a temporary area called private log space. After the completion of all the normal operations, a transaction begins the commit proces-sing that consists of five steps described as follows.
[Step 1] At the beginning of its commit_work, either the physical log records or logical log records of the transac-tion are copied to the current log page. When checkpoint-ing is ongocheckpoint-ing, physical log records are copied. Otherwise, logical log records are copied. Note that the physical log records can be obtained from the shadow area, and logical log records can be obtained in the private log space. The current log pages are locked and then unlocked at the start and end of this step, respectively. [Step 2] The updated data in the shadow area are propa-gated to the MMDB.
[Steps 3 and 4] The pages used in the shadow area and all the acquired locks on the data items are released. This is a strict two phase locking (S2PL), which guarantees the
serialization execution of transactions in committed order at restart [11].
[Step 5] If the log page containing the log records of the transaction has not been yet flushed, the transaction waits until the log page is flushed; the end of this step is the completion of the commit_work.
During normal transaction execution, logical log records are produced for the operations of the transaction. These logical log records are stored in the private log space of the transaction. The private log space reduces contentions to the log buffer and enables only commit log records to be written to the log buffer. So, it makes the log applying step at restart simple and fast.
At Step 1 of the commit processing, we first lock the current log page in the log buffer and then determine the type of log. If checkpointing is active, the physical log records are written to the log page; otherwise, the logical log records in the private log space are used. To indicate whether checkpointing is active, we use a global check-pointing flag, chkpt_flag. The flag is maintained by the checkpointer; it is set when the checkpointer writes the beginning record, BC, of checkpointing to the log buffer and reset when the checkpointer writes the ending record, EC. As the lock of the current log page is the synchroniza-tion point between transacsynchroniza-tions and the checkpointer, the checkpointer cannot start (or finish) its work while a trans-action is in the middle of Step 1; that is, the checkpointer cannot write its BC (or EC) record to the log buffer, because a transaction holds a lock on the current log page for record-ing its log data. Moreover, under S2PL, after transaction T has read or written an object x, no other transaction can access x in a conflicting mode until after T has committed or aborted [11]. This guarantees only one transaction can update an object. Thus, overlapping of physical and logical logging does not occur.
2.4. Checkpointing
When a new fuzzy checkpoint begins, the checkpointer writes a BC record to the current log page, and then flushes dirty data pages to the backup databases without considering locks and other transaction activities. When finishing the backup work, the checkpointer writes an EC record to the current log page, and flushes the log pages to log disks. After the log page with the EC record is written to disk, the checkpointer finally records the position of the BC record at the well-known location on disk. This is a normal fuzzy checkpointing process [5].
There is one consideration related with the redo point for recovery in using the hybrid logging scheme. The redo point is the first log record to be applied to the reloaded database for recovery. As we do not consider the quiescence of trans-actions, some transactions may be updating MMDB (i.e. propagating pages in shadow area to MMDB) at the BC, and hence the partially updated pages can be flushed to the backup databases. Then, the redo point for database
time
BC EC
exec logging update
T T 1 2 crash redo point
exec logging update
recovery is in the oldest log page among the log pages of transactions that is updating MMDB at the BC. That is, the redo point can be placed in front of the BC.
The problem related to the redo point is the type of log record of the transaction that updates pages at the BC; the type is of logical log in our scheme. This is because in the hybrid logging scheme the transaction obeys logical logging during the non-checkpointing interval. As an example, consider the situation in Fig. 2. Since the log records of transactions T1 and T2 are recorded before BC, the log
records are logical log. If the partially updated pages of T1
and T2 are flushed to the backup databases during
check-pointing, we may not be able to recover those pages. The reason is that the logical log records cannot be applied to the partially updated pages.
Our solution for this problem is to delay backup proces-sing, until transactions T1and T2finish MMDB updating. By
delaying the backup start time, the checkpointer can avoid the backup of partially updated pages whose corresponding logs are logical. Fig. 3 illustrates the concept of the delayed backup strategy with four transactions T1, T2, T3, and T4. The
checkpointer does not begin the backup right after recording the BC, but delays the backup until all the transactions updating MMDB at the BC finish the updates.
In the figure, Tbackupis the starting point of the backup
work. Some partially updated pages by T3and T4may be
written to a backup database. However, consistent states of those pages can be reestablished because the type of log record of T3and T4is physical. Since the current log page
in the log buffer is the synchronization point, the BC cannot be recorded to the log buffer during the logging of a transac-tion, i.e. until the end of logging by T1in Fig. 3.
We can easily implement the delayed backup strategy by using an array variable num_updating_tr[ ]; num_upda-ting_tr[k] counts the number of transactions that record their last log data (i.e. the commit of transaction) to log page k, but have not finished updating MMDB yet. The size of the array is the number of pages in the log buffer. A transaction first acquires the lock for log page k, and stores its log records to that log page. Then, it increases
time
BC EC
exec logging update
T T T T Tbackup 1 2 3 4 redo point exec logging update
exec logging update exec logging update
Fig. 3. The concept of delayed backup.
num_updating_tr[k] by one before releasing the lock on the log page. After the transaction finishes updating MMDB, num_updating_tr[k] is decreased by one. When num_upda-ting_tr[k] is zero, it means that all the transactions whose log records were written to log page k finish their MMDB updating steps. By using this variable, the checkpointer can determine the backup start time. That is, if BC is stored in log page i, the backup processing begins when
num_upda-ting_tr[k] is zero for all k#i.
Fig. 4 shows the proposed fuzzy checkpointing procedure with the delayed backup strategy. After writing BC to the current log page i, the checkpointer waits while tail#i. The variable tail points to the oldest log page with one or more MMDB updating transactions, i.e. those transactions that copied their logical log records in private log spaces to the log buffer, but have not finished propagating shadow pages to the MMDB.
Suppose BC is written to log page i and tail points to log page k where k,i. A transaction that recorded their log data to log page k decreases num_updating_tr[k] by one after finishing MMDB updating. If num_updating_tr[k] becomes zero, tail is set to the next log page whose num_up-dating_tr[ ] is not zero. The checkpointer begins backup processing if tail . i. Note that since all the updates at MMDB are processed in the main memory without disk I/ Os, the delay time is very small. The analysis on this matter will be described in Section 5.2.
By using the delayed backup strategy, the BC of the last complete checkpoint becomes the redo point for database recovery. That is, the delayed backup strategy guarantees that a transaction, which wrote its log records to the log buffer before BC, completes its MMDB updates before backup processing begins. Thus, only log records that are written to the log buffer after the BC need to be applied for recovery. Note that we do not have to search for the redo point because the position of BC is recorded to a well-known location on disk. This effect is a fundamental basis on the log applying rule described in Section 3.
Under shadow updating, only transactions in the commit_work can update MMDB, which prevents the partial undo of a transaction and generates only redo log records. In the proposed method, each transaction writes all of its log records to the log buffer at its logging time (Step 1 in Fig. 1). Therefore, by using the delayed backup strategy together with shadow updating and the private log space, a transaction-consistent database state can be effec-tively reestablished after only log records generated during checkpointing are applied to the last complete fuzzy check-point.
2.5. Extension to consecutive checkpointing
There has been some research on the consecutive fuzzy checkpointing [5,10], where the EC of a checkpoint becomes the BC of the next checkpoint. In this way, the checkpointer is always active. The proposed hybrid logging scheme can also be applied to the consecutive checkpoint-ing by partitioncheckpoint-ing MMDB to several segments and check-pointing the segments in the round-robin fashion. A segment consists of one or more pages. Every database object (rela-tion, index, etc) is stored in a segment.
The hybrid logging scheme in the segmented MMDB is that when the checkpointer is flushing dirty pages in segment i, we use physical logging for objects in segment i and logical logging for objects in the other segments. The hybrid logging scheme in the segmented MMDB is stated as follows.
Hybrid Logging Scheme for Segmented MMDB: Write physical log records for objects in the segment that is under checkpointing, and use logical log records for objects in the other segments.
The ping-pong backup policy is used in the segmented MMDB as well. The checkpointer flushes dirty pages in each segment to one of two backup databases in a round-robin fashion. In order to adjust the redo point of each
segment, we also apply the delayed backup strategy to checkpointing of each segment. Before flushing dirty pages in a segment, the checkpointer delays the backup timing point. A consecutive checkpointing procedure on the segmented MMDB is given in Fig. 5. The BCsegid
indi-cates the start of the checkpoint of segment segid as well as the end of the checkpoint of the previous segment. When-ever the checkpointer begins the backup of segment segid, it records each position of the BCsegidat the well-known
loca-tion on disk, as in the case of the non-segmented MMDB.
3. Recovery processing
Since the non-segmented MMDB can be considered as a special case of the segmented MMDB with one segment, we only describe the recovery procedure for the segmented MMDB. The recovery processing consists of two phases: reloading the backup database and applying the log. In the reloading phase the last complete backuped database is restored in main memory, and in the log applying phase log data are applied to the reloaded database. As described in the previous section, the redo point of segment i is BCi when the delayed backup strategy for each segment is used. To reestablish a consistent database state, we have to first determine the last complete checkpoint. Consider Fig. 6 that shows consecutive checkpointing in MMDB with four segments. In the non-segmented MMDB, Checkpoint1 is
the last complete checkpoint. However, the backuped data-base after checkpointing segment 1 in Checkpoint2already
includes the updated database images for all the log records about segment 1 generated during Checkpoint1; the delayed
backup strategy guarantees this. The same argument can also be applied to segment 2 and 3. Therefore, the last complete checkpoint for recovery is Checkpoint2. This
approach reduces the amount of log data required for recov-ery. Note that a similar approach has been proposed in Ref. [16], which, however, uses stable memory as the log buffer.
Based on this approach, the size of log data to be read from disks can be further reduced in the proposed hybrid logging scheme.
When the backup database is reloaded into memory, the log records generated after the beginning of the last complete checkpoint, i.e. BC4 in the example of Fig. 6,
are applied to the reloaded database. In this time, we do not have to apply all the physical and logical log records to the reloaded backup database. This is because backup processing of segment i begins only after all transactions that write their log records before BCifinish updates accord-ing to the delayed backup strategy. This means that pages backuped during checkpointing of segment i have all the after-images for the objects in segment i whose correspond-ing log records are generated before BCi. Thus, we do not have to consider log records stored before BCiin recovery for segment i.
For example, consider Fig. 7 that shows log records in the last complete checkpoint of Fig. 6. Here, Lji(or Pji) denotes logical (or physical) log records for the objects in segment i, generated during checkpointing segment j. The checkpoint-ing of segment 4 begins with BC4, and the log records
generated during that checkpoint consists of physical log records for the objects in segment 4 (i.e. P44) and logical log records for objects in the other segments (i.e. L41;L
4 2 and L43). Now, consider the checkpointed image of segment 1 after checkpointing segment 1, i.e. just before BC2. It
reflects all the updates represented by L41: Therefore, L 4 1 need not be applied to the reloaded backup database at all. Likewise, L42and L
1
2need not be applied because the check-pointed image of segment 2 already reflects all the updates represented by L42and L12. In other words, we can recover a segment by applying only logical log records for the objects in the previously recovered segments and physical log records of its segment to the reloaded backup database. Those log records are denoted by circle in Fig. 7. This log applying method reduces the number of log records for recovery processing. Following is the description of the log applying policy.
Log Applying Rule: Consider the recovery of the segmen-ted MMDB with N segments. We first have to establish a consistent database state from the last complete check-point. Suppose the last complete checkpoint was made based on segments Si1;Si2;β¦;S1N by this order, where Β ii;i2;β¦;iNΒ is an arrangement of Β 1;2;β¦;NΒ in the
round-robin order. We scan the log from BCSi
1: While we scan the log records generated during checkpointing segment Sik Β kΒ1;β¦;NΒ;we apply physical log records
1 2 3 4 1 2 3
Checkpoint1
Checkpoint2
Crash
Fig. 6. Consecutive checkpointing in MMDB with four segments.
BC BC BC BC L4 2 L4 1 L 4 3 P 4 4 P 1 1 L 1 2L 1 3 L 1 4 L 2 1 P 2 2 L 2 3 L 2 4 L 31 L 32 P 33 L 34 4 1 2 3
for the objects in, i.e. log records denoted by Pikik and
apply only logical log records for the objects in segments Si1;β¦;Sik21; i.e. log records denoted by Likq where q is
i1;β¦;ik21: After establishing the consistent database state from the last complete checkpoint, apply to MMDB all log records of the remaining committed trans-actions.
Note that in the segmented MMDB under conventional fuzzy checkpointings that permit only physical logging scheme, the main idea of the above log applying rule can also be used if the delayed backup strategy is provided. In that case, logical log records Lppare changed to physical log records Pppin the above description.
4. Discussion
4.1. Fuzzy checkpoint state
To recovery the database, the last complete checkpoint first needs to be reestablished by applying log records from the redo point to the ending record of the checkpoint. In the case of hybrid logging scheme, the reestablished database state should be either a transconsistent or an action-consistent state. Otherwise we cannot apply logical log records to the recovered checkpoint because the logical log records can only be effective when the database has a consistent state [11].
In our method, we achieve a transaction-consistent data-base state from the last complete checkpoint, through fuzzy checkpointing, by using shadow updating and a private logging for each transaction appropriately. This is because these policies enable the log records of a transaction to be copied to the log buffer by a unit of transaction at commit time.
4.2. Consistency of logical logging
According to the hybrid logging scheme, a way of logging is converted from logical logging to physical logging by checkpointing, i.e. from an object-level log record to a page-level log record. To overcome the gap, we should ensure that the redo result of logical log records is page-action consistent. In other words, a redo operation of a logical log record must have the same result as when it executed in normal processing. There is little research related to page-action consistency of logical logging [7,17]. Ref. [7] proposes an abstract data-type modeling of a logical operation. Ref. [17] presents a physiological logging scheme with another form of logical log record including page number information. Our policy below is based on Ref. [7].
First, we consider locking granularity. To maintain page-action consistency, we should prevent more than one trans-action from concurrently updating the same page. That is, executions have to be strict on the level of pages. S2PL can
guarantee the serializable execution on a page in the order of committed transactions on log if we use page-level locking granularity.
In the case of an operation that needs new page allocation, it must be considerately taken care of. The corresponding redo logical log record must contain not only the operation requesting an allocation, but also the allocated page number. During the execution step, we get information of the page to which an operation applies and allocate a new page if it is needed. The page number of the new allocated page is stored to the log record. In the MMDB updating step at commit_work, the updated data in the shadow area for the new page are reflected to MMDB. When executing the operation at restart, the allocation module gets the page number and allocates the page as the allocated page. 4.3. Shadow area size
The limited size of the shadow area may cause perfor-mance degradation. However, the requirements of normal transactions in general are confined to a small subset of the entire database entities [18]. Ref. [19], a study based on actual reference strings, indicates that the minimum cache size for the DB cache is about 100 cache pages each of 200KB. Moreover, since the shadow area contains only portions of updated pages with smaller shadow granules, the amount of the shadow area would be much less than this as described in Ref. [13]. Some approaches proposed in Ref. [13] can also be used to minimize the size of the shadow area.
4.4. Correctness
In this section, we describe that our recovery method can restore the database to a consistent state which guarantees serialization of committed transactions. Since we make use of S2PL with page-level granularity under shadow updating, no page may be read or overwritten until the transaction that previously wrote into it terminates, either by aborting or by committing [11]. Thus, all the transactionβs log data are recorded in the execution order on log. This is because all the committed transactions release the locked pages after the steps of logging and MMDB updating in our transaction processing model. By using shadow updating and the private log space, we can record only redo log data of committed transactions; this avoids undo of transactions at restart and makes the applying step simple. A transaction can commit only when the log page containing its log records is flushed to the log disk. Furthermore, the delayed backup strategy guarantees that the log records written before BC do not have to be applied, because the backuped database generated by the backup processing includes all the images of the log records as described in the previous section. This also implies that the beginning record of the last complete checkpoint is the redo point. Thus, the proposed recovery method restores the database to a consis-tent state guaranteeing the serialization order, by applying
the last complete checkpoint log records of committed transactions from the redo point.
5. Performance analysis
In this section, we analyze the performance of our proposed method on the segmented MMDB with consecu-tive checkpointing. The metrics are the size of log data generated in a complete checkpoint (ChkptLogSize), the size of log data to be applied for recovering the last complete checkpoint (ApplyLogSize), and recovery time. ChkptLogSize has a direct effect on the recovery time. ApplyLogSize is for measuring the effect of the log applying rule.
5.1. Reduced ratio of log data 5.1.1. Without hotspot
MMDB is partitioned into N segments. We assume that all the segments are uniformly accesses by transactions; that is, log records generated during the checkpoint of a segment consist of S log records for each segment. Therefore, SN log records are generated during a segment checkpoint. Let P be the size of a physical log record and L be the size of a logical log record.
When all the segments are uniformly accessed, ChkptLogSize in physical logging only is the sum of the log size of one segment multiplied by P; that is,
ChkptLogSizephysicalΒ Β SNΒPN ΒSN 2
P:
If the hybrid logging scheme is used, the log records gener-ated in a checkpoint consist of physical log records for objects in the checkpointed segment and logical log records for objects in the other segments. The size of log data gener-ated during a segment checkpoint is SP1SΒ N21ΒL:Thus, ChkptLogSize in the hybrid logging is
ChkptLogSizehybridΒ Β SP1SΒ N21ΒLΒN
ΒSNΒ P1LN2LΒ
ΒSN2L1SNP2SNL: Β 1Β Next, we analyze ApplyLogSize. When only a physical logging scheme is used and our log applying rule is not considered, ApplyLogSize in the segmented MMDB has the same complexity as ChkptLogSizephysical, i.e. SN2P,
When the proposed log applying scheme is used, ApplyLog-Size in physical logging is
ApplyLogSizephysicalΒSP 1SP1SP ... 1SP1SPΒ N21Β ΒSPΒ 112131β¦1NΒ ΒSPNΒ N11Β 2 Β 1 2SN 2 P1 1 2SNP: Β 2Β
When our proposed hybrid logging scheme is considered, only logical log records are applied to the previous segments by using the log applying rule. Thus, ApplyLogSize in the hybrid logging is ApplyLogSizehybridΒSP 1SP1SL 1SP12SL ... 1SP1Β N21ΒSL ΒSPN1SLΒ 112131β¦1N21Β ΒSPN1SLΒ N21ΒN 2 Β 1 2SN 2 L1SPN2 1 2SNL: Β 3Β Table 1
Ratio of size of log data
Case Logging Log applying rule ChkptLogSize ApplyLogSize
1 Physical X 1 1 2 Physical O 1 1 21 1 2N 3 Hybrid O L P1 1 N2 L NP L 2P1 1 N2 L 2NP 0 0.2 0.4 0.6 0.8 1 0 5 10 15 20 25 30 35 40 45 50
Ratio of Log Size
Number of Segments ApplyLogSize of case 2 ChkptLogSize of case 3 ApplyLogSize of case 3
To evaluate the effect of the hybrid logging and the log applying scheme, Table 1 shows the ratio of the log size in hybrid logging to that in only physical logging; that is, Eqs. (1)β(3), are divided by the log size in only physical logging, ChkptLogSizephysical, respectively. According to Ref. [5], we
assume that L is 64 words and P is 192 words. The ratio of the size of log data with varying N is shown in Fig. 8.
The size of log data is inversely proportional to the number of segments. This is because the portions of physi-cal log records in a segment checkpoint are reduced, as the number of segments is increased. With a small number of segments, the size of log data generated during a checkpoint can be reduced to about half, compared with that of only physical logging. For ApplyLogSize, we can reduce the size of log data to be applied for recovery to more than half by using both the hybrid logging and the log applying scheme.
5.1.2. With hotspot
In this section, we consider hotspots of the database and measure only ChkptLogSize. We assume that portions of all database pages have (12fH) of accesses, e.g. 20β80 rule.
Let H be the number of log records generated during a complete checkpoint of whole segments. First, suppose the database is partitioned into two segments, hotspot and non-hotspot. Then, the generation rate of dirty pages in the hotspot segment may be equal to the access rate of the hotspot segment, (12fH). Thus, the number of log records generated during the checkpoint of the hotspot segment is (1 2 fH) H. When a uniform distribution over accessed positions is considered, (1 2 fH) portions of (1 2 fH) H log records are related to objects in the hotspot segment. Thus, (12fH) (12fH) H log records are physical log, and the remaining fH(1 2 fH)H log records are logical log. A similar computation is also applied to the non-hotspot segment. Then, ChkptLogSize with the two segments is ChkptLogSize2segΒ Β 12fHΒΒ 12fHΒHP1fHΒ 12fHΒ
HL1fHfHHP1fHΒ 12fHΒ
HLΒ Β 12fHΒHΒΒ 12fHΒP1fHLΒ
1fHHΒfHP1Β 12fHΒLΒ:
We expand the above idea to N segments: dNfHe hotspot segments and (N2 dNfHenon-hotspot segments. Let be NH
dNfHe. We assume every hotspot (or non-hotspot) segment has the same access rate. Thus, the number of log records generated during the checkpoint of a hotspot segment is
Β 12fHΒ
H
NH
:
Among the above number of log records, onlyΒ 12fHΒ=NH
portions are related to objects in the segment that is under checkpointing, and the type of these log records is of physi-cal log. The type of log records for the remaining hotspot segments and all non-hotspot segments is of logical log. Thus, the size of log data generated during the checkpoint of a hotspot segment is Β 12fHΒH NH Β 12fHΒ NH P1 Β 12fHΒ NH Β NH21ΒL1fHL : Β 4Β
Similarly, the size of log data for a non-hotspot segment can be given by fHH N2NH fH N2NH P1 fH N2NH Β N2NH21ΒL1Β 12fHΒL : Β 5Β Since there are NH hotspot segments and Β N2NHΒ
non-hotspot segments, The size of log data generated during the last complete checkpoint under the segmented MMDB, ChkptLogSizeNseg, is the sum of Eqs. (4) and (5)
multiplied by NHandΒ N2NHΒ, respectively.
The size of log data in only physical logging is HP. Thus, the ratio of the log size with the hybrid logging to that of
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 5 10 15 20 25 30 35 40 45 50
Ratio of Log Size
Number of Segments 20% 30% 40% 50%
Fig. 9. Effects of reduced log under hotspots.
Table 2
Parameters and their defaults
Symbol Meaning Default
Sdb Database size 512M words
S1pg Log page size 1024 words
Spage Page size 8K words
Srec Record size 32 words
Sop Logical log entry size 32 words
Sinit Log header size 32 words
M Concurrent transactions 100
Trate Transaction arrival rate 1000 TPS
Tseek Average seek time 0.008 s
Tlatency Average rotation time 0.00417 s
Ttransfer Average transfer time 0.000313 s/page
Nbdisks Number of backup disks 20
Nact Actions per transaction 5
Pabort Abort probability 0.05
fH Fraction of hotspot 0.2
fhotact Fraction of actions to hotspot 12fH
only physical logging is Β 12fHΒ Β 12fHΒ NH 1 Β 12fHΒ NH Β NH21Β L P 1fH L P 1fH fH N2NH 1 fH N2NH Β N2NH21Β L P 1Β 12fHΒ L P : Β 6Β If we consider equal access rate for each segment, fHΒ1=2;
the result of Eq. (6) is 1 N 1 L P 2 L NP;
which is the same as ChkptLogSize of case 3 in Table 1. Fig. 9 shows the ratio of ChkptLogSizeNsegwith varying N
for some hotspot rates, with L of 64 words and P of 192 words. As we have more segments, the portion for physical logging is smaller. Thus, the whole size of log data is reduced. In this result, we know that the size of log data is minimum at 50% hot spot rate. A 50% hot spot rate means all the segments are accessed uniformly by transactions. So, this result indicates that for MMDB segmentation, the access rate of segments is a more important partition factor than the size of the segments.
5.2. Recovery time 5.2.1. Parameters
The parameters for analyzing the recovery time consist of the size of the database, the size of log-related structures, disk I/O time, the ratio of abort, etc. Table 2 shows these parameters and their default values. They are derived from Refs. [5] and [10].
5.2.2. Delay time
In our recovery method, checkpointing time increases by the delay time of the delayed backup strategy. Before measuring the recovery time, we analyze the delay time. After writing BC record in the current log page, the check-pointer waits until all transactions in the updating step of the commit_work finish their MMDB updating, rather than
waiting until commit. So, disk I/O is not involved in the delay, and the delay time is proportional to the amount of updating work and the number of transactions. Then, the maximum delay time can be no more than the sum of updat-ing times in serial executions of the maximum number of concurrent transactions. Given the parameters in Table 2, the number of words updated by these transactions is S
rec-NactM. If reading or writing of a word is processed with one
CPU instruction, the maximum delay time equals the time
of SrecNactM instructions. Thus, under 100 MIPS CPU
processing power, the maximum delay time can be about 0.00016 s. Furthermore, no disk I/O occurs during the delay time. Compared with the inter-checkpoint interval, the amount of delay time is so small that the time has little influence on the inter-checkpoint interval. Therefore, we do not consider the delay time in checkpoint interval for the recovery time analysis.
5.2.3. Analysis
The recovery time consists of the backup database reload-ing time, log pages readreload-ing time, and log applyreload-ing time. For simplicity, the recovery time of the last complete checkpoint is considered. The time to read the backup database, Tback, is
TbackΒ
Sdb
SpageΒ Tseek1Tlatency1TtransferΒ:
The size of log data for recovering the last complete check-pointing, Slog, is
Slog Β Β 12PabortΒTrateticpDredo
where ticpis an inter-checkpoint interval and Dredois the size
of redo log data per transaction. If only physical logging is used, DredoΒSinit1SrecNact:When only logical logging is used, we simply assume DredoΒSop1Sinit:The Dredoof the
hybrid logging is calculated according to the result of Fig. 9. ticpis a period between the beginnings of checkpoints and
is determined by the number of dirty pages and the I/O capability. According to Ref. [10], the expected number of dirty pages generated during time t, NdirtyΒ tΒ;is
NdirtyΒ tΒ Β 12 12 Rspa
Npage
!fhotactNactTratet
" #
fHNpage
1 12 12 Rspa
Npage
!Β 12fhotactΒNactTratet
" #
Β 12fHΒNpage
where Npage is Sdb=Spage:Since we use ping-pong backup policy, the number of dirty pages to be flushed during time t, Nflush(t), is Ndirty(2ticp).
According to Ref. [5], the number of pages that can be written out to the disks during time t, Nio(t), is given by NioΒ tΒ ΒNbdisks
t
TseekTlatencyTtransfer
: 825 830 835 840 845 850 0 5 10 15 20 25 30 35 40 45 50
Recovery Time (sec)
Number of Segments Physical Hybrid - 20 Hybrid - 50 Logical
By setting
NflushΒ tΒ ΒNioΒ tΒ; Β 7Β we find the minimum ticp.
In general, since disk reading and CPU processing can be overlapped, and since disk I/O time is much larger than CPU processing time, the log page reading time may be regarded as total log processing time. Moreover, due to the locality of the log, that is, the sequential file, the actual average seek time may be only 25β33% of the given number (Tseek) [20].
Thus, the time to read log, Treadlog, is TreadlogΒ
Slog
Slpg
Β 0:3Tseek1Tlatency1TtransferΒ:
Fig. 10 shows the result of analyzing the recovery time, i.e. Tback1Treading: The recovery time of logical logging (Logical) is the ideal case because we cannot use only logi-cal logging with fuzzy checkpointing in MMDB. Our hybrid logging scheme (Hybrid) presents better performance than physical logging (Physical). As having more segments makes the portion of physical log data smaller, the recovery time of the hybrid logging would converge to that of logical logging. With 20 segments, the recovery time gap between physical logging and the hybrid logging is about 20 s at 20% hot spot rate. This is not a small time in high transaction
processing. At the 1000 TPS rate, 20000 transactions can be processed during the gap.
Fig. 11 shows the results of recovery time with varying transaction arrival rates (Trate) and the number of pages per
action (Rspa) when the number of segments is 20. As the
arrival rate of transactions and the number of pages per action increase, the more dirty pages are produced, which make the number of corresponding log records be increased. In the conventional fuzzy checkpointing, since these log records are physical log, the increased amount of log records affects recovery time significantly due to a large amount of physical log records. On the other hand, in our proposed hybrid logging scheme, since the portion of physical logging is considerably reduced, the influence on the increased amount of log records is much less than that of conventional fuzzy checkpointing.
Finally, we have analyzed log processing time with vary-ing sizes of databases. Fig. 12 shows the result. Since the time to read the backup database is fixed with a given data-base size, we compare recovery time with respect to log processing time. As in the previous results, our proposed scheme gives better performance than conventional fuzzy checkpointing based on physical logging. Here, even though the database size becomes larger, the number of pages to be flushed would not increase in proportion to the increased rate of database size. This is because the parameters about update rates, e.g. Trateand Rspa, we fixed. However, a little
increase of the number of flushed pages gives a great influ-ence on determining ticp in Eq. (7). The related data are 810 820 830 840 850 860 870 880 890 900 910 200400600800100012001400160018002000
Recovery Time (sec)
Transaction Arrival Rate (trans./sec) Physical Hybrid - 20 Hybrid - 50 Logical 815 820 825 830 835 840 845 850 855 860 865 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Recovery Time (sec)
Pages per Action Physical
Hybrid - 20 Hybrid - 50 Logical
Fig. 11. Recovery time at various Trateand Rspa.
0 50 100 150 200 250 300 0.5 1 1.5 2 2.5 3 3.5 4
Log Processing Time (sec)
Database Size (G words) Physical
Hybrid - 20 Hybrid - 50 Logical
Fig. 12. Log processing time at various database sizes.
Table 3
Values of ticpand Ndirty(1) for varying database sizes
Database size (G words) ticp(s) Ndirty(1)
0.5 27.87 3378 1.0 55.74 3448 1.5 83.61 3472 2.0 111.48 3484 2.5 139.33 3491 3.0 167.22 3496 3.5 195.09 3500 4.0 222.96 3502
described in Table 3, which shows the changes of ticpand the
number of pages to be flushed per second with varying sizes of databases. Due to the increase of checkpointing interval, the recovery time increases in proportion to the increased rate of the database size.
6. Concluding remarks
Fuzzy checkpointing is an efficient way for MMDB backup due to its asynchronous flushing feature. Most previous works on fuzzy checkpointing in MMDB have used physical logging. Although physical logging is rela-tively simple to be apply, it however incurs space and recov-ery time overhead. In this paper, we have focused on the reduction of the size of log data and have proposed a recov-ery method based on the hybrid logging scheme. The hybrid logging scheme accommodates logical logging under fuzzy checkpointing, when applicable, and thus significantly reduces the size of log data. We have also presented an efficient log applying rule in the segmented MMDB, which results in efficient recovery processing by reducing the number of log records to be applied.
We have performed analyses for evaluation of our proposed method. The result shows that in our method the size of log data is reduced to more than half, compared with that in only physical logging. We have shown that the size of log data is inversely proportional to the number of segments. We have also shown that the recovery time based on the proposed method can be close to the case where only logical logging is used. The result of the log applying rule shows that we can reduce the number of log records to be applied for recovery to more than half of the number of log records generated in the segmented MMDB. Thus, the hybrid logging scheme along with the log apply-ing rule makes ordinary log processapply-ing as well as database recovery quite efficient.
References
[1] H. Garcia-Molina, K. Salem, Main memory database systems: an overview, IEEE Transactions on Knowledge and Data Engineering 4 (6) (1992) 509β516.
[2] D.J. Dewitt, R.H. Katz, F. Olken, L.D. Shapiro, M.R. Stonebraker, D. Wood, Implementation techniques for main memory databases systems, in: Proceedings of ACM SIGMOD International Conference on Management of Data, 1984, pp. 1β8.
[3] M.H. Eich, A Classification and comparison of main memory data-base recovery techniques, in: Proceedings of International Conference on Data Engineering, IEEE, 1987, pp. 332β339.
[4] R.B. Hagmann, A crash recovery scheme for a memory-resident data-base system, IEEE Transactions on Computers C-35 (9) (1986) 839β 843.
[5] K. Salem, H. Garcia-Molina, Checkpointing memory-resident data-bases, in: Proceedings of International Conference on Data Engineer-ing, 1989, pp. 452β462.
[6] L. Gruenwald, M.H. Eich, MMDB reload algorithms, in: Proceedings of ACM SIGMOD International Conference on Management of Data, ACM, 1991, pp. 397β405.
[7] H.V. Jagadish, A. Silberschatz, S. Sudarshan, Recovering form main-memory lapses, in: Proceedings of the 19th International Conference on Very Large Data Bases, 1993, pp. 391β404.
[8] T.J. Lehman, M.J. Carey, A recovery algorithm for a high-perfor-mance memory-resident databases system, in: Proceedings of ACM SIGMOD International Conference on Management of Data, 1987, pp. 104β117.
[9] E. Levy, A. Silberschatz, Incremental recovery in main memory data-bases systems, IEEE Transactions on Knowledge and Data Engineer-ing 4 (6) (1992) 529β540.
[10] X. Li, M.H. Eich, Post-crash log processing for fuzzy checkpointing main memory databases, in: Proceedings of International Conference on Data Engineering, IEEE, 1993, pp. 117β124.
[11] P.A. Bernstein, V. Hadzilacos, N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, Reading, MA, 1987.
[12] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, P. Schwarz, ARIES: a transactions recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging, ACM Transactions on Database Systems 17 (1) (1992) 94β162.
[13] M.H. Eich. Main memory database research directions. Technical Report TR 88-CSE-35, Southern Methodist University, 1988. [14] V. Kumar, A. Burger, Performance measurement of main memory
database memory algorithms based on update-in-place and shadow approaches, IEEE Transactions on Knowledge and Data Engineering 4 (6) (1992) 567β571.
[15] K. Salem, H. Garcia-Molina, Crash recovery for memory-resident databases, in: Technical Report CS-TR-119-87, Department of Computer Science, Princeton University, November 1987. [16] J.-L. Lin, M.H. Dunham, Segmented fuzzy checkpointing for main
memory databases, in: Proceedings of ACM Symposium on Applied Computing, February 1996, pp. 158β165.
[17] J. Gray, A. Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann, Los Altos, CA, 1993.
[18] V. Kumar, Recovery in main memory database systems, in: Proceed-ings of Database and Expert Systems Applications, 1996, pp. 769β 778.
[19] K. Elhardt, R. Bayer, database cache for high performance and fast restart in database systems, ACM Transactions on Database Systems 9 (4) (1984) 503β525.
[20] J.L. Hennessy, D.A. Patterson, Computer Architecture: a Quantitative Approach, 2, Morgan Kaufmann, Los Altos, CA, 1996.