Algebraic Signatures for Scalable Distributed Data Structures

(1)

Algebraic Signatures for Scalable Distributed Data Structures

Abstract

Signatures detect changes to data objects. Numerous schemes are used, e.g., the cryptographically secure standards SHA1 or MD5 used in Computer Forensics to distinguish file versions. We propose a novel signature descending from the Karp-Rabin signatures, which we call algebraic signatures because it uses Galois Field calculations. One major consequence, new for any known signature scheme, is certain detection of small changes of parameterized size. More precisely, we detect for sure any change that does not exceed n-symbols for an n-symbol signature. For larger changes, the collision probability is typically negligible, as for the other known schemes. We apply the algebraic signatures to the Scalable Distributed Data Structures (SDDS). We filter at the SDDS client node the updates that do not actually change the records. We also manage the concurrent updates to data stored in the SDDS RAM buckets at the server nodes. We further use the scheme for the fast disk backup of these buckets. We sign our objects with 4-byte signatures, instead of 20-byte standard SHA-1 signatures that would be impractical for us. Our algebraic calculus is then also about twice as fast.

1. Introduction

A signature is a string of a few bytes intended to identify uniquely the contents of a data object (a record, a page, a file, etc.). Different signatures prove the inequality of the contents, while same signatures indicate their equality with overwhelmingly high probability. Signatures are a potentially useful tool to detect the updates or discrepancies among replicas (e.g. of files [Me83], [AA93], [BGMF88], [BL91], [FWA86], [Me91], [SBB90]). Their practical use requires further properties, since the updates often follow common

patterns. In a text document the cut/paste (switch) of n symbols usually dominates. A database record update often changes only relatively few bytes. Common updates should change the signatures, i.e., should not lead to collisions. The collision probability should be also uniformly low for every possible update, although no schemes can guarantee the signature change for any update. Many signatures schemes with additional “good” properties for specific applications have been proposed, e.g., for documents, against transmission errors, malicious alterations… [FC87], [KC96], [KC99]. The prominent public standard SHA1 provides a 160-bit secure hash signature, [N95], [KPS02] and supplements the MD5 signature used to ascertain integrity of disk images in computer forensics.

The concept of a signature appeared to us useful for the management of a Scalable Distributed Data Structure (SDDS).

Most known SDDS schemes are a hash LH* or LH*RS file, or a range partitioned RP* file [LNS06], [LS01], [LNS94], managed by the SDDS-2000 prototype system available for download [C02]. SDDS are intended for parallel or distributed databases on multicomputers, grid or P2P systems [NDLR00], [NDLR01], [LRS02]. The SDDS-2000 files reside in distributed RAM buckets and its access performance currently reaches more than 100 times improvement over the disk data.

The use of signatures is motivated in the SDDS context as follows. First, some applications of SDDS-2000 may need disk back up of the buckets from time to time. One needs then to find the only areas that changed in the bucket since the last backup. For reasons we detail later, but essentially because we initially did not design SDDS-2000 for this need, the traditional dirty bit approach became impractical in our case.

Signatures were apparently the only workable approach.

Thomas Schwarz

Comp. Engineering, Santa Clara University Santa Clara, CA 95053, USA

[email protected] Witold Litwin

CERIA, Université Paris 9 Pl. du Mal. de Lattre, 75016 Paris, Fr

[email protected]

(2)

Next, signing record updates proved useful. On frequently observed occasions, an application requests a record update, but the new and the old record are in fact identical. We compare the signatures of the before and after image to detect this problem and avoid unnecessary traffic. If transactions follow the two-step model, we can prevent dirty reads by calculating the signatures of the read set between reading and just before committing the writes.

Cryptographically secure signatures such as the 20B SHA-1 or MD5 do not fit our purpose well. Our n-symbol signature is the concatenation of n power series of Galois Field (GF) symbols in GF(2⁸) or GF(2¹⁶). While our algebraic signatures are not cryptographically secure, they exhibit a number of attractive properties. First, we may produce short signatures, sufficient for our goals, e.g., only 4B long. Next, the scheme is the first known, to the best of our knowledge, to guarantee that objects differing by n symbols have guaranteed different n-symbol signatures. Furthermore, the probability that a switch or any update leads to a collision is also sufficiently low, namely 2^-swith signature length in bits of s. We can also calculate the signature of an updated object from the signature of the update and from the previous signature. Finally, we may gather signatures in a tree of signatures, speeding up the localization of changes.

Our signatures are related to the Karp Rabin fingerprints [KR 87] [B93] [S88] and its generalizations in [C97].

Below, we first describe more in depth our motivating SDDS needs. Next, we recall basic properties of a GF.

Afterwards, we present our approach, we overview the implementation, and we discuss the experimental results. We conclude with directions for the further work.

2. Signatures for an SDDS

We recall that a Scalable Distributed Data Structure (SDDS) uses server nodes to store a file consisting of records or more generally objects. Records or objects have a unique key and are stored on each server in buckets. The data structure implements the key-based operations of inserts, deletes, and updates as well as scan queries. The application requests these operations from the SDDS client at its node. The client manages the query delivery through the network to the appropriate server(s) and receives the replies if any. The file scales with inserts through splits. Each split sends about half of a bucket to a new one, dynamically appended. The buckets reside for the processing entirely in the distributed RAM. We apply the signatures to the disk backup of SDDS buckets and to the record updates. While these were our motivating needs, others appeared later on and we present them in the conclusion.

2.1. File backup

We wish to backup an SDDS bucket B on disk. We only want to move those parts of the bucket that differ from the current disk copy. The traditional approach is to divide the buckets into pages of reasonably small granularity and maintain a dirty bit for each page. One resets the dirty bit when the page goes to the disk and sets it when the page is read back to RAM. One moves only dirty pages to disk. The implementation of this approach for our running prototype SDDS-2000 [C02] would demand refitting a large part of the existing code that was not designed for this capacity. As is often the case, this appeared an impossible task in practice.

The existing code is a large piece of software that updates the bucket in many places. Different students who left the team since, produced over the years different related functional parts of the code.

Our alternative approach calculates a signature for each page when data should move to the disk. This computation is independent of the history of the bucket page and does not interfere with the existing maintenance of a data structure.

This is the crucial advantage in our context.

More in detail, we provide the disk copy of the bucket with a signature map, which is simply the collection of all its page signatures. Before we move a page to disk, we recalculate its signature. If the signature is identical to the entry in the signature map, we do not write the page.

The slicing of the buckets into pages is somewhat arbitrary.

The signature map should fit entirely into RAM (or even the L2 cache using for example the prefetch macro). Smaller pages minimize transfer sizes, but increase the map and overall signature calculus overhead. One may expect the practical page size somewhere between 512B and 64KB. The best choice depends on the application. In any case, signature calculus speed is the challenge as it has to be small with respect to the disk write time. Another challenge is the practical absence of the collisions to avoid an update loss. The ideal case is a zero probability of a collision, but in practice, a probability at the order of magnitude of that of irrecoverable disk errors (e.g. writes to an adjacent track) or software failures is sufficient. The database community does not bother dealing with those unlikely failures anyway so that the possibility of a collision does not change fundamentally the situation.

Presently, we implement the signature map simply as a table, since it fits into RAM. Otherwise, the algebraic signature scheme allows to structure the map into a signature tree. In a tree, one may compute the signature at the node from the signatures of all descendents of the node. This speeds up the identification of the portions of the map where the signatures have changed (similar to [Me86] et al.) More on it in Section 4.1.

(3)

2.2. Record Updates

We recall that an update operation only manipulates the non-key part of R. Let R_bdenote the before-image of R and S_b its signature. The before-image is the content of R, subject to the update by a client. The result of the update of R is the after-image which we note R_a and its signature S_a. The update is normal if R_a depends on R_b, e.g., Salary := Salary + 0.01*Sales. The update is blind if Ra is set independently of Rb, e.g., if one requests Salary = 1000 or a house surveillance camera updates the stored image. The application needs R_b for a normal update and perhaps not for a blind one. In both cases, it is often not aware whether the actual result is effectively R_a ≠ R_b. As in the above examples for unlucky salespersons in these hard times, or as long as there is no burglar in the house.

The application nevertheless typically requests the update from the data management system that also typically executes it. This “trustworthy” policy, i.e., if there is an update request, then it had to be a data change, characterizes in particular all the DBMSs we are aware of. This is kind of surprising after all, since the policy can often cost a lot. Tough times can leave thousands of salespersons with no sale, leading to useless transfers between clients and servers and to the useless processing on both nodes of thousands of records with the tuples. Likewise, a security camera image is often a clip or movie of several Mbytes, leading to an equally futile effort.

Furthermore, on the server side, several clients may attempt to read or update concurrently the same SDDS record R. It is best to let every client read anyrecord without any wait. The subsequent updates should not however override each other.

Our approach to this classical constraint is freely inspired by the optimistic option of the concurrency control of MS-Access, which is not the traditional one in the database literature, e.g.

[LBK02].

The signatures for SDDS updates are useful in this context as follows. The application that needs R_b for a normal update, requests from the client the key search of R. When done with itsupdate calculus, the application returns to the client Rb and Ra. The client computes Sa and Sb. If Sa = Sb, then the update actually did not change the record. Such updates terminate at the client. Only if S_a ≠ Sb sends the client R_a and S_b to the server. The server accesses R and computes its signature S. If Sb = S, then the server updates R to R := Ra. Otherwise, it abandons the update. A concurrent update had to happen to R in the mean time, since the client read R_b and the server received its update R_a. If the new update proceeded, it would override that one, making the result non-serializable. The server notifies the client about the rollback, which in turn alerts the application. The application may read R again and redo the update.

For a blind update, the application provides only R_ato the client. The client computes S_a and sends the key of R_a to the server requesting S. The server computes S and sends it to the client as S_b. From this point on, the client and the server can proceed as in a normal update. Calculating and sending S alone as Sb, already avoids the transfer of Rb to the client. It may avoid further the useless transfer of R_ato the server.

These can be substantial savings, e.g., for the surveillance images.

The scheme does not need locks. In addition, as we have seen, the signature calculus saves the useless record transfers.

Besides, neither the key search, nor the insert or deletion operations need the signature calculus. Hence, none of these operations incurs the concurrency management overhead. All together, the degree of concurrency can be potentially high.

The scheme roughly corresponds to the R-Committed isolation level of the SQL3 standard. Its properties make it attractive to many applications that do not need the transaction management. Especially, if the search is the predominant operation, as one considers in general for an optimistic scheme

.

The scheme does not store the signatures. Hence, the storage overhead can be interestingly strictly zero. This is not possible for timestamps, probably used by MS Access, although that overhead is usually negligible, hence perfectly acceptable in practice. In fact, it can still be advantageous to vary the signature scheme so to store the signatures with their records. As we show later, the storage cost at the server usually turns out to be negligible, of only about 4B per signature. The client sends in this case also S_a to the server which stores it in the file with R_aif it accepts the update. When the client requests R it gets it with S. If the client requests S alone, the server simply extracts S from R, instead of dynamically calculating it. All together, one saves the S_b calculus at the client and that of S at the server. Also, and more significantly perhaps in practice, the signature calculus becomes entirely deported at the client. Hence, it is entirely parallel among the concurrent clients. This can enhance the update throughput even further.

Whether one stores the signature or not, the speed of the signature calculus is again clearly the main challenge. Since a record key search in or insert into an SDDS reaches speeds of 0.1 ms at present, the record signature calculus time cannot be longer than dozens of microseconds. Another challenge is, also again, the total or at least practical absence of the collisions, to avoid any update losses.

2.3. Searches

A frequent SDDS scan operation looks for all records that contain a string somewhere in a non-key field. If the SDDS contains many server nodes, if the search string is long, and if

(4)

there are few hits, then we can have the client send out only the signature and the length of the search string, let every server decide whether a sub-string of the right length in the non-key fields, and return the results. This procedure differs only in the exact signature definition from a distributed, Las Vegas pattern-matching algorithm based on Karp-Rabin fingerprints [KR87] and used extensively. Since the client evaluates the results sent by the various responding servers, no problems arise from collisions. Collisions are indeed more likely, since many more strings are being hashed.

3. Galois Fields

A Galois field (GF) is finite field. Addition and multiplication in a GF are associative, commutative, and distributive. There are neutral elements called zero and one for addition and multiplication respectively, and there exist inverse elements regarding addition and multiplication. We denote GF(2^f) a GF over the set of all binary strings of a certain length f. GF (2⁸) and GF (2¹⁶) are our main concern.

Their elements are respectively byte and 2-byte strings.

We identify each binary string with a binary polynomial in one formal unknown x. For example, we identify the string 101001 with the polynomial x⁵+x³+1. We further associate with the GF the generator polynomial g(x). This is a polynomial of degree f that cannot be written as a product of two other polynomials other than the trivial result of a multiplication of 1 with itself.

The addition of two elements in our GF is that of their binary polynomials. In practice, the sum of two strings is the XOR of the strings. The product of two elements is the binary polynomial obtained by multiplying the two operand polynomials and taking the remainder modulo g(x). There are several ways to implement this calculus. We use the logarithmic multiplication method we explain soon. It uses the primitive elements of a GF, which are elements with the following properties. The order of a non-zero element α of a Galois field noted ord (α) is the smallest non-zero exponent i such that αⁱ = 1.

All non-zero elements in a GF have a finite order. An element α ≠ 0 of a GF of size s is primitive, if ord(α) = s-1. It is well known that for any given primitive element α in a Galois field with s elements, all the non-zero elements in the field are different powers αⁱ, each with a uniquely determined exponent i, 0 ≤ i ≤ s-1. A GF can have several primitive elements.

In particular, any αⁱ is also a primitive element if i and s-1 are coprime, i.e., without non-trivial factors in common. Our GFs contain 2^f elements, hence the prime decomposition of 2^f- 1 does not contain the prime 2. For our basic values of f = 8,16, 2^f-1 has only few factors, hence there are relatively many

primitive elements. For example, for f=8 we count 127 primitive elements or roughly half the elements in the GF.

The logarithmic multiplication uses the primitive elements as follows. If α is a primitive element, then every non-zero element β is a power of α. If β = αⁱ, we call i the logarithm of β with respect to α and write i = logα(β) and we call β the antilogarithm of i with respect to α and write β = antilog_α(i).

The logarithms are uniquely determined if we choose i to be 0 ≤ i ≤ 2^f-2. We set log_α (0)= -∝ .

The multiplication is now given by the following formula with the addition modulo 2^f-1:

)).

( log ) ( (log

antilog β γ

γ

β ⋅ =

_α _α

+

_α

We implement GF multiplication on this basis as follows.

We create one table for logarithms of size 2^f symbols. We also create another one for antilogarithms of size 2^f⋅2. That table has two copies of the basic antilog table. It accommodates indices up to size 2^f⋅2 avoiding the slower modulo calculus of the formula. For our f’s, i.e. up to f = 16, both tables may fit also into the L1 or L2 cache of current processors (not all for f = 16). We also check for the special case of one of the operands being equal to 0. All together, we obtain the following simple C-pseudo-code:

GFElement mult(GFElement left, GFElement right) {

if(left==0||right == 0) return 0;

return

antilog[log[left]+log[right]];}

In terms of Assembly instructions, the typical execution costs of the body of the sub-program are two comparisons, four additions (three for table-look-up), three memory fetches and the return statement.

4. Algebraic Signatures

4.1. Basic properties

We call page P a string of l symbols p_i; i = 0..l-1. In our case, the symbols p_i are bytes or 2-byte words. The symbols are elements of a Galois field, GF (2^f) (f = 8, 16). We assume that l < 2^f-1.

Let α = (α1…α_n)be a vector of different non-zero elements of the Galois field. We call α the n-symbol signature base, or simply the base. The (n-symbol) P signature or, simply, P signature, based on α, is the vector

1 2

sig ( ) (sig ( ),sig ( ),...,sig ( ))

_α

=

_α _α _α

P P P

n

P

(5)

where for each α we set

1

sig ( )

_α

P = ∑

^l_i₌⁻0

p

_i

α

ⁱ^.

We call each coordinate of sig_α the component signature.

Some choices of the coordinates, perhaps pseudo-random, are possibly interesting for cryptographic needs. It is not our goal here. Our primary interest is nevertheless in α exhibiting the following pattern for some primitive α:

α = (α, α², α³…αⁿ) with n << ord(a) = 2^f - 1.

We denote sig_α in this case as sig_α,n. Clearly, the collision probability of sig_α,n can be at best 2^-nf. If n = 1, this is probably insufficient.

Our interest is also in signatures we denote sig²α,n whose all coordinates of α are primitive as we have discussed and which is:

α = (α, α², α⁴, α⁸…α²ⁿ).

The sig²α,n calculus schema offers potentially better randomization for n > 2, (for n ≤ 2, we have sig²_α,n= sig_α,n, as α² is primitive in any GF(2^f)). Intuitively, the highest possible order of each element should avoid some collisions that sig_α,n must leads to. We prove it more formally for the cut and paste later on.

The basic new property of sig_α,n calculus scheme is that any change of up to n symbols within P changes the signature for sure. This is our primary rationale in this scheme. More formally we stay this property as follows.

Proposition 1: Provided the page length l is l < ord(α) = 2^f – 1, sig_α,n signature discovers any change of up to n symbols per page.

Proof: As α is primitive and our GF is GF (2^f) we have ord(α) = 2^f – 1. Assume that the file symbols at locations i1, i2,

… i_n has been changed, but that the signatures of the original and the altered file are the same. Call d_νthe difference between the respective symbols in position iν. The difference of the component signatures is then:

2

1 1 1

0 , 0 , ... 0

n n n

i^νd_ν i^νd_ν ni^νd_ν

ν ν ν

α α α

= = =

∑ ∑ ∑

^.

The dν values are the solutions of a homogeneous linear system:

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

3

1 2 4

3

1 2 4

3

1 2 4

3

1 2 4

3

1 2 4

2

2 2 2 2

1 3 2

3 3 3 3

3 4

4 4 4 4

4

0.

n

n i

i i i i

i

i i i i

i

i i i i

i

i i i i

n n

n n i n n

i i i i

d d d d

d

α α α α α

⎛ ⎞

⎜ ⎟

⎛ ⎞

⎜ ⎟ ⎜ ⎟

⎜ ⎟ ⎜⎝ ⎟⎠

⎜ ⎟

⎝ ⎠

⋅ =

L L L

L M

L L L L L L

L

The coefficients in the first row are all different, since the exponents iv < ord(α). The matrix is of Vandermonde type, hence invertible. The vector of differences (d1,d2...d_n)^t is thus the zero vector. This contradicts our assumption. Thus sig_α,n signature detects any up to n-symbol change. qed

Notice that Proposition 1 trivially holds for sig²α,n with n ≤ 2. More generally, it proves best possible behavior of sig_α,n scheme for changes limited to n symbols. An application can however possibly change up to l > n symbols.

We now prove that sig_α,n scheme still exhibits a good behavior, with low collision probability, typically expected from a signature schema.

Proposition 2: Assuming the page length l < ord(α) and every possible page content equally likely, the signatures sig_α,n of two different pages collide (coincide) with probability of 2^-nf.

Proof: The n-symbol signature is a linear mapping between the vector spaces GF(2^f)^l and GF(2^f)ⁿ. This mapping is an epimorphism, i.e., every element in GF(2^f)ⁿ is the signature of some page, an element of GF(2^f)^l. Consider the map φ, which maps every page with all but the first n elements equal to zero to its signature. Thus, φ: GF(2^f)ⁿ→ GF(2^f)^l, (x1,…,x_n) → sigα,n

((x1,…,x_n,0,…0)), and:

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

1 2

2 3 4

2 2 2 2 1

2 2 3 4

3 3 3 3 2

3 2 3 4

3

4 4 4 4

4 2 3 4 4

2 3 4

φ( )

.

n

n n n n n

n n

(x ,x ,...,x )

x x x x

x

α α α α α

=

⎛ ⎞

⎜ ⎟ ⎛ ⎞

⎜ ⎟ ⎜ ⎟

⎜ ⎟ ⋅⎜ ⎟

⎜ ⎟ ⎜ ⎟

⎜ ⎟ ⎜ ⎟⎜ ⎟

⎜ ⎟ ⎝ ⎠

⎜ ⎟

⎝ ⎠

L L L

L M

L L L L L L

L

The matrix is again of Vandermonde type, hence is invertible. This implies that every possible vector in GF(2^f)ⁿ is the signature of a page with all but the first n symbols equal to zero, and of only one such page. Consider now an arbitrary vector s in GF(2^f)ⁿ. Each page of form (0,…,0,x_n+1,x_n+2,…, x_l)

(6)

has some vector t in GF(2^f)ⁿ as its signature. For any s, t there is then exactly one page (x1,…,x_n,0,…0) that has s-t as the signature. The page (x1,…,x_n,x_n+1,x_n+2,…, x_l) has therefore signature s. Thus, the number of pages that have signature s is that of all pages of form (0,…,0,x_n+1,x_n+2,…, x_l). There are 2^f(l-

n) such pages. There is furthermore 2^fl pages in total. A random choice of two pages leads thus to the same signature s with probability 2^f(l-n) / 2^fl, which is 2^-fn. Assuming the selections of all possible pages to be equally likely, our proposition follows. qed

Notice that Proposition 2 also characterizes sig²α,n for n ≤ 2.

Next, observe that we qualified our schemes algebraic claiming that they support some form of the algebraic calculus of new signatures from the signatures only. Here is one motivating property of sigα,n scheme. Its practical importance concerns pages where the change is localized to a rather small substring (or several substrings). The case is usual in databases, where attributes have typically rather few symbols.

We show that one may then update the sig_α,n signature only knowing the changed symbols and the before signature. This may clearly largely speed up the signature calculus with respect to the full calculus again, necessary for the more traditional signature schemes we are aware of, SHA-1 in particular. Formally:

Proposition 3: Let us change P = (p0, p1, … p_l-1) to page P’

where we replace the symbols starting in position r and ending with position s-1 with the string

q q

_r

,

_r₊₁

, , L q

_s₋₁. We define Δ-string as Δ = (δ0, δ1…δs-r-1) with δi = p_r+i – q_r+i . Then for each α in our base α we have:

sig ( ') sig ( )_α P = _α P +

α

^rsig ( )._α Δ

Proof: The difference between the signatures is:

1

0

sig ( ') sig ( )

( )

( ( ) )

( )

sig ( ).

s

i

i i

i r s

r i r

i i

i r s

r i r

i r i r s r

r i

i i r

P P

q p

α α

α

α α

α δ α

α

−

=

− −

=

− −

−

=

− −

=

−

= −

=

= Δ

∑

qed Finally for this section, we now prove the practicality of sig²α,n schema. We consider specifically the popular switch (cut/paste) operation. Proposition 1 provides for sure detection

of any switch of length up to ⎣n/2⎦. For many applications, e.g., involving text editing, the switch operations of length above are common. Proposition 2 does not cover this case.

We now seek to determine and minimize the collision probability in that context as well. The base α = α0, α1, … αn- 1, 0 ≤ i ≤ n-1, where every αi is primitive, has the largest possible ord (αi) for each αi. The sig²_α,n scheme appears intuitively preferable in this context to sig_α,n and the following proposition confirms the conjecture formally.

Figure 1: Illustration of the cut and paste operation.

Proposition 4: Assume an arbitrary page P and three indices r, s, t of appropriate sizes, Figure 1. Consider for P the scheme sigα where base α = α0,α1,…αn-1 has every ord(αi) above the length of P, 0 ≤ i ≤ n-1. Cut a string T of length t beginning with position r and move it into position s in P.

Assuming any possible T equally likely, the probability that sig_α(P) changes is 2^-nf.

Proof: We denote B the reminder B = P-T. B contains at least n symbols or T contains at least n symbols, Figure 1. We only treat the case of length(B) ≥ n, the other one is analogue.

Without loss of generality we assume a forward move of T within the file from position r to position s. A backward move just undoes this operation and thus has the same effect on the signature. Figure 1 defines names for the regions of the block and makes a spurious case distinction depending on whether r+t < s or not. For any α ∈{α0, α1…αn-1}, the α signature of the “before” page (the top scheme for both situations) is

sig ( old)

sig ( ) ^rsig ( ) ^{r t}sig ( ) ^{s t}sig ( ) P

A T B C

α

⁺ α

α

⁺ α

=

+ + + ^.

The after page signature is

sig (

new

)

sig ( )

^r

sig ( )

^s

sig ( )

^{s t}

sig ( ).

P

A B T C

α

⁺ α

=

+ + +

The difference of the two signatures is:

A

A T B

B T C C

P^new

r r+t s P^old

A

A T B

B T

C C

P^old P^new

r s r+t

(7)

( )

new old

sig ( ) sig ( )

sig ( ) sig ( ) sig ( ) sig ( )

( )sig ( ) ( )sig ( )

(1 )sig ( ) (1 )sig ( ) .

r r t r s

r s r r t

r s t

P P

T B B T

T B

α α

α α α α

α α

α α α α

α α α

+

− =

+ + + =

+ + +

This expression is zero only if the right hand side, or the following expression, where we use γi as an abbreviation, is zero:

1

size( )-1 1

1

0 1

0

(1 )(1 ) sig ( ) sig ( ) (1 )(1 ) sig ( )

.

i i

i

s t

i i

B n

s t

i i i i

n n

i i

T B

T b b

b

α α

ν ν

α ν ν

ν ν

ν ν ν

α α

α α α α

γ α

−

− −

= =

−

=

+ + +

= + + + +

= +

∑ ∑

∑

We now fix the whole situation with the exception of the first n symbols in B. The change in signature is:

1 1 1

0 0 1 1 1 1

0 0 0

1 1 1

0 1 1 0 1 1

0 0 0

, ,...,

( , ,..., ) , ,..., .

n n n

n n

n n n

n n

b b b

ν ν ν

γ α γ α γ α

γ γ γ α α α

− − −

− −

= = =

− − −

− −

= = =

+ + + =

+

⎛ ⎞

⎜ ⎟

⎝ ⎠

⎛ ⎞

⎜ ⎟

⎝ ⎠

∑ ∑ ∑

which is zero if and only if:

1 1 1

0 1 1

0 0 0

, ,...,

n n n

b b

n

b

ν ν ν

α α α

− − −

−

= = =

⎛ ⎞

⎜ ⎟

⎝ ∑ ∑ ∑ ⎠

0 1 1

( , ,..., γ γ γ

_n₋

).

=

The left hand side is a linear mapping in the (b0, b1, … b_n-1), which has a matrix that is invertible, because it has a Vandermonde type determinant. Therefore, there exists only one combination (b0, b1, … bn-1) that is mapped by the mapping onto the right hand vector. This combination will be attained for a randomly picked B with probability 2^-nf. qed

To obtain the strongest property of a sig_α,_n signature schema, one should thus use α whose α_i have the largest order.

The natural choice are primitive αi. Assuming the need for n > 2, this is precisely the rationale in sig²α,n. Notice that for GF (2¹⁶), the collision probability is already small enough in practice, as we discuss more in Section 5.2.

At this stage of our research, the choice of sig²α,n appears only as a trade-off between smaller probability of collision for possibly frequent updates (switches here), and the zero probability of collision for updates up to any n symbols. We are able only to conjecture that there is α in GF(2⁸) or GF(2⁸) for which Proposition 1 and 2 holds for sig²_α,n with n > 2. We did not pursue the investigation further. For our needs, n = 2

for GF (2¹⁶) was sufficient (Section 5.2). As sig²_α,2= sig_α,2, the properties of both schemes coincide anyway.

4.2. Compound Algebraic Signatures

Our signature schemes keep the property of sure detection of n-symbol change as long as the page size in symbols is at most 2^f – 2. For f = 16, the limit on the page size is almost 128 KB. Such granularity suffices for our purpose. One may have many pages in an SDDS bucket that can reach, e.g., 256 MB for SDDS-2000. The collections of the signatures in the bucket may be seen as a vector. We call it compound signature (of the bucket). More generally, we qualify a compound signature of m pages, as m-fold. The signature map of Section 2.1 implements a compound signature.

The practical interest of the compound signatures stretches beyond our motivating cases. One may usefully apply the concept as an alternative to a signature of an area A not exceeding the limit of ord (α)-1. To use an m-fold signature sig_α,n for instance, one may divide A into equally sized pages each provided with sigα,n. One locates then for sure and with the granularity of l /m any up to n-symbol change with a priori unknown location (hence Proposition 3 does not apply).

The price with respect to sig_α,n(A), i.e., sig_α,n over entire A with the granularity thus of l, is mainly the about m times larger storage overhead. In practice, one can search for m leading to a reasonable compromise. Notice that a yet alternative choice for the m times larger overhead if acceptable, is to enhance the sure change detection resolution to mn symbols anywhere in A, using sig_α,mn(A).

For larger m, further practical importance of the algebraic properties of m-fold sig_α,n scheme (and in fact similarly of sig²_α,nscheme) is that we may implement signature maps as trees speeding up the search for a changed sig_α,n. As we show below, with our schemes, we may algebraically, i.e., without reexamining the pages themselves compute the higher-level signatures (unlike for more traditional signature schemes we are aware of). If a sig_α,n changes, we may update the higher level once again only algebraically. All these capabilities of compound signatures can be of obvious interest to our SDDS file backup application.

The following proposition proves the algebraic properties we discuss for an area partitioned into two pages. Those can be furthermore of different sizes. This is sometimes a useful capability as well, e.g., when A starts with a relatively small index of the data the follow in A. It generalizes trivially to any larger m. It holds pages of different sizes as well.

Proposition 5: Consider that we concatenate two pages P1

and P2 of length l and m, l + m ≤ 2^f-1, into page (area) denoted P1⏐P2. Then, the signature sig_α,n (P1⏐P2) is as follows, where sig denotes sig_α,n :

(8)

sig_α,n (P1⏐P2) = (sig(P1)+a^l⋅sig(P2), sig(P1)+a^2l⋅sig(P2),…, sig(P1)+a^nl⋅sig(P2).

Proof: The proof consists in applying Lemma that follows

to each α coordinate α. qed

Lemma. We note sigα,1 as sigα. Then, the 1-symbol sigα

(P1⏐P2) signature is sigα (P1|P2) = sig(P1)+a^l sig(P2).

Proof: Assume that P1 = {s1,s2,…,s_l} and P2 = {s_l+1,s_l+2,…,s_l+m}. Then

1 2

1

sig ( | )

l m

v

P P s

α ν ν

+

α

=

= ∑

1 1

l l m

l

s

_ν ^ν

s

_ν ^ν

ν ν

α

⁺

α

= = +

= ∑ + ∑

1 1

1 2

= sig ( ) sig ( ).

l m

l l

l

s s

P P

ν ν

α α

α α α

α

+

= =

= +

+

∑ ∑

qed Proposition 5 holds analogously for sig²_α,n. Together, all the propositions we have formulated prove the potential of our two schemes. They have further algebraic properties we are currently investigating.

4.3. Reinterpretation of Symbols

For performance tuning, the following “meta”-proposition turns out to be important.

Proposition 6: Let ϕ be a function that maps the space of all symbols into itself and is one-to-one. (With other words, ϕ is a bijection of GF(2^f). ) Define the twisted signature

1

, 0

sig ( )

^l

( )

_i ⁱ

P

i

p

ϕ α

= ∑

⁻₌

ϕ α

and similarly

1 2

, , , ,

sig ( ) (sig ( ),sig ( ),...,sig ( ))

P P P

n

P

φ_α

=

ϕ α ϕ α ϕ α

Then Propositions 1 to 5 also apply mutatis mutandis to the twisted signatures.

The proof is simply by inspection.

5. Experimental Implementation

5.1. Calculus tuning

One can tune the signature calculus. First, according to Proposition 6, one may interpret the page symbols directly as logarithms. This saves a table look-up. The logarithms range from 0 to 2^f-2 (inclusively) with the additional value for log(0). One can set this one to 2^f-1. Next, the signature calculations form the product with αⁱ. This one has i as the logarithm. One does not need to look this value up neither.

The following pseudo-code for sig_α,1applies these properties.

It uses as parameters the address of an array representing the bucket and the size of the bucket. The constant TWO_TO_THE_F is 2^f. The type GFElement is an alias for the appropriate integer type.

GFElement signature(

GFElement *page, int pageLength) {

GFElement returnValue = 0;

for(int i=0; i< pageLength; i++) { if(page[i]!=TWO_TO_THE_F-1)

returnValue ^= antilog[i+page[i]];

}

return returnValue;

}

The application to the calculation of sig_α,n is easy.

In our file backup application, the bucket usually contains several pages so we typically calculate the compound signature. To tune the calculus, one should consider the best use of the processor caches, i.e., L1 and L2 caches on our Pentium machines. It seems advantageous to explore the cache lines on the log table. Then, we might gain by first looping upon the calculus of sigα,1 for all the pages, then move to sigα2

,1 and so on. Our experiments confirmed this intuition.

5.2. Experimental Performance

We have implemented motivating applications with sig_α,1 scheme for the experimental analysis. The testbed configuration consisted from 1.8 GHz Pentium P4 nodes and from 700 Mhz Pentium P3 nodes over a 100 Mbs Ethernet.

One implementation concerned the signature calculus schemes alone with simulated data. The experiments examined variants of the sig_α,ncalculus with respect to implementation issues and some differences with respect to the basic scheme. We have also experimented the sig²α,n whose calculus time turned out the same, understandably. Finally, we have ported the fastest algorithm of sig_α,ncalculus to SDDS-2000.

(9)

In both cases, we have set up for dividing the bucket into the pages of size of 16 KB with the 4-byte signature per page.

This choice appears as a reasonable compromise between the signature size, hence its calculus time, and the overall collision probability of order 2^-32, i.e. over 4*10^-9. For the record updates, we use the same signature size, but the record size is of 100 bytes. If we had to set up for SHA-1, the overhead would be 20 bytes per page or record, [N95]. Records had about 100 bytes in our experiments.

Internally, the bucket in SDDS-2000 has a RAM index as it is structured into a RAM B-tree. The index is small, a few KB at largest. Bucket size page does not make sense there. We set up for the page size of 128 B for the index. Next, for the record updates, we set up for the signature calculus on-the-fly only. While this seemed the most flexible choice, we recall that we could alternatively store the signature in each record, saving some calculus. The actual computation took place only for the updates. Inserts were not affected.

The analysis of the experiments with the actual SDDS-2000 implementation is presented in full in [L03]. The main results are as follows. The stand-alone experiments showed, somehow surprisingly, a large variation of the calculus time depending on the data symbols. The reason seemed to be the influence of the caches L1 and L2. For a given page size, the calculus time was linear with n for sig_α,n used. The actual calculus times of sig_α,2 we finally put into SDDS-2000 was of 20-30 ms per 1 MB of RAM bucket, manipulated as a mapped file. For SHA-1, our tests showed about 50-60 ms. The sigα,2 calculus time was also as wished in the order of dozens of microseconds for the index page or a record. This timing was linear with the bucket or record size, and, also somehow surprisingly, rather stable regardless of the algebraic signature scheme tested. The calculus time was smaller for a larger page: 64 KB versus 16 KB. This is probably due to the better cache use. The actual transfer time of 1 Mbyte of RAM to the disk is about 300 ms.

For both page sizes, the calculus in GF (2¹⁶) was faster than in GF (2⁸). This despite the fact that the logarithm table of the latter could entirely enter the process cache of each of our machines, accelerating thus the calculus. The former calculus used in turn more effectively the 4B words. Being faster, GF (2¹⁶) was our final choice for SDDS-2000.

We have experimented with using signatures to distinguish between normal and blind updates that change and those that in fact do not change a datum. We now call the latter pseudo updates. The experiments were carried over our implementation in SDDS-2000, according to the principles outlined in Section 2.2. Details of experiments are in [H03].

Globally, results confirm very fast processing of our signatures and of updates in general and the expected savings for pseudo- updates.

More precisely, the signature calculus speed on 1.8 GHz P4 nodes was under 5 μs / KB of data. The 700 Mhz P3 nodes were surprisingly (about 30 times) slower, as the time was of 158 μs / KB. The normal actual update time on P4 nodes was of 0,614 ms per 1KB record. The normal pseudo-update time was of 0,043 ms. Thus the saving was of about 90 % of the normal actual update processing time. Notice however, that the initial search time for the record is not included in these figures, being of 0,237 ms in this case. If we add this time, the gain is about 70 %. For a blind update, the times were slightly faster, as one could expect. They were respectively of 0,8372 ms and of 0,2707 ms. The gain for a pseudo update was thus again about 70 %. Recall that the execution times for a blind pseudo update include the key search, the calculus and transfer of the record signature at the server to the client.

For smaller records the processing times were naturally faster and the gain logically smaller. For instance, for 100 B records, the normal update times were, respectively, of 0,419 ms and of 0,03 ms, with the search time of 0,22 ms.

Likewise the blind update times were of 0,837 ms and of 0,271 ms. The gains were, also respectively, of about 50 %, with the search time for the normal update taken to the account. See [H03] for the results on the P3 configuration.

We also tested our signature based non-key string search, as presented in Section 2.3. Our code takes care of alignment problems since we use symbols of length 16 bits and search for ASCII text with a character size of 8. Preliminary tests on the SDDS-2000 gave 1.516 s per search of the bucket with 8000 records, each containing a 60 B non-key field, searched unsuccessfully, till the last record, for a 3 B string. This time includes the sequential scan of the bucket which was measured as 0.5 s. That time taken out, the search speed is 2 s/MB.

Comparatively, the search speed through the bucket for the same string using the XOR calculus alone was 1.504 s. The algebraic signature search appears thus very effective.

6. Conclusion and related work

Our signature scheme has the new (to our knowledge) property of guaranteed detection of small changes and also of algebraic operations over the signatures themselves. (Cohen [C97] describes general extension of Karp-Rabin fingerprinting to obtain similar algebraic properties for use in searching. These properties are also used in finding similarities amongst files, e.g. [A+00].) Together with the high probability of detection of any change, including switches, small overhead, and fast calculus, our approach proved useful for our motivating SDDS needs of bucket backups and record updates. The experimental fine tuning of the implementation of the signature calculus allowed us finally to successfully add the sig_α,n scheme to SDDS-2000 system.

Beyond, one should determine further algebraic properties

(10)

of the schemes. The conjecture of sure detection of changes also by sig²_α,n scheme remains to be proved or disproved.

Variants of the basic schemes should be studied. The use of signature trees for computing the compound signatures and explore the signature maps is an open research area. Finally, we did not explore the Prefetch macro, providing perhaps further savings to the calculus time through better L1 and L2 cache management.

Beyond these goals, one can apply our schemes to the automatic file eviction in presence of several files sharing an SDDS server whose RAM became insufficient for all the files simultaneously, [LSS02]. Next, the signatures appear a useful tool for the cache management at the SDDS client, allowing to keep the cache and server data synchronized.

There is also an interesting relationship between the algebraic signatures and the Reed-Salomon parity calculus we use for the high-availability SDDS LH*RS scheme, [LS00]. In this scheme, we use GF calculations to generate the parity symbols of a Reed Solomon code ECC (error-control-code).

LH*RS combines a small number m of servers into a reliability group and adds k parity servers to the ensemble. The parity servers store parity records whose non-key data consists of parity symbols. We can reconstruct contents of lost servers as long as we can access the data in m out of the m+k total servers in a reliability group. LH*RS needs to guarantee consistency between parity and data servers, i.e., parity and data servers need to have seen the same updates to records.

We have shown the existence of an algebraic relation between the signatures of data and parity records which can be used to confirm this consistency between parity and data buckets.

[X+02] proposes the same scheme in a non-SDDS context.

Our techniques should help also other database needs.

Especially, they appear attractive for a RAM-DBS that typically needs the RAM data imaging to the disk as well.

Very large Gbyte RAMs are now widely available, and make enhanced performance of such DBSs increasingly attractive with respect to those of traditional disk-based DBSs, [NDLR01]. Likewise, our signature based record update calculus at the SDDS client, should provide similar advantages for a client-server based DBS architecture in general.

Interesting possibilities appear further for the transactional concurrency control, beyond the avoidance of the lost updates only.

Acknowledgments

We thank Lionel Delafosse, Jim Gray and Peter Scheuermann for fruitful discussions. This work was partly supported by the research grants from Microsoft Research, and from the European Commission project ICONS project no. IST-2001-32429 and the SCU IBM grant 41102-COEN-RSCH-IG-IG09.\

Bibliography

[AA93Abdel-Ghaffar, ] K. A. S., El-Abbadi, A. Efficient Detection of Corrupted Pages in a Replicated File. ACM Symp. Distributed Computing, 1993, p. 219-227.

[A+00] Ajtai, M., Burns, R., Fagin, R., Long, D. Stockmeyer, L.

Compactly encoding unstructured input with differential

compression. www.almaden.ibm.com/cs/people/stock/diff7.ps. IBM Research Report RJ 10187, April 2000.

[BGMF88] Barbara, D., Garcia-Molina, H. , Feijoo, B. Exploiting Symmetries for Low-Cost Comparison of File Copies. Proc. Int. Conf.

Distributed Computing Systems, 1988, p. 471-479.

[BL91] Barbara, D., Lipton, R. J.: A class of Randomized Strategies for Low-Cost Comparison of File Copies.

IEEE Trans. Parallel and Distributed Systems, vol. 2(2), 1991, p.

160-170.

[B93] Broder, A. Some applications of Rabin's fingerprinting method.

In Capocelli, De Santis, and Vaccaro, (ed.), Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152.

Springer-Verlag, 1993.

[C02] Ceria Web site. http://ceria.dauphine.fr/.

[C97] Cohen, J. Recursive Hashing Functions for n-Grams. ACM Trans. Information Systems, Vol 15(3), 1997, p. 291-320.

[FC87] Faloutsos, C., Christodoulakis, S.: Optimal Signature Extraction and Information Loss. ACM Trans. Database Systems, Vol. 12(5), p. 395-428.

[FWA86] Fuchs, W. Wu, K. L., Abraham, J. A. Low-Cost Comparison and Diagnosis of Large, Remotely Located Files. Proc.

Symp. Reliability Distributed Software and Database Systems, p. 67- 73, 1986.

[Go99] D. Gollmann: Computer Security, Wiley, 1999.

[H03] Hammadi, B. Suppresions et mises a jour dans le système SDDS-2000. CERIA research report CERIA-BD-DEA127.

[KR87] Karp, R. M., Rabin, M. O. Efficient randomized pattern- matching algorithms. IBM Journal of Research and Development, Vol. 31, No. 2, March 1987.

[KC 95] Jeong-Ki Kim, Jae-Woo Chang: “A new parallel Signature File Method for Efficient Information Retrieval”, CIKM 95, p. 66-73.

[KC99] S. Kocberber, F. Can: “Compressed Multi-Framed Signature Files: An Index Structure for Fast Information Retrieval”, SAS 99, p.

221-226.

[KPS02] Kaufman, C., Perlman, R., Speciner, M. Network Security:

Private Communication in a Public World, 2^nd Ed., Prentice-Hall 2002.

[LBK02] Lewis, Ph., M., Bernstein. A. Kifer, M. Database and transaction Processing. Addison & Wesley, 2002.

[LNS94] Litwin, W., Neimat, M-A., Schneider, D. RP* : A Family of Order-Preserving Scalable Distributed Data Structures. 20th Intl. Conf on Very Large Data Bases (VLDB), 1994.

[LNS96] Litwin, W., Neimat, M-A., Schneider, D. LH*: A Scalable Distributed Data Structure. ACM Transactions on Database Systems ACM-TODS, (Dec. 1996).

[LS00] Litwin, W., Schwarz, Th. LH*RS: A High-Availability Scalable Distributed Data Structure using Reed Solomon Codes.

ACM-SIGMOD 2000.

[LRS02] Litwin, W., Risch, T., Schwarz, Th. An Architecture for a Scalable Distributed DBS :Application to SQL Server 2000. 2nd Intl.

Workshop on Cooperative Internet Computing (CIC 2002), August, 2002, Hong Kong.

(11)

[LSS02] Litwin, W., Scheuermann, P., Schwarz Th. Evicting SDDS- 2000 Buckets in RAM to the Disk. CERIA Res. Rep. 2002-07-24, U.

Paris 9, 2002.

[Me83] Metzner, J. A Parity Structure for Large Remotely Located Data Files. IEEE Transactions on Computers, Vol. C – 32, No. 8, 1983.

[MA00] Michener, J., Acar, T. Managing System and Active-Content Integrity. Computer, July 2000, p. 108-110.

[L03] W., Litwin,, R., Mokadem, Th. SCHWARZ, S.J. Disk Backup Through Algebraic Signatures in Scalable Distributed Data Structures. 5-th Intl. Workshop on Distributed Data & Structures 5 (WDAS 5), Thessalonica, 2003.

[N95] Natl. Inst. of Standards and Techn. Secure Hash Standards.

FIPS PUB 180-1, (Apr. 1995).

[NDLR00] Ndiaye, Y., Diène A., W., Litwin, W., Risch, T. Scalable Distributed Data Structures for High-Performance Databases. Intl.

Workshop on Distr. Data Structures, (WDAS-2000). Carleton Scientific (publ.).

[NDLR01] Ndiaye, Y., Diène A., W., Litwin, W., Risch, T. AMOS- SDDS: A Scalable Distributed Data Manager for Windows Multicomputers. 14th Intl. Conference on Parallel and Distributed Computing Systems- PDCS 2001.

[S88] Sedgewick, R: Algorithms. Addison-Wesley Publishing Company. 1988.

[SBB90] Schwarz, Th., Bowdidge, B., Burkhard, W., Low Cost Comparison of Files, Int. Conf. on Distr. Comp. Syst., (ICDCS 90) , 196-201.

[X+03] Xin, Q., Miller, E, Long, D., Brandt, S., Litwin, W., Schwarz, T. Selecting Reliability Mechanisms for a Large Object-Based Storage System. 20^th Symposium on Mass Storage Systems and Technology. San Diego. 2003.