State-based merging technique

3.2 Textual merging

3.2.1 State-based merging technique

In state-based merging, deltas are derived by executing a text differentiation algorithm that compares two text files to generate the differences between them. The most reputable text differentiation algorithm is the diff (f1, f2) algorithm [71, 88],

which compares the two text files f1 and f2 and generates an editing script (or dif-

ference report) with which one file can be transformed to the other. There are some font-end tools built upon thediffalgorithm, such as thespiff[94] and flexible diff[96], which provide users with some necessary and useful flexibility to control how thediff

algorithm is performed and how the generated editing scripts are reported. For example, a user can instruct spiff to ignore differences between floating point numbers that are below a user-specified threshold or ignore white spaces and comments in differentiating program source code. The diff algorithm and a variety of its font-end tools have been used in most SCM systems. Various state-based textual merging algorithms have been devised based on the deltas derived by text differentiation algorithms. A typical three-way state-based textual merging algorithm is diff3(f1, fb,

f2), which produces a merged text file based on text filesf1 andf2 and their baseline

file fb by utilizing deltas derived by the diff algorithm. The diff3 textual merging

algorithm has been widely used in SCM systems such asSCCS [112], RCS[142] and

CVS [10].

The representation of deltas

Deltas between two text files are represented as editing scripts and can be used to transform one file to the other. For example, if D12 is the editing scripts from text

file f1 to f2, then applying these editing scripts to f1 will transform f1 to f2. This

transformation procedure is denoted as f2 = f1 ◦ D12 for the sake of explanation.

Editing Scripts are described by three kinds of editing operations: add (a), delete

(d), andchange(c). An adding operation is denoted asa[SL, NoL, ToL] to represent adding NoL new lines of text ToL after line SL. A changing operation is denoted as c[SL, NoL, ToL] to represent replacing NoL lines starting from line SL with new lines of text ToL. A deletion operation is denoted asd[SL, NoL] to represent deleting

changing operation c[SL, NoL, ToL] with empty ToL.

For example, as shown in Figure 3.3, an initial text file f1 described four major

topics on collaborative editing. File f1 can be revised into a new file f2 as follows:

insert a new line “- Concurrency Control” as the first item; insert word “Group”

before word “Undo”; and remove item “- Workflow management”. Editing scripts D12 describing the deltas from f1 to f2 consists of a list of editing operations [O1,

O2,O3] whereO1 = d[5, 1] or c[5, 1, “”] to delete the fifth line (i.e., the last line) “-

Workflow Management”, O2 = c[3, 1, “- Group Undo”] to replace the third line “-

Undo”with the new line“- Group Undo”, andO3 =a[1, 1,“- Concurrency Control”]

to add a new line“- Concurrency Control”after the first line“Topics on collaborative editing:”.

Topics on collaborative editing: - Concurrency Control - Consistency Maintenance - Group Undo

- Usability Studies

Revised file f₂

Topics on collaborative editing: - Consistency Maintenance - Undo - Usability Studies - Workflow Management Initial file f₁ O₁ = d[5, 1] or c[5, 1. ""] O2 = c[3, 1, "- Group Undo"]

O₃ = a[1, 1, "- Concurrency Control"]

D₁₂ = diff(f₁ , f ₂)

Figure 3.3: The representation of editing scripts

Editing scripts are line-based in the sense that if there is a single change within a line, the whole line is regarded as changed. In Figure 3.3, the line-based editing operations O1 and O3 have captured the real editing actions of inserting a new item

not capture the real editing action of inserting a word within an existing item. The line-based representation of editing scripts could be even more coarse-grained in the sense that if there is a change in every consecutive line, those consecutive lines as a whole, referred to as an editing block, are regarded as changed. For example, as shown in Figure 3.4, file f1 has been revised into a new file f2 by changing the word

“Studies”in the item“- Usability Studies”from plural to singular“Study”in addition to those revisions made in Figure 3.3. Editing scripts in Figure 3.4 are significantly different from those in Figure 3.3. In particular, consecutive lines 3, 4, 5 constitute an editing block with editing operation O1 = d[3, 3] or d[3, 3, “”] to delete 3 lines

starting from line 3, and O2 = a[2, 2, “- Group Undo\n- Usability Study”] to insert

two new lines after line 2. Apparently, editing operations O1 and O2 do not reflect

the real editing actions performed by the user.

Topics on collaborative editing: - Concurrency Control - Consistency Maintenance - Group Undo

- Usability Study

Revised file f₂

Topics on collaborative editing: - Consistency Maintenance - Undo - Usability Studies - Workflow Management Initial file f₁ O1 = d[3, 3] or c[3, 3, ""] O2 = a[2, 2, "- Group Undo\n - Usability Study"] O3 = a[1, 1, "- Concurrency Control"]

D₁₂ = diff(f₁ , f ₂)

Figure 3.4: Coarse-grained representation of editing scripts

It is the coarse-granularity of derived editing scripts that causes the false conflict problem in state-based textual merging. In the diff3 merging algorithm, a false conflict would occur if an editing block covered by an editing operation overlaps with

another editing block covered by another editing operation and these two editing operations are contradictory in the sense they change the same line(s) to different values. For example, as shown in Figure 3.5, the baseline file fb contains four items

describing the topics on collaborative editing. It has been revised into file f1 as fol-

lows: add word “Group” in the item“- Undo”; change word “Studies”to “Study” in the item “- Usability Studies”; and remove the last item “- Workflow Management”. Concurrently, it has also been revised into another filef2 as follows: add a new item

“- Concurrency Control”as the first item; add word“Selective”in the item“- Undo”; and remove the last item“- Workflow Management”.

Editing script Db1 contains editing operation O11 =c[3, 2,“- Group Undo\n- Us-

ability Study”] to represent deltas from fb to f1. Editing script Db2 contains editing

operations O2

1 = d[5, 1], O 2

2 = c[3, 1, “- Selective Undo”] and O 2

3 = a[1, 1, “- Con-

currency Control”] to represent deltas from fb to f2. When f2 is merged with f1 to

produce a merged text file fm, the diff3 algorithm will be executed as follows: first,

the editing block covered by O2

1 in Db2 contains the fifth line and the editing block

covered by O1

1 in Db1 contains the third, fourth, and fifth lines. These two editing

blocks overlap but editing operations O2

1 and O

1 are not contradictory because both

intend to change the fifth line“- Workflow Management” to “”. Therefore, the effect of O2

1 is integrated into the merged file fm: the item “- Workflow Management” is

removed. Second, the editing block covered by O2

2 in Db2 contains the third line and

the editing block covered by O1

1 in Db1 contains the third, fourth, and fifth lines.

These two editing blocks overlap and editing operationsO2

1 and O

1 are contradictory

because O2

2 intends to change the third line to “- Selective Undo” while O 1

1 intends

to change the same line to “- Group Undo”. As a result, it is a false conflict and the merging algorithm simply keeps both changes in the merged file fm. Finally, editing

block covered by O2

3 inDb2 does not overlap with the editing block covered by O11 in

Concurrency Control”is added.

Topics on collaborative editing: - Consistency Maintenance - Group Undo

- Usability Study

Revised file f₁

Topics on collaborative editing: - Consistency Maintenance - Undo

- Usability Studies - Workflow Management

Baseline file f_b

Topics on collaborative editing: - Concurrency Control - Consistency Maintenance - Selective Undo - Usability Studies Revised file f₂ O1 1 = c[3, 3, "- Group Undo\n - Usability Study"] D_b1 = diff(f_b , f ₁) O2 1 = c[5, 1, ""] O22 = c[3, 1, "- Selective Undo"] O2

3 = a[1, 1, "- Concurrency Control"]

D_b2 = diff(f_b , f ₂)

Topics on collaborative editing: - Concurrency Control - Consistency Maintenance <<<<<<< f₁ - Group Undo - Usability Study ======= - Selective Undo - Usability Studies >>>>>>> f2 f_m = diff3(f₁ , f _b , f ₂)

Figure 3.5: An example of three-way textual merging supported by diff3

The reason why text differentiation algorithms generate line-based editing scripts is that it is too time-consuming to generate fine-grained editing scripts. The text differentiation algorithm proposed by Hunt et al. [72] has a time complexity of Ω(m × n) in terms of the string lengths m and n alone, and is even worse if taking into account how much the two strings are different. The algorithm proposed by Myers [93] is significantly improved if the two strings to be differentiated are not too different and it has been widely implemented as the diff utility in various UNIX systems and

mostSCMsystems. The algorithm has a time complexity of O(N ×D) if Dis small, where N is the total lengths of the two strings and D is the size of the minimum editing script between the two strings. However, if D is large (e.g., for two strings that are completely different,D = 2N), its time complexity will be O(NlgN + D2_),

even worse than the algorithm proposed by Hunt et al. [72]. Therefore, to make the algorithm efficient, the minimum editing script D between the two strings has to be small and the compromise is to compare the two strings line by line to derive line-based editing scripts instead of comparing them character by character to derive character-based editing scripts [142].

A B C A B B A 0 1 2 3 4 5 0 1 2 3 4 5 6 (7, 6) (0, 0) C B A B A C ABCABBA C B A B A C (0, 0) (1, 1) 0

Editing Script = 1D, 2D, 3IB, 6D, 7IC D = 5

Editing Script = 1D, 1ICBABAC D = 2

(a) Character-based editing script

(b) Line-based editing script

Figure 3.6: Character-based versus line-based editing script

For example, Figure 3.6(a) is the editing graph used in [93], where the source string is ABCABBA and the destination string is CBABAC. The editing graph has a vertex at each point in the grid (x,y) (x∈[0, 7] and y∈ [0, 6]). The vertices of the editing graph are connected by horizontal, vertical, and diagonal edges to form

a directed acyclic graph. Horizontal edges connect each vertex to its right neighbour (i.e., (x-1,y) → (x,y) corresponding to the deletion of character at position (x,0) for x ∈ [0, 7] and y ∈ [0, 6]). Vertical edges connect each vertex to the neighbour below it (i.e., (x,y-1) → (x,y) corresponding to the insertion of character at position (0,y) for x ∈ [0, 7] and y ∈ [0, 6]). If the character at position (x,0) is the same as the character at position (0,y), then there is a diagonal edge connecting vertex (x-1,y-1) to vertex (x,y), corresponding to keep that common character.

According to the text differentiation algorithm proposed by Myers [93], the size of the minimum editing script for transforming the source string ABCABBA to the destination string CBABACis 5, if the editing script is character-based, which is significantly larger than the size of the corresponding minimum line-based editing script, which is 2. If deltas are derived in a more fine-grained way, they can certainly help merging algorithms reduce the chance of complaining false conflicts. However, even fine-grained editing scripts do not necessarily mean they have captured the real editing actions performed by the user. Consequently, they may mislead merging algorithms to produce a merging result undesirable to the user. For example, in Figure 3.6(a), the character-based editing script contains five editing operations: 1D

to delete the first character A, 2D to delete the second character B, 3IB to insert character B at position 3, 6D to delete the sixth character B, and 7IC to insert character C at position 7. But as shown in Figure 3.6(b), the real editing actions performed by the user could be 1D to delete all seven characters and1ICBABAC

to insert six new characters.

It should be pointed out that deltas are also important for storing versions in the repository in SCM systems. In the repository, versions are stored in such a way that the latest version is stored intact while old versions are stored as deltas, although the user interface completely hides this fact [124, 142]. Using deltas to represent versions in repository is a space-time tradeoff: deltas reduce the space consumption

but increase access time. Statistics [142] show that the latest version is the one that is retrieved in 95 percent of all check-out cases, therefore it is significant to reduce access time by storing the latest version intact.

Merging process

A state-based merging process involves a working site, the repository site, and data transfers between them. Time spent at the working site is denoted asTwok, time spent

at the repository site is denoted asTrep, and time spent for data transfers between the

working site and the repository site is denoted as Tcom. Another important measure

is the system response time denoted as Tres, which is the time interval between when

the user issues a merging command and the user re-gains the control of the system to issue other commands at a working site. Tres is very important in measuring the

performance of a merging process because it is visible by users and has a substantial impact on users’ evaluation on the system. Twok,Trep, and Tcom are used to measure

the consumption of computing cycles and network bandwidth, which may have direct or indirect impact on Tres.

Figure 3.7 shows the state-based commit merging process to merge the working copy W1

m at Site 1 with version R0 in the repository to generate a new version R1.

The process is described as follow:

1. Site 1 sends a request to the repository, asking whether its working copy with baseline version number 0 is committable or not.

2. The repository processes the request and sends back the reply, telling whether the request is permitted or not.

3. If the request is not permitted, the merging process fails. Otherwise, Site 1

spends time Twok to compress the working copy with general file compression

algorithms (it is optional), and to prepare for transferring the working copy. Counting for Tres ends at this step.

4. The working copy at Site 1 takes timeTcom to be transferred to the repository.

5. The repository spends timeTrep to mergeW01 withR0 to generate a new version

R1. The repository first uncompressesSite 1’s working copyWm1 (it is optional),

then executes the text differentiation algorithmD=diff(W1

m,R0) to derive the

deltas D betweenW1

m and R0, and finally generates the new version R1 =Wm1

and replacesR0 with D.

Site 1 Repository Msg = Req(₀₎ Msg = Ack(Yes/N o) Msg =_W 1 m D = diff(W 1 m, R0) R1 = W1m R₀ = D

{

T_res Tcom Trep Twok

}

general compression

{

general uncompression

Figure 3.7: A state-based commit merging process

Figure 3.8 shows the state-based updating merging process to merge version R1

with the working copy W2

n at Site 2 to generate a new document state W

n+1. The

process is described as follows:

0 for the working copy.

2. The repository takes time Trep to process the request. It first retrieves version

R0 = R1 ◦ D, then compresses filesR0 and R1 (optional), and finally transfers

the two files to Site 2.

3. The two files take time Tcom to arrive at Site 2.

4. Site 2 spends time Twok to merge version R1 into the working copy. It first

uncompresses files R0 and R1 (optional), then derives deltas D1 (from Wn2 to

R1) by D1 = diff(W_n2, R1) and deltas D2 (from R0 to R1) by D2 = diff(R0,

R1), and finally generates a new document state Wn2+1 by executingdiff3(Wn2,

R0, R1,D1, D2). Counting for the system response time Tres ends here.

As shown in Figure 3.7, in the commit merging process, Tres is short and inde-

pendent of the files to be merged. It is good that users will not be discouraged by other delays involved in a commit merging process. Tcom is dependent on the size

of the working copy to be committed into the repository. Time for deriving deltas dominates Trep at the repository site. By comparison, as shown in Figure 3.8, in the

update merging process, Tres is determined by the whole merging process, which is

mainly measured by Trep,Tcom, and Twok. First, Trep is dominated by the time spent

on the retrieval of version R0, which is dependent on the number of editing opera-

tions in the deltas D. Second, Tcom is dependent on the total sizes of files R0 and

R1. Finally, Twok is dominated primarily by the derivation of deltas D1 and D2, and

secondarily by the execution of merging algorithm to generate a new document state of the working copy. In general,Twok is dependent on the total sizes of the three files

n, R0, and R1, and how much these three files are different.

To sum, time spent on deltas related tasks, such as transferring files across the network, deriving deltas, performing deltas on one document to generate or retrieve another document, is significant in a state-based merging process. In particular, it

has a fundamental impact on the system response in an update merging process. Responsiveness could be very poor under the circumstances that the files are large, network resources are scarce, and/or computation power of hosts is low.

Site 2 Repository Msg = Req(_{0, 1)}

{

R0 = R1o D

{

general compression Trep Msg = Ack(R 0 , R 1)

}

D1 = diff(W 2n , R 1) D₂ = diff(R₀ , R₁)

}

general uncompression

}

W2 n+1 = diff3(W 2n , R 0, R1, D 1 , D 2) Tcom Twok

T

_res

Figure 3.8: A state-based update merging process

In document INTERNET-BASED COLLABORATIVE PROGRAMMING TECHNIQUES AND ENVIRONMENTS (Page 97-108)