3.2 Textual merging
3.2.1 State-based merging technique
In state-based merging, deltas are derived by executing a text differentiation algo- rithm that compares two text files to generate the differences between them. The most reputable text differentiation algorithm is the diff (f1, f2) algorithm [71, 88],
which compares the two text files f1 and f2 and generates an editing script (or dif-
ference report) with which one file can be transformed to the other. There are some font-end tools built upon thediffalgorithm, such as thespiff[94] and flexible diff[96], which provide users with some necessary and useful flexibility to control how thediff
algorithm is performed and how the generated editing scripts are reported. For ex- ample, a user can instruct spiff to ignore differences between floating point numbers that are below a user-specified threshold or ignore white spaces and comments in differentiating program source code. The diff algorithm and a variety of its font-end tools have been used in most SCM systems. Various state-based textual merging algorithms have been devised based on the deltas derived by text differentiation al- gorithms. A typical three-way state-based textual merging algorithm is diff3(f1, fb,
f2), which produces a merged text file based on text filesf1 andf2 and their baseline
file fb by utilizing deltas derived by the diff algorithm. The diff3 textual merging
algorithm has been widely used in SCM systems such asSCCS [112], RCS[142] and
CVS [10].
The representation of deltas
Deltas between two text files are represented as editing scripts and can be used to transform one file to the other. For example, if D12 is the editing scripts from text
file f1 to f2, then applying these editing scripts to f1 will transform f1 to f2. This
transformation procedure is denoted as f2 = f1 ◦ D12 for the sake of explanation.
Editing Scripts are described by three kinds of editing operations: add (a), delete
(d), andchange(c). An adding operation is denoted asa[SL, NoL, ToL] to represent adding NoL new lines of text ToL after line SL. A changing operation is denoted as c[SL, NoL, ToL] to represent replacing NoL lines starting from line SL with new lines of text ToL. A deletion operation is denoted asd[SL, NoL] to represent deleting
84
changing operation c[SL, NoL, ToL] with empty ToL.
For example, as shown in Figure 3.3, an initial text file f1 described four major
topics on collaborative editing. File f1 can be revised into a new file f2 as follows:
insert a new line “- Concurrency Control” as the first item; insert word “Group”
before word “Undo”; and remove item “- Workflow management”. Editing scripts D12 describing the deltas from f1 to f2 consists of a list of editing operations [O1,
O2,O3] whereO1 = d[5, 1] or c[5, 1, “”] to delete the fifth line (i.e., the last line) “-
Workflow Management”, O2 = c[3, 1, “- Group Undo”] to replace the third line “-
Undo”with the new line“- Group Undo”, andO3 =a[1, 1,“- Concurrency Control”]
to add a new line“- Concurrency Control”after the first line“Topics on collaborative editing:”.
Topics on collaborative editing: - Concurrency Control - Consistency Maintenance - Group Undo
- Usability Studies
Revised file f2
Topics on collaborative editing: - Consistency Maintenance - Undo - Usability Studies - Workflow Management Initial file f1 O1 = d[5, 1] or c[5, 1. ""] O2 = c[3, 1, "- Group Undo"]
O3 = a[1, 1, "- Concurrency Control"]
D12 = diff(f1 , f 2)
Figure 3.3: The representation of editing scripts
Editing scripts are line-based in the sense that if there is a single change within a line, the whole line is regarded as changed. In Figure 3.3, the line-based editing operations O1 and O3 have captured the real editing actions of inserting a new item
not capture the real editing action of inserting a word within an existing item. The line-based representation of editing scripts could be even more coarse-grained in the sense that if there is a change in every consecutive line, those consecutive lines as a whole, referred to as an editing block, are regarded as changed. For example, as shown in Figure 3.4, file f1 has been revised into a new file f2 by changing the word
“Studies”in the item“- Usability Studies”from plural to singular“Study”in addition to those revisions made in Figure 3.3. Editing scripts in Figure 3.4 are significantly different from those in Figure 3.3. In particular, consecutive lines 3, 4, 5 constitute an editing block with editing operation O1 = d[3, 3] or d[3, 3, “”] to delete 3 lines
starting from line 3, and O2 = a[2, 2, “- Group Undo\n- Usability Study”] to insert
two new lines after line 2. Apparently, editing operations O1 and O2 do not reflect
the real editing actions performed by the user.
Topics on collaborative editing: - Concurrency Control - Consistency Maintenance - Group Undo
- Usability Study
Revised file f2
Topics on collaborative editing: - Consistency Maintenance - Undo - Usability Studies - Workflow Management Initial file f1 O1 = d[3, 3] or c[3, 3, ""] O2 = a[2, 2, "- Group Undo\n - Usability Study"] O3 = a[1, 1, "- Concurrency Control"]
D12 = diff(f1 , f 2)
Figure 3.4: Coarse-grained representation of editing scripts
It is the coarse-granularity of derived editing scripts that causes the false conflict problem in state-based textual merging. In the diff3 merging algorithm, a false con- flict would occur if an editing block covered by an editing operation overlaps with
86
another editing block covered by another editing operation and these two editing operations are contradictory in the sense they change the same line(s) to different values. For example, as shown in Figure 3.5, the baseline file fb contains four items
describing the topics on collaborative editing. It has been revised into file f1 as fol-
lows: add word “Group” in the item“- Undo”; change word “Studies”to “Study” in the item “- Usability Studies”; and remove the last item “- Workflow Management”. Concurrently, it has also been revised into another filef2 as follows: add a new item
“- Concurrency Control”as the first item; add word“Selective”in the item“- Undo”; and remove the last item“- Workflow Management”.
Editing script Db1 contains editing operation O11 =c[3, 2,“- Group Undo\n- Us-
ability Study”] to represent deltas from fb to f1. Editing script Db2 contains editing
operations O2
1 = d[5, 1], O 2
2 = c[3, 1, “- Selective Undo”] and O 2
3 = a[1, 1, “- Con-
currency Control”] to represent deltas from fb to f2. When f2 is merged with f1 to
produce a merged text file fm, the diff3 algorithm will be executed as follows: first,
the editing block covered by O2
1 in Db2 contains the fifth line and the editing block
covered by O1
1 in Db1 contains the third, fourth, and fifth lines. These two editing
blocks overlap but editing operations O2
1 and O
1
1 are not contradictory because both
intend to change the fifth line“- Workflow Management” to “”. Therefore, the effect of O2
1 is integrated into the merged file fm: the item “- Workflow Management” is
removed. Second, the editing block covered by O2
2 in Db2 contains the third line and
the editing block covered by O1
1 in Db1 contains the third, fourth, and fifth lines.
These two editing blocks overlap and editing operationsO2
1 and O
1
1 are contradictory
because O2
2 intends to change the third line to “- Selective Undo” while O 1
1 intends
to change the same line to “- Group Undo”. As a result, it is a false conflict and the merging algorithm simply keeps both changes in the merged file fm. Finally, editing
block covered by O2
3 inDb2 does not overlap with the editing block covered by O11 in
Concurrency Control”is added.
Topics on collaborative editing: - Consistency Maintenance - Group Undo
- Usability Study
Revised file f1
Topics on collaborative editing: - Consistency Maintenance - Undo
- Usability Studies - Workflow Management
Baseline file fb
Topics on collaborative editing: - Concurrency Control - Consistency Maintenance - Selective Undo - Usability Studies Revised file f2 O1 1 = c[3, 3, "- Group Undo\n - Usability Study"] Db1 = diff(fb , f 1) O2 1 = c[5, 1, ""] O22 = c[3, 1, "- Selective Undo"] O2
3 = a[1, 1, "- Concurrency Control"]
Db2 = diff(fb , f 2)
Topics on collaborative editing: - Concurrency Control - Consistency Maintenance <<<<<<< f1 - Group Undo - Usability Study ======= - Selective Undo - Usability Studies >>>>>>> f2 fm = diff3(f1 , f b , f 2)
Figure 3.5: An example of three-way textual merging supported by diff3
The reason why text differentiation algorithms generate line-based editing scripts is that it is too time-consuming to generate fine-grained editing scripts. The text differentiation algorithm proposed by Hunt et al. [72] has a time complexity of Ω(m × n) in terms of the string lengths m and n alone, and is even worse if taking into account how much the two strings are different. The algorithm proposed by Myers [93] is significantly improved if the two strings to be differentiated are not too different and it has been widely implemented as the diff utility in various UNIX systems and
88
mostSCMsystems. The algorithm has a time complexity of O(N ×D) if Dis small, where N is the total lengths of the two strings and D is the size of the minimum editing script between the two strings. However, if D is large (e.g., for two strings that are completely different,D = 2N), its time complexity will be O(NlgN + D2),
even worse than the algorithm proposed by Hunt et al. [72]. Therefore, to make the algorithm efficient, the minimum editing script D between the two strings has to be small and the compromise is to compare the two strings line by line to derive line-based editing scripts instead of comparing them character by character to derive character-based editing scripts [142].
A B C A B B A 0 1 2 3 4 5 0 1 2 3 4 5 6 (7, 6) (0, 0) C B A B A C ABCABBA C B A B A C (0, 0) (1, 1) 0
Editing Script = 1D, 2D, 3IB, 6D, 7IC D = 5
Editing Script = 1D, 1ICBABAC D = 2
(a) Character-based editing script
(b) Line-based editing script
0
Figure 3.6: Character-based versus line-based editing script
For example, Figure 3.6(a) is the editing graph used in [93], where the source string is ABCABBA and the destination string is CBABAC. The editing graph has a vertex at each point in the grid (x,y) (x∈[0, 7] and y∈ [0, 6]). The vertices of the editing graph are connected by horizontal, vertical, and diagonal edges to form
a directed acyclic graph. Horizontal edges connect each vertex to its right neighbour (i.e., (x-1,y) → (x,y) corresponding to the deletion of character at position (x,0) for x ∈ [0, 7] and y ∈ [0, 6]). Vertical edges connect each vertex to the neighbour below it (i.e., (x,y-1) → (x,y) corresponding to the insertion of character at position (0,y) for x ∈ [0, 7] and y ∈ [0, 6]). If the character at position (x,0) is the same as the character at position (0,y), then there is a diagonal edge connecting vertex (x-1,y-1) to vertex (x,y), corresponding to keep that common character.
According to the text differentiation algorithm proposed by Myers [93], the size of the minimum editing script for transforming the source string ABCABBA to the destination string CBABACis 5, if the editing script is character-based, which is significantly larger than the size of the corresponding minimum line-based editing script, which is 2. If deltas are derived in a more fine-grained way, they can certainly help merging algorithms reduce the chance of complaining false conflicts. However, even fine-grained editing scripts do not necessarily mean they have captured the real editing actions performed by the user. Consequently, they may mislead merging algorithms to produce a merging result undesirable to the user. For example, in Figure 3.6(a), the character-based editing script contains five editing operations: 1D
to delete the first character A, 2D to delete the second character B, 3IB to insert character B at position 3, 6D to delete the sixth character B, and 7IC to insert character C at position 7. But as shown in Figure 3.6(b), the real editing actions performed by the user could be 1D to delete all seven characters and1ICBABAC
to insert six new characters.
It should be pointed out that deltas are also important for storing versions in the repository in SCM systems. In the repository, versions are stored in such a way that the latest version is stored intact while old versions are stored as deltas, although the user interface completely hides this fact [124, 142]. Using deltas to represent versions in repository is a space-time tradeoff: deltas reduce the space consumption
90
but increase access time. Statistics [142] show that the latest version is the one that is retrieved in 95 percent of all check-out cases, therefore it is significant to reduce access time by storing the latest version intact.
Merging process
A state-based merging process involves a working site, the repository site, and data transfers between them. Time spent at the working site is denoted asTwok, time spent
at the repository site is denoted asTrep, and time spent for data transfers between the
working site and the repository site is denoted as Tcom. Another important measure
is the system response time denoted as Tres, which is the time interval between when
the user issues a merging command and the user re-gains the control of the system to issue other commands at a working site. Tres is very important in measuring the
performance of a merging process because it is visible by users and has a substantial impact on users’ evaluation on the system. Twok,Trep, and Tcom are used to measure
the consumption of computing cycles and network bandwidth, which may have direct or indirect impact on Tres.
Figure 3.7 shows the state-based commit merging process to merge the working copy W1
m at Site 1 with version R0 in the repository to generate a new version R1.
The process is described as follow:
1. Site 1 sends a request to the repository, asking whether its working copy with baseline version number 0 is committable or not.
2. The repository processes the request and sends back the reply, telling whether the request is permitted or not.
3. If the request is not permitted, the merging process fails. Otherwise, Site 1
spends time Twok to compress the working copy with general file compression
algorithms (it is optional), and to prepare for transferring the working copy. Counting for Tres ends at this step.
4. The working copy at Site 1 takes timeTcom to be transferred to the repository.
5. The repository spends timeTrep to mergeW01 withR0 to generate a new version
R1. The repository first uncompressesSite 1’s working copyWm1 (it is optional),
then executes the text differentiation algorithmD=diff(W1
m,R0) to derive the
deltas D betweenW1
m and R0, and finally generates the new version R1 =Wm1
and replacesR0 with D.
Site 1 Repository Msg = Req(0) Msg = Ack(Yes/N o) Msg = W 1 m D = diff(W 1 m, R0) R1 = W1m R0 = D
{
{
Tres Tcom Trep Twok}
general compression{
general uncompression
Figure 3.7: A state-based commit merging process
Figure 3.8 shows the state-based updating merging process to merge version R1
with the working copy W2
n at Site 2 to generate a new document state W
2
n+1. The
process is described as follows:
92
0 for the working copy.
2. The repository takes time Trep to process the request. It first retrieves version
R0 = R1 ◦ D, then compresses filesR0 and R1 (optional), and finally transfers
the two files to Site 2.
3. The two files take time Tcom to arrive at Site 2.
4. Site 2 spends time Twok to merge version R1 into the working copy. It first
uncompresses files R0 and R1 (optional), then derives deltas D1 (from Wn2 to
R1) by D1 = diff(Wn2, R1) and deltas D2 (from R0 to R1) by D2 = diff(R0,
R1), and finally generates a new document state Wn2+1 by executingdiff3(Wn2,
R0, R1,D1, D2). Counting for the system response time Tres ends here.
As shown in Figure 3.7, in the commit merging process, Tres is short and inde-
pendent of the files to be merged. It is good that users will not be discouraged by other delays involved in a commit merging process. Tcom is dependent on the size
of the working copy to be committed into the repository. Time for deriving deltas dominates Trep at the repository site. By comparison, as shown in Figure 3.8, in the
update merging process, Tres is determined by the whole merging process, which is
mainly measured by Trep,Tcom, and Twok. First, Trep is dominated by the time spent
on the retrieval of version R0, which is dependent on the number of editing opera-
tions in the deltas D. Second, Tcom is dependent on the total sizes of files R0 and
R1. Finally, Twok is dominated primarily by the derivation of deltas D1 and D2, and
secondarily by the execution of merging algorithm to generate a new document state of the working copy. In general,Twok is dependent on the total sizes of the three files
W2
n, R0, and R1, and how much these three files are different.
To sum, time spent on deltas related tasks, such as transferring files across the network, deriving deltas, performing deltas on one document to generate or retrieve another document, is significant in a state-based merging process. In particular, it
has a fundamental impact on the system response in an update merging process. Responsiveness could be very poor under the circumstances that the files are large, network resources are scarce, and/or computation power of hosts is low.
Site 2 Repository Msg = Req(0, 1)
{
R0 = R1o D{
general compression Trep Msg = Ack(R 0 , R 1)}
D1 = diff(W 2n , R 1) D2 = diff(R0 , R1)}
general uncompression}
W2 n+1 = diff3(W 2n , R 0, R1, D 1 , D 2) Tcom TwokT
resFigure 3.8: A state-based update merging process