http://wrap.warwick.ac.uk/
Original citation:
Goldberg, Leslie Ann, Goldberg, Paul W., Phillips, C. A. and Sorking, G. B. (1996) Constructing computer virus phylogenies. University of Warwick. Department of
Computer Science. (Department of Computer Science Research Report). (Unpublished) CS-RR-297
Permanent WRAP url:
http://wrap.warwick.ac.uk/60984
Copyright and reuse:
The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. Copyright © and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable the material made available in WRAP has been checked for eligibility before being made available.
Copies of full items can be used for personal research or study, educational, or not-for-profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way.
A note on versions:
Research
R.port
297
Constructing Computer
Virus
Phylogenies
Leslie
Ann
Goldberg,
Paul
W.
Goldberg,
Cynthia
A.
Phillips,
Gregory
B.
Sorkin
RR297
There has been much recent algorithmic work on the problem of reconstructing the evolutionary
history of biological species. Computer virus specialists are interested in
finding
the evolutionaryhistory of computer viruses - a virus is often written using code fragments from one or more other viruses, which are its immediate ancestors.
A
phylogeny for a collection of computer viruses is a directed acyclic graph whose nodes are the viruses and whose edges map ancestors to descendants and satisfy the property that each code fragment is "invented" only once. To providea simple explanation for the data, we consider the problem of constructing such a phylogeny with a minimum number of edges. In general this optimization problem is NP-complete; some
associated approximation problems are also hard, but others are easy. When tree solutions exist, they can be constructed and randomly sampled in polynomial time.
Department of Computer Science
University of Warwick Coventry
CV4lAL
United Kinedom
Constructing
Computer
Virus
Phylogenies
Leslie
Ann
Goldberg*Cynthia A. Phillipsl
Paul
W.
GoldberglGregory
B.
Sorkin$January 23,
1996Abstract
There has been much recent algorithmic work on the problem of reconstruct-ing the evolutionary history of biological species. computer virus specialists are interested in finding the evolutionary history of computer viruses
-
a virusis often written using code fragments from one
or
more other viruses, which are its immediate ancestors.A
phylogeny for a collection of computer virusesis a directed acyclic graph whose nodes are the viruses and whose edges map ancestors to descendants and satisfy the property that each code fragment is
,,invented" only once. To provide a simple explanation for the data, we con-sider the problem of constructing such a phylogeny with a minimum number of
edges.
In
general this optimization problem is NP-completel some associatedapproximation problems are also hard, but others are easy. When tree solutions
exist, they can be constructed and raudomly sampled in polynomial tin-re.
*lesl-ie6dcs.rarrick.ac.uk. Department of Computer Science, University of Warwick,
Coven-try CV4 7AL, United Kingdom. Part of this work was performed at Sandia National Laboratories
and was supported by the U.S. Department of Energy under contract DE-AC04-76AL85000. Part
of this work was supported by the ESPRIT Basic Research Action Programme of the EC under
contract ?141 (project ALCOM-IT).
tgoldbepr6helios.aston.ac.uk. Department of Applied Mathematics and Computer Science'
Aston University, Aston Triangle, Birmingham 84 7ET, United Kingdom. Part of this work was
performed at Sandia National Laboratories and was supported by the U.S. Department of Bnergy
under contract DE-AC04-76AL85000.
lcaphillocs.sandia.gov. Sandia National Labs, P.O. Box 5800, Albuquerque NM 87185. This work was performed under u.s. Department of Energy contract DE-ACO4-76AL85000.
1
Introduction
There are now several thousand different computer viruses
in
existence,with
new ones beingwritten
at a rate of 3 to 4 per day. Most of these are based upon previous ones: someone copies and modifies a virus, or creates a new viruswith
subroutines borrowed from one or moreancestors-For most purposes, a computer virus can be regarded as a fi.xed string of bytes,
each byte consisting
ofeight bits.
If
one virus is based on another, long substrings of the ancestor, say 20 bytes or more,will
appearin
the descendant. Using probabilitvrnodels
similar
to
those employedin
speech recognitionit
is
possibleto
estimatcthe probability
that
a given byte string occursin
several viruses by chance [1a];if
the
probability
is lowbut
the string does occurin
several viruses then we concludethat
it
waswritten for
onevirus,
and copiedinto
the others.
More detailsof
thepattern-matching approach used
to identify
duplicatedbyte
strings are given inSection 1.3.
We
wish
to
infer an
evolutionaryor
phylogenetichistory
for a
set
of
com-puter
viruses.l
As
most phylogeneticliterature
to
date has been based upon bi-ological evolution, we adoptthat
terminology.
Thus,the
virusesin
the
input
set5 :
{rt,...,sr}
are ca.lledspecies.
The
species are defi.nedby a
setof binary
characters
c =
{cr,...,ck}.A
binary
characterisafunction
c:
s *
{0,1}.
(In
general,
the
range ofa
character can bearbitrary,
but
the presenceor
absence of byte strings can be rnodeledwith
binary characters.) Each character c correspoldsto
a
bytestring,
with
c(s)
:
1if
the
string occursin
speciess
and c(s)=
0oth-erwise.
Ifc(s)
=
1, we saythat
speciess
hasor
contains characterc.
In
analogyrvith terminology from the logic synthesis area of computer circuit design, we d.efine
the
on-set
,5.of
a character cto
bethe
setof
all
species on whichits
valueis
1:.9.
=
{s
€5
| c(s):
1}. A
character c istrivial
if
l5.l
<
1.A trivial
character canbe ignored since
it
imposes no constraints on possible solutions.We assume
that
theinput
species areall
related:that
thebipartite
graphjoin-ing
speciesto
charactersthat
have themis
connected. Otherwise,the
connected components can be considered independently.We also assume
that
each code fragment is invented oniy once. For sufficiently long fragments this is justified by differencesin
programming style, the manypossi-ble orderings of unconstrained events,
etc.
We model the evolution of a set of viral specieswith
a directed graphin
which an edge si--tsj
indicatesthat
species s; is anancestor of species
s;
(i.e.s;
inherited some character(s) from s;).Definition:
A
phyloDAG
forinput
species5
and characters C is a directed acycLic graph(DAG)
with
node setS.
For each character c€
C, the subgraph induced by on-set ,5" is connected,in
the sensethat
from a singlearchetype
ac€
Sc there is adirected path,
within
5.,
to
every other s€
.t
.The phyloDAG model allows the possibility
that
a species may be derived from several ancestors ratherthan from
a single ancestor. For computer virusesthis
isappropriate, since a virus author could appropriate code from a variety ofsources.
It
i One of us (Sorkin)
is a member of the computer anti-virus group at IBM Research. The
group produces a commercial software product, IBM ANrrVm.us, and conducts related theoretical
research. Phylogenetic information may be useful for tracking global virus trends, while the byte
is also a plausible model for evolution of bacteria populations which have inherited genes
via
infection
by
bacteriophages[B]:the
genes evolveonly
once, and can betransferred
from
hostto
host by these viruses.A
phyloDAG
existsfor
anyinputs
(S,C): for
any
chronology ascribedto
the species(i.e. any
total
orderingof
the
speciesset),
the
directed graphwith
edgesfrom each species
to
al1later
species is a phyloDAG. However, every pair of speciesis
related
by an
edgein
this
graph.
Sincemost
virus
species presumably have few ancestols, we seek aMinimum
PhyloDAG,
onewith
a minimum number ofdirected edges.
We assume
that
the
input
is
givenin
thespecies s
€
,5, we are given a sortedlist
of theDefinition:
The
input
length
(.=
(.(S,C)=
number of characters is A=
lCl.following
compactformat: for
eachcharacters c for which c(s)
:
1.Dcec 15.1. The size
is
n:
lSl.
TheOur
approachto the
evolution problem correspondsto a
restricted model of evolution: onein
which we are not allowed to introduce hypothetical species outsideof
theilput
set.
This
modelis
well-suitedto
computer vituses, where because ofgood world-wide communications, sharing of data between anti-virus organizations,
and
the brief
history
involved, there arelikely
to
be very few
gapsin
our
viraldatabase
-
a
situation quite
differentfrom that
ir
biology.
Previouswork
onrestricted. models
of
evolutionwill
be discussedin
Section1.2.
Forour
model,if
additional
species could be introducedinto
a phyloDAG, there would always be atrivial
sparse phyloDAG: a star graphwith
the center an added species s such that c(s):1forall
ce
C.If
a phyloDAG's vertices are labeledwith
the vaLuesof
one character, thepos-tulate
that
no characteris
"invented" twice correspondsto
an assertionthat
theleis
at
most one directed edge labeled 0--+1. Thusthe
sequence of labels along anysource-to-leaf
path
is
describedby
the
regular expression 0+1*0*,that is'
zero or more 0's, followed by zeroor
more 1-'s, andfinally
zero or more 0's again.1.1
Paper
organization and
results
We
will
showin
Section 3.3that the
Minimum PhyloDAG
problemis
NP-hard'Furthermore,
if
NTIME(nPolvlosn) I
DTIME(nPolvlosn)(a
plausiblecomplexitv-theoretic assumption related
toP
I
NP),it
cannot be approximately solved withina
factor
lessthan
(1/16)log2(.in
polynomialtime.
In
fact,
we knowof
no
wayto
approximateMinimum PhyloDAG
within
any logarithmic
factor;
Section 3.3shows
that
variousnatural
greedy strategies(including
randomized ones) do not even approximatewithin
a factor of cn.Because
of
the
difEcultyof
the
phyloDAG problem, we considertwo
variants.In
thefirst
variant, we requirethat
each specieshavejust
one ancestot, sothat
thephyloDAG is an arborescence (a tree
with
edges directed away from aroot).
In
Sec-tion
2 we d.efine a0-1-0
phylogeny
to
be an arborescent phyloDAG's underlying und.irectedtree.
Species .9 and charactersC
maybe
consistentwith
zero' one' ormultiple
0-1
0 phylogenies. We give two polynomial-time algorithmsto
randomly sample0-1-0
phylogeniesif
any exist.The
first atomic-set
algorithm
(Section2.1)
computesa
concisedata
select
a
phylogenyuniformly
at
randomin
time
O(n(.).
Whenno
solution existsthe algorithm
returns awitness set:
a conciseindication
of
why there can be nophylogenetic tree.
The second
minimum spanning tree
algorithm (Section 2.2) characterizes a0-1
0 phylogenyof
theinput
species set as a minimum spanning tree(MST)
of a particular undirected edge-weighted graph. Using known algorithms for constructingand randomly sampling spanning trees,
a
0-1
0 phylogeny can be constructed i1deterministic
time
O((n
I
n2\ogn)or (with
highprobability)
in
rand.omized timeO((.n), and sampled uniformly at random in time
O(ln-t M(")),
whereM(n)is
thetime needed
to muitiply
two nx
n matrices. The NIST algorithm does not produce a concise witness when there is no0-1
0 phylogeny.The second variant of phyloDAG
is
simplyits
undirected analogue.A
phylo-graph
for species5
and characters C is an undirected graphwith
vertex set5,
withthe property
that
the
subgraph inducedby the
on-setof
each characterc €
C is connected.The
Minimum
Phylograph
problemis
find
a
phylographwith
theminimum number of edges. Theorem 3.1 shows
that
it
is hardto
approximateMin-imum
Phylographwithin
a factor less than(1/16)Iog2l,
while Theorem 3.4 shorvsthat
approximatingit
within
afactor of.lt(.
is easy.The model of computation used
in
this paper is the uniform-cost rand.om-accessmachine.
L.2
Related work
Previous
work
in
phylogeny has focused on constructing phyiogenetic frees. Horv, ever,the
problemof
modelingvirus
evolutionis
more suitedto
phylographs and phyloDAGs,in
which undirected cycles may arise. -A.sfar
as we know, ours is thefirst
phylogenetic workthat
allows cycles.Tliere
is
substantialliterature
on character-based phylogenies where each sub-graph inducedby
all
species sharing a statefor
a characteris
requiredto
becon-nected. This
problemis
calledthe
perfect
phylogeny
problem, and
is
Np-complete
for the
"unrestricted"
case (whereputative
species may be added) withgeneral characters [3, 18]. For the unrestricted case
n'ith
binary characters Gusfield gives an elegantO(nk)
algorithm [12], and for the restricted casewith
generaichar-acters Goldberg
et
aJ. [11] give an algorithm analogousto
the MST
algorithm ofSection 2.2.
Our
0-1
0 phylogeny problemis
similarto
a restricted version of thegeneral
character
compatibility
problem of Benhamet al.
[2]. There a character c mapseach species
s
to
a
subsetc(s) C
{0,1,2}
ratherthan
to
a single value; lhe leartesof the tree are the species
5;
for
each c and s, a single value from c(s) is chosen asa
label;
andthe
goalis
to find
a rooted perfect phylogenyin
whichthe
sequenceof
labels along any root-to-leafpath is
of
theform
0
--+I
---+2.
The
problem is NP-hard [21.1.3
Choosing
characters
For the
bulk of this
paperthe
set of species andthe
characters are simply inputs,but
at
an earlier stage theytoo
must be determined algorithmically.In
oneematical model of the problem, we are given a collection of byte strings
represent-ing
computer viruses (there might be 5,000 such strings, eachtypically
2,000 byteslong) and a similar collection representing legitimate computer code (perhaps 40,000 strings, each
typically
20,000 bytes long); we wishto
find all
strings of 20 bytes or morethat
occurin
at
least 2 virusesbut
in
no legitimate programs.Of
course thenumerical parameters
might
be varied,or
anothercriterion might
be substituted, e.g. the string's appearancein
at least 10 times as manyviral
programs as legitimate programs.The problems can be solved
in
linear space andtime
by a straightforwardappii-cation of suffi.x trees [7].
All
viral
and legitimate strings are concatenated together, separatedby
a special character, and a suffi.x treeis
constructed. The leaves of the suffix tree represent all sufixes of theinput
string, and the internal nodes-
viewedas paths
from
root
part-wayto
leaf-
denote prefixesof
sufrxes, whichis
to
savsubstrings
of
the
input string.
Depth-first search can be usedto
propagate, fromleaves
to root,
the number of times each substring appears, andin
fact the numbelof
timesit
appearsin
viruses and (separately)in
legitimate
strings;this
providesthe solution
to
the stated problem.z2
Computing a
0-1-0
phylogeny
The
casein
which
each species hasonly
one ancestoris of
specialinterest,
and correspondsto
casesin
whichthe
phyloDAGis
anarborescence
a treewith
alledges directed away from some
root.
There is a straightforward n:1 correspondence between arborescences and undirected trees: the undirected graph underlying anar-borescence is a tree; and each of the n possible rootings of a tree is an arboresceuce.3
Therefore we concentrate on undirected
0
1-0 phylogenies:Definition:
An
(undirected)o-1-0
phylogeny,
orphylogenetic
tree,
is a tree?
on species
5
with
characters C suchthat
each on-set5"
induces a sub-tree of7'
If
f
is a phyloDAG whose underlying graph is a tree?,
then7
is a0-1
0 phy-logeny as defined above: as each on-set5.
was connectedin
7,
it
is connectedin
?.
Also,if
?
is
a 0-1-0
phylogeny, any arborescence based on7
is
a phyloDAG: the archetypeof
any
character cis
the
speciesin
5.
closestto
the
root. In
this
sec-tion,
wewill
show howto
generate0-1-0
phylogenies, and howto
generate themuniformly at
random. Given a uniformly random phylogenetic tree, choosing a rootuniformly at
random generates a uniformly random arborescent phyloDAG.Because an arborescence can be rooted anywhere, a
0-1-0
phylogeny alone doesnot
determine an evolutionary chronology,but
it
can be usefulin
combination with externalinformation.
For exampleif
thefirst
species'identity
is known, the rest ofthe evolutionary history follows.
2For the
sake of accuracy, we note that this is not precisely the approach employed in practice
at IBM; we have simplified here for the sake of exposition'
tTh"r"
exist phyloDAGs whose underlying graphs are trees but which are not atborescences.An example, for species with characters (o), (ab), and (b), is (a)
-
(ab)*
(b). But since such2.r
The
atomic-set
algorithm
for
computing
0-1-0
phylogenies
As describedin
theIntroduction,
our atomic-set algorithm produces a datastruc-tute,
an AS-tree, which concisely represents all0
1
0 phylogenies for species5
andcharacters C, and can be used to generate an
arbitrary
solution or a solution chosenuniformly
at random.Generalizing
the
definitionof
the
on-setof a
character, definethe
on-setof
acollection of chara,cters to be the species having
allthose
characters:Sc:
fl"ecS..
Definition:
LetC
C C be a maximal(not
necessarilymarirnum)
set of chara-ctersfor
whichlsel >
2.
ThenA:
Se is anatomic
set
with defining
characters
C.Lemma
2.1
For
any atom'ic setA
and characterc,
either S.)
A
(cis
a def,ning cLmracter),orlS.nAl:
L (c is anon-defining
characterowned
by the sole speciess
€ l^t
aAl),
or
S.tr
A=
Aft
is
anavoiding
character).Proof.
Theonly
logical possibiiity missingis
that
15"n
al
)
2but
s.n
A
+
A,which would contradict the ma-ximality of ,4.'s set of defining characters.
r
An
atomic set can be constructedin
timeo(kn):
startwith
c
=
0 (sosc = s),
sweep through
all
characters c€
Cin
turn,
reject cif l.ti
n.g.l
<
1,but
otherwiseadd c
to
the
defining set, soe t:
Cu
{"}.
An
O((.)-time implementationof
thisalgorithm is described
in
the Appendix.Lemma 2.2
Supposeall
speciesin
3
are connected,i.e.
the bipartite graphjoin-ing
charactersto
speciesthat
haue themis
connected. Thenif
s1, s2€
S
haae no charactersin
cornmon, no phylogeny contains the edge (,tr,sz).Proof.
Suppose a phylogenetic tree7
contained (sr, sz), and delete("r,
"r)
to createa forest
7',
consisting of two trees. For any character c and any s, s'€
^9.,?
has a paths,.
. ., s'within
5..
The path does not include the edge(rr,
"r),
since not boths1 and s2 can be
in
5.,
so7/
containsthe
samepath.
Thusin
?/
thereis
a pathfrom
any species having character cto
anyother.
Given the connectedness of the species character graph, a seriesof
such pathsjoins
any speciesin
S
to
any other,contradicting the
fact that Tt
isnot
a connected graph.r
Lemma
2.3
If
A
is
an atomic set, thenin
any0-1-0
phylogeny,A's
inducedsub-graph
is
o, subtree.Proof.
In
a0-1-0
phylogenetic tree?,
the on-set of any characterc €
C inducesa
connected subgraph, thereforea
subtree.
A is
the
intersectionof
the
subtreescorresponding
to
A's defining chatacters, and the intersection of subtrees is itself asubtree.
I
Lemma 2.4
For
ang phylogenyT
and, atomic setA,
if
the subtreeTa
is
replaced,by anE other treeT'a on the set
A,
the resultant oaeralltreeTt
'is also a phylogerty.Proof.
For
any
characterc
and
species s,s/
€
.9.,
considerthe
(unique) paths,.
. .,s'
in
7.
If
.9"n
A
:
A,the path
never enters ,4, soit
is
unaffected (i.e. thehence no edges
within
A,
and is unaffected. Otherwise (by Lemma 2.1)S.I
A,
andif
thepath
through?
included any sub-paths throughTa
(in fact
there can be at most one), those sections could be replaced by sub-paths through{
(and thus stillwithin
5.).
So connectedness ofall
charactersin
?
implies the samefot
T',
and?'
is
a phylogeny.r
Lemma 2.5
For
any phylogenyT and atomic setA, if A
iscollapsed
-
replaced bya
single spec'ies"a"
hauingall
defining and non-defi'ning charactersof
A
(butnot
its auoiding characters), and the subtreeT4 is contracted, to the single species a, then the resultant ouerall treeTt
is
a phylogengfor
5'
:
(5 \
A)
U{o}.
Proof.
Same as previous.I
Lemma 2.6
If
(S,C)
hasan
atomic setA,
with
species s1,52 €.A
own'ing non' defining characters c1,c2 respectiuely, andif
5., n
5.,
*
A'
then thereis no
0
10 phylogeny
for
E.Proof.
Suppose there is a phylogeny?
for5.
Root7
at
any s3€
,56' ffS.r,
and lets,
be the lowest common ancestor of s1 and s2. Then thepath (all
pathsin
a tree are unique) from s1 to s2 passes throughs"; thepath
from s3 to s1 passes through s" (since s" is an ancestor of s1); and the path from s3 to s2 passes throughs'
(since .s,'is an ancestor of
sz). By
Lemma 2.3, Ainduces a subtree, so 51,s2€
A implies thatthe
s1-s2path is
containedin,4,
andin
particular
s"
€,4"
Similarly sr,s3
€
56' impliess, €
S"r,
and s2, s3e
56, impliess,
€
S.r.
Therefores"
€
AOSrrnS.r-Brrt
c1 and c2 are nondefining characterswith
distinct
owneIS, soAn5.,
nS"2-
A.a contradiction.
I
If
the
hypothesesof
Lemma 2.6 are sa,tisfied,we
saythat the
atomic
set A. characters cr,c2, and species s1, s2 provide awitness
attestingto
the non-existenceof any
0
1-0 phylogenetic tree.Lemma
2.7
Let
A
bean
atomic set, and, supposetltat no
srrs2,c1,c2 satisfy the conditionsof
Lernma2.6. As
before, "collaqtse"A
to
the single speciesa
hauing alldefi,ning and non-defi,ning characters of
A.
If
5'
:
(5
\
A)
U{a}
has a phylogeny,so does E.
Proof.
LetT'
be a phylogenyfor
5'.
Deiete rz andits
incident edges, and replacethem
with
the
setA
and any tree onA.
Additionally,
replace each edge (s, a) witir a single edge as follows.By
Lemma 2.2,.s and rr must share some character(s), which (since c, hasthen)
must be defining or nondefining characters ofA.
If. s and a share any non-defi'ning characters, those characters must have a single owners'
(or elseA,
these characters, andtheir
owners ale a negative witness),in
which case add the edge (","')'
Other-wise, s and a only share defining characters of
A,
in
which case add any edge (s, s')with
s/€
A.Replacement of each edge
(s,o)
with
an edge(","'),
s' € A,
meansthat
the tree components createdby
a's deletion areall
connectedto
the
treeon
A,
creating asingle tree
?.
Using arguments similarto
thosein
Lemma 2.4, all. characters indrrceIn
fact, the constructive nature of the proof of Lemma 2.7 immediately suggests theatomic-set
algorithm.
Starting from 56::
5,
repeatedly,find
an atomic setA;
and checkfor
a witness as per Lemma 2.6.If
one is found, terminate negatively.otherwise, collapse
A;
lo
a single new speciesa;,
and, re-define the species set to beS;
::
(5;-1
\
-A;) U{o;}.
Since each atomic set containsat
leasttwo
species, this reduces the number of species, and needs to be performed at most n-
1 times.We construct the AS-tree during
this
contraction phase. The leaves of theAS-tree
arethe
speciesin S,
andall
elementsof
any setA;
haven;
astheir
parent.Equivalently, the fi.nal n;
is
theroot of
the AS-tree, and each o1 hasall
species inA,
as children. This tree concisely representsall
possible phylogenies.Itlow, starting at the root of the AS-tree, we expand any node n; whose
parelt
isalready expanded using the method suggested by the proof of Lemma 2.7: Replace 4;
with
A;
and form any tree T; onA;.
For each old edge(",o;), if
s has a nondefiningcharacter
c
of A;,
add edge(s,owner4,(c));
otherwises
must haveonly
defining characters,in
which case add any edge (",",),
s,€
A;.Theorem
2.8
The algorithm aboae produces a phylogenyfor
S,c if
one erists, and.otherwise prod,uce-s
a
negatiaewitness.
If
the algorithmis
implemented.to
choosetrees
T;
uniformly
at
random,and
to
chooses' €
A;
uniformly
at
randomfor
def'ning-character edges(s,st),
thenit
producesa
un'iformly random undirectefl0-1
0 phylogeny.Proof.
Thefirst
assertion follorvs directiy from the preceding sequence of lemmas.If
we detecta
negative witness, we correctly terminate negativelyby
Lemma 2.6coupled
witir
Lenma 2.5.
Otherwise,by
Lemmas2.b
and 2.7, we can collapse theatomic set, solve the problem
on the
new set,and
"expand"the
collapsed set toa
0-1-0 phylogeny. The choices madein
the expansion phase are independent and leadto
different phylogenies. The uniform generationof
phylogenies follows fromthis
one-to-olle correspondence between phylogenies, and choicesin
the
algorithm. ISince we can generate a random
0
1
0 phylogeny from the AS-tree,it
conciselyrepresents
all
possible0-1
0 phylogenies.The
atomic-set algorithm produces an AS-treein
time
o(nL): in
eachof
theO(n)
collapsing iterations, wefind
an atomic set, checkfor
a witness, and collapse tlre set, each such operationtaking
fime O((.). (See the Appendix.)The expansion can be completed
in
timeo(nL).
There arco(n)
expansions. To expand node o;, we can produce a random tree on the setA;
in
timeO(lA,l),
sincea labeled tree on
r
nodes can be selecteduniformly
at
randornin O(r) time.
(See,for
example,[16].)
If
we store pointersto
ownersof
non-defining characters whenconstructing the AS-tree, we can connect
this
treeto its
neighborsin
timeO(l).
2.2
The
Minimum
Spanning
Tbee
algorithm
In
this
section we give a second algorithmfor
computing0
1-0 phylogenies.It
isvery simple, and is based on the observation that 0-1-0 phylogenies for species
5
and characters C correspondto
minimum-weight spanning trees (MSTs)of
a particularundirected edge-weighted graph
G(E,C).
(This observation was also usedin
[11] toThe graph G(E,C)
is a
complete graphon
5,
with
edge weights ur(s1,s2):
k
_
l{"
€ Clc(s1)=
"(rz):1}l.
It
can be constructedin
O(/n)
time.Theorem
2.9
0-1-0 phylogeni,esfor
(3,c)
are spanning trees ofG(s,c)
with weightnk
- (.
Furth.ermore,G(3,C)
has no spanning trees of smaller weight.Proof.
Every spanning tree ofG(5)
has weight at leastnk-
(,
since the contribution ofeachcharacterctothetotalweightisatleast(n
-1)-(lS.l -1).
Spanningtreesof
G(5)
with
weight nk- I
correspondto
treesin
which each on-set .9. is connected (see [11]).r
Because
of this
correspondence, phylogenies can be constructed(or
randomly sampled) by established algorithmsfor
constructing (or randomly sampling) MSTs.Prim's
algorithm
[17, 10] constructsan MST
of G
in
O(mlogm)
time,
where mis
the
number of edgesin
G,
and* :
(;)
for
G
:
G(S,C).
If
a faster algorithmis
required, Karger,Klein
and Tarjan's randomized algorithm constructs an MST,with
high probability , in O(m)
time [13]. (Their model of computation is a unit-costrand.om-access machine
with
the restrictionthat
the only operations allowed on edgeweights are
binary
comparisons. See also the other algorithms discussedin
[t3].)
Givel
an unweighted n-vertex graph, analgorithm
of
Colbourn,Myrvold
andNeufeld [5] selects a spanning tree
uniformly
at
randomin O(M(n))
tirne.a
(HereM(")
:
O(r2'3'u) is the time needed tomultiply
twon
x
n matrices[6].)
Colbourn and Jerrum [4] notethat
the algorithm can be usedto
select an MST of a weightedgraph G
uniformly at
randomh
O(M(n))
time:
construct a random spanning tree on each connected component of the subgraph of G induced bv the edges ofminimun
r,veight,
put
the
spanning trees'edgesinto
thefiual
solution, conttact the spannilgtrees, and repeat.
Compared
with
the atomic-set algorithm, theMST
approach has the advantageof using an unusually widely understood and simpie paradigm, a benefit echoed in the availability and effi.ciency of computer programs. However,
it
doesnot
supply astructural
representation ofall
possible phylogenies,ltor
a concise witness when nophylogeny exists.
3
Phylographs and phyloDAGs
Having considered the problem
of
constructing phylogenetic ttees, we nowturn
to phylogeniesthat
arenot
trees.In
particulal,
we consider the phylograph andphy-IoDAG problems that were defined
in
theIntroduction. In
Section 3.1 we prove thatit
is
hardto
approximate theoptimal
phylographwithin
better
than a logarithmic factor, andin
Section 3.2that
the natural greedy algorithm gives an approximationwithin
such afactor.
In
Section 3.3 we show boththat
it
is hardto
approximate theoptimal
phyloDAGwithin
better than
alogarithmic factor,
andthat
in
this
casethe
natural
greedy algorithm can perform very badly, even on average'aAnother randomized algorithm, due to Wilson [19], has an expected running time equal to the
3.1
Hardness
of approximation
of phylograph
The following theorem states
that
Miniumum Phylograph is hard, and (unless n6n-deterministic quasi-polynomial time is equivalentto
deterministic quasi-polynomialtime)
is hardto
approximate:Theorem
3.1
Giuen speciesS
and charactersC,
it
is
NP-cornpleteto
compute aphylograph
G
:
(E,E)
suchthat
lEl
is
min'imized. Also,
for
any
constantc
11f 16, unless NTIME(nnolvlosn;
:
DTIME(rrolrios';,
thereis
no polynomial-timealgorithm
to
compute a phylograph suchthatlEl
is within
afactor
ofclogr(
of the,ntinimum possible
ualue.
'We
will
use the following notationin
the proof of Theorem 3.1.Definition:
The
neighborhood of
a
vertexo
of
a
graphG
:
(v,E)
is the
setl/(u):
{o}
u
{tu :(u,w)
e
E}.A
dominating
set
of G is a set of verticesD
cv
whose neigirborhoods cover the graph:
U1ro
U1a;: y.
Lemma
3.2
computing aMinimum Dominating
set
D
of
a gra?thG
is
Np-cornplete.Proof.
Garey and Johnson [Oj indicate the proof as a simple reduction from Vertex Cover; noprimary
source is cited.r
The following
theorem,from
[15], showsthat
it
is
alsohard
to
approrinzateMinimum Dominating
Setto
within
better
than a
loqarithmic
factor: a
similarresult appears
in
[1].Theorem 3.3
(Lund
and Yannakakis)
Let c be: u, constantin
the range 0I
c 1114.
[JnlessIITIME(zrolvlogn)
:
DTIME(rrpolylo8"),
thereis
no porynornial-tznrcalgorithm
that
takes asinput
a graphG
and outputs a dominating setD
ofG
such thatlDl
is
wi.thin afactor
of clog2lvl
,f
theminimum
poss'ible ualue.We are now ready
to
prove Theorem 3.1.Proof.
We use an approximation-preserving reduction from Minimum Dominating Setto
Minimum
Phylograph. Given aninput
G:
(V,86) with
lVl=
u,
construct an instanceP
to
Minimum
Phylograph as follows:The
species setis
S
:
u
Ux
where
X
is a set of z3"auxiliary
vertices". For each pair of vertices{q,az}
€V(2),
define a character
with
on-set{at,uz}.
Thus any phylographfor
P
contains each edgein
the complete graph on7.
In addition, for each pair of vertices (a,r)
e
v
xx
we def.ne a character
with
on-set{r}
U .nf (o).If
Po:
(5, Eo) is an optimal phylograph forP
and D6 is a minimum dominating setfor
G,then
ltol
:
(? +
lxl
lDsl . To see this, observethat
the complete graph onV
addedtoV
xDe
is a phylograph forP,
solEsl
<
G)+lxl
llol.
On the other hand, every phylograph forP
has at least(i)
edges connecting speciesin
I/
and has at least lD6l edges adjacentto
eachr
€X.
Suppose we had an
algorithm
A that
could producea
phylograph (S,Ea)
forP with
lE.ql<
clogr((.) l-Usl edges.By the
constructionof
P,
some vertexr
€
X
is
connectedto
a dominating setD
for
Gwith
lrl
<
lnnlllXl
< clogr(l)lnolllXl
edges. Since
lEsl
:
(:;)+
lXl
llol,
we havelDl
SThus (since
lxl
:
v3 andlrol
>
1),lrl
< c(1+o(1))Iog2(')lDsl.
Now note that/
:
u(u-
L)+
ulxl
+
2lEGllXl
:
O(r5).
Thus,lDl <
5c(1+
o( 1)) 1og2(z) lD6l, which is contrary to Theorem 3.3if
c<
ll20,unless NTIME(7?polrlosn;= DTIME(rrolvlos';.
UsinglXl
:
ru2*' instead of v3 gives the constant 1/16.The
same reduction showsthat
an exactly minimum phylograph would containa minimum
dominating set, sothe
NP-completenessof
Minimum
Dominating Set impliesthat
of
Minimum Phylograph.r
3.2
Greedy
algorithm
for
phylograph
There
is a natural
greedyalgorithm
for the Minimum
Phylographproblem.'In
a
phylograph, every character's induced subgraph consistsof a
single connectedcomponent, so
the
greedyalgorithm "grows"
a
solution by iteratively
adding anedge
that
maximally reduces the number of connected components.The
samenotation
neededto
definethe algorithm
more precisely can be usedin
the proof
of its quality.
Given species5
and charactersC,
anda
setof
edgesP
c
5(z)
definethe
"cost" ofE
to
bef(E)=
f
.o-ponents(5.)
-lcl,
c€C
where components(5.) denotes the number of connected components in the subgraph
of
(5,-E)
induced by the on-set ofc.
Thusf(0) =
Lcecls.l
-
lcl
=
{'-
lcl'
andif
-E
is
a phylograph, f(E):
Icec
1-
lcl
:0.
For any edge set -B and any edge
e,let A6(e)
=
f
(E)-
f
(EU{ei)
be the amountby
which e decrea,ses the cost/.
The greedy algorithm begins rvith each species an isolated vertex, anditeratively
adds the edge rvhich rnaximally decreasesthe
cost,until
the costis 0.
(See the Appendixfor
pseudocode).Theorem
3.4
Suppose thatfor
speciesS
and charactersC,
of totalinput
length('
the
minimum
phylograph{e(7),...,"(")}
has cardinalityr.
Thenthe
greedyalgo-rithm
produces a phylographEc
of size lE6;l<
rln(/
-
lCl)
Proof. If
we have anypartial
solution, addingin all
r
edges of a minimumphylo-graph
will
certainly yield a phylograph. Sincer
mole edges are enoughto
complete thejob,
some edge (one of these, even) must take care of at least1/rth
of the cost.If
theinitial
cost was/(0),
and the greedy algorithm reducesit
by
1-
11,at
ea,c.hstep,
after
rln
f0)
steps the cost must be reduced below 1, and the algorithm must have terminated.5More formally,
for
any edge setE(0)
define a series of setsE(0)
q
..-
C
E(r),
where
E(i)
=
E(0)
u
{"(1),...,"(r)}
and the edgese(i)
are those of the minimumphylograph.
Notethat
E(r)
is a
phylograph, sinceit
containsthe
minimumphy-lograph.
Because components(with
respectto
any
character)only
become more connected asi
increases,for
any €,if
i
<
j
then A61;y(e)2
Artil(e).
Thusfor
anysThe same approach will not work for phyloDAGs. Since directed cycles are forbidden, chosen
edges constrain the addition offuture ones, and even ifthere was a solution ofsize r initially, there
may not be once some edges have been chosen sub-optimally.
starting set Zs,
r.
maxe€.S(z)
Ar(o)(")
=
U,freto-
1))-
f
(EID)
x= |
:
/(E(0))
-f@(,))
:
f(E(0)).
Comparing
the
first
andlast
quantities, we concludethat
there always exists anedge e
for
whichAE(ot(r)
>
f
(Eo)lr.
Therefore
the
greedyalgorithm
reducesthe
costby
a
factor
7*
7lr
at
eaclr step. Since theinitiai
cost is(
-
lcl,
the cost after rIn((.-
lCl)
steps of the greedy algorithm is at most(7-rlr)'r"(Llcl)U-
lcl)
<
1. The greedy algorithm therefore terminateswithin
rIn((.-
lCl)
steps, producing a phylograph of the same size.r
This complements the result of Theorem 3.1: Minimum Phylograph is apparently
hard
to
approximateto
better than
a factorof
(1116)log2l,
but
is
certainly easyto
approximateto
afactor of
ln((.-
lCl)
( ln/.
Sothe
thresholdfor
cat
which(chrl)-approximability
becomes feasible lies betrveen(1/t6)logr(e)
=
0.09, a1cl 1;closer bounds would be of some interest.
3.3
PhyloDAGs
We begin by observing
that
a phyioDAG cannot ahvays be obtained by directing theedges of a phylograph. Consider four species
rvith
s1 defined by characters (ir,c,d),s2
by (a,c,d),
s3by (o,6,d), and
saby
(a,D,c).
The
cycle s1,s2,s3,s4,s1is a
4_edge phylograph,
but
thereis no
wayto
directthe
edgesof the
cycleto
obtain
a phyloDAG: any acyclic orientationwill
create two archetypesfor
some character's on-set.we
now prove the following theorem, which is analogousto
Theorem 3.1.Theorem
3.5
Giuen spec'iesS
and charactersC,
it
is
Np-cctmpleteto
compute a phyloDAGG
:
(5,E)
suchthat
lEl
is
minimized.
Also,for
any
constantc
I
7f 16, unless NTIME(npolvlogn;:
DTIME(rrnolrlosn;, thereis
no polynomial-time algorzthmto
compute a plryloDAG such thatlEl
is
w'ithin afactor
of
clog,I
of the nz'in'imum po s s,ible u alue.Proof.
The
proof usesthe
same reduction asthe
proofof
Theorem3.1. Let
-86be the edge set
in
anoptimal
phyloDAGfor
P.
We must showthat
lEol
=
(;)
+ l)(l
llol
. The directionthat
differs from the proof of Theorem 3.1 is showingihat
given
adominating
set D6, we can constructaphyloDAG
of size(?
+lxt
lD6l.
To do so,f.rst
construct a phylograph (asin
the proofof
Theorem 3-.1). Then directedges having
both
end-pointsin
I/
accordingto
atotal
order on the verticesin
V,
and direct all remaining edges from vertices in
V
toward vertices inX.
The resultinedigraph has no directed cycles and each character has a unique archetype. Therefore,
it
is a phyloDAG. The rest of the proof is identicalto
that
of Theorem 3.1.r
As
alreadynoted,
the
natural
greedyalgorithm
doesnot
work well
for
phr'-IoDAGs: the phyloDAG problem seems
to
be moredifficult
because the prohibitionof cycles means
that
it
is possible for the greedy algorithm to add a "bad" edge which prevents other "good" edges from being addedlater. In
the remainder of this section, we give an example of a species setfor
which various natural greedy approaches forconstructing a phyloDAG lead
to
an f)(n )ratio
between the size (number of edges)of
the
constructedphyloDAG
andthe
sizeof the optimal
phyloDAG.A
randour-ized strategy has an
Q(n)
expectedratio
and hasa ratio of
A@llogn)
rvith
highprobability.
We construct
a
species set asfollows.
There aren
speci€s s1,...rsn,
and two distinguished speciess'
ands".
Now we addo
2n
chatacters shared by s/ and s//;o
2 characters shared by s/ and s;,for
i
:
1,...,n
r
1 character sharedby
r",
s;
ands;, for
7<
i,j I
n,i
I j.
Duplicating
characters forcesthe
orderin
which
a
greedyalgorithm
connectsspecies. We hide
this
duplication from an algorithmthat
checksfor
it
by
adding aset 54of dummy species, where lSal
=
[ogr(an).l.
There a1s2lr"sz(nn)l>
4r, distinct subsets of ^9a. We add one such subset to each of the 4n nonunique characters.6 Anoptimal
solution hasO(n)
edges, consisting of an edgefrom s'
to
s",
edgesfrom
-.'and
s"
to
each of thes;,
and edges froms' to
each speciesin
5a.A
phyloDAG has exactly one archetypefor
each character.A
greedy algorithmbegins
with
each species an isolated node, thus an archetypefor
each characterit
contains.
A
natural
edgeto
addin
a greedy fashionis
onethat
maximally reducesthe
numberof
archetypes (overall
characters).Of
course, we maynot
introduce directed cycles.There may be times where we can choose the direction of
the
edgeto
beintro-duced (for example
at
thefirst iteration)
and we showthat
the algorithm performsbadly
for
any of the following strategies:r
The direction is chosen arbitrarily.o
The directionis
chosenuniformly at random.
(The expected performance ofthe algorithm
is
badfor
this
example, andthe
example can be modified sothat
the bad performance occurswith
high probability.)o
The edgeis
directedout from
the nodewith
the larger number of characters.(This
anatural
way of breaking ties, since we expect ancestral nodesto
have manv characters.)6An algorithm may also check for dornination, where sa contains a subset of the characters
contained by s. We can remove the dominated species sa from the instance and later direct an edge
from s to sa in the phylogeny for the reduced set. To avoid this situation here, we add a character
lsa(i),sa(i{
1)} fori:1,...,14"1,
which chains the dummies together. This does not change theasymptotic size of the optimal solution.
A
greedy algorithm startsby putting
an edge between s/ and s//, and an edgebetween
s'(or
possiblys")
and each speciesin
^94. Thenit
adds edges between,s/ and thes;.
If
directions are chosenarbitrarily
we may assumethat
these edges are from s'to
s//, and from each of the s; tos/.
Henceit
is now impossibleto
add edgesftom
s" to
any ofthe si,
since they would create directed cycles.This
means thatin
orderto
prevent there being two archetypesfor
a character sharecl by s//, s; and -sj, species s, must be connectedto
s,
by an edge. This resultsi"
(l)edges.
Now consider the variant where the direction of an edge
is
chosen uniformly at random wheneverit
is
equally goodto
directit
either way.With
high probabiLity(i.e.
with
complementprobability
that
is
exponentially smallin
n),
therewill
beat
least nf4
edges directedfrom
the s;'s
to
s/.
If
the
edge betweens,
and s,, isdirected
the
wrongway (i.e.
from s/ to
s//)then
theses,
nodeswill
haveto
beconnected
in
a clique, resultingin
a quadratic number of edges.If
we now consider a species set consisting of a log(n) copies of the species set as described (for a positive constanta),
rve seethat
theoptimal
solution has O(nlogn)
nodes and edges, an<iwitlr
probability
at
least 1-
D-o, at
least oneof
those copieswill
have the eclqebetween
s'
and s// directed the wrong way, resuitingin
O(n2) edges.If
edges are directed away from nodeswith
higher numbers of characters,thel
the algorithm can be forcedto
takethe
"wrong" directionfor
the edges by addiugdummy characters
at
the nodes from which we want the edgesto
be directed.References
[1]
N'{. Bellare, S. Goldwasser, C. Lund, andA.
Russell. Efficient probabiiisticallv cireckable proofs and applications to approximation.In
Proceedings of the 25thAnnu,al
ACM
Symposium on the Tlteoryof
Computing, pages 294J04,Igg3.
12)
c.
Benham, s. Kannan, M. Paterson, andr.
\&arnow. Hen's teeth and whale'sfeet:
Generalized characters andtheir compatibility.
Journal of Mathemati,calBiology. To appear.
[3] H.
Bodlaender,M.
Fellows, andr.
warnow. Two
strikes against perfect phy-logeny.In
Proceed,ings of the 19th Internat'ional Colloquiurn on Automata, Lan-guages, and Programming, Lecture Notesin
Computer Science, pages 273 283. Springer Verlag, 1992.[4]
C. Colbourn andM.
Jerrum, 1995. Personal communication.[5]
c.
colbourn,
w.
Myrvold,
and
E. Neufeld. Two
algorithmsfor
unranking arborescences. Journal of Algorithms. To appear.[6]
D.
Coppersmith and S.Winograd.
Matrix
multiplication
via
arithmetic pro-gressions. Journal of Symbolic Computation, g:257 280, 1990.[7]
M.
Crochemore andW.
Rytter.
Tert
Algori.thrns.Oxford University
Press, 1994.[8]
R. Dulbecco andH.
Ginsberg. virology. J.B.Lippincott,
2ndedition,
1988.[9] M.
R.
Garey andD.
S. Johnson. Computers andIntractabilitg.
Freernan, SanFrancisco,
CA,
1979.[10]
A.
Gibbons.Algorithmic
Graph Theory. Cambridge University Press, 1985.[11] L. Goidberg, P. Goldberg,
c.
Phillips, E. sweedyk, and T.warnow.
computingtlre
phylogenetic numberto
find
good evolutionarytrees.
Discrete Appl'i'edMathernatics. 1996.
[12]
D. Gusfield.
Efrcient
algorithmsfor
inferring
evolutionary tr.ees. Networks,2I:12-28,7997.
[13]
D. Karger, P. I{lein, and R.Tarjan.
A randomized linear-time algorithm to find minimum spanning trees. Journal of the Associati,onfor Computing Machinery,42(2), 1995.
[14]
J.
Kephart
andW.
Arnold.
Automatic extraction
of
computervirus
signa-tures.
In
R.
Ford,editor,
Proceedingsof
thelth
VirusBulletin
Internation'o,1,Conference, pages
I79-I94. Virus Bulletin
Ltd;
1994.[15]
C.Lund
andl\{.
Yannakakis. On the hardnessof
approximating minimization problems.In
Proceedings of the 27th AnnualACM
Symposiunt onthe
Theoryof
Cornpu,ting, pages 286-293, 1993.[16]
A.
Nijenhuis andH. WIII.
Cctmbinatorial Algorithmsfor
Cornpu,ter.s and Cal-culators. Academic Ptess, 2ndedition,
1978.[17]
R.
Prim.
Shortest connection networks and some generalizatio:ns. BeLl Systertr'Technical J ournal. 36 : 1389-140 1, 1957.
[18]
M.
Steel. The
complexityof
reconstructing treesfrom
qualitative charactersand subtrees. Journal
of
Classifi,cat'ion,9:91 116, 1992.[19]
D. Wilson.
Generating random spanning trees morequickly
than
the
covertime.
Submittedfor
publication, 1995.Acknowledgements:
Wethank Phil
MacKenzie,Tom
Martin,
Madhu
Sudal, and David Wilsonfor
useful discussions.4
Appendix
This
appendix contains additional proofs and details.4.1
The
greedy
phylograph algorithm
Let
i
::
0 andE6l0)
::
AWhile
f@cQ))
)
0 dobegin
LeIi::il7
Let
e be an edgemaximizing
Ln"(;-rl(r)
Let
E6(i)
::
Ec(.i- l)u
{e}
endReturn the set
Ec
:
Ec(i)
4.2
An
O(t)-time algorithm
to
compute an atomic
set
\4/e presume
that
each speciesis
describedby a
sortedlist
of the
charactelsit
possesses. From
this,
constructa
descriptionof
each charactcr, as a sortedlist
ofthe
species possessingit.
This
can be donein
linear tirne:
loop through specics i;loop through cha,ractcrs J on
i;
add specics zlto
character 7's list.Now the basic algorithm is:
Let
As::
S
(Oth atomic set contains al1 species)Let
D :=
0 (set of defining characters isinitially
empty)Loop through characters
f,
and consider thelist
5;
of species having character i:(1) size
:=
lA,_1 O 5;l(2)
If
size(
2 thenA;
::
A;-1(3)
If
size)
2 thenA; := A;-1O
5;,
andD
::
D
U{i}
(4) nexti
We now shor,v how to compute the intersection size (step 1) and the intersection
itself (step 3) in linear
time.
This implemention gives a iinear-time algorithm overall.Let all the sets A be represented doubly, as both a sorted linked
list,
and as a binaryarray of leugth A
(rvith
1'sfor
species presentin,,1,0's
for species absent fromJ).
Cornputing the size of A O ,5; can be donein time
l5;l:
Over species s€
S;, sum up the binary arrav elernents ,,{[s]. Thus all iterations of step 1, together, take timt:of order
;-
1,9;1.Computing
A'
::
AL5;
can be donein
time lAl
f
1.9;l: The ordered Listfor
A/is
constructedbl'
steppingthrough the
orderedlists for
A
and ,5;in
synchrony, advancingin
thelist
rvith the smaller current value, and augmenting the Listfor
A/ when thelists for
A
and ,5; havethe
same current value. Thebinary
arrayfor
A,is formed by modifying
that
ofA,
which is no longer neededfor
any other purpose;the
list for.,{
is usedto
setall
1'sin
the array backto
0, and then thelist
for
A/ isused
to
set 1's appropriately.Thus over values
i
where step 3 is executed, thetotal
time consumed is of orderl2
L'
+\-tqt
| lJ t"tlz
since the execution of step 3 implies