Object-oriented Bayesian networks for complex forensic DNA profiling problems

(1)

Object-oriented Bayesian networks for complex forensic DNA

profiling problems

∗

A. P. Dawid

University College London

J. Mortera

P. Vicard

Universit`

a Roma Tre

September 6, 2005

Abstract

We describe a flexible computational toolkit, based on object-oriented Bayesian networks (OOBNs), that can be used to model and solve a wide variety of complex problems of re-lationship testing using DNA profiles. In particular this can account for such complicating features as missing individuals, mutation, and null alleles. We show by example how to build a high-level representation of a disputed pedigree problem, and how to incorporate lower-level network models of the relevant complications. We illustrate the use of this toolkit with several examples, including disputed paternity with missing or additional measurements, and criminal identification. Using this technology, we investigate the effects on likelihood ratios of introducing mutation and/or null alleles, and show that this can be very substantial even when the underlying perturbations are very small.

Some key words and phrases: Bayesian network, DNA profile, missed allele, mutation, null allele, object-oriented, paternity testing, silent allele.

1

Introduction

DNA parentage testing and forensic identification are currently conducted using DNA profiles, comprised of several highly polymorphicshort tandem repeat (STR) genetic markers each having a repertory of alleles (“repeat numbers”) that can typically be represented as small integers. The European standard AMPFlSTRr SGM PlusT M _{system uses ten such STR loci, plus amelogenin.}

All are on different chromosomes and so segregate independently. Polymerase chain reaction amplification now allows a profile to be obtained from very small amounts of DNA, even a single cell. For an account of the relevant biotechnology seee.g. Buckletonet al. (2004).

The forensic impact of such DNA evidence is most appropriately captured by calculating the corresponding likelihood ratio for comparing a pair of competing hypotheses (Evett and Weir 1998; Morling et al. 2002). However, this can become extremely challenging, both logically and computationally, in the presence of additional complicating features such as missing data on some individuals, mixed trace evidence, mutation, null alleles, etc. For example, in a paternity case the true father may appear to be excluded, when in fact a mutation has taken place, or an allele has not been recorded.

We have previously shown (Dawidet al. 2002; Mortera 2003; Morteraet al. 2003; Dawid 2003) how such complex problems can be addressed by structuring and analysing them with the aid of the computational technology ofBayesian networks (BN), also calledProbabilistic Expert Systems (PES) (Cowell et al. 1999). These have been implemented in general purpose software such as Hugin1.

∗_{Research report No. 256, Department of Statistical Science, University College London.} _Date:

September 2005.

(2)

A recent extension of this BN technology is the object-oriented Bayesian network (OOBN). This allows hierarchical definition and construction of a BN, utilising simple modular building blocks. Additional complexity can easily be introduced by adding new modules or refining existing ones. Object-oriented Bayesian network architectures have been described by Laskey and Mahoney (1997); Koller and Pfeffer (1997); Bangsø and Wuillemin (2000).

In this paper we describe a construction set of basic OOBN modules for DNA identification, and show how these can be flexibly combined to handle a wide variety of complex problems. Our networks have been built usingHuginversion 6.4.

One specific complicating feature that we address is mutation, which can lead to a child having an allele that appears to have no source in either parent. Another is the possibility that observation of an individual’s genotype can be incomplete on account of a“null allele”, i.e. one that is not detected by the measuring apparatus. We further distinguish between the cases where this property is non-inherited (when we term the allele “missed”) or inherited (which we term a “silent” allele). An allele can be missed simply on account of sporadic equipment failure. A silent allele, on the other hand, might be the result of a mutation in the primer binding region, causing DNA amplification failure (Claytonet al. 2004). In this case only one allele is amplified and read, and the individual appears, wrongly, to be homozygous. This feature will be passed, by Mendelian inheritance, to a child, which, consequently, may again wrongly appear homozygous. We can thus easily have false evidence of exclusion, leading us to conclude, wrongly, that the alleged father is not the true father.

We apply our networks to analyse a number of specific forensic cases. We find that properly accounting for a small probability of a silent allele can have a dramatic effect. In particular, in pa-ternity testing where we can also observe the putative father’s brother, this additional information can substantially change the probability of paternity in the presence of silent alleles.

The paper is organized as follows. In§2 we describe a variety of problems of civil and criminal forensic identification, and represent them as high-level disputed pedigree networks. Section 3 shows how DNA identification in such problems can be implemented by treating these as object-oriented Bayesian networks, having further internal structure that can be expressed by means of lower-level networks as described in §4. Modifications of the lower-level networks to incorporate various complicating features, viz. mutation, silent alleles, missed alleles and combinations of these features are described in§5,§6,§7 and§8, respectively. In§9 we examine some numerical examples to illustrate the effects of taking proper account of the various complications considered. Section 10 presents further examples, showing the sometimes dramatic effect on the paternity ratio of accounting for silent allelesetc.when measurements can be obtained from relatives; while§11 presents a case of criminal identification. Closing remarks are given in§12. Appendix A develops some algebraic formulae for the paternity ratio, allowing for silent alleles, in a simple paternity problem when we can also observe the genotype of the putative father’s brother.

2

Pedigrees

We give particular attention to problems of testing paternity, or other family relationships, using DNA profile data. We always start by constructing a single pedigree to represent the relationships, whether known, assumed, or uncertain, between relevant individuals.

2.1

Nuclear family

Figure 1 is a simple pedigree representation for a nuclear family consisting of father f, mother

m, and one child c(colour-coded blue for male, pink for female). Both fand m are instances of type founder, having no parents represented in the pedigree, whereas c is an instance of type child, having both parents represented. Cases where, say, only the individual’s father is known or observed can be handled by adding the unknown mother as an additional founder.

(3)

Figure 1: Pedigree for nuclear family

2.2

Simple disputed paternity

In the simplest case of disputed paternity, we have an alleged family triplet formed by a disputed child c, its undisputed motherm, and the putative father pf. The hypothesis of interest, H0, is that the putative father is the true father tfof the child; the alternative hypothesis H1 is that the true father is some unobserved alternative father, af, treated as drawn at random from the population.

A pictorial representation of this disputed pedigree is shown in Figure 2 (unobserved individuals being shown in a lighter shade.) Each ofm,pfandafis afounder, whilecis achild. To represent the disputed identity of the true fathertfwe describe him as aqueryindividual, and include an explicit “hypothesis node”tf=pf? to indicate that we have a choice betweenpfandaf.

Figure 2: Pedigree for simple disputed paternity

We may have DNA profiles from m, c, and pf, consituting evidence _E. The impact of this evidence is carried by the likelihood ratio in favour of paternity:

LR= Pr(_E|H0)/Pr(E |H1). (1)

If we make some standard assumptions — Mendelian segregation, independent markers, known population allele frequencies — this can be calculated by a simple and well-known algebraic formula (Essen-M¨oller 1938).

2.3

Missing individuals

In more complex cases, DNA profiles may be missing for one or more members of the basic family triplet, but further information may be available in terms of profiles from known relatives. Foren-sic geneticists have not generally been able to handle such incomplete paternity data rigorously because of the more complex logical and computational analysis required.

Figure 3 and Figure 4 relate to the two incomplete paternity cases described and analysed by Dawidet al. (2002). They are variations on Figures 3 and 5 of that paper, extended to incorporate explicitly all relevant individuals, whether observed or unobserved.

In Case 1, as displayed in Figure 3, we have DNA from a disputed child c1, but not from its motherm1 nor from the putative fatherpf. We do however have DNA from c2, an undisputed

(4)

child ofpfby a different, unobserved, motherm2, as well as from an undisputed full brother bof

pf. The sibling relationship is made explicit by the incorporation of the (unobserved) grandfather

gfand grandmothergm, parents of bothpfand b. Nodes gf, gm, m1, m2and afare all instances offounder; pf,b,c1andc2are instances ofchild; andtfis an instance ofquery.

Case 2, displayed in Figure 4, is very similar, except that we now have DNA from bothm1and

m2, and from two full brothers,b1andb2, ofpf.

Figure 3: Pedigree for incomplete paternity case 1

Figure 4: Pedigree for incomplete paternity case 2

2.4

Criminal identification

Such genetic networks can also be used in certain criminal cases, as well as for identification of victims of disasters.

The problem represented by Figure 5 is based on a real case. A bodyhas been found, burnt beyond recognition, but there is reason to believe it might be that of a missing criminalcr. DNA is available from body, from thewife of cr, and from two children, c1and c2, of crand wife. The hypothesis node now indicates thatcrmight be identical tobody; otherwise he is treated as an unobserved man,cr (unobs).

Figure 6 describes a Britishcause c´el`ebre, the case of James Hanratty (H) who was found guilty of murder and rape and hanged in 1962. In 1998 it was decided to apply modern DNA profiling technology to certain items of evidence from the original trial, which had been retained by the police, and a profile, taken to be from the culpritc(eitherH, or some other persono) was found. In an attempt to prove Hanratty’s innocence, his mothermand full brother boffered themselves for DNA profiling. In principle this might have excluded Hanratty, but in fact did not do so: the associated likelihood ratio in favour of his having left the crime trace was about 440. In 2001

(5)

Figure 5: Pedigree for criminal identification case

Hanratty’s body was exhumed, and it was found that his DNA did indeed provide a full match to the crime profile, yielding an updated likelihood ratio of about 2.5 million.

Figure 6: The case of James Hanratty

3

Object-oriented networks for DNA identification

So far we have merely described the type of problem we wish to address. In order to assess the impact of the evidence in any but the simplest of such problems we shall generally have to make use of sophisticated computational tools. Our approach is based on building Bayesian networks to represent the assumed structure. These then allow insertion of the evidence and propagation of its effect throughout the network. In particular, we can find its impact on the comparison of competing hypotheses,e.g.as to paternity.

3.1

Object-oriented Bayesian networks

Dawid et al. (2002) showed how Bayesian networks can be built to represent problems such as described above, allowing one to obtain the correct likelihood ratio for the hypotheses based on all the available evidence. Here we describe a new, “object-oriented”, construction for such networks, which greatly simplifies and clarifies the specification process.

Version 6 of the Bayesian network (BN) software systemHuginsupports hierarchical definition of a BN, whereby any network can itself contain repeatedinstances of some other generic (class) network or networks. We use bold face to indicate a network class, and teletype face to indicate an instance or regular node.

A class network is like a regular network, except that it can have interface —input andoutput

(6)

node having a dotted outline, and an output node a solid outline. Any network can have nodes that are themselves instances of other networks, in addition to regular nodes. Each instance of a class network within another network is displayed as a rounded rectangle, which can be expanded if desired to display its interface nodes; internal nodes remain hidden from view (although they can be accessed in “run” mode for entering findings or extracting updated probabilities). Arrows between nodes within the same network, or from output nodes to regular nodes in the containing network, represent, in the standard way, the probabilistic or functional dependence of that “child” node on its “parents” (Cowellet al. 1999). An input node can have at most one incoming arrow from a node in the containing network (which could itself be an output node of some other subnetwork): this is a “binding link”, indicating that these two nodes are to be identified.

All instances of a class have identical probabilistic structure, save that the table for an input node is a default, being overwritten in any instance where that node is bound to a node of the containing network. Only output nodes can be parents of external nodes (either regular nodes of the containing network, or input nodes of other subnetworks).

This architecture enables a convenient modular approach to problem specification. It is particu-larly natural and useful for genetic networks, where there is repetition, across different individuals, of such basic structures as Mendelian inheritance or mutation processes. Here we describe a set of simple class networks that can be pieced together as required, much like a child’s construction set, to represent a wide variety of problems. A specific application of this modular construction process to a complex problem involving mutation has previously been described by Dawid (2003). Note that the object-oriented structure is used purely for problem specification and network construction. Within the software the network is expanded internally into a regular Bayes net (which can be output if desired). Once an object-oriented network has been constructed, it can be used for individual case analysis in essentially the same way as a regular network: see Dawid

et al. (2002) for illustrations. After entering evidence, computation and analysis are effected by standard propagation algorithms (Cowellet al. 1999), initiated by means of simple mouse clicks.

3.2

Bayesian networks for DNA identification

The pedigrees displayed in§2 above were constructed inHugin6.4. Over and above expressing family relationships, this allows us to describe the operation of genetic inheritance in detail. We do this in the context of forensic DNA profiles, each consisting of measurements on a collection of STR genetic markers (which we shall usually simply call “gene”).

An individual’s DNA profile consists of measurements on a number of DNA markers. For each such marker we observe a genotype, comprising the unordered pair of values (alleles) for its constituent genes — one maternally and one paternally inherited, although this distinction can not usually be observed. When these alleles are the same the individual is called homozygous at that marker, elseheterozygous. Current technology utilises STR markers, which have a repertory of 8–20 alleles that can commonly be described by a small integer. For present purposes these can be regarded as measured without error, except for the specific possibility of “silent” or “missed” alleles, as treated in§6ff. below.

Each of our networks describes the inheritance of a single marker: distinct markers require distinct networks, but these will differ only in the details of the repertory of alleles, and their population frequencies. On entering the available DNA profile data for a marker we can use the system to calculate likelihood ratios for comparing hypotheses of interest. Throughout this paper we assume that the networks for different markers are entirely independent (given any of the hypotheses entertained), and calculate an overall likelihood ratio by simply multiplying the values obtained from each component marker network.

Note that colouring of nodes is purely for presentational purposes and has no effect on the analysis.

(7)

3.3

Nuclear family

In Figure 1, each of its three nodes was defined as aninstanceof another, generic,classnetwork, having hidden internal structure. Both f and mare instances of a class founder, while c is an instance of a classchild.

In Figure 7, which is an expanded version of this network, we see that foundercontains two

output nodes: pg, representing the founder’s paternally inherited gene, and mg, representing the maternally inherited gene. As for child, in addition to output nodes pg andmg as forfounder it hasinput nodes fpg, fmg, mpg, mmg, representing respectively the child’s father’s paternal and maternal genes, and his/her mother’s paternal and maternal genes. The arrows into these represent

binding links, specifying that these are identical copies of the associated gene nodes in the two parental networks.

Figure 7: Expanded pedigree for nuclear family

The above class networks contain still further hidden structure, defining the nature of the inheritance process and of the observable quantities (genotypes). This will be described in _§4 below.

3.4

Simple disputed paternity

In Figure 2,m,pfandafare again instances of classfounder, andcan instance of classchild, exactly as described above. To modeltfwe need to construct a new network classquery. Some details of this are shown in the partially expanded version of Figure 8. Internally, the output node

tfpg is copied from either f1pg or f2pg, according as the Boolean variable tf=f1? is true or

false; and similarly fortfmg. Input nodesf1pgandf1mgare bound to output nodespgandmgof

pf, whilef2pg andf2mgare bound to output nodespgandmgofaf. Other connexions between the nodes in Figure 2 are made exactly as described in§3.3 above. We also include the explicit “hypothesis node” tf=pf?, bound to tf=f1?, in the top-level network: this node embodies H0 or H1 according as its value is trueor false. We initially set these as equally likely, so that after propagation of evidence the ratio of their posterior probabilities can be interpreted as a likelihood ratio.

3.5

Further networks

We now have all the ingredients to represent more complex problems, such as described in§2.3 and§2.4. All one has to do is to insert and connect together, in obvious ways determined by the basic pedigree, instances of the already constructed networksfounder,childandquery, as well as a hypothesis node. Armed with this “construction set” we can represent and so solve a very wide variety of problems involving DNA profiles and disputed identity.

4

Detailed structure

(8)

Figure 8: Partially expanded pedigree for simple disputed paternity

4.1

Network founder

The internal structure of the network classfounderis shown in Figure 9. The internal nodespgin

Figure 9: Networkfounder

andmginrepresent the random paternally and maternally inherited genes of the founder, and are themselves specified as instances of a classgene(not shown here), which consists of a single output node, also calledgene. Associated with genein this simple network is the appropriate repertory of allele values and their population frequencies.

For our illustrations in this paper we use forensic marker VWA, having alleles ranging from 12 to 22 and probability table as given in Table 1. These are Austrian-German population allele frequencies.2

The output nodes pgandmgoffounder are specified as identical copies of the internalgene

node ofpgin andmgin, respectively. Such duplication is necessary only because of limitations of Hugin, which currently does not allow a node to be both an input and an output node, nor for an arrow to cross more than one level of the hierarchy.

Finally the internal nodegtoffounderis an instance of the classgenotype, as displayed in Figure 10. Heregtminandgtmaxare defined (by means of Hugin expressions) as the minimum

Figure 10: Networkgenotype

and maximum of the two input gene nodespgandmg, and represent the observable genotype of an individual, being used for entering such genotype evidence when available — we colour such

(9)

an “observation node” in green. The input nodespgandmgofgenotypeare bound to nodes pg

andmgoffounder.

4.2

Network child

The internal structure of network classchildis displayed in Figure 11.

Figure 11: Networkchild

On the paternal (left-hand) side, the input nodesfpgandfmgofchildare bound to the input nodespgandmgof an instancefmeiosisof a network classmendel. This in turn has an output nodecg, which is then copied identically to the output nodepgofchild(again, such duplication would ideally be avoided but at present can not be). An identical structure holds for the maternal (right-hand) side ofchild. Finallypgandmgare fed into an instancegtofgenotype, exactly as infounder, again allowing input of observed genotype data.

Figure 12 shows the internal structure ofmendel. Its internal Boolean node cg=pg? is

mod-Figure 12: Networkmendel

elled as having a 50% chance of beingtrue, in which case output nodecgis identical with input node pg; else, when cg=pg? is false, cg is identical with input node mg. The effect is thus to transmit, at random, just one of the two parental genes, in accord with Mendelian segregation.

4.3

Network query

The internal structure of networkqueryis shown in Figure 13. This contains only the input and

(10)

output nodes as described in§2.2 above. Whentf=f1? istrue,tfpgcopiesf1pgandtfmgcopies

f1mg; when false,tfpgcopiesf2pg andtfmgcopiesf2mg.

4.4

Analysis

For case analysis the pedigree network describing a problem is used essentially as described in Section 2.2 of Dawidet al. (2002): each observed genotype is entered (asgtminandgtmax) inside the instancegtofgenotypewithin the relevant instance offounderor child. Then probability propagation is performed by the software, following which we calculate, as the ratio of the updated probabilities at nodetf=pf?, the contribution to the likelihood ratio in favour of paternity based on these observations at this marker. The global likelihood ratio is obtained by multiplication of these contributions across all the markers measured.

4.5

Super-networks

We can even treat a “top-level” network, such astriplet, as a class, and create one instance of it for each marker. Since Hugin does not currently allow modification of the states of a node when reusing a network, we must first set up a single repertory of coded states in gene, and specify appropriate correspondences with the actual alleles of the marker under consideration; the allele frequencies are likewise edited appropriately for each marker. The resulting marker networks can then be analysed separately, and their several likelihood ratios multiplied together. Alternatively all the single-marker networks can be explicitly combined as instances within a single super-network, with the node tf=pf? (now made into an input node) in each instance bound to a new top-level hypothesis node tf=pf?. Then after entering the evidence on all individuals at all markers, and propagating, we can obtain directly the global likelihood ratio from that hypothesis node. Such super-networks are not ideally suited to the propagation algorithm used by Hugin, since the links to the top-level hypothesis node can create very large cliques, and thus severe computational inefficiencies. External combination of marker-specific calculations is preferable whenever (as in the cases considered here) this is possible. However in some more complex problems,e.g.those involving quantitative analysis of mixed samples (Cowellet al. 2004), there are additional quantities common to all markers, and then such a super-network may be the only way to proceed.

5

Mutation

It is easy to modify networks such as the above to account for possible mutation of genes in transmission from parent to child. We distinguish between a child’soriginal gene cog, identical with one of the parent’s own genes, and the actual gene cag available to the child, which may differ fromcogbecause of mutation.

Mutation network “mut” We must first construct a new class network mut to model the relevant mutation process. This network should have ogas an input node, andagas an output node.

Revised network “mendel” We also modify the class mendel of Figure 12 as shown in Figure 14, renamingcgto cog(now made into an internal node) and binding this to input node

ogof an instance cagof mutation network mut. The output nodeagof cagis then duplicated to supply the output nodecgofmendel.

The overall effect is that the output of mendelnow represents the result of mutation acting on top of Mendelian segregation.

As a very simple example, the network mutshown in Figure 15 implements the proportional mutation model: the actual geneagis either identical to the original geneog, or else replaces that

(11)

Figure 14: Revised networkmendel, incorporating mutation

by a new gene sampled randomly from the population distribution, obtained from the output of an instance otherg of gene. The choice between these is made according to the outcome of a biased coin tossbcoin.

Figure 15: Networkmutfor proportional mutation model

For some mutation models we might wish to allow the mutation process to vary, according as it affects the paternal or the maternal line; in this case we need to incorporate a further Boolean input nodep or m? inmutto specify the parental line. We then duplicate this inmendel, and bind these nodes together, as shown in Figure 16; and further modifychildas in Figure 17, assigning probabilities 1 and 0 appropriately at nodesplineandmline(each bound to input nodep or m?

in the relevant instancefmeiosisormmeiosisofmendel) to specify the relevant paternal line.

Figure 16: Revised networkmendel, incorporating mutation varying with parental line For more complicated mutation models there may be further internal structure, and/or ad-justable parameters, in mut. As an example, Figure 18 represents a “mixed mutation model” (Dawid et al. 2001; Vicard and Dawid 2004). This chooses, as ag, either the original gene og, or a mutated gene, represented by an instance mutgof the class mutgof Figure 19. The choice is controlled by a coin tossbcoin, with bias determined by parametersxi, related to the overall mutation rate, andrho, which can be set to allow for differential mutation rates in the male and female lines. The mutated gene mutgis itself obtained by selecting between the outputs of the

(12)

Figure 17: Revised networkchild, incorporating mutation varying with parental line

proportional mutation modelpropmutg, an instance of gene, and that of the “single-step muta-tion model” onemutg, an instance of onestep (not shown here). A parameterhdetermines the selection probability. For further details of this model see Dawid (2003).3

Figure 18: Networkmutfor mixed mutation model

Figure 19: Networkmutgfor mixed mutation model

If we were only concerned with fixed values of the parameters, we could omit the parameter nodes and simply insert appropriate values into the conditional probability tables of the coin toss or other nodes that they affect. In that case we could proceed exactly as described above for the proportional mutation model. However, exploration of sensitivity to varying parameter values would then require direct editing of these conditional probability tables. To avoid this we have in-serted explicit parameter nodesh,xiandrho, each having a discrete collection of numerical values we wish to experiment with, and specify the coin-toss probabilitiesetc.as algebraic expressions in these parameters. Since typically several instances of a network class containing such a parameter node will occur in the overall network, we need to ensure that any value set for the parameter is transferred to all those instances. The “traverse instance” feature ofHugin6.4 enables this to be done easily.

Once an appropriate networkmuthas been built, andmendel(and possibly alsochild) modi-fied as described above, pedigree networks constructed as in_§2 will now automatically incorporate the additional possibility of mutation. No other changes are required.

3_{Our network}_mut_{corresponds to the network}_ag_{of Dawid (2003), while our parameter}_xi_{is twice the parameter}

(13)

5.1

Non-stationarity

A stationary mutation model is one for which the allele frequency distribution of a gene after mutation is identical with its distribution before mutation. The proportional mutation model de-scribed above is stationary, but in general the mixed mutation model is not. With non-stationary mutation, allele frequencies will change slightly from one generation to the next, and the very concept of a “population allele frequency distribution” dissolves into meaninglessness. A conse-quence of this is that we will get slightly different answers according as, say, our pedigree network does or does not include parents for node pf. For example, if we were to use the pedigree of Figure 3 to analyse the simple paternity problem of Figure 2, by inserting findings atm,pfandc, we would get a slightly different answer simply in view of the fact that a (now unobserved) brother is represented in the network. Various workarounds could be used to avoid this, but we have not felt it worthwhile following this route, on the grounds that there is no logically compelling reason to prefer raw over once-mutated, twice-mutated, . . . , frequencies, and the numerical differences will in any case be small (vanishing completely for a stationary mutation process).

6

Silent alleles

6.1

Background and assumptions

A null or drop-out allele is one that is not recorded by the equipment used. When this can happen, what appears to be a homozygous genotype at some marker may not be so: an alternative explanation is that we are seeing just one band of a heterozygous genotype, the other band being null. This phenomenon will clearly affect the evidential interpretation of certain patterns of DNA profiles. Several papers in the literature have dealt with genetic aspects of dropout and how to allow for it in the analysis: Gill et al. (2000) develop formulae for the likelihood ratio, while dna·view, a programme developed by C. Brenner, contains modules to perform the calculations. This phenomenon can occur for a number of reasons. One possibility is “run-off”, where the measuring apparatus used is simply unable to record certain allele values. Another is a mutation in the primer binding site, near to the target marker, leading to failure of the amplification process. In either of these cases a null allele will be inherited exactly like any other allele, distinct markers still being unlinked. We term such an inherited null allelesilent. We construct networks to model and analyse this situation in§6.2 below.

Clayton et al. (2004) found that about 3_×10−4 _{apparent mutations detected in paternity} triplets were due to primer binding site mutations. They also suggest that such a mutation is likely to be preferentially associated with some specific allele or alleles of the target marker. For simplicity and demonstration purposes we have not taken account of this association, supposing instead that every allele has the same probability of becoming silent. Thus the models developed and the numerical values assumed here should be considered as purely illustrative: they are not recommendations for use in forensic laboratory casework.

Another possible explanation for a null allele is sporadic failure of the apparatus to record the correct allele value. In this case the property is not inherited; we refer to such a null allele as

missed. We describe how to handle this situation in§7.

6.2

Networks for inherited silent alleles

We can construct Hugin networks to handle problems with inherited silent alleles by making minor modifications to the basic building blocks: specifically, to gene and genotype. We now make explicit use of the dummy value 99 to represent silence. Wherever any node in any network represents a gene, its state-space must be augmented with this value (in fact, to avoid further editing we already included this in our previous networks, giving it probability 0 in network gene).

(14)

Revised network “gene” The simple one-node networkgeneis now renamedgene0, and an instance gene0 of it is included in the new genenetwork shown in Figure 20. This has output

Figure 20: Networkgenefor founder gene, incorporating silent allele

nodegene, equal to the output ofgene0unless the binary nodesilenttakes the value 1, in which casegeneis set to the silent value 99. The silence indicatorsilentis generated from Binomial(1,

pr(silent)), depending on parameter nodepr(silent): we have made this a discrete numerical node, so that we can vary its value (we consider values 0.000015, 0.00003, 0.0001, 0.0005, 0.001, 0.005 and 0.01). The overall effect is that, with probabilitypr(silent), any original allele value is transformed into a silent allele. The probability of a silent allele is thuspr(silent), while initial “real” allele frequencies are multiplied by 1−pr(silent). A silent allele is inherited just like any other allele.

Revised network “genotype” The network of Figure 10 for class genotype also needs to be modified, as shown in Figure 21, to account for the fact that silent alleles can not be seen in observed genotypes. Nodes pg, mg and gtmin are defined as before. Previous node gtmax is

Figure 21: Networkgenotype, incorporating silent allele

renamed gtmax0, while new output node gtmax is equal to gtmax0 unless this has value 99, in which case it is set equal togtmin, so mimicking a homozygous genotype. If both alleles are silent so will be both gtminand gtmax, and nothing will be seen — an event which, though rare, has been known to occur (Claytonet al. 2004, Figure 1).

Again, once we have made the above replacements of lower level networks, we can simply reuse top-level pedigree networks such as in §2 — now automatically incorporating the possibility of silent alleles into these problems.

7

Missed alleles

Modelling of sporadically missing alleles is just as straightforward. These only affect the way in which a genotype is observed. We now use 99 to represent an unobserved “missed” value. Observed allele network “geneobs” This new network, displayed in Figure 22, is very similar to that forgenein Figure 20. Nodepr(missed) is a discrete numerical parameter node allowing us to set various values for the probability that an allele is missed (supposed independent of its

(15)

value). The binary missingness indicator missed has a Binomial(1, pr(missed)) distribution. Input nodegene0represents an actual allele value, while output nodegene, the possibly missed gene, replaces this by 99 ifmissedtakes value 1.

Figure 22: Networkgeneobsfor observed gene, incorporating missed allele

Revised network “genotype” We also revise the network genotype of Figure 10, as in Figure 23. New nodes pgobs and mgobs are instances of geneobs, thus transforming pg and

mgaccording to the missingness process. Nodesgtmin,gtmax0 andgtmaxare obtained from the resulting, possibly missing, alleles exactly as described in§6.2.

Figure 23: Networkgeneobsfor observed genotype, incorporating missed allele Yet again, existing pedigree networks can be reused, so as now to allow for missing alleles.

8

Combination

We can readily combine any or all the complicating features so far introduced, thus allowing for the possible simultaneous existence of inherited silent alleles, sporadic missed alleles, and mutation; all within a wide variety of top-level pedigree networks incorporating further complications such as missing individuals. We simply include all the appropriate new and revised networks needed for the various extensions (when combining both silence and missingness — treated as operating independently — we use the networkgenotype constructed for missingness). Further modifica-tions can generally be introduced quite easily: for example, when combining mutation and silence we have chosen to modifymendel, adding an extra arrow fromcogtocg, to ensure that mutation out of or into a silent allele is not allowed.

In all circumstances the identical pedigree networks can be used. We have created a number of directories containing the appropriate lower-level networks for each combination of the above features. Using instances offounder,child,query, a pedigree network to describe a new prob-lem can be constructed in any one of these, and simply dropped into any other, for immediate incorporation of the relevant additional features.

(16)

9

Examples

We now illustrate the effects of accounting for either the separate or the combined effects of silent alleles, missed alleles, and mutation. All examples refer to marker VWA, with population gene frequencies as given in Table 1.

We use the simple paternity pedigree network of Figure 2, extended, as described in _§8, to allow for all the additional complications simultaneously. A mixed mutation model is assumed, with parameter values set toh= 0.9,rho= 0.5 andxi= 0.005081 (corresponding to a combined mutation rate ofτ= 0.004982). When no mutation is allowed we setxi= 0.

After propagating the evidence, node tf=pf? contains the posterior probabilities of paternity and non-paternity. We set the prior probability of paternity to 0.5, so that we can interpret the ratio of the resulting (purely nominal) posterior probabilities as the likelihood ratio in favour of paternity — which we henceforth term thepaternity ratio.

In our examples both the child’s and the putative father’s genotypes are apparently homozy-gous. It is easy to see that (in the absence of mutation) if either the child or the putative father were heterozygous it would make no difference to introduce the possibility of a silent or a missed allele.

Since a silent allele is inherited while a missed allele only affects the recorded genotype, allowing for silence will typically have a much greater effect than allowing for missingness.

Example 9.1 The data are:

m: _{12,20_} pf: _{18,18_} c: _{12,12_}.

Note that the child’s observed allele 12 is extremely rare, having frequency p12 = 0.03%; the mother’s other allele 20 is somewhat less rare, with p20 = 1.4%; while the putative father’s observed allele 18 is common, withp18= 22%.

Table 2 shows the combined effects of silence and missingness with no mutation. Comparing the columnpr(missed) = 0 with the rowpr(silent) = 0, we see that the effect of silence alone is roughly 5 times that of missingness alone. On passing frompr(silent) = 0 topr(silent) = 0.001 — the value estimated by the American Association of Bloodbanks — the paternity ratio goes from 0 to 3.53: instead of the evidence ruling the putative father out, when we introduce a small possibility of silence it actually favours paternity. Indeed, whenever pr(silent) ≥ 0.0001 all entries in the table give a paternity ratio greater than 1, favouring paternity (the additional effect of incorporating missingness in addition to silence being to reduce slightly the paternity ratio). Intuitively this is because, as soon as the probability of silence is comparable with that of allele 12, the child’s apparently homozygous genotype is well explained as really being truly heterozygous {12, silent}. This in turn is readily explained under paternity if the putative father also has a silent allele. A similar explanation based on a (non-inherited) missed allele is however much less convincing.

Table 3 shows the combined effect of silence, missingness and mutation. In the absence of silence or missingness, a 6-step mutation would be required to explain the data under paternity, and this is highly improbable under our mixed mutation model. Comparing Table 3 with Table 2 one in fact observes a negligible additional effect of allowing for mutation. 2 Example 9.2 Now consider data:

m: {12,20} pf: {13,13} c: {12,12}.

The mother’s and child’s genotypes are the same as in Example 9.1, while the putative father’s observed allele is now the relatively rare allele 13, withp13= 0.2%. The combined effects of silence and missingness are displayed in Table 4.

The impact of introducing the possibility of silence is overwhelming: for example, when

pr(silent) = 0.01% the paternity ratio is 125. Compared with Example 9.1, the greater rar-ity of the putative father’s observed allele now makes the presence of a silent allele still more plausible. However the sheer magnitude of this effect is perhaps unexpected.

(17)

The effect of missingness alone is, however, similar to that in Example 9.1. The additional effect of allowing for missingness over that of silence is to decrease the paternity ratio— markedly so forpr(missed) ≥0.001.

The effect of further incorporating mutation can be seen in Table 5. Mutation by itself (pr(silent) = pr(missed) = 0) has quite an impact, giving a paternity ratio of 3.79; intu-itively this is because paternity can now be well-explained by a 1-step mutation, and this is quite probable under the mixed model. This effect of mutation can still be seen when missingness is introduced, but essentially disappears as soon as silence is allowed. 2 Example 9.3 The data are:

m: {16,16} pf: {18,18} c: {18,18}.

The undisputed mother is apparently incompatible with the child: she must therefore have a missed allele, or have transmitted a silent or mutated allele to her child. Given thatp18= 21% is much larger than any value considered forpr(silent)orpr(missed), we can be pretty sure, first that bothpfgtandcgtare truly homozygous, and then that the child inherited allele 18 from its father. This has probability close to 1 under paternity, and top18= 0.2162 under non-paternity. Correspondingly the paternity ratio is close to 1/0.2162≈4.6 for any combination of the above explanations. This can be confirmed by calculations (not shown), using our networks. 2

10

Additional individuals

Suppose that, in a simple disputed paternity case, the genotypebgtof the putative father’s full brother bhas been observed, in addition to those of the basic triplet m, pfand c. The relevant pedigree is as shown in Figure 24. Under simple Mendelian segregation this additional observation

Figure 24: Pedigree for paternity testing with additional individual

is independent of paternity status given the triplet evidence, and so makes no difference to the impact of that evidence. However, once we allow for a silent or missed allele the paternity ratio can be affected by knowledge of the brother’s genotype, because it can help to distinguish whether the putative father is a true homozygote, or is truly heterozygous but with a silent or missed allele. The likelihood ratio in favour of paternityPbased on just the triplet dataD:= (mgt,pfgt,cgt) is

LD:=

Pr(D|P)

Pr(D|P¯). (2)

The impact of the additional information carried by the brother’s dataB := (bgt) is measured by

LB:=

Pr(B|D, P)

(18)

and the overall paternity ratio, taking account of bothD andB, is

LR:=LD×LB. (4)

We can calculate LB directly by algebraic methods: this is developed in Appendix A.

Alter-natively we can compute LD and LRby numerical propagation, and thus derive LB from (4).

Our computations were made using the pedigree network of Figure 24, together with appropri-ate lower-level networks to incorporappropri-ate the effects of silence or missingness (we do not consider mutation here).

Example 10.1 To illustrate the possible effect of the additional measurementB on the paternity ratio, we consider an example where the triplet evidenceD is as follows:

m: {12,15} pf: {14,14} c: {12,12}.

The putative father and child are both apparently homozygous, in a way that would be inconsis-tent with paternity under Mendelian segregation. Howeverpfcould still be the true father if he had a silent allele he might have passed to the child, or if one of his alleles was missed. Observation of his brother’s genotype can help to shed light on these possibilities.

Silent alleles. Table 6 displays the paternity ratio, allowing for silent alleles. The second column gives the paternity ratioLDbased on the triplet data only. The later columns show the additional

factorLB for various possible observations on the brother’s genotypebgt. The behaviour of this

term is determined by its relationship to the putative father’s observed genotypepft.

In columns 3 and 4 we consider bgt ={16,20} and bgt= {12,17}: b is heterozygous, and does not share any allele (and in particular, not a silent allele) withpf. As is verified in Case 1 (a) in Appendix A, the additional observationB makes no difference whatsoever in this case: LB= 1

for all values of pr(silent).

However, when b is heterozygous but shares an allele with pf, the paternity ratio is reduced by this additional knowledge. Intuitively this is because it becomes more likely thatpfis a true homozygote, and hence excluded from paternity. This effect is seen in columns 5 and 6 of Table 6 for the cases bgt = {12,14} and bgt = {14,17}, so that b and pf share allele 14. The fact that the additional paternity ratio factor is close to 0.5 is explained by the analysis of Case 1 (b) in Appendix A, since in our example we have q14 ≈p14 = 0.1009, considerably larger than the various values considered for pr(silent). That analysis also explains why the results are the same in both these columns.

Column 7 refers to the case bgt=pfgt( ={14,14}). Since bcould now have a silent allele the additional data do little to distinguish whether or notpfis a true homozygote. Indeed we see that the extra factorLB is very close to 1, and so essentially uninformative. This is explained in

Case 2 (a) in Appendix A.

Finally we consider the case that b is apparently homozygous, but with bgt different from

pfgt. With such a configuration pf and b might still share a silent allele, and the additional observation B therefore renders it more probable that pfis a false homozygote, who could have passed a silent allele down to the child. As a consequence the paternity ratio is increased.

In column 8 the brother exhibits a relatively common allele,bgt={16,16}, wherep16≈20%. Even though this renders him likely to be a true homozygote, the effect on the paternity ratio of the uncertainty introduced by this extra information is to introduce a factor of around 6 for small

ps, reducing somewhat asps increases.

In column 9 we take a very rare allele,bgt={12,12}, wherep12= 0.03%. The increase in the paternity ratio is now dramatic. The values here reflect the analysis of Case 2 (b) in Appendix A, where it is shown that the additional effect is particularly strong when the allele of the brother is rare, but the silent allele is rarer still. The limiting value of LB as ps → 0 here is 3334.33,

though to come close to this value ps needs to be less than 10−6. The overall paternity ratio

(19)

Missing alleles.

Table 7 illustrates the effect of observing the brother when allowing for missing alleles. Now the principal determinant of the additional effect of observing b is whether or not he shares an allele withc.

Columns 3 (bgt={16,20}), 6 (bgt ={14,17}), 7 (bgt={14,14}) and 8 (bgt={16,16}) involve cases wherebgtandcgthave no common alleles. Since missing alleles occur independently in different individuals, observation of the brother carries very little additional information on paternity.

In columns 4 (bgt={12,17}) 5 (bgt={12,14}) and 9 (bgt={12,12}) the brother and the child share allele 12. In this case, knowing that allele 12 is likely to be present in the paternal line, because it has been observed in the putative father’s brother, makes it more probable that

pfgt, observed as {14,14}, was in fact {12,14}, but with allele 12 missed. This argument is strengthened further whenbgt={12,17}: whether this is a true homozygote or involves a silent allele, it provides evidence forpfgttruly being{12, s}. The strength of the effect is related to the

rarity of allele 12. It decreases slowly asps increases. 2

Example 10.1 shows that when the possibility of silent or missed alleles is taken into account in a paternity testing problem where the putative father appears incompatible with the child, additional information on relatives of the putative father can have a dramatic effect on the paternity ratio.

An effect can also be seen in compatible cases. Example 10.2 The triplet evidenceDis now:

m: _{12,15_} pf: _{13,13_} c: _{12,13_}.

Paternity ratios allowing for silent alleles are shown in Table 8. The values of LD in column

2 are much greater than 1 because the triplet is compatible, but they decrease as pr(silent)

increases since it is then more likely that pf carries a silent allele. When bgt is also observed, its additional effect depends on its type. From column 6 of Table 8 we see that the there is no effect whatsoever when the brother is heterozygous with no allele in common with the child (bgt={21,22}); otherwise there is some effect, which is most apparent in column 5, where bgt

is apparently homozygous but different from pfgt: it then becomes more plausible thatpf is in fact heterozygous with one silent allele.

The effect of allowing for missed alleles is shown in Table 9. In this case the most interesting configurations are those wherebshares at least one allele withpf. In particular, column 4 shows that when the brother is heterozygous (bgt = {13,16}), for larger values of pr(missed) the paternity ratio decreases, since it is then more likely that pf is truly heterozygous but with a missed allele. On the other hand if bgt=pfgt (= _{13,13_}), the paternity ratio is increased by

the additional information. 2

11

Criminal Case

Here we analyse the criminal case represented by Figure 5. The identity of an unrecognisable

bodyis unknown, and it is questioned whether it might be that of a criminalcrwhose family had reported his disappearance. The DNA profiles of the criminal’s family members — his wifewife

and their two childrenc1and c2— were typed, and a DNA profile was also extracted from the bodily remains.

Two different hypothetical cases are analysed below, to investigate the possible effects of al-lowing for silent and/or null alleles (we do not illustrate the additional effects of mutation, which were small). We again use marker VWA with allele frequencies as in Table 1.

Example 11.1 The observed genotypes are:

(20)

Both c1 and c2 are apparently incompatible with being the children of body. Table 10 shows the likelihood ratio in favour of identity,body=cr, obtained by propagating the evidence in the network of Figure 5, incorporating lower level networks for silent and missed alleles. The likelihood ratio exceeds 1 forpr(silent)≥0.0001. The effect of missingness alone is slight; when included

in addition to silence it slightly reduces the likelihood ratio. 2

Example 11.2 Here the DNA evidence is:

body:{16,16} wife: {13,14} c1: {13,13} c2: {14,16}.

The difference from Example 11.1 is that c2 is now compatible with being the child of body. Table 11 shows the results of propagating this evidence. When taking the possibility of silent alleles into account the general effect is, as might have been expected, to increase the likelihood ratio; however this is not so for small values of pr(silent) and pr(missing). The likelihood ratio again exceeds 1 whenpr(silent)≥0.0001. Additional allowance for missingness increases the likelihood ratio whenpr(silent)≤0.0001, while for pr(silent)≥0.001it slightly reduces

the likelihood ratio. 2

In both the above cases, an apparent exclusion can turn into strong positive evidence for identity as soon as we allow only a small probability of a silent allele. Allowing a small probability of a missed allele yields much weaker evidence in itself, but even here the overall effect of all the evidence could be strongly in favour of identity when there is no exclusion on any other marker.

12

Conclusions

This paper has illustrated how object-oriented Bayesian networks can be fruitfully applied to solv-ing complex problems of forensic DNA identification and paternity testsolv-ing. The modularity and flexibility of the approach allows ready application to numerous different cases and complicating features. A significant application is to accommodate potential allelic drop-out.

When a silent or missing allele is suspected, the ambiguity in the genotype can sometimes be resolved by retesting. In cases where this is impossible or proves ineffective, it has been common simply to discard the data (Leopoldino and Pena 2002), but it is better to perform an appropriate analysis that properly allows for the ambiguity. We have shown how this can be done using the computational methodology of OOBNs, and have used this to illustrate the sometimes striking impact of even very low levels of drop-out. In particular, as shown in§10, in the presence of silent alleles information on additional relatives can be very powerful in helping to resolve the ambiguity and assess the strength of the evidence.

In this work we have used a very simple model in which the probability of allelic drop-out is independent of the actual allele value. In fact small alleles may be less affected by degradation and so less likely to drop out. Also, as suggested by Claytonet al. (2004), silence due to primer binding site mutation is likely to be associated with the allele repeat number. It should be relatively straightforward to incorporate such more realistic dependencies into our OOBNs.

There are numerous further artifacts, such as stutter, drop-in etc., that can occur in DNA profiling and that we have not considered here. Again, most of these can modelled by modifications to our basic modular structures, along the lines already described. We hope to address some of these issues in future work. Another important area where this approach could be applied is in the analysis of low copy number (LCN) DNA, which is particularly sensitive both to drop-out and to possible contamination. Whitakeret al. (2001) found that under low copy number conditions approximately 10% per locus of all heterozygotes exhibit allelic drop-out.

Object-oriented Bayesian networks will also be useful for analysing other problems of interest in forensic DNA identification. For example, Bayesian networks have been applied to the analysis of mixed DNA traces, where several individuals may have contributed to the DNA trace (Mortera

et al. 2003; Cowellet al. 2004). In such cases allelic drop-out and other artifacts are known to occur quite often. Incorporation of these additional complicating features in modular object-oriented networks should be reasonably straightforward.

(21)

A

Appendix: The effect of observing the putative father’s

brother

In this Appendix we develop algebraic formulae for the paternity ratio in certain cases where we wish to allow for silent alleles, and we can also measure the brother B of the putative father PF. In the absence of silent alleles, the brother’s genotype would contain no information relevant to the paternity query. In their presence, however, it can carry useful additional information. This happens when there is ambiguity as to PF’s full genotype, because his measured genotype appears homozygous; and observing his brother can then provide information relevant to resolving this ambiguity.

We confine attention to a single forensic marker; as always, assuming independence across different markers we can obtain the overall paternity ratio by multiplication across markers. Notation We denote by [x, y] an ordered genotype, where xis the paternally inherited allele, and y is the maternal allele. We further denote by hx, yi the corresponding unordered genotype (with possibly repeated values); and by{x, y}the measured genotype—identical withhx, yiwhen

x6=y, but with{x, x} ambiguously denoting the homozygous pairhx, xior the pairhx, siwhere

sis a silent allele. We denote the frequency of the silent allelesbyps, and that of any other allele

xbyqx4.

We consider a putative family triplet with measurementsDon the genotypes on all individuals:

mgtfor mother M,pfgtfor putative father PF, andcgtfor child C. We have, in addition, measured the genotypebgtof a full brother B of PF. Under non-paternity we assume that the true father TF is unrelated to PF.

The impact of the additional information contained inB is carried by

LB:=

Pr(B|D, P)

Pr(B_|D,P¯), (5)

the likelihood ratio in favour of paternity (P) as against non-paternity ( ¯P), based on B, after taking account of D. The overall likelihood ratio in favour of paternity is thenLD×LB, where

LD is that based only on the data D on the family triplet. In particular, there is no additional

information inB just whenLB = 1.

We also define, for a seemingly homozygous putative father withpfgt={z, z},

Lh:=

Pr(B|pfgt=hz, zi)

Pr(B|pfgt=hz, si). (6)

This is the likelihood ratio in favour of PF’s being truly homozygous, as against heterozygous with a silent allele, based on his brother’s dataB. We denote limps→0Lh byL

0

h.

Inconsistent triplet

We shall here consider only triplets that are prima facie incompatible, but could be explained, under paternity, by silent alleles. These would have measured genotype dataD of the form: mgt

={a, b},pfgt={z, z},cgt={a, a}, with z6=a(though we allow a=b). This is the pattern of Example 10.1, with a= 12, b= 15, z= 14, andqz ≈p14= 0.1009. We denote the the brother’s measured genotype by{x, y}.

We shall require the following general results. 4_{Under the assumptions made in}_§_6,_q

x= (1−ps)px(x6=s), wherepx is the population frequency of allelex

(22)

Lemma A.1 Forprima facieincompatible triplet dataD, L−_B1= 1−α(1−Lh), (7) where α= qz qz+ 2ps . (8)

Proof. Consider first Pr(B_|D, P). Under paternityP, we can deduce from the family dataD

that PF must have unordered genotype hz, si, including a silent allele. Given this, the profileB

of the brother is independent of those of M and C. Hence

Pr(B|D, P) = Pr(B |pfgt=hz, si). (9)

Under non-paternity, ¯P, the data on M and C are completely irrelevant to those on PF and B, and we have Pr(B|D,P¯) = Pr(B |pfgt=hz, siorhz, zi) = (1−α) Pr(B |pfgt=hz, si) +αPr(B|pfgt=hz, zi), (10) with α: = Pr(pfgt=hz, zi |pfgt=hz, sior hz, zi) = qz/(qz+ 2ps).

The result now follows. 2

Corollary A.2 LB = 1(the brother is uninformative as to paternity) if and only ifLh= 1(the

brother is uninformative as to silence).

Corollary A.3 LB ≤1/(1−α) = 1 + (qz/2ps). Corollary A.4 LB →L0B := (L 0 h)− 1 _as _p s→0.

Lemma A.5 Consider two full brothersB1andB2, with respective ordered genotypes[X, Y]and

[Z, W]. Then

Pr([X, Y] = [x, y]_|[Z, W] = [z, w]) = 1

4(δxz+qx)(δyw+qy),

whereδxz := 1if x=z,0 otherwise.

Proof. LetIP denote the event that B1 and B2 inherited the identical gene from their father.

Then Pr(IP) = 1₂, independently of the paternal allele, Z, of brother B2. Clearly Pr(X = x|

Z =z, IP) = δxz, Pr(X =x |Z = z, IP) = qx, so that unconditionally Pr(X =x| Z = z) =

1

2(δxz+qx). Similarly Pr(Y =y |W =w) = 1

2(δyw+qy). The result follows since the maternal

and paternal inheritance processes operate independently. 2

Corollary A.6 Pr(hX, Yi=hx, yi | hZ, Wi=hz, wi) = ( ₁ 4{(δxz+qx)(δyw+qy) + (δxw+qx)(δyz+qy)} (x6=y) 1 4(δxz+qx)(δxw+qx) (x=y).

(23)

Paternity ratio

We consider various cases, according to the relationship between the brother’s measured genotype {x, y_}and those of the family triplet: mgt=_{a, b_},pfgt =_{z, z_},cgt=_{a, a_}.

Case 1: x6=y. This observation is equivalent to hx, yi, with both x and y different from the silent alleles. Applying Corollary A.6 withw=z, we obtain

Lh=

2(δxz+qx)(δyz+qy)

2qxqy+δxzqy+δyzqx

. (11)

(a). Ifxandy are both different fromz,5 (11) reduces to 2qxqy/2qxqy = 1. So in this case

no further information is obtained from the brother’s genotype. (b). Otherwise, supposex=z, y6=z.6We calculate

Lh = 2(1 +qz)qy 2qzqy+qy = 1 + 1 (1 + 2qz) . Thus from (7) L−_B1 = 1 + α 1 + 2qz = 1 + qz (1 + 2qz)(qz+ 2ps) .

In particular, Lh ≤ 2, and LB ≥1/(1 +α) = 1−qz/(2qz+ 2ps), with approximate

equality so long asqzis small. This lower bound in turn exceeds 1/2, with approximate

equality when ps is much smaller than qz. The limit as ps → 0 is L0B =

1

2 ×(1 + 2pz)/(1 +pz).

Case 2: x=y. Then x 6= s, and the observation is ambiguously either homozygous hx, xi, or heterozygous _hx, s_i. Hence Pr(B _| pfgt = _hz, z_i) = Pr(bgt = _hx, x_i) _| pfgt = _hz, z_i) + Pr(bgt=_hx, s_i)_|pfgt=_hz, z_i), and on applying Corollary A.6 we obtain

Pr(B|pfgt=hz, zi) = 1 4(δxz+qx) 2₊1 2(δxz+qx)ps = 1 4(δxz+qx)(δxz+qx+ 2ps). Similarly, Pr(B|pfgt=hz, si) =1 4{(δxz+qx)(1 +qx)(1 +qx+ps) +qxps}. (a). Ifx=z,7 _{we obtain} Pr(B|pfgt=hz, zi) = 1 4(1 +qz)(1 +qz+ 2ps) Pr(B _| pfgt=_hz, s_i) = 1 4{(1 +qz)(1 +qz+ 2ps)−ps}.

Since ps will be negligible in comparison with (1 +qz)(1 +qz+ 2ps), which is at least

1, we see that in this case Lh will be extremely close (though not exactly equal) to 1.

Correspondingly so will beLB, and the additional information in the brother’s genotype

is virtually valueless.

5_{As for the case}_bgt₌_{₁₆_,₂₀_}_{in Example 10.1} 6_{As for the case}_bgt₌_{₁₂_,₁₄_}_{in Example 10.1} 7_{As for the case}_bgt₌_{₁₄_,₁₄_}_{in Example 10.1}

(24)

(b). Finally, forx6=z 8 _{we similarly calculate} Lh = 1− 1 1 +qx+ 2ps , L−_B1 = 1₋ α 1 +qx+ 2ps .

It follows from Corollary A.4 that LB →1 +px−1 as ps → 0. When px is small, the

additional effect of observing the brother can thus be very substantial, even when the probability of a silent allele is extremely tiny.

Acknowledgement

This research was supported by a Research Interchange Grant from the Leverhulme Trust. We are indebted to Steffen Lauritzen for extremely helpful suggestions.

References

Bangsø, O. and Wuillemin, P. H. (2000). Object Oriented Bayesian Networks: A framework for top-down specification of large Bayesian networks with repetitive structures. Technical report, Hewlett-Packard Laboratory for Normative Systems, Aalborg University.

Buckleton, J. S., Triggs, C. M., and Walsh, S. J. (ed.) (2004). Forensic DNA Evidence Interpre-tation. CRC Press.

Clayton, T. M., Hill, S. M., Denton, L. A., Watson, S. K., and Urquhart, A. J. (2004). Primer binding site mutations affecting the typing of STR loci contained within the AMPFlSTRr SGM PlusT M _kit. _{Forensic Science International}_, ₁₃₉_{, 255–9.}

Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999). Probabilistic Networks and Expert Systems. Springer, New York.

Cowell, R. G., Lauritzen, S. L., and Mortera, J. (2004). Identification and separation of DNA mixtures using peak area information using a probabilistic expert system. Research Report 25, Cass Business School, City University.

Dawid, A. P. (2003). An object-oriented Bayesian network for estimating mutation rates. In

Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Jan 3–6 2003, Key West, Florida,(ed. C. M. Bishop and B. J. Frey).http://tinyurl.com/39bmh. Dawid, A. P., Mortera, J., and Pascali, V. L. (2001). Non-fatherhood or mutation? A probabilistic approach to parental exclusion in paternity testing. Forensic Science International,124, 55– 61.

Dawid, A. P., Mortera, J., Pascali, V. L., and van Boxel, D. W. (2002). Probabilistic expert systems for forensic inference from genetic markers. Scandinavian Journal of Statistics, 29, 577–95.

Essen-M¨oller, E. (1938). Die Beweiskraft der ¨Ahnlichkeit im Vaterschaftsnachweis. Theoretische Grundlagen. Mitteilungen der Anthropologischen Gesellschaft,68, 9–53.

Evett, I. W. and Weir, B. S. (1998). Interpreting DNA Evidence. Sinauer, Sunderland, MA. Gill, P., Whitaker, J., Flaxman, C., Brown, N., and Buckleton, J. (2000). An investigation of the

rigor of interpretation rules for STRs derived from less than 100 pg of DNA. Forensic Science International,112, 17–40.

Koller, D. and Pfeffer, A. (1997). Object-oriented Bayesian networks. In Proceedings of the 13th Annual Conference on Uncertainty in Artificial Intelligence, (ed. D. Geiger and P. Shenoy), pp. 302–13. Morgan Kaufmann Publishers, San Francisco.

(25)

Laskey, K. B. and Mahoney, S. M. (1997). Network fragments: Representing knowledge for con-structing probabilistic models. InProceedings of the 13th Annual Conference on Uncertainty in Artificial Intelligence, (ed. D. Geiger and P. Shenoy), pp. 334–41. Morgan Kaufmann Publishers, San Francisco.

Leopoldino, A. M. and Pena, S. D. J. (2002). The mutational spectrum of human autosomal tetranucleotide microsatellites. Human Mutation,21, 71–9.

Morling, N., Allen, R. W., Carracedo, A., Geada, H., Guidet, F., Hallenberg, C., Martin, W., Mayr, W. R., Olaisen, B., Pascali, V. L., and Schneider, P. M. (2002). Paternity Testing Commission of the International Society of Forensic Genetics: Recommendations on genetic investigations in paternity cases. Forensic Science International,129, 148–57.

Mortera, J. (2003). Analysis of DNA mixtures using Bayesian networks. In Highly Structured Stochastic Systems, (ed. P. J. Green, N. L. Hjort, and S. Richardson), chapter 1B, pp. 39–44. Oxford University Press.

Mortera, J., Dawid, A. P., and Lauritzen, S. L. (2003). Probabilistic expert systems for DNA mixture profiling. Theoretical Population Biology,63, 191–205.

Vicard, P. and Dawid, A. P. (2004). A statistical treatment of biases affecting the estimation of mutation rates. Mutation Research,547, 19–33.

Whitaker, J. P., Cotton, E. A., and Gill, P. (2001). A comparison of the characteristics of profiles produced with the AMPFlSTRr SGM PlusT M multiplex system for both standard and low copy number (LCN) STR DNA analysis. Forensic Science International,123, 215–23.

(26)

allele 12 13 14 15 16 17 18 19 20 21 22 frequency 0.0003 0.0018 0.1009 0.1004 0.1949 0.2834 0.2162 0.0866 0.0137 0.0015 0.0003

Table 1: Population gene frequencies for marker VWA

pr(missed) pr(silent) 0 0.000015 0.0001 0.001 0 0 0.0477 0.2503 0.7701 0.000015 0.2202 0.2557 0.4083 0.8136 0.0001 1.1555 1.1497 1.1238 1.0422 0.001 3.5297 3.5004 3.3462 2.4128

Table 2: Example 9.1: mgt = {12, 20}, pfgt = {18, 18}, cgt = {12, 12}. Combined effect of silent and missed alleles on likelihood ratio in favour of paternity.

pr(missed) pr(silent) 0 0.000015 0.0001 0.001 0 0.0003 0.0477 0.2497 0.7694 0.000015 0.2195 0.2549 0.4071 0.8128 0.0001 1.1517 1.1461 1.1208 1.0413 0.001 3.5260 3.4968 3.3430 2.4114

Table 3: Example 9.1: mgt = {12, 20}, pfgt = {18, 18}, cgt = {12, 12}. Combined effect of silent and missed alleles together with mutation on likelihood ratio in favour of paternity.

pr(missed) pr(silent) 0 0.000015 0.0001 0.001 0 0 0.0554 0.2875 0.8299 0.000015 26.02 24.98 18.08 3.80 0.0001 125.02 120.79 91.14 18.61 0.001 202.57 199.98 178.77 75.45

Table 4: Example 9.2: mgt = {12, 20}, pfgt = {13, 13}, cgt = {12, 12}. Combined effect of silent and missed alleles on likelihood ratio in favour of paternity.

pr(missed) pr(silent) 0 0.000015 0.0001 0.001 0 3.79 3.64 2.99 1.48 0.000015 29.49 27.78 20.61 4.43 0.0001 127.30 120.96 92.97 19.19 0.001 203.01 199.14 179.19 75.73

Table 5: Example 9.2: mgt = {12, 20}, pfgt = {13, 13}, cgt = {12, 12}. Combined effect of silent and missed alleles together with mutation on likelihood ratio in favour of paternity.

(27)

LB with bgt= pr(silent) LD {16,20} {12,17} {12,14} {14,17} {14,14} {16,16} {12,12} 0 0 1 1 0.546 0.546 1 6.13 3334 0.000015 0.472 1 1 0.546 0.546 1.0000 6.12 1595 0.0001 2.473 1 1 0.546 0.546 0.9999 6.07 403.7 0.001 7.485 1 1 0.551 0.551 0.9992 5.54 46.07 0.01 8.100 1 1 0.590 0.590 0.9932 3.19 5.45

Table 6: Example 10.1: mgt =_{12, 15_}, pfgt =_{14, 14_}, cgt= _{12, 12_}. Likelihood ratio in favour of paternity allowing for silent alleles: LD, without brother’s genotype. LB, additional

effect of brother’s genotype.

LB with bgt= pr(missed) LD {16,20} {12,17} {12,14} {14,17} {14,14} {16,16} {12,12} 0 0 1 5.94 5.94 0.9987 0.9973 1 10.88 0.000015 0.048 1.0000 5.94 5.94 0.9987 0.9973 1.0000 10.05 0.0001 0.251 1.0000 5.92 5.93 0.9987 0.9973 1.0000 8.04 0.001 0.771 0.9999 5.76 5.84 0.9987 0.9974 0.9999 6.14 0.01 0.973 0.9996 4.60 5.14 0.9988 0.9978 0.9997 4.90

Table 7: Example 10.1: mgt ={12, 15}, pfgt ={14, 14}, cgt= {12, 12}. Likelihood ratio in favour of paternity allowing for missed alleles: LD, without brother’s genotype. LB, additional

LB withbgt= pr(silent) LD {13,13} {13,16} {22,22} {21,22} 0 555.55 1 1 1 1 0.000015 551.01 1.0000 1.0041 0.5118 1 0.0001 527.83 1.0000 1.0249 0.5158 1 0.001 409.70 1.0002 1.1144 0.6102 1 0.01 303.54 1.0007 1.0632 0.8703 1

Table 8: Example 10.2: mgt ={12, 15}, pfgt ={13, 13}, cgt= {12, 13}. Likelihood ratio in favour of paternity allowing for silent alleles: LD, without brother’s genotype. LB, additional

(28)

LB withbgt= pr(silent) LD {13,13} {13,16} {22,22} {21,22} 0 555.55 1 1 1 1 0.000015 551.01 1.0082 0.9918 0.9927 0.9920 0.0001 527.83 1.0524 0.9501 0.9685 0.9569 0.001 409.55 1.3537 0.7385 0.9296 0.8890 0.01 300.96 1.6720 0.5980 0.9758 0.9631

Table 9: Example 10.2: mgt =_{12, 15_}, pfgt =_{13, 13_}, cgt= _{12, 13_}. Likelihood ratio in favour of paternity allowing for missed alleles: LD, without brother’s genotype. LB, additional

pr(missed) pr(silent) 0 0.000015 0.0001 0.001 0 0 0.0004 0.0002 0.0078 0.000015 0.3883 0.3823 0.3517 0.1956 0.0001 1.7563 1.7377 1.6394 1.0239 0.001 3.9567 3.9467 3.8901 3.3792 0.01 4.1576 4.1560 4.1467 4.0506

Table 10: Example 11.1: bogt = {16, 16}, wgt={13, 14}, c1gt = {13, 13}, c2gt ={14, 14}. Effect of silent and missed alleles on likelihood ratio in favour of identification.

pr(missed) pr(silent) 0 0.000015 0.0001 0.001 0 0 0.0845 0.5152 2.7005 0.000015 0.2175 0.2978 0.7071 2.7922 0.0001 1.3845 1.4429 1.7418 3.2977 0.001 9.3310 9.2850 9.0418 7.5166 0.01 20.6564 20.6138 20.3767 18.2226

Table 11: Example 11.2: bogt = {16, 16}, wgt={13, 14}, c1gt = {13, 13}, c2gt ={14, 16}. Effect of silent and missed alleles on likelihood ratio in favour of identification.