Likelihood ratio, calibration and forensics

Evidence reporting in forensics requires the speaker recognition outputs to be pre- sented as likelihood ratio so as to adhere to modern fact finding conventions in court [43]. An ideal system should be able to produce well-calibrated likelihood ratios. The likelihood ratio (LR) can be formulated as:

LR = P (E|Hp, I) P (E_|Hd, I)

(2.1)

where E is the speech trace from a crime scene, Hp and Hd as prosecution and

defense hypothesis, respectively, and I represents other circumstances relevant to the case. In a forensic case, posterior odds can be calculated by the fact finder, i.e., judge or jury, by using the LR and prior odds from other evidences related to the case. This can be done with Bayes’ formulation as the following [26]:

posterior odds = LR_{× prior odds}

1. Is the evidence based on a testable theory or technique; 2. Has the theory or technique been peer reviewed;

3. In the case of particular techniques, does it have a known error rate and standard controlling the technique operation;

4. Is the underlying science generally accepted?

The Daubert ruling replaced the Frye ruling (1923) that did not cover all the four points in Daubert ruling. The likelihood ratio is believed to include all four points of Daubert ruling. Therefore, the forensic community is convinced that LR is the proper way of reporting scientific evidence to court [119, 41, 61, 118, 149, 62]. According to [173], the likelihood ratio is good for presenting evidence in court because it uses the Bayesian interpretation of probability, and therefore can be used:

1. to assist scientists to assess the value of scientific evidence. 2. to help jurists to interpret judicial facts, and

2.2. LIKELIHOOD RATIO, CALIBRATION AND FORENSICS 17 3. to clarify the respective roles of scientists and of members of the court. Here, the likelihood ratio is viewed as a measure of the evidence values. Equa- tion (2.2) shows that the LR is produced by an expert witness. The court is then combines the LR with some prior odds, e.g., information from another evidence. The court is also the one who makes decision (posterior odds), not the expert.

The likelihood ratios (LRs), or often mentioned as LR scores, should have a proper probabilistic meaning. There must be a way to check that this probabilistic meaning is correct. This is carried out by doing calibration using a collection of LRs. Calibration in speaker recognition, i.e., likelihood ratio calibration, is a process in which the scores s (raw scores) from recognizer are transformed into calibrated log- likelihood ratios (LLR). Some of the calibration technique in speaker recognition are linear calibration [17] and line-up calibration [180]. Linear calibration is considered to be the most common type of calibration with the transformation formulated as:

LLR = w0+ w1s (2.3)

where w0and w1 as parameters for calibration that are optimized through logistic

linear regression using a set of training materials [17, 179, 15]. Here, w0 is also

known as offset parameter, and w1 as a scaling parameter as it is attached directly

to the raw scores s. For proper calibration results, it is important to use disjoint databases for training the calibration parameters and evaluation, such that the training and evaluation databases are not sharing the same speech material and have disjoint speaker sets.

The concept of calibration was firstly proposed in the context of weather fore- casting [34]. In the speaker recognition field, calibration is specifically highlighted through the concept of a proper scoring rule in speaker recognition [17, 22]. An automatic speaker recognition system must produce reliable likelihood ratios in order to be used for evaluating and presenting evidence to court. Scores from an automatic speaker recognition system should be calibrated in order to produce more reliable and less misleading likelihood ratios [148]. This paper illustrates the necessity of performing likelihood ratio calibration when using automatic speaker recognition system in forensics.

Calibration can be carried out either in a parametric or non-parametric way. One of the non-parametric calibration methods uses the Pool Adjacent Violators (PAV) algorithm which finds the optimal non-linear calibration for the training data itself [196], but it more often used for evaluating the quality of calibration Logistic regression is a parametric calibration method [139]. It can not only be used to calibrate a single system (raw scores to LLRs transformation), but also

to fuse multiple systems. In this thesis we have used two calibration toolkits:

Bosaris [16], and its predecessor FoCal [13]. Both use logistic regression and the PAV algorithm in doing calibration and evaluation of the quality of calibration. It

also provides functions that normally used to evaluate speaker recognition system, such as plotting Detection Error Trade-off (DET) curve or computing performance measures in speaker recognition.

In this thesis, the calibration is carried out using linear approach. Additionally, an approach using quality measure functions (QMF) is proposed as an extension to the linear calibration [105, 107]. Another type of calibration called shared scaling or categorical calibration is also introduced through this thesis [105, 102]. In the shared scaling calibration, trials under different conditions get their own offset parameter, while the scaling calibration parameter is shared between all conditions. The main goal of any calibration method above is to produce calibrated 2_{LRs (or LLRs).}

In document Speaker Recognition System in Forensic Conditions: The Calibration and Evaluation of the Likelihood Ratio (Page 32-34)