A single sworn written statement on heavy cream paper lying on the worn honey-toned oak ledge of an old witness box, a brass-buttoned panel rising behind it, lit by a soft warm pool of low lamplight that falls off into the shadowed courtroom beyond.

Reading · Foundations

How Conclusions Go Wrong: The Shared Backbone

A forensic conclusion is a sentence spoken to twelve people who cannot check it. The bench work can be flawless and that sentence can still be false, and across the record of proven wrongful convictions the conclusion failed far more often than the lab did. This reading is the backbone the other CrossCoach readings link back to: the logic of a sound conclusion, the three ways it breaks, why even careful experts get it wrong, the safeguards and the standards, and the questions a cross-examiner builds from all of it.

16 min readBased on the the forensic conclusion and validity literature

The lab work was fine. The sentence to the jury was not.

Gary Dotson sat in prison while an analyst told the jury that the semen from the crime scene came from 11% of the population, a group that happened to include Dotson. The arithmetic the analyst did at the bench was not the problem. The problem was that 100% of the population could have left that stain. The result reported to the jury described a wide door as a narrow one, and a man went to prison through it.

Brandon Garrett and Peter Neufeld documented this pattern in the Virginia Law Review in 2009, in a study called "Invalid Forensic Science Testimony and Wrongful Convictions." They went looking for the trial transcripts of every DNA exoneree who had faced forensic testimony at trial, 156 people in all, and they located and read 137 of those transcripts. In 82 of those cases, 60%, a forensic analyst called by the prosecution gave invalid testimony. Invalid, in their definition, means a conclusion not supported by empirical data: not a typo, not a clerical slip, but a statement of meaning that the science could not bear.

Take that number slowly. In six of every ten cases where an innocent person was convicted and later cleared by DNA, the expert had overstated what the evidence proved. And this was not a handful of rogue witnesses. The 82 cases involved 72 different analysts, from 52 laboratories and medical practices, across 25 states. The failure was structural, in the way conclusions were phrased, not personal.

Garrett and Neufeld found the overstatement took two main shapes. The first was misusing population statistics, as in Dotson. The second was claiming probative value that no data supported at all. In Timothy Durham's case, an analyst testified that the reddish-yellow hue of his hair matched the crime scene hair and that this combination appears in "about 5 percent of the population." There is no database of hair-colour frequencies. The 5% was conjured. It sounded like a measurement and it was a guess wearing the costume of one.

This is the spine of everything that follows in CrossCoach. Every discipline reading, fingerprints, firearms, DNA mixtures, bite marks, digital, comes back to this single hinge: the analysis can be sound and the sentence spoken to the jury can still be false. As John Morgan has argued more recently, sound bench work and an overstated conclusion can live in the same report, because the place a conclusion fails is usually the gap between what was measured and what was claimed. The Supreme Court warned in Daubert that expert evidence is "both powerful and quite misleading because of the difficulty in evaluating it." A juror cannot audit a likelihood ratio. They can only weigh the sentence the witness chose.

Garrett and Neufeld were careful: they made no claim that the invalid testimony caused these convictions, since other evidence was usually present and we cannot read jurors' minds. The point survives the caution. The expressed conclusion is a distinct object from the underlying work, it can be tested separately, and across the exoneration record it failed far more often than the bench did. That is the seam this reading teaches you to find, whether you are the witness keeping your testimony inside the data or the lawyer pressing to see where it left the data behind.

“In 82 cases, or 60%, forensic analysts called by the prosecution provided invalid testimony.”

— Garrett & Neufeld, Virginia Law Review 2009

A tall narrow panelled oak courtroom door standing ajar in a vast dark wood-panelled wall, a flood of warm lamplight pouring through the slim opening that is plainly far broader than the narrow doorway can account for, spilling wide across the floor. — Fig. 1 · A wide door reported as a narrow one. The arithmetic was right; the population the stain could have come from was everyone.

Challenge 01 · Put it to the test

One inch past the notes

Counsel lays your bench notes beside the report you signed and points to the page.

The question

"Your bench notes are accurate. Now point to the line in your report where the conclusion you spoke to the jury goes one inch beyond what those notes actually measured."

Your answerNot graded · think it through

Three ways a true measurement becomes a false claim

A sound evaluative conclusion says one thing and refuses to say another. It says how much more probable the observed features are if the prosecution is right than if the defence is right. That ratio is the likelihood ratio, and it is the witness's whole job. What the conclusion must not do is announce how probable it is that the defendant is the source, or that he is guilty. Those probabilities belong to the court, because they depend on everything else in the case: the alibi, the eyewitness, the opportunity. The witness supplies the strength of one piece of evidence. The fact-finder combines it with the rest. Three classic failures all come from the witness, or the lawyer, crossing that line.

The first is the prosecutor's fallacy, also called the transposed conditional. Leung, writing in 2002, defines it as treating the probability of event A given B as if it were the probability of B given A. An expert testifies that if the defendant were innocent, the chance of a DNA match is, say, 1 in a million. The prosecutor then invites the jury to hear this as a 1 in a million chance the defendant is innocent. They are not the same number, and the gap can be enormous. Leung works the arithmetic: if the man was identified only because he is one of 10,000 local people who could have done it, that 1 in a million match probability still leaves roughly a 1 in 10 chance he is innocent. In R v Deen the trial judge let the figure stand as something that "approximates pretty well to certainty," and the Court of Appeal quashed the conviction. The probability of the evidence is not the probability of innocence.

The second failure is individualisation, the claim that a mark can be traced to one unique source "to the exclusion of all others in the world." Saks and Koehler in 2005 call this the assumption of discernible uniqueness, and they are blunt that it has no theoretical or empirical foundation. It is what lets an examiner skip building databases, measuring frequencies, and reporting a probability, and instead declare a flat match. The cost shows in the error rates they cite: handwriting comparisons averaging around 40 percent, false-positive bite-mark rates as high as 64 percent, and the FBI's "100 percent match" of a Madrid fingerprint to Brandon Mayfield, an Oregon lawyer who had never been to Spain. Uniqueness may be true. It has not been measured, so it cannot be sworn to as a finding.

The third is target-shifting, what Thompson in 2009 calls the Texas sharpshooter move: the rifleman fires into the barn, then paints the targets around the holes. Thompson showed DNA analysts the same ambiguous mixed profile while swapping the suspect without telling them. With Tom as the suspect they dismissed a peak as an artefact; with Dick they declared the same peak a true allele. Each time they "included" the man in front of them and adjusted the match criteria to fit. In one case an analyst computed a random-match probability of 1 in 1.1 billion when Thompson's own calculation put it near 1 in 2. Deciding what counts as a match after seeing the suspect's sample inflates the apparent rarity of the evidence by orders of magnitude.

“In my opinion, the actual random match probability is close to 1 in 2; hence, the number the prosecutor gave the jury may have understated the true value by approximately nine orders of magnitude.”

— Thompson 2009, p. 272

A weathered honey-toned oak plank pocked with three scattered bullet holes, around exactly one of which a fresh pale bone-white target ring has been chalked so that hole sits dead centre, the other two holes left bare and untargeted. — Fig. 2 · The Texas sharpshooter: fire first, paint the target around the hole. Decide what counts as a match after you have seen the suspect.

Challenge 02 · Put it to the test

Before or after you saw the sample?

Counsel holds up your one-in-a-billion figure and asks about the order you did things in.

The question

"You testified the chance of a coincidental match is one in a billion. Did you write down what would count as a match before you ever saw my client's sample, or only after?"

Your answerNot graded · think it through

III

Why careful experts reach wrong conclusions

Itiel Dror and Greg Hampikian took a single DNA mixture from a real Georgia gang-rape case and handed it to 17 qualified analysts working casework in an accredited North American lab. The original examiners, who had seen the case context, had concluded that one suspect "cannot be excluded." Of the 17 working the same electropherograms without that context, exactly one agreed. Four called it inconclusive. The other 12 excluded the suspect outright. Same data, the so-called gold standard of forensic science, and the conclusion went whichever way the analyst went. Paul Gill put it bluntly, quoted in their 2011 paper: "If you show 10 colleagues a mixture, you will probably end up with 10 different answers."

That result is not a story about dishonest analysts or weak ones. It is the core of what Dror laid out in his 2020 taxonomy, "Six Fallacies and the Eight Sources of Bias." His first fallacy is that bias is an ethical issue, a matter of corrupt people. It is not. Cognitive bias, he writes, "is not a matter of dishonesty, intentional discrimination, or of a deliberate act." His third fallacy is expert immunity, the belief that training inoculates you. The opposite is closer to the truth. Expertise builds schemas, expectations, and shortcuts that usually serve the examiner well and that, in a hard call, steer the answer without announcing themselves. Bias arises from eight sources Dror organises in three tiers: the specific case (the data itself, the reference materials, the contextual information you were told), the specific person (their base rates, their lab, their training, their personality), and human nature, the cognitive architecture every one of us shares. The suspect's known profile driving the read of the crime-scene sample is one of those sources, and it is exactly what put Kerry Robinson in prison for 17 years on a DNA error before he was exonerated.

The most damaging fallacy for a witness is the fifth, the bias blind spot. Jeff Kukucka, Saul Kassin, Patricia Zapf, and Dror surveyed 403 forensic examiners across 21 countries in 2017. Seventy-one percent agreed bias is a concern for forensic science as a whole. Only 52 percent thought it touched their own domain. Just 26 percent believed their own judgements were affected. Thirty-seven percent claimed their personal accuracy rate was 100 percent. And 71 percent believed an examiner who simply tries to set aside expectations is less likely to be influenced by them, which Dror's sixth fallacy, the illusion of control, flatly denies. Willpower does not cancel bias. Trying to suppress a thought tends to make it louder.

This is why "I am objective" and "I have done this for twenty years" are worthless as defences on the stand. They are not arguments against bias. They are textbook expressions of the blind spot. The examiner who claims immunity has just told the jury they hold the exact belief the research predicts in someone who has not protected against bias. The credible answer is the opposite: I know I am susceptible, here are the blind procedures and the case-management controls that kept the irrelevant context away from my analysis.

“If you show 10 colleagues a mixture, you will probably end up with 10 different answers.”

— Paul Gill, quoted in Dror & Hampikian 2011

A single pale ivory sphere resting on aged oak, lit by warm lamplight from one side so one clean half glows in full light while the other half falls into deep shadow, half of the very object hidden from view. — Fig. 3 · The half you cannot see is part of the call you are making. Bias is not dishonesty; it is the side of the judgment hidden from the examiner.

Challenge 03 · Put it to the test

The blind spot, in numbers

You have just told the court your experience and objectivity rule out bias. Counsel produces the survey of 403 examiners.

The question

"You testified that twenty years of experience and your professional objectivity mean bias did not affect your conclusion. The research on 403 examiners found that the people most confident they were unbiased were exhibiting the documented bias blind spot. What specific blind procedure, not your willpower, kept the case context out of your analysis?"

Your answerNot graded · think it through

What a lab must do, not just believe

Dror reported a striking gap in 2020. Across surveys, 70% of forensic scientists acknowledged that cognitive bias is a concern for forensic science as a whole, but only 52% thought it touched their own discipline, and just 25% thought it was relevant to them personally. That is the bias blind spot in numbers. It is easy to see bias in others and nearly impossible to see it in yourself. So when a witness on the stand says "I kept an open mind" or "I am aware of bias, so I controlled for it," they are describing the exact mental move that Dror calls the illusion of control, the sixth fallacy. Worse, the science says willpower can backfire. Trying to suppress a biasing thought through effort produces what Dror, citing Wegner, calls "ironic processing" or "ironic rebound," the same mechanism by which telling a juror to disregard evidence makes them notice it more. Good intentions are not a safeguard. They are sometimes the problem.

If bias operates automatically and outside awareness, the only real defences are procedural: things a lab builds into its workflow and writes down, not attitudes an examiner carries in their head. The flagship procedure is Linear Sequential Unmasking. In their 2015 letter in the Journal of Forensic Sciences, Dror, Thompson, Meissner, Kornfield, Krane, Saks, and Risinger laid out a "context management toolbox" organised around five levels of potentially biasing information: the trace evidence itself, the reference samples, the case information, the examiner's base-rate expectations, and organisational culture. LSU controls which of these reaches the examiner and in what order. The core rule is linear: the examiner must first examine and document the trace evidence from the crime scene before being exposed to the known reference sample. You work from the evidence to the suspect, not from the suspect to the evidence. That sequence blocks the circular reasoning that let two analysts work backward from Kerry Robinson's DNA profile to the crime-scene mixture, a case Dror cites in which Robinson served 17 years before exoneration, and that drove the FBI's misidentification of Brandon Mayfield as the Madrid bomber, where a signal in the evidence was dismissed as noise because it did not match the target.

LSU does not forbid examiners from ever revisiting their work. It permits documented changes after exposure to the reference, but it imposes balanced restrictions, for example treating a revision of a high-confidence initial judgement as a red flag that may warrant blind review by a second examiner. As the 2015 authors put it, the requirement to document changes "does not eliminate the possibility that such changes arose from bias, it only makes that possibility more transparent."

The second pillar is independent verification done blind. In Dror's 2020 list of countermeasures, item (D) is "using blind, double blind, and proper verifications when possible." A verification means nothing if the second examiner already knows the first examiner's answer, because the conclusion itself becomes biasing context. True verification means a second examiner re-works the comparison without knowing what the first one found.

For court, this is the whole point. Ask not what the examiner believed, but what the lab did. Did they document the trace analysis before seeing the reference? Was the verifier blind? "I kept an open mind" is not a procedure. It cannot be audited, it cannot be reproduced, and the science says it does not work.

“The requirement that changes be documented does not eliminate the possibility that such changes arose from bias, it only makes that possibility more transparent.”

— Dror, Thompson, Meissner, Kornfield, Krane, Saks & Risinger 2015, J Forensic Sci 60(4):1111-1112

Two folded cream documents tied with brass-coloured cord on aged oak in a strict left-to-right order; the left one is unfolded and open, read first, the right one still tied shut and untouched, held back until the first is done. — Fig. 4 · Linear sequential unmasking: open the evidence first, document it, and only then meet the reference. Sequence is a procedure, not a mindset.

Challenge 04 · Put it to the test

Show me the file, not the mindset

Counsel asks to see the case file rather than hear about your state of mind.

The question

"You testified that you 'kept an open mind' and were 'aware of the risk of bias.' Show the jury where in your case file you documented your analysis of the crime-scene evidence before you ever saw the suspect's reference sample, and tell us whether the examiner who verified your conclusion knew what conclusion you had reached."

Your answerNot graded · think it through

A number can discriminate well and still be wrong

Jan Hannig and Hari Iyer, writing in the Journal of the Royal Statistical Society in 2021, ran a likelihood-ratio system for comparing car paint through their own test. The system was, by one measure, superb. Its area under the ROC curve was 0.982, which means it sorted same-source pairs from different-source pairs almost perfectly. Then they checked a second thing, and found that when this system reported a likelihood ratio somewhere in the range of 10,000, the validation data could only support a number between a thousand and a hundred thousand times smaller. The direction of the evidence was right. The magnitude was off by orders of magnitude. A confident, well-sorting system was reporting a number that was, in their words, overstated by as much as a factor of 100,000.

That gap is the whole reason "validated" cannot mean "the examiner is experienced." Validity is something you measure, on data where the truth is already known, before the method touches a case. The model designs are familiar: Ulery and colleagues at the FBI ran latent print examiners as a black box in 2011 and counted how often they were wrong, and Phillips and colleagues did the same for face comparison in 2018. You feed the system pairs that you know to be same-source or different-source, you collect what it says, and you score it.

For a likelihood-ratio method, Didier Meuwly, Daniel Ramos and Rudolf Haraksim laid out how in their 2017 guideline for Forensic Science International. Their central move is to split accuracy in two. They write that accuracy equals discriminating power plus calibration. Discriminating power is the ability to tell same-source from different-source comparisons apart, and it is measured by things like the equal error rate or the minimum log-likelihood-ratio cost. Calibration is whether the size of the reported number is actually warranted by the data. As they put it, perfect calibration means "the LR is exactly as big or small as is warranted by the data." A method can score well on the first and badly on the second, and the single cost measure Cllr cleanly decomposes into a discrimination part and a calibration part precisely so you can see which one failed.

This is the snare a cross-examiner should learn to set. The witness will rely on how well the system separates the two hypotheses, because that number really is impressive. Hannig and Iyer found the same pattern in the fingerprint likelihood-ratio system of Neumann and colleagues: excellent discrimination, and yet it overstated the strength of evidence across the entire range from 1 in 100 up to 10,000. The lawyer's job is to ask not whether the system points the right way but whether the magnitude it reports has been checked against ground truth, and over what range of values that check holds. Hannig and Iyer are blunt that the diagnostic many labs rely on, the average of different-source likelihood ratios sitting near one, is a necessary but not sufficient condition, so passing it tells you almost nothing.

The line for the witness box: a system that sorts the world correctly can still hand the jury a number that is numerically indefensible, and "I have seen thousands of these" is an assertion of discrimination, not a measurement of calibration.

“Although this fingerprint LR system has excellent discrimination power, it would be desirable to reduce its calibration discrepancy.”

— Hannig & Iyer 2021, on the Neumann et al. fingerprint likelihood-ratio system

On aged oak, small identical ivory tokens sorted cleanly into two perfect groups on either side of a carved dividing groove, while an old brass measuring rule laid across one group is grossly the wrong length for the tokens beneath it, far too long. — Fig. 5 · Perfect sorting, wrong size. Discrimination tells you which side; calibration tells you whether the number on the rule is true.

Challenge 05 · Put it to the test

Has anyone checked the magnitude?

Counsel accepts that your system sorts well, then asks the second, harder question.

The question

"You told the jury your system separates same-source from different-source pairs almost perfectly. Will you now tell them whether anyone has ever checked that the number you reported, the actual magnitude, matches the truth, and over what range of values that check holds?"

Your answerNot graded · think it through

Who sets the bar, and the questions it generates

In 2009 the National Academy of Sciences published Strengthening Forensic Science in the United States and reached a verdict that still rings in courtrooms: with the exception of nuclear DNA, no forensic discipline had been rigorously shown to consistently and reliably connect evidence to a specific individual or source. Fingerprints, toolmarks, bitemarks, hair, handwriting: methods used for a century, never validated the way a clinical drug is validated. The report asked Congress for research and for an independent body to oversee it. The body never came. The challenge it laid down did.

Seven years later the President's Council of Advisors on Science and Technology gave the challenge teeth. PCAST 2016 split reliability into two questions a court can actually ask. Foundational validity: has the method itself been shown to work, with error rates measured in properly designed black-box studies where examiners judge samples whose true answer is known? Validity as applied: did this examiner, in this case, apply the method the way the validation studies assumed? A discipline can pass the first and fail the second. An examiner who is sloppy on a Tuesday is unreliable even if the method is sound.

That logic moved from advisory report into binding rule. Federal Rule of Evidence 702, amended effective December 1, 2023, now says a witness may give expert opinion only if the proponent demonstrates to the court that it is more likely than not that the opinion reflects a reliable application of reliable principles and methods to the facts of the case. The amendment was aimed squarely at overstatement. For years experts walked in claiming a "100 percent" or "zero error rate" identification, and judges waved it through as something for the jury to weigh. The rule now puts the burden on the side offering the expert, before the jury hears a word.

The same expectations exist outside the United States, written differently. The UK Forensic Science Regulator issues statutory Codes of Practice with the force of law behind them: validated methods, declared competence, disclosed limitations. The ENFSI Guideline for Evaluative Reporting (Willis and colleagues, 2015, the STEOFRAE project) tells European examiners to report findings as a likelihood ratio, the probability of the evidence under the prosecution proposition against its probability under a stated defence proposition, and never to report the probability of a proposition itself. ISO/IEC 17025 accreditation wraps around all of it, demanding that a laboratory validate its methods, prove the competence of its people, and pass blind proficiency tests.

Strip away the acronyms and every one of these bodies hands cross-examining counsel the same five questions, which is why each CrossCoach discipline reading points back here instead of rebuilding the machinery. What were your two propositions, the one you tested for and the one you tested against? What is your likelihood ratio, and is it calibrated, meaning does a value you call strong actually correspond to that strength in tested data? What is your method's validated error rate, from a black-box study, not your personal sense of how often you are right? What did you do procedurally about bias, before you saw the answer you were hoping to reach? And can you express your conclusion as anything other than a bare assertion of certainty? An examiner who can answer those five has met the standard the last fifteen years built. An examiner who cannot has a long afternoon ahead.

“The proponent demonstrates to the court that it is more likely than not that ... the expert's opinion reflects a reliable application of the principles and methods to the facts of the case.”

— Federal Rule of Evidence 702, as amended effective December 1, 2023

A single long horizontal polished brass rail fixed high and level across a vast wall of dark oak courtroom panelling, catching a line of warm lamplight along its length, set as one fixed threshold to be cleared. — Fig. 6 · Every standard, from NAS to PCAST to Rule 702, sets the same bar and hands counsel the same five questions.

The five questions every standard generates

Your two propositionsThe one you tested for and the one you tested against

A calibrated likelihood ratioDoes a value you call strong correspond to that strength in tested data?

A validated error rateFrom a black-box study, not your personal sense of how often you are right

A procedural bias safeguardWhat you did before you saw the answer you hoped to reach

A conclusion that is not bare certaintyExpressed as strength of evidence, not an assertion of source or guilt

ValidatedInstrument-basedSubjective comparison

NAS (2009), PCAST (2016), FRE 702 (amended 2023), the UK FSR Codes, ENFSI and ISO/IEC 17025 all converge on the same demands. The bars show how much settled scientific footing each demand can draw on, not a real metric.

What to carry into the witness box

01The conclusion is a separate object from the analysis, and it is where cases fail. Keep your testimony inside what you measured. The seam between measured and claimed is exactly where counsel will press.
02Report the strength of the evidence, a likelihood ratio under two stated propositions, never the probability that the defendant is the source or is guilty. That belongs to the court.
03Three failures turn a true measurement into a false claim: the prosecutor's fallacy (the probability of the evidence is not the probability of innocence), individualisation (uniqueness is assumed, not measured), and target-shifting (set your match criteria before you see the suspect's sample).
04"I am objective" and "twenty years of experience" are expressions of the bias blind spot, not defences against it. Bias is automatic, and willpower can backfire.
05Safeguards are things a lab does and documents, not a mindset: linear sequential unmasking (work from the evidence to the suspect, in that order) and verification by an examiner who is truly blind to your conclusion.
06"Validated" means a measured error rate from a black-box study, and a likelihood ratio that is calibrated, not merely discriminating. A number can point the right way and still be wrong by orders of magnitude.
07Every standard generates the same five questions: what were your propositions; what is your likelihood ratio and is it calibrated; what is your method’s validated error rate; what did you do procedurally about bias; and can you state your conclusion as anything other than bare certainty?

Challenge 06 · Put it to the test

Point me to the black-box study

You have asserted certainty. Counsel reaches for the current version of Rule 702.

The question

"You testified that this is a '100 percent match' and that your discipline has a 'zero error rate.' Under the version of Rule 702 in force since December 2023, point me to the black-box study that measured your method's actual error rate, and tell the jury what that number is."

Your answerNot graded · think it through

Ask the tutor

Still have questions about the research?

Ask anything about the forensic conclusion and validity literature. The tutor answers from the document itself — and keeps one eye on how it might come up under cross-examination.

Your question↩ to send · ⇧↩ for newline

References

Keep going

Put this into practice, or go deeper with the tutor on the full research.

Ask the tutor Practise a session