Skip to main content
CrossCoach
Sign inRequest access
A dense overhead surveillance grid of many near-identical anonymous human faces in desaturated grey-green, each face caught from above in low-resolution CCTV, all of them slightly soft and indistinct, the whole wall reading as one watched, interchangeable crowd.
Reading · Facial comparison

Faces from Images: How Good Is the Match, Really?

A face on a grainy CCTV still, and a face in the dock. To a jury it looks like the easiest comparison in the building, because we all recognise faces all day without effort. That intuition is backwards. Matching two images of someone you do not know is one of the hardest and least validated tasks in forensic science. These six moves are how a cross-examiner exposes it, and how a careful examiner stays standing.

16 min readBased on the the forensic facial-comparison research
I

You are worse at this than you think

Recognising your own brother across a crowded street is effortless. You do it in poor light, at a bad angle, after twenty years. That feeling of ease is the danger, because the task you do in the lab and in court is a different task entirely, and it is much harder than it feels.

Vicki Bruce, Zoe Henderson, Craig Newman, and Mike Burton ran the most rigorous version of this in 2001. They filmed academic staff at the University of Glasgow on a real building-entry CCTV camera, the truly low-quality kind in use everywhere, then asked people to decide whether a video clip matched a single high-quality photograph held right next to it. No memory was involved. Both images were in front of the viewer at once. Glasgow participants who knew the staff personally scored about 92% correct, with discriminability (A prime) of 0.93, close to ceiling, despite the grainy video. Participants from the University of Paisley, who knew none of the faces, scored 70% overall, A prime of 0.75. On the hardest trials, where a video target was paired with a deliberately similar-looking stranger's photo, the unfamiliar group dropped to 56% correct. That is barely above guessing, on a side-by-side comparison, with no time limit, with the option to pause and replay the clip.

Look at that result closely. The familiar viewers were not better witnesses or sharper minds. They simply recognised the person, which lets you skip the perceptual comparison altogether. Strip the familiarity away and what remains is the comparison itself, and the comparison is where people fall apart. Bruce and colleagues were blunt about the courtroom: comparing a defendant against a CCTV image of an otherwise unfamiliar person is a practice that "should be avoided, as resemblances between images of otherwise unfamiliar faces can be misleading and such judgments are highly prone to error."

Jenkins and his colleagues in 2011 found the floor under that floor. They pulled 20 internet photos each of two Dutch TV celebrities, unknown to UK viewers, shuffled all 40, and asked UK undergraduates to sort them into piles by identity. Two people. The median answer was 7.5 identities, and not one of the 20 participants got it right. Photos of the same woman looked like different women. Run the identical task with Dutch viewers who knew the celebrities and they sorted it perfectly into two piles. Same cards, same image quality. The difference was entirely in the head of the viewer.

The reason cuts to the root of the discipline. Two photos of one face can differ more than two photos of two faces. Lighting, expression, camera, age, a smile versus a neutral mouth: these push the pixels around so much that "do these two images show the same person?" has no stable answer from appearance alone. Jenkins said it directly. Face photographs are "unsuitable as proof of identity," and the variation within one person's photos was large compared to the variation between different people.

When counsel asks whether the person on the CCTV is the same as the defendant, that is the unfamiliar-matching task, performed on the two worst kinds of image, by someone the law treats as an expert precisely because they do not know either person. The science says that is the hard direction.

Photos of the same face were often deemed too dissimilar to go together, leading participants falsely to fractionate a single identity into several identities.
Jenkins, White, Van Montfort & Burton (2011), Cognition
A row of surveillance stills of the same anonymous person, each frame so different in angle, light and expression that the captures read as several different people, separated into a few distinct groups as though sorted by identity.
Fig. 1 · One person, sorted into several. The same face captured at different angles and lights fractures into strangers when you cannot recognise it.
Challenge 01 · Put it to the test

The hard direction

Counsel turns from the CCTV still to the examiner, calm and deliberate.

The question

“You compared the CCTV image to the defendant and concluded they are the same man. You had never met either person before this case, correct? So is it fair to say you were performing exactly the unfamiliar-face matching task that Bruce in 2001 found drops to chance, and the photo-sorting task where Jenkins in 2011 found not one participant in twenty got the right answer?”

Your answer
II

The badge does not make you better

David White and his colleagues walked into the Sydney Passport Office in 2014 and tested 30 officers whose whole job is confirming that the person standing in front of them is the person in the photo. These were not students. They averaged eight and a half years on the job, and all but three had completed the office's training module on identity verification. On the live person-to-photo test, they got 10% of decisions wrong. The number that matters for a fraud examiner is the false acceptance rate: 14% of fraudulent photos were waved through as authentic. And the imposters were not even chosen to resemble the applicants closely. They were picked from a small, diverse pool of students.

It gets worse for the badge. When White put the passport officers head to head with first-year university students on a photo-to-photo matching test, there was no significant main effect of group. On the standardised Glasgow Face Matching Test, officers scored 79.2%, statistically indistinguishable from the published norm of 81.3%. Years on the job predicted nothing. Experience did not buy accuracy.

This is not one rogue office. White, Towler and Kemp pulled together every published comparison they could find: 29 separate professional-versus-novice contrasts across 12 papers, over 1,600 practitioners tested. Twelve of those comparisons showed no significant difference at all. Bank tellers and notaries tested by Papesh made 25% errors. Police officers in Towler's training study were actually less accurate than students. Passport officers asked to review a face-recognition candidate list, their single most common daily task, got it wrong on 1 in 2 trials and picked the wrong face 40% of the time. The fair summary: a group called "facial reviewers" beat novices by an aggregate of about 1.5 percentage points. That is noise.

The story does not end with debunking, and a truthful witness needs the other half. Phillips and his team in 2018 ran the most demanding test yet, deliberately chosen hard image pairs, and gave participants three months. Trained forensic facial examiners (57 of them, from five continents) and natural super-recognisers clearly outperformed everyone. Median accuracy for examiners landed at an AUC of 0.93, against 0.68 for students. In the White, Towler and Kemp meta-analysis, examiners beat novices by 13 percentage points and police super-recognisers by 14. These are real effects, large and replicated.

Read the fine print before you rely on it. Examiners did best when given long study times, not short ones. Their high-confidence false-alarm rate was tiny but not zero, roughly 0.9%. Some examiners in Phillips' sample still scored below the median student. And the single most accurate result in the whole study was not a human at all: it was an examiner fused with the best 2017 algorithm, more accurate than two examiners combined.

So expertise here is real, but it is narrow and conditional. It belongs to a tested subgroup, under the right conditions, with the right time, and it is never perfect. What does not exist is expertise that travels with a job title. The passport officer, the border guard, the detective: the literature says assume novice-level error until someone shows you a score.

Trained passport officers also perform poorly when matching unfamiliar faces. High error rates were consistent across three tests... length of time employed as a passport officer did not predict accuracy.
White, Kemp, Jenkins, Matheson & Burton (2014), PLoS ONE
Two anonymous people seen from overhead surveillance at two identical screens comparing faces, one in a uniform with a lanyard, one in plain clothes, their work and their results indistinguishable from above.
Fig. 2 · A uniform beside plain clothes, doing the identical task with identical results. Years on the job and a job title predicted nothing.
Challenge 02 · Put it to the test

Better than a stranger off the street?

Counsel sets down the witness CV and asks the relevance question plainly.

The question

“You hold a recognised qualification and you compare faces every day at work. On what evidence should this court treat your conclusion as more reliable than that of an untrained member of the public, given that passport officers with eight years' experience perform no better than first-year students?”

Your answer
III

The algorithm is not a second witness

On the Friday after Thanksgiving 2022, Randal Quran Reid was pulled over outside Atlanta and handcuffed on theft warrants from Louisiana, a state he had never set foot in. He spent six days in the DeKalb County jail. His family spent thousands of dollars before anyone figured out why: a Jefferson Parish detective had run a face from a store surveillance camera through Clearview AI, and the system pointed at Reid. The warrant affidavit said only that a "credible source" had identified the "heavyset black male." There was no source. There was an algorithm, and a detective who treated its output as confirmation. Reid was released only after his lawyer pointed out a mole on Reid's face that the actual thief did not have.

Porcha Woodruff was eight months pregnant when Detroit police arrested her in 2023 for carjacking. The chain was the same. A face-recognition search returned 73 candidates from a database of more than five million mugshots. Woodruff's years-old arrest photo was in the pile. Police built a six-person lineup around her and showed it to the complaining witness, who picked her. The ACLU's brief in her case calls this what it is: the algorithm chose the person who looked most like the suspect, then the witness was steered to that same person. The corroboration was not independent. It was the machine's guess, laundered through a human.

Two things make this a real problem in your discipline. The first is that the algorithms carry demographic error that is not evenly spread. Buolamwini and Gebru's 2018 Gender Shades audit ran three commercial classifiers and found error rates of up to 34.7 percent on darker-skinned women, against a maximum of 0.8 percent for lighter-skinned men. That is a gender task, not identification, but it broke the myth that one accuracy number describes a system. NIST's demographic testing confirmed the pattern for matching: by the figures cited in the Woodruff brief, some algorithms misidentify Asian and African American faces up to 100 times more often than white men, and older Black women in some tests were over 3,000 times more likely to draw a false positive than younger Eastern European men. Howard and colleagues in 2019 showed why. Comparing faces within a single demographic group (same race, same gender, same age) raised the false-match rate more than 400-fold over comparisons across groups, driven mostly by shared race and gender. Lookalikes cluster.

The second problem is you. A candidate list is, by design, a list of people who resemble the probe. When a reviewer scans it, automation bias takes over. Parasuraman and Manzey describe it as a heuristic that replaces vigilant analysis with deference to the computer's advice. The reviewer never sees the dozens of other near-matches the search threw away, and human operators picking the right face from a candidate list err roughly half the time. The list does not make you a better witness. It makes you a worse one, because it has already told you the answer it wants.

In court, hold the line the vendors themselves hold. Clearview's own CEO said an arrest should never rest on a face-recognition search alone. The output is an investigative lead. It is a reason to look, not a finding that you saw.

Even if Clearview AI came up with the initial result, that is the beginning of the investigation by law enforcement to determine, based on other factors, whether the correct person has been identified.
Hoan Ton-That, CEO of Clearview AI, New York Times (2023)
A ranked grid of near-identical anonymous CCTV faces on a screen, the top-left face boxed by a bright machine selection outline before any person has looked, the rest of the look-alikes left cold and unmarked.
Fig. 3 · A candidate list is a wall of look-alikes the search chose. Rank one is already boxed before you compare a single feature.
Challenge 03 · Put it to the test

Two identifications, or one?

Counsel frames the search and the review as if they were two separate witnesses agreeing.

The question

“The algorithm returned my client's photo, and you, a trained examiner, then confirmed the match. Isn't that two independent identifications?”

Your answer
IV

Told the answer before you looked

In 2006, Itiel Dror and David Charlton ran a study that should change how you think about your own conclusions. They took fingerprint comparisons that experts had previously called matches, in real casework, and showed those same prints back to those same experts months later, this time wrapped in a story: these are the prints from the Madrid bombing that the FBI got wrong. Several experts reversed themselves. Same ridges, same examiner, different answer. The only thing that changed was what they had been told before they looked. That is contextual bias, and it is the founding result behind everything in this section.

Facial comparison is not immune. Rebecca Heyer and Carolyn Semmler made exactly this argument in 2013, applying the forensic confirmation bias framework of Kassin, Dror and Kukucka directly to facial image comparison. Their point is uncomfortable: ground truth is usually unknown, there is no statistical basis for quantifying a face match, and examiners routinely have access to information they cannot easily un-see. Heyer and Semmler had tested 149 experienced facial comparison specialists from Australian government agencies. On average the group was well calibrated, confidence tracking accuracy. The catch was the spread. Some specialists ran 20% more confident than they were accurate, others 20% less. Calibration on average tells you nothing about the examiner in the box.

The sharpest demonstration of how a machine primes a human comes from Dror, Wertheim, Fraser-Mackenzie and Walajtys in 2012. They slipped a manipulation into the real casework of 23 latent fingerprint examiners, all court-qualified, averaging over 19 years of experience. Across 3,680 candidate lists and 55,200 comparisons, the examiners did not know they were in a study. The researchers moved the true matching print to different positions in the algorithm's ranked list. The position changed the verdict. False identifications clustered at the top of the list, positions one and two, and they happened even when a better, actually-matching print was sitting lower down the same list. Examiners spent less time on lower-ranked candidates, and the rushed comparisons produced more missed identifications. The list ranking, not the ridge detail, was steering the conclusion.

Carry that finding into your discipline. A facial recognition system hands you a candidate gallery sorted best-match-first. Rank one carries a suggestion before you compare a single feature. Heyer and Semmler note that performance in facial review drops as the candidate list grows, and they explicitly call for the kind of in-casework testing Dror's team ran.

The remedies are procedural, not motivational. Heyer and Semmler endorse working linear: evaluate each face in isolation, document what you see, before you ever place the two side by side, with some agencies splitting the evaluation and comparison stages between different specialists. The broader principle is linear sequential unmasking, where you expose yourself to the reference and to context only after you have committed your assessment of the questioned image. They are blunt that good intentions are not protection. Pronin's bias blind spot means training someone about bias can leave them more confident they have beaten it, not less exposed.

In the box, counsel does not need to prove you are dishonest. They need to establish the order in which you learned things. Did you see the suspect's name, the arrest, the algorithm's rank one, or the detective's theory before you formed your conclusion? If yes, the conclusion you signed is the one you were handed.

These findings are not a function of the print itself; the same print is considered differently when presented at a lower position on the ranked list.
Dror, Wertheim, Fraser-Mackenzie & Walajtys (2012), J. Forensic Sci. 57(2):343-352
A surveillance screen showing an anonymous face, with a plain opaque card held up in the foreground directly in front of the screen so it reaches the eye first, the card sharply lit and read before the dim face waiting behind its edge.
Fig. 4 · The order you learn things in decides the answer. The context is taken in first, before the face behind it is ever examined.
Challenge 04 · Put it to the test

What did you know first?

Counsel sets aside the comparison itself and asks only about the order of events.

The question

“Before you formed your conclusion that these two faces matched, what had you already been told: the suspect's identity, the algorithm's rank-one candidate, the fact of an arrest, or the investigating officer's theory of the case?”

Your answer
V

What is the error rate of the method you used?

Counsel can dismantle your testimony with one question, and it has nothing to do with whether you got this case right. The question is simple: what is the measured error rate of the method you used? For forensic facial comparison, as it is actually practiced, the truthful answer is that nobody has measured it.

Take the metric approach first. Photoanthropometry, also called facial mapping, marks anatomical landmarks on a face, measures the distances between them, and converts those distances to proportionality indices so that two images can be compared. Reuben Moreton and Johanna Morley, both then at the Metropolitan Police Service, tested this in 2011. They pulled 25 individuals from the Home Office multipose database, each photographed at high resolution across 20 different camera angles, and they asked a basic question: if you only move the camera, do a person's facial proportions stay put? They do not. Every proportionality index changed significantly when the camera moved just 10 degrees in the vertical plane. Their conclusion was blunt: the variability in one person's measurements caused by camera angle alone can be as great as the variability between different people. They state that photoanthropometry, as currently practiced, is unsuitable even for elimination. That is the weaker claim, exclusion, and the method fails it.

It gets worse at courtroom image quality. In their CCTV footage, two different individuals shared an identical nose-width index at 2.5 metres, and at 5 metres four individuals shared the same value. Different faces become numerically indistinguishable as the pixels run out. And as Moreton and Morley note, this is a subjective assessment in which no estimations of error are given.

So examiners turn to the morphological method instead: feature-by-feature comparison, eyes against eyes, ears against ears, guided by FISWG and OSAC documents. Read those documents and notice what they are. The FISWG training guideline, version 1.0 from 2010, is exactly that, a guideline. It tells a trainee to be "aware of available and relevant statistics regarding facial shapes and relative frequency of occurrence." Aware of statistics that, for the most part, do not exist. There is no published table telling you how rare a particular ear shape or chin is in the population, so when you say two faces "match," you cannot say how many other people would match equally well.

The OSAC documentation standard is even more candid, and it works against you on the stand. OSAC 2022-S-0008 requires a facial comparison report to disclose, in writing, the "absence of citable empirical measures of performance" (section 4.3.6.1). The field's own standards body instructs you to admit, on paper, that there is no measured performance figure for what you did. These are documentation and guideline texts, not validation studies, and they say so in clear terms.

That gap is precisely what the National Academy of Sciences flagged in 2009 and what PCAST in 2016 and the UK Forensic Science Regulator have demanded since: demonstrated validity and a known error rate, established by empirical testing. For DNA that burden was met. For facial comparison it largely has not been. When counsel asks for your number, you will not have one, and the guidelines you relied on already told you to say so.

Absence of citable empirical measures of performance.
OSAC 2022-S-0008, required report disclosure, section 4.3.6.1
A round measuring gauge mounted on a surveillance screen over an anonymous face, its needle resting on a completely blank dial with no scale, no numbers and no markings, an instrument pointing at a face but measuring nothing.
Fig. 5 · An instrument aimed at the face, with a blank dial behind the needle. The method produces a reading, but no validated error rate to anchor it.
Challenge 05 · Put it to the test

Where is your number?

Counsel asks the validity question that has nothing to do with whether the examiner got this case right.

The question

“You testified that these two faces match. What is the validated error rate of the method you used to reach that conclusion, and what published study established it?”

Your answer
VI

When "strong support" means more than it should

Robert Neave had compared faces for about twenty years when he took the stand in R v Atkins. A medical artist at Manchester with forty years behind him, he examined indistinct CCTV from an armed robbery in west London, spent sixteen hours on the comparison with Dean Atkins, and excluded roughly twenty other known burglars plus Atkins's own brother. His conclusion came off a tidy ladder he had written into his report: level 0 "lends no support," up through level 5 "lends powerful support." He placed this case between the top of level 3 and into level 4, somewhere past "lends support" and toward "lends strong support." Pressed, he conceded there was no database, so the ladder was built on his own experience and nothing else. The Court of Appeal in 2009 admitted it anyway, provided the jury was told the rungs were subjective labels.

Consider what that ladder is. The court in R v Tang had already called the same kind of scale "no more than a series of convenient labels, arranged in an ascending hierarchy, that state a conclusion." In R v Gray, Mitting J had gone further: with no national database and no agreed formula, any expression of the degree of support "must be only the subjective opinion" of the witness, and "this court doubts whether such opinions should ever be expressed." Atkins overruled that doubt. So the word "strong" can reach a jury with no number underneath it, and the only safeguard is that someone says out loud it is a judgment, not a measurement.

The problem is what the jury then does with the word. Martire and her colleagues at UNSW ran 494 mock jurors through a larceny trial in 2013, varying whether the expert's strength came as a number or as the matching verbal label from the Association of Forensic Science Providers scale. Numbers moved belief toward guilt. The weak verbal label did the opposite. In the low-strength verbal condition, 61.7% of participants moved toward innocence after hearing evidence that pointed at the defendant. Nearly a quarter flipped outright from "more likely guilty" to "more likely not guilty" on hearing incriminating testimony. They called it the weak evidence effect: tell a jury your evidence "weakly supports" the prosecution, and a majority read it as helping the defence. The expert's intended meaning and the juror's reading came apart entirely.

Even at the top of the scale the words underperformed. When the expert handed over a likelihood ratio of 495,000, the median juror behaved as though it were about 1.4. Belief change for "450 times more likely" and "495,000 times more likely," three orders of magnitude apart, was statistically the same.

Edmond, Biber, Kemp and Porter spelled out the deeper hole in 2009. Facial mapping had never been validated. No study had measured an examiner's hit rate and false-positive rate, so no error rate existed. In Tang the court had treated reliability as "an extraneous idea." Your scale of support has nothing measured underneath it.

Counsel will ask what your words rest on. State your validated casework error rate, or concede you do not have one and that "strong support" is your experience speaking, not a measured probability. Then explain clearly that the jury can read the word the opposite way you mean it.

A majority of those in the low/verbal condition (61.72%) responded in a manner incongruent with the evidence provided by the expert (taking inculpatory evidence to be exculpatory).
Martire, Kemp, Watkins, Sayle & Newell (2013), Law and Human Behavior
A tall ladder rising up a wall, its lower rungs solid metal but its upper rungs dissolving into faint, insubstantial outlines that could bear no weight, the higher and stronger the rung the less of it is actually there.
Fig. 6 · An ascending ladder of words with nothing solid under the top rungs. "Strong" reaches the jury carrying weight nothing has measured.
Words that carry more than the science can back

Each phrase a facial-comparison witness might reach for claims more certainty than a method with no measured error rate can support, or invites the jury to read it the wrong way. Swap each for language that states the limit honestly. Grounded in the facial-comparison research.

What to carry into the witness box
  • 01Matching two images of a stranger is the hard direction, and familiar recognition tells you nothing about it. Name the task you actually performed, and own that not knowing the people is what makes it fragile.
  • 02A job title is not evidence of skill. Passport officers with eight years on the job match no better than first-year students. Real examiner and super-recogniser ability exists, but it is tested, narrow, conditional on time and conditions, and never perfect. Assume novice-level error until someone shows a score.
  • 03A face-recognition hit is an investigative lead, not a second identification. The algorithm carries uneven demographic error and has already told you the answer it wants. Do not let your review launder its guess into confirmation.
  • 04Know the order you learned things in. If you saw the name, the rank-one candidate, the arrest, or the case theory first, the conclusion was handed to you. Work blind and linear, and be ready to say you did.
  • 05Your method has no measured error rate. Photoanthropometry collapses at courtroom image quality, the morphological method has no feature-frequency statistics, and OSAC instructs you to disclose the absence of performance measures in writing.
  • 06"Strong support" is not a number, and a jury can read it the opposite way you mean it. State your validated casework error rate or concede you have none, and explain what your words can and cannot carry.
Challenge 06 · Put it to the test

What does "strong" rest on?

You are on the stand. Counsel has saved the conclusion language for last.

The question

“You told this jury the similarities offer 'strong support' that the man on the camera is the defendant. What is the validated error rate behind that phrase, and if you have none, can you rule out that some of these jurors will read 'support' as helping my client rather than the Crown?”

Your answer
Ask the tutor

Still have questions about the research?

Ask anything about the forensic facial-comparison research. The tutor answers from the document itself — and keeps one eye on how it might come up under cross-examination.

Your question
References
Next reading

Voice on Trial: Forensic Speaker Comparison

Keep going

Put this into practice, or go deeper with the tutor on the full research.