
Digital Forensics: The Evidence That Looks Like Maths
Of all the forensic evidence a court hears, digital evidence wears the best disguise. A fingerprint examiner says "in my opinion." A DNA analyst gives a probability. The digital examiner hands over a file, a timestamp, and a hash that matches to thirty-two characters, and it reads like arithmetic. The six moves below are the ones a good cross-examiner uses to show it isn't.
The presumption that computers don't lie
For twenty years, the Post Office prosecuted its own subpostmasters on the word of a computer. From 1999, branch after branch showed accounting shortfalls in a system called Horizon, and the Post Office took those shortfalls to court as proof of theft and fraud. Hundreds of people were convicted. Many were imprisoned, fined, or had their homes taken; some died still labelled thieves. The shortfalls were software bugs. In the words of the appeal courts now quashing conviction after conviction, it is likely the largest miscarriage of justice in British history.
It happened because of a legal default most examiners have never heard of and all of them rely on. In England and Wales, a court treats a computer, as a matter of law, as having worked correctly unless someone proves otherwise. Evidence from a computer is presumed reliable. Lawyers call it a rebuttable presumption. You can rebut it in theory. In practice it is close to impossible, especially when the system belongs to a "substantial institution" with the records and the lawyers on its side. The subpostmasters could never get inside Horizon to show it was broken, so the presumption held. It was wrong.
It was not always the law. Until 1999, section 69 of the Police and Criminal Evidence Act 1984 put the burden the other way. The prosecution had to show a computer was operating properly before its output could come in. The Law Commission decided that was a nuisance. Section 69, it said, served "no useful purpose," and had it repealed in favour of a presumption that "mechanical instruments were in order at the material time." Computers, the Commission reasoned, counted as mechanical instruments. It also assured everyone that "such a regime would work fairly." Horizon is what fair looked like.
Calling a computer a mechanical instrument is where it goes wrong. A set of scales fails in obvious, physical ways you can see and repeat. Software fails in ways that hide. In 2022, ten of Britain's most senior figures in computer reliability and evidence law (Ladkin, Littlewood, Thimbleby, Thomas, Murdoch and Mason among them) set out why. Every computer system contains bugs, and some, in their words, "rarely reveal themselves... because they can masquerade as normal behaviour." Even after a failure, it can be impossible to say whether the cause was a software defect or someone using the system wrongly. A machine that has run cleanly for years can still be running on a bug that surfaces for the first time today.
“A court will treat a computer as if it is working perfectly unless someone can show why that is not the case.”

American courts built the same problem one level down, around the tools. In 2006, Van Buskirk and Liu gave it a name. Courts had granted forensic software a "presumption of reliability," treating the output of EnCase and its rivals as accurate because it usually is. There was, they argued, no scientific basis for it. The assumption then runs the whole way down. The system is sound, the tool that reads it is sound, the examiner who reports the tool is sound, and nobody had to prove any of it.
None of this is a reason to attack computers in general on the stand. It is a reason to know what you are standing on. When you present a device's output as fact, you invite the court to assume the one thing Horizon proves it should not. Bohm and his colleagues propose a better position: treat reliability as a claim you have to back, not a default you inherit. If anyone asks, you should be able to point to the system's own error logs, its audit trail, and its change records.
Established fact, or inherited assumption?
Counsel sets a printout of the device report on the rail and begins, almost gently.
"You've presented this device's output to the court as established fact. That rests on the presumption that the system was operating correctly, the same presumption that sent innocent subpostmasters to prison. What can you actually produce, error logs, audit records, change-control history, to show this system was working correctly rather than asking the jury to assume it?"
"I couldn't find it, your honour, so it mustn't be there"
In 2010, the man who runs the testing lab published a confession for a title. James Lyle, the computer scientist who heads NIST's Computer Forensic Tool Testing program, called his paper "If error rate is such a simple concept, why don't I have one for my forensic tool yet?" He had been testing these tools since 2000. He still could not hand you a number.
The reason matters in the witness box. A soil test for some chemical has a stable error rate because its failures are random. They scatter, so you can model them. Digital tools don't fail like that. They fail systematically: the same drive, the same operating system, the same interface, and the tool drops the same data every time. Lyle's own example is the imaging tool SafeBack. On one test it copied 3,335,472 sectors and got 1,008 of them wrong, an error rate of 0.0003. On most other runs it was perfect: zero. Same tool. Whichever number you quote misrepresents all the others. A write blocker, as Lyle put it, does not have a 3% failure rate. It "either works or it fails." There is no average to give a jury.
“I couldn't find it your honour, it mustn't be there!”
A decade of that testing, written up by Barbara Guttman, Lyle and Rick Ayers in "Ten Years of Computer Forensic Tool Testing," reads as a catalogue of tools failing silently. Some skip the last sectors of an NTFS partition because "those sectors are not used to contain user data," and overwrite them with a block already acquired. When a bad sector turns up on a Linux ATA acquisition, seven readable sectors around it get replaced with zeros. Gone, no warning. The same tool over a different interface drops a different, variable number. None of it throws an error. The report still looks correct.
Then there are the people using the tools. In 2019 Graeme Horsman surveyed up to 100 working examiners. 76% were worried about the state of tool testing. 88% had personally hit erroneous results from a forensic tool. 79% admitted using a tool in a live investigation they had never tested themselves, on the vendor's word alone. The vendor's word is an EULA. EnCase's says the software "is not fault-tolerant" and the maker "does not warrant that the software is error-free." X-Ways is blunter: "the user must assume the entire risk of using the program." The usual fallback is dual-tool agreement. Run two tools, and if they match, call it confirmed. Horsman takes that apart. Two tools built on the same flawed code library will agree and both be wrong. Agreement is not proof. It is two witnesses who copied each other's homework.

That leads to the title of Horsman's other paper: "I couldn't find it your honour, it mustn't be there!" A tool returns nothing, and the examiner reads nothing as not there. But the tool may simply not parse that browser version, that compressed format, that unallocated region, a limit it never disclosed. Marshall and Paige put the deeper problem precisely: the inputs to a forensic exam are unknown. You have no ground truth to check the tool against. So when counsel asks how you know your tool found everything, the truthful answer is that you don't, and you can't, from the tool's output alone.
Where's the number?
Counsel holds up the tool report and asks for the one thing it does not contain.
"What is the measured error rate of the tool you used on this exhibit, for this device, this operating system, this interface, verified against ground truth? And if you can't give me a number, how can you tell this jury the tool found everything that was there?"
"Verified MD5" answers a question nobody asked
Radina Stoykova and her colleagues pulled 124 forensic reports out of the Norwegian police case-management system: 21 homicide and sexual-assault cases that all went to indictment, 187 seized devices. Then they checked something simple. Did the reports show the evidence had actually been verified? For 50 of the reported acquisitions, the report said nothing about whether integrity was checked at all. Eighteen said "MD5 was used" but printed no hash value. Three named both MD5 and SHA1, again with no value. In the whole dataset, exactly two acquisitions stated the algorithm and gave the number. Their conclusion was blunt. None of the 21 cases validated tool results or error rates, and it was "not possible to trace the digital forensic actions performed on each item or link the digital evidence to its source."
A hash is the one thing examiners are sure they got right, and most of the time it is not even in the report.
“Only for two reported acquisitions were the use of MD5 specified and the hash value provided.”
The deeper problem is the one counsel will press. Even a perfect, documented, matching MD5 proves a single thing: the copy is faithful to what was read off the device. It says nothing about whether you read off everything that mattered. Stoykova's framework separates the acquisition space (partitions, volumes, allocated versus unallocated) for exactly this reason. Manual extraction, she notes, "does not actually capture the digital data stored on the device." It captures only "the representation of data as provided by the device itself," and it changes the device while it does so. Selective extraction, a cloud-only pull, an encrypted partition you could not open: you hash the fraction you got, and the hash comes back immaculate. An immaculate hash of a fragment.
The hash itself is not the bedrock it is sold as either. Rasjid and colleagues reviewed the cryptographic collisions in the functions forensic tools rely on. MD5 collisions have been public since Wang's 2004 Eurocrypt result. SHA-1 fell to a full collision attack from Stevens and colleagues. Collisions have to exist: the message is far larger than the fixed-length digest, so by the pigeonhole principle two different files must share a value somewhere. Tools, Rasjid notes, "still use MD5 due to performance issues." A defence expert who knows this can ask whether your matching hash really proves the two objects are identical, and "yes, mathematically certain" is the wrong answer.

The live issue is rarely that someone forged your evidence with a chosen-prefix collision. That is exotic. The issue is rhetorical. "Verified MD5" sounds like proof that everything is correct. It is proof of copy fidelity for the data you happened to acquire, and nothing else. As Horsman puts it, the field has to verify its tools using its tools, "trapped in an infinite loop," and a matching hash does nothing to catch a tool that silently failed to parse a browser format or skipped a compressed region.
In the box, the mistake is letting "I verified the hash" stand in for "my extraction was complete and my interpretation correct." Those are three different claims, and the hash covers the first inch of the first one. When counsel asks how you know you recovered everything that mattered, the real answer is about scope and method and what the device, the cloud, or the encryption put out of your reach. It is not about a thirty-two-character string that matched.
What the hash does not cover
Counsel concedes the hash matched, then asks what it leaves untouched.
"You testified that the extraction was 'verified MD5.' That confirms your copy matches what your tool read off the phone. It tells the jury nothing about the encrypted partition you couldn't open, or the cloud backup you never pulled. So how do you know you recovered everything that mattered, and what, specifically, did your acquisition leave out?"
Fifty-three examiners, one drive, four stories
In 2021, Nina Sunde and Itiel Dror handed the same 3 GB disk image to 53 working digital forensic examiners across eight countries: Norway, India, the UK, Denmark, Finland, the Netherlands, Kenya and Canada. Same file, an old Windows XP machine owned by "Jean," timestamps from 2008. Same question: what happened, and was Jean involved? The traces on the drive were not exotic. The eleven they scored, including emails, chat logs, a spreadsheet and a USB mount, were the kind of thing the authors say "any competent examiner would find." Not one examiner found all eleven. Most found between five and eight. Fourteen found four or fewer.
That should trouble anyone who tells a court digital evidence is objective, and the second half of the study is sharper. Sunde and Dror split the 53 into four groups. The control group got only the bare scenario. The others got a line of context first. One was told Jean had confessed. One got an ambiguous wage-dispute story. One was told the police now believed Jean was innocent, framed in a phishing attack. The drive was identical for all of them. The context was a sentence.
“...if a DF examiner performs an analysis of an evidence file, and another DF examiner would do a re-analysis of the same evidence file, the chances of reaching consistent results are low.”
The examiners told innocence were the ones who looked and stopped. They observed the fewest traces, an average of 4.5 against 6.9 for the group nudged toward guilt. They did not find less because the data was different. They found less because they stopped searching once the file fit the story. Sunde and Dror state it directly: an examiner who believes the suspect is innocent observes fewer traces and "may have less information for developing explanations" of what actually happened. The bias did more than colour the conclusion. It set the moment the examination ended. One examiner reported it was "very likely" Jean was framed in a phishing attack, a confident finding built on a drive that, examined without that prompt, supported no such thing.
Then there is reliability, which the authors called their most important result. They measured agreement between examiners with Krippendorff's alpha, where 0.80 is strong and anything below 0.667 is inadequate. Across observations, interpretations and conclusions, every score came in low. The best was 0.51. Same evidence file, same context, and the chance that a second examiner would see, read and conclude the way the first did was poor. On individual traces, examiners split between "indicates guilt" and "indicates innocence." Opposite directions, same bytes.

The 2019 Sunde and Dror paper had already argued that digital forensics, alone among the major disciplines, had sidestepped the bias research that reshaped DNA and fingerprints, and that examiners meet case context as a matter of routine. In a side study they collected 30 case-submission forms from European and US units. 22 of them, 73%, had an open "information about the case" box, and not one warned against putting task-irrelevant detail in it. The pipeline that hands you the drive also hands you the theory.
In the box, the question is not whether you are truthful. It is whether the conclusion was yours or the file's. Did you read the suspect interview before you analysed the image? Did you know there was a confession? Would the examiner at the next desk, given your drive and nothing else, have written your report?
Yours, or the file's?
Counsel asks not whether the examiner is honest, but what they knew, and when.
"Before you formed your conclusion about this drive, had you read the suspect's interview, been told of a confession, or heard the lead investigator's theory of the case? And if the examiner at the next desk had been handed your image with none of that, can you tell this court they'd have written your report?"
The account logged in. Nobody saw who was at the keyboard.
On 17 October 2003, at Southwark Crown Court, Aaron Caffrey walked free. The prosecution said he had flooded a computer server at the Port of Houston with data and shut it down. His machine held the attack tools. His machine held the connection. The jury acquitted anyway, because Caffrey said unknown hackers had taken control of his computer and launched the attack to frame him. The prosecution's own expert, Professor Neil Barrett, told the court he could find no Trojan on the machine. Caffrey's answer was simple. You cannot test every file, and a Trojan could delete itself and leave no trace. The CPS casenote records the detail that mattered: this was "the first case where a Trojan virus defence has been raised without a trace of a Trojan virus being found." No Trojan, and he still walked.
Six months earlier, in April 2003 at Reading, Karl Schofield walked free too. He was charged over 14 indecent images of children on his PC. A defence expert found a real Trojan on the machine, installed, Schofield told the Reading Evening Post, the day before the images were downloaded. The prosecution accepted it probably did the downloading. Two cases, one lesson for the box. The file was there, the connection was there, and it still was not enough. Susan Brenner called this the Trojan horse defence. Lawyers know its older cousin as SODDI, Some Other Dude Did It. The device is not the person.
“This is the first case where a Trojan virus defence has been raised without a trace of a Trojan virus being found.”
It gets worse. Writing in 2007, Simson Garfinkel laid out anti-forensics as a field with goals, one of which, in words he quotes from Liu and Brown, is to "implicate an innocent party by planting data." The tools exist. Timestomp overwrites the NTFS create, modify, access and change timestamps. Transmogrify disguises a text file as an executable so EnCase skips it. Slacker hides data in slack space. An attacker can plant, forge, back-date and erase, and Garfinkel makes the obvious point that the examiner who does not account for anti-forensics is the one most likely to miss it.

The timestamps deserve their own warning. Céline Vanini, Christopher Hargreaves and colleagues titled their 2024 paper with the question you will be asked: "Was the clock correct?" Often you cannot assume it was. System time comes from the device's own clock, and that clock drifts, gets reset, runs down a failing battery, or is skewed on purpose. In a controlled experiment they backdated a virtual machine's clock by about three hours, and every file and history timestamp on it then lied by three hours. Their fix is a "time anchor," an artifact that carries both the local system time and an independent external time, such as a server timestamp baked into a browser cache header or a Google search URL. With one, you can say the clock was probably correct at that moment. Without one, a file-creation time is just a number the machine asserts about itself.
Who, and when, really?
Counsel takes the report's file-creation time and asks what really stands behind it.
"Examiner, your report says my client created this file at 9:05 p.m. on the 23rd. You took that time from the computer's own clock. What independent source did you use to confirm that clock was correct at that moment, and if you have none, how do you exclude that someone backdated it, or that a Trojan put the file there while my client slept?"
The number you can't defend, and what to say instead
In a 2020 paper, Eoghan Casey works through a murder case and arrives at a likelihood ratio "on the order of 1,000,000." The phone's data is a million times more expected if the device was at the crime scene than if it was somewhere else. Then he stops and refuses to say it. The number, he writes, is "ill-conditioned": the alternative is so rare that a small wobble in one input probability swings the answer by orders of magnitude. Presenting that as a precise figure would "disguise the broad and subjective opinion... into a scientific-looking result." His recommended wording is not a number at all. The evidence supports the device being at Location X "so extremely strongly that it would be precarious or problematic to express a precise numerical likelihood ratio."
Truthful testimony about digital evidence almost never sounds like certainty. It sounds like strength of evidence, stated against a named alternative.
“...so extremely strongly that it would be precarious or problematic to express a precise numerical likelihood ratio.”
Casey's method is buildable. His standardisation paper sets out a seven-stage process: observe, form hypotheses, infer, predict, then run a second-phase search that actively looks for evidence contradicting each hypothesis, and only then assign strength. His worked example is an intellectual-property case. Messages missing from a phone could mean they never existed, were deleted, or were not recovered. A second tool then pulls back deleted message remnants, including an attachment named "customers.xlsx." Now he can assign strength, high for the deletion hypothesis and low for the others. You assign strength to the evidence given a hypothesis, never to the probability the hypothesis is true. That last step belongs to the court, and reaching for it is the transposed-conditional fallacy that gets experts taken apart.

All of that assumes the underlying work is sound. Stoykova's 2022 study of the Norwegian police asks whether it is, and the answer was hard. Across 21 real homicide and sexual-assault cases, the digital forensic actions could not be traced to their source, not one case validated tool results or error rates, and across every acquisition exactly two recorded a hash value. This is what NAS 2009 and PCAST 2016 mean by validity as applied: not the tool in theory, but what was actually done, documented, and able to be reproduced. Stoykova found the document trail often was not there.
There is also the tool that will not behave the same way twice. Gougherty's 2024 test of a large language model on scientific reports found it 50 times faster than a human and over 90% accurate on simple categories. It also invented pathogen-incidence figures for 53 of 100 reports where the real answer was "not stated," and returned the wrong records for some queries without flagging it. It does not check that its own output is internally consistent. A forensic duty is deterministic: same input, same result. A language model, even at "temperature zero," is a probability machine that will fill a gap with confidence. Bolt one into your workflow and you have imported the exact certainty-shaped error this whole reading warns against.
A rough ranking of how much scientific footing a digital-evidence claim stands on, from what is genuinely demonstrable to what the research warns you cannot defend. Relative footing only, not a metric.
“You testified the phone was at the scene. Can you express that conclusion as anything other than a bare assertion of certainty?”
- 01"The computer says so" is a presumption, not a finding. The law treats the system as reliable until someone proves otherwise, which is the default that convicted the subpostmasters. Be ready to back it with the system's error logs, audits and change records, or limit what you claim.
- 02No one can give you an error rate for your tool, and "not found" is not "not there." The head of NIST's own testing program says so. Claim only what the tool was validated to do, on this device and this operating system.
- 03A matching hash proves the copy is faithful. It does not prove your extraction was complete or your reading correct. State your acquisition scope and its blind spots, including the encrypted partition and the cloud data you never pulled.
- 04A second examiner, handed your drive, would often not write your report. Know what case information you saw, and when. A conclusion formed after you learned of the confession is one worth questioning.
- 05The account is not the person, and the clock is not the truth. Show an external time anchor rather than the machine's own word, and be ready to say how you ruled out planted or back-dated data.
- 06Do not hand the jury a certainty you cannot defend. Name your alternative, say how strongly the evidence points away from it, and leave the probability of guilt to the court.
Anything but bare certainty
You are on the stand. Counsel has saved the hardest framing for last.
"You testified the phone was at the scene. Can you express that conclusion as anything other than a bare assertion of certainty? What is your alternative hypothesis, how strong is the evidence against it, and how confident can you truly be?"
Still have questions about the research?
Ask anything about the digital-forensics reliability literature. The tutor answers from the document itself — and keeps one eye on how it might come up under cross-examination.
- Bohm, N., Brown, N., Christie, B., Ladkin, P. B., Littlewood, B., Marshall, S., Mason, S., Murdoch, S., Newby, M., Rogers, P., Thimbleby, H., & Thomas, M. (2022). The legal rule that computers are presumed to be operating correctly: Unforeseen and unjust consequences. Digital Evidence and Electronic Signature Law Review, 19, 123-149.
- Van Buskirk, E., & Liu, V. T. (2006). Digital evidence: Challenging the presumption of reliability. Journal of Digital Forensic Practice, 1(1), 19-26.
- Lyle, J. R. (2010). If error rate is such a simple concept, why don't I have one for my forensic tool yet? Digital Investigation, 7, S135-S139.
- Guttman, B., Lyle, J. R., & Ayers, R. (2014). Ten years of computer forensic tool testing. Digital Evidence and Electronic Signature Law Review, 8, 139-144.
- Horsman, G. (2018). "I couldn't find it your honour, it mustn't be there!" Tool errors, tool limitations and user error in digital forensics. Science & Justice, 58(6), 433-440.
- Horsman, G. (2019). Tool testing and reliability issues in the field of digital forensics. Digital Investigation, 28, 163-175.
- Stoykova, R., Andersen, S., Franke, K., & Axelsson, S. (2022). Reliability assessment of digital forensic investigations in the Norwegian police. Forensic Science International: Digital Investigation, 40, 301351.
- Rasjid, Z. E., Soewito, B., Witjaksono, G., & Abdurachman, E. (2017). A review of collisions in cryptographic hash function used in digital forensic tools. Procedia Computer Science, 116, 381-392.
- Sunde, N., & Dror, I. E. (2019). Cognitive and human factors in digital forensics: Problems, challenges, and the way forward. Digital Investigation, 29, 101-108.
- Sunde, N., & Dror, I. E. (2021). A hierarchy of expert performance (HEP) applied to digital forensics: Reliability and biasability in digital forensics decision making. Forensic Science International: Digital Investigation, 37, 301175.
- Brenner, S. W., Carrier, B., & Henninger, J. (2004). The Trojan horse defense in cybercrime cases. Santa Clara Computer and High Technology Law Journal, 21(1), 1-53.
- Garfinkel, S. (2007). Anti-forensics: Techniques, detection and countermeasures. Proceedings of the 2nd International Conference on i-Warfare and Security, 77-84.
- Vanini, C., Hargreaves, C., van Beek, H., & Breitinger, F. (2024). Was the clock correct? Exploring timestamp interpretation through time anchors for digital forensic event reconstruction. Forensic Science International: Digital Investigation, 49, 301759.
- Casey, E., Nelson, A., & Hyde, J. (2020). Standardization of forming and expressing preliminary evaluative opinions on digital evidence. Forensic Science International: Digital Investigation, 32, 200888.
- National Research Council. (2009). Strengthening Forensic Science in the United States: A Path Forward. Washington, DC: The National Academies Press.
- President's Council of Advisors on Science and Technology. (2016). Forensic Science in Criminal Courts: Ensuring Scientific Validity of Feature-Comparison Methods. Washington, DC: Executive Office of the President.
Faces from Images: How Good Is the Match, Really?
Put this into practice, or go deeper with the tutor on the full research.