Do PDF Tools Conform To The Specification?

By Prashant Anantharaman on 07 Sep 2022 @ Nullcon
📊 Presentation 📹 Video 🔗 Link
#pdf #data-protection #static-analysis #security-testing #software-security #application-pentesting #input-validation
Focus Areas: 🔒 Data Privacy & Protection , 🔐 Application Security , ⚙️ DevSecOps , 🦠 Malware Analysis

Presentation Material

Abstract

The PDF specification has been popular since the 1990s as a common data transmission format. However, as more tools implement this standard, tools have also deviated from the specification in subtle ways. Until now, the true extent of these deviations has not been cataloged. In this talk, I present a type checker that strictly enforces the constraints of the PDF specification.

I also present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. Our PDF checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files.

AI Generated Summary

The talk applies language-theoretic security (LangSec) principles to PDF file validation, arguing that parsers must fully recognize and reject malformed input before processing. It presents Sparta, a Rust-based tool that enforces strict conformance to the PDF specification using a machine-readable format (the Arlington DOM). Sparta generates a type checker from the DOM to validate PDF structure and applies automated “reducer” rules to fix common errors, producing a normalized, canonicalized output.

Key findings include pervasive parser differentials—different text extraction tools produce inconsistent results from the same PDF—and widespread structural violations in real-world files. An evaluation of over 100GB of PDFs from sources like Common Crawl and archive.org revealed frequent errors: missing mandatory keys in page tree nodes (affecting ~20,000 files), incorrect language metadata, null references, and font dictionary issues. LaTeX-generated PDFs were particularly error-prone. Sparta successfully auto-fixed many of these issues by inserting missing keys and simplifying cross-reference structures.

Practical implications stress the necessity of separating parsing from business logic,

Disclaimer: This summary was auto-generated from the video transcript using AI and may contain inaccuracies. It is intended as a quick overview — always refer to the original talk for authoritative content. Learn more about our AI experiments.