Building an Automated Machine for Discovering Privacy Violations at Scale

#static-analysis #secure-coding #data-protection #security-development-lifecycle #sast

Focus Areas: 🔒 Data Privacy & Protection , 🔐 Application Security , ⚙️ DevSecOps

Presentation Material

Abstract

While the most advanced digital watch in 1980 asked us to manually enter and store our phone book on the watch, modern smartwatches are sending our GPS location pings and heartbeat each second to unknown cloud machines which you know nothing about! To tackle this information void of where our data flows, various regulations and privacy frameworks have been developed. While there are multiple stakeholders such as lawyers and privacy officers in privacy conversations, the onus falls on the developers to eventually write code that respects those regulations - or fix issues that got introduced. In this talk we discuss how tried and tested static analysis techniques such as taint tracking and dataflow analysis can be used on large code bases at scale to help fix privacy leaks right at the source itself. What does it take to build such tooling? What challenges would we face and how can you, a developer or a privacy engineer fix privacy bugs in code!

AI Generated Summary

The talk presented research on applying static analysis techniques to proactively identify privacy vulnerabilities, specifically the unintended leakage of personally identifiable information (PII) in software code. The core argument was that privacy issues originate in code during development and can be detected before deployment by analyzing data flows statically, rather than reacting to leaks at runtime.

The speaker detailed foundational static analysis concepts: identifying data sources (e.g., user input), sinks (e.g., log files, external APIs), and tracking tainted data through program execution paths. A key technique discussed was the Code Property Graph (CPG), a queryable graph representation that combines an abstract syntax tree, control flow graph, and other program relationships to model comprehensive code semantics. This allows for complex queries about data movement.

The primary tool introduced was Privado Scan, an open-source scanner built upon the Yarn CPG implementation. Its architecture consists of: 1) a rule engine using declarative YAML files to define PII sources and sensitive sinks, 2) a graph generation component that builds the CPG for target codebases, and 3) a CLI/UX that outputs findings in JSON format or via a visualization dashboard. The tool scans dependencies and configurations to map data flows from variables like phone numbers or passwords to logging statements, databases, or third-party services (e.g., SMS APIs).

A live demonstration scanned a sample banking application, automatically detecting 11 data elements and revealing specific flows, such as a phone number variable being passed to an SMS service, written to a console, and stored in a database. The practical implication is enabling developers to integrate privacy checks directly into the software development lifecycle, shifting privacy “left” to catch leaks during coding. The approach empowers engineers to understand and remediate data exposure paths without executing the application, providing a calm, preemptive analysis compared to reactive runtime monitoring. The tool’s open-source nature and customizable rules support community-driven refinement for scaling privacy analysis.

Disclaimer: This summary was auto-generated from the video transcript using AI and may contain inaccuracies. It is intended as a quick overview — always refer to the original talk for authoritative content. Learn more about our AI experiments.