Hi HN,
I'm one of the creators of HoundDog.ai (https://github.com/hounddogai/hounddog). We currently handle privacy scanning for Replit's 45M+ creators.
We built HoundDog because privacy compliance is usually a choice between manual spreadsheets or reactive runtime scanning. While runtime tools are useful for monitoring, they only catch leaks after the code is live and the data has already moved. They can also miss code paths that aren't actively triggered in production.
HoundDog traces sensitive data in code during development and helps catch risky flows (e.g., PII leaking into logs or unapproved third-party SDKs) before the code is shipped.
The core scanner is a standalone Rust binary. It doesn't use LLMs so it's local, deterministic, cheap, and fast. It can scan 1M+ lines of code in seconds on a standard laptop, and supports 80+ sensitive data types (PII, PHI, CHD) and hundreds of data sinks (logs, SDKs, APIs, ORMs etc.) out of the box.
We use AI internally to expand and scale our rules, identifying new data sources and sinks, but the execution is pure static analysis.
The scanner is free to use (no signups) so please try it out and send us feedback. I'll be around to answer any questions!
Is this looking for PII in my code, or trying to understand the code logic that handles PII?
Thanks for your question. I am one of the co-founders. It is the latter. We analyze the names of functions, methods, and variables to detect likely Personally Identifiable Information (PII), Protected Health Information (PHI), Cardholder Data (CHD), and authentication tokens using well tuned patterns and language specific rules. You can see the full list here: https://github.com/hounddogai/hounddog/blob/main/data-elemen...
When we find a match, we trace that data through the codebase across different paths and transformations, including reassignment, helper functions, and nested calls. We then identify where the data ultimately ends up, such as third party SDKs (e.g. Stripe, Datadog, OpenAI, etc.), exposures in API protocols like REST, GraphQL, or gRPC, as well as functions that write to logs or local storage. Here's a list of all supported data sinks: https://github.com/hounddogai/hounddog/blob/main/data-sinks....
Most privacy frameworks, including GDPR and US Privacy Frameworks, require these flows to be documented, so we use your source code as the source of truth to keep privacy notices accurate and aligned with what the software is actually doing.
Interesting. Can I use the output to document sub processors in our privacy notice, or is it specific to GDPR reporting only?
Great question. It is not limited to GDPR. The same output can be used to document sub processors in your privacy notice and vendor disclosures since it shows which third parties and services personal data flows to in your code.
Cool. Why not use LLM for this kind of analysis? Cost or something else?
LLMs can find issues that traditional SAST misses, but today they are slow, expensive, and nondeterministic. SAST is fast and cheap, but requires heavy manual rule maintenance. Our approach combines the strengths of both. The scanning engine is fully rule based and deterministic, with a rule language expressive enough to model code at compiler level accuracy. AI is used only to generate broad rule coverage across thousands of patterns, without sacrificing scan performance or reliability.