EU AI Footprint Scanner

AST-based static analysis that detects AI/ML library use across a Python codebase, classified into simplified EU AI Act risk tiers. The first product I've shipped under Argus Intelligence.

status: shipped
started: 2026-03
updated: 2026-06
tags: AI/ML · Tools
repo: github.com/zhenee/argus-ai-footprint-scanner

What

A Python static analyser that walks the Abstract Syntax Tree of a codebase, finds every AI/ML library import and call, and classifies each finding into a three-tier risk model loosely inspired by the EU AI Act. Output is structured JSON listing every finding with file path, line number, library name, and risk tier — designed to feed straight into a compliance review.

It ships in two forms: a CLI for ad-hoc and local runs, and a GitHub App that runs the same analysis on every pull request. The GitHub App is live — it posts findings as a GitHub Check Run plus a summary comment on the PR — and is sold as a subscription via Lemon Squeezy under Argus Intelligence.

Why

Most EU SMEs have no idea what AI/ML code is running inside their products. The EU AI Act’s GPAI obligations land on 2 August 2026, and the audit asks “what AI are you using?” before anything else. Manual code audits are slow, error-prone, and require specialised knowledge. The existing compliance tools are aimed at large enterprises with dedicated GRC teams — there’s a real gap for a pragmatic, engineering-grade tool that fits in CI.

It’s also the first commercial product I’ve shipped under Argus Intelligence.

How

Pure-Python stack. AST visitor from the ast stdlib + PyYAML for risk-definition config + pytest for the test harness. No heavy framework overhead; the binary path is cold-start fast.
AST-based detection rather than string matching — catches import openai and chained calls like openai.ChatCompletion.create() without false positives from comments or docstrings. Recursive attribute traversal handles dotted imports (google.generativeai) and walks back to the base name for chained calls.
Three-tier risk schema:
- HIGH — generative LLMs and direct AI APIs (openai, anthropic, cohere, gemini, mistral, ollama)
- LIMITED — foundation-model frameworks and orchestration (langchain, transformers, tensorflow, pytorch)
- MINIMAL — classical ML and numerical computing (scikit-learn, numpy, pandas)
Configurable. Risk definitions live in a YAML file. Customers can add internal libraries to a tier or move libraries between tiers without touching the scanner code.

Honest limitations

It’s a technical risk-discovery tool, not a legal compliance certification. The output is engineering input to a compliance review — not a substitute for one. This framing is on every page of the product site and matters more than every feature combined.
Doesn’t detect dynamic imports like importlib.import_module("openai") — known limitation, may add later.
Library-centric, not data-flow. It tells you what’s imported, not what’s being done with it. A compliance review still needs human judgement about how the library is used.
Three-tier model is engineering shorthand, not a literal Annex I/II mapping of the EU AI Act. Useful as a starting point, not a final classification.

What’s next

GitLab CI support in v1.1.
An org-level dashboard for teams scanning across multiple repositories.
Continue expanding risk_definitions.yml to cover more libraries as the AI ecosystem moves.