← projects

EU AI Footprint Scanner

AST-based static analysis that detects AI/ML library use across a Python codebase, classified into simplified EU AI Act risk tiers. The first product I'm shipping under Argus Intelligence.

status
active
started
2026-03
updated
2026-04
tags
AI/ML · Tools

What

A Python static analyser that walks the Abstract Syntax Tree of a codebase, finds every AI/ML library import and call, and classifies each finding into a three-tier risk model loosely inspired by the EU AI Act. Output is structured JSON listing every finding with file path, line number, library name, and risk tier — designed to feed straight into a compliance review.

Currently shipped as a CLI; the next milestone is a GitHub App that runs the same analysis on every pull request and posts findings as a PR comment.

Why

Most EU SMEs have no idea what AI/ML code is running inside their products. The EU AI Act’s GPAI obligations land on 2 August 2026, and the audit asks “what AI are you using?” before anything else. Manual code audits are slow, error-prone, and require specialised knowledge. The existing compliance tools are aimed at large enterprises with dedicated GRC teams — there’s a real gap for a pragmatic, engineering-grade tool that fits in CI.

It’s also the first commercial product I’m shipping under Argus Intelligence.

How

  • Pure-Python stack. AST visitor from the ast stdlib + PyYAML for risk-definition config + pytest for the test harness. No heavy framework overhead; the binary path is cold-start fast.
  • AST-based detection rather than string matching — catches import openai and chained calls like openai.ChatCompletion.create() without false positives from comments or docstrings. Recursive attribute traversal handles dotted imports (google.generativeai) and walks back to the base name for chained calls.
  • Three-tier risk schema:
    • HIGH — generative LLMs and direct AI APIs (openai, anthropic, cohere, gemini, mistral, ollama)
    • LIMITED — foundation-model frameworks and orchestration (langchain, transformers, tensorflow, pytorch)
    • MINIMAL — classical ML and numerical computing (scikit-learn, numpy, pandas)
  • Configurable. Risk definitions live in a YAML file. Customers can add internal libraries to a tier or move libraries between tiers without touching the scanner code.

Honest limitations

  • It’s a technical risk-discovery tool, not a legal compliance certification. The output is engineering input to a compliance review — not a substitute for one. This framing is on every page of the product site and matters more than every feature combined.
  • Doesn’t detect dynamic imports like importlib.import_module("openai") — known limitation, may add later.
  • Library-centric, not data-flow. It tells you what’s imported, not what’s being done with it. A compliance review still needs human judgement about how the library is used.
  • Three-tier model is engineering shorthand, not a literal Annex I/II mapping of the EU AI Act. Useful as a starting point, not a final classification.

What’s next

  • Build the GitHub App for CI integration. Posts findings as PR comments rather than running ad-hoc on a developer’s laptop.
  • Add GitLab CI support in v1.1.
  • Continue expanding risk_definitions.yml to cover more libraries as the AI ecosystem moves.