DOM-first with vision fallback: architecture of a production browser agent

A two-phase design — cheap DOM locators first, vision-model fallback when the DOM doesn't answer — that made a browser-automation agent fast enough to run at scale and robust enough to handle canvas, WebGL, and selectorless UIs.

Published 2026-04-20 · AI/ML, agents, browser automation

Most browser-automation tutorials stop at “use Playwright, write a selector, click.” That works until it doesn’t, and the doesn’t cases are expensive in practice: canvas / WebGL views that expose no DOM at all, visually rendered components that skip semantic roles, and whatever JavaScript framework is hiding its clickable targets behind shadow DOM this week.

The opposite extreme — “screenshot the page, ask a vision model what to click” — works on those targets, but each call costs real money and takes real seconds. Doing every interaction that way is fine for a demo and bankrupt at any production cadence.

The design that kept me sane was a two-phase loop: cheap DOM locators first, vision-model fallback only when the DOM doesn’t have the answer. Which sounds obvious, but the interesting part is the fallback protocol — when to give up on DOM, how to structure the vision call so it doesn’t hallucinate, and how to cache the expensive half so most runs never trigger it.

The loop

Per interaction step:

Phase 1 — DOM. Run a short cascade of free locators: visible text match, ARIA role match, a few structural heuristics for the framework on the page. If exactly one locator resolves, click and move on. Zero API calls, low single-digit milliseconds. On reasonably-built web UIs this is where 70–80% of interactions land.
Phase 2 — vision. Phase 1 returned zero candidates, or more than one, or the “clickable” thing turned out to be a canvas. Screenshot the viewport, send it to a vision model with a structured prompt, execute the returned action.

Phase 2 is the part people get wrong.

CLASSIFY-then-ACT

The naïve vision prompt — “here’s a screenshot, the goal is to click the primary action; return coordinates” — hallucinates confidently. Vision models will cheerfully click a static hero image of a button, a menu icon that sort of looks right, or a disabled CTA behind a modal. The coordinates come back plausible every time, including when the right target isn’t on screen at all.

The fix is to split the decision into two structured fields and have the model commit to the first before reasoning about the second. The provider contract ends up looking like this:

interface VisionProvider {
  decide(screenshot: Buffer, goal: string): Promise<Decision>;
}

// Two-stage structured output in one call
type Decision = {
  screen: Screen;   // what am I looking at?
  action: Action;   // what should I do, given the screen class?
};

type Screen =
  | 'modal'        // a dialog blocking the main content
  | 'loading'      // transient spinner; wait, don't click
  | 'permission'   // cookie banner, consent, age gate
  | 'content'      // the app itself
  | 'unknown';     // bail out; human review

type Action =
  | { type: 'click'; x: number; y: number; label: string }
  | { type: 'wait'; durationMs: number; reason: string }
  | { type: 'give_up'; reason: string };

The screen classification front-loads the interpretive work. When the model commits to screen: 'loading', it can no longer coherently return action: 'click' on a spinner, because the types don’t line up in its own reasoning trace. Hallucinations drop noticeably. The ones that remain tend to be legitimate ambiguity — two reasonable primary actions on a real content screen — which is the class of error where you want a human in the loop anyway.

Caching the expensive half

The same button on the same screen has the same coordinates most of the time. So you cache Phase 2’s output, keyed on (target, screen_fingerprint), with a TTL.

A 14-day TTL fit my workload. What matters is that on repeat runs the cache hits before the vision call fires, which means the steady-state cost of the agent is close to zero — the $0.05-per-session bill only shows up on cold-cache, new-screen runs. For anything that repeats, the amortised cost collapses to the DOM-phase cost, which is free.

The cache-invalidation heuristic I settled on: bust the entry when Phase 2’s action fails (click landed on nothing, or the post-click screen doesn’t match expectations). You do eat one extra vision call per stale entry, but you keep the cache usefully fresh without having to instrument every upstream change.

Provider abstraction

I ran three vision backends behind the same contract and learned they aren’t interchangeable — each fit a different operating mode:

Local Qwen2.5-VL on a 3060 is unbeatable for sequential runs. Zero marginal cost, sub-second latency, private. The wall is GPU memory: parallel workers don’t fan out on one consumer card.
Gemini is what I switch to when I need concurrency. Cheap enough (~$0.0005/session in my workload) that parallel cost isn’t the concern, and the throughput is effectively unbounded from the client side.
Claude Haiku wins on hard classification cases where the screen is crowded or the targets are small. Roughly two orders of magnitude more expensive per session than Gemini, which is fine because I only route to it when the easier providers return low-confidence answers.

Keeping the provider contract minimal — decide(screenshot, goal) → Decision — made this routing layer possible without the core agent knowing or caring which backend answered.

Failure modes I still hit

Three that I haven’t fully solved:

Sub-pixel precision. Vision models return integer coordinates on a normalised viewport (typically 1280×720). Responsive layouts at other viewports need per-viewport normalisation, and the normalisation math is where most of my click-misses come from, not the model.
Pure-canvas targets. When there’s no DOM to fall back to, every step is a vision call. Phase 1 never fires, the cache helps less than you’d expect because canvas state changes visibly, and the cost model gets ugly.
Confident-but-wrong classifications on screens where the right target isn’t visible. CLASSIFY-then-ACT reduces the rate; it doesn’t take it to zero.

What I’d build next

The obvious direction — and the thing I’m actually working on — is replacing the general-purpose vision prompt with a task-specific detector for the class of UI elements the agent cares about. A YOLO model trained on buttons, dialogs, dismiss controls, and form fields would turn Phase 2 from a $0.05 API call into a free local inference, while keeping the DOM-first design for cases where cheap locators already work.

The concrete project this points at: a vision-based QA testing agent that doesn’t depend on CSS selectors or test IDs to do its job. Current automated testing breaks the moment the DOM changes; a human QA doesn’t, because a human looks at the screen. Replacing “read the DOM, compare to expected” with “look at the screen, compare to expected” is the same two-phase architecture, with the domain model (UI elements instead of content screens) retrained for the task.

That’s a writeup for when the numbers are real. This one is about the foundation underneath it.