A web application pentest is mostly browser work. You log in, click through flows, watch what the app does, change one value, and see what breaks. For an autonomous agent to test like a human, it needs to use a browser like a human: see the page, decide where to click, move and type the way a person does, and watch the network traffic the whole time. Scripted automation that fires synthetic clicks at CSS selectors does none of that well, and modern anti-bot defenses spot it instantly.
So we built HumanBrowser: an open-source, vision-driven, human-fidelity browser that any tool-calling LLM can drive in real time. It is the browser Nebula uses to test web applications, and it is designed from the ground up for one job, letting an AI agent operate a real browser convincingly enough to do real security work. This is how it works and why each piece matters.
The problem with "AI browser" tools
Most agent browser integrations take one of two shortcuts. They either drive a headless browser with raw Playwright or Selenium calls (instant, robotic, easily fingerprinted), or they screenshot the page and ask the model to guess pixel coordinates (slow, imprecise, and blind to the DOM). Both struggle with the thing pentesting actually requires: interacting with a live, stateful application the way a real user would, while capturing everything the browser sends and receives.
HumanBrowser takes a different path. It pairs a real Chromium with two layers the others skip: a humanized input driver, and a vision-friendly element model the LLM can reason about directly.
Seeing the page: Set-of-Mark
Before an agent can act, it has to understand what is on the screen. HumanBrowser uses a technique called Set-of-Mark. It runs JavaScript in the page to collect every visible interactive element, button, link, input, select, and tags each one with a stable number for that snapshot.
Instead of asking the model to write a CSS selector or guess coordinates, the agent gets a clean list: [3] BUTTON "Submit", [4] INPUT "email", and so on. The model can simply say "click 3." Each element comes back with its tag, its visible text or placeholder or aria-label (trimmed to 80 characters), and a selector the driver resolves internally. This is the difference between an agent that reasons about a page and one that gambles on pixels.
The eight-tool surface the agent works with is deliberately small: navigate, see, click, type_text, observe, wait_for_change, scroll, and keypress. see is the vision call that returns the Set-of-Mark snapshot; everything else acts on it. A small, well-defined tool set keeps the LLM's decisions grounded and auditable.
Acting like a human: the fidelity layer
This is where HumanBrowser earns its name. Real people do not move the mouse in straight lines or type at a constant rate, and anti-bot systems profile exactly those tells. HumanBrowser's driver reproduces human motor behavior with tunable, real parameters:
- Bezier mouse paths. The cursor follows a curved path, not a straight line, with a curvature factor applied per movement and a per-pixel jitter so no two paths are identical. Long throws are damped so the motion stays natural.
- Gaussian-jittered typing. Keystrokes land at a mean of about 120 ms apart with a 40 ms standard deviation, bounded to a human range, and with a small typo-and-correction rate so the cadence is not mechanically perfect.
- Read-pauses before high-stakes actions. Before an ordinary click the agent pauses briefly (roughly 0.3 to 1.2 seconds). Before a high-stakes action, a Submit, a payment, a delete, it pauses longer (roughly 0.8 to 2.4 seconds), the way a person hesitates before committing something irreversible.
- Humanized scrolling with variable step delays, and stealth patches via playwright-stealth that set
navigator.webdriverto false and normalize the other properties headless browsers leak.
These are not cosmetic. An agent that trips bot detection on the login page never reaches the part of the application worth testing. Human fidelity is what keeps the session alive long enough to find real bugs.
Watching everything: the pentest layer
A pentest browser has to be an observer as well as an actor. HumanBrowser's BrowserSession is described in its own source as the "multi-engine lifecycle owner for LLM-driven pentest sessions," and it is built for that role:
- Proxy presets for interception. One argument routes all browser traffic through Burp Suite, OWASP ZAP, mitmproxy, or a custom SOCKS5 upstream (
ProxyConfig.burp(),.zap(), and so on), withignore_https_errorsso the proxy's CA can MITM TLS. Every request the agent makes shows up in your interception tool, ready for the rest of the workflow. - Bounded network and console capture. The session records request and response events and console messages into bounded buffers (capped so a long session cannot exhaust memory), so the agent and the analyst can review exactly what the application sent and received.
- HAR recording and multi-engine support. Sessions can record a full HAR archive and run on Chromium, Firefox, or WebKit.
This is the bridge from "an agent clicked around" to "an agent ran a pentest." The humanized interaction drives the application; the capture layer turns every interaction into evidence.
Watching it happen: live theater
One of the most useful and unusual features is theater: a live, real-time view of what the browser is doing, watchable from any browser tab with no VNC or remote desktop. It works by capturing the Chrome DevTools Protocol screencast, publishing frames over Redis pub/sub, and streaming them to a WebSocket viewer backed by a static HTML page, with HMAC-signed viewer tokens so only authorized people can watch.
The pipeline is simple and low-latency:
Rendering diagram
For an autonomous pentest, this matters. When Nebula is driving a browser against a target, an operator can open the theater stream and watch in real time as the agent logs in, navigates, and probes, at roughly 30 frames per second with about 50 ms latency. Autonomy plus live observability is exactly what you want from a tool operating against production-like systems.
How any LLM drives it
HumanBrowser is model-agnostic. It ships ready-made tool schemas for Anthropic Claude, OpenAI-compatible APIs (which covers xAI Grok and Kimi K2.6), and Google Gemini, and a single universal entry point:
result = await agent.dispatch(tool_name, tool_args)
The loop is the same for every provider: get the schemas, hand them to the model, and for each tool call the model makes, dispatch it and return the result. The cycle looks like this:
Rendering diagram
Dropping a real, humanized, instrumented browser into an LLM tool-use loop takes well under 30 lines. That model-agnostic design is deliberate, because Nebula routes across five LLM providers depending on the task, and the browser has to work the same under all of them.
How Nebula uses it
Inside Nebula, HumanBrowser is wrapped by a scan-scoped adapter: one browser session per scan, lazy-launched on the first tool call and disposed when the scan completes. The eight vision tools are exposed to Nebula's agents under its own naming (browser_see, browser_human_click, browser_human_navigate, and so on), and the Set-of-Mark state is shared between calls, so a browser_see snapshot and the browser_human_click that follows it refer to the same numbered elements.
In practice, when Nebula's swarm decides a finding needs hands-on browser work, an agent gets a live Chromium it drives by sight, routed through the same interception and capture stack a human tester would use, with an operator able to watch the whole thing live. The agent sees the page, reasons about the numbered elements, acts with human-fidelity input, and the session records every request for analysis. It is the same workflow a skilled human runs, executed autonomously.
Why this is the right foundation
Autonomous web pentesting fails at the browser more often than at the brain. An agent that gets fingerprinted, fumbles selectors, or cannot see its own network traffic will not find real vulnerabilities no matter how capable the model behind it is. HumanBrowser exists so that layer is not the weak link: real browser, human-fidelity interaction, vision-based element reasoning, full traffic capture, proxy interception, and live observability, in a small open-source package any LLM can drive.
It is open source under Apache-2.0, because the foundation of trustworthy autonomous testing should be inspectable.
Takeaways
- Autonomous pentesting lives or dies at the browser layer. Scripted clicks get flagged; pixel-guessing is blind and slow.
- HumanBrowser pairs a real Chromium with human-fidelity input (Bezier mouse, Gaussian typing, read-pauses) and Set-of-Mark vision so an LLM can reason about elements directly.
- It is built for security work: Burp and ZAP proxy presets, bounded network and console capture, HAR recording, and live theater streaming for real-time observability.
- It is model-agnostic and open source, and it is the browser Nebula drives, one session per scan, to test web applications the way a person would.