Skip to main content
Whitepapers
Technical ArchitectureFeatured

Nebula: The Autonomous AI Penetration Testing Platform

How Breachline's Nebula runs continuous, autonomous penetration tests: a swarm of reasoning agents that chain real exploits, prove them, and report.

May 30, 2026 20 min 3,844 words 15 sections Breachline Labs

A modern engineering team ships dozens of times a day. Its attack surface, thousands of endpoints across web, APIs, cloud, and identity, changes by the hour. The standard answer to that risk is a penetration test once a quarter, scoped to a slice of the surface, delivered as a PDF two weeks later. By the time the report lands, the application it described is gone.

Nebula is Breachline's answer to that gap. It is an autonomous AI security platform that reasons about a target the way an experienced attacker does, runs continuously, and proves what it finds with working exploits. This paper explains what Nebula is, how it is built, and where it fits next to scanners, traditional pentests, and your existing workflows. It also draws a clear line between what the platform does today, in Early Access, and what is on the roadmap.

"The question is no longer whether AI can find vulnerabilities. It is whether your security program can keep pace with software that ships continuously and attackers who never stop looking." Breachline Labs

Executive Summary

Point-in-time testing was designed for software that changed a few times a year. Software no longer behaves that way, and neither do attackers. The result is a structural coverage gap: the surface a quarterly assessment never sees is exactly where breaches happen.

Nebula closes that gap with three ideas working together:

  • Reasoning, not signatures. Nebula forms hypotheses about how an application works, tests them, learns from the response, and adapts, instead of replaying a fixed payload library.
  • A swarm, not a single model. A team of specialist agents works the attack surface in parallel, spawning new specialists for each lead and chaining individual weaknesses into full attack paths.
  • Proof, not noise. Every finding ships with the exact request, the response that proves it, and step-by-step reproduction. If Nebula reports it, it is exploitable.

This whitepaper covers the structural failure of point-in-time testing, the company behind Nebula, how its multi-tier intelligence and agent swarm reason about a target, the platform architecture end to end, the integration and compliance layers, the security model, and an honest view of outcomes and cost.

1. The Problem: Testing That Cannot Keep Up

Continuous delivery, microservices, and ephemeral cloud infrastructure created an attack surface that expands faster than any human team can assess it. Traditional penetration testing was built for a slower world, and the mismatch is structural, not a matter of effort or skill.

The shape of the gap

Reality of modern deliveryPoint-in-time testing
Many deploys per dayA handful of assessments per year
Thousands of endpoints per appSampled coverage per engagement
Production minutes after mergeReport delivered days or weeks later
Near-zero cost to ship new codeHigh fixed cost per engagement
Attackers automate, around the clockHuman testers, in hours, in scope

Two consequences follow. First, most of the surface is never tested in any given window, so risk accumulates between engagements. Second, the vulnerability classes that matter most are the ones automated scanners miss: broken object-level authorization, JWT algorithm confusion, race conditions, GraphQL-specific abuse, and business-logic flaws that no signature library can describe.

Legacy DAST tools do not fill the gap. They generate large volumes of low-confidence findings, most of them false positives, while systematically missing logic and authorization bugs. The industry did not need a faster scanner. It needed something that reasons.

2. About Breachline

Breachline Labs is a UK security company, headquartered in London, built on one conviction: security testing should be as continuous, comprehensive, and automated as the software it is meant to protect.

The team comes from offensive security research, large-scale AI systems, and enterprise security engineering. That combination matters, because the hard part is not running tools. It is reasoning about a target the way an attacker does, and building AI that does so reliably rather than confidently making things up.

Breachline pairs the Nebula platform with a UK red team. Nebula runs continuously on its own; the human team steps in for scoped, expert-led engagements and signs off the work auditors, boards, and customers accept.

Our mission: find every vulnerability and prove every risk before attackers do, continuously and at scale.

That mission shapes every design decision in Nebula: why it produces working proof-of-concept exploits rather than theoretical findings, why it chains weaknesses into attack paths rather than listing them in isolation, and why it plugs into the tools you already use instead of demanding a new silo.

3. The Intelligence at the Core

Multi-tier intelligence, not one model on a checklist

Most "AI security" products wrap a single general-purpose model around a prompt. Nebula does not. It routes every task to the right tier of intelligence and cascades to a fallback if a tier fails:

TierRole
Nebula FastSpeed tier: classification, triage, routing decisions
Nebula CorePrimary agentic engine: long-context analysis and bulk reasoning
Nebula MaxPremium reasoning: zero-day hunting, exploit design, code review

A lightweight task does not pay for deep reasoning, and a hard problem is never starved of it. The router scores reliability continuously and reorders the chain so that engines which hallucinate are demoted and dependable ones are promoted. There is no single-vendor lock-in and no single point of failure.

Two brains, one platform

Nebula runs as a dual-brain system:

  • A conversational brain that talks to your team in plain language over Slack, Teams, or the web. You describe a target in any language, it asks clarifying questions, and it builds a complete profile.
  • An autonomous swarm brain that executes the engagement: it hunts, exploits, proves, and reports without a human in the loop.

A swarm that reasons like an attacker

The defining trait of an elite tester is not tool proficiency. It is reasoning. They form a hypothesis, remember that a low-severity finding from an hour ago might combine with something new, adapt when the application pushes back, and understand what the business should do well enough to notice when the code does something different.

Nebula encodes that loop and runs it across the whole surface at once. A lead agent orchestrates a swarm of specialist agents (recon, exploitation, authentication, cloud, and more), and spawns new specialists on demand for each lead it surfaces.

Rendering diagram

Every cycle runs in parallel across the full attack surface. No fatigue, no context loss, no scope drift.

A memory that compounds

Human experts get sharper on a target over time because they remember it. Nebula has a nine-layer memory that does the same: it retains which payloads bypassed your WAF, which endpoints were patched, and which attack chains still work. Each engagement feeds the next, so the second scan starts where the first left off rather than from zero.

Evidence over confidence

A reasoning system is only useful if you can trust its output. Nebula is built to refuse to guess: it extracts and quotes tool output before drawing conclusions, cites the evidence behind every claim, and runs layered hallucination checks with a best-of-N pass on the highest-severity findings. "I do not know" is an allowed answer. A confident fabrication is not.

4. How Nebula Works: Platform Architecture

A Nebula engagement moves through a fixed pipeline: input, analysis, planning, execution, validation, and reporting. The phases below are the parts you see.

Phase 1: Reconnaissance and surface mapping

Before a single test runs, Nebula builds a model of the target that goes well beyond passive enumeration.

Rendering diagram

In this phase Nebula discovers every reachable endpoint (including undocumented and shadow APIs), fingerprints the stack, maps authentication patterns such as OAuth flows and JWT implementations, recovers API schemas through GraphQL introspection and OpenAPI parsing, and identifies the trust boundaries between services.

Phase 2: Autonomous vulnerability testing

Nebula tests the applicable vulnerability classes against each endpoint in parallel. Rather than firing every payload at every input (the approach that makes scanners noisy), it uses the technology fingerprint and observed behavior to choose the attack vectors most likely to land.

Rendering diagram

Findings are evidence-gated. Every reported issue includes the exact HTTP request that triggers it, the response that proves exploitation, and the steps to reproduce. The point is to leave you with effectively nothing to triage: confirmation happens in the sandbox, not in your inbox.

Phase 3: Exploit chaining and the attack graph

Individual vulnerabilities are rarely the real risk. Attack paths are.

A low-severity open redirect becomes a critical account takeover when combined with an OAuth flaw. A medium SSRF becomes cloud compromise when the instance exposes its metadata service. A "low" information leak becomes the first hop of a lateral-movement chain. Nebula models each confirmed weakness as a state transition in a directed attack graph, a map of how an attacker's capability grows as they move through the environment.

Rendering diagram

The diagram above is an illustrative chain, not a measured result. The pattern it shows is the point: the steps that compose into a critical path are very often rated Low or Medium on their own, which is exactly why signature tools and time-boxed engagements deprioritize them. Nebula evaluates the impact of the whole path, not the individual scores.

Phase 4: Real-time reporting

Findings arrive as they are confirmed, not in a document weeks later, and each lands in the right format for its audience.

Report tierAudienceContent
Executive SummaryCISO, BoardRisk posture, business impact, trend over time
Technical ReportSecurity, AppSecFull exploit details, reproduction, remediation guidance
Compliance ReportAuditors, GRCFindings mapped to PCI DSS, SOC 2, ISO 27001, NIST CSF, GDPR
Developer ReportEngineeringFindings per repository or service, with inline fix suggestions

5. Core Capabilities

Continuous, not point-in-time

Nebula runs continuously against staging and production. An endpoint deployed overnight is in scope by the next cycle, and a config change that opens an SSRF vector is caught without waiting for the next quarterly window.

Rendering diagram

Systematic OWASP Top 10 coverage

Nebula is designed to work through all ten OWASP categories systematically, including the design and logic categories that time-constrained engagements usually reach last, if at all.

OWASP categoryTypical scannerNebula approach
A01 Broken Access ControlPartial, sampledCross-user and cross-tenant validation
A02 Cryptographic FailuresLimitedCrypto analysis and JWT attacks
A03 InjectionStrongAll injection classes, all input vectors
A04 Insecure DesignRarely testedBusiness-logic and workflow analysis
A05 Security MisconfigurationModerateStack-aware misconfiguration testing
A06 Vulnerable ComponentsTool-dependentSBOM and dependency review
A07 Auth FailuresModerateAuth bypass and session testing
A08 Data Integrity FailuresRarely testedDeserialization and CI/CD supply chain
A09 Logging FailuresAlmost neverAttack-traffic detection checks
A10 SSRFLimitedInternal pivot, cloud metadata, full chains

Coverage across the whole surface

Rendering diagram

CI/CD integration

Nebula can run inside the deployment pipeline. A pull request or a deploy triggers a targeted scan of the changed surface, and the result comes back before the change reaches production.

Rendering diagram

6. The Integration Ecosystem

Nebula is built to work inside the tools your teams already use, not to add another place to check.

Rendering diagram

What that looks like in practice, as illustrative examples:

  • On-call engineer. A PagerDuty alert fires for a critical chain in production: SSRF into cloud credential theft. The alert carries the exact request, the chain visualization, and a link to the full finding, so the picture is clear before the laptop is even open.
  • Developer on a pull request. A GitHub status check from Nebula reports no new findings introduced by the change, and the branch is clear to merge. No review queue, no surprises later.
  • Security team on Monday. A Slack summary recaps the week's testing and links to the full report in the SIEM.
  • CISO before a board review. A dashboard pulls live data from Nebula: coverage over time, findings by severity and category, and posture against the frameworks that matter.

7. Compliance Automation

Every finding is mapped to the frameworks your organization answers to, translating a technical issue into the control language an auditor expects.

Rendering diagram

When an auditor asks for penetration-testing evidence, you export a compliance report: the findings, the tests run, and the clean results, timestamped and formatted for review.

8. Security and Trust

This is a tool that attacks systems on your behalf, so its own security model has to be airtight.

  • Sandboxed execution. Every exploit attempt runs in an isolated container, provisioned fresh per scan and destroyed after use, with network access limited to the authorized target. Nebula leaves no backdoors and no persistent access.
  • Graduated containment. Testing runs at an explicit containment level, from passive observation through active exploitation, so the intensity of an engagement always matches what you authorized.
  • Scope enforcement. Nebula operates only within the defined scope. Domain and IP boundaries, allowlisting, and rate limits keep its activity controlled and predictable, and every request it makes is logged for audit.
  • Data handling. Application data used to validate an exploit is processed in memory and discarded. Findings metadata is retained for reporting and trend analysis, with customer-controlled data-residency options.

9. What Continuous, Autonomous Testing Changes

Nebula is in Early Access, so this section is about the structural difference the model makes, not a claim of measured customer results.

The core change is coverage over time. A point-in-time test inspects a sample of the surface on one day; the rest of the surface, and every change after that day, goes untested until the next engagement. Continuous autonomous testing inverts that. The whole surface is in scope, new code is tested as it ships, and findings arrive while they still describe the live system.

The second change is the unit of risk. Scanners and time-boxed tests report findings in isolation, which pushes Low and Medium items into a backlog. Because Nebula chains weaknesses into attack paths and scores the path, the items that actually compose into a breach surface as a priority instead of disappearing into a list.

The third change is triage cost. When every finding ships with a working proof of concept and reproduction steps, the security team spends its time on remediation and judgment calls, not on sorting real issues from scanner noise.

An illustrative cost model

The figures below are illustrative industry ranges to frame the economics, not Breachline-measured results. Use your own contract and salary numbers in place of these.

Cost categoryPoint-in-time modelContinuous model
Recurring assessment spendHigh fixed cost per engagementSubscription
Scanner licensingSeparate line itemIncluded
Triage laborOngoing security-team timeMinimal, findings ship proven
False-positive resolutionOngoing developer timeMinimal
Attack-surface coverageSampledFull surface, continuous

For context on the cost being managed against, IBM's Cost of a Data Breach report puts the global average cost of a breach in the millions of dollars per incident. The case for continuous testing is that the surface left untested between engagements is where many of those incidents begin.

10. Nebula Next to the Alternatives

Rendering diagram
CapabilityDAST ScannerTraditional PentestNebula
OWASP Top 10 coveragePartialMostSystematic
Business-logic testingNoPartialYes
Exploit chainingNoLimitedYes
Continuous operationYesNoYes
Proof-of-concept exploitsNoYesYes
GraphQL-native testingNoLimitedYes
Race-condition detectionNoRareYes
CI/CD integrationLimitedNoYes
Cloud infrastructure testingLimitedScope-dependentYes
Real-time findingsYesNoYes
Compliance mappingLimitedManualYes
Triage burdenHighLowNear-zero, proof-gated

11. Getting Started

Nebula is delivered as a SaaS platform with no on-premise infrastructure required, with private-cloud and on-premise options for teams that need them.

Rendering diagram
  • Day one: the first comprehensive scan completes, including any exploit chains already present before Nebula arrived.
  • Week one: integrations are live, the pipeline gates on critical findings, and alerts flow to your team and SIEM.
  • Month one: trend data accumulates, showing how the surface is changing and where new risk is being introduced.

Enterprise plans include dedicated security-engineering support, custom scope and scan configuration, white-label reporting, SSO and role-based access control, custom compliance mapping, and scheduled engagements with the Breachline red team. A self-service tier for smaller teams is on the roadmap.

12. Conclusion

The organizations that handle security well in 2026 are not the ones with the largest team or the most expensive annual test. They are the ones that made testing continuous, automated, and part of how they ship.

Nebula is built for that: a swarm of reasoning agents, with multi-tier intelligence and a memory that compounds, that reasons about your attack surface the way a strong human tester would, but never sleeps, never runs out of time before it reaches the last endpoint, and never hands you a finding it cannot prove. It is in Early Access today, and it is honest about the line between what it does now and what is on the roadmap.

The open question is no longer whether AI can do this work. It is how long an organization keeps relying on an approach that samples its surface a few times a year while attackers work the rest of it continuously.

Sources

About Breachline Labs

Breachline Labs Limited builds autonomous AI security platforms that find and prove vulnerabilities before attackers do. Headquartered in London, United Kingdom, Breachline pairs the Nebula platform with a UK red team for expert-led engagements.

Get started:

© 2026 Breachline Labs Limited. Nebula and the Breachline logo are trademarks of Breachline Labs Limited. All other trademarks are the property of their respective owners.