mirror of
https://github.com/KeygraphHQ/shannon.git
synced 2026-02-12 17:22:50 +00:00
docs: add benchmarks README
This commit is contained in:
BIN
assets/xbow-performance-comparison.png
Normal file
BIN
assets/xbow-performance-comparison.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 160 KiB |
@@ -1,55 +0,0 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 900 600" width="900" height="600">
|
||||
<!-- Background -->
|
||||
<rect width="900" height="600" fill="#ffffff"/>
|
||||
|
||||
<!-- Title -->
|
||||
<text x="450" y="40" text-anchor="middle" font-family="Arial, sans-serif" font-size="22" font-weight="bold" fill="#1f2937">
|
||||
Success Rate on X-Bow CTF Security Testing Benchmark (104 Challenges)
|
||||
</text>
|
||||
|
||||
<!-- Chart area background -->
|
||||
<rect x="80" y="80" width="770" height="400" fill="#f9fafb" stroke="#e5e7eb" stroke-width="1"/>
|
||||
|
||||
<!-- Grid lines -->
|
||||
<line x1="80" y1="480" x2="850" y2="480" stroke="#e5e7eb" stroke-width="1"/>
|
||||
<text x="70" y="485" text-anchor="end" font-family="Arial, sans-serif" font-size="14" fill="#6b7280">0%</text>
|
||||
<line x1="80" y1="400" x2="850" y2="400" stroke="#e5e7eb" stroke-width="1"/>
|
||||
<text x="70" y="405" text-anchor="end" font-family="Arial, sans-serif" font-size="14" fill="#6b7280">20%</text>
|
||||
<line x1="80" y1="320" x2="850" y2="320" stroke="#e5e7eb" stroke-width="1"/>
|
||||
<text x="70" y="325" text-anchor="end" font-family="Arial, sans-serif" font-size="14" fill="#6b7280">40%</text>
|
||||
<line x1="80" y1="240" x2="850" y2="240" stroke="#e5e7eb" stroke-width="1"/>
|
||||
<text x="70" y="245" text-anchor="end" font-family="Arial, sans-serif" font-size="14" fill="#6b7280">60%</text>
|
||||
<line x1="80" y1="160" x2="850" y2="160" stroke="#e5e7eb" stroke-width="1"/>
|
||||
<text x="70" y="165" text-anchor="end" font-family="Arial, sans-serif" font-size="14" fill="#6b7280">80%</text>
|
||||
<line x1="80" y1="80" x2="850" y2="80" stroke="#e5e7eb" stroke-width="1"/>
|
||||
<text x="70" y="85" text-anchor="end" font-family="Arial, sans-serif" font-size="14" fill="#6b7280">100%</text>
|
||||
|
||||
<!-- Y-axis label -->
|
||||
<text x="30" y="280" text-anchor="middle" font-family="Arial, sans-serif" font-size="16" font-weight="bold" fill="#374151" transform="rotate(-90, 30, 280)">
|
||||
Success Rate (%)
|
||||
</text>
|
||||
|
||||
<!-- X-axis label -->
|
||||
<text x="450" y="580" text-anchor="middle" font-family="Arial, sans-serif" font-size="16" font-weight="bold" fill="#374151">
|
||||
System
|
||||
</text>
|
||||
|
||||
<!-- Bars and labels -->
|
||||
<rect x="103.1" y="96" width="107.8" height="384" fill="#f97316" stroke="#f97316" stroke-width="2" opacity="0.9"/>
|
||||
<text x="157" y="86" text-anchor="middle" font-family="Arial, sans-serif" font-size="18" font-weight="bold" fill="#1f2937">96%</text>
|
||||
<text x="157" y="510" text-anchor="middle" font-family="Arial, sans-serif" font-size="14" fill="#374151">Shannon</text>
|
||||
<rect x="257.1" y="140" width="107.8" height="340" fill="#3b82f6" stroke="#3b82f6" stroke-width="2" opacity="0.9"/>
|
||||
<text x="311" y="130" text-anchor="middle" font-family="Arial, sans-serif" font-size="18" font-weight="bold" fill="#1f2937">85%</text>
|
||||
<text x="311" y="510" text-anchor="middle" font-family="Arial, sans-serif" font-size="14" fill="#374151">X-Bow</text>
|
||||
<rect x="411.1" y="140" width="107.8" height="340" fill="#8b5cf6" stroke="#8b5cf6" stroke-width="2" opacity="0.9"/>
|
||||
<text x="465" y="130" text-anchor="middle" font-family="Arial, sans-serif" font-size="18" font-weight="bold" fill="#1f2937">85%</text>
|
||||
<text x="465" y="510" text-anchor="middle" font-family="Arial, sans-serif" font-size="14" fill="#374151">Cyber-AutoAgent</text>
|
||||
<rect x="565.1" y="140" width="107.8" height="340" fill="#06b6d4" stroke="#06b6d4" stroke-width="2" opacity="0.9"/>
|
||||
<text x="619" y="130" text-anchor="middle" font-family="Arial, sans-serif" font-size="18" font-weight="bold" fill="#1f2937">85%</text>
|
||||
<text x="619" y="510" text-anchor="middle" font-family="Arial, sans-serif" font-size="14" fill="#374151">Principal Human</text>
|
||||
<text x="619" y="530" text-anchor="middle" font-family="Arial, sans-serif" font-size="14" fill="#374151">Pentester</text>
|
||||
<rect x="719.1" y="180" width="107.8" height="300" fill="#6b7280" stroke="#6b7280" stroke-width="2" opacity="0.9"/>
|
||||
<text x="773" y="170" text-anchor="middle" font-family="Arial, sans-serif" font-size="18" font-weight="bold" fill="#1f2937">75%</text>
|
||||
<text x="773" y="510" text-anchor="middle" font-family="Arial, sans-serif" font-size="14" fill="#374151">MAPTA</text>
|
||||
</svg>
|
||||
|
Before Width: | Height: | Size: 4.1 KiB |
196
xben-benchmark-results/README.md
Normal file
196
xben-benchmark-results/README.md
Normal file
@@ -0,0 +1,196 @@
|
||||
# Shannon AI: Achieving 96% Success on the hint-free XBOW Benchmark
|
||||
|
||||
Shannon Lite, our open-source AI pentester, achieved a **96% success rate (100/104 exploits)** on a systematically cleaned, hint-free version of the XBOW security benchmark. This performance surpasses the 85% score achieved by both leading AI agents and expert human penetration testers on the original benchmark.
|
||||
|
||||
**Shannon Github:** [github.com/KeygraphHQ/shannon](https://github.com/KeygraphHQ/shannon)
|
||||
|
||||
**Cleaned Benchmark**: [xbow-engineering/validation-benchmarks](https://github.com/xbow-engineering/validation-benchmarks)
|
||||
|
||||
**Benchmark Results, with detailed turn by turn agentic logs and full pentest report for each challenge**: [View Full Results](https://github.com/KeygraphHQ/shannon/blob/main/xben-benchmark-results/README.md)
|
||||
|
||||

|
||||
|
||||
*Data sourced from: [XBOW](https://xbow.com/blog/xbow-vs-humans) and [Cyber-AutoAgent](https://medium.com/data-science-collective/building-the-leading-open-source-pentesting-agent-architecture-lessons-from-xbow-benchmark-f6874f932ca4)*
|
||||
|
||||
**About the benchmark:** XBOW is an open-source security benchmark containing 104 intentionally vulnerable applications designed to test AI agent capabilities on realistic penetration testing scenarios.
|
||||
|
||||
We tested against a fully cleaned, hint-free version of the benchmark—removing shortcuts like descriptive variable names, comments, and filenames that could artificially boost performance. This represents a more realistic evaluation of Shannon's core analysis and reasoning capabilities.
|
||||
|
||||
---
|
||||
|
||||
## Why This Matters: From Annual Audits to Continuous Security
|
||||
|
||||
Modern development teams ship code constantly. Your penetration test? That happens once a year, maybe twice if you're diligent. This creates a 364-day security gap where vulnerabilities can silently ship to production.
|
||||
|
||||
Shannon closes this gap by delivering autonomous, on-demand penetration testing with proof-based validation. It doesn't just flag potential issues, it executes real exploits to confirm vulnerabilities are actually exploitable.
|
||||
|
||||
**The shift in security economics:**
|
||||
|
||||
| **Metric** | **Traditional Pentest** | **Shannon** |
|
||||
| --- | --- | --- |
|
||||
| **Cost** | $10,000+ | ~$16 (API costs) |
|
||||
| **Time** | Weeks to months, scheduled | Under 1.5 hours, on-demand |
|
||||
| **Frequency** | 1-2x per year | Every deployment |
|
||||
|
||||
The 96% success rate on XBOW demonstrates that autonomous, continuous security testing is no longer theoretical, it's ready for real-world use.
|
||||
|
||||
---
|
||||
|
||||
## Performance Across Vulnerability Classes
|
||||
|
||||
Shannon demonstrated consistent performance across diverse attack vectors:
|
||||
|
||||
| **Vulnerability Type** | **Total Challenges** | **Succeeded** | **Success Rate** |
|
||||
| --- | --- | --- | --- |
|
||||
| Broken Authorization | 25 | 25 | 100% |
|
||||
| SQL Injection | 7 | 7 | 100% |
|
||||
| Blind SQL Injection | 3 | 3 | 100% |
|
||||
| SSRF / Misconfiguration | 22 | 21 | 95.45% |
|
||||
| XSS | 23 | 22 | 95.65% |
|
||||
| Server-Side Template Injection | 13 | 12 | 92.31% |
|
||||
| Command Injection | 11 | 10 | 90.91% |
|
||||
|
||||
This consistency reflects Shannon's structured, phase-based approach that maintains strategic coherence through complex, multi-step attack chains.
|
||||
|
||||
---
|
||||
|
||||
## Transparent Analysis: Our 4 Failures
|
||||
|
||||
Here is a detailed analysis of the 4 challenges Shannon did not solve. These failures highlight current limitations and define our roadmap.
|
||||
|
||||
- **XBEN-10 (XSS):**
|
||||
|
||||
- **Diagnosis:** The agent correctly identified the vulnerability and character blacklist but incorrectly assessed the limitations of a JSFuck payload, abandoning a viable exploit path.
|
||||
|
||||
- **Lesson:** A knowledge gap exists in Shannon's payload encoding and obfuscation library.
|
||||
|
||||
- **XBEN-22 (SSTI/Default Credentials/Path Traversal):**
|
||||
|
||||
- **Diagnosis:** While it successfully exploited 2 of the 3 vulnerabilities, the agent misclassified the Server-Side Template Injection (SSTI) vulnerability as a false positive and did not proceed with exploitation.
|
||||
|
||||
- **Lesson:** The agent's classification model for SSTI needs refinement to reduce false negatives.
|
||||
|
||||
- **XBEN-34 (RFI):**
|
||||
|
||||
- **Diagnosis:** The agent correctly found the file inclusion vulnerability but misclassified it as Local File Inclusion (LFI) instead of Remote File Inclusion (RFI), leading it to attempt the wrong exploitation technique.
|
||||
|
||||
- **Lesson:** The classification logic between LFI and RFI must be improved based on server configuration analysis.
|
||||
|
||||
- **XBEN-82 (Command Injection via SSRF):**
|
||||
|
||||
- **Diagnosis:** Shannon identified the full attack path but failed on two fronts: the analysis agent misclassified eval() as incapable of OS command execution, and the exploitation agent failed to initiate a local web server for the payload.
|
||||
|
||||
- **Lesson:** Agent capabilities need to be updated to correctly classify eval() risks and to utilize available local tooling for payload delivery.
|
||||
|
||||
---
|
||||
|
||||
## How Shannon Works: Proof by Exploitation
|
||||
|
||||
Shannon follows a structured, five-phase workflow designed to eliminate false positives:
|
||||
|
||||
```
|
||||
Reconnaissance → Vulnerability Analysis → Exploitation → Reporting
|
||||
```
|
||||
|
||||
**The key difference:** Shannon doesn't stop at detection. Every reported vulnerability includes a working proof-of-concept exploit. If Shannon can't successfully exploit a vulnerability, it's not included in the report. No exploit = no report.
|
||||
|
||||
This "proof by exploitation" approach ensures every finding is:
|
||||
|
||||
- **Verified:** Confirmed through actual exploitation
|
||||
- **Reproducible:** Includes copy-paste PoC code
|
||||
- **Actionable:** Shows real impact, not theoretical risk
|
||||
|
||||
Shannon utilizes specialized agents for different vulnerability classes, running analysis and exploitation in parallel for efficiency. The system integrates industry-standard tools (Nmap, Subfinder, WhatWeb, Schemathesis) with custom browser automation and code analysis.
|
||||
|
||||
**For the complete technical breakdown,** see our article [Proof by Exploitation: Shannon's Approach to Autonomous Penetration Testing](https://medium.com/@parathan/proof-by-exploitation-shannons-approach-to-autonomous-penetration-testing-010eac3588d3) and the [GitHub repository](https://github.com/KeygraphHQ/shannon).
|
||||
|
||||
---
|
||||
|
||||
## What’s Next: Shannon Pro and Beyond
|
||||
|
||||
The 4 failures we analyzed above directly inform our immediate roadmap.
|
||||
|
||||
**[Shannon Pro](https://keygraph.io/shannon) is coming:** While Shannon Lite uses a straightforward, context-window-based approach to code analysis, Shannon Pro will feature an advanced LLM-powered data flow analysis engine (inspired by the [LLMDFA paper](https://arxiv.org/abs/2402.10754)) that provides comprehensive, graph-based analysis of entire codebases, enabling detection of complex vulnerabilities that span multiple files and modules.
|
||||
|
||||
Shannon Pro will also bring enterprise-grade capabilities: production orchestration, dedicated support, and seamless integration into existing security and compliance workflows.
|
||||
|
||||
Beyond Shannon Pro, we're working toward a vision where security testing is as continuous as deployment:
|
||||
|
||||
- **Deeper coverage:** Expanding to additional OWASP categories and complex multi-step exploits
|
||||
- **CI/CD integration:** Native support for automated testing in deployment pipelines
|
||||
- **Faster iteration:** Optimizing for both thoroughness and speed
|
||||
|
||||
The 96% success rate on the XBOW benchmark demonstrates the feasibility. The next step is making autonomous pentesting a standard part of every development workflow.
|
||||
|
||||
Please fill out this form if you are interested in [Shannon Pro](https://docs.google.com/forms/d/e/1FAIpQLSf-cPZcWjlfBJ3TCT8AaWpf8ztsw3FaHzJE4urr55KdlQs6cQ/viewform?usp=header).
|
||||
|
||||
---
|
||||
|
||||
## Open Source Release: Benchmarks and Complete Results
|
||||
|
||||
We're releasing everything needed for independent validation:
|
||||
|
||||
**1. The Cleaned XBOW Benchmark**
|
||||
|
||||
- All 104 challenges with hints systematically removed
|
||||
- Enables reproducible, unbiased agent evaluation
|
||||
- **Available at:** [KeygraphHQ/xbow-validation-benchmarks](https://github.com/KeygraphHQ/xbow-validation-benchmarks)
|
||||
|
||||
**2. Complete Shannon Results Package**
|
||||
|
||||
- All 104 penetration testing reports
|
||||
- Turn-by-turn agentic logs
|
||||
- **Available in the same repository**
|
||||
|
||||
We believe reproducible research is the only way to make genuine progress. Use these resources to validate our findings, benchmark your own tools, or build upon this work.
|
||||
|
||||
---
|
||||
|
||||
## Join the Community
|
||||
|
||||
- **GitHub:** [KeygraphHQ/shannon](https://github.com/KeygraphHQ/shannon)
|
||||
- **Discord:** [Join our Discord](https://discord.gg/aWY8rRUCxc)
|
||||
- **Twitter/X:** [@KeygraphHQ](https://x.com/KeygraphHQ)
|
||||
- **Enterprise inquiries:** [shannon@keygraph.io](mailto:shannon@keygraph.io)
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Methodology Notes
|
||||
|
||||
### Benchmark Cleaning Process
|
||||
|
||||
The original XBOW benchmark contains unintentional hints that can guide AI agents in white-box testing scenarios. To conduct a rigorous evaluation, we systematically removed hints from all 104 challenges:
|
||||
|
||||
- Descriptive variable names
|
||||
- Source code comments
|
||||
- Filepaths and filenames
|
||||
- Application titles
|
||||
- Dockerfile configurations
|
||||
|
||||
Shannon's 96% success rate was achieved exclusively on this cleaned version, representing a more realistic assessment of autonomous pentesting capabilities compared to the hint-containing original (where other systems achieved 85%).
|
||||
|
||||
This cleaned benchmark is now available to the research community to establish a more rigorous standard for evaluating security agents.
|
||||
|
||||
### Adaptation for CTF-Style Testing
|
||||
|
||||
Shannon was originally designed as a generalist penetration testing system for complex, production applications. The XBOW benchmark consists of simpler, isolated CTF-style challenges.
|
||||
|
||||
Shannon's production-grade workflow includes comprehensive reconnaissance and vulnerability analysis phases that run regardless of target complexity. While this thoroughness is essential for real-world applications, it adds overhead on simpler CTF targets.
|
||||
|
||||
Additionally, Shannon's primary goal is exploit confirmation rather than CTF flag capture. A straightforward adaptation was made to extract flags when exploits succeeded, reflected in our public repository.
|
||||
|
||||
**Performance note:** Current runtime averages 1.5 hours for the full benchmark suite. API costs range from $16-50 depending on target complexity.
|
||||
|
||||
---
|
||||
|
||||
**Data sources:**
|
||||
|
||||
1. [XBOW vs Humans](https://xbow.com/blog/xbow-vs-humans)
|
||||
2. [Cyber-AutoAgent Performance Analysis](https://medium.com/data-science-collective/from-single-agent-to-meta-agent-building-the-leading-open-source-autonomous-cyber-agent-e1b704f81707)
|
||||
3. [Building the Leading Open Source Pentesting Agent](https://medium.com/data-science-collective/building-the-leading-open-source-pentesting-agent-architecture-lessons-from-xbow-benchmark-f6874f932ca4)
|
||||
|
||||
---
|
||||
|
||||
*Built with ❤️ by the Keygraph team*
|
||||
|
||||
*Making application security accessible to everyone*
|
||||
Reference in New Issue
Block a user