4.1 KiB
SWE-Bench Trajectory Analysis
A comprehensive analysis tool for comparing failed and successful software engineering trajectories across different SWE-bench dataset splits. This tool helps identify patterns in agent failures and provides actionable insights for improving software engineering agents.
Features
- Multi-Split Support: Analyze trajectories across SWE-bench splits (Verified, Lite, Test, Multimodal)
- Automated Comparison: Compare your agent's failed trajectories with successful trajectories from top-performing agents
- LLM-Powered Analysis: Uses advanced language models to generate detailed failure analysis reports
- Batch Processing: Process multiple instances in parallel for balanced analysis
- Leaderboard Integration: Automatically downloads and analyzes top agent results from SWE-bench leaderboards
Requirements
- Python 3.8+
- Required packages (see
requirements.txt):datasets>=2.0.0- For loading SWE-bench datasetsrequests>=2.25.0- For API callspython-dotenv>=0.19.0- For environment variable managementPyYAML>=6.0- For configuration file parsingboto3>=1.40.52- For AWS S3 access
Installation
-
Clone the repository and navigate to the directory:
cd trajectory_analysis -
Install dependencies:
pip install -r requirements.txt -
Set up environment variables:
cp .env.example .env # Edit .env and add your OPENROUTER_API_KEY
Configuration
Edit config.yaml to configure your analysis:
# Dataset configuration
dataset:
split: "Verified" # Options: Verified, Lite, Test, Multimodal
# Paths
paths:
txt_path: "/path/to/unsolved_instances.txt"
intermediate_output_path: "/path/to/intermediate_output/"
output_path: "/path/to/final_report.txt"
logs_directory: "/path/to/failed_trajectories/"
prompts_path: "/path/to/prompts_manager.py"
tools_path: "/path/to/tools/"
# Settings
settings:
max_workers: 20 # Maximum number of parallel requests to OpenRouter
num_agents: 4 # Number of top SWE-bench leaderboard agents' to download trajectories for
Usage
Basic Usage
-
Prepare your data:
- Create a text file listing unresolved instance IDs (one per line)
- Ensure your failed trajectories are in the specified logs directory
-
Run the analysis:
python run.py
Input Requirements
-
Failed Trajectories: Your agent's failed trajectories in one of these formats:
- Direct
.logfiles named{instance_id}.log - Issues directory structure:
issues/{numerical_id}-{instance_id}/logs/{timestamp}/run/{instance_id}/generator/{generator_id}/
- Direct
-
Unresolved Instances: Text file containing instance IDs that your agent failed to solve
Output
The tool generates:
- Individual Analysis: Detailed failure analysis for each instance in the intermediate output directory
- Comprehensive Report: Categorized analysis of all failures with improvement recommendations
Architecture
Core Components
run.py: Main entry point and batch processing coordinatortrajectory_discovery.py: Handles trajectory finding and leaderboard integrationpairwise_analysis.py: LLM-powered failure analysis and dataset loadingutils.py: Utility functions for file operations and trajectory parsingconfig.yaml: Configuration management
Supported SWE-bench Splits
| Split | Dataset | Instances | Description |
|---|---|---|---|
| Verified | princeton-nlp/SWE-bench_Verified |
500 | Verified high-quality instances |
| Lite | SWE-bench/SWE-bench_Lite |
300 | Smaller subset for faster testing |
| Test | SWE-bench/SWE-bench |
2,294 | Full test set |
| Multimodal | SWE-bench/SWE-bench_Multimodal |
510 | Instances with UI elements |
Error Handling
The tool includes robust error handling for:
- Missing trajectory files
- API rate limiting and failures
- Dataset loading errors
- Invalid configuration parameters
Contributor
License
This project is licensed under the MIT License - see the project LICENSE file for details.