Files
InfCode/analyzer
2025-10-29 17:52:31 +08:00
..
2025-10-29 17:52:31 +08:00
2025-10-29 17:52:31 +08:00
2025-10-29 17:52:31 +08:00
2025-10-29 17:52:31 +08:00
2025-10-29 17:52:31 +08:00
2025-10-29 17:52:31 +08:00
2025-10-29 17:52:31 +08:00

SWE-Bench Trajectory Analysis

A comprehensive analysis tool for comparing failed and successful software engineering trajectories across different SWE-bench dataset splits. This tool helps identify patterns in agent failures and provides actionable insights for improving software engineering agents.

Features

  • Multi-Split Support: Analyze trajectories across SWE-bench splits (Verified, Lite, Test, Multimodal)
  • Automated Comparison: Compare your agent's failed trajectories with successful trajectories from top-performing agents
  • LLM-Powered Analysis: Uses advanced language models to generate detailed failure analysis reports
  • Batch Processing: Process multiple instances in parallel for balanced analysis
  • Leaderboard Integration: Automatically downloads and analyzes top agent results from SWE-bench leaderboards

Requirements

  • Python 3.8+
  • Required packages (see requirements.txt):
    • datasets>=2.0.0 - For loading SWE-bench datasets
    • requests>=2.25.0 - For API calls
    • python-dotenv>=0.19.0 - For environment variable management
    • PyYAML>=6.0 - For configuration file parsing
    • boto3>=1.40.52 - For AWS S3 access

Installation

  1. Clone the repository and navigate to the directory:

    cd trajectory_analysis
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Set up environment variables:

    cp .env.example .env
    # Edit .env and add your OPENROUTER_API_KEY
    

Configuration

Edit config.yaml to configure your analysis:

# Dataset configuration
dataset:
  split: "Verified"  # Options: Verified, Lite, Test, Multimodal

# Paths
paths:
  txt_path: "/path/to/unsolved_instances.txt"
  intermediate_output_path: "/path/to/intermediate_output/"
  output_path: "/path/to/final_report.txt"
  logs_directory: "/path/to/failed_trajectories/"
  prompts_path: "/path/to/prompts_manager.py"
  tools_path: "/path/to/tools/"

# Settings
settings:
  max_workers: 20 # Maximum number of parallel requests to OpenRouter
  num_agents: 4 # Number of top SWE-bench leaderboard agents' to download trajectories for

Usage

Basic Usage

  1. Prepare your data:

    • Create a text file listing unresolved instance IDs (one per line)
    • Ensure your failed trajectories are in the specified logs directory
  2. Run the analysis:

    python run.py
    

Input Requirements

  • Failed Trajectories: Your agent's failed trajectories in one of these formats:

    • Direct .log files named {instance_id}.log
    • Issues directory structure: issues/{numerical_id}-{instance_id}/logs/{timestamp}/run/{instance_id}/generator/{generator_id}/
  • Unresolved Instances: Text file containing instance IDs that your agent failed to solve

Output

The tool generates:

  • Individual Analysis: Detailed failure analysis for each instance in the intermediate output directory
  • Comprehensive Report: Categorized analysis of all failures with improvement recommendations

Architecture

Core Components

  • run.py: Main entry point and batch processing coordinator
  • trajectory_discovery.py: Handles trajectory finding and leaderboard integration
  • pairwise_analysis.py: LLM-powered failure analysis and dataset loading
  • utils.py: Utility functions for file operations and trajectory parsing
  • config.yaml: Configuration management

Supported SWE-bench Splits

Split Dataset Instances Description
Verified princeton-nlp/SWE-bench_Verified 500 Verified high-quality instances
Lite SWE-bench/SWE-bench_Lite 300 Smaller subset for faster testing
Test SWE-bench/SWE-bench 2,294 Full test set
Multimodal SWE-bench/SWE-bench_Multimodal 510 Instances with UI elements

Error Handling

The tool includes robust error handling for:

  • Missing trajectory files
  • API rate limiting and failures
  • Dataset loading errors
  • Invalid configuration parameters

Contributor

License

This project is licensed under the MIT License - see the project LICENSE file for details.