mirror of
https://github.com/Tokfinity/InfCode.git
synced 2026-05-24 16:54:02 +02:00
@@ -126,6 +126,8 @@ The main configurations are in `config/config.yaml`:
|
|||||||
- Ensure Docker is installed and running correctly.
|
- Ensure Docker is installed and running correctly.
|
||||||
- Check logs for detailed execution steps.
|
- Check logs for detailed execution steps.
|
||||||
- Supports multiple LLM providers, tool formats are auto-selected based on configuration.
|
- Supports multiple LLM providers, tool formats are auto-selected based on configuration.
|
||||||
|
- We also provide an automated failure analysis tool based on result trace files to help you quickly identify the root cause of bad cases.[here](https://github.com/Tokfinity/InfCode/blob/main/analyzer/README.md).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Contact
|
## Contact
|
||||||
|
|||||||
@@ -0,0 +1,125 @@
|
|||||||
|
# SWE-Bench Trajectory Analysis
|
||||||
|
|
||||||
|
A comprehensive analysis tool for comparing failed and successful software engineering trajectories across different SWE-bench dataset splits. This tool helps identify patterns in agent failures and provides actionable insights for improving software engineering agents.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Multi-Split Support**: Analyze trajectories across SWE-bench splits (Verified, Lite, Test, Multimodal)
|
||||||
|
- **Automated Comparison**: Compare your agent's failed trajectories with successful trajectories from top-performing agents
|
||||||
|
- **LLM-Powered Analysis**: Uses advanced language models to generate detailed failure analysis reports
|
||||||
|
- **Batch Processing**: Process multiple instances in parallel for balanced analysis
|
||||||
|
- **Leaderboard Integration**: Automatically downloads and analyzes top agent results from SWE-bench leaderboards
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- Python 3.8+
|
||||||
|
- Required packages (see `requirements.txt`):
|
||||||
|
- `datasets>=2.0.0` - For loading SWE-bench datasets
|
||||||
|
- `requests>=2.25.0` - For API calls
|
||||||
|
- `python-dotenv>=0.19.0` - For environment variable management
|
||||||
|
- `PyYAML>=6.0` - For configuration file parsing
|
||||||
|
- `boto3>=1.40.52` - For AWS S3 access
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
1. Clone the repository and navigate to the directory:
|
||||||
|
```bash
|
||||||
|
cd trajectory_analysis
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Install dependencies:
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Set up environment variables:
|
||||||
|
```bash
|
||||||
|
cp .env.example .env
|
||||||
|
# Edit .env and add your OPENROUTER_API_KEY
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Edit `config.yaml` to configure your analysis:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Dataset configuration
|
||||||
|
dataset:
|
||||||
|
split: "Verified" # Options: Verified, Lite, Test, Multimodal
|
||||||
|
|
||||||
|
# Paths
|
||||||
|
paths:
|
||||||
|
txt_path: "/path/to/unsolved_instances.txt"
|
||||||
|
intermediate_output_path: "/path/to/intermediate_output/"
|
||||||
|
output_path: "/path/to/final_report.txt"
|
||||||
|
logs_directory: "/path/to/failed_trajectories/"
|
||||||
|
prompts_path: "/path/to/prompts_manager.py"
|
||||||
|
tools_path: "/path/to/tools/"
|
||||||
|
|
||||||
|
# Settings
|
||||||
|
settings:
|
||||||
|
max_workers: 20 # Maximum number of parallel requests to OpenRouter
|
||||||
|
num_agents: 4 # Number of top SWE-bench leaderboard agents' to download trajectories for
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Basic Usage
|
||||||
|
|
||||||
|
1. **Prepare your data**:
|
||||||
|
- Create a text file listing unresolved instance IDs (one per line)
|
||||||
|
- Ensure your failed trajectories are in the specified logs directory
|
||||||
|
|
||||||
|
2. **Run the analysis**:
|
||||||
|
```bash
|
||||||
|
python run.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Input Requirements
|
||||||
|
|
||||||
|
- **Failed Trajectories**: Your agent's failed trajectories in one of these formats:
|
||||||
|
- Direct `.log` files named `{instance_id}.log`
|
||||||
|
- Issues directory structure: `issues/{numerical_id}-{instance_id}/logs/{timestamp}/run/{instance_id}/generator/{generator_id}/`
|
||||||
|
|
||||||
|
- **Unresolved Instances**: Text file containing instance IDs that your agent failed to solve
|
||||||
|
|
||||||
|
### Output
|
||||||
|
|
||||||
|
The tool generates:
|
||||||
|
- **Individual Analysis**: Detailed failure analysis for each instance in the intermediate output directory
|
||||||
|
- **Comprehensive Report**: Categorized analysis of all failures with improvement recommendations
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Core Components
|
||||||
|
|
||||||
|
- **`run.py`**: Main entry point and batch processing coordinator
|
||||||
|
- **`trajectory_discovery.py`**: Handles trajectory finding and leaderboard integration
|
||||||
|
- **`pairwise_analysis.py`**: LLM-powered failure analysis and dataset loading
|
||||||
|
- **`utils.py`**: Utility functions for file operations and trajectory parsing
|
||||||
|
- **`config.yaml`**: Configuration management
|
||||||
|
|
||||||
|
## Supported SWE-bench Splits
|
||||||
|
|
||||||
|
| Split | Dataset | Instances | Description |
|
||||||
|
|-------|---------|-----------|-------------|
|
||||||
|
| Verified | `princeton-nlp/SWE-bench_Verified` | 500 | Verified high-quality instances |
|
||||||
|
| Lite | `SWE-bench/SWE-bench_Lite` | 300 | Smaller subset for faster testing |
|
||||||
|
| Test | `SWE-bench/SWE-bench` | 2,294 | Full test set |
|
||||||
|
| Multimodal | `SWE-bench/SWE-bench_Multimodal` | 510 | Instances with UI elements |
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
The tool includes robust error handling for:
|
||||||
|
- Missing trajectory files
|
||||||
|
- API rate limiting and failures
|
||||||
|
- Dataset loading errors
|
||||||
|
- Invalid configuration parameters
|
||||||
|
|
||||||
|
# Contributor
|
||||||
|
|
||||||
|
- [Hengzhi Zhang](https://henryzhang11.github.io)
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
This project is licensed under the MIT License - see the project LICENSE file for details.
|
||||||
@@ -0,0 +1,17 @@
|
|||||||
|
# Configuration file for trajectory analysis
|
||||||
|
paths:
|
||||||
|
txt_path: "/root/henry/trajectory_analysis/unsolved.txt"
|
||||||
|
intermediate_output_path: "/root/henry/trajectory_analysis_test/per_instance_failure_analysis"
|
||||||
|
output_path: "/root/henry/trajectory_analysis_test/failure_analysis_report.txt"
|
||||||
|
logs_directory: "/root/peteWong/project/Code-agent/batch_out/batch-first/issues"
|
||||||
|
prompts_path: "/root/henry/Code-agent/src/managers/prompts/prompts_manager.py"
|
||||||
|
tools_path: "/root/henry/Code-agent/src/tools"
|
||||||
|
|
||||||
|
# Dataset configuration
|
||||||
|
dataset:
|
||||||
|
split: "Verified" # Options: Verified, Lite, Test, Multimodal (bash-only not available on Hugging Face)
|
||||||
|
|
||||||
|
# Optional: Add other configuration parameters
|
||||||
|
settings:
|
||||||
|
max_workers: 20
|
||||||
|
num_agents: 1
|
||||||
@@ -0,0 +1,179 @@
|
|||||||
|
import json
|
||||||
|
import re
|
||||||
|
import os
|
||||||
|
import requests
|
||||||
|
import time
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import List, Optional, Callable
|
||||||
|
from datasets import load_dataset
|
||||||
|
import random
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
|
||||||
|
# Load environment variables from .env file
|
||||||
|
load_dotenv()
|
||||||
|
|
||||||
|
def failure_analysis_prompt(problem_statement: str, hints_text: str, working_trajectory: str, failed_trajectory: str, code_agent_code: str) -> str:
|
||||||
|
"""
|
||||||
|
Generate a failure analysis prompt for comparing successful and failed trajectories.
|
||||||
|
|
||||||
|
Creates a structured prompt that asks an AI model to analyze why one trajectory
|
||||||
|
succeeded while another failed, and what the failed trajectory could have done
|
||||||
|
differently to succeed.
|
||||||
|
"""
|
||||||
|
return f"""Here's a software engineering issue: {problem_statement}.
|
||||||
|
|
||||||
|
Here's the hints: {hints_text}.
|
||||||
|
|
||||||
|
Here's the trajectory that fixed the issue: {working_trajectory}.
|
||||||
|
|
||||||
|
Here's the trajectory that failed: {failed_trajectory}.
|
||||||
|
|
||||||
|
Here's the prompts/tools code for the software engineering agent: {code_agent_code}.
|
||||||
|
|
||||||
|
Why does the first SWE-agent resolve the issue but the second failed?
|
||||||
|
|
||||||
|
What could the second agent have done differently to resolve the issue?
|
||||||
|
|
||||||
|
What improvements could be made to the prompt/tool usage?
|
||||||
|
|
||||||
|
Answer in the following format:
|
||||||
|
# Why the second agent failed
|
||||||
|
...
|
||||||
|
# What could the second agent have done differently
|
||||||
|
...
|
||||||
|
# What improvements could be made to the prompt/tool usage
|
||||||
|
..."""
|
||||||
|
|
||||||
|
def openrouter_gemini_client(prompt: str, max_retries: int = 10, base_delay: float = 1.0) -> str:
|
||||||
|
"""
|
||||||
|
Send a prompt to OpenRouter API using Gemini model for analysis.
|
||||||
|
|
||||||
|
Makes API calls to OpenRouter service using Google's Gemini 2.5 Pro model
|
||||||
|
with exponential backoff retry logic. Handles authentication, rate limiting,
|
||||||
|
and error recovery for robust API communication.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
prompt (str): The analysis prompt to send to the model
|
||||||
|
max_retries (int, optional): Maximum number of retry attempts. Defaults to 10.
|
||||||
|
base_delay (float, optional): Base delay in seconds for exponential backoff. Defaults to 1.0.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: The model's response content as a string
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
ValueError: If OPENROUTER_API_KEY environment variable is not set
|
||||||
|
Exception: If API request fails after all retry attempts or returns invalid response
|
||||||
|
|
||||||
|
Note:
|
||||||
|
Requires OPENROUTER_API_KEY environment variable to be set with valid API key.
|
||||||
|
Uses exponential backoff: 1s, 2s, 4s, 8s, 16s, etc. between retries.
|
||||||
|
"""
|
||||||
|
api_key = os.getenv('OPENROUTER_API_KEY')
|
||||||
|
if not api_key:
|
||||||
|
raise ValueError("OPENROUTER_API_KEY environment variable not found")
|
||||||
|
|
||||||
|
url = "https://openrouter.ai/api/v1/chat/completions"
|
||||||
|
|
||||||
|
headers = {
|
||||||
|
"Authorization": f"Bearer {api_key}",
|
||||||
|
"Content-Type": "application/json"
|
||||||
|
}
|
||||||
|
|
||||||
|
data = {
|
||||||
|
"model": "google/gemini-2.5-pro",
|
||||||
|
"messages": [
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": prompt
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"max_tokens": 200000, # Increased limit for comprehensive analysis
|
||||||
|
"temperature": 0.7 # Lower temperature for more consistent summaries
|
||||||
|
}
|
||||||
|
|
||||||
|
for attempt in range(max_retries):
|
||||||
|
try:
|
||||||
|
print(f"ATTEMPT: API attempt {attempt + 1}/{max_retries}...")
|
||||||
|
response = requests.post(url, headers=headers, json=data, timeout=30)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
result = response.json()
|
||||||
|
|
||||||
|
if 'choices' not in result or not result['choices']:
|
||||||
|
raise Exception(f"Invalid API response: missing or empty choices")
|
||||||
|
|
||||||
|
content = result['choices'][0]['message']['content']
|
||||||
|
|
||||||
|
if not content or content.strip() == "":
|
||||||
|
print(f"WARNING: Empty content returned from API on attempt {attempt + 1}/{max_retries}")
|
||||||
|
if attempt < max_retries - 1:
|
||||||
|
delay = base_delay * (2 ** attempt) / (2 ** random.uniform(0, 1)) # Exponential backoff with jitter
|
||||||
|
print(f"WAIT: Waiting {delay:.2f} seconds before retry...")
|
||||||
|
time.sleep(delay)
|
||||||
|
continue
|
||||||
|
else:
|
||||||
|
raise Exception(f"Empty content returned from API after {max_retries} attempts")
|
||||||
|
|
||||||
|
print(f"SUCCESS: API request successful on attempt {attempt + 1}")
|
||||||
|
return content.strip()
|
||||||
|
|
||||||
|
except requests.exceptions.RequestException as e:
|
||||||
|
print(f"FAILED: Request failed on attempt {attempt + 1}/{max_retries}: {e}")
|
||||||
|
if attempt < max_retries - 1:
|
||||||
|
delay = base_delay * (2 ** attempt) / (2 ** random.uniform(0, 1)) # Exponential backoff with jitter
|
||||||
|
print(f"WAIT: Waiting {delay:.2f} seconds before retry...")
|
||||||
|
time.sleep(delay)
|
||||||
|
continue
|
||||||
|
else:
|
||||||
|
raise Exception(f"FAILED: OpenRouter API request failed after {max_retries} attempts: {e}")
|
||||||
|
|
||||||
|
except (KeyError, IndexError) as e:
|
||||||
|
raise Exception(f"FAILED: Unexpected response format from OpenRouter API: {e}")
|
||||||
|
|
||||||
|
raise Exception(f"FAILED: All {max_retries} attempts failed")
|
||||||
|
|
||||||
|
def get_swe_bench_problem_and_hints_for_split(instance_id: str, split_name: str) -> tuple[str, str]:
|
||||||
|
"""
|
||||||
|
Retrieve problem statement and hints for a specific SWE-Bench instance in a specific split.
|
||||||
|
|
||||||
|
Loads the appropriate SWE-Bench dataset and searches for the specified instance ID to extract the problem statement and hints text.
|
||||||
|
This data is used for generating analysis prompts and understanding the context of the software engineering problem.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
instance_id (str): The unique identifier for the SWE-Bench instance
|
||||||
|
(e.g., 'scikit-learn__scikit-learn-10908')
|
||||||
|
split_name (str): The split name (e.g., "Verified", "Lite", "Test", "Multimodal", "bash-only")
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
tuple[str, str]: A tuple containing (problem_statement, hints_text)
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
ValueError: If the specified instance_id is not found in the dataset
|
||||||
|
Note:
|
||||||
|
This function loads the full SWE-Bench dataset for the specified split, which may take some time on first execution due to dataset download and caching.
|
||||||
|
"""
|
||||||
|
dataset_mapping = {
|
||||||
|
"Verified": "princeton-nlp/SWE-bench_Verified",
|
||||||
|
"Lite": "SWE-bench/SWE-bench_Lite",
|
||||||
|
"Test": "SWE-bench/SWE-bench",
|
||||||
|
"Multimodal": "SWE-bench/SWE-bench_Multimodal",
|
||||||
|
# Note: bash-only dataset not yet available on Hugging Face
|
||||||
|
}
|
||||||
|
|
||||||
|
dataset_name = dataset_mapping.get(split_name)
|
||||||
|
if not dataset_name:
|
||||||
|
raise ValueError(f"No dataset mapping found for split '{split_name}'")
|
||||||
|
|
||||||
|
ds = load_dataset(dataset_name, split="test")
|
||||||
|
target_row = None
|
||||||
|
for row in ds:
|
||||||
|
if row['instance_id'] == instance_id:
|
||||||
|
target_row = row
|
||||||
|
break
|
||||||
|
if target_row is None:
|
||||||
|
raise ValueError(f"Instance ID '{instance_id}' not found in the {split_name} dataset")
|
||||||
|
problem_statement = target_row['problem_statement']
|
||||||
|
hints_text = target_row['hints_text']
|
||||||
|
return problem_statement, hints_text
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
datasets>=2.0.0
|
||||||
|
requests>=2.25.0
|
||||||
|
python-dotenv>=0.19.0
|
||||||
|
PyYAML>=6.0
|
||||||
|
boto3>=1.40.52
|
||||||
+222
@@ -0,0 +1,222 @@
|
|||||||
|
import json
|
||||||
|
import os
|
||||||
|
import yaml
|
||||||
|
from typing import List, Dict, Any, Tuple, Optional
|
||||||
|
from pathlib import Path
|
||||||
|
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||||
|
import threading
|
||||||
|
from trajectory_discovery import find_failed_trajectory, find_successful_trajectory_for_split, download_top_agents_logs_for_split, count_top_agents_solved_for_split
|
||||||
|
from pairwise_analysis import get_swe_bench_problem_and_hints_for_split, failure_analysis_prompt, openrouter_gemini_client
|
||||||
|
|
||||||
|
|
||||||
|
def load_config(config_path: str = "config.yaml") -> Dict[str, Any]:
|
||||||
|
"""Load configuration from YAML file."""
|
||||||
|
with open(config_path, 'r', encoding='utf-8') as f:
|
||||||
|
return yaml.safe_load(f)
|
||||||
|
|
||||||
|
|
||||||
|
def extract_code_agent_code(prompts_path: str, tools_path: str) -> Dict[str, str]:
|
||||||
|
"""
|
||||||
|
Extract content from Code-agent directory including prompts_manager.py and tools/ files.
|
||||||
|
"""
|
||||||
|
result = ""
|
||||||
|
prompts_path = Path(prompts_path)
|
||||||
|
if prompts_path.exists():
|
||||||
|
try:
|
||||||
|
with open(prompts_path, 'r', encoding='utf-8') as f:
|
||||||
|
result += f"{prompts_path.name}: \n" + "```python\n" + f.read() + "```\n"
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error reading prompts_manager.py: {e}")
|
||||||
|
tools_path = Path(tools_path)
|
||||||
|
if tools_path.exists():
|
||||||
|
for file_path in tools_path.iterdir():
|
||||||
|
if file_path.is_file() and not file_path.name.startswith('test'):
|
||||||
|
try:
|
||||||
|
with open(file_path, 'r', encoding='utf-8') as f:
|
||||||
|
result += f"{file_path.name}: \n" + "```python\n" + f.read() + "```\n"
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error reading {file_path}: {e}")
|
||||||
|
return result
|
||||||
|
|
||||||
|
def analyze_failure_causes(instance_id: str, current_directory: str, code_agent_code: str, split_name: str, num_agents: int = 5) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Compare trajectories alongside test results to analyze specific failure causes.
|
||||||
|
|
||||||
|
This function finds both failed and successful trajectories for a given instance, retrieves the problem context, and uses an LLM to analyze why the failure occurred.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
instance_id: The instance ID to analyze
|
||||||
|
current_directory: Directory containing failed trajectories
|
||||||
|
code_agent_code: Code agent code for analysis
|
||||||
|
split_name: The split name (e.g., "Verified", "Lite", "Test", "Multimodal", "bash-only")
|
||||||
|
num_agents: Number of top agents to check for successful trajectories
|
||||||
|
"""
|
||||||
|
failed_trajectory_path = find_failed_trajectory(instance_id, current_directory)
|
||||||
|
if not failed_trajectory_path:
|
||||||
|
print(f"No failed trajectory found for {instance_id}")
|
||||||
|
return None
|
||||||
|
successful_trajectory_path = find_successful_trajectory_for_split(instance_id, split_name, num_agents)
|
||||||
|
if not successful_trajectory_path:
|
||||||
|
print(f"No successful trajectory found for {instance_id} in {split_name} split")
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
with open(failed_trajectory_path, 'r', encoding='utf-8') as f:
|
||||||
|
failed_content = f.read()
|
||||||
|
with open(successful_trajectory_path, 'r', encoding='utf-8') as f:
|
||||||
|
successful_content = f.read()
|
||||||
|
problem_statement, hints_text = get_swe_bench_problem_and_hints_for_split(instance_id, split_name)
|
||||||
|
prompt = failure_analysis_prompt(problem_statement, hints_text, successful_content, failed_content, code_agent_code)
|
||||||
|
analysis_result = openrouter_gemini_client(prompt)
|
||||||
|
formatted_result = f"Instance: {instance_id}\nAnalysis: {analysis_result}\n"
|
||||||
|
print(f"Analysis result for {instance_id}: {analysis_result}")
|
||||||
|
return formatted_result
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error analyzing {instance_id}: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
def batch_failure_analysis(unresolved_id_txt_path: str, intermediate_output_path: str, output_path: str, logs_directory: str, prompts_path: str, tools_path: str, split_name: str, max_workers: int = 20, num_agents: int = 5) -> None:
|
||||||
|
"""
|
||||||
|
Complete batch failure analysis pipeline for a specific split.
|
||||||
|
Process failed trajectories, find successful trajectories, analyze failure causes, and categorize them by cause. Generate a comprehensive report with recommendations and save it to output_path.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
unresolved_id_txt_path: Path to text file containing unresolved instance IDs
|
||||||
|
intermediate_output_path: Path to directory for intermediate output files
|
||||||
|
output_path: Path to final output report file
|
||||||
|
logs_directory: Directory containing failed trajectory logs
|
||||||
|
prompts_path: Path to prompts manager Python file
|
||||||
|
tools_path: Path to tools directory
|
||||||
|
split_name: The split name (e.g., "Verified", "Lite", "Test", "Multimodal", "bash-only")
|
||||||
|
max_workers: Maximum number of worker threads
|
||||||
|
num_agents: Number of top agents to analyze
|
||||||
|
"""
|
||||||
|
print(f"Regenerating top agents performance CSV for {split_name} split...")
|
||||||
|
count_top_agents_solved_for_split(split_name, num_agents)
|
||||||
|
print(f"Ensuring experiments repository and logs are up to date for {split_name} split...")
|
||||||
|
download_top_agents_logs_for_split("experiments", split_name, num_agents)
|
||||||
|
try:
|
||||||
|
with open(unresolved_id_txt_path, 'r', encoding='utf-8') as f:
|
||||||
|
unresolved_ids = [line.strip() for line in f.readlines() if line.strip()]
|
||||||
|
|
||||||
|
if not unresolved_ids:
|
||||||
|
print("No unresolved instances found in text file")
|
||||||
|
return
|
||||||
|
print(f"Found {len(unresolved_ids)} unresolved instances to analyze")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error loading text file: {e}")
|
||||||
|
return
|
||||||
|
|
||||||
|
failure_analysis_results = []
|
||||||
|
skipped_instances = []
|
||||||
|
code_agent_code = extract_code_agent_code(prompts_path, tools_path)
|
||||||
|
# create the intermediate output directory if it doesn't exist
|
||||||
|
os.makedirs(intermediate_output_path, exist_ok=True)
|
||||||
|
|
||||||
|
def process_instance(instance_id):
|
||||||
|
print(f"\nProcessing: {instance_id}")
|
||||||
|
try:
|
||||||
|
analysis_result = analyze_failure_causes(instance_id, logs_directory, code_agent_code, split_name, num_agents)
|
||||||
|
if analysis_result:
|
||||||
|
with open(os.path.join(intermediate_output_path, f'{instance_id}.txt'), 'w', encoding='utf-8') as f:
|
||||||
|
f.write(analysis_result)
|
||||||
|
return (instance_id, analysis_result, None)
|
||||||
|
else:
|
||||||
|
print(f"Skipped {instance_id} - no trajectories found")
|
||||||
|
return (instance_id, None, "no trajectories found")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error processing {instance_id}: {e}")
|
||||||
|
return (instance_id, None, str(e))
|
||||||
|
|
||||||
|
with ThreadPoolExecutor(max_workers=max_workers) as executor:
|
||||||
|
futures = [executor.submit(process_instance, instance_id) for instance_id in unresolved_ids]
|
||||||
|
for future in as_completed(futures):
|
||||||
|
instance_id, analysis_result, error = future.result()
|
||||||
|
if analysis_result:
|
||||||
|
failure_analysis_results.append(analysis_result)
|
||||||
|
else:
|
||||||
|
skipped_instances.append(instance_id)
|
||||||
|
# combine the intermediate output files into a single file
|
||||||
|
with open(output_path, 'w', encoding='utf-8') as f:
|
||||||
|
for file in os.listdir(intermediate_output_path):
|
||||||
|
if file.endswith('.txt'):
|
||||||
|
with open(os.path.join(intermediate_output_path, file), 'r', encoding='utf-8') as f1:
|
||||||
|
f.write(f1.read() + '\n')
|
||||||
|
# read the combined failure analysis report
|
||||||
|
with open(output_path, 'r', encoding='utf-8') as f:
|
||||||
|
failure_analysis_report = f.read()
|
||||||
|
batch_prompt = f"""Read failure analysis reports and group failure causes into categories.
|
||||||
|
|
||||||
|
Here are the failure analysis reports:
|
||||||
|
|
||||||
|
{failure_analysis_report}
|
||||||
|
|
||||||
|
Here are the prompts/tools code for the software engineering agent:
|
||||||
|
|
||||||
|
{code_agent_code}
|
||||||
|
|
||||||
|
Please read the above analysis and categorize the failure causes into categories.
|
||||||
|
|
||||||
|
For each category, please describe the failure cause in detail, list related instances, and suggest improvements for the prompt/tool usage to address the failure cause.
|
||||||
|
|
||||||
|
Format your response as:
|
||||||
|
|
||||||
|
# Problem 1:
|
||||||
|
**Description**: [Your analysis of this problem]
|
||||||
|
**Related Instances**:
|
||||||
|
- instance_id_1: [brief description of the failure]
|
||||||
|
- instance_id_2: [brief description of the failure]
|
||||||
|
**Improvements**: [Suggestions for improving the prompt/tool usage to address the failure cause]
|
||||||
|
|
||||||
|
# Problem 2:
|
||||||
|
**Description**: [Your analysis of this problem]
|
||||||
|
**Related Instances**:
|
||||||
|
- instance_id_3: [brief description of the failure]
|
||||||
|
- instance_id_4: [brief description of the failure]
|
||||||
|
**Improvements**: [Suggestions for improving the prompt/tool usage to address the failure cause]
|
||||||
|
|
||||||
|
Continue for all identified problems."""
|
||||||
|
|
||||||
|
try:
|
||||||
|
batch_analysis = openrouter_gemini_client(batch_prompt)
|
||||||
|
with open(output_path, 'w', encoding='utf-8') as f:
|
||||||
|
f.write(batch_analysis)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error generating batch analysis: {e}")
|
||||||
|
return f"Error generating batch analysis: {e}"
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
config = load_config("config.yaml")
|
||||||
|
|
||||||
|
# Validate paths exist
|
||||||
|
paths = config["paths"]
|
||||||
|
for path_name in ["txt_path", "logs_directory", "prompts_path", "tools_path"]:
|
||||||
|
if not os.path.exists(paths[path_name]):
|
||||||
|
raise FileNotFoundError(f"Path {path_name}: {paths[path_name]} does not exist")
|
||||||
|
|
||||||
|
# Validate dataset split
|
||||||
|
split_name = config["dataset"]["split"]
|
||||||
|
valid_splits = ["Verified", "Lite", "Test", "Multimodal"] # bash-only not yet available on Hugging Face
|
||||||
|
if split_name not in valid_splits:
|
||||||
|
raise ValueError(f"split must be one of {valid_splits}, got: {split_name}")
|
||||||
|
|
||||||
|
# Validate max_workers
|
||||||
|
max_workers = config["settings"]["max_workers"]
|
||||||
|
if not isinstance(max_workers, int) or max_workers <= 0:
|
||||||
|
raise ValueError(f"max_workers must be positive integer, got: {max_workers}")
|
||||||
|
|
||||||
|
# Validate num_agents
|
||||||
|
num_agents = config["settings"]["num_agents"]
|
||||||
|
if not isinstance(num_agents, int) or num_agents <= 0:
|
||||||
|
raise ValueError(f"num_agents must be positive integer, got: {num_agents}")
|
||||||
|
|
||||||
|
batch_failure_analysis(
|
||||||
|
paths["txt_path"],
|
||||||
|
paths["intermediate_output_path"],
|
||||||
|
paths["output_path"],
|
||||||
|
paths["logs_directory"],
|
||||||
|
paths["prompts_path"],
|
||||||
|
paths["tools_path"],
|
||||||
|
split_name,
|
||||||
|
max_workers,
|
||||||
|
num_agents
|
||||||
|
)
|
||||||
@@ -0,0 +1,386 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import json
|
||||||
|
import pathlib
|
||||||
|
import csv
|
||||||
|
import urllib.request
|
||||||
|
import argparse
|
||||||
|
from typing import Optional, List, Tuple, Dict, Any
|
||||||
|
from pathlib import Path
|
||||||
|
from concurrent.futures import ProcessPoolExecutor, as_completed
|
||||||
|
import multiprocessing
|
||||||
|
from utils import read_json_url_or_path
|
||||||
|
from datasets import load_dataset
|
||||||
|
from utils import find_failed_trajectory_in_issues_directory
|
||||||
|
|
||||||
|
LEADERBOARD_JSON = "https://raw.githubusercontent.com/swe-bench/swe-bench.github.io/master/data/leaderboards.json"
|
||||||
|
EXPERIMENTS_DIR = str(Path(__file__).parent / "experiments")
|
||||||
|
EXPERIMENTS_REPO_URL = "https://github.com/SWE-bench/experiments.git"
|
||||||
|
|
||||||
|
def top_agents_for_split(leaderboards_json: dict, split_name: str, num_agents: int = 5) -> list[dict]:
|
||||||
|
"""
|
||||||
|
Get the top N SWE-bench agents for a specific split from the leaderboard.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
leaderboards_json: The leaderboard JSON data
|
||||||
|
split_name: The split name (e.g., "Verified", "Lite", "Test", "Multimodal")
|
||||||
|
num_agents: Number of top agents to return (default: 5)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of top N agents sorted by resolution rate
|
||||||
|
"""
|
||||||
|
lbs = leaderboards_json.get("leaderboards") or leaderboards_json
|
||||||
|
split_lb = next((lb for lb in lbs if (lb.get("name") or "").lower() == split_name.lower()), None)
|
||||||
|
if not split_lb:
|
||||||
|
raise KeyError(f"Could not find '{split_name}' leaderboard in leaderboards.json")
|
||||||
|
rows = [r for r in split_lb.get("results", []) if not r.get("warning")]
|
||||||
|
rows.sort(key=lambda r: float(r.get("resolved", 0)), reverse=True)
|
||||||
|
return rows[:num_agents]
|
||||||
|
|
||||||
|
def read_submission_resolved_ids(submission_dir: pathlib.Path) -> set[str]:
|
||||||
|
"""
|
||||||
|
Read the resolved instance_ids from the results.json file in a SWE-bench verified submission folder.
|
||||||
|
"""
|
||||||
|
fp = submission_dir / "results" / "results.json"
|
||||||
|
if not fp.is_file():
|
||||||
|
return set()
|
||||||
|
try:
|
||||||
|
j = json.loads(fp.read_text(encoding="utf-8"))
|
||||||
|
return set(j.get("resolved", []))
|
||||||
|
except Exception:
|
||||||
|
return set()
|
||||||
|
|
||||||
|
def load_canonical_ids_for_split(split_name: str) -> list[str]:
|
||||||
|
"""
|
||||||
|
Load canonical instance IDs for a specific SWE-bench split.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
split_name: The split name (e.g., "Verified", "Lite", "Test", "Multimodal")
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of sorted instance IDs for the split
|
||||||
|
"""
|
||||||
|
dataset_mapping = {
|
||||||
|
"Verified": "princeton-nlp/SWE-bench_Verified",
|
||||||
|
"Lite": "SWE-bench/SWE-bench_Lite",
|
||||||
|
"Test": "SWE-bench/SWE-bench",
|
||||||
|
"Multimodal": "SWE-bench/SWE-bench_Multimodal",
|
||||||
|
# Note: bash-only dataset not available on Hugging Face
|
||||||
|
}
|
||||||
|
|
||||||
|
dataset_name = dataset_mapping.get(split_name)
|
||||||
|
if not dataset_name:
|
||||||
|
print(f"Warning: No dataset mapping found for split '{split_name}'")
|
||||||
|
return []
|
||||||
|
|
||||||
|
try:
|
||||||
|
ds = load_dataset(dataset_name, split="test")
|
||||||
|
ids = sorted(ds["instance_id"])
|
||||||
|
print(f"Loaded {len(ids)} instance IDs for {split_name} split")
|
||||||
|
return ids
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error loading dataset for {split_name}: {e}")
|
||||||
|
return []
|
||||||
|
|
||||||
|
def ensure_experiments_repo() -> pathlib.Path:
|
||||||
|
"""Clone or update the SWE-bench experiments repository."""
|
||||||
|
experiments_path = pathlib.Path(EXPERIMENTS_DIR).expanduser().resolve()
|
||||||
|
|
||||||
|
if experiments_path.exists() and (experiments_path / ".git").exists():
|
||||||
|
try:
|
||||||
|
subprocess.run(["git", "pull"], cwd=experiments_path, check=True, capture_output=True, text=True)
|
||||||
|
except subprocess.CalledProcessError as e:
|
||||||
|
print(f"Warning: Failed to update repository: {e.stderr}")
|
||||||
|
print("Continuing with existing version...")
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
subprocess.run(["git", "clone", EXPERIMENTS_REPO_URL, str(experiments_path)], check=True, capture_output=True, text=True)
|
||||||
|
print("Repository cloned successfully")
|
||||||
|
except subprocess.CalledProcessError as e:
|
||||||
|
print(f"Error: Failed to clone repository: {e.stderr}")
|
||||||
|
sys.exit(1)
|
||||||
|
return experiments_path
|
||||||
|
|
||||||
|
def count_top_agents_solved_for_split(split_name: str, num_agents: int = 5):
|
||||||
|
"""
|
||||||
|
Generate CSV showing which top agents solved which instances for a specific split.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
split_name: The split name (e.g., "Verified", "Lite", "Test", "Multimodal")
|
||||||
|
num_agents: Number of top agents to analyze
|
||||||
|
"""
|
||||||
|
# Ensure we have the latest experiments repository
|
||||||
|
ensure_experiments_repo()
|
||||||
|
split_root = Path(EXPERIMENTS_DIR) / "evaluation" / split_name.lower()
|
||||||
|
if not split_root.is_dir():
|
||||||
|
print(f"ERROR: {split_root} not found. Set EXPERIMENTS_DIR correctly.", file=sys.stderr)
|
||||||
|
sys.exit(2)
|
||||||
|
|
||||||
|
lb = read_json_url_or_path(LEADERBOARD_JSON)
|
||||||
|
top_agents = top_agents_for_split(lb, split_name, num_agents)
|
||||||
|
|
||||||
|
columns = []
|
||||||
|
local_union_ids: set[str] = set()
|
||||||
|
|
||||||
|
print(f"Top {num_agents} ({split_name}):")
|
||||||
|
for i, row in enumerate(top_agents, 1):
|
||||||
|
folder = row.get("folder")
|
||||||
|
name = row.get("name") or f"rank{i}"
|
||||||
|
print(f"{i:2d}. {name} | resolved={row.get('resolved')} | folder={folder}")
|
||||||
|
if not folder:
|
||||||
|
continue
|
||||||
|
subdir = split_root / folder
|
||||||
|
if not subdir.is_dir():
|
||||||
|
print(f"[warn] missing folder locally: {subdir}", file=sys.stderr)
|
||||||
|
continue
|
||||||
|
resolved = read_submission_resolved_ids(subdir)
|
||||||
|
columns.append((f"{i:02d}_{folder}", resolved))
|
||||||
|
# also collect skipped ids so rows don't get lost
|
||||||
|
try:
|
||||||
|
j = json.loads((subdir / "results" / "results.json").read_text(encoding="utf-8"))
|
||||||
|
local_union_ids |= set(j.get("resolved", []))
|
||||||
|
local_union_ids |= set(j.get("no_generation", []))
|
||||||
|
local_union_ids |= set(j.get("no_logs", []))
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# decide row universe (instance_ids)
|
||||||
|
instance_ids = load_canonical_ids_for_split(split_name)
|
||||||
|
if not instance_ids:
|
||||||
|
instance_ids = sorted(local_union_ids)
|
||||||
|
print(f"[warn] using local union of IDs ({len(instance_ids)}). "
|
||||||
|
f"For guaranteed canonical IDs, install `datasets` and internet access.", file=sys.stderr)
|
||||||
|
|
||||||
|
out_csv = pathlib.Path(f"top_agents_performance_{split_name.lower()}.csv")
|
||||||
|
|
||||||
|
headers = ["instance_id"] + [col for col, _ in columns] + ["any_top_agents_solved"]
|
||||||
|
with out_csv.open("w", newline="", encoding="utf-8") as f:
|
||||||
|
w = csv.writer(f)
|
||||||
|
w.writerow(headers)
|
||||||
|
|
||||||
|
non_resolved_instances = []
|
||||||
|
|
||||||
|
for iid in instance_ids:
|
||||||
|
row_vals = [iid]
|
||||||
|
solved_by_any = 0
|
||||||
|
|
||||||
|
for _, resolved in columns:
|
||||||
|
solved = 1 if iid in resolved else 0
|
||||||
|
row_vals.append(solved)
|
||||||
|
if solved == 1:
|
||||||
|
solved_by_any = 1
|
||||||
|
|
||||||
|
row_vals.append(solved_by_any)
|
||||||
|
w.writerow(row_vals)
|
||||||
|
|
||||||
|
if solved_by_any == 0:
|
||||||
|
non_resolved_instances.append(iid)
|
||||||
|
|
||||||
|
print(f"\nResults saved to: {out_csv}")
|
||||||
|
print(f"\nNon-resolved instances ({len(non_resolved_instances)}):")
|
||||||
|
for instance_id in non_resolved_instances:
|
||||||
|
print(f" {instance_id}")
|
||||||
|
|
||||||
|
def find_failed_trajectory(instance_id: str, experiments_dir: str) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Find a failed trajectory for the given instance within the experiments directory (assume names like '{instance_id}.log').
|
||||||
|
If the file exists, return the path to the file.
|
||||||
|
Otherwise, call find_failed_trajectory_in_issues_directory to find the failed trajectory.
|
||||||
|
"""
|
||||||
|
trajectory_path = os.path.join(experiments_dir, instance_id + ".log")
|
||||||
|
if os.path.exists(trajectory_path):
|
||||||
|
return trajectory_path
|
||||||
|
else:
|
||||||
|
return find_failed_trajectory_in_issues_directory(instance_id, experiments_dir)
|
||||||
|
|
||||||
|
def find_successful_trajectory_for_split(instance_id: str, split_name: str, num_agents: int = 5) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Find a successful trajectory for the given instance in a specific split.
|
||||||
|
Checks the top agents performance CSV to find which agents solved the instance, then looks in each agent's directory for trajectory files.
|
||||||
|
Tries agents in order of ranking until a trajectory is found.
|
||||||
|
Special handling for 20250915_JoyCode agent where trajectories are in instance-named folders.
|
||||||
|
Automatically downloads logs and generates CSV if missing.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
instance_id: The instance ID to find trajectory for
|
||||||
|
split_name: The split name (e.g., "Verified", "Lite", "Test", "Multimodal")
|
||||||
|
num_agents: Number of top agents to check
|
||||||
|
"""
|
||||||
|
# Check if trajectory folders exist, download if missing
|
||||||
|
split_dir = os.path.join(EXPERIMENTS_DIR, "evaluation", split_name.lower())
|
||||||
|
if not os.path.exists(split_dir):
|
||||||
|
download_top_agents_logs_for_split(EXPERIMENTS_DIR, split_name, num_agents)
|
||||||
|
csv_path = os.path.join(os.path.dirname(__file__), f"top_agents_performance_{split_name.lower()}.csv")
|
||||||
|
if not os.path.exists(csv_path):
|
||||||
|
count_top_agents_solved_for_split(split_name, num_agents)
|
||||||
|
|
||||||
|
# Get all agents that solved this instance
|
||||||
|
solved_agents = _find_all_agents_for_instance(instance_id, csv_path)
|
||||||
|
if not solved_agents:
|
||||||
|
print(f"No agent found that solved {instance_id} in {split_name} split")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Try each agent until we find a trajectory
|
||||||
|
for agent_name in solved_agents:
|
||||||
|
print(f"Trying agent {agent_name} for {instance_id}")
|
||||||
|
# Special handling for JoyCode agent
|
||||||
|
if agent_name == "20250915_JoyCode":
|
||||||
|
trajectory_path = _find_joycode_trajectory_for_split(instance_id, EXPERIMENTS_DIR, split_name)
|
||||||
|
else:
|
||||||
|
trajectory_path = _find_regular_agent_trajectory_for_split(instance_id, agent_name, EXPERIMENTS_DIR, split_name)
|
||||||
|
|
||||||
|
if trajectory_path:
|
||||||
|
print(f"Found trajectory for {instance_id} using agent {agent_name}")
|
||||||
|
return trajectory_path
|
||||||
|
else:
|
||||||
|
print(f"No trajectory found for {instance_id} using agent {agent_name}")
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _find_best_agent_for_instance(instance_id: str, csv_path: str) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Find the best agent that solved the given instance by checking the CSV.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
with open(csv_path, 'r', encoding='utf-8') as f:
|
||||||
|
reader = csv.DictReader(f)
|
||||||
|
for row in reader:
|
||||||
|
if row['instance_id'] == instance_id:
|
||||||
|
for col_name in reader.fieldnames:
|
||||||
|
if col_name.startswith(('01_', '02_', '03_', '04_', '05_', '06_', '07_', '08_', '09_', '10_')) and row.get(col_name) == '1':
|
||||||
|
agent_name = col_name[3:]
|
||||||
|
return agent_name
|
||||||
|
return None
|
||||||
|
return None
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error reading CSV: {e}")
|
||||||
|
return None
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _find_all_agents_for_instance(instance_id: str, csv_path: str) -> List[str]:
|
||||||
|
"""
|
||||||
|
Find all agents that solved the given instance by checking the CSV.
|
||||||
|
Returns a list of agent names sorted by their ranking (01_, 02_, etc.).
|
||||||
|
"""
|
||||||
|
agents = []
|
||||||
|
try:
|
||||||
|
with open(csv_path, 'r', encoding='utf-8') as f:
|
||||||
|
reader = csv.DictReader(f)
|
||||||
|
for row in reader:
|
||||||
|
if row['instance_id'] == instance_id:
|
||||||
|
for col_name in reader.fieldnames:
|
||||||
|
if col_name.startswith(('01_', '02_', '03_', '04_', '05_', '06_', '07_', '08_', '09_', '10_')) and row.get(col_name) == '1':
|
||||||
|
agent_name = col_name[3:]
|
||||||
|
agents.append(agent_name)
|
||||||
|
break
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error reading CSV: {e}")
|
||||||
|
return []
|
||||||
|
return agents
|
||||||
|
|
||||||
|
def _find_joycode_trajectory_for_split(instance_id: str, experiments_dir: str, split_name: str) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Find trajectory for JoyCode agent in a specific split (special case with instance-named folders).
|
||||||
|
"""
|
||||||
|
joycode_trajs_dir = os.path.join(experiments_dir, "evaluation", split_name.lower(), "20250915_JoyCode", "trajs")
|
||||||
|
instance_dir = os.path.join(joycode_trajs_dir, instance_id)
|
||||||
|
if not os.path.exists(instance_dir):
|
||||||
|
print(f"JoyCode trajectory directory not found: {instance_dir}")
|
||||||
|
return None
|
||||||
|
completed_logs_path = os.path.join(instance_dir, "completed_logs.txt")
|
||||||
|
if os.path.exists(completed_logs_path):
|
||||||
|
return os.path.abspath(completed_logs_path)
|
||||||
|
else:
|
||||||
|
print(f"completed_logs.txt not found in {instance_dir}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _find_regular_agent_trajectory_for_split(instance_id: str, agent_name: str, experiments_dir: str, split_name: str) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Find trajectory for regular agents in a specific split (files named with instance_id).
|
||||||
|
"""
|
||||||
|
agent_trajs_dir = os.path.join(experiments_dir, "evaluation", split_name.lower(), agent_name, "trajs")
|
||||||
|
if not os.path.exists(agent_trajs_dir):
|
||||||
|
print(f"Agent trajectory directory not found: {agent_trajs_dir}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
try:
|
||||||
|
for filename in os.listdir(agent_trajs_dir):
|
||||||
|
if instance_id in filename:
|
||||||
|
trajectory_path = os.path.join(agent_trajs_dir, filename)
|
||||||
|
if os.path.isfile(trajectory_path):
|
||||||
|
return os.path.abspath(trajectory_path)
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
return None
|
||||||
|
|
||||||
|
def download_single_agent_for_split(agent_info: tuple, experiments_dir: str, split_name: str) -> tuple[str, bool, str]:
|
||||||
|
"""
|
||||||
|
Download logs for a single agent in a specific split. Returns (agent_name, success, error_message).
|
||||||
|
"""
|
||||||
|
i, agent = agent_info
|
||||||
|
folder = agent.get("folder")
|
||||||
|
name = agent.get("name") or f"rank{i}"
|
||||||
|
|
||||||
|
if not folder:
|
||||||
|
return (name, False, "No folder specified")
|
||||||
|
|
||||||
|
# Check if trajectories already exist
|
||||||
|
trajs_path = os.path.join(experiments_dir, "evaluation", split_name.lower(), folder, "trajs")
|
||||||
|
if os.path.exists(trajs_path) and os.listdir(trajs_path):
|
||||||
|
print(f"[{name}] ✓ Trajectories already exist, skipping download")
|
||||||
|
return (name, True, "")
|
||||||
|
|
||||||
|
download_path = f"evaluation/{split_name.lower()}/{folder}"
|
||||||
|
cmd = [sys.executable, "-m", "analysis.download_logs", "--only_trajs", "--skip_existing", download_path]
|
||||||
|
try:
|
||||||
|
result = subprocess.run(cmd, cwd=experiments_dir, check=True)
|
||||||
|
print(f"[{name}] ✓ Download completed successfully")
|
||||||
|
return (name, True, "")
|
||||||
|
except Exception as e:
|
||||||
|
error_msg = f"Unexpected error: {e}"
|
||||||
|
print(f"[{name}] ✗ {error_msg}")
|
||||||
|
return (name, False, error_msg)
|
||||||
|
|
||||||
|
def download_top_agents_logs_for_split(experiments_dir: str = "experiments", split_name: str = "Verified", num_agents: int = 5) -> None:
|
||||||
|
"""
|
||||||
|
Download logs for all top N agents in a specific split using parallel processing.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
experiments_dir: Directory containing the experiments repository
|
||||||
|
split_name: The split name (e.g., "Verified", "Lite", "Test", "Multimodal")
|
||||||
|
num_agents: Number of top agents to download logs for (default: 5)
|
||||||
|
"""
|
||||||
|
ensure_experiments_repo()
|
||||||
|
print("Fetching leaderboard data...")
|
||||||
|
leaderboard_json = read_json_url_or_path(LEADERBOARD_JSON)
|
||||||
|
top_agents = top_agents_for_split(leaderboard_json, split_name, num_agents)
|
||||||
|
|
||||||
|
print(f"Checking/downloading trajectories for {len(top_agents)} agents in {split_name} split...")
|
||||||
|
|
||||||
|
agent_infos = [(i, agent) for i, agent in enumerate(top_agents, 1)]
|
||||||
|
|
||||||
|
with ProcessPoolExecutor(max_workers=10) as executor:
|
||||||
|
future_to_agent = {
|
||||||
|
executor.submit(download_single_agent_for_split, agent_info, experiments_dir, split_name): agent_info[1]
|
||||||
|
for agent_info in agent_infos
|
||||||
|
}
|
||||||
|
|
||||||
|
successful = 0
|
||||||
|
failed = 0
|
||||||
|
|
||||||
|
for future in as_completed(future_to_agent):
|
||||||
|
agent = future_to_agent[future]
|
||||||
|
try:
|
||||||
|
name, success, error_msg = future.result()
|
||||||
|
if success:
|
||||||
|
successful += 1
|
||||||
|
else:
|
||||||
|
failed += 1
|
||||||
|
except Exception as e:
|
||||||
|
name = agent.get("name", "Unknown")
|
||||||
|
print(f"[{name}] ✗ Exception: {e}")
|
||||||
|
failed += 1
|
||||||
|
print(f"\nDownload completed for {split_name} split. Success: {successful}, Failed: {failed}")
|
||||||
@@ -0,0 +1,101 @@
|
|||||||
|
import urllib.request
|
||||||
|
import json
|
||||||
|
import pathlib
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
|
||||||
|
def read_json_url_or_path(src: str) -> dict:
|
||||||
|
if src.startswith(("http://", "https://")):
|
||||||
|
with urllib.request.urlopen(src, timeout=60) as r:
|
||||||
|
return json.load(r)
|
||||||
|
p = pathlib.Path(src).expanduser().resolve()
|
||||||
|
with p.open("r", encoding="utf-8") as f:
|
||||||
|
return json.load(f)
|
||||||
|
|
||||||
|
# an "issues" directory containing "{numerical_id}-{instance_id}/logs/{timestamp}/run/{instance_id}/generator/{generator_id}/" and within it "notice.log" and "debug.log"
|
||||||
|
def find_failed_trajectory_in_issues_directory(instance_id: str, experiments_dir: str) -> Optional[str]:
|
||||||
|
"""
|
||||||
|
Find a failed trajectory for the given instance within the experiments directory.
|
||||||
|
|
||||||
|
The function searches for debug.log files that contain successful patches and returns
|
||||||
|
the path to the corresponding notice.log file.
|
||||||
|
|
||||||
|
The function searches through the directory structure:
|
||||||
|
{experiments_dir}/{numerical_id}-{instance_id}/logs/{timestamp}/run/{instance_id}/generator/{generator_id}/
|
||||||
|
"""
|
||||||
|
if not os.path.exists(experiments_dir):
|
||||||
|
return None
|
||||||
|
pattern = re.compile(r'^\d+-' + re.escape(instance_id) + r'$')
|
||||||
|
for item in os.listdir(experiments_dir):
|
||||||
|
if pattern.match(item):
|
||||||
|
instance_dir = os.path.join(experiments_dir, item)
|
||||||
|
if not os.path.isdir(instance_dir):
|
||||||
|
continue
|
||||||
|
logs_dir = os.path.join(instance_dir, 'logs')
|
||||||
|
if not os.path.exists(logs_dir):
|
||||||
|
continue
|
||||||
|
timestamp_dirs = []
|
||||||
|
for timestamp in os.listdir(logs_dir):
|
||||||
|
timestamp_path = os.path.join(logs_dir, timestamp)
|
||||||
|
if os.path.isdir(timestamp_path):
|
||||||
|
timestamp_dirs.append(timestamp)
|
||||||
|
timestamp_dirs.sort(reverse=True)
|
||||||
|
for timestamp in timestamp_dirs:
|
||||||
|
timestamp_path = os.path.join(logs_dir, timestamp)
|
||||||
|
run_dir = os.path.join(timestamp_path, 'run', instance_id, 'generator')
|
||||||
|
if not os.path.exists(run_dir):
|
||||||
|
continue
|
||||||
|
for generator_id in range(5):
|
||||||
|
generator_dir = os.path.join(run_dir, f"{generator_id:03d}")
|
||||||
|
debug_log_path = os.path.join(generator_dir, 'debug.log')
|
||||||
|
notice_log_path = os.path.join(generator_dir, 'notice.log')
|
||||||
|
if not os.path.exists(debug_log_path):
|
||||||
|
continue
|
||||||
|
if _contains_successful_patch(debug_log_path):
|
||||||
|
if os.path.exists(notice_log_path):
|
||||||
|
return os.path.abspath(notice_log_path)
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _contains_successful_patch(debug_log_path: str) -> bool:
|
||||||
|
"""
|
||||||
|
Check if a debug.log file contains a successful patch.
|
||||||
|
|
||||||
|
Reads the last non-empty line, splits on 'result_data: ', parses the dict,
|
||||||
|
and checks if the golden_patch contains non-empty patch_content.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
with open(debug_log_path, 'r', encoding='utf-8') as f:
|
||||||
|
lines = f.readlines()
|
||||||
|
last_line = None
|
||||||
|
for line in reversed(lines):
|
||||||
|
if line.strip():
|
||||||
|
last_line = line.strip()
|
||||||
|
break
|
||||||
|
if not last_line:
|
||||||
|
return False
|
||||||
|
if 'result_data: ' not in last_line:
|
||||||
|
return False
|
||||||
|
parts = last_line.split('result_data: ', 1)
|
||||||
|
if len(parts) != 2:
|
||||||
|
return False
|
||||||
|
dict_str = parts[1]
|
||||||
|
try:
|
||||||
|
result_data = eval(dict_str)
|
||||||
|
except:
|
||||||
|
return False
|
||||||
|
if not isinstance(result_data, dict):
|
||||||
|
return False
|
||||||
|
golden_patch = result_data.get('golden_patch')
|
||||||
|
if not isinstance(golden_patch, list) or len(golden_patch) == 0:
|
||||||
|
return False
|
||||||
|
first_patch = golden_patch[0]
|
||||||
|
if not isinstance(first_patch, dict):
|
||||||
|
return False
|
||||||
|
patch_content = first_patch.get('patch_content')
|
||||||
|
if not isinstance(patch_content, str) or not patch_content.strip():
|
||||||
|
return False
|
||||||
|
return True
|
||||||
|
except (IOError, UnicodeDecodeError, Exception):
|
||||||
|
return False
|
||||||
Reference in New Issue
Block a user