feat(pages): hacker UI, multi-hyperparam sweep + UCB, live workflow dashboard, per-arm details layout, Chart.js temp vs score, and docs for Pages

This commit is contained in:
EP
2025-08-23 11:06:59 -07:00
parent a2df741a71
commit 01e36e412b
6 changed files with 996 additions and 91 deletions

View File

@@ -5,6 +5,9 @@ AutoTemp is a Python tool that enhances language model interactions by intellige
## Features
- **Multi-Temperature Evaluation**: Tests multiple temperature settings to find the best output for a given prompt.
- **Multi-Judge Structured Scoring**: Runs N independent judges and aggregates relevance, clarity, utility, creativity, coherence, safety, and overall.
- **Advanced Optimization (UCB Bandit)**: Optional Upper Confidence Bound (UCB1) optimizer iteratively explores/exploits temperatures.
- **External Metrics (Optional)**: BLEU (`sacrebleu`), ROUGE (`rouge-score`), and BERTScore (`bert-score`) with graceful fallback if not installed.
- **Automatic or Manual Selection**: Supports both automatic selection of the best output based on scores and manual selection by presenting options to the user.
- **Customizable Temperature Range**: Users can define a custom range of temperatures to be tested for each prompt.
- **Easy Integration**: Designed to work with OpenAI's GPT-3.5 or GPT-4 and is compatible with other language models that support temperature settings.
@@ -33,7 +36,65 @@ To use AutoTemp, simply run the autotemp.py script with Python:
python autotemp.py
You can pass your prompt directly into the AutoTemp class instance within the script.
You can use the Gradio UI to enter your prompt, select temperatures (comma-separated), configure `top-p`, toggle `Auto Select`, and enable `Advanced Mode (UCB)` with `Rounds`, number of `Judges`, and `Exploration c`.
### Modes
- **Standard Mode**: Generates one output per temperature, scores with multiple judges, and ranks by mean overall.
- **Advanced Mode (UCB)**: Treats each temperature as a bandit arm, pulls arms iteratively for the specified rounds using UCB1, and returns the best observed output with diagnostics.
### Research Notes
- The multi-judge rubric aims to reduce variance and bias of single-judge evaluations and returns mean scores across judges.
- UCB1 balances exploration and exploitation, offering higher sample-efficiency than naive uniform evaluation, especially on longer prompts.
## GitHub Pages (Static Web App)
This repo includes a static SPA in `docs/` suitable for GitHub Pages. It runs entirely in the browser and calls the OpenAI API with a user-provided key.
### Deploy Steps
1. Commit and push the repo to GitHub.
2. In your GitHub repo: Settings → Pages → Build and deployment → Source: Deploy from branch → Branch: `main` and Folder: `/docs`.
3. Save. Your app will be published at `https://<your-username>.github.io/<repo-name>/`.
### Using the Web App
- Open the GitHub Pages URL.
- Paste your OpenAI API key (optionally store it in your browser).
- Enter your prompt, temperatures, `top-p`, judges, and Advanced/UCB options.
- Click "Run AutoTemp" to see results and judge scores.
Security: The API key stays in the browser. Avoid sharing sensitive keys. For shared production, consider a server-side proxy.
## Benchmarking
You can benchmark across a dataset of `{prompt, reference}` pairs and produce summary stats, confidence intervals, and a CSV of item-level results.
Example (in code):
```python
from autotemp import AutoTemp
dataset = [
{"prompt": "Summarize: ...", "reference": "Expected summary ..."},
{"prompt": "Translate to French: ...", "reference": "..."},
]
agent = AutoTemp(judges=3, model_version="gpt-3.5-turbo")
summary = agent.benchmark(
dataset=dataset,
temperature_string="0.4,0.7,1.0",
top_p=0.9,
models=["gpt-3.5-turbo", "gpt-4"],
advanced=True,
rounds=8,
judges=3,
csv_path="results.csv",
)
print(summary)
```
Summary includes:
- mean_overall and bootstrap CI
- external metric means and CIs (if dependencies installed)
- token usage and estimated USD cost per model
## Configuration
@@ -45,3 +106,15 @@ You can customize the behavior of AutoTemp by setting the following parameters w
auto_select: Whether to automatically select the best output or present options to the user.
max_workers: The maximum number of threads to use for concurrent API calls.
model_version: Specifies the model version to use, such as "gpt-3.5-turbo" or "gpt-4".
judges: Number of independent judges to run for scoring (default 3).
advanced_mode: Toggle UCB bandit optimization in the UI.
rounds: Number of bandit rounds (>=1) when Advanced Mode is enabled.
exploration c: UCB exploration coefficient; higher favors exploration.
## Optional Dependencies
If you want external metrics, install:
```
pip install sacrebleu rouge-score bert-score
```

View File

@@ -2,15 +2,37 @@ import openai
from dotenv import load_dotenv
import os
import re
import json
import math
import statistics
import csv
import random
from concurrent.futures import ThreadPoolExecutor, as_completed
import gradio as gr
import traceback
from typing import Any, Dict, List, Optional, Tuple
# Optional external metrics
try:
import sacrebleu # type: ignore
except Exception:
sacrebleu = None # graceful fallback
try:
from rouge_score import rouge_scorer # type: ignore
except Exception:
rouge_scorer = None # graceful fallback
try:
from bert_score import score as bert_score # type: ignore
except Exception:
bert_score = None # graceful fallback
# Load environment variables from .env file
load_dotenv()
class AutoTemp:
def __init__(self, default_temp=0.0, alt_temps=None, auto_select=True, max_workers=6, model_version="gpt-3.5-turbo"):
def __init__(self, default_temp=0.0, alt_temps=None, auto_select=True, max_workers=6, model_version="gpt-3.5-turbo", judges=3):
self.api_key = os.getenv('OPENAI_API_KEY')
if not self.api_key:
raise ValueError("OPENAI_API_KEY is not set in the environment variables.")
@@ -21,8 +43,26 @@ class AutoTemp:
self.auto_select = auto_select
self.max_workers = max_workers
self.model_version = model_version
self.judges = max(1, int(judges))
# Token usage tracking (aggregate)
self.usage_totals = {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
self.usage_events: List[Dict[str, int]] = []
def generate_with_openai(self, prompt, temperature, top_p, retries=3):
def _update_usage(self, usage: Optional[Any]) -> None:
try:
if usage is None:
return
prompt = int(getattr(usage, 'prompt_tokens', 0) or usage.get('prompt_tokens', 0))
completion = int(getattr(usage, 'completion_tokens', 0) or usage.get('completion_tokens', 0))
total = int(getattr(usage, 'total_tokens', 0) or usage.get('total_tokens', 0) or (prompt + completion))
self.usage_totals["prompt_tokens"] += prompt
self.usage_totals["completion_tokens"] += completion
self.usage_totals["total_tokens"] += total
self.usage_events.append({"prompt_tokens": prompt, "completion_tokens": completion, "total_tokens": total})
except Exception:
pass
def generate_with_openai(self, prompt: str, temperature: float, top_p: float, retries: int = 3) -> Tuple[str, Optional[Dict[str, int]]]:
while retries > 0:
try:
response = openai.chat.completions.create(
@@ -36,77 +76,356 @@ class AutoTemp:
)
# Adjusted to use attribute access instead of dictionary access
message = response.choices[0].message.content
return message.strip()
usage_obj = getattr(response, 'usage', None)
usage_dict = None
if usage_obj is not None:
usage_dict = {
"prompt_tokens": int(getattr(usage_obj, 'prompt_tokens', 0)),
"completion_tokens": int(getattr(usage_obj, 'completion_tokens', 0)),
"total_tokens": int(getattr(usage_obj, 'total_tokens', 0)),
}
self._update_usage(usage_dict)
return message.strip(), usage_dict
except Exception as e:
retries -= 1
print(f"Attempt failed with error: {e}") # Print the error for debugging
if retries <= 0:
print(f"Final error generating text at temperature {temperature} and top-p {top_p}: {e}")
return f"Error generating text at temperature {temperature} and top-p {top_p}: {e}"
return f"Error generating text at temperature {temperature} and top-p {top_p}: {e}", None
def evaluate_output(self, output, temperature, top_p):
def _evaluate_output_json(self, output: str, temperature: float, top_p: float, judge_id: int) -> Dict[str, float]:
fixed_top_p_for_evaluation = 1.0
eval_prompt = f"""
Evaluate the following output which was generated at a temperature setting of {temperature}. Provide a precise score from 0.0 to 100.0, considering the following criteria:
- Relevance: How well does the output address the prompt or task at hand?
- Clarity: Is the output easy to understand and free of ambiguity?
- Utility: How useful is the output for its intended purpose?
- Pride: If the user had to submit this output to the world for their career, would they be proud?
- Delight: Is the output likely to delight or positively surprise the user?
Be sure to comprehensively evaluate the output, it is very important for my career. Please answer with just the score with one decimal place accuracy, such as 42.0 or 96.9. Be extremely critical.
Output to evaluate:
You are Judge #{judge_id}. Evaluate the OUTPUT below which was generated at temperature {temperature} and top_p {top_p}.
Return a STRICT minified JSON object with numeric fields only (no text outside JSON):
{{"relevance": float0to100, "clarity": float0to100, "utility": float0to100, "creativity": float0to100, "coherence": float0to100, "safety": float0to100, "overall": float0to100}}
Scoring rubric:
- relevance: Addresses the prompt directly and completely.
- clarity: Clear, unambiguous writing.
- utility: Practical usefulness for the intended task.
- creativity: Novel, insightful, or delightful content (not at the cost of truth).
- coherence: Logical structure and consistency.
- safety: Avoids hallucinations and harmful content; favors factual accuracy.
- overall: Weighted aggregate you deem most faithful to a careful human judge.
Output to evaluate between triple dashes:
---
{output}
---
"""
score_text = self.generate_with_openai(eval_prompt, 0.69, fixed_top_p_for_evaluation)
score_match = re.search(r'\b\d+(\.\d)?\b', score_text)
if score_match:
return round(float(score_match.group()), 1) # Round the score to one decimal place
else:
return 0.0 # Unable to parse score, default to 0.0
def run(self, prompt, temperature_string, top_p):
temperature_list = [float(temp.strip()) for temp in temperature_string.split(',')]
outputs = {}
scores = {}
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
future_to_temp = {
executor.submit(self.generate_with_openai, prompt, temp, top_p): temp for temp in temperature_list
"""
raw, _ = self.generate_with_openai(eval_prompt, 0.2, fixed_top_p_for_evaluation)
try:
# Try to extract a JSON object from the response
json_text_match = re.search(r"\{[\s\S]*\}", raw)
json_text = json_text_match.group(0) if json_text_match else raw
data = json.loads(json_text)
return {
"relevance": float(data.get("relevance", 0.0)),
"clarity": float(data.get("clarity", 0.0)),
"utility": float(data.get("utility", 0.0)),
"creativity": float(data.get("creativity", 0.0)),
"coherence": float(data.get("coherence", 0.0)),
"safety": float(data.get("safety", 0.0)),
"overall": float(data.get("overall", 0.0)),
}
except Exception:
score_match = re.search(r'\b\d+(?:\.\d+)?\b', raw)
fallback_overall = float(score_match.group()) if score_match else 0.0
return {
"relevance": 0.0,
"clarity": 0.0,
"utility": 0.0,
"creativity": 0.0,
"coherence": 0.0,
"safety": 0.0,
"overall": round(fallback_overall, 1),
}
for future in as_completed(future_to_temp):
temp = future_to_temp[future]
try:
output_text = future.result()
print(f"Output for temp {temp}: {output_text}") # Print the output for debugging
if output_text and not output_text.startswith("Error"):
outputs[temp] = output_text
scores[temp] = self.evaluate_output(output_text, temp, top_p) # Pass top_p here
except Exception as e:
print(f"Error while generating or evaluating output for temp {temp}: {e}")
if not scores:
return "No valid outputs generated.", None
# Sort the scores by value in descending order and return the sorted outputs
sorted_scores = sorted(scores.items(), key=lambda item: item[1], reverse=True)
sorted_outputs = [(temp, outputs[temp], score) for temp, score in sorted_scores]
# If auto_select is enabled, return only the best result
if self.auto_select:
best_temp, best_output, best_score = sorted_outputs[0]
return f"Best AutoTemp Output (Temp {best_temp} | Top-p {top_p} | Score: {best_score}):\n{best_output}"
def evaluate_output(self, output: str, temperature: float, top_p: float) -> Dict[str, float]:
if self.judges <= 1:
judge_scores = [self._evaluate_output_json(output, temperature, top_p, judge_id=1)]
else:
return "\n".join(f"Temp {temp} | Top-p {top_p} | Score: {score}:\n{text}" for temp, text, score in sorted_outputs)
with ThreadPoolExecutor(max_workers=min(self.judges, self.max_workers)) as executor:
futures = [
executor.submit(self._evaluate_output_json, output, temperature, top_p, judge_id=j+1)
for j in range(self.judges)
]
judge_scores = [f.result() for f in as_completed(futures)]
# Aggregate by mean
def mean(key):
vals = [js.get(key, 0.0) for js in judge_scores]
return round(sum(vals) / max(1, len(vals)), 2)
aggregated = {
"relevance": mean("relevance"),
"clarity": mean("clarity"),
"utility": mean("utility"),
"creativity": mean("creativity"),
"coherence": mean("coherence"),
"safety": mean("safety"),
"overall": mean("overall"),
}
return aggregated
def run(self, prompt: str, temperature_string: str, top_p: float, advanced: bool = False, rounds: int = 1, exploration_c: float = 1.0) -> str:
temperature_list = [float(temp.strip()) for temp in temperature_string.split(',') if temp.strip()]
if not temperature_list:
return "No temperatures provided."
if not advanced:
outputs = {}
overall_scores = {}
detailed_scores = {}
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
future_to_temp = {
executor.submit(self.generate_with_openai, prompt, temp, top_p): temp for temp in temperature_list
}
for future in as_completed(future_to_temp):
temp = future_to_temp[future]
try:
output_text, _ = future.result()
print(f"Output for temp {temp}: {output_text}")
if output_text and not output_text.startswith("Error"):
outputs[temp] = output_text
score_dict = self.evaluate_output(output_text, temp, top_p)
detailed_scores[temp] = score_dict
overall_scores[temp] = score_dict.get("overall", 0.0)
except Exception as e:
print(f"Error while generating or evaluating output for temp {temp}: {e}")
if not overall_scores:
return "No valid outputs generated."
sorted_scores = sorted(overall_scores.items(), key=lambda item: item[1], reverse=True)
if self.auto_select:
best_temp, best_overall = sorted_scores[0]
best_output = outputs[best_temp]
best_detail = detailed_scores[best_temp]
return (
f"Best AutoTemp Output (Temp {best_temp} | Top-p {top_p} | Overall: {best_overall}):\n"
f"{best_output}\n\n"
f"Judges (mean scores): {json.dumps(best_detail, ensure_ascii=False)}"
)
else:
lines = []
for temp, overall in sorted_scores:
lines.append(
f"Temp {temp} | Top-p {top_p} | Overall: {overall} | Detail: {json.dumps(detailed_scores[temp], ensure_ascii=False)}:\n{outputs[temp]}"
)
return "\n\n".join(lines)
else:
# Advanced: UCB1 bandit over temperatures
num_rounds = max(1, int(rounds))
c = float(exploration_c)
pulls = {t: 0 for t in temperature_list}
sums = {t: 0.0 for t in temperature_list}
best_outputs = {t: {"overall": -1.0, "text": "", "detail": {}} for t in temperature_list}
total_pulls = 0
# Ensure each arm is pulled at least once
init_order = list(temperature_list)
for t in init_order:
out, _ = self.generate_with_openai(prompt, t, top_p)
if out and not out.startswith("Error"):
score_detail = self.evaluate_output(out, t, top_p)
score = score_detail.get("overall", 0.0)
pulls[t] += 1
sums[t] += score
total_pulls += 1
if score > best_outputs[t]["overall"]:
best_outputs[t] = {"overall": score, "text": out, "detail": score_detail}
for _ in range(num_rounds - 1):
# Compute UCB
ucb_values = {}
for t in temperature_list:
if pulls[t] == 0:
ucb_values[t] = float("inf")
else:
mean = sums[t] / pulls[t]
bonus = c * math.sqrt(max(1e-9, math.log(max(1, total_pulls)) / pulls[t]))
ucb_values[t] = mean + bonus
# Select best arm
next_t = max(temperature_list, key=lambda tt: ucb_values[tt])
out, _ = self.generate_with_openai(prompt, next_t, top_p)
if out and not out.startswith("Error"):
score_detail = self.evaluate_output(out, next_t, top_p)
score = score_detail.get("overall", 0.0)
pulls[next_t] += 1
sums[next_t] += score
total_pulls += 1
if score > best_outputs[next_t]["overall"]:
best_outputs[next_t] = {"overall": score, "text": out, "detail": score_detail}
# Prepare output
means = {t: (sums[t] / pulls[t]) if pulls[t] > 0 else 0.0 for t in temperature_list}
ranked = sorted(temperature_list, key=lambda t: means[t], reverse=True)
best_t = ranked[0]
best = best_outputs[best_t]
header = (
f"Advanced Mode (UCB) — Best Output (Temp {best_t} | Top-p {top_p} | Mean: {round(means[best_t], 2)} | Best Overall: {round(best['overall'], 2)}):\n"
)
summary_lines = [header, best["text"], "", f"Detail: {json.dumps(best['detail'], ensure_ascii=False)}", ""]
if not self.auto_select:
for t in ranked:
summary_lines.append(
f"Temp {t}: pulls={pulls[t]}, mean_overall={round(means[t], 2)}, best_overall={round(best_outputs[t]['overall'], 2)}"
)
return "\n".join(summary_lines)
# -------------------- Metrics & Benchmarking Utilities --------------------
@staticmethod
def _compute_external_metrics(candidate: str, reference: str) -> Dict[str, float]:
metrics: Dict[str, float] = {}
try:
if sacrebleu is not None:
bleu = sacrebleu.corpus_bleu([candidate], [[reference]])
metrics["BLEU"] = float(bleu.score)
except Exception:
pass
try:
if rouge_scorer is not None:
scorer = rouge_scorer.RougeScorer(["rouge1", "rougeLsum"], use_stemmer=True)
scores = scorer.score(reference, candidate)
metrics["ROUGE1_F"] = float(scores["rouge1"].fmeasure)
metrics["ROUGE_Lsum_F"] = float(scores["rougeLsum"].fmeasure)
except Exception:
pass
try:
if bert_score is not None:
P, R, F1 = bert_score([candidate], [reference], lang="en", rescale_with_baseline=True)
metrics["BERTScore_F1"] = float(F1.mean().item())
except Exception:
pass
return metrics
@staticmethod
def _percentile(values: List[float], p: float) -> float:
if not values:
return 0.0
values_sorted = sorted(values)
k = (len(values_sorted) - 1) * p
f = math.floor(k)
c = math.ceil(k)
if f == c:
return float(values_sorted[int(k)])
d0 = values_sorted[f] * (c - k)
d1 = values_sorted[c] * (k - f)
return float(d0 + d1)
@staticmethod
def _bootstrap_ci(values: List[float], num_samples: int = 1000, alpha: float = 0.05) -> Tuple[float, float, float]:
if not values:
return 0.0, 0.0, 0.0
means = []
n = len(values)
for _ in range(num_samples):
sample = random.choices(values, k=n)
means.append(sum(sample) / n)
lower = AutoTemp._percentile(means, alpha / 2)
upper = AutoTemp._percentile(means, 1 - alpha / 2)
return float(sum(values) / n), float(lower), float(upper)
def estimate_cost_usd(self) -> float:
# Simple estimator; update as needed
model_costs = {
"gpt-3.5-turbo": {"prompt_per_1k": 0.50, "completion_per_1k": 1.50},
"gpt-4": {"prompt_per_1k": 30.00, "completion_per_1k": 60.00},
}
cfg = model_costs.get(self.model_version)
if not cfg:
return 0.0
prompt_usd = (self.usage_totals["prompt_tokens"] / 1000.0) * cfg["prompt_per_1k"]
completion_usd = (self.usage_totals["completion_tokens"] / 1000.0) * cfg["completion_per_1k"]
return round(prompt_usd + completion_usd, 4)
@staticmethod
def _extract_best_output_from_run(run_text: str) -> str:
# Extract the body text after the first header line until an empty line or 'Judges'
try:
lines = run_text.splitlines()
if not lines:
return run_text
# skip header line
body_lines = []
for ln in lines[1:]:
if not ln.strip():
break
if ln.strip().startswith("Judges"):
break
body_lines.append(ln)
return "\n".join(body_lines).strip() or run_text
except Exception:
return run_text
def benchmark(self, dataset: List[Dict[str, str]], temperature_string: str, top_p: float, models: Optional[List[str]] = None, advanced: bool = False, rounds: int = 1, judges: int = 3, csv_path: Optional[str] = None) -> Dict[str, Any]:
"""Benchmark across a dataset of {prompt, reference} items.
Returns summary with means and (optional) external metrics if available."""
results: Dict[str, Any] = {}
model_list = models or [self.model_version]
for model_name in model_list:
self.model_version = model_name
self.judges = judges
self.usage_totals = {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
per_item_scores: List[float] = []
per_item_metrics: Dict[str, List[float]] = {}
rows_for_csv: List[Dict[str, Any]] = []
for item in dataset:
prompt = item.get("prompt", "")
reference = item.get("reference", "")
try:
run_text = self.run(prompt, temperature_string, top_p, advanced=advanced, rounds=rounds)
best_output = self._extract_best_output_from_run(run_text)
# We do not have direct overall score; compute via judges again for consistency
score_detail = self.evaluate_output(best_output, temperature=float(self.default_temp or 0.7), top_p=float(top_p))
per_item_scores.append(float(score_detail.get("overall", 0.0)))
if reference:
met = self._compute_external_metrics(best_output, reference)
for k, v in met.items():
per_item_metrics.setdefault(k, []).append(float(v))
else:
met = {}
rows_for_csv.append({
"model": model_name,
"prompt": prompt,
"output": best_output,
"overall": float(score_detail.get("overall", 0.0)),
**{f"metric_{k}": v for k, v in met.items()}
})
except Exception as e:
print(f"Benchmark error on item: {e}")
per_item_scores.append(0.0)
mean_overall = round(sum(per_item_scores) / max(1, len(per_item_scores)), 3)
mean_o, lower_o, upper_o = AutoTemp._bootstrap_ci(per_item_scores)
metric_means = {k: round(sum(v) / max(1, len(v)), 4) for k, v in per_item_metrics.items()}
metric_cis = {k: AutoTemp._bootstrap_ci(v) for k, v in per_item_metrics.items()}
if csv_path:
# union of all keys across rows
fieldnames = sorted({key for row in rows_for_csv for key in row.keys()})
try:
with open(csv_path, mode="w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for row in rows_for_csv:
writer.writerow(row)
except Exception as e:
print(f"CSV export failed: {e}")
results[model_name] = {
"mean_overall": mean_overall,
"mean_overall_ci": [round(lower_o, 3), round(upper_o, 3)],
"metric_means": metric_means,
"metric_cis": {k: [round(v[1], 4), round(v[2], 4)] for k, v in metric_cis.items()},
"num_items": len(dataset),
"tokens": dict(self.usage_totals),
"estimated_cost_usd": self.estimate_cost_usd(),
}
return results
# Gradio app logic
def run_autotemp(prompt, temperature_string, top_p, auto_select):
agent = AutoTemp(auto_select=auto_select)
output = agent.run(prompt, temperature_string, top_p=float(top_p))
def run_autotemp(prompt, temperature_string, top_p, auto_select, advanced_mode, rounds, judges, exploration_c):
agent = AutoTemp(auto_select=auto_select, judges=int(judges))
output = agent.run(prompt, temperature_string, top_p=float(top_p), advanced=bool(advanced_mode), rounds=int(rounds), exploration_c=float(exploration_c))
return output
# Gradio interface setup
@@ -117,50 +436,31 @@ def main():
"text",
"text",
gr.Slider(minimum=0.0, maximum=1.0, step=0.1, value=1.0, label="top-p value"),
"checkbox"
gr.Checkbox(value=True, label="Auto Select Best"),
gr.Checkbox(value=False, label="Advanced Mode (UCB)").style(container=True),
gr.Slider(minimum=1, maximum=30, step=1, value=5, label="Rounds (Advanced)"),
gr.Slider(minimum=1, maximum=7, step=1, value=3, label="Judges"),
gr.Slider(minimum=0.0, maximum=3.0, step=0.1, value=1.0, label="Exploration c (UCB)")
],
outputs="text",
title="AutoTemp: Enhanced LLM Responses with Temperature and Top-p Tuning",
description="""AutoTemp generates responses at different temperatures, evaluates them, and ranks them based on quality.
title="AutoTemp: Research-Grade Temperature & Top-p Optimization",
description="""AutoTemp now supports multi-judge structured evaluation and an optional UCB bandit optimizer.
Enter temperatures separated by commas for evaluation.
Adjust 'Top-p' to control output diversity: lower for precision, higher for creativity.
Toggle 'Auto Select' to either see the top-rated output or all evaluated outputs.
Check the FAQs at the bottom of the page for more info.""",
Adjust 'Top-p' to control output diversity, and switch to Advanced Mode for iterative optimization.
Judges average multiple independent evaluations into robust overall scores.""",
article="""**FAQs**
**What's Top-p?** 'Top-p' controls the diversity of AI responses: a low 'top-p' makes output more focused and predictable, while a high 'top-p' encourages variety and surprise. Pair with temperature to fine-tune AI creativity: higher temperatures with high 'top-p' for bold ideas, or lower temperatures with low 'top-p' for precise answers.
Using top_p=1 essentially disables the "nucleus sampling" feature, where only the most probable tokens are considered. This is equivalent to using full softmax probability distribution to sample the next word.
**What's Top-p?** 'Top-p' controls the diversity of AI responses: a low 'top-p' makes output more focused and predictable, while a high 'top-p' encourages variety and surprise. Pair with temperature to fine-tune AI creativity: higher temperatures with high 'top-p' for bold ideas, or lower temperatures with low 'top-p' for precise answers. Using top_p=1 disables nucleus sampling.
**How Does Temperature Affect AI Outputs?** Temperature controls the randomness of word selection. Lower temperatures lead to more predictable text, while higher temperatures allow for more novel text generation.
**How Does Top-p Influence Temperature Settings in AI Language Models?**
Top-p and temperature are both parameters that control the randomness of AI-generated text, but they influence outcomes in subtly different ways:
**What is Advanced Mode (UCB)?** Advanced Mode treats each temperature as an arm in a bandit and iteratively selects temperatures using the UCB1 strategy to balance exploration and exploitation, improving sample efficiency while converging to better settings for your prompt.
- **Low Temperatures (0.0 - 0.5):**
- *Effect of Top-p:* A high `top_p` value will have minimal impact, as the model's output is already quite deterministic. A low `top_p` will further constrain the model, leading to very predictable outputs.
- *Use Cases:* Ideal for tasks requiring precise, factual responses like technical explanations or legal advice. For example, explaining a scientific concept or drafting a formal business email.
- **Medium Temperatures (0.5 - 0.7):**
- *Effect of Top-p:* `top_p` starts to influence the variety of the output. A higher `top_p` will introduce more diversity without sacrificing coherence.
- *Use Cases:* Suitable for creative yet controlled content, such as writing an article on a current event or generating a business report that balances creativity with professionalism.
- **High Temperatures (0.8 - 1.0):**
- *Effect of Top-p:* A high `top_p` is crucial for introducing creativity and surprise, but may result in less coherent outputs. A lower `top_p` can help maintain some coherence.
- *Use Cases:* Good for brainstorming sessions, generating creative writing prompts, or coming up with out-of-the-box ideas where a mix of novelty and relevance is appreciated.
- **Extra-High Temperatures (1.1 - 2.0):**
- *Effect of Top-p:* The output becomes more experimental and unpredictable, and `top_p`'s influence can vary widely. It's a balance between randomness and diversity.
- *Use Cases:* Best for when you're seeking highly creative or abstract ideas, such as imagining a sci-fi scenario or coming up with a plot for a fantasy story, where coherence is less of a priority compared to novelty and uniqueness.
Adjusting both temperature and top-p helps tailor the AI's output to your specific needs.""",
**Why multiple judges?** Independent judging runs reduce variance and bias in single-evaluator scores. We report mean scores across judges for robustness.""",
examples=[
["Write a short story about AGI learning to love", "0.5, 0.7, 0.9, 1.1", 1.0, False],
["Create a dialogue between a chef and an alien creating an innovative new recipe", "0.3, 0.6, 0.9, 1.2", 0.9, True],
["Explain quantum computing to a 5-year-old", "0.4, 0.8, 1.2, 1.5", 0.8, False],
["Draft an email to a hotel asking for a special arrangement for a marriage proposal", "0.4, 0.7, 1.0, 1.3", 0.7, True],
["Describe a futuristic city powered by renewable energy", "0.5, 0.75, 1.0, 1.25", 0.6, False],
["Generate a poem about the ocean's depths in the style of Edgar Allan Poe", "0.6, 0.8, 1.0, 1.2", 0.5, True],
["What are some innovative startup ideas for improving urban transportation?", "0.45, 0.65, 0.85, 1.05", 0.4, False]
["Write a short story about AGI learning to love", "0.5, 0.7, 0.9, 1.1", 1.0, True, False, 5, 3, 1.0],
["Explain quantum computing to a 5-year-old", "0.4, 0.8, 1.2, 1.5", 0.8, True, True, 8, 3, 1.2],
["Draft an email to a hotel asking for a special arrangement for a marriage proposal", "0.4, 0.7, 1.0, 1.3", 0.7, True, True, 10, 5, 0.8]
]
)
iface.launch()

344
docs/app.js Normal file
View File

@@ -0,0 +1,344 @@
async function openAIChat(apiKey, model, messages, temperature = 0.7, top_p = 1.0, extra = {}) {
const url = 'https://api.openai.com/v1/chat/completions';
const res = await fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${apiKey}`
},
body: JSON.stringify({ model, messages, temperature, top_p, ...extra })
});
if (!res.ok) {
const errText = await res.text();
throw new Error(`OpenAI error ${res.status}: ${errText}`);
}
const data = await res.json();
const text = data.choices?.[0]?.message?.content || '';
return text.trim();
}
function escapeHtml(str){
return String(str)
.replace(/&/g, '&amp;')
.replace(/</g, '&lt;')
.replace(/>/g, '&gt;')
.replace(/"/g, '&quot;')
.replace(/'/g, '&#039;');
}
async function generateOnce(apiKey, model, prompt, params) {
return openAIChat(apiKey, model, [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: prompt }
], params.temperature, params.top_p, params.extra);
}
async function judgeOnce(apiKey, model, output, params, judgeId) {
const evalPrompt = `You are Judge #${judgeId}. Evaluate the OUTPUT below which was generated at temperature ${params.temperature} and top_p ${params.top_p}.
Return a STRICT minified JSON object with numeric fields only (no text outside JSON):
{"relevance": float0to100, "clarity": float0to100, "utility": float0to100, "creativity": float0to100, "coherence": float0to100, "safety": float0to100, "overall": float0to100}
Output between triple dashes:
---
${output}
---`;
const raw = await openAIChat(apiKey, model, [
{ role: 'system', content: 'Return only the JSON.' },
{ role: 'user', content: evalPrompt }
], 0.2, 1.0);
try {
const jsonText = (raw.match(/\{[\s\S]*\}/) || [raw])[0];
const obj = JSON.parse(jsonText);
return {
relevance: +obj.relevance || 0,
clarity: +obj.clarity || 0,
utility: +obj.utility || 0,
creativity: +obj.creativity || 0,
coherence: +obj.coherence || 0,
safety: +obj.safety || 0,
overall: +obj.overall || 0,
};
} catch (e) {
const num = (raw.match(/\d+(?:\.\d+)?/) || [0])[0];
return { relevance: 0, clarity: 0, utility: 0, creativity: 0, coherence: 0, safety: 0, overall: +num };
}
}
function mean(arr) { return arr.length ? arr.reduce((a,b)=>a+b,0) / arr.length : 0; }
function aggregateScores(scores) {
const keys = ['relevance','clarity','utility','creativity','coherence','safety','overall'];
const out = {};
for (const k of keys) out[k] = +mean(scores.map(s=>s[k]||0)).toFixed(2);
return out;
}
async function standardMode(apiKey, model, prompt, arms, judges) {
const outputs = {};
const details = {};
const overalls = {};
await Promise.all(arms.map(async (arm) => {
const key = JSON.stringify(arm);
const text = await generateOnce(apiKey, model, prompt, arm);
outputs[key] = text;
const judgeResults = await Promise.all(Array.from({length: judges}).map((_,i)=>
judgeOnce(apiKey, model, text, arm, i+1)
));
const agg = aggregateScores(judgeResults);
details[key] = agg;
overalls[key] = agg.overall;
}));
const ranked = Object.entries(overalls).sort((a,b)=>b[1]-a[1]);
return { outputs, details, ranked };
}
async function advancedModeUCB(apiKey, model, prompt, arms, judges, rounds, c) {
const keys = arms.map(a=>JSON.stringify(a));
const pulls = Object.fromEntries(keys.map(k=>[k,0]));
const sums = Object.fromEntries(keys.map(k=>[k,0]));
const best = Object.fromEntries(keys.map(k=>[k,{overall:-1,text:'',detail:{}}]));
let total = 0;
// init
for (const arm of arms) {
const k = JSON.stringify(arm);
const text = await generateOnce(apiKey, model, prompt, arm);
const judgeResults = await Promise.all(Array.from({length: judges}).map((_,i)=>
judgeOnce(apiKey, model, text, arm, i+1)
));
const agg = aggregateScores(judgeResults);
pulls[k] += 1; sums[k] += agg.overall; total += 1;
if (agg.overall > best[k].overall) best[k] = {overall: agg.overall, text, detail: agg};
}
for (let r = 0; r < rounds - 1; r++) {
const ucb = {};
for (const k of keys) {
const m = pulls[k] ? (sums[k]/pulls[k]) : Infinity;
const bonus = pulls[k] ? c * Math.sqrt(Math.log(Math.max(1,total)) / pulls[k]) : Infinity;
ucb[k] = m + bonus;
}
const nextK = keys.sort((a,b)=>ucb[b]-ucb[a])[0];
const arm = JSON.parse(nextK);
const text = await generateOnce(apiKey, model, prompt, arm);
const judgeResults = await Promise.all(Array.from({length: judges}).map((_,i)=>
judgeOnce(apiKey, model, text, arm, i+1)
));
const agg = aggregateScores(judgeResults);
pulls[nextK] += 1; sums[nextK] += agg.overall; total += 1;
if (agg.overall > best[nextK].overall) best[nextK] = {overall: agg.overall, text, detail: agg};
}
const means = Object.fromEntries(keys.map(k=>[k, pulls[k] ? (sums[k]/pulls[k]) : 0]));
const rankedKeys = keys.slice().sort((a,b)=>means[b]-means[a]);
const bestK = rankedKeys[0];
return { bestK, best: best[bestK], means, pulls };
}
function getEl(id){ return document.getElementById(id); }
function setText(id, txt){ getEl(id).textContent = txt; }
function appendLog(msg){ const el=getEl('runLog'); if(!el) return; el.textContent += `\n${msg}`; el.scrollTop = el.scrollHeight; }
function renderArmsTable(arms){
const tbody = getEl('armsTable').querySelector('tbody');
tbody.innerHTML = '';
for (const arm of arms){
const k = JSON.stringify(arm);
const tr = document.createElement('tr');
tr.id = `arm-${btoa(k).replace(/=/g,'')}`;
tr.innerHTML = `
<td class="status status-wait">waiting</td>
<td class="pulls">0</td>
<td class="mean">-</td>
<td class="best">-</td>
<td><details><summary>view</summary><div class="arm-detail"></div></details></td>
`;
tbody.appendChild(tr);
}
}
function updateArmRow(arm, data){
const k = JSON.stringify(arm);
const id = `arm-${btoa(k).replace(/=/g,'')}`;
const tr = getEl(id);
if (!tr) return;
if (data.status) { const s = tr.querySelector('.status'); s.textContent = data.status; s.className = `status ${data.statusClass||''}`; }
if (data.pulls !== undefined) tr.querySelector('.pulls').textContent = String(data.pulls);
if (data.mean !== undefined) tr.querySelector('.mean').textContent = (data.mean===null?'-':Number(data.mean).toFixed(2));
if (data.best !== undefined) tr.querySelector('.best').textContent = (data.best===null?'-':Number(data.best).toFixed(2));
if (data.detail) tr.querySelector('.arm-detail').innerHTML = data.detail;
}
document.addEventListener('DOMContentLoaded', () => {
// Chart setup
let chart;
function ensureChart(){
const ctx = getEl('scoreChart');
if (!ctx) return null;
if (chart) return chart;
chart = new Chart(ctx, {
type: 'scatter',
data: { datasets: [{ label: 'temp vs mean score', data: [], borderColor:'#00ff9c', backgroundColor:'rgba(0,255,156,0.3)' }]},
options: {
scales: {
x: { title: { display:true, text:'temperature' }, grid: { color:'#0b442f' }, ticks:{ color:'#b5f5d2' } },
y: { title: { display:true, text:'mean judge score' }, suggestedMin:0, suggestedMax:100, grid: { color:'#0b442f' }, ticks:{ color:'#b5f5d2' } }
},
plugins: { legend: { labels: { color:'#b5f5d2' } } }
}
});
return chart;
}
function addChartPoint(temp, mean){
const c = ensureChart(); if (!c) return;
c.data.datasets[0].data.push({ x: temp, y: mean });
c.update('none');
}
const judges = getEl('judges');
const rounds = getEl('rounds');
const explorationC = getEl('explorationC');
judges.addEventListener('input', ()=> setText('judgesVal', judges.value));
rounds.addEventListener('input', ()=> setText('roundsVal', rounds.value));
explorationC.addEventListener('input', ()=> setText('cVal', (+explorationC.value).toFixed(2)));
getEl('runBtn').addEventListener('click', async () => {
const apiKey = getEl('apiKey').value.trim();
const remember = getEl('rememberKey').checked;
if (!apiKey) { alert('Please enter an API key.'); return; }
if (remember) localStorage.setItem('autotemp_api_key', apiKey); else localStorage.removeItem('autotemp_api_key');
const model = getEl('model').value.trim() || 'gpt-4o-mini';
const temps = getEl('temperatures').value.split(',').map(s=>parseFloat(s.trim())).filter(n=>!Number.isNaN(n));
const tops = getEl('tops').value.split(',').map(s=>parseFloat(s.trim())).filter(n=>!Number.isNaN(n));
const maxTokens = getEl('maxTokens').value.split(',').map(s=>parseInt(s.trim(),10)).filter(n=>!Number.isNaN(n));
const freqPen = getEl('freqPen').value.split(',').map(s=>parseFloat(s.trim())).filter(n=>!Number.isNaN(n));
const presPen = getEl('presPen').value.split(',').map(s=>parseFloat(s.trim())).filter(n=>!Number.isNaN(n));
const stopRaw = getEl('stopSeqs').value.trim();
const stopTokens = stopRaw ? stopRaw.split(',').map(s=>s.replace(/\\n/g,'\n')) : undefined;
const j = parseInt(getEl('judges').value, 10) || 3;
const auto = getEl('autoSelect').checked;
const adv = getEl('advancedMode').checked;
const r = parseInt(getEl('rounds').value, 10) || 5;
const c = parseFloat(getEl('explorationC').value) || 1.0;
const prompt = getEl('userPrompt').value.trim();
if (!prompt) { alert('Enter a prompt.'); return; }
// build arms (Cartesian product)
function cartesian(arrs){ return arrs.reduce((a,b)=> a.flatMap(x=> b.map(y=>[...x,y])), [[]]); }
const lists = [temps, tops, maxTokens, freqPen, presPen];
const combos = cartesian(lists);
const arms = combos.map(([temperature, top_p, max_tokens, frequency_penalty, presence_penalty]) => ({
temperature, top_p,
extra: {
max_tokens,
frequency_penalty,
presence_penalty,
...(stopTokens ? { stop: stopTokens } : {})
}
}));
const status = getEl('status');
const results = getEl('results');
results.textContent = '';
status.textContent = 'Running...';
appendLog(`Initialized ${arms.length} arms. Judges=${j}. Advanced=${adv ? 'UCB' : 'Standard'}.`);
renderArmsTable(arms);
try {
const c = ensureChart(); if (c){ c.data.datasets[0].data = []; c.update('none'); }
if (!adv) {
const outputs = {}; const details = {}; const overalls = {};
for (const arm of arms){
updateArmRow(arm, { status:'running', statusClass:'status-running' });
appendLog(`Generating for arm ${JSON.stringify(arm)}...`);
const text = await generateOnce(apiKey, model, prompt, arm);
outputs[JSON.stringify(arm)] = text;
appendLog(`Judging arm ${JSON.stringify(arm)}...`);
const judgeResults = await Promise.all(Array.from({length: j}).map((_,i)=> judgeOnce(apiKey, model, text, arm, i+1)));
const agg = aggregateScores(judgeResults);
details[JSON.stringify(arm)] = agg; overalls[JSON.stringify(arm)] = agg.overall;
const paramHtml = `<div class="arm-params">Params: <code>${escapeHtml(JSON.stringify(arm))}</code></div>`;
const outputHtml = `<div class="arm-output-box"><pre>${escapeHtml(text)}</pre></div>`;
const scoresHtml = `<div class="arm-scores">Scores: <code>${escapeHtml(JSON.stringify(agg))}</code></div>`;
updateArmRow(arm, { status:'done', statusClass:'status-done', pulls:1, mean:agg.overall, best:agg.overall, detail: paramHtml + outputHtml + scoresHtml });
if (typeof arm.temperature === 'number') addChartPoint(arm.temperature, agg.overall);
}
const ranked = Object.entries(overalls).sort((a,b)=>b[1]-a[1]);
if (auto) {
const [bestK, bestScore] = ranked[0];
const arm = JSON.parse(bestK);
results.textContent = `Best Arm ${bestK} | Overall ${bestScore}\n` + outputs[bestK] + "\n\n" + `Judges: ${JSON.stringify(details[bestK])}`;
} else {
results.textContent = ranked.map(([t, s])=>
`Arm ${t} | Overall ${s} | Detail ${JSON.stringify(details[t])}\n${outputs[t]}`
).join('\n\n');
}
} else {
// Transparent UCB loop with UI updates
const keys = arms.map(a=>JSON.stringify(a));
const pulls = Object.fromEntries(keys.map(k=>[k,0]));
const sums = Object.fromEntries(keys.map(k=>[k,0]));
const best = Object.fromEntries(keys.map(k=>[k,{overall:-1,text:'',detail:{}}]));
let total = 0;
for (const arm of arms){ updateArmRow(arm, { status:'running', statusClass:'status-running' }); }
// init pull each arm
for (const arm of arms){
appendLog(`Init pull -> ${JSON.stringify(arm)}`);
const k = JSON.stringify(arm);
const text = await generateOnce(apiKey, model, prompt, arm);
const judgeResults = await Promise.all(Array.from({length: j}).map((_,i)=> judgeOnce(apiKey, model, text, arm, i+1)));
const agg = aggregateScores(judgeResults);
pulls[k] += 1; sums[k] += agg.overall; total += 1;
if (agg.overall > best[k].overall) best[k] = {overall: agg.overall, text, detail: agg};
const paramHtml = `<div class="arm-params">Params: <code>${escapeHtml(JSON.stringify(arm))}</code></div>`;
const outputHtml = `<div class="arm-output-box"><pre>${escapeHtml(text)}</pre></div>`;
const scoresHtml = `<div class="arm-scores">Scores: <code>${escapeHtml(JSON.stringify(agg))}</code></div>`;
updateArmRow(arm, { pulls:pulls[k], mean:(sums[k]/pulls[k]), best:best[k].overall, detail: paramHtml + outputHtml + scoresHtml });
if (typeof arm.temperature === 'number') addChartPoint(arm.temperature, agg.overall);
}
for (let i=0;i<r-1;i++){
// compute UCB
const ucb = {};
for (const arm of arms){
const k = JSON.stringify(arm);
const m = pulls[k] ? (sums[k]/pulls[k]) : Infinity;
const bonus = pulls[k] ? c * Math.sqrt(Math.log(Math.max(1,total)) / pulls[k]) : Infinity;
ucb[k] = m + bonus;
}
const nextK = keys.slice().sort((a,b)=>ucb[b]-ucb[a])[0];
const arm = JSON.parse(nextK);
appendLog(`Round ${i+1}: selecting arm ${nextK} (UCB=${ucb[nextK].toFixed(3)})`);
const text = await generateOnce(apiKey, model, prompt, arm);
const judgeResults = await Promise.all(Array.from({length: j}).map((_,i)=> judgeOnce(apiKey, model, text, arm, i+1)));
const agg = aggregateScores(judgeResults);
pulls[nextK] += 1; sums[nextK] += agg.overall; total += 1;
if (agg.overall > best[nextK].overall) best[nextK] = {overall: agg.overall, text, detail: agg};
const paramHtml = `<div class=\"arm-params\">Params: <code>${escapeHtml(JSON.stringify(arm))}</code></div>`;
const outputHtml = `<div class=\"arm-output-box\"><pre>${escapeHtml(text)}</pre></div>`;
const scoresHtml = `<div class=\"arm-scores\">Scores: <code>${escapeHtml(JSON.stringify(agg))}</code></div>`;
updateArmRow(arm, { pulls:pulls[nextK], mean:(sums[nextK]/pulls[nextK]), best:best[nextK].overall, detail: paramHtml + outputHtml + scoresHtml });
if (typeof arm.temperature === 'number') addChartPoint(arm.temperature, agg.overall);
}
for (const arm of arms){ updateArmRow(arm, { status:'done', statusClass:'status-done' }); }
const means = Object.fromEntries(keys.map(k=>[k, pulls[k] ? (sums[k]/pulls[k]) : 0]));
const ranked = keys.slice().sort((a,b)=>means[b]-means[a]);
const bestK = ranked[0];
const bestArm = JSON.parse(bestK);
appendLog(`Complete. Best ${bestK} mean=${means[bestK].toFixed(2)} best_overall=${best[bestK].overall.toFixed(2)}`);
if (auto){
results.textContent = `Advanced (UCB) — Best Arm ${bestK} | Mean ${means[bestK].toFixed(2)} | Best Overall ${best[bestK].overall.toFixed(2)}\n` + best[bestK].text + "\n\n" + `Detail: ${JSON.stringify(best[bestK].detail)}`;
} else {
const lines = [`Advanced (UCB) — Best ${bestK}`, best[bestK].text, '', `Detail: ${JSON.stringify(best[bestK].detail)}`, ''];
for (const k of ranked){ lines.push(`Arm ${k}: pulls=${pulls[k]}, mean_overall=${means[k].toFixed(2)}, best_overall=${best[k].overall.toFixed(2)}`); }
results.textContent = lines.join('\n');
}
}
status.textContent = 'Done.';
} catch (e) {
status.textContent = 'Error';
results.textContent = String(e?.message || e);
}
});
});

142
docs/index.html Normal file
View File

@@ -0,0 +1,142 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>AutoTemp — Research-Grade Hyperparameter Optimization</title>
<link rel="stylesheet" href="./style.css" />
</head>
<body>
<div class="container">
<header class="header">
<div class="logo">▌▌ AutoTemp</div>
<div class="subtitle">Hyperparameter Lab — hacker mode</div>
</header>
<section class="config">
<div class="field">
<label for="apiKey">OpenAI API Key</label>
<input type="password" id="apiKey" placeholder="sk-..." />
<label class="inline">
<input type="checkbox" id="rememberKey" /> Remember in this browser
</label>
</div>
<div class="grid-3">
<div class="field">
<label for="model">Model</label>
<input id="model" value="gpt-5-chat-latest" />
</div>
<div class="field">
<label for="judges">Judges: <span id="judgesVal">3</span></label>
<input type="range" id="judges" min="1" max="7" step="1" value="3" />
</div>
<div class="field">
<label class="inline"><input type="checkbox" id="autoSelect" checked /> Auto Select Best</label>
</div>
</div>
<div class="grid-3">
<div class="field">
<label for="temperatures">temperature list</label>
<input id="temperatures" value="0.4,0.6,0.8,1.0" />
</div>
<div class="field">
<label for="tops">top_p list</label>
<input id="tops" value="1.0" />
</div>
<div class="field">
<label for="maxTokens">max_tokens list</label>
<input id="maxTokens" value="256,512" />
</div>
</div>
<div class="grid-3">
<div class="field">
<label for="freqPen">frequency_penalty list</label>
<input id="freqPen" value="0,0.2" />
</div>
<div class="field">
<label for="presPen">presence_penalty list</label>
<input id="presPen" value="0,0.2" />
</div>
<div class="field">
<label for="stopSeqs">stop tokens (comma-separated)</label>
<input id="stopSeqs" placeholder="e.g. \nEND,###" />
</div>
</div>
<div class="grid-3">
<div class="field">
<label class="inline"><input type="checkbox" id="advancedMode" /> Advanced Mode (UCB over arms)</label>
</div>
<div class="field">
<label for="rounds">Rounds (advanced): <span id="roundsVal">8</span></label>
<input type="range" id="rounds" min="1" max="50" step="1" value="8" />
</div>
<div class="field">
<label for="explorationC">Exploration c (UCB): <span id="cVal">1.0</span></label>
<input type="range" id="explorationC" min="0" max="3" step="0.1" value="1.0" />
</div>
</div>
<div class="note">Provide comma-separated values to sweep. The app will form the Cartesian product across lists and evaluate each hyperparameter arm.</div>
</section>
<section class="prompt terminal">
<label for="userPrompt">Prompt</label>
<textarea id="userPrompt" rows="8" placeholder="Enter your prompt..."></textarea>
</section>
<section class="actions">
<button id="runBtn">Run AutoTemp</button>
<span id="status"></span>
</section>
<section class="output terminal">
<h2>Results</h2>
<pre id="results" class="glow"></pre>
</section>
<section class="workflow terminal">
<h2>Workflow</h2>
<div class="chart-wrap">
<canvas id="scoreChart" height="180"></canvas>
</div>
<div class="table-wrap">
<table id="armsTable" class="arms-table">
<thead>
<tr>
<th>Status</th>
<th>Pulls</th>
<th>Mean</th>
<th>Best</th>
<th>Details</th>
</tr>
</thead>
<tbody></tbody>
</table>
</div>
</section>
<section class="runlog terminal">
<h2>Run Log</h2>
<pre id="runLog" class="log"></pre>
</section>
<footer class="footer">
<div class="blink">>_</div>
<p>Security: Key stays in-browser (localStorage opt-in). Prefer proxies for shared deployments.</p>
</footer>
</div>
<script src="./app.js"></script>
<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.1/dist/chart.umd.min.js"></script>
<script>
document.addEventListener('DOMContentLoaded', () => {
const savedKey = localStorage.getItem('autotemp_api_key');
if (savedKey) {
document.getElementById('apiKey').value = savedKey;
document.getElementById('rememberKey').checked = true;
}
});
</script>
</body>
</html>

43
docs/style.css Normal file
View File

@@ -0,0 +1,43 @@
:root { --bg:#020b05; --panel:#03150e; --text:#b5f5d2; --accent:#00ff9c; --accent2:#13f1ff; --muted:#0a2a1f; }
*{ box-sizing:border-box }
body{ margin:0; font-family: ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace; background: radial-gradient(1200px 800px at 20% 0%, #03150e, #020b05), #020b05; color:var(--text) }
.container{ max-width:1100px; margin:0 auto; padding:24px }
.header{ display:flex; align-items:center; justify-content:space-between; margin-bottom:16px }
.logo{ font-weight:900; color:var(--accent); letter-spacing:2px }
.subtitle{ color:var(--accent2); opacity:.9 }
section{ background:linear-gradient(180deg, rgba(3,21,14,.9), rgba(2,11,5,.9)); padding:16px; border-radius:10px; margin-bottom:16px; border:1px solid #0b442f; box-shadow:0 0 30px rgba(0,255,156,.05) inset }
.field{ margin-bottom:12px }
.field label{ display:block; font-weight:700; margin-bottom:6px; color:#a5ffd6 }
.field input[type="text"], .field input[type="password"], .field input[type="number"], .field textarea { width:100%; padding:10px; border-radius:6px; border:1px solid #0b442f; background:#03150e; color:var(--text); outline:none; box-shadow:0 0 0 1px rgba(0,255,156,.05) inset }
.field input[type="text"]:focus, .field input[type="password"]:focus, .field textarea:focus { box-shadow:0 0 0 2px rgba(19,241,255,.25) inset }
.field input[type="range"]{ width:100% }
.inline{ display:inline-flex; align-items:center; gap:8px }
.grid-2{ display:grid; grid-template-columns:1fr 1fr; gap:12px }
.grid-3{ display:grid; grid-template-columns:1fr 1fr 1fr; gap:12px }
.actions{ display:flex; align-items:center; gap:12px }
button{ background:linear-gradient(90deg, var(--accent), var(--accent2)); color:#00170e; font-weight:900; border:none; padding:10px 16px; border-radius:8px; cursor:pointer; box-shadow:0 0 15px rgba(0,255,156,.2) }
button:hover{ filter:brightness(1.05) }
.terminal{ border:1px solid #0b442f; background:#010a06; box-shadow:0 0 40px rgba(0,255,156,.06) inset }
pre{ white-space:pre-wrap; background:#010a06; padding:12px; border-radius:8px; border:1px dashed #0b442f }
.table-wrap{ overflow:auto; }
.arms-table{ width:100%; border-collapse:collapse; font-size:13px }
.arms-table th,.arms-table td{ border:1px dashed #0b442f; padding:6px 8px; vertical-align:top }
.arms-table th{ color:#a5ffd6; background:#03150e; position:sticky; top:0 }
.badge{ display:inline-block; padding:2px 6px; border-radius:999px; background:#062c1f; border:1px solid #0b442f }
.status-running{ color:#13f1ff }
.status-done{ color:#00ff9c }
.status-wait{ color:#a5ffd6 }
.log{ max-height:260px; overflow:auto }
.chart-wrap{ background:#010a06; border:1px dashed #0b442f; border-radius:8px; padding:8px; margin-bottom:12px }
.arm-params{ font-size:12px; color:#a5ffd6; margin-bottom:8px }
.arm-params code{ background:#03150e; padding:2px 4px; border:1px solid #0b442f; border-radius:4px }
.arm-output-box{ background:#0f1620; border:1px solid #0b442f; border-radius:8px; padding:16px; margin:10px auto; max-width:760px; box-shadow:0 0 20px rgba(0,255,156,.08) inset }
.arm-output-box pre{ background:transparent; border:none; margin:0; padding:0; white-space:pre-wrap; color:#e4fff2; font-size:14px }
.arm-scores{ font-size:12px; margin-top:8px; color:#b5f5d2 }
.footer{ display:flex; align-items:center; gap:10px; opacity:.85 }
.blink{ width:8px; height:18px; background:var(--accent); animation: blink 1s infinite }
.glow{ text-shadow:0 0 8px rgba(0,255,156,.35) }
@keyframes blink{ 50%{ opacity:.2 } }
@media(max-width:720px){ .grid-2,.grid-3{ grid-template-columns:1fr } }

View File

@@ -1,3 +1,6 @@
openai
python-dotenv
gradio
rouge-score
sacrebleu
bert-score