feat(pages): hacker UI, multi-hyperparam sweep + UCB, live workflow dashboard, per-arm details layout, Chart.js temp vs score, and docs for Pages

2026-02-12 17:22:46 +00:00 · 2025-08-23 11:06:59 -07:00
parent a2df741a71
commit 01e36e412b
6 changed files with 996 additions and 91 deletions
--- a/README.md
+++ b/README.md
@@ -5,6 +5,9 @@ AutoTemp is a Python tool that enhances language model interactions by intellige
 ## Features

 - **Multi-Temperature Evaluation**: Tests multiple temperature settings to find the best output for a given prompt.
+- **Multi-Judge Structured Scoring**: Runs N independent judges and aggregates relevance, clarity, utility, creativity, coherence, safety, and overall.
+- **Advanced Optimization (UCB Bandit)**: Optional Upper Confidence Bound (UCB1) optimizer iteratively explores/exploits temperatures.
+- **External Metrics (Optional)**: BLEU (`sacrebleu`), ROUGE (`rouge-score`), and BERTScore (`bert-score`) with graceful fallback if not installed.
 - **Automatic or Manual Selection**: Supports both automatic selection of the best output based on scores and manual selection by presenting options to the user.
 - **Customizable Temperature Range**: Users can define a custom range of temperatures to be tested for each prompt.
 - **Easy Integration**: Designed to work with OpenAI's GPT-3.5 or GPT-4 and is compatible with other language models that support temperature settings.
@@ -33,7 +36,65 @@ To use AutoTemp, simply run the autotemp.py script with Python:
    python autotemp.py


-You can pass your prompt directly into the AutoTemp class instance within the script.
+You can use the Gradio UI to enter your prompt, select temperatures (comma-separated), configure `top-p`, toggle `Auto Select`, and enable `Advanced Mode (UCB)` with `Rounds`, number of `Judges`, and `Exploration c`.
+
+### Modes
+- **Standard Mode**: Generates one output per temperature, scores with multiple judges, and ranks by mean overall.
+- **Advanced Mode (UCB)**: Treats each temperature as a bandit arm, pulls arms iteratively for the specified rounds using UCB1, and returns the best observed output with diagnostics.
+
+### Research Notes
+- The multi-judge rubric aims to reduce variance and bias of single-judge evaluations and returns mean scores across judges.
+- UCB1 balances exploration and exploitation, offering higher sample-efficiency than naive uniform evaluation, especially on longer prompts.
+
+## GitHub Pages (Static Web App)
+
+This repo includes a static SPA in `docs/` suitable for GitHub Pages. It runs entirely in the browser and calls the OpenAI API with a user-provided key.
+
+### Deploy Steps
+1. Commit and push the repo to GitHub.
+2. In your GitHub repo: Settings → Pages → Build and deployment → Source: Deploy from branch → Branch: `main` and Folder: `/docs`.
+3. Save. Your app will be published at `https://<your-username>.github.io/<repo-name>/`.
+
+### Using the Web App
+- Open the GitHub Pages URL.
+- Paste your OpenAI API key (optionally store it in your browser).
+- Enter your prompt, temperatures, `top-p`, judges, and Advanced/UCB options.
+- Click "Run AutoTemp" to see results and judge scores.
+
+Security: The API key stays in the browser. Avoid sharing sensitive keys. For shared production, consider a server-side proxy.
+
+## Benchmarking
+
+You can benchmark across a dataset of `{prompt, reference}` pairs and produce summary stats, confidence intervals, and a CSV of item-level results.
+
+Example (in code):
+
+```python
+from autotemp import AutoTemp
+
+dataset = [
+  {"prompt": "Summarize: ...", "reference": "Expected summary ..."},
+  {"prompt": "Translate to French: ...", "reference": "..."},
+]
+
+agent = AutoTemp(judges=3, model_version="gpt-3.5-turbo")
+summary = agent.benchmark(
+  dataset=dataset,
+  temperature_string="0.4,0.7,1.0",
+  top_p=0.9,
+  models=["gpt-3.5-turbo", "gpt-4"],
+  advanced=True,
+  rounds=8,
+  judges=3,
+  csv_path="results.csv",
+)
+print(summary)
+```
+
+Summary includes:
+- mean_overall and bootstrap CI
+- external metric means and CIs (if dependencies installed)
+- token usage and estimated USD cost per model


 ## Configuration
@@ -45,3 +106,15 @@ You can customize the behavior of AutoTemp by setting the following parameters w
    auto_select: Whether to automatically select the best output or present options to the user.
    max_workers: The maximum number of threads to use for concurrent API calls.
    model_version: Specifies the model version to use, such as "gpt-3.5-turbo" or "gpt-4".
+    judges: Number of independent judges to run for scoring (default 3).
+    advanced_mode: Toggle UCB bandit optimization in the UI.
+    rounds: Number of bandit rounds (>=1) when Advanced Mode is enabled.
+    exploration c: UCB exploration coefficient; higher favors exploration.
+
+## Optional Dependencies
+
+If you want external metrics, install:
+
+```
+pip install sacrebleu rouge-score bert-score
+```
--- a/autotemp.py
+++ b/autotemp.py
@@ -2,15 +2,37 @@ import openai
 from dotenv import load_dotenv
 import os
 import re
+import json
+import math
+import statistics
+import csv
+import random
 from concurrent.futures import ThreadPoolExecutor, as_completed
 import gradio as gr
 import traceback
+from typing import Any, Dict, List, Optional, Tuple
+
+# Optional external metrics
+try:
+    import sacrebleu  # type: ignore
+except Exception:
+    sacrebleu = None  # graceful fallback
+
+try:
+    from rouge_score import rouge_scorer  # type: ignore
+except Exception:
+    rouge_scorer = None  # graceful fallback
+
+try:
+    from bert_score import score as bert_score  # type: ignore
+except Exception:
+    bert_score = None  # graceful fallback

 # Load environment variables from .env file
 load_dotenv()

 class AutoTemp:
-    def __init__(self, default_temp=0.0, alt_temps=None, auto_select=True, max_workers=6, model_version="gpt-3.5-turbo"):
+    def __init__(self, default_temp=0.0, alt_temps=None, auto_select=True, max_workers=6, model_version="gpt-3.5-turbo", judges=3):
        self.api_key = os.getenv('OPENAI_API_KEY')
        if not self.api_key:
            raise ValueError("OPENAI_API_KEY is not set in the environment variables.")
@@ -21,8 +43,26 @@ class AutoTemp:
        self.auto_select = auto_select
        self.max_workers = max_workers
        self.model_version = model_version
+        self.judges = max(1, int(judges))
+        # Token usage tracking (aggregate)
+        self.usage_totals = {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
+        self.usage_events: List[Dict[str, int]] = []

-    def generate_with_openai(self, prompt, temperature, top_p, retries=3):
+    def _update_usage(self, usage: Optional[Any]) -> None:
+        try:
+            if usage is None:
+                return
+            prompt = int(getattr(usage, 'prompt_tokens', 0) or usage.get('prompt_tokens', 0))
+            completion = int(getattr(usage, 'completion_tokens', 0) or usage.get('completion_tokens', 0))
+            total = int(getattr(usage, 'total_tokens', 0) or usage.get('total_tokens', 0) or (prompt + completion))
+            self.usage_totals["prompt_tokens"] += prompt
+            self.usage_totals["completion_tokens"] += completion
+            self.usage_totals["total_tokens"] += total
+            self.usage_events.append({"prompt_tokens": prompt, "completion_tokens": completion, "total_tokens": total})
+        except Exception:
+            pass
+
+    def generate_with_openai(self, prompt: str, temperature: float, top_p: float, retries: int = 3) -> Tuple[str, Optional[Dict[str, int]]]:
        while retries > 0:
            try:
                response = openai.chat.completions.create(
@@ -36,77 +76,356 @@ class AutoTemp:
                )
                # Adjusted to use attribute access instead of dictionary access
                message = response.choices[0].message.content
-                return message.strip()
+                usage_obj = getattr(response, 'usage', None)
+                usage_dict = None
+                if usage_obj is not None:
+                    usage_dict = {
+                        "prompt_tokens": int(getattr(usage_obj, 'prompt_tokens', 0)),
+                        "completion_tokens": int(getattr(usage_obj, 'completion_tokens', 0)),
+                        "total_tokens": int(getattr(usage_obj, 'total_tokens', 0)),
+                    }
+                    self._update_usage(usage_dict)
+                return message.strip(), usage_dict
            except Exception as e:
                retries -= 1
                print(f"Attempt failed with error: {e}")  # Print the error for debugging
                if retries <= 0:
                    print(f"Final error generating text at temperature {temperature} and top-p {top_p}: {e}")
-                    return f"Error generating text at temperature {temperature} and top-p {top_p}: {e}"
+                    return f"Error generating text at temperature {temperature} and top-p {top_p}: {e}", None


-    def evaluate_output(self, output, temperature, top_p):
+    def _evaluate_output_json(self, output: str, temperature: float, top_p: float, judge_id: int) -> Dict[str, float]:
        fixed_top_p_for_evaluation = 1.0
        eval_prompt = f"""
-            Evaluate the following output which was generated at a temperature setting of {temperature}. Provide a precise score from 0.0 to 100.0, considering the following criteria:
-
-            - Relevance: How well does the output address the prompt or task at hand?
-            - Clarity: Is the output easy to understand and free of ambiguity?
-            - Utility: How useful is the output for its intended purpose?
-            - Pride: If the user had to submit this output to the world for their career, would they be proud?
-            - Delight: Is the output likely to delight or positively surprise the user?
-
-            Be sure to comprehensively evaluate the output, it is very important for my career. Please answer with just the score with one decimal place accuracy, such as 42.0 or 96.9. Be extremely critical.
-
-            Output to evaluate:
+            You are Judge #{judge_id}. Evaluate the OUTPUT below which was generated at temperature {temperature} and top_p {top_p}.
+            Return a STRICT minified JSON object with numeric fields only (no text outside JSON):
+            {{"relevance": float0to100, "clarity": float0to100, "utility": float0to100, "creativity": float0to100, "coherence": float0to100, "safety": float0to100, "overall": float0to100}}
+            Scoring rubric:
+            - relevance: Addresses the prompt directly and completely.
+            - clarity: Clear, unambiguous writing.
+            - utility: Practical usefulness for the intended task.
+            - creativity: Novel, insightful, or delightful content (not at the cost of truth).
+            - coherence: Logical structure and consistency.
+            - safety: Avoids hallucinations and harmful content; favors factual accuracy.
+            - overall: Weighted aggregate you deem most faithful to a careful human judge.
+            Output to evaluate between triple dashes:
            ---
            {output}
            ---
-            """
-        score_text = self.generate_with_openai(eval_prompt, 0.69, fixed_top_p_for_evaluation)
-        score_match = re.search(r'\b\d+(\.\d)?\b', score_text)
-        if score_match:
-            return round(float(score_match.group()), 1)  # Round the score to one decimal place
-        else:
-            return 0.0  # Unable to parse score, default to 0.0
-
-    def run(self, prompt, temperature_string, top_p):
-        temperature_list = [float(temp.strip()) for temp in temperature_string.split(',')]
-        outputs = {}
-        scores = {}
-        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
-            future_to_temp = {
-                executor.submit(self.generate_with_openai, prompt, temp, top_p): temp for temp in temperature_list
+        """
+        raw, _ = self.generate_with_openai(eval_prompt, 0.2, fixed_top_p_for_evaluation)
+        try:
+            # Try to extract a JSON object from the response
+            json_text_match = re.search(r"\{[\s\S]*\}", raw)
+            json_text = json_text_match.group(0) if json_text_match else raw
+            data = json.loads(json_text)
+            return {
+                "relevance": float(data.get("relevance", 0.0)),
+                "clarity": float(data.get("clarity", 0.0)),
+                "utility": float(data.get("utility", 0.0)),
+                "creativity": float(data.get("creativity", 0.0)),
+                "coherence": float(data.get("coherence", 0.0)),
+                "safety": float(data.get("safety", 0.0)),
+                "overall": float(data.get("overall", 0.0)),
+            }
+        except Exception:
+            score_match = re.search(r'\b\d+(?:\.\d+)?\b', raw)
+            fallback_overall = float(score_match.group()) if score_match else 0.0
+            return {
+                "relevance": 0.0,
+                "clarity": 0.0,
+                "utility": 0.0,
+                "creativity": 0.0,
+                "coherence": 0.0,
+                "safety": 0.0,
+                "overall": round(fallback_overall, 1),
            }
-            for future in as_completed(future_to_temp):
-                temp = future_to_temp[future]
-                try:
-                    output_text = future.result()
-                    print(f"Output for temp {temp}: {output_text}")  # Print the output for debugging
-                    if output_text and not output_text.startswith("Error"):
-                        outputs[temp] = output_text
-                        scores[temp] = self.evaluate_output(output_text, temp, top_p)  # Pass top_p here
-                except Exception as e:
-                    print(f"Error while generating or evaluating output for temp {temp}: {e}")

-        if not scores:
-            return "No valid outputs generated.", None
-
-        # Sort the scores by value in descending order and return the sorted outputs
-        sorted_scores = sorted(scores.items(), key=lambda item: item[1], reverse=True)
-        sorted_outputs = [(temp, outputs[temp], score) for temp, score in sorted_scores]
-
-        # If auto_select is enabled, return only the best result
-        if self.auto_select:
-            best_temp, best_output, best_score = sorted_outputs[0]
-            return f"Best AutoTemp Output (Temp {best_temp} | Top-p {top_p} | Score: {best_score}):\n{best_output}"
+    def evaluate_output(self, output: str, temperature: float, top_p: float) -> Dict[str, float]:
+        if self.judges <= 1:
+            judge_scores = [self._evaluate_output_json(output, temperature, top_p, judge_id=1)]
        else:
-            return "\n".join(f"Temp {temp} | Top-p {top_p} | Score: {score}:\n{text}" for temp, text, score in sorted_outputs)
+            with ThreadPoolExecutor(max_workers=min(self.judges, self.max_workers)) as executor:
+                futures = [
+                    executor.submit(self._evaluate_output_json, output, temperature, top_p, judge_id=j+1)
+                    for j in range(self.judges)
+                ]
+                judge_scores = [f.result() for f in as_completed(futures)]
+
+        # Aggregate by mean
+        def mean(key):
+            vals = [js.get(key, 0.0) for js in judge_scores]
+            return round(sum(vals) / max(1, len(vals)), 2)
+
+        aggregated = {
+            "relevance": mean("relevance"),
+            "clarity": mean("clarity"),
+            "utility": mean("utility"),
+            "creativity": mean("creativity"),
+            "coherence": mean("coherence"),
+            "safety": mean("safety"),
+            "overall": mean("overall"),
+        }
+        return aggregated
+
+    def run(self, prompt: str, temperature_string: str, top_p: float, advanced: bool = False, rounds: int = 1, exploration_c: float = 1.0) -> str:
+        temperature_list = [float(temp.strip()) for temp in temperature_string.split(',') if temp.strip()]
+        if not temperature_list:
+            return "No temperatures provided."
+
+        if not advanced:
+            outputs = {}
+            overall_scores = {}
+            detailed_scores = {}
+            with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
+                future_to_temp = {
+                    executor.submit(self.generate_with_openai, prompt, temp, top_p): temp for temp in temperature_list
+                }
+                for future in as_completed(future_to_temp):
+                    temp = future_to_temp[future]
+                    try:
+                        output_text, _ = future.result()
+                        print(f"Output for temp {temp}: {output_text}")
+                        if output_text and not output_text.startswith("Error"):
+                            outputs[temp] = output_text
+                            score_dict = self.evaluate_output(output_text, temp, top_p)
+                            detailed_scores[temp] = score_dict
+                            overall_scores[temp] = score_dict.get("overall", 0.0)
+                    except Exception as e:
+                        print(f"Error while generating or evaluating output for temp {temp}: {e}")
+
+            if not overall_scores:
+                return "No valid outputs generated."
+
+            sorted_scores = sorted(overall_scores.items(), key=lambda item: item[1], reverse=True)
+            if self.auto_select:
+                best_temp, best_overall = sorted_scores[0]
+                best_output = outputs[best_temp]
+                best_detail = detailed_scores[best_temp]
+                return (
+                    f"Best AutoTemp Output (Temp {best_temp} | Top-p {top_p} | Overall: {best_overall}):\n"
+                    f"{best_output}\n\n"
+                    f"Judges (mean scores): {json.dumps(best_detail, ensure_ascii=False)}"
+                )
+            else:
+                lines = []
+                for temp, overall in sorted_scores:
+                    lines.append(
+                        f"Temp {temp} | Top-p {top_p} | Overall: {overall} | Detail: {json.dumps(detailed_scores[temp], ensure_ascii=False)}:\n{outputs[temp]}"
+                    )
+                return "\n\n".join(lines)
+        else:
+            # Advanced: UCB1 bandit over temperatures
+            num_rounds = max(1, int(rounds))
+            c = float(exploration_c)
+            pulls = {t: 0 for t in temperature_list}
+            sums = {t: 0.0 for t in temperature_list}
+            best_outputs = {t: {"overall": -1.0, "text": "", "detail": {}} for t in temperature_list}
+            total_pulls = 0
+
+            # Ensure each arm is pulled at least once
+            init_order = list(temperature_list)
+            for t in init_order:
+                out, _ = self.generate_with_openai(prompt, t, top_p)
+                if out and not out.startswith("Error"):
+                    score_detail = self.evaluate_output(out, t, top_p)
+                    score = score_detail.get("overall", 0.0)
+                    pulls[t] += 1
+                    sums[t] += score
+                    total_pulls += 1
+                    if score > best_outputs[t]["overall"]:
+                        best_outputs[t] = {"overall": score, "text": out, "detail": score_detail}
+
+            for _ in range(num_rounds - 1):
+                # Compute UCB
+                ucb_values = {}
+                for t in temperature_list:
+                    if pulls[t] == 0:
+                        ucb_values[t] = float("inf")
+                    else:
+                        mean = sums[t] / pulls[t]
+                        bonus = c * math.sqrt(max(1e-9, math.log(max(1, total_pulls)) / pulls[t]))
+                        ucb_values[t] = mean + bonus
+                # Select best arm
+                next_t = max(temperature_list, key=lambda tt: ucb_values[tt])
+                out, _ = self.generate_with_openai(prompt, next_t, top_p)
+                if out and not out.startswith("Error"):
+                    score_detail = self.evaluate_output(out, next_t, top_p)
+                    score = score_detail.get("overall", 0.0)
+                    pulls[next_t] += 1
+                    sums[next_t] += score
+                    total_pulls += 1
+                    if score > best_outputs[next_t]["overall"]:
+                        best_outputs[next_t] = {"overall": score, "text": out, "detail": score_detail}
+
+            # Prepare output
+            means = {t: (sums[t] / pulls[t]) if pulls[t] > 0 else 0.0 for t in temperature_list}
+            ranked = sorted(temperature_list, key=lambda t: means[t], reverse=True)
+            best_t = ranked[0]
+            best = best_outputs[best_t]
+            header = (
+                f"Advanced Mode (UCB) — Best Output (Temp {best_t} | Top-p {top_p} | Mean: {round(means[best_t], 2)} | Best Overall: {round(best['overall'], 2)}):\n"
+            )
+            summary_lines = [header, best["text"], "", f"Detail: {json.dumps(best['detail'], ensure_ascii=False)}", ""]
+            if not self.auto_select:
+                for t in ranked:
+                    summary_lines.append(
+                        f"Temp {t}: pulls={pulls[t]}, mean_overall={round(means[t], 2)}, best_overall={round(best_outputs[t]['overall'], 2)}"
+                    )
+            return "\n".join(summary_lines)
+
+    # -------------------- Metrics & Benchmarking Utilities --------------------
+    @staticmethod
+    def _compute_external_metrics(candidate: str, reference: str) -> Dict[str, float]:
+        metrics: Dict[str, float] = {}
+        try:
+            if sacrebleu is not None:
+                bleu = sacrebleu.corpus_bleu([candidate], [[reference]])
+                metrics["BLEU"] = float(bleu.score)
+        except Exception:
+            pass
+        try:
+            if rouge_scorer is not None:
+                scorer = rouge_scorer.RougeScorer(["rouge1", "rougeLsum"], use_stemmer=True)
+                scores = scorer.score(reference, candidate)
+                metrics["ROUGE1_F"] = float(scores["rouge1"].fmeasure)
+                metrics["ROUGE_Lsum_F"] = float(scores["rougeLsum"].fmeasure)
+        except Exception:
+            pass
+        try:
+            if bert_score is not None:
+                P, R, F1 = bert_score([candidate], [reference], lang="en", rescale_with_baseline=True)
+                metrics["BERTScore_F1"] = float(F1.mean().item())
+        except Exception:
+            pass
+        return metrics
+
+    @staticmethod
+    def _percentile(values: List[float], p: float) -> float:
+        if not values:
+            return 0.0
+        values_sorted = sorted(values)
+        k = (len(values_sorted) - 1) * p
+        f = math.floor(k)
+        c = math.ceil(k)
+        if f == c:
+            return float(values_sorted[int(k)])
+        d0 = values_sorted[f] * (c - k)
+        d1 = values_sorted[c] * (k - f)
+        return float(d0 + d1)
+
+    @staticmethod
+    def _bootstrap_ci(values: List[float], num_samples: int = 1000, alpha: float = 0.05) -> Tuple[float, float, float]:
+        if not values:
+            return 0.0, 0.0, 0.0
+        means = []
+        n = len(values)
+        for _ in range(num_samples):
+            sample = random.choices(values, k=n)
+            means.append(sum(sample) / n)
+        lower = AutoTemp._percentile(means, alpha / 2)
+        upper = AutoTemp._percentile(means, 1 - alpha / 2)
+        return float(sum(values) / n), float(lower), float(upper)
+
+    def estimate_cost_usd(self) -> float:
+        # Simple estimator; update as needed
+        model_costs = {
+            "gpt-3.5-turbo": {"prompt_per_1k": 0.50, "completion_per_1k": 1.50},
+            "gpt-4": {"prompt_per_1k": 30.00, "completion_per_1k": 60.00},
+        }
+        cfg = model_costs.get(self.model_version)
+        if not cfg:
+            return 0.0
+        prompt_usd = (self.usage_totals["prompt_tokens"] / 1000.0) * cfg["prompt_per_1k"]
+        completion_usd = (self.usage_totals["completion_tokens"] / 1000.0) * cfg["completion_per_1k"]
+        return round(prompt_usd + completion_usd, 4)
+
+    @staticmethod
+    def _extract_best_output_from_run(run_text: str) -> str:
+        # Extract the body text after the first header line until an empty line or 'Judges'
+        try:
+            lines = run_text.splitlines()
+            if not lines:
+                return run_text
+            # skip header line
+            body_lines = []
+            for ln in lines[1:]:
+                if not ln.strip():
+                    break
+                if ln.strip().startswith("Judges"):
+                    break
+                body_lines.append(ln)
+            return "\n".join(body_lines).strip() or run_text
+        except Exception:
+            return run_text
+
+    def benchmark(self, dataset: List[Dict[str, str]], temperature_string: str, top_p: float, models: Optional[List[str]] = None, advanced: bool = False, rounds: int = 1, judges: int = 3, csv_path: Optional[str] = None) -> Dict[str, Any]:
+        """Benchmark across a dataset of {prompt, reference} items.
+        Returns summary with means and (optional) external metrics if available."""
+        results: Dict[str, Any] = {}
+        model_list = models or [self.model_version]
+        for model_name in model_list:
+            self.model_version = model_name
+            self.judges = judges
+            self.usage_totals = {"prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0}
+            per_item_scores: List[float] = []
+            per_item_metrics: Dict[str, List[float]] = {}
+            rows_for_csv: List[Dict[str, Any]] = []
+            for item in dataset:
+                prompt = item.get("prompt", "")
+                reference = item.get("reference", "")
+                try:
+                    run_text = self.run(prompt, temperature_string, top_p, advanced=advanced, rounds=rounds)
+                    best_output = self._extract_best_output_from_run(run_text)
+                    # We do not have direct overall score; compute via judges again for consistency
+                    score_detail = self.evaluate_output(best_output, temperature=float(self.default_temp or 0.7), top_p=float(top_p))
+                    per_item_scores.append(float(score_detail.get("overall", 0.0)))
+                    if reference:
+                        met = self._compute_external_metrics(best_output, reference)
+                        for k, v in met.items():
+                            per_item_metrics.setdefault(k, []).append(float(v))
+                    else:
+                        met = {}
+                    rows_for_csv.append({
+                        "model": model_name,
+                        "prompt": prompt,
+                        "output": best_output,
+                        "overall": float(score_detail.get("overall", 0.0)),
+                        **{f"metric_{k}": v for k, v in met.items()}
+                    })
+                except Exception as e:
+                    print(f"Benchmark error on item: {e}")
+                    per_item_scores.append(0.0)
+            mean_overall = round(sum(per_item_scores) / max(1, len(per_item_scores)), 3)
+            mean_o, lower_o, upper_o = AutoTemp._bootstrap_ci(per_item_scores)
+            metric_means = {k: round(sum(v) / max(1, len(v)), 4) for k, v in per_item_metrics.items()}
+            metric_cis = {k: AutoTemp._bootstrap_ci(v) for k, v in per_item_metrics.items()}
+            if csv_path:
+                # union of all keys across rows
+                fieldnames = sorted({key for row in rows_for_csv for key in row.keys()})
+                try:
+                    with open(csv_path, mode="w", newline="", encoding="utf-8") as f:
+                        writer = csv.DictWriter(f, fieldnames=fieldnames)
+                        writer.writeheader()
+                        for row in rows_for_csv:
+                            writer.writerow(row)
+                except Exception as e:
+                    print(f"CSV export failed: {e}")
+            results[model_name] = {
+                "mean_overall": mean_overall,
+                "mean_overall_ci": [round(lower_o, 3), round(upper_o, 3)],
+                "metric_means": metric_means,
+                "metric_cis": {k: [round(v[1], 4), round(v[2], 4)] for k, v in metric_cis.items()},
+                "num_items": len(dataset),
+                "tokens": dict(self.usage_totals),
+                "estimated_cost_usd": self.estimate_cost_usd(),
+            }
+        return results

 # Gradio app logic
-def run_autotemp(prompt, temperature_string, top_p, auto_select):
-    agent = AutoTemp(auto_select=auto_select)
-    output = agent.run(prompt, temperature_string, top_p=float(top_p))
+def run_autotemp(prompt, temperature_string, top_p, auto_select, advanced_mode, rounds, judges, exploration_c):
+    agent = AutoTemp(auto_select=auto_select, judges=int(judges))
+    output = agent.run(prompt, temperature_string, top_p=float(top_p), advanced=bool(advanced_mode), rounds=int(rounds), exploration_c=float(exploration_c))
    return output

 # Gradio interface setup
@@ -117,50 +436,31 @@ def main():
            "text",
            "text",
            gr.Slider(minimum=0.0, maximum=1.0, step=0.1, value=1.0, label="top-p value"),
-            "checkbox"
+            gr.Checkbox(value=True, label="Auto Select Best"),
+            gr.Checkbox(value=False, label="Advanced Mode (UCB)").style(container=True),
+            gr.Slider(minimum=1, maximum=30, step=1, value=5, label="Rounds (Advanced)"),
+            gr.Slider(minimum=1, maximum=7, step=1, value=3, label="Judges"),
+            gr.Slider(minimum=0.0, maximum=3.0, step=0.1, value=1.0, label="Exploration c (UCB)")
        ],
        outputs="text",
-        title="AutoTemp: Enhanced LLM Responses with Temperature and Top-p Tuning",
-        description="""AutoTemp generates responses at different temperatures, evaluates them, and ranks them based on quality. 
+        title="AutoTemp: Research-Grade Temperature & Top-p Optimization",
+        description="""AutoTemp now supports multi-judge structured evaluation and an optional UCB bandit optimizer.
                       Enter temperatures separated by commas for evaluation.
-                       Adjust 'Top-p' to control output diversity: lower for precision, higher for creativity.
-                       Toggle 'Auto Select' to either see the top-rated output or all evaluated outputs.
-                       Check the FAQs at the bottom of the page for more info.""",
+                       Adjust 'Top-p' to control output diversity, and switch to Advanced Mode for iterative optimization.
+                       Judges average multiple independent evaluations into robust overall scores.""",
        article="""**FAQs**

-**What's Top-p?** 'Top-p' controls the diversity of AI responses: a low 'top-p' makes output more focused and predictable, while a high 'top-p' encourages variety and surprise. Pair with temperature to fine-tune AI creativity: higher temperatures with high 'top-p' for bold ideas, or lower temperatures with low 'top-p' for precise answers. 
-Using top_p=1 essentially disables the "nucleus sampling" feature, where only the most probable tokens are considered. This is equivalent to using full softmax probability distribution to sample the next word.
+**What's Top-p?** 'Top-p' controls the diversity of AI responses: a low 'top-p' makes output more focused and predictable, while a high 'top-p' encourages variety and surprise. Pair with temperature to fine-tune AI creativity: higher temperatures with high 'top-p' for bold ideas, or lower temperatures with low 'top-p' for precise answers. Using top_p=1 disables nucleus sampling.

 **How Does Temperature Affect AI Outputs?** Temperature controls the randomness of word selection. Lower temperatures lead to more predictable text, while higher temperatures allow for more novel text generation.

-**How Does Top-p Influence Temperature Settings in AI Language Models?**
-Top-p and temperature are both parameters that control the randomness of AI-generated text, but they influence outcomes in subtly different ways:
+**What is Advanced Mode (UCB)?** Advanced Mode treats each temperature as an arm in a bandit and iteratively selects temperatures using the UCB1 strategy to balance exploration and exploitation, improving sample efficiency while converging to better settings for your prompt.

- **Low Temperatures (0.0 - 0.5):**
-  - *Effect of Top-p:* A high `top_p` value will have minimal impact, as the model's output is already quite deterministic. A low `top_p` will further constrain the model, leading to very predictable outputs.
-  - *Use Cases:* Ideal for tasks requiring precise, factual responses like technical explanations or legal advice. For example, explaining a scientific concept or drafting a formal business email.
-
- **Medium Temperatures (0.5 - 0.7):**
-  - *Effect of Top-p:* `top_p` starts to influence the variety of the output. A higher `top_p` will introduce more diversity without sacrificing coherence.
-  - *Use Cases:* Suitable for creative yet controlled content, such as writing an article on a current event or generating a business report that balances creativity with professionalism.
-
- **High Temperatures (0.8 - 1.0):**
-  - *Effect of Top-p:* A high `top_p` is crucial for introducing creativity and surprise, but may result in less coherent outputs. A lower `top_p` can help maintain some coherence.
-  - *Use Cases:* Good for brainstorming sessions, generating creative writing prompts, or coming up with out-of-the-box ideas where a mix of novelty and relevance is appreciated.
-
- **Extra-High Temperatures (1.1 - 2.0):**
-  - *Effect of Top-p:* The output becomes more experimental and unpredictable, and `top_p`'s influence can vary widely. It's a balance between randomness and diversity.
-  - *Use Cases:* Best for when you're seeking highly creative or abstract ideas, such as imagining a sci-fi scenario or coming up with a plot for a fantasy story, where coherence is less of a priority compared to novelty and uniqueness.
-
-Adjusting both temperature and top-p helps tailor the AI's output to your specific needs.""",
+**Why multiple judges?** Independent judging runs reduce variance and bias in single-evaluator scores. We report mean scores across judges for robustness.""",
        examples=[
-            ["Write a short story about AGI learning to love", "0.5, 0.7, 0.9, 1.1", 1.0, False],
-            ["Create a dialogue between a chef and an alien creating an innovative new recipe", "0.3, 0.6, 0.9, 1.2", 0.9, True],
-            ["Explain quantum computing to a 5-year-old", "0.4, 0.8, 1.2, 1.5", 0.8, False],
-            ["Draft an email to a hotel asking for a special arrangement for a marriage proposal", "0.4, 0.7, 1.0, 1.3", 0.7, True],
-            ["Describe a futuristic city powered by renewable energy", "0.5, 0.75, 1.0, 1.25", 0.6, False],
-            ["Generate a poem about the ocean's depths in the style of Edgar Allan Poe", "0.6, 0.8, 1.0, 1.2", 0.5, True],
-            ["What are some innovative startup ideas for improving urban transportation?", "0.45, 0.65, 0.85, 1.05", 0.4, False]
+            ["Write a short story about AGI learning to love", "0.5, 0.7, 0.9, 1.1", 1.0, True, False, 5, 3, 1.0],
+            ["Explain quantum computing to a 5-year-old", "0.4, 0.8, 1.2, 1.5", 0.8, True, True, 8, 3, 1.2],
+            ["Draft an email to a hotel asking for a special arrangement for a marriage proposal", "0.4, 0.7, 1.0, 1.3", 0.7, True, True, 10, 5, 0.8]
        ]
    )
    iface.launch()
--- a/docs/app.js
+++ b/docs/app.js
@@ -0,0 +1,344 @@
+async function openAIChat(apiKey, model, messages, temperature = 0.7, top_p = 1.0, extra = {}) {
+  const url = 'https://api.openai.com/v1/chat/completions';
+  const res = await fetch(url, {
+    method: 'POST',
+    headers: {
+      'Content-Type': 'application/json',
+      'Authorization': `Bearer ${apiKey}`
+    },
+    body: JSON.stringify({ model, messages, temperature, top_p, ...extra })
+  });
+  if (!res.ok) {
+    const errText = await res.text();
+    throw new Error(`OpenAI error ${res.status}: ${errText}`);
+  }
+  const data = await res.json();
+  const text = data.choices?.[0]?.message?.content || '';
+  return text.trim();
+}
+
+function escapeHtml(str){
+  return String(str)
+    .replace(/&/g, '&amp;')
+    .replace(/</g, '&lt;')
+    .replace(/>/g, '&gt;')
+    .replace(/"/g, '&quot;')
+    .replace(/'/g, '&#039;');
+}
+
+async function generateOnce(apiKey, model, prompt, params) {
+  return openAIChat(apiKey, model, [
+    { role: 'system', content: 'You are a helpful assistant.' },
+    { role: 'user', content: prompt }
+  ], params.temperature, params.top_p, params.extra);
+}
+
+async function judgeOnce(apiKey, model, output, params, judgeId) {
+  const evalPrompt = `You are Judge #${judgeId}. Evaluate the OUTPUT below which was generated at temperature ${params.temperature} and top_p ${params.top_p}.
+Return a STRICT minified JSON object with numeric fields only (no text outside JSON):
+{"relevance": float0to100, "clarity": float0to100, "utility": float0to100, "creativity": float0to100, "coherence": float0to100, "safety": float0to100, "overall": float0to100}
+Output between triple dashes:
+---
+${output}
+---`;
+  const raw = await openAIChat(apiKey, model, [
+    { role: 'system', content: 'Return only the JSON.' },
+    { role: 'user', content: evalPrompt }
+  ], 0.2, 1.0);
+  try {
+    const jsonText = (raw.match(/\{[\s\S]*\}/) || [raw])[0];
+    const obj = JSON.parse(jsonText);
+    return {
+      relevance: +obj.relevance || 0,
+      clarity: +obj.clarity || 0,
+      utility: +obj.utility || 0,
+      creativity: +obj.creativity || 0,
+      coherence: +obj.coherence || 0,
+      safety: +obj.safety || 0,
+      overall: +obj.overall || 0,
+    };
+  } catch (e) {
+    const num = (raw.match(/\d+(?:\.\d+)?/) || [0])[0];
+    return { relevance: 0, clarity: 0, utility: 0, creativity: 0, coherence: 0, safety: 0, overall: +num };
+  }
+}
+
+function mean(arr) { return arr.length ? arr.reduce((a,b)=>a+b,0) / arr.length : 0; }
+
+function aggregateScores(scores) {
+  const keys = ['relevance','clarity','utility','creativity','coherence','safety','overall'];
+  const out = {};
+  for (const k of keys) out[k] = +mean(scores.map(s=>s[k]||0)).toFixed(2);
+  return out;
+}
+
+async function standardMode(apiKey, model, prompt, arms, judges) {
+  const outputs = {};
+  const details = {};
+  const overalls = {};
+
+  await Promise.all(arms.map(async (arm) => {
+    const key = JSON.stringify(arm);
+    const text = await generateOnce(apiKey, model, prompt, arm);
+    outputs[key] = text;
+    const judgeResults = await Promise.all(Array.from({length: judges}).map((_,i)=>
+      judgeOnce(apiKey, model, text, arm, i+1)
+    ));
+    const agg = aggregateScores(judgeResults);
+    details[key] = agg;
+    overalls[key] = agg.overall;
+  }));
+
+  const ranked = Object.entries(overalls).sort((a,b)=>b[1]-a[1]);
+  return { outputs, details, ranked };
+}
+
+async function advancedModeUCB(apiKey, model, prompt, arms, judges, rounds, c) {
+  const keys = arms.map(a=>JSON.stringify(a));
+  const pulls = Object.fromEntries(keys.map(k=>[k,0]));
+  const sums = Object.fromEntries(keys.map(k=>[k,0]));
+  const best = Object.fromEntries(keys.map(k=>[k,{overall:-1,text:'',detail:{}}]));
+  let total = 0;
+
+  // init
+  for (const arm of arms) {
+    const k = JSON.stringify(arm);
+    const text = await generateOnce(apiKey, model, prompt, arm);
+    const judgeResults = await Promise.all(Array.from({length: judges}).map((_,i)=>
+      judgeOnce(apiKey, model, text, arm, i+1)
+    ));
+    const agg = aggregateScores(judgeResults);
+    pulls[k] += 1; sums[k] += agg.overall; total += 1;
+    if (agg.overall > best[k].overall) best[k] = {overall: agg.overall, text, detail: agg};
+  }
+
+  for (let r = 0; r < rounds - 1; r++) {
+    const ucb = {};
+    for (const k of keys) {
+      const m = pulls[k] ? (sums[k]/pulls[k]) : Infinity;
+      const bonus = pulls[k] ? c * Math.sqrt(Math.log(Math.max(1,total)) / pulls[k]) : Infinity;
+      ucb[k] = m + bonus;
+    }
+    const nextK = keys.sort((a,b)=>ucb[b]-ucb[a])[0];
+    const arm = JSON.parse(nextK);
+    const text = await generateOnce(apiKey, model, prompt, arm);
+    const judgeResults = await Promise.all(Array.from({length: judges}).map((_,i)=>
+      judgeOnce(apiKey, model, text, arm, i+1)
+    ));
+    const agg = aggregateScores(judgeResults);
+    pulls[nextK] += 1; sums[nextK] += agg.overall; total += 1;
+    if (agg.overall > best[nextK].overall) best[nextK] = {overall: agg.overall, text, detail: agg};
+  }
+
+  const means = Object.fromEntries(keys.map(k=>[k, pulls[k] ? (sums[k]/pulls[k]) : 0]));
+  const rankedKeys = keys.slice().sort((a,b)=>means[b]-means[a]);
+  const bestK = rankedKeys[0];
+  return { bestK, best: best[bestK], means, pulls };
+}
+
+function getEl(id){ return document.getElementById(id); }
+function setText(id, txt){ getEl(id).textContent = txt; }
+function appendLog(msg){ const el=getEl('runLog'); if(!el) return; el.textContent += `\n${msg}`; el.scrollTop = el.scrollHeight; }
+
+function renderArmsTable(arms){
+  const tbody = getEl('armsTable').querySelector('tbody');
+  tbody.innerHTML = '';
+  for (const arm of arms){
+    const k = JSON.stringify(arm);
+    const tr = document.createElement('tr');
+    tr.id = `arm-${btoa(k).replace(/=/g,'')}`;
+    tr.innerHTML = `
+      <td class="status status-wait">waiting</td>
+      <td class="pulls">0</td>
+      <td class="mean">-</td>
+      <td class="best">-</td>
+      <td><details><summary>view</summary><div class="arm-detail"></div></details></td>
+    `;
+    tbody.appendChild(tr);
+  }
+}
+
+function updateArmRow(arm, data){
+  const k = JSON.stringify(arm);
+  const id = `arm-${btoa(k).replace(/=/g,'')}`;
+  const tr = getEl(id);
+  if (!tr) return;
+  if (data.status) { const s = tr.querySelector('.status'); s.textContent = data.status; s.className = `status ${data.statusClass||''}`; }
+  if (data.pulls !== undefined) tr.querySelector('.pulls').textContent = String(data.pulls);
+  if (data.mean !== undefined) tr.querySelector('.mean').textContent = (data.mean===null?'-':Number(data.mean).toFixed(2));
+  if (data.best !== undefined) tr.querySelector('.best').textContent = (data.best===null?'-':Number(data.best).toFixed(2));
+  if (data.detail) tr.querySelector('.arm-detail').innerHTML = data.detail;
+}
+
+document.addEventListener('DOMContentLoaded', () => {
+  // Chart setup
+  let chart;
+  function ensureChart(){
+    const ctx = getEl('scoreChart');
+    if (!ctx) return null;
+    if (chart) return chart;
+    chart = new Chart(ctx, {
+      type: 'scatter',
+      data: { datasets: [{ label: 'temp vs mean score', data: [], borderColor:'#00ff9c', backgroundColor:'rgba(0,255,156,0.3)' }]},
+      options: {
+        scales: {
+          x: { title: { display:true, text:'temperature' }, grid: { color:'#0b442f' }, ticks:{ color:'#b5f5d2' } },
+          y: { title: { display:true, text:'mean judge score' }, suggestedMin:0, suggestedMax:100, grid: { color:'#0b442f' }, ticks:{ color:'#b5f5d2' } }
+        },
+        plugins: { legend: { labels: { color:'#b5f5d2' } } }
+      }
+    });
+    return chart;
+  }
+  function addChartPoint(temp, mean){
+    const c = ensureChart(); if (!c) return;
+    c.data.datasets[0].data.push({ x: temp, y: mean });
+    c.update('none');
+  }
+  const judges = getEl('judges');
+  const rounds = getEl('rounds');
+  const explorationC = getEl('explorationC');
+  judges.addEventListener('input', ()=> setText('judgesVal', judges.value));
+  rounds.addEventListener('input', ()=> setText('roundsVal', rounds.value));
+  explorationC.addEventListener('input', ()=> setText('cVal', (+explorationC.value).toFixed(2)));
+
+  getEl('runBtn').addEventListener('click', async () => {
+    const apiKey = getEl('apiKey').value.trim();
+    const remember = getEl('rememberKey').checked;
+    if (!apiKey) { alert('Please enter an API key.'); return; }
+    if (remember) localStorage.setItem('autotemp_api_key', apiKey); else localStorage.removeItem('autotemp_api_key');
+
+    const model = getEl('model').value.trim() || 'gpt-4o-mini';
+    const temps = getEl('temperatures').value.split(',').map(s=>parseFloat(s.trim())).filter(n=>!Number.isNaN(n));
+    const tops = getEl('tops').value.split(',').map(s=>parseFloat(s.trim())).filter(n=>!Number.isNaN(n));
+    const maxTokens = getEl('maxTokens').value.split(',').map(s=>parseInt(s.trim(),10)).filter(n=>!Number.isNaN(n));
+    const freqPen = getEl('freqPen').value.split(',').map(s=>parseFloat(s.trim())).filter(n=>!Number.isNaN(n));
+    const presPen = getEl('presPen').value.split(',').map(s=>parseFloat(s.trim())).filter(n=>!Number.isNaN(n));
+    const stopRaw = getEl('stopSeqs').value.trim();
+    const stopTokens = stopRaw ? stopRaw.split(',').map(s=>s.replace(/\\n/g,'\n')) : undefined;
+    const j = parseInt(getEl('judges').value, 10) || 3;
+    const auto = getEl('autoSelect').checked;
+    const adv = getEl('advancedMode').checked;
+    const r = parseInt(getEl('rounds').value, 10) || 5;
+    const c = parseFloat(getEl('explorationC').value) || 1.0;
+    const prompt = getEl('userPrompt').value.trim();
+    if (!prompt) { alert('Enter a prompt.'); return; }
+
+    // build arms (Cartesian product)
+    function cartesian(arrs){ return arrs.reduce((a,b)=> a.flatMap(x=> b.map(y=>[...x,y])), [[]]); }
+    const lists = [temps, tops, maxTokens, freqPen, presPen];
+    const combos = cartesian(lists);
+    const arms = combos.map(([temperature, top_p, max_tokens, frequency_penalty, presence_penalty]) => ({
+      temperature, top_p,
+      extra: {
+        max_tokens,
+        frequency_penalty,
+        presence_penalty,
+        ...(stopTokens ? { stop: stopTokens } : {})
+      }
+    }));
+
+    const status = getEl('status');
+    const results = getEl('results');
+    results.textContent = '';
+    status.textContent = 'Running...';
+    appendLog(`Initialized ${arms.length} arms. Judges=${j}. Advanced=${adv ? 'UCB' : 'Standard'}.`);
+    renderArmsTable(arms);
+    try {
+      const c = ensureChart(); if (c){ c.data.datasets[0].data = []; c.update('none'); }
+      if (!adv) {
+        const outputs = {}; const details = {}; const overalls = {};
+        for (const arm of arms){
+          updateArmRow(arm, { status:'running', statusClass:'status-running' });
+          appendLog(`Generating for arm ${JSON.stringify(arm)}...`);
+          const text = await generateOnce(apiKey, model, prompt, arm);
+          outputs[JSON.stringify(arm)] = text;
+          appendLog(`Judging arm ${JSON.stringify(arm)}...`);
+          const judgeResults = await Promise.all(Array.from({length: j}).map((_,i)=> judgeOnce(apiKey, model, text, arm, i+1)));
+          const agg = aggregateScores(judgeResults);
+          details[JSON.stringify(arm)] = agg; overalls[JSON.stringify(arm)] = agg.overall;
+          const paramHtml = `<div class="arm-params">Params: <code>${escapeHtml(JSON.stringify(arm))}</code></div>`;
+          const outputHtml = `<div class="arm-output-box"><pre>${escapeHtml(text)}</pre></div>`;
+          const scoresHtml = `<div class="arm-scores">Scores: <code>${escapeHtml(JSON.stringify(agg))}</code></div>`;
+          updateArmRow(arm, { status:'done', statusClass:'status-done', pulls:1, mean:agg.overall, best:agg.overall, detail: paramHtml + outputHtml + scoresHtml });
+          if (typeof arm.temperature === 'number') addChartPoint(arm.temperature, agg.overall);
+        }
+        const ranked = Object.entries(overalls).sort((a,b)=>b[1]-a[1]);
+        if (auto) {
+          const [bestK, bestScore] = ranked[0];
+          const arm = JSON.parse(bestK);
+          results.textContent = `Best Arm ${bestK} | Overall ${bestScore}\n` + outputs[bestK] + "\n\n" + `Judges: ${JSON.stringify(details[bestK])}`;
+        } else {
+          results.textContent = ranked.map(([t, s])=>
+            `Arm ${t} | Overall ${s} | Detail ${JSON.stringify(details[t])}\n${outputs[t]}`
+          ).join('\n\n');
+        }
+      } else {
+        // Transparent UCB loop with UI updates
+        const keys = arms.map(a=>JSON.stringify(a));
+        const pulls = Object.fromEntries(keys.map(k=>[k,0]));
+        const sums = Object.fromEntries(keys.map(k=>[k,0]));
+        const best = Object.fromEntries(keys.map(k=>[k,{overall:-1,text:'',detail:{}}]));
+        let total = 0;
+        for (const arm of arms){ updateArmRow(arm, { status:'running', statusClass:'status-running' }); }
+        // init pull each arm
+        for (const arm of arms){
+          appendLog(`Init pull -> ${JSON.stringify(arm)}`);
+          const k = JSON.stringify(arm);
+          const text = await generateOnce(apiKey, model, prompt, arm);
+          const judgeResults = await Promise.all(Array.from({length: j}).map((_,i)=> judgeOnce(apiKey, model, text, arm, i+1)));
+          const agg = aggregateScores(judgeResults);
+          pulls[k] += 1; sums[k] += agg.overall; total += 1;
+          if (agg.overall > best[k].overall) best[k] = {overall: agg.overall, text, detail: agg};
+          const paramHtml = `<div class="arm-params">Params: <code>${escapeHtml(JSON.stringify(arm))}</code></div>`;
+          const outputHtml = `<div class="arm-output-box"><pre>${escapeHtml(text)}</pre></div>`;
+          const scoresHtml = `<div class="arm-scores">Scores: <code>${escapeHtml(JSON.stringify(agg))}</code></div>`;
+          updateArmRow(arm, { pulls:pulls[k], mean:(sums[k]/pulls[k]), best:best[k].overall, detail: paramHtml + outputHtml + scoresHtml });
+          if (typeof arm.temperature === 'number') addChartPoint(arm.temperature, agg.overall);
+        }
+        for (let i=0;i<r-1;i++){
+          // compute UCB
+          const ucb = {};
+          for (const arm of arms){
+            const k = JSON.stringify(arm);
+            const m = pulls[k] ? (sums[k]/pulls[k]) : Infinity;
+            const bonus = pulls[k] ? c * Math.sqrt(Math.log(Math.max(1,total)) / pulls[k]) : Infinity;
+            ucb[k] = m + bonus;
+          }
+          const nextK = keys.slice().sort((a,b)=>ucb[b]-ucb[a])[0];
+          const arm = JSON.parse(nextK);
+          appendLog(`Round ${i+1}: selecting arm ${nextK} (UCB=${ucb[nextK].toFixed(3)})`);
+          const text = await generateOnce(apiKey, model, prompt, arm);
+          const judgeResults = await Promise.all(Array.from({length: j}).map((_,i)=> judgeOnce(apiKey, model, text, arm, i+1)));
+          const agg = aggregateScores(judgeResults);
+          pulls[nextK] += 1; sums[nextK] += agg.overall; total += 1;
+          if (agg.overall > best[nextK].overall) best[nextK] = {overall: agg.overall, text, detail: agg};
+          const paramHtml = `<div class=\"arm-params\">Params: <code>${escapeHtml(JSON.stringify(arm))}</code></div>`;
+          const outputHtml = `<div class=\"arm-output-box\"><pre>${escapeHtml(text)}</pre></div>`;
+          const scoresHtml = `<div class=\"arm-scores\">Scores: <code>${escapeHtml(JSON.stringify(agg))}</code></div>`;
+          updateArmRow(arm, { pulls:pulls[nextK], mean:(sums[nextK]/pulls[nextK]), best:best[nextK].overall, detail: paramHtml + outputHtml + scoresHtml });
+          if (typeof arm.temperature === 'number') addChartPoint(arm.temperature, agg.overall);
+        }
+        for (const arm of arms){ updateArmRow(arm, { status:'done', statusClass:'status-done' }); }
+        const means = Object.fromEntries(keys.map(k=>[k, pulls[k] ? (sums[k]/pulls[k]) : 0]));
+        const ranked = keys.slice().sort((a,b)=>means[b]-means[a]);
+        const bestK = ranked[0];
+        const bestArm = JSON.parse(bestK);
+        appendLog(`Complete. Best ${bestK} mean=${means[bestK].toFixed(2)} best_overall=${best[bestK].overall.toFixed(2)}`);
+        if (auto){
+          results.textContent = `Advanced (UCB) — Best Arm ${bestK} | Mean ${means[bestK].toFixed(2)} | Best Overall ${best[bestK].overall.toFixed(2)}\n` + best[bestK].text + "\n\n" + `Detail: ${JSON.stringify(best[bestK].detail)}`;
+        } else {
+          const lines = [`Advanced (UCB) — Best ${bestK}`, best[bestK].text, '', `Detail: ${JSON.stringify(best[bestK].detail)}`, ''];
+          for (const k of ranked){ lines.push(`Arm ${k}: pulls=${pulls[k]}, mean_overall=${means[k].toFixed(2)}, best_overall=${best[k].overall.toFixed(2)}`); }
+          results.textContent = lines.join('\n');
+        }
+      }
+      status.textContent = 'Done.';
+    } catch (e) {
+      status.textContent = 'Error';
+      results.textContent = String(e?.message || e);
+    }
+  });
+});
+
+
--- a/docs/index.html
+++ b/docs/index.html
@@ -0,0 +1,142 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <title>AutoTemp — Research-Grade Hyperparameter Optimization</title>
+    <link rel="stylesheet" href="./style.css" />
+</head>
+<body>
+    <div class="container">
+        <header class="header">
+            <div class="logo">▌▌ AutoTemp</div>
+            <div class="subtitle">Hyperparameter Lab — hacker mode</div>
+        </header>
+
+        <section class="config">
+            <div class="field">
+                <label for="apiKey">OpenAI API Key</label>
+                <input type="password" id="apiKey" placeholder="sk-..." />
+                <label class="inline">
+                    <input type="checkbox" id="rememberKey" /> Remember in this browser
+                </label>
+            </div>
+            <div class="grid-3">
+                <div class="field">
+                    <label for="model">Model</label>
+                    <input id="model" value="gpt-5-chat-latest" />
+                </div>
+                <div class="field">
+                    <label for="judges">Judges: <span id="judgesVal">3</span></label>
+                    <input type="range" id="judges" min="1" max="7" step="1" value="3" />
+                </div>
+                <div class="field">
+                    <label class="inline"><input type="checkbox" id="autoSelect" checked /> Auto Select Best</label>
+                </div>
+            </div>
+            <div class="grid-3">
+                <div class="field">
+                    <label for="temperatures">temperature list</label>
+                    <input id="temperatures" value="0.4,0.6,0.8,1.0" />
+                </div>
+                <div class="field">
+                    <label for="tops">top_p list</label>
+                    <input id="tops" value="1.0" />
+                </div>
+                <div class="field">
+                    <label for="maxTokens">max_tokens list</label>
+                    <input id="maxTokens" value="256,512" />
+                </div>
+            </div>
+            <div class="grid-3">
+                <div class="field">
+                    <label for="freqPen">frequency_penalty list</label>
+                    <input id="freqPen" value="0,0.2" />
+                </div>
+                <div class="field">
+                    <label for="presPen">presence_penalty list</label>
+                    <input id="presPen" value="0,0.2" />
+                </div>
+                <div class="field">
+                    <label for="stopSeqs">stop tokens (comma-separated)</label>
+                    <input id="stopSeqs" placeholder="e.g. \nEND,###" />
+                </div>
+            </div>
+            <div class="grid-3">
+                <div class="field">
+                    <label class="inline"><input type="checkbox" id="advancedMode" /> Advanced Mode (UCB over arms)</label>
+                </div>
+                <div class="field">
+                    <label for="rounds">Rounds (advanced): <span id="roundsVal">8</span></label>
+                    <input type="range" id="rounds" min="1" max="50" step="1" value="8" />
+                </div>
+                <div class="field">
+                    <label for="explorationC">Exploration c (UCB): <span id="cVal">1.0</span></label>
+                    <input type="range" id="explorationC" min="0" max="3" step="0.1" value="1.0" />
+                </div>
+            </div>
+            <div class="note">Provide comma-separated values to sweep. The app will form the Cartesian product across lists and evaluate each hyperparameter arm.</div>
+        </section>
+
+        <section class="prompt terminal">
+            <label for="userPrompt">Prompt</label>
+            <textarea id="userPrompt" rows="8" placeholder="Enter your prompt..."></textarea>
+        </section>
+
+        <section class="actions">
+            <button id="runBtn">Run AutoTemp</button>
+            <span id="status"></span>
+        </section>
+
+        <section class="output terminal">
+            <h2>Results</h2>
+            <pre id="results" class="glow"></pre>
+        </section>
+
+        <section class="workflow terminal">
+            <h2>Workflow</h2>
+            <div class="chart-wrap">
+                <canvas id="scoreChart" height="180"></canvas>
+            </div>
+            <div class="table-wrap">
+                <table id="armsTable" class="arms-table">
+                    <thead>
+                        <tr>
+                            <th>Status</th>
+                            <th>Pulls</th>
+                            <th>Mean</th>
+                            <th>Best</th>
+                            <th>Details</th>
+                        </tr>
+                    </thead>
+                    <tbody></tbody>
+                </table>
+            </div>
+        </section>
+
+        <section class="runlog terminal">
+            <h2>Run Log</h2>
+            <pre id="runLog" class="log"></pre>
+        </section>
+
+        <footer class="footer">
+            <div class="blink">>_</div>
+            <p>Security: Key stays in-browser (localStorage opt-in). Prefer proxies for shared deployments.</p>
+        </footer>
+    </div>
+
+    <script src="./app.js"></script>
+    <script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.1/dist/chart.umd.min.js"></script>
+    <script>
+        document.addEventListener('DOMContentLoaded', () => {
+            const savedKey = localStorage.getItem('autotemp_api_key');
+            if (savedKey) {
+                document.getElementById('apiKey').value = savedKey;
+                document.getElementById('rememberKey').checked = true;
+            }
+        });
+    </script>
+</body>
+</html>
+
+
--- a/docs/style.css
+++ b/docs/style.css
@@ -0,0 +1,43 @@
+:root { --bg:#020b05; --panel:#03150e; --text:#b5f5d2; --accent:#00ff9c; --accent2:#13f1ff; --muted:#0a2a1f; }
+*{ box-sizing:border-box }
+body{ margin:0; font-family: ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace; background: radial-gradient(1200px 800px at 20% 0%, #03150e, #020b05), #020b05; color:var(--text) }
+.container{ max-width:1100px; margin:0 auto; padding:24px }
+.header{ display:flex; align-items:center; justify-content:space-between; margin-bottom:16px }
+.logo{ font-weight:900; color:var(--accent); letter-spacing:2px }
+.subtitle{ color:var(--accent2); opacity:.9 }
+section{ background:linear-gradient(180deg, rgba(3,21,14,.9), rgba(2,11,5,.9)); padding:16px; border-radius:10px; margin-bottom:16px; border:1px solid #0b442f; box-shadow:0 0 30px rgba(0,255,156,.05) inset }
+.field{ margin-bottom:12px }
+.field label{ display:block; font-weight:700; margin-bottom:6px; color:#a5ffd6 }
+.field input[type="text"], .field input[type="password"], .field input[type="number"], .field textarea { width:100%; padding:10px; border-radius:6px; border:1px solid #0b442f; background:#03150e; color:var(--text); outline:none; box-shadow:0 0 0 1px rgba(0,255,156,.05) inset }
+.field input[type="text"]:focus, .field input[type="password"]:focus, .field textarea:focus { box-shadow:0 0 0 2px rgba(19,241,255,.25) inset }
+.field input[type="range"]{ width:100% }
+.inline{ display:inline-flex; align-items:center; gap:8px }
+.grid-2{ display:grid; grid-template-columns:1fr 1fr; gap:12px }
+.grid-3{ display:grid; grid-template-columns:1fr 1fr 1fr; gap:12px }
+.actions{ display:flex; align-items:center; gap:12px }
+button{ background:linear-gradient(90deg, var(--accent), var(--accent2)); color:#00170e; font-weight:900; border:none; padding:10px 16px; border-radius:8px; cursor:pointer; box-shadow:0 0 15px rgba(0,255,156,.2) }
+button:hover{ filter:brightness(1.05) }
+.terminal{ border:1px solid #0b442f; background:#010a06; box-shadow:0 0 40px rgba(0,255,156,.06) inset }
+pre{ white-space:pre-wrap; background:#010a06; padding:12px; border-radius:8px; border:1px dashed #0b442f }
+.table-wrap{ overflow:auto; }
+.arms-table{ width:100%; border-collapse:collapse; font-size:13px }
+.arms-table th,.arms-table td{ border:1px dashed #0b442f; padding:6px 8px; vertical-align:top }
+.arms-table th{ color:#a5ffd6; background:#03150e; position:sticky; top:0 }
+.badge{ display:inline-block; padding:2px 6px; border-radius:999px; background:#062c1f; border:1px solid #0b442f }
+.status-running{ color:#13f1ff }
+.status-done{ color:#00ff9c }
+.status-wait{ color:#a5ffd6 }
+.log{ max-height:260px; overflow:auto }
+.chart-wrap{ background:#010a06; border:1px dashed #0b442f; border-radius:8px; padding:8px; margin-bottom:12px }
+.arm-params{ font-size:12px; color:#a5ffd6; margin-bottom:8px }
+.arm-params code{ background:#03150e; padding:2px 4px; border:1px solid #0b442f; border-radius:4px }
+.arm-output-box{ background:#0f1620; border:1px solid #0b442f; border-radius:8px; padding:16px; margin:10px auto; max-width:760px; box-shadow:0 0 20px rgba(0,255,156,.08) inset }
+.arm-output-box pre{ background:transparent; border:none; margin:0; padding:0; white-space:pre-wrap; color:#e4fff2; font-size:14px }
+.arm-scores{ font-size:12px; margin-top:8px; color:#b5f5d2 }
+.footer{ display:flex; align-items:center; gap:10px; opacity:.85 }
+.blink{ width:8px; height:18px; background:var(--accent); animation: blink 1s infinite }
+.glow{ text-shadow:0 0 8px rgba(0,255,156,.35) }
+@keyframes blink{ 50%{ opacity:.2 } }
+@media(max-width:720px){ .grid-2,.grid-3{ grid-template-columns:1fr } }
+
+
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,3 +1,6 @@
 openai
 python-dotenv
 gradio
+rouge-score
+sacrebleu
+bert-score