Panel Ruling: Analyst Learning Mechanism — Calibration Without Memory

Date: 2026-03-30 Panel: Opus 4.6 (native, max reasoning) · Sonnet 4.6 · Gemini 3.1 Pro Preview · Grok 4.2 Reasoning Verdict: UNANIMOUS on all questions Related: analyst-briefing-ruling.md (Blind Analysis Protocol)


Context

The Blind Analysis Protocol (ruled earlier today) establishes that analysts go COLD every session — no memory, no market prices, independent oddsmakers. This ruling addresses the follow-up question: how do blind analysts improve over time if they have no memory of their mistakes?

Additionally, Opus 4.6 is added as a THIRD blind analyst (free via CLI bridge), and the pipeline architecture is updated accordingly.


Core Principle: Calibration Is Not Memory

Unanimous ruling. The distinction is critical:

Calibration corrections are instrument adjustments, not memories. A weather model that knows its temperature predictions run 2 degrees high is not "remembering" past forecasts — it is applying a correction factor derived from history. The model's working state is still fresh every session.


Updated Pipeline Architecture

1. DATA COLLECTION — scrapers collect fundamentals daily
2. RESEARCH — Grok + Sonar Pro (parallel, web search + deep research)
3. VALIDATION — 5 rule-based checks + Flash Lite contradiction detection
4. BLIND ANALYSIS (all three run in parallel, all cold, all independent):
   ├─ Sonnet 4.6 (via API)    — own lines + probability curves + evidence + conviction
   ├─ Gemini 3.1 Pro (via API) — own lines + probability curves + evidence + conviction
   └─ Opus 4.6 (via CLI bridge, FREE) — own lines + probability curves + evidence + conviction
5. COMPARISON LAYER — 3 analyst curves vs edge scanner math vs Kalshi prices vs Pinnacle lines
6. VERDICT OPUS (separate CLI session) — sees EVERYTHING, makes final pick + position sizing
7. POST-VERDICT — store predictions, track curves, settlement, calibration
8. WEEKLY — post-mortem runs, data accumulates (no prompt changes)
9. MONTHLY — compile per-analyst calibration digests, human reviews, update system prompts

Ruling 1: The Calibration Mechanism — "Bias Calibration Appendix"

Each analyst's system prompt includes a dedicated section at the end titled:

"CALIBRATION NOTES (based on your historical performance)"

This section contains 8-15 numbered calibration corrections. Each follows this format:

[N]. [CATEGORY]: [DIRECTION] by [MAGNITUDE] on average.
     Strongest in: [specific context where bias is worst]
     Weakest in: [context where bias is minimal]
     Sample: N=[count]. Confidence: [HIGH/MEDIUM/LOW].

Example Calibration Appendix:

CALIBRATION NOTES (based on your historical performance):

1. HOME COURT ADVANTAGE: You overestimate home court advantage by 1.3
   points on average.
   Strongest in: games where home team is favored by 5+
   Weakest in: divisional rivalry games (your HCA estimate is accurate there)
   Sample: N=82. Confidence: HIGH.

2. BACK-TO-BACK FATIGUE: You underweight B2B impact by 0.8 points.
   Strongest in: B2B + travel combinations
   Weakest in: home B2Bs (impact is smaller, your estimate is closer)
   Sample: N=34. Confidence: MEDIUM.

3. TOTALS — HIGH PACE: When both teams rank top-10 in pace, your totals
   run 3.5 points too low.
   Strongest in: early-season games with unsettled rotations
   Weakest in: playoff-contender matchups (your pace read improves)
   Sample: N=19. Confidence: MEDIUM.

4. INJURY IMPACT — STAR PLAYERS: You overestimate single star absence
   impact by 1.5 points.
   Strongest in: teams with strong depth (backup absorbs minutes well)
   Weakest in: teams with thin rosters (your estimate is accurate)
   Sample: N=28. Confidence: MEDIUM.

5. CONVICTION CALIBRATION: Your HIGH conviction picks hit at 58% against
   a 65% implied threshold. Reserve HIGH for stronger signals.
   Sample: N=45. Confidence: HIGH.

6. COMPRESSION BIAS: You compress the spread range — elite teams are
   underrated by ~1 point, bottom teams overrated by ~1 point.
   Strongest in: top-5 vs bottom-5 matchups
   Sample: N=40. Confidence: MEDIUM.

7. WELL CALIBRATED — TOTALS IN STANDARD PACE: Your totals estimates
   for average-pace matchups (both teams ranked 10-20 in pace) are
   accurate. Trust your baseline here.
   Sample: N=55. Confidence: HIGH.

Key Rules:


Ruling 2: Update Cadence — Monthly, Sample-Size Triggered

Schedule:

Why not weekly:

Weekly variance in sports outcomes is enormous. A spread estimate "wrong" by 4 points might have been perfectly calibrated — the game was an outlier. Weekly updates chase noise and create corrections that reverse the following week.

Minimum sample sizes before a correction is injected:

Bias Category Minimum N
Broad metrics (home court, baseline spread error) 50
Situational (B2B, rest advantage, altitude) 30
Conviction calibration 50
Niche (divisional, conference, time-of-season) 40

First 4-6 weeks: NO CORRECTIONS

Let analysts run pure and unmodified to establish clean baselines. You need to know each model's natural biases before correcting them. Correcting early on insufficient data introduces artificial biases harder to detect than natural ones.


Ruling 3: Each Analyst Gets Its Own Calibration — Never Shared

Unanimous. Not optional.

Rationale: The entire value of a multi-analyst ensemble is uncorrelated errors. If Sonnet overvalues home court and Gemini undervalues it, a universal correction pushes them toward the same answer, destroying independence. Personalized digests let each model improve its own weaknesses while maintaining distinct analytical personalities.

Secondary benefit: Lets you measure which analyst RESPONDS to calibration. If after two months of corrections Sonnet still overvalues home court by the same amount, that tells you something about the model's resistance to prompt-based recalibration.


Ruling 4: Calibration Metrics to Track

Tier 1 — Track immediately, feed back when N thresholds met:

Metric What It Catches
Mean spread error by team quality tier (elite/good/average/bad) Compression bias — models make elite teams too weak, bad teams too strong
Mean total error by combined pace rank Pace interaction errors — where totals go most wrong
Home court advantage residual (predicted HCA minus actual) Single most common LLM bias in sports
Conviction calibration curve (% correct at each conviction tier) Whether conviction system is meaningful or noise
Rest/fatigue bias (B2B, 3-in-4, travel + B2B combos) Models either overweight or underweight fatigue
Injury impact error (star absence, depth absence) Models overweight star injuries, underweight depth

Tier 2 — Track once 60+ markets accumulated:

Metric What It Catches
Divisional/rivalry game bias Narrative overweighting
Conference strength bias Systematically over/underrating a conference
Time-of-season bias Late-season tanking, rest management, playoff motivation
Altitude/venue factor error Small sample but persistent if real
Marquee game bias Drift toward public consensus on high-attention games

Tier 3 — Track after 6 months:

Metric What It Catches
Error drift over time Is the model's baseline shifting as training data ages?
Variance stability Getting more or less consistent?
Cross-analyst correlation Are Sonnet and Gemini making correlated errors? If yes, independence assumption is weakening

Do NOT track or feed back:


Ruling 5: Opus as Third Blind Analyst

Opus 4.6 runs through the CLI bridge at zero API cost. It serves as a third independent blind analyst alongside Sonnet and Gemini.

Design Requirements:

  1. Blind Opus and Verdict Opus are COMPLETELY SEPARATE roles. Different CLI sessions. Different system prompts. Zero shared context. No session-level contamination.

  2. Blind Opus runs FIRST. Output is captured, frozen, and the session ends. Verdict Opus runs in a completely separate session afterward.

  3. Blind Opus gets the same briefing format as Sonnet and Gemini: fundamentals only, no prices, no market data. Plus its own personalized calibration appendix.

  4. Blind Opus's calibration is tracked independently. Its track record generates its own digest. Its performance is evaluated separately from its verdict role.

  5. Randomize analyst labels in the verdict prompt. Present the three analyst outputs as "Analyst A, B, C" (randomized each session) so Verdict Opus does not develop meta-biases about which model to trust. This forces the verdict layer to evaluate reasoning quality, not source identity.

Verdict Opus Bonus — Historical Bias Awareness:

Feed Verdict Opus the calibration data of all three blind analysts. Example: "Analyst A predicts -4, but historically overvalues home favorites by 1.5. Analyst B predicts -2.5, well-calibrated in this matchup type. Analyst C predicts -6 with HIGH conviction but tends to compress spreads." This makes the verdict layer extremely powerful — it can adjust each analyst's output based on known biases before synthesizing the final line.


Ruling 6: Opus CLI Relay Mechanism

Input — File-Based Briefing:

  1. Orchestrator writes the analyst briefing to a file (e.g., /home/ubuntu/edgeclaw/data/pipeline/analyst-relay/opus_briefing_{date}_{sport}.txt)
  2. Keep input compressed and dense — structured data, not verbose prose
  3. Calibration digest goes first (frames everything), fundamentals second, task instructions last

Output — Structured JSON:

  1. System prompt explicitly states: "Your entire response must be valid JSON matching the schema. No prose before or after."
  2. Output captured to file (e.g., opus_output_{date}_{sport}.json)
  3. Orchestrator parses and feeds into comparison layer

Execution:

For the Telegram bridge relay (alternative approach):

If using the existing Telegram bridge instead of direct CLI:


Summary

Decision Ruling
Learning mechanism Calibration digests in system prompt — not memory
Digest structure 8-15 corrections with category, direction, magnitude, N, confidence
What's included Both biases AND what they get right
Update cadence Monthly, sample-size triggered (N >= 30-50), human-reviewed
First 4-6 weeks No corrections — establish clean baselines
Per-analyst or universal Per-analyst only — never shared, never universal
Opus added Third blind analyst via CLI bridge (free)
Blind vs Verdict Opus Completely separate sessions, zero contamination
Analyst labels in verdict Randomized (A/B/C) to prevent meta-biases
Verdict Opus sees All 3 analyst outputs + their historical biases + all market data
Relay mechanism File-based: briefing in, JSON out, full audit trail
Metrics tracked 6 Tier 1 (immediate), 5 Tier 2 (at 60 markets), 3 Tier 3 (6 months)

The analysts are "amnesiac experts" — they wake up every session with no memory of yesterday, but with a refined, mathematically corrected set of calibration notes about their own tendencies. Cold independence is preserved. Systematic improvement is achieved.

Source: ~/edgeclaw/results/panel-results/analyst-learning-ruling.md