Panel Ruling: Analyst Learning Mechanism — Calibration Without Memory

Date: 2026-03-30 Panel: Opus 4.6 (native, max reasoning) · Sonnet 4.6 · Gemini 3.1 Pro Preview · Grok 4.2 Reasoning Verdict: UNANIMOUS on all questions Related: analyst-briefing-ruling.md (Blind Analysis Protocol)

Context

The Blind Analysis Protocol (ruled earlier today) establishes that analysts go COLD every session — no memory, no market prices, independent oddsmakers. This ruling addresses the follow-up question: how do blind analysts improve over time if they have no memory of their mistakes?

Additionally, Opus 4.6 is added as a THIRD blind analyst (free via CLI bridge), and the pipeline architecture is updated accordingly.

Core Principle: Calibration Is Not Memory

Unanimous ruling. The distinction is critical:

Memory: "You picked Celtics -8 last Tuesday and lost by 12." This creates anchoring and path dependence. FORBIDDEN.
Calibration: "Your spread estimates for elite home teams run 1.3 points too high across 47 observations." This adjusts a systematic tendency. APPROVED.

Calibration corrections are instrument adjustments, not memories. A weather model that knows its temperature predictions run 2 degrees high is not "remembering" past forecasts — it is applying a correction factor derived from history. The model's working state is still fresh every session.

Updated Pipeline Architecture

1. DATA COLLECTION — scrapers collect fundamentals daily
2. RESEARCH — Grok + Sonar Pro (parallel, web search + deep research)
3. VALIDATION — 5 rule-based checks + Flash Lite contradiction detection
4. BLIND ANALYSIS (all three run in parallel, all cold, all independent):
   ├─ Sonnet 4.6 (via API)    — own lines + probability curves + evidence + conviction
   ├─ Gemini 3.1 Pro (via API) — own lines + probability curves + evidence + conviction
   └─ Opus 4.6 (via CLI bridge, FREE) — own lines + probability curves + evidence + conviction
5. COMPARISON LAYER — 3 analyst curves vs edge scanner math vs Kalshi prices vs Pinnacle lines
6. VERDICT OPUS (separate CLI session) — sees EVERYTHING, makes final pick + position sizing
7. POST-VERDICT — store predictions, track curves, settlement, calibration
8. WEEKLY — post-mortem runs, data accumulates (no prompt changes)
9. MONTHLY — compile per-analyst calibration digests, human reviews, update system prompts

Ruling 1: The Calibration Mechanism — "Bias Calibration Appendix"

Each analyst's system prompt includes a dedicated section at the end titled:

"CALIBRATION NOTES (based on your historical performance)"

This section contains 8-15 numbered calibration corrections. Each follows this format:

[N]. [CATEGORY]: [DIRECTION] by [MAGNITUDE] on average.
     Strongest in: [specific context where bias is worst]
     Weakest in: [context where bias is minimal]
     Sample: N=[count]. Confidence: [HIGH/MEDIUM/LOW].

Example Calibration Appendix:

CALIBRATION NOTES (based on your historical performance):

1. HOME COURT ADVANTAGE: You overestimate home court advantage by 1.3
   points on average.
   Strongest in: games where home team is favored by 5+
   Weakest in: divisional rivalry games (your HCA estimate is accurate there)
   Sample: N=82. Confidence: HIGH.

2. BACK-TO-BACK FATIGUE: You underweight B2B impact by 0.8 points.
   Strongest in: B2B + travel combinations
   Weakest in: home B2Bs (impact is smaller, your estimate is closer)
   Sample: N=34. Confidence: MEDIUM.

3. TOTALS — HIGH PACE: When both teams rank top-10 in pace, your totals
   run 3.5 points too low.
   Strongest in: early-season games with unsettled rotations
   Weakest in: playoff-contender matchups (your pace read improves)
   Sample: N=19. Confidence: MEDIUM.

4. INJURY IMPACT — STAR PLAYERS: You overestimate single star absence
   impact by 1.5 points.
   Strongest in: teams with strong depth (backup absorbs minutes well)
   Weakest in: teams with thin rosters (your estimate is accurate)
   Sample: N=28. Confidence: MEDIUM.

5. CONVICTION CALIBRATION: Your HIGH conviction picks hit at 58% against
   a 65% implied threshold. Reserve HIGH for stronger signals.
   Sample: N=45. Confidence: HIGH.

6. COMPRESSION BIAS: You compress the spread range — elite teams are
   underrated by ~1 point, bottom teams overrated by ~1 point.
   Strongest in: top-5 vs bottom-5 matchups
   Sample: N=40. Confidence: MEDIUM.

7. WELL CALIBRATED — TOTALS IN STANDARD PACE: Your totals estimates
   for average-pace matchups (both teams ranked 10-20 in pace) are
   accurate. Trust your baseline here.
   Sample: N=55. Confidence: HIGH.

Key Rules:

No specific games, dates, teams, or prior picks referenced. Always abstract categories.
No market prices embedded. Never "Pinnacle had this at -6.5 and you said -8."
Framed as properties of the model. "YOU tend to overweight X" — not "X matters less than people think." The analyst still reasons from first principles.
Include what they get RIGHT. Prevents overcorrection. If their totals in standard-pace games are well calibrated, tell them so they trust that instinct.
Descriptive only, not prescriptive. Say "you overestimate by 1.5" — do NOT say "subtract 1.5 from your estimate." Let the analyst internalize the correction through its own reasoning.
Cap at 15 corrections. More creates noise.
Include the sample size (N). The analyst understands statistics and can weight the correction appropriately.

Ruling 2: Update Cadence — Monthly, Sample-Size Triggered

Schedule:

Weekly (Sunday 2AM): Post-mortem runs. Data accumulates. NO prompt changes.
Monthly (1st of month): Calibration digests compiled, reviewed by boss, injected into system prompts.
Emergency override: If any category shows bias > 3 points AND N > 15, flag for early injection before the monthly cycle.

Why not weekly:

Weekly variance in sports outcomes is enormous. A spread estimate "wrong" by 4 points might have been perfectly calibrated — the game was an outlier. Weekly updates chase noise and create corrections that reverse the following week.

Minimum sample sizes before a correction is injected:

Bias Category	Minimum N
Broad metrics (home court, baseline spread error)	50
Situational (B2B, rest advantage, altitude)	30
Conviction calibration	50
Niche (divisional, conference, time-of-season)	40

First 4-6 weeks: NO CORRECTIONS

Let analysts run pure and unmodified to establish clean baselines. You need to know each model's natural biases before correcting them. Correcting early on insufficient data introduces artificial biases harder to detect than natural ones.

Ruling 3: Each Analyst Gets Its Own Calibration — Never Shared

Unanimous. Not optional.

Sonnet's digest comes from Sonnet's track record ONLY.
Gemini's digest comes from Gemini's track record ONLY.
Opus's digest comes from Opus's track record ONLY.
No analyst ever learns about another analyst's biases.
No universal corrections applied to all three.

Rationale: The entire value of a multi-analyst ensemble is uncorrelated errors. If Sonnet overvalues home court and Gemini undervalues it, a universal correction pushes them toward the same answer, destroying independence. Personalized digests let each model improve its own weaknesses while maintaining distinct analytical personalities.

Secondary benefit: Lets you measure which analyst RESPONDS to calibration. If after two months of corrections Sonnet still overvalues home court by the same amount, that tells you something about the model's resistance to prompt-based recalibration.

Ruling 4: Calibration Metrics to Track

Tier 1 — Track immediately, feed back when N thresholds met:

Metric	What It Catches
Mean spread error by team quality tier (elite/good/average/bad)	Compression bias — models make elite teams too weak, bad teams too strong
Mean total error by combined pace rank	Pace interaction errors — where totals go most wrong
Home court advantage residual (predicted HCA minus actual)	Single most common LLM bias in sports
Conviction calibration curve (% correct at each conviction tier)	Whether conviction system is meaningful or noise
Rest/fatigue bias (B2B, 3-in-4, travel + B2B combos)	Models either overweight or underweight fatigue
Injury impact error (star absence, depth absence)	Models overweight star injuries, underweight depth

Tier 2 — Track once 60+ markets accumulated:

Metric	What It Catches
Divisional/rivalry game bias	Narrative overweighting
Conference strength bias	Systematically over/underrating a conference
Time-of-season bias	Late-season tanking, rest management, playoff motivation
Altitude/venue factor error	Small sample but persistent if real
Marquee game bias	Drift toward public consensus on high-attention games

Tier 3 — Track after 6 months:

Metric	What It Catches
Error drift over time	Is the model's baseline shifting as training data ages?
Variance stability	Getting more or less consistent?
Cross-analyst correlation	Are Sonnet and Gemini making correlated errors? If yes, independence assumption is weakening

Do NOT track or feed back:

Individual team-level accuracy (too noisy, encourages overfitting)
Win/loss record of picks (outcome-based, not process-based — track calibration instead)
Comparison to market prices (teaches the analyst to predict Pinnacle, not to be right)

Ruling 5: Opus as Third Blind Analyst

Opus 4.6 runs through the CLI bridge at zero API cost. It serves as a third independent blind analyst alongside Sonnet and Gemini.

Design Requirements:

Blind Opus and Verdict Opus are COMPLETELY SEPARATE roles. Different CLI sessions. Different system prompts. Zero shared context. No session-level contamination.
Blind Opus runs FIRST. Output is captured, frozen, and the session ends. Verdict Opus runs in a completely separate session afterward.
Blind Opus gets the same briefing format as Sonnet and Gemini: fundamentals only, no prices, no market data. Plus its own personalized calibration appendix.
Blind Opus's calibration is tracked independently. Its track record generates its own digest. Its performance is evaluated separately from its verdict role.
Randomize analyst labels in the verdict prompt. Present the three analyst outputs as "Analyst A, B, C" (randomized each session) so Verdict Opus does not develop meta-biases about which model to trust. This forces the verdict layer to evaluate reasoning quality, not source identity.

Verdict Opus Bonus — Historical Bias Awareness:

Feed Verdict Opus the calibration data of all three blind analysts. Example: "Analyst A predicts -4, but historically overvalues home favorites by 1.5. Analyst B predicts -2.5, well-calibrated in this matchup type. Analyst C predicts -6 with HIGH conviction but tends to compress spreads." This makes the verdict layer extremely powerful — it can adjust each analyst's output based on known biases before synthesizing the final line.

Ruling 6: Opus CLI Relay Mechanism

Input — File-Based Briefing:

Orchestrator writes the analyst briefing to a file (e.g., /home/ubuntu/edgeclaw/data/pipeline/analyst-relay/opus_briefing_{date}_{sport}.txt)
Keep input compressed and dense — structured data, not verbose prose
Calibration digest goes first (frames everything), fundamentals second, task instructions last

Output — Structured JSON:

System prompt explicitly states: "Your entire response must be valid JSON matching the schema. No prose before or after."
Output captured to file (e.g., opus_output_{date}_{sport}.json)
Orchestrator parses and feeds into comparison layer

Execution:

All three analysts run in parallel: Sonnet + Gemini via API, Opus via CLI
Each game can be a separate CLI invocation to keep context clean
Complete audit trail: every input file and output file is logged for post-mortem analysis

For the Telegram bridge relay (alternative approach):

If using the existing Telegram bridge instead of direct CLI:

Compress briefing into dense format to fit message limits
Use a relay bot that sends the briefing in chunks, waits for complete response
Parse structured output from the response
Less reliable than file-based but works with existing infrastructure

Summary

Decision	Ruling
Learning mechanism	Calibration digests in system prompt — not memory
Digest structure	8-15 corrections with category, direction, magnitude, N, confidence
What's included	Both biases AND what they get right
Update cadence	Monthly, sample-size triggered (N >= 30-50), human-reviewed
First 4-6 weeks	No corrections — establish clean baselines
Per-analyst or universal	Per-analyst only — never shared, never universal
Opus added	Third blind analyst via CLI bridge (free)
Blind vs Verdict Opus	Completely separate sessions, zero contamination
Analyst labels in verdict	Randomized (A/B/C) to prevent meta-biases
Verdict Opus sees	All 3 analyst outputs + their historical biases + all market data
Relay mechanism	File-based: briefing in, JSON out, full audit trail
Metrics tracked	6 Tier 1 (immediate), 5 Tier 2 (at 60 markets), 3 Tier 3 (6 months)

The analysts are "amnesiac experts" — they wake up every session with no memory of yesterday, but with a refined, mathematically corrected set of calibration notes about their own tendencies. Cold independence is preserved. Systematic improvement is achieved.

Source: ~/edgeclaw/results/panel-results/analyst-learning-ruling.md