Date: 2026-03-30 Panel: Opus 4.6 (native, max reasoning) · Sonnet 4.6 · Gemini 3.1 Pro Preview · Grok 4.2 Reasoning Verdict: UNANIMOUS on all questions Related: analyst-briefing-ruling.md (Blind Analysis Protocol)
The Blind Analysis Protocol (ruled earlier today) establishes that analysts go COLD every session — no memory, no market prices, independent oddsmakers. This ruling addresses the follow-up question: how do blind analysts improve over time if they have no memory of their mistakes?
Additionally, Opus 4.6 is added as a THIRD blind analyst (free via CLI bridge), and the pipeline architecture is updated accordingly.
Unanimous ruling. The distinction is critical:
Calibration corrections are instrument adjustments, not memories. A weather model that knows its temperature predictions run 2 degrees high is not "remembering" past forecasts — it is applying a correction factor derived from history. The model's working state is still fresh every session.
1. DATA COLLECTION — scrapers collect fundamentals daily
2. RESEARCH — Grok + Sonar Pro (parallel, web search + deep research)
3. VALIDATION — 5 rule-based checks + Flash Lite contradiction detection
4. BLIND ANALYSIS (all three run in parallel, all cold, all independent):
├─ Sonnet 4.6 (via API) — own lines + probability curves + evidence + conviction
├─ Gemini 3.1 Pro (via API) — own lines + probability curves + evidence + conviction
└─ Opus 4.6 (via CLI bridge, FREE) — own lines + probability curves + evidence + conviction
5. COMPARISON LAYER — 3 analyst curves vs edge scanner math vs Kalshi prices vs Pinnacle lines
6. VERDICT OPUS (separate CLI session) — sees EVERYTHING, makes final pick + position sizing
7. POST-VERDICT — store predictions, track curves, settlement, calibration
8. WEEKLY — post-mortem runs, data accumulates (no prompt changes)
9. MONTHLY — compile per-analyst calibration digests, human reviews, update system prompts
Each analyst's system prompt includes a dedicated section at the end titled:
"CALIBRATION NOTES (based on your historical performance)"
This section contains 8-15 numbered calibration corrections. Each follows this format:
[N]. [CATEGORY]: [DIRECTION] by [MAGNITUDE] on average.
Strongest in: [specific context where bias is worst]
Weakest in: [context where bias is minimal]
Sample: N=[count]. Confidence: [HIGH/MEDIUM/LOW].
CALIBRATION NOTES (based on your historical performance):
1. HOME COURT ADVANTAGE: You overestimate home court advantage by 1.3
points on average.
Strongest in: games where home team is favored by 5+
Weakest in: divisional rivalry games (your HCA estimate is accurate there)
Sample: N=82. Confidence: HIGH.
2. BACK-TO-BACK FATIGUE: You underweight B2B impact by 0.8 points.
Strongest in: B2B + travel combinations
Weakest in: home B2Bs (impact is smaller, your estimate is closer)
Sample: N=34. Confidence: MEDIUM.
3. TOTALS — HIGH PACE: When both teams rank top-10 in pace, your totals
run 3.5 points too low.
Strongest in: early-season games with unsettled rotations
Weakest in: playoff-contender matchups (your pace read improves)
Sample: N=19. Confidence: MEDIUM.
4. INJURY IMPACT — STAR PLAYERS: You overestimate single star absence
impact by 1.5 points.
Strongest in: teams with strong depth (backup absorbs minutes well)
Weakest in: teams with thin rosters (your estimate is accurate)
Sample: N=28. Confidence: MEDIUM.
5. CONVICTION CALIBRATION: Your HIGH conviction picks hit at 58% against
a 65% implied threshold. Reserve HIGH for stronger signals.
Sample: N=45. Confidence: HIGH.
6. COMPRESSION BIAS: You compress the spread range — elite teams are
underrated by ~1 point, bottom teams overrated by ~1 point.
Strongest in: top-5 vs bottom-5 matchups
Sample: N=40. Confidence: MEDIUM.
7. WELL CALIBRATED — TOTALS IN STANDARD PACE: Your totals estimates
for average-pace matchups (both teams ranked 10-20 in pace) are
accurate. Trust your baseline here.
Sample: N=55. Confidence: HIGH.
Weekly variance in sports outcomes is enormous. A spread estimate "wrong" by 4 points might have been perfectly calibrated — the game was an outlier. Weekly updates chase noise and create corrections that reverse the following week.
| Bias Category | Minimum N |
|---|---|
| Broad metrics (home court, baseline spread error) | 50 |
| Situational (B2B, rest advantage, altitude) | 30 |
| Conviction calibration | 50 |
| Niche (divisional, conference, time-of-season) | 40 |
Let analysts run pure and unmodified to establish clean baselines. You need to know each model's natural biases before correcting them. Correcting early on insufficient data introduces artificial biases harder to detect than natural ones.
Unanimous. Not optional.
Rationale: The entire value of a multi-analyst ensemble is uncorrelated errors. If Sonnet overvalues home court and Gemini undervalues it, a universal correction pushes them toward the same answer, destroying independence. Personalized digests let each model improve its own weaknesses while maintaining distinct analytical personalities.
Secondary benefit: Lets you measure which analyst RESPONDS to calibration. If after two months of corrections Sonnet still overvalues home court by the same amount, that tells you something about the model's resistance to prompt-based recalibration.
| Metric | What It Catches |
|---|---|
| Mean spread error by team quality tier (elite/good/average/bad) | Compression bias — models make elite teams too weak, bad teams too strong |
| Mean total error by combined pace rank | Pace interaction errors — where totals go most wrong |
| Home court advantage residual (predicted HCA minus actual) | Single most common LLM bias in sports |
| Conviction calibration curve (% correct at each conviction tier) | Whether conviction system is meaningful or noise |
| Rest/fatigue bias (B2B, 3-in-4, travel + B2B combos) | Models either overweight or underweight fatigue |
| Injury impact error (star absence, depth absence) | Models overweight star injuries, underweight depth |
| Metric | What It Catches |
|---|---|
| Divisional/rivalry game bias | Narrative overweighting |
| Conference strength bias | Systematically over/underrating a conference |
| Time-of-season bias | Late-season tanking, rest management, playoff motivation |
| Altitude/venue factor error | Small sample but persistent if real |
| Marquee game bias | Drift toward public consensus on high-attention games |
| Metric | What It Catches |
|---|---|
| Error drift over time | Is the model's baseline shifting as training data ages? |
| Variance stability | Getting more or less consistent? |
| Cross-analyst correlation | Are Sonnet and Gemini making correlated errors? If yes, independence assumption is weakening |
Opus 4.6 runs through the CLI bridge at zero API cost. It serves as a third independent blind analyst alongside Sonnet and Gemini.
Blind Opus and Verdict Opus are COMPLETELY SEPARATE roles. Different CLI sessions. Different system prompts. Zero shared context. No session-level contamination.
Blind Opus runs FIRST. Output is captured, frozen, and the session ends. Verdict Opus runs in a completely separate session afterward.
Blind Opus gets the same briefing format as Sonnet and Gemini: fundamentals only, no prices, no market data. Plus its own personalized calibration appendix.
Blind Opus's calibration is tracked independently. Its track record generates its own digest. Its performance is evaluated separately from its verdict role.
Randomize analyst labels in the verdict prompt. Present the three analyst outputs as "Analyst A, B, C" (randomized each session) so Verdict Opus does not develop meta-biases about which model to trust. This forces the verdict layer to evaluate reasoning quality, not source identity.
Feed Verdict Opus the calibration data of all three blind analysts. Example: "Analyst A predicts -4, but historically overvalues home favorites by 1.5. Analyst B predicts -2.5, well-calibrated in this matchup type. Analyst C predicts -6 with HIGH conviction but tends to compress spreads." This makes the verdict layer extremely powerful — it can adjust each analyst's output based on known biases before synthesizing the final line.
/home/ubuntu/edgeclaw/data/pipeline/analyst-relay/opus_briefing_{date}_{sport}.txt)opus_output_{date}_{sport}.json)If using the existing Telegram bridge instead of direct CLI:
| Decision | Ruling |
|---|---|
| Learning mechanism | Calibration digests in system prompt — not memory |
| Digest structure | 8-15 corrections with category, direction, magnitude, N, confidence |
| What's included | Both biases AND what they get right |
| Update cadence | Monthly, sample-size triggered (N >= 30-50), human-reviewed |
| First 4-6 weeks | No corrections — establish clean baselines |
| Per-analyst or universal | Per-analyst only — never shared, never universal |
| Opus added | Third blind analyst via CLI bridge (free) |
| Blind vs Verdict Opus | Completely separate sessions, zero contamination |
| Analyst labels in verdict | Randomized (A/B/C) to prevent meta-biases |
| Verdict Opus sees | All 3 analyst outputs + their historical biases + all market data |
| Relay mechanism | File-based: briefing in, JSON out, full audit trail |
| Metrics tracked | 6 Tier 1 (immediate), 5 Tier 2 (at 60 markets), 3 Tier 3 (6 months) |
The analysts are "amnesiac experts" — they wake up every session with no memory of yesterday, but with a refined, mathematically corrected set of calibration notes about their own tendencies. Cold independence is preserved. Systematic improvement is achieved.