Panel Ruling: Analyst Briefing Design — Blind Analysis Protocol

Date: 2026-03-30 Panel: Opus 4.6 (native, max reasoning) · Sonnet 4.6 · Gemini 3.1 Pro Preview · Grok 4.2 Reasoning Version: 2 (updated with no-pass constraint + boss clarifications) Verdict: UNANIMOUS on core questions, boss override on thresholds + steam

Context

The desk analyst pipeline uses two independent AI analysts (Sonnet 4.6 + Gemini 3.1 Pro) to produce independent lines, probabilities, and analysis for sports betting markets on Kalshi. The edge scanner (pure math engine) finds mathematical edges by de-vigging Pinnacle lines and comparing against Kalshi prices. The analysts act as independent oddsmakers — they set their own lines from scratch using fundamental data only. Opus sees everything at the end and makes the final call.

We are in model-building mode. No real money is at risk. Every market gets a pick from every layer. We are tracking performance to measure accuracy, calibration, and model improvement over time.

Pipeline Architecture

1. DATA COLLECTION — free scrapers collect hard facts daily
2. RESEARCH — Grok + Sonar Pro (parallel, web search + deep research)
3. VALIDATION — 5 rule-based checks + Flash Lite contradiction detection
4. BLIND ANALYSIS — Sonnet 4.6 + Gemini 3.1 Pro (independent, cold, no memory)
   - Each sets own lines (spread, total, ML, player props) from scratch
   - Each provides implied probabilities + evidence + conviction
   - Neither sees market prices, each other, or prior sessions
   - System builds probability curves from each analyst output
5. COMPARISON LAYER — analyst curves vs edge scanner math vs Kalshi vs Pinnacle
6. OPUS VERDICT — sees EVERYTHING, makes final pick + position sizing on every market
7. POST-VERDICT — store predictions, track curves, settlement, calibration

Ruling 1: Complete Price Blackout for Analysts

UNANIMOUS + Boss confirmed: Hide ALL prices and lines. No exceptions.

Analysts see NONE of the following:

Pinnacle lines, spreads, totals, moneylines, or odds
Kalshi contract prices, thresholds, or alternative line numbers
Edge scanner model probabilities, implied probabilities, or edge calculations
De-vigged probabilities from any source
Any number that originates from a market or model prediction
Sharp money signals, line movement, steam, cash/ticket percentages — NOTHING market-derived

The analysts do not see what the contract thresholds are. They are not told "price BOS -7.5" — they are told "set your own spread for Celtics vs 76ers." They build the line from scratch.

Rationale: Anchoring is not a risk — it is a certainty. AI models anchor on any number shown to them. The spread IS a price. Even showing "-6.5" without odds tells the analyst the market thinks Boston wins by ~7. The entire value of the analyst layer is independence. If they adjust from market prices, we are paying for a redundant signal.

Ruling 2: Sharp Money / Steam — EXCLUDED ENTIRELY

Boss override: No sharp money, no steam, no line movement at the analyst level.

Sharp money, line movement direction, cash/ticket divergence, reverse line movement — all excluded from analyst briefings entirely. This data goes directly to the Opus layer where it serves as a confirming/disconfirming signal alongside the analyst reports.

Ruling 3: Analyst Briefing Contents — Fundamentals Only

The analyst briefing contains ONLY raw performance data and situational facts. Nothing market-derived.

A. Game Context

Teams (full names, conference, division)
Date, time, venue (home/away)
Season records: overall W-L, last 10, home record, away record
Conference/division standing, playoff implications (locked in? fighting? eliminated?)
Head-to-head results this season (scores, not lines)
Recent form: last 5 game results with scores

B. Team Performance Metrics

Offensive Rating (points per 100 possessions) — season + last 10
Defensive Rating (points per 100 possessions) — season + last 10
Net Rating — season + last 10
Pace (possessions per game) — season + last 10
Home/away offensive and defensive splits
True shooting %, opponent true shooting %
Turnover rate, rebound rate, free throw rate

C. Situational Factors

Rest days for each team since last game
Back-to-back status + historical B2B performance record this season
Schedule density (games in last 7 days, upcoming 7 days)
Travel distance/timezone considerations
Altitude flag (Denver games for totals)

D. Personnel

Full injury report with status (OUT, DOUBTFUL, QUESTIONABLE, PROBABLE)
Key players: per-game stats (PPG, RPG, APG, usage rate) so analyst can estimate impact
Team record/net rating WITH vs WITHOUT injured player this season
Expected starting lineup if available
Notable rotation changes

E. Referee Data

Assigned referee crew
Crew season averages: total points in games officiated, foul rate per 48
Home team win rate under this crew
Over/under tendency (crew games vs league average total)

F. Market Assignment

Which games to analyze
Which market types to analyze per game (Spread, Total, ML, Player Props)
NO thresholds, NO lines, NO prices — just "set your own line for each market type"

EXPLICITLY EXCLUDED:

All market prices, lines, odds, spreads, totals from any source
All third-party model predictions
Sharp money, line movement, steam, cash/ticket data
Edge scanner output of any kind
Previous analyst reports from prior sessions
Consensus picks, expert picks, narrative commentary

Guiding principle: The analyst sees what an oddsmaker would have on day one — raw data, no market reference.

Ruling 4: Third-Party Model Predictions — HIDDEN

Unanimous: Do not show Sagarin, DRatings, Dimers, ESPN BPI, Massey, GameSim, or Prediction Tracker to analysts.

These are other models' opinions. Showing them creates anchoring and turns analysts into aggregators rather than independent oddsmakers.

Valid use: Collect and track against outcomes over 2-3 months. If a model consistently beats Pinnacle implied probabilities, integrate that signal at the Opus layer — never at the analyst layer.

Ruling 5: Analyst Output — What They Must Produce

Every analyst produces the following for EVERY game assigned:

{
  "game_id": "2026-03-30_BOS_PHI",
  "analyst": "Sonnet-4.6",
  "timestamp": "2026-03-30T14:23:00Z",

  "game_summary": "2-3 sentence matchup assessment from fundamentals",

  "spread_analysis": {
    "predicted_winner": "BOS",
    "predicted_margin": 7.0,
    "confidence_band": [4.5, 9.5],
    "implied_probability_curve": {
      "win_by_1_plus": 0.74,
      "win_by_4_plus": 0.62,
      "win_by_7_plus": 0.48,
      "win_by_10_plus": 0.33,
      "win_by_14_plus": 0.18,
      "win_by_20_plus": 0.06
    },
    "conviction": 4,
    "conviction_reasoning": "Boston 5.2 net rating advantage amplified at home. PHI missing Embiid — team is -4.3 net rating without him this season.",
    "evidence": [
      "BOS net rating +5.2 vs PHI -1.3 = 6.5 point fundamental gap",
      "BOS home court adds +3.1 to net rating historically",
      "PHI 12-18 without Embiid, -4.3 net rating differential",
      "Referee crew averages 2.1 more fouls/48 — deeper BOS roster benefits"
    ]
  },

  "total_analysis": {
    "predicted_total": 219.0,
    "confidence_band": [213.0, 225.0],
    "implied_probability_curve": {
      "over_205": 0.91,
      "over_210": 0.80,
      "over_215": 0.65,
      "over_220": 0.45,
      "over_225": 0.28,
      "over_230": 0.13
    },
    "conviction": 3,
    "conviction_reasoning": "Both teams near league average pace. No strong factors pushing over or under.",
    "evidence": [
      "BOS pace 99.2, PHI pace 98.8 — both middle of pack",
      "BOS scores 114.2 PPG at home, allows 107.1",
      "PHI scores 108.5 on road, allows 112.3"
    ]
  },

  "moneyline_analysis": {
    "predicted_winner": "BOS",
    "win_probability": 0.72,
    "opponent_win_probability": 0.28,
    "conviction": 4,
    "conviction_reasoning": "Clear talent + situational advantage.",
    "evidence": [
      "Net rating differential + home court + injury advantage",
      "BOS 28-8 at home this season"
    ]
  },

  "upset_scenario": "Philadelphia wins if Maxey scores 35+ and BOS shoots below 33% from three. PHI transition offense without Embiid is actually faster. BOS complacency in a seemingly easy matchup is the risk.",

  "key_uncertainties": [
    "If Embiid is upgraded to active, margin estimate drops to 3-4",
    "BOS has been coasting in late season — effort level uncertain"
  ],

  "data_gaps": [
    "No recent PHI practice reports available",
    "Unsure of BOS rotation plans with playoffs approaching"
  ]
}

Schema Rules:

Every field is required. No nulls, no N/A, no omissions.
Probability must be a specific decimal (0.72, not "around 70%").
Implied probability curves must be monotonically decreasing — system validates this.
Win probability + opponent probability must sum to 1.0.
Evidence must be specific and citable — not vague claims.
Conviction 1-5 scale with mandatory reasoning.
Upset scenario required even on high-conviction picks.

Ruling 6: Conviction Calibration — 1-5 Scale

Scale:

Level	Label	Meaning
1	VERY LOW	Genuine uncertainty. Making a pick but minimal fundamental basis. Near-random but tracked.
2	LOW	Weak basis. One or two data points, no strong directional story.
3	MODERATE	Reasonable basis. Several data points align. Standard operating pick.
4	HIGH	Strong basis. Multiple independent factors converge. Clear matchup story.
5	VERY HIGH	Exceptional. Rare clarity — overwhelming convergence of factors. Use sparingly (<10% of picks).

Rules:

Conviction is per-market-type (spread conviction can differ from total conviction).
Must include written reasoning for every conviction level.
Low conviction is still a full pick with a specific number. Not an excuse to hedge.
Base rate enforcement in prompt: "Most picks should be 2s and 3s. 4s and 5s should be rare. If you rate everything HIGH, you are miscalibrated."
Track conviction-weighted accuracy over time. If HIGH conviction picks do not outperform MEDIUM, recalibrate.

Ruling 7: Probability Curves — Model Building

Each analyst produces an implied probability curve for spreads and totals. The system:

Builds a full probability curve from the analyst's central estimate + confidence band
Compares curves: Analyst A curve vs Analyst B curve vs Edge Scanner curve vs Pinnacle implied vs Kalshi prices
Tracks curve accuracy over time using Brier scores at each threshold
Identifies which curve is best per sport, per market type, per analyst
Evolves weights — after 200+ markets, shift composite weights toward the best-performing curve

This is the core model-building mechanism. Over time, we learn: is Sonnet better at spreads? Is Gemini better at totals? Does the edge scanner beat both on ML? The clean independence makes this measurement meaningful.

Ruling 8: Opus Verdict Layer — Sees Everything

Opus receives for every market:

Analyst A full report — lines, probabilities, curves, conviction, evidence, upset scenario
Analyst B full report — independent, same format
Edge scanner output — mathematical probability, raw/net edge, steam detection, tail confidence
Kalshi prices — current contract prices, bid/ask, volume
Pinnacle lines — the sharp benchmark
Third-party model consensus — Sagarin, DRatings, etc. (analysts never saw this)
Sharp money data — line movement, cash/ticket splits (analysts never saw this)

Opus Output (every market, no passing):

{
  "market": "BOS vs PHI — Spread",
  "final_line": "BOS -7.5",
  "final_probability": 0.51,
  "final_conviction": 4,
  "position_size": "FULL",
  "convergence_level": "FULL",
  "verdict_narrative": "Both analysts see BOS -7 to -8.5. Edge scanner math at -7.1. Pinnacle at -6.5. Analysts + math agree market is underpricing BOS margin. Sharp money confirms direction. Full position.",
  "analyst_agreement": true,
  "analyst_vs_market_divergence": 1.5,
  "tracking_flags": {
    "high_conviction_both": true,
    "contrarian_signal": false,
    "steam_aligned": true
  }
}

Position Sizing:

Signal Pattern	Size
Both analysts + math agree, market diverges	FULL
One analyst + math agree, other close	THREE_QUARTER
Analysts agree, math differs	HALF
Math only, analysts neutral	QUARTER
Everyone disagrees (still pick, edge scanner tiebreaks)	MINIMUM

MINIMUM is still a position. No passing. Every market gets a pick and a size. Minimum positions generate calibration data on low-confidence scenarios.

Ruling 9: Prompt Engineering Requirements

The analyst system prompt MUST include:

Independence framing: "You are an independent oddsmaker. You have NO access to current market prices. Build your lines entirely from the fundamental data provided. It is valuable for your line to differ significantly from markets — that divergence is the entire point of your role."
No-pass rule: "You MUST set a line and probability for every market type assigned. Low conviction is fine — it is still a pick with a number. There is no passing, no holding, no skipping."
Temporal isolation: Never show previous session outputs. Each session is fully independent.
Anti-inflation: "Most conviction ratings should be 2s and 3s. If you rate most picks 4+, you are miscalibrated. Be honest about uncertainty."
Evidence requirement: "Every claim must be backed by specific data from the briefing. Do not make assertions without evidence."

Ruling 10: Performance Tracking

Track the following over time:

Brier scores per analyst per market type — who is better at spreads vs totals vs ML?
Conviction calibration — do HIGH conviction picks actually outperform?
Curve accuracy at each threshold — which analyst's probability curve is closest to reality?
Analyst vs edge scanner vs composite — which signal source adds the most value?
Divergence value — when analysts disagree with the market, are they right more often than wrong?
Upset scenario accuracy — when the upset scenario materializes, did the analyst identify it?

After 60-90 sessions, use this data to:

Tune analyst prompts
Adjust conviction scale definitions
Shift composite weights toward best-performing sources
Consider replacing underperforming analyst model

Summary

Decision	Ruling
Price hiding	Complete blackout — no lines, odds, probabilities, thresholds from any source
Sharp money / steam	Excluded entirely from analyst layer
Briefing contents	Raw fundamentals only — stats, rest, refs, injuries, matchup data
Third-party models	Hidden from analysts — tracked for correlation research only
Assignment format	Game + market types only — "set your own spread, total, ML" — no thresholds shown
Analyst output	Own lines + implied probability curves + evidence + conviction + upset scenario
Conviction	1-5 scale, per-market-type, mandatory reasoning, base rate enforcement
Probability curves	Built from analyst output, tracked against reality, compared across analysts
Opus verdict	Sees everything — analyst reports + Kalshi + Pinnacle + sharp money + models. Picks every market.
Position sizing	FULL to MINIMUM based on convergence — no passing
No-pass rule	Every layer picks every market. Low conviction = still a pick.

The architecture's value is independence at the analyst layer. Protect it aggressively. The probability curves built from blind analysis are the product — everything else is infrastructure to make those curves better over time.

Ruling 11: Analyst Model Roster — LOCKED

Date added: 2026-03-30 Status: HARD RULE — no substitutions

The blind analyst panel consists of exactly these three models. No substitutions, no swaps, no additions without a new panel ruling.

Seat	Model	Provider	API
Analyst 1	Claude Sonnet 4.6 (claude-sonnet-4-6)	Anthropic	Direct API
Analyst 2	Gemini 3.1 Pro Preview (google/gemini-3.1-pro-preview)	Google	OpenRouter
Analyst 3	gpt-oss-120b (openai/gpt-oss-120b)	OpenAI (open-weight)	OpenRouter

Verdict layer: Opus 4.6 via CLI bridge (free). NOT a blind analyst.

Why these three:

Three different providers (Anthropic, Google, OpenAI) — maximum training data diversity
Three different architectures — uncorrelated errors
gpt-oss-120b scored highest weight (40%) in Metaculus forecasting tournament
Research shows reasoning-heavy models have WORSE calibration — gpt-oss-120b is optimized for this
Total cost: ~/usr/bin/bash.10 per full NBA slate for all 3 analysts

Rules:

No model may be replaced without a full 4-member panel ruling
No additional analysts may be added without a panel ruling (must stay at odd number: 3 or 5)
All three receive identical prompts and identical data
All three go cold every session — no memory, no cross-contamination
If a model fails, retry up to 3 times. After 3 failures, alert the boss via Telegram. Never substitute a different model.

Ruling 12: Opus Verdict Layer — HYBRID (Fresh + Calibration Digest)

Date added: 2026-03-31 Council: Full 5-member council, unanimous Status: LOCKED

Opus goes FRESH every session. No resume. No memory of past verdicts.

Before each session, Opus receives a Verdict Calibration Digest — aggregate pipeline statistics only.

Digest Contains:

Convergence class hit rates and ROI (A through E) — 30-day rolling
Position sizing calibration (FULL through MINIMUM hit rates)
Opus Adjustment Value — when Opus overrides composite by >3%, is it correct?
Known biases from last calibration cycle
Sport-specific and market-type breakdowns (when N >= 50)

Digest Never Contains:

Specific past picks or outcomes
Streak data or recent sequence
Team/matchup narratives from prior sessions
Any individual game results

Rules:

Minimum 20 graded markets per category before appearing in digest
30-day rolling window, not cumulative
Programmatically generated — no human or AI edits
Injected as system-level context, separate from game data
Below 40 graded plays per class, run pure fresh (no digest)

Rationale:

Resume introduces anchoring, path dependence, and context pollution — every advisor flagged these as fatal. Pure fresh wastes real signal about pipeline performance. Hybrid gives Opus calibration without contamination.

Source: ~/edgeclaw/results/panel-results/analyst-briefing-ruling.md