Human Intelligence Desk — Final Specification

Date: 2026-03-16 Desk #: 16 Purpose: Track public predictions from human analysts across all markets, score them against actual outcomes, and feed accuracy data to every other desk. Panel: Opus 4.6 | Gemini 3.1 Pro | GPT-4.1 | Grok 4.1 Fast | DeepSeek V3.1 | Llama 4 Maverick | Grok 4 Reasoning Big Ideas Panel: Gemini Pro | GPT-4.1 | Grok 4 Fast | Sonar Reasoning Pro | Grok 4 Reasoning (judged by Opus) Verdict: BUILD IT (unanimous) — start narrow (NFL ATS), expand after pipeline proven end-to-end.


The Core Idea

Every day, thousands of humans make public predictions: earnings estimates, sports picks, weather forecasts, FX targets, crypto calls. Most are never tracked. This desk logs them all, scores them when the outcome is known, and builds a running accuracy profile for every analyst.

The output is simple: who is actually good at predicting what, and who is consistently wrong?

The desk's real value is not the predictions themselves — you can read those anywhere. The value is the denominator. When Jim Cramer says "buy AAPL," the news tells you that. This desk tells you Cramer has been right 41% of the time on tech stocks over 200+ calls. That context transforms noise into signal. You cannot maintain that context in your head across dozens of analysts and thousands of predictions. The desk is a memory prosthetic with math attached.

[Opus — core framing]

Other desks use this two ways:

  1. Follow signals — when a historically accurate analyst makes a call in their strong sector, weight it higher
  2. Fade signals — when a consistently wrong analyst makes a confident call, consider the opposite

This is a meta-edge — it doesn't predict markets directly but identifies who can. It acts as a force multiplier for every other desk.

[Gemini Pro — "core infrastructure disguised as a new desk"]


Database Schema (SQLite)

File: data/db/human-intel.db Mode: WAL (single writer, batch inserts, busy_timeout=5000)

-- Every human analyst/forecaster we track
CREATE TABLE analysts (
    id TEXT PRIMARY KEY,                    -- slug: "jim-cramer", "gdpnow", "estimize-consensus"
    name TEXT NOT NULL,                     -- display name
    source TEXT NOT NULL,                   -- where we found them: "estimize", "espn", "crypto-twitter", etc.
    type TEXT NOT NULL,                     -- "individual", "institution", "model", "consensus"
    sectors TEXT NOT NULL DEFAULT '[]',     -- JSON array: ["stocks", "earnings", "macro"]
    first_seen TEXT NOT NULL,               -- ISO date
    prediction_count INTEGER DEFAULT 0,
    active INTEGER DEFAULT 1,              -- 0 = no longer publishing
    metadata TEXT DEFAULT '{}',            -- JSON: twitter handle, publication, URL, notes
    created_at TEXT NOT NULL DEFAULT (datetime('now'))
);

-- Every prediction we collect
CREATE TABLE predictions (
    id TEXT PRIMARY KEY,                    -- uuid
    analyst_id TEXT NOT NULL REFERENCES analysts(id),
    market TEXT NOT NULL,                   -- "stocks", "forex", "sports", "weather", "crypto", "macro", "commodities"
    sub_market TEXT,                        -- "earnings", "nfl", "temperature", "btc", "fed-rate", "crude-oil"
    prediction_type TEXT NOT NULL,          -- "directional", "point_estimate", "binary", "range"
    subject TEXT NOT NULL,                  -- what they're predicting: "AAPL Q1 EPS", "KC vs BUF spread", "June NFP"
    -- PANEL FIX: structured value fields instead of single TEXT blob
    predicted_value_numeric REAL,           -- for point estimates: 2.15, 180000, 72.5
    predicted_value_string TEXT,            -- for non-numeric: "KC -3.5", "bullish", "above 200k"
    predicted_range_low REAL,              -- for range predictions: lower bound
    predicted_range_high REAL,             -- for range predictions: upper bound
    predicted_direction TEXT,               -- "up", "down", "over", "under", "yes", "no" (for directional)
    confidence REAL,                        -- 0-1 ONLY if analyst stated explicit confidence. NULL otherwise.
    confidence_source TEXT,                 -- "stated" (analyst said it), "inferred" (Confidence Scorer LLM), NULL
    horizon TEXT,                           -- "intraday", "daily", "weekly", "monthly", "quarterly", "annual"
    target_date TEXT,                       -- when the prediction resolves: "2026-03-20"
    source_url TEXT,                        -- where we found it
    raw_text TEXT,                          -- original unprocessed text of the prediction (for auditing)
    parent_prediction_id TEXT,             -- links to prior version if analyst revised their call
    collected_at TEXT NOT NULL DEFAULT (datetime('now')),
    status TEXT NOT NULL DEFAULT 'pending', -- "pending", "resolved", "expired", "cancelled"
    FOREIGN KEY (parent_prediction_id) REFERENCES predictions(id)
);

-- Normalized subject key for deduplication (prevents near-dupes)
-- "AAPL Q1 2026 EPS" and "Apple first quarter earnings per share" both normalize to "AAPL|Q1-2026|EPS"
CREATE UNIQUE INDEX idx_pred_normalized ON predictions(analyst_id, subject, target_date);

-- Actual outcomes (linked to predictions)
CREATE TABLE outcomes (
    id TEXT PRIMARY KEY,                    -- uuid
    prediction_id TEXT NOT NULL REFERENCES predictions(id),
    actual_value TEXT NOT NULL,             -- what actually happened: "2.18", "BUF won by 7"
    actual_value_numeric REAL,             -- numeric actual for point estimate comparison
    result TEXT NOT NULL,                   -- "correct", "incorrect", "partial", "push"
    brier_score REAL,                       -- 0.0 (perfect) to 1.0 (perfectly wrong). NULL if confidence not stated.
    error_magnitude REAL,                   -- for point estimates: |predicted - actual|
    error_pct REAL,                         -- for point estimates: |predicted - actual| / actual * 100
    difficulty REAL,                        -- market-implied probability (closing line, consensus). For skill scoring.
    resolved_at TEXT NOT NULL DEFAULT (datetime('now')),
    resolution_source TEXT,                 -- where we got the actual: "sec-edgar", "espn-api", "fred"
    resolution_criteria TEXT,              -- defined BEFORE collection: how we judge this outcome
    notes TEXT                              -- edge cases, pushes, adjustments
);

-- Precomputed accuracy scores (updated after each resolution batch)
CREATE TABLE scores (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    analyst_id TEXT NOT NULL REFERENCES analysts(id),
    market TEXT NOT NULL,                   -- sector or "all"
    horizon TEXT NOT NULL,                  -- time horizon or "all"
    period TEXT NOT NULL,                   -- "30d", "90d", "365d", "lifetime"
    total_predictions INTEGER NOT NULL,
    correct INTEGER NOT NULL,
    incorrect INTEGER NOT NULL,
    accuracy_pct REAL NOT NULL,             -- correct / total * 100
    avg_brier REAL,                         -- average Brier score (lower = better). NULL if no stated-confidence predictions.
    avg_error REAL,                         -- average error magnitude (for point estimates)
    calibration_score REAL,                 -- how well confidence matches accuracy
    -- PANEL ADDITIONS
    skill_score REAL,                       -- (accuracy - baseline) / (1 - baseline). Cross-market comparison.
    baseline_accuracy REAL,                 -- market base rate (50% for ATS, ~70% for Seattle rain in November, etc.)
    ci_lower REAL,                          -- Wilson confidence interval lower bound
    ci_upper REAL,                          -- Wilson confidence interval upper bound
    consensus_deviation REAL,              -- how much this analyst deviates from consensus (independence score)
    error_stddev REAL,                      -- standard deviation of errors (for Sharpe calculation)
    consensus_outperform REAL,             -- % of time analyst beat consensus
    streak_current INTEGER DEFAULT 0,       -- current streak (positive = correct, negative = wrong)
    streak_best INTEGER DEFAULT 0,          -- best correct streak ever
    fade_signal INTEGER DEFAULT 0,          -- 1 = this analyst is a reliable fade
    follow_signal INTEGER DEFAULT 0,        -- 1 = this analyst is worth following
    updated_at TEXT NOT NULL DEFAULT (datetime('now')),
    UNIQUE(analyst_id, market, horizon, period)
);

-- Desk self-scoring: tracks whether follow/fade signals actually work
CREATE TABLE desk_meta_scores (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    signal_type TEXT NOT NULL,              -- "follow" or "fade"
    signal_strength INTEGER NOT NULL,       -- 1-5
    market TEXT NOT NULL,
    total_signals INTEGER NOT NULL,
    correct INTEGER NOT NULL,
    accuracy_pct REAL NOT NULL,
    period TEXT NOT NULL,                    -- "30d", "90d", "lifetime"
    updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);
-- [Opus — "desk should track its own signal hit rate"]

-- Analyst aliases for name normalization
CREATE TABLE analyst_aliases (
    alias TEXT PRIMARY KEY,                 -- "Jim Cramer", "james cramer", "jimcramer"
    analyst_id TEXT NOT NULL REFERENCES analysts(id)
);
-- [Opus — analyst name normalization]

-- Indexes
CREATE INDEX idx_pred_analyst ON predictions(analyst_id);
CREATE INDEX idx_pred_market ON predictions(market);
CREATE INDEX idx_pred_status ON predictions(status);
CREATE INDEX idx_pred_target ON predictions(target_date);
CREATE INDEX idx_pred_parent ON predictions(parent_prediction_id);
CREATE INDEX idx_outcome_pred ON outcomes(prediction_id);
CREATE INDEX idx_scores_analyst ON scores(analyst_id);
CREATE INDEX idx_scores_market ON scores(market);
CREATE INDEX idx_scores_fade ON scores(fade_signal);
CREATE INDEX idx_scores_follow ON scores(follow_signal);

-- Full-text search on predictions
-- PANEL FIX: add INSERT/UPDATE/DELETE triggers to keep FTS in sync
CREATE VIRTUAL TABLE predictions_fts USING fts5(
    subject, predicted_value_string,
    tokenize='porter unicode61'
);

CREATE TRIGGER predictions_fts_insert AFTER INSERT ON predictions BEGIN
    INSERT INTO predictions_fts(rowid, subject, predicted_value_string)
    VALUES (NEW.rowid, NEW.subject, NEW.predicted_value_string);
END;

CREATE TRIGGER predictions_fts_delete AFTER DELETE ON predictions BEGIN
    INSERT INTO predictions_fts(predictions_fts, rowid, subject, predicted_value_string)
    VALUES ('delete', OLD.rowid, OLD.subject, OLD.predicted_value_string);
END;

CREATE TRIGGER predictions_fts_update AFTER UPDATE ON predictions BEGIN
    INSERT INTO predictions_fts(predictions_fts, rowid, subject, predicted_value_string)
    VALUES ('delete', OLD.rowid, OLD.subject, OLD.predicted_value_string);
    INSERT INTO predictions_fts(rowid, subject, predicted_value_string)
    VALUES (NEW.rowid, NEW.subject, NEW.predicted_value_string);
END;

REMOVED: desk_consumption table. Preventing re-alerting is a notification concern — consuming desks manage their own "last seen" state.

[Opus — "solves a problem that does not need a database solution"]

NOTE on entity types: Weather models (GFS, ECMWF) and consensus prices (Polymarket) are tracked as analysts with type="model" and type="consensus". These use type-specific scoring paths:

[Opus — "schema mixes fundamentally different entity types"]


Data Sources & Collection Methods

1. Earnings Estimates

Source Method Frequency What We Get
Estimize (estimize.com) Free API / RSS Daily during earnings season Consensus + individual EPS/revenue estimates
SEC EDGAR EDGAR full-text search API (free) After each earnings release Actual reported EPS/revenue for resolution
Yahoo Finance Public earnings calendar page scrape Weekly Wall Street consensus estimates

Prediction type: point_estimate Resolution: Compare estimated EPS to actual reported EPS. Use error_pct. Brier only if analyst stated explicit confidence. Resolution standard: Non-GAAP adjusted EPS (most commonly reported). Define in resolution_criteria at collection time.

[Panel consensus — define resolution criteria BEFORE collecting] Subject format: "{TICKER} Q{N} {YEAR} EPS" or "{TICKER} Q{N} {YEAR} Revenue" Correct threshold: Within 5% of actual. [Opus — market-specific thresholds]

2. Economic / Macro Forecasts

Source Method Frequency What We Get
FRED API (fred.stlouisfed.org) REST API, free key Daily Actual macro data (GDP, CPI, NFP, unemployment)
Atlanta Fed GDPNow Public page scrape Every update (~weekly) Real-time GDP estimate
Cleveland Fed CPI Nowcast Public page scrape Daily Inflation nowcast
Fed/ECB/BOJ projections PDF scrape from central bank sites Quarterly Official rate + growth projections
Survey of Professional Forecasters (Philadelphia Fed) CSV download (free) Quarterly 40+ economist forecasts with names

Prediction type: point_estimate or range Resolution: FRED API provides actuals for all major series. Resolution standard: GDP = third estimate (final revision). NFP = first revision (one month later). CPI = initial release. Subject format: "{INDICATOR} {PERIOD}" e.g. "US GDP Q1 2026", "June 2026 NFP" Correct thresholds:

Indicator Threshold
GDP Within 0.5 percentage points
NFP Within 15% or 30k jobs (whichever larger)
CPI Within 0.2 percentage points
Fed funds rate Within 25 bps

[Opus — market-specific thresholds]

WARNING: Macro has tiny sample sizes. Quarterly frequency = 4-8 data points per forecaster per year. Statistical significance requires 3-5 years. Forex bank forecasts have the same problem.

[Opus — ROI is questionable for macro/forex at realistic sample sizes]

3. Commodities / Energy

Source Method Frequency What We Get
EIA API (api.eia.gov) REST API, free key Weekly Crude oil inventory actuals + analyst consensus estimates
USDA WASDE reports PDF scrape Monthly Crop production estimates vs actuals
Trading Economics Public page scrape As published Commodity forecasts from analysts

Prediction type: point_estimate Resolution: EIA weekly inventory numbers, USDA final production numbers.

4. Sports

Source Method Frequency What We Get
ESPN Expert Picks Page scrape Weekly (NFL), daily (NBA/NHL) Named expert picks against the spread
CBS Sports picks Page scrape Weekly/daily Staff picks with records
The Odds API (existing) REST API, free tier Daily Closing lines for resolution
Public handicapper accounts RSS / X API (free tier) Daily Notable public picks

Prediction type: directional (against the spread) or binary (moneyline) Resolution: Final scores from our existing sports tracking sheets. Subject format: "{AWAY} @ {HOME} {DATE} {PICK}" e.g. "BUF @ KC 2026-01-20 KC -3.5" Correct threshold: Binary — covered or didn't. No margin needed. Base rate: ~50% for ATS picks. This is the most important market for V1.

[Panel consensus — start here. Cleanest data, fastest resolution, binary outcomes, no ambiguity.]

5. Weather

Source Method Frequency What We Get
NWS API (api.weather.gov) REST API, free 4x daily Official NWS forecasts (temp, precip)
Open-Meteo API REST API, free 4x daily GFS, ECMWF, ICON model outputs
Tropical Tidbits Page scrape During severe weather Expert meteorologist forecasts

Prediction type: point_estimate (temperature) or binary (rain/no rain) Resolution: NWS observed data (METARs, station observations). Subject format: "{CITY} {DATE} High Temp" or "{CITY} {DATE} Precipitation Y/N" Correct threshold: Within 3 degrees F for temperature.

[Opus — market-specific thresholds]

Note: Weather models (GFS, EURO) are tracked as "analysts" with type="model". Uses type-specific scoring (accuracy only, no calibration). Lets us score which forecast model is most accurate per region/season.

6. Forex

Source Method Frequency What We Get
Major bank FX forecasts (Goldman, JPM, Citi, etc.) PDF/page scrape from public research Quarterly End-of-quarter FX targets
TradingView public ideas Page scrape Weekly Community FX calls with clear targets
CFTC COT data CSV from cftc.gov Weekly Positioning data (context, not predictions)

Prediction type: point_estimate or directional Resolution: Actual FX rates from free forex APIs (exchangerate.host, frankfurter.app). Subject format: "{PAIR} {TARGET_DATE} Target" e.g. "EUR/USD Q2 2026 Target" Correct threshold: Within 2%.

[Opus — market-specific thresholds]

7. Crypto

Source Method Frequency What We Get
Notable CT accounts X/Twitter API free tier or Nitter scrape Daily Public crypto calls from tracked accounts
Glassnode free tier REST API Weekly On-chain analyst reports
CoinGlass Public page scrape Daily Funding rate predictions, liquidation estimates
Polymarket / Kalshi API Daily Market-implied probabilities (tracked as type="consensus")

Prediction type: directional or point_estimate Resolution: CoinGecko API (free) for price data. Subject format: "{COIN} {HORIZON} {CALL}" e.g. "BTC 30d above 100k" Correct threshold: Within 10% (high volatility market).

[Opus — market-specific thresholds]


Scoring Methodology

Two-Track Scoring System

[Panel consensus — the most important fix to the original spec]

Track 1: Brier Score (ONLY for predictions with stated confidence)

Brier = (forecast_probability - actual_outcome)^2

Brier scoring is reserved for:

Track 2: Binary Accuracy (for ALL predictions)

accuracy_pct = correct_predictions / total_predictions * 100

A prediction is "correct" if:

This track runs on EVERY prediction regardless of whether confidence was stated. It is the primary metric for most analysts (ESPN experts, handicappers, bank forecasters).

Error Metrics (for point estimates)

error_magnitude = |predicted - actual|
error_pct = |predicted - actual| / |actual| * 100

Used for earnings, macro forecasts, temperature, FX targets — anything with a number.

Skill Score (difficulty-adjusted accuracy)

[Opus, Gemini Pro — "separates a good system from a world-class one"]

skill_score = (accuracy - baseline_accuracy) / (1.0 - baseline_accuracy)

Every prediction gets a difficulty field in outcomes: the market-implied probability of the outcome. For sports, this is derived from the closing line. For earnings, it's whether they beat consensus. For weather, it's the climatological base rate.

Vig adjustment: Market-implied probabilities include the vig (house edge), so raw odds overstate baseline difficulty. For sports, adjust by (1 - vig) where vig is approximately 4.5%. A -110/-110 line implies 52.4% each side, not 50%. Use the de-vigged probability as the true baseline.

[Grok 4 Fast — "detail nobody else caught"; Opus agreed]

Without this, scores are uninterpretable. A 55% accuracy in NFL ATS is above the 50% base rate and genuinely impressive. A 55% accuracy predicting rain in Seattle in November is terrible (base rate ~70%).

Base rates by market:

Market Base Rate Source
Sports ATS ~50% By definition (line is set to split action)
Sports ML (favorites) ~60-65% Historical favorite win rate
Earnings beat/miss ~70% Historical beat rate
Weather (rain Y/N) Varies by city/season NOAA climatological data
Macro directional ~50% Random walk assumption
Crypto directional ~50% Random walk assumption
Forex directional ~50% Random walk assumption

Calibration Score

Measures whether someone who says "80% confident" is right 80% of the time.

calibration = average |stated_confidence - actual_hit_rate| across confidence buckets

Buckets: 50-59%, 60-69%, 70-79%, 80-89%, 90-100%. Perfect calibration = 0.0. Most humans are overconfident (say 90%, right 65%).

Only computed for Track 1 (stated confidence) predictions.

Time-Weighted Scoring

Recent predictions count more than old ones. Use exponential decay:

weight = e^(-days_ago / half_life)

Half-lives by horizon:

Wilson Confidence Intervals

[Opus, Gemini Pro, DeepSeek — "must add statistical significance"]

Every accuracy score includes Wilson confidence interval bounds (ci_lower, ci_upper). Signals only fire when the CI supports the threshold:

Formula:

Wilson CI = (p + z²/2n ± z√(p(1-p)/n + z²/4n²)) / (1 + z²/n)

where z = 1.96 for 95% confidence.

This prevents false signals from small samples. An analyst at 60% over 20 picks has a CI of roughly [39%, 78%] — NOT significant. At 60% over 100 picks, CI narrows to [50%, 69%] — now it means something.

Multiple Testing Correction

[Opus — "the multiple-testing problem"]

When testing hundreds of analysts simultaneously, apply Benjamini-Hochberg False Discovery Rate correction. With 200 analysts, ~10 will appear significant by chance. In practice this means either:

[Gemini Pro, DeepSeek — Bayesian approach as alternative]


Signal Generation

Follow Signal (analyst is worth listening to)

An analyst gets follow_signal = 1 when ALL of these are true:

  1. Minimum predictions: market-specific minimum (see table below)
  2. Accuracy: ci_lower > baseline for this market (statistically significant outperformance)
  3. Recency: at least 5 predictions in last 90 days (still active)
  4. Consistency: no losing streak > 8 in last 30 predictions
  5. Calibration: within 15 percentage points (not wildly overconfident). Only checked if Track 1 predictions exist.

Fade Signal (analyst is reliably wrong — trade the opposite)

An analyst gets fade_signal = 1 when ALL of these are true:

  1. Minimum predictions: higher market-specific minimum (see table below)
  2. Accuracy: ci_upper < baseline for this market (statistically significant underperformance)
  3. Recency: at least 5 predictions in last 90 days
  4. Consistency: losing rate sustained across at least 2 different 30-day windows (not just one bad month)
  5. Not improving: 90-day accuracy is not trending up

Minimum Sample Sizes by Market

[Panel consensus — original minimums too low. Raised significantly.]

Market Min for Scoring Min for Follow Signal Min for Fade Signal
Sports (ATS) 20 50 80
Earnings (EPS) 10 20 30
Macro 8 15 25
Weather 30 60 100
Forex 6 12 20
Crypto 15 30 50

Sports and weather minimums raised most aggressively because data is frequent and individual predictions are noisy. At 30 ATS picks, a fair coin flipper has a 13% chance of hitting 58%+. At 80 picks, that drops to ~2%.

[Opus — "the spec's 30-pick minimum is too low. For a reliable fade signal with sports data, you need 80-100+ picks"]

Signal Strength

Not all follows/fades are equal. Strength = 1 to 5:

Strength Follow Criteria Fade Criteria
1 ci_lower > baseline, 50+ picks ci_upper < baseline, 80+ picks
2 ci_lower > baseline+3%, 75+ picks ci_upper < baseline-3%, 100+ picks
3 ci_lower > baseline+5%, 100+ picks ci_upper < baseline-5%, 125+ picks
4 ci_lower > baseline+8%, 150+ picks ci_upper < baseline-8%, 150+ picks
5 ci_lower > baseline+10%, 200+ picks ci_upper < baseline-10%, 200+ picks

How Other Desks Consume This Data

Query Interface

interface HumanIntelQuery {
  market?: string;          // "sports", "crypto", "forex", etc.
  sub_market?: string;      // "nfl", "btc", "eur-usd"
  subject_search?: string;  // FTS search: "AAPL earnings"
  signal_type?: 'follow' | 'fade' | 'both';
  min_strength?: number;    // 1-5
  min_predictions?: number; // override default minimums
  horizon?: string;
  days_back?: number;       // only recent predictions
  limit?: number;
}

interface HumanIntelResult {
  analyst: { id, name, source, type };
  prediction: { subject, predicted_value_numeric, predicted_value_string, direction, confidence, confidence_source, target_date };
  score: { accuracy_pct, avg_brier, skill_score, ci_lower, ci_upper, signal_type, signal_strength, prediction_count, baseline_accuracy };
  consensus?: { pct_agree, pct_disagree }; // how many other analysts agree
}

What Each Desk Gets

Sports Desk: "Which ESPN experts have ci_lower > 50% on ATS this season? What are they picking this week?" Crypto Desk: "Which CT accounts called the last 3 moves correctly? What are they saying now?" Forex Desk: "Goldman said EUR/USD 1.15 by Q2 — their FX forecast skill_score is -0.12 over 3 years. Fade signal strength 3." Weather Desk: "GFS vs EURO: which model has lower error_pct for this region in the last 30 days?" Macro Desk: "Survey of Professional Forecasters consensus for June NFP is 180k. The 3 analysts with highest skill_score say 210k+." Earnings Desk: "Estimize consensus for AAPL is $2.15. Wall Street consensus is $2.10. Estimize has been closer 62% of the time." Certainty Gap Desk: "When ALL tracked analysts agree on something (90%+ consensus), how often are they right? Is there a gap?"

Automated Briefing

When an analyst desk starts its session, it can request a briefing:

getBriefing(desk: "Sports", sub_market: "nfl", days_back: 7)

Returns:


Update Frequency & Execution Pipeline

Collection Schedule

Source Collection Frequency Resolution Frequency
ESPN/CBS expert picks Daily After games complete
Estimize earnings Daily during earnings season After each earnings release
FRED macro data After each data release Same (FRED IS the actuals)
GDPNow / CPI Nowcast Each update (semi-weekly) After official release
EIA inventories Tuesday (estimates), Wednesday (actuals) Wednesday
NWS / weather models 4x daily Next day (compare forecast to observed)
Bank FX forecasts Quarterly End of quarter
Crypto Twitter calls Daily scan Based on stated horizon
Polymarket/Kalshi Daily At contract settlement
USDA crop reports Monthly At report release
Survey of Prof. Forecasters Quarterly After data is released

Speed of Resolution — Architectural Principle (V1)

[Big Ideas #9 — unanimous V1, value 8.8/10. "Architectural foundation, not feature."]

Most human prediction trackers resolve slowly — days to weeks after outcomes are known. Our edge: resolve within 2 hours of outcome availability.

Event Type Outcome Available Our Resolution Target
Sports games Game ends ~11PM ET Resolved by 1AM ET
Earnings reports Released 4PM ET Resolved by 5PM ET
Weather forecasts Yesterday's actuals at 6AM Resolved by 7AM ET
Macro data (NFP, CPI) Released 8:30AM ET Resolved by 9AM ET
EIA inventories Released 10:30AM ET Wed Resolved by 11AM ET

Implementation: Poll free APIs (ESPN scores, Yahoo Finance earnings, FRED) every 15 minutes post-event window. LLM verifies outcomes from raw text if needed. DAG triggers scoring immediately after resolution.

Why this matters: Consuming desks get tomorrow's signals based on tonight's results. That speed compounds over a season — if we resolve sports at 1AM and competitors resolve next morning, we have a 6-hour head start on signal generation.

Fallback: If API is slow or game goes to extra innings, queue for human review rather than waiting indefinitely. Disputed outcomes (protests, stat corrections) flagged for manual override.

[Grok 4 Reasoning — "API rate limits or downtimes delay resolution; disputed outcomes require manual overrides"]

DAG Execution (NOT Fixed Cron)

[Opus — "the cron schedule has a race condition"]

Steps are chained as a DAG — each step triggers the next on completion. No fixed-time scheduling for downstream steps.

COLLECT (source-specific schedules)
   ↓ (on completion)
VALIDATE (Data Validator checks extractions)
   ↓ (on completion)
RESOLVE (match predictions to outcomes)
   ↓ (on completion)
SCORE (recompute all affected analyst scores)
   ↓ (on completion)
SIGNAL (recompute follow/fade signals)
   ↓ (on completion)
NOTIFY (alert consuming desks of new/changed signals)

If resolution runs long (data source slow, game went to extra innings), downstream steps wait instead of running on stale data.

Source Health Monitor (script, no LLM): alerts when any collector returns zero results or gets a 403/429. Runs as a check after each collection step.

[Panel consensus — new role]

Desk Self-Scoring — Killswitch (V1)

[Big Ideas #10 — Opus ruled V1 over split panel vote. "If you build a signal-generating desk without measuring whether its signals work, you're flying blind."]

The desk_meta_scores table (already in schema) tracks whether follow/fade signals actually produce winning outcomes for consuming desks.

How it works:

  1. When a follow/fade signal is generated, log it with timestamp + signal details
  2. When the underlying prediction resolves, check: did the signal's direction win?
  3. Aggregate hit rates by signal_type, strength, and market
  4. Killswitch threshold: If follow signals hit <52% over 100+ signals, the desk is not adding value. Review methodology or shut down.
  5. Monthly report to EDGE TEAM group with signal P&L summary

Must exist before the desk sends its first signal to another desk. This is non-negotiable.

[Grok Fast — "meta beats base"]


Analyst Roles for This Desk

Role Model Job
Data Collector TBD — AUDITION REQUIRED Parse scraped pages, extract predictions, normalize to structured fields
Confidence Scorer TBD — AUDITION REQUIRED Classify implied confidence from language cues when analyst doesn't state a number
Score Calculator Code (no LLM needed) Pure math — Brier scores, accuracy %, calibration, skill scores, CIs
Signal Analyst Grok 4.1 Fast (code for signals, LLM for narratives) Signal generation is pure SQL. LLM writes the briefing narratives.
Contrarian Sonnet 4.6 (reused) Challenge signals — "this fade signal might be regression to mean, not real negative skill"
Data Validator / Curator TBD — AUDITION REQUIRED QA on parsed predictions before DB commit. Checks extraction against source text.
Outcome Resolver Reuse Settlement AI (Gemini Flash) Handle ambiguous outcomes: GAAP vs non-GAAP EPS, OT affecting spreads, GDP revisions
Source Health Monitor Script (no LLM) Alert when scrapers return zero results or APIs 403

Data Collector — Critical Role

[Opus, Gemini Pro — "catastrophic mistake to use Haiku here"]

This is NOT a simple parser. The Data Collector must:

Haiku is NOT suitable. Haiku is fine for structured extraction from semi-structured data (JSON, tables). It is mediocre at interpreting intent and confidence from natural language. The marginal cost increase for a stronger model is trivial compared to the cost of corrupted data.

Audition candidates: Gemini Flash, Grok 4.1 Fast, GPT-4.1-mini, Sonnet Test design: 30 real prediction snippets — 15 real predictions, 10 traps (analysis that looks like predictions), 5 multi-prediction texts. Score on field-level accuracy across subject, value, direction, and target date.

[Panel synthesis — audition spec]

Confidence Scorer — New Role (V1)

[GPT-4.1 — proposed new role. Big Ideas panel unanimous: separate LLM call. Opus judge ruling: AFTER Data Collector.]

Reads raw prediction text and classifies implied confidence from language cues. Runs AFTER the Data Collector extracts the prediction, BEFORE database commit.

Pipeline position: AFTER Data Collector. The Collector's job is "is this a prediction? who made it? what market?" The Confidence Scorer's job is "how confident are they?" These are different tasks. The Scorer needs the Collector's output (analyst identity, market type) to properly calibrate.

[Opus ruling — overruled GPT/Sonar who wanted it before]

Raw text → Data Collector (extract prediction) → Confidence Scorer (assign confidence) → Data Validator → DB commit

Separate LLM call: Yes (unanimous). Allows modular testing, A/B testing, and different model selection from Data Collector.

Model assignment: Haiku 4.5

[Opus ruling — overruled panel suggestions of GPT-3.5/Sonnet]

Input: raw text + context (source, analyst, market type) Output format:

{ "confidence": 0.0-1.0, "reasoning": "string", "flagged": boolean }

Hybrid Brier scoring rules:

[All 4 panelists converged on hybrid approach. Opus agreed.]

10 calibration examples:

[Grok Fast — "most operationally useful examples"]

Input Text Score Reasoning
"This is the lock of the week: KC covers easily." 0.95 Absolute terms like 'lock' and 'easily' = extreme confidence
"Absolute slam dunk: KC by 10+." 0.98 Hyperbolic 'slam dunk' + specific margin = near-certainty
"Based on stats, KC is a must-bet at -3.5." 0.85 Evidence-based + 'must-bet' = high confidence
"Picking KC to win." 0.70 Straightforward, no emphasis or doubt = moderate
"I think KC might cover, but it's close." 0.60 Qualifiers 'might' and 'close' = moderate uncertainty
"KC covers or it's a push; either way, avoid the under." 0.65 Multiple scenarios = diluted conviction
"I guess I'll go with KC, but I'm not thrilled." 0.55 Reluctance via 'guess' + 'not thrilled' = low confidence
"Possibly KC, depending on weather." 0.50 Conditional 'possibly' + 'depending' = low commitment
"KC wins because aliens told me—haha, just kidding, but seriously maybe." 0.40 Joke mixed with 'maybe' = unreliable
"Sure, KC covers because pigs fly—total joke pick." -1.0 (flagged) Sarcasm: 'pigs fly' + 'joke' = not a real prediction

Failure modes:

Audition required. Test with 30 real snippets covering all 10 example types above. Score on alignment with human judgment.

Signal Analyst — Split Role

[Opus — "signal generation logic is entirely rule-based"]

Signal computation (follow/fade thresholds, Wilson CIs, skill scores) is pure SQL/code. Deterministic and testable.

The LLM (Grok 4.1 Fast) only handles: writing the briefing narrative for consuming desks. "Goldman's FX desk has a -0.12 skill score on EUR/USD over 3 years, their latest call is 1.15 by Q2, this is a strength-3 fade."

Contrarian — Domain-Specific Prompting

[Opus — "needs different system prompt than general-purpose Contrarian"]

The Human Intel Contrarian must be given real statistical ammunition: confidence intervals, probability of observing the record by chance, comparison to similar analysts. System prompt focused on:

Without this data, the Contrarian role is theater.


Edge Cases & Rules

Cancelled/Postponed Events

Analyst Name Normalization

[Opus — "add analyst_aliases table for name normalization"]

Consensus as Analyst

Stale Analysts

Survivorship Bias

[Opus, DeepSeek — flagged as risk]

Public analysts who get famous for being right attract attention; analysts who are consistently wrong quietly disappear. The desk will naturally accumulate more data on "good" analysts (who keep publishing) and less on "bad" ones (who stop). This creates survivorship bias in the follow signal pool. The fade signal pool will be thin because genuinely bad forecasters either stop or get fired.

Mitigation: Track all analysts from first discovery regardless of activity level. Archive predictions even if the analyst deletes them. Report survivorship stats in desk meta-scores.

Correlated Predictions

[Opus — "10 analysts copying each other != 10 independent signals"]

When multiple analysts make the same call, check independence. If 10 ESPN experts all pick KC, that might be 10 independent signals or it might be 1 signal repeated 10 times. The consensus_deviation score in the scores table tracks this. Weight independent accurate analysts higher.

Data Retention

[Opus — "the spec says nothing about data retention"]


V1 Scope (Free Data Only)

V1 priority order:

  1. NFL/NBA/NHL ATS picks from ESPN — start here. Cleanest data, binary outcomes, fastest resolution, existing infrastructure.
  2. Estimize earnings estimates — API makes collection easy, clear resolution via SEC filings.
  3. Weather models (GFS vs ECMWF vs ICON) — already pulling this data for Kalshi weather bot.
  4. FRED macro + GDPNow — structured API, no scraping needed.

[Panel consensus — all 6 panelists said start with sports. 4 of 6 said sports + earnings for Phase 1.]

NOT in v1 (requires paid or complex setup):

[Opus — "Crypto Twitter is the most signal-rich environment but excluded for good reason"]

Backfill Plan

[Opus — "the spec has no backfill plan. Starting from zero means no signals for months."]

Before going live, backfill historical data to validate the pipeline:

  1. Scrape ESPN expert pick archives for the current NFL/NBA/NHL season
  2. Pull historical Estimize data via their API for last 2 earnings seasons
  3. Download FRED historical forecast surveys (SPF has archives going back decades)
  4. Pull historical weather model verification data from NWS

This lets you:


Implementation Checklist

Phase 1: MVP (Sports ATS only — 2-3 weeks)

[Panel consensus — "nail this one vertical end-to-end before expanding"] V1 Big Ideas: Speed of Resolution, Skill Score, Confidence Scorer, Desk Self-Scoring

  1. Create SQLite database + all tables (with triggers)
  2. Build ESPN expert pick scraper
  3. Build outcome resolver (closing lines from Odds API, scores from sports sheets)
  4. Speed of Resolution architecture — poll APIs every 15 min post-event, resolve within 2 hours (Big Idea #9)
  5. Build scoring engine with unit tests (Brier, accuracy, skill score with vig adjustment, Wilson CIs) (Big Idea #7)
  6. Audition Data Collector role (30 real snippets test)
  7. Audition Confidence Scorer role — Haiku 4.5, ~$6/month (Big Ideas Panel)
  8. Audition Data Validator role
  9. Build DAG execution pipeline (collect -> validate -> confidence score -> resolve -> score -> signal)
  10. Desk self-scoring — desk_meta_scores table tracking signal hit rates from day 1 (Big Idea #10)
  11. Backfill current season ESPN picks
  12. Paper trade follow/fade signals for 30 days before trusting

[DeepSeek, Grok — "paper trade from day 1"]

Phase 2: Expand (+ Earnings + Weather + V2 Big Ideas — Month 2-3)

V2 Big Ideas: Update Tracking, Independence Scoring, Smart Consensus, Cluster Detection, Forecaster Sharpe

  1. Add Estimize collector + SEC EDGAR resolver
  2. Add weather model collectors (NWS, Open-Meteo)
  3. Add Signal Analyst (Grok for narratives) + Contrarian (Sonnet)
  4. Wire briefing interface to consuming desks
  5. Prediction Update Tracking — trajectory scoring for analyst revisions (Big Idea #1)
  6. Analyst Independence Scoring — consensus correlation, rolling 30-day windows (Big Idea #3)
  7. Smart Consensus — weighted aggregation with extremization, 10+ analysts, 50+ predictions (Big Idea #2)
  8. Prediction Cluster Detection — 3+ independent analysts agreeing, 1-hour buffer window (Big Idea #6)
  9. Forecaster Sharpe Ratio — consistency metric, rolling 50-prediction windows (Big Idea #8)

Phase 3: Mature (Month 4+)

  1. Add macro (FRED, GDPNow, SPF)
  2. Add crypto (CT accounts, Polymarket/Kalshi)
  3. Add forex (bank forecasts, TradingView)
  4. Build cross-market analyst fingerprinting
  5. Reasoning Extraction — extract + score prediction reasons (Big Idea #4, V3)
  6. Narrative Scoring — score belief systems as pseudo-analysts (Big Idea #5, V3)
  7. Add desk to dashboard
  8. Quarterly re-audition of LLM roles against fresh samples (Model Drift mitigation)

Risks (Panel Consensus)

  1. Data quality is the existential risk. Everything downstream depends on predictions being correctly captured. A single systematic parsing error corrupts an entire analyst's score. No recovery from bad data — must delete and re-score.

[All 6 panelists]

  1. Scrapers WILL break constantly. ESPN/CBS change HTML without warning. Budget for weekly scraper maintenance. Isolate scrapers behind adapters.

[All 6 panelists]

  1. False signals from small samples. Even with raised minimums, expect 2-3 false follow signals per scoring cycle. Wilson CIs + Bayesian approach mitigate but don't eliminate.

[Opus, Gemini Pro, DeepSeek]

  1. False confidence propagation. Systems that produce numerical scores create an illusion of precision. Always report CI bounds alongside point estimates. Consuming desks must understand these are probabilistic, not certain.

[Opus]

  1. Legal risk (low). Scraping ESPN/CBS violates TOS. Practical risk for non-commercial research is minimal. If this becomes a product, need licensing or API access.

[Gemini Pro, Grok]

  1. Survivorship bias. Good analysts keep publishing, bad ones disappear. Fade signal pool will be thin.

[Opus, DeepSeek]


Success Metrics (How We Know the Desk Is Working)

The desk's success is NOT its own accuracy — the desk doesn't predict anything. Success is measured by:

  1. Signal yield rate: What % of analyst-market pairs generate a follow or fade signal after 90 days? If below 5%, desk is consuming resources without producing actionable output.

[Opus]

  1. Signal persistence: When an analyst earns a follow signal, do they maintain above-threshold accuracy for the next 50 predictions? If follow signals decay to baseline within a month, they're capturing randomness.

[Opus]

  1. Downstream impact: When another desk acts on a follow/fade signal, does their outcome improve? Track in desk_meta_scores.

[Opus, Gemini Pro]

  1. Desk meta-score: If follow signals are hitting 51% after 6 months (barely above coin flip), the desk is producing noise and needs restructuring or shutdown.

[Opus — "the spec does not address what happens when the desk is wrong"]


V2 Features (Month 2-3) — Big Ideas Panel

[All items below reviewed by 5-model panel + Opus judge. Build order reflects dependencies.]

Prediction Update Tracking (Big Idea #1)

[Feasibility: 8.5/10 | Value: 7.8/10 | Opus: V2]

Don't just log the initial call. Track when analysts revise predictions as new information arrives.

Example: An analyst says "KC -3.5" on Monday, then shifts to "KC -1.5" on Thursday after an injury report. The current schema captures both (via parent_prediction_id), but V2 adds trajectory scoring.

Trajectory scoring: Did they move in the right direction when new info arrived? Score the update, not just the final call. This separates analysts who update well (Metaculus-style) from those who anchor on first impressions.

Schema: Already in place — parent_prediction_id in predictions table links revisions. V2 adds:

-- Add to scores table
ALTER TABLE scores ADD COLUMN update_skill REAL;  -- trajectory accuracy: did revisions improve?
ALTER TABLE scores ADD COLUMN avg_revisions REAL;  -- how often they update (updating skill vs stubbornness)

Gotcha: Analysts might not explicitly state revisions. Data Collector must detect updates via keyword matching ("updating my pick", "changing my call") and same-subject matching within time windows.

[Grok 4 Reasoning — "could inflate scores if revisions are retroactively cherry-picked post-event"]

Analyst Independence Scoring (Big Idea #3)

[Feasibility: 6.3/10 | Value: 8.8/10 | Opus: V2. Gemini gave it 10/10 value. Prerequisite for Cluster Detection.]

Track correlation between each analyst's predictions and consensus. A 60% accurate analyst with LOW consensus correlation is far more valuable than a 62% accurate analyst who just picks the favorite every time.

Formula:

independence_score = 1 - pearson_correlation(analyst_predictions, market_consensus)

Schema: Already partially in place — consensus_deviation in scores table. V2 computes it properly:

-- Computed as rolling 30-day correlation between analyst picks and consensus direction
-- Updated after each scoring cycle
-- Range: 0.0 (always agrees with consensus) to 1.0 (maximally independent)

How it's used:

Gotcha: Correlation metrics skewed in low-volume markets. Use rolling 30-day windows to smooth noise.

[Grok 4 Reasoning — "use rolling correlations to smooth noise"]

Systemic risk: If all tracked analysts read the same sources, "independence" is an illusion. True independence requires different information sources, not just different conclusions.

[Gemini Pro — "data source monoculture" risk]

Smart Consensus — Weighted Aggregation (Big Idea #2)

[Feasibility: 7.3/10 | Value: 9.0/10 | Opus: V2. Unanimous highest-value V2 feature. Depends on #3 Independence + #7 Skill Score.]

Once 10+ analysts have 50+ resolved predictions in a market, compute a weighted average where weights are proportional to historical skill score. Compare this "smart consensus" to market consensus (closing line, Wall Street estimate). When they diverge, that is a genuine signal.

This is the core mechanism behind the Good Judgment Project's Superforecaster aggregation, and it consistently outperforms both individual forecasters and prediction markets.

Extremization formula (push aggregate toward tails):

extremized_probability = p^a / (p^a + (1-p)^a)

where a > 1 (typically 1.5-2.5). Corrects for the well-documented tendency of individual forecasters to be underconfident.

Implementation:

-- Weighted consensus view
CREATE VIEW smart_consensus AS
SELECT
    p.market, p.sub_market, p.subject, p.target_date,
    SUM(p.predicted_value_numeric * s.skill_score) / SUM(s.skill_score) AS weighted_prediction,
    COUNT(DISTINCT p.analyst_id) AS analyst_count,
    AVG(s.consensus_deviation) AS avg_independence
FROM predictions p
JOIN scores s ON p.analyst_id = s.analyst_id AND p.market = s.market
WHERE s.total_predictions >= 50
    AND s.skill_score > 0  -- only include analysts with positive skill
    AND p.status = 'pending'
GROUP BY p.market, p.sub_market, p.subject, p.target_date
HAVING analyst_count >= 10;

Signal: When smart_consensus diverges from market consensus by >5%, flag as actionable signal for consuming desks.

Survivorship bias warning: Only accurate analysts accumulate 50 predictions, so the weighted consensus has built-in selection bias. Mitigate by including ALL analysts with min 20 predictions, not just the good ones — weight by skill score (which can be negative for bad analysts).

[Sonar Reasoning Pro — best gotcha on this idea]

Gotcha: Small sample sizes early on produce noisy weights. If analysts game by making safe predictions to build history, it dilutes the "smart" aspect. Mitigate with minimum variance thresholds.

[Grok 4 Reasoning]

Prediction Cluster Detection (Big Idea #6)

[Feasibility: 6.8/10 | Value: 8.5/10 | Opus: V2. Unanimous. Depends on #3 Independence.]

When 3+ independently accurate analysts suddenly agree on the same call, that's a stronger signal than any one of them alone. Detect these clusters in real-time.

Key insight: Weight by independence score. A cluster of independent thinkers agreeing is far more significant than a cluster of consensus followers echoing each other.

Implementation:

-- Detect clusters: 3+ analysts with independence > 0.6 agreeing on same subject
SELECT
    p.subject, p.target_date, p.predicted_direction,
    COUNT(DISTINCT p.analyst_id) AS cluster_size,
    AVG(s.skill_score) AS avg_skill,
    AVG(s.consensus_deviation) AS avg_independence,
    GROUP_CONCAT(a.name) AS analysts
FROM predictions p
JOIN scores s ON p.analyst_id = s.analyst_id AND p.market = s.market
JOIN analysts a ON p.analyst_id = a.id
WHERE p.status = 'pending'
    AND s.consensus_deviation > 0.6   -- independent thinkers only
    AND s.skill_score > 0             -- positive skill only
    AND s.total_predictions >= 30
GROUP BY p.subject, p.target_date, p.predicted_direction
HAVING cluster_size >= 3;

Signal strength: cluster_size x avg_independence x avg_skill = composite cluster score. Higher = stronger signal.

1-hour buffer window: Don't trigger cluster detection on every single prediction insert. Batch-check every hour to account for time zone differences in when analysts publish.

[Grok 4 Reasoning — practical timing detail]

Gotcha: False positives from coincidental agreements in popular markets (everyone picks the Patriots). The independence filter handles this — if they're all picking the obvious favorite, their independence scores will be low and the cluster won't fire.

Forecaster Sharpe Ratio (Big Idea #8)

[Feasibility: 8.0/10 | Value: 7.0/10 | Opus: V2. Supplementary consistency metric.]

Reward CONSISTENCY over sporadic brilliance.

forecaster_sharpe = avg_accuracy / stddev_of_errors

An analyst who is 58% accurate with low variance is more trustworthy for systematic trading than one who is 65% accurate but swings wildly between 80% and 40% streaks.

Schema: Already in place — error_stddev in scores table. V2 adds:

ALTER TABLE scores ADD COLUMN sharpe_ratio REAL;  -- avg_accuracy / error_stddev

Computed using rolling 50-prediction windows to handle regime changes (an analyst who was great in bull markets but terrible in bear markets).

[Grok 4 Reasoning — "rolling 50-pred windows" for regime handling]

Gotcha: Low sample sizes inflate stddev, making new analysts look inconsistent. Only compute for analysts with 30+ resolved predictions.


V3 Features (Month 4+) — Future Research

Reasoning Extraction (Big Idea #4)

[Feasibility: 5.3/10 | Value: 7.0/10 | Opus: V3. "LLM hallucination risk too high for early versions."]

When available, extract the stated REASON alongside the prediction ("KC covers because their defense is elite in cold weather"). Over time, score reasoning patterns: is "cold weather advantage" actually predictive? Do analysts who cite injury reports outperform those citing momentum?

Schema additions:

ALTER TABLE predictions ADD COLUMN reasoning_text TEXT;      -- extracted reason
ALTER TABLE predictions ADD COLUMN reasoning_category TEXT;  -- "injury", "weather", "momentum", "matchup", etc.

Minimum instances: Require 20+ predictions citing a reasoning category before scoring that category's predictive value.

[Grok 4 Reasoning — prevents overfitting on sparse categories]

Warning: LLM misclassification risk is high (sarcasm as serious reasoning, commentary as prediction). Defer until Data Collector is battle-tested.

Narrative Scoring (Big Idea #5)

[Feasibility: 4.8/10 | Value: 7.0/10 | Opus: V3. Unanimous. "Coolest idea, hardest to execute. Research project."]

Create pseudo-analyst entries for IDEAS, not people. Example: analyst_id for "inflation is transitory" or "AI bubble will pop." Track all predictions made by proponents of that narrative. Score the narrative itself.

Practical constraint: Limit to 10-20 predefined narratives to reduce ambiguity. Narratives evolve and overlap ("inflation transitory" vs "Fed pivot") — unbounded narrative detection creates double-counting chaos.

[Grok 4 Reasoning — "limit to 10-20 predefined narratives"]

Schema: New narratives table + analyst_narrative join table. LLM tags predictions to narratives post-ingestion.

May never be worth the cost. Evaluate after V2 data proves the base system works.


Systemic Risks (from Big Ideas Panel)

[Gemini Pro — "three systemic risks nobody else flagged"]

  1. Data Source Monoculture — If all tracked analysts read the same sources (ESPN, same Twitter accounts), "independence" is an illusion. True independence requires different information sources, not just different conclusions. Monitor for this by tracking source overlap.

  2. Goodhart's Law — Analysts will game the system once they know the scoring methodology. They could make vague, untrackable predictions or spam low-stakes calls to reset stats. Mitigation: Keep scoring methodology private. Only ingest predictions that are specific, measurable, and time-bound.

[Gemini Pro — "most important systemic risk"]

  1. Model Drift — The LLM models (Data Collector, Confidence Scorer) will degrade as language and jargon evolve. Sports betting slang in 2027 may differ from 2026. Mitigation: Quarterly re-audition of LLM roles against fresh real-world samples.
Source: ~/edgeclaw/results/human-intel-desk/spec-final.md