Stocks Desk — Data Collection Spec (MERGED FINAL)

Panel review: Claude Opus 4.6 (base) + DeepSeek V3.1, Gemini 3.1 Pro, GPT-4.1, Grok 4 Fast, Grok 4.1, Sonar Reasoning Pro

Merged: March 15, 2026. Opus is the complete base; panel additions tagged [PANEL: Model].


NOTE ON BROKER: Stock trade execution broker is TBD. All references in this spec use broker-agnostic language. The equity trading platform will be configured separately. Prediction markets (Kalshi, Polymarket) are unaffected by broker choice.

NOTE ON STACK: Data collection scripts run in Python (best financial ecosystem). The orchestration layer and prediction pipeline run in TypeScript/Node.js. Communication between them uses a shared PostgreSQL database with a job_queue table. See Section L for integration architecture. [PANEL: Opus — aplus additions]


What This Document Is

This is the complete data collection specification for the Stocks desk in the research pipeline. It covers equity trades on a TBD broker AND event contracts traded on Kalshi/Polymarket. An AI builder should be able to read this and know exactly what data to collect, from where, how often, in what format, and why it matters for finding mispriced stocks or prediction market contracts.

The Business Model (Three Venues)

Unlike sports desks that only trade prediction markets, the Stocks desk trades on THREE platforms:

  1. Equity broker (TBD) — Actual stock trades (buy/sell shares on individual equities). We buy and sell real stocks.
  2. Kalshi — Binary prediction markets on stock events: price thresholds ("TSLA above $300 by Friday"), earnings outcomes ("AAPL beats EPS estimate"), index levels ("S&P 500 above 5500 by month end").
  3. Polymarket — Similar binary contracts on crypto rails (USDC on Polygon). Includes M&A, regulatory outcomes, CEO changes, index milestones.

The cross-venue edge: Fundamental analysis tells us a stock is mispriced. The same analysis also tells us whether a Kalshi/Polymarket contract is mispriced. Example: if our fundamental model says NVDA fair value is $950 and Kalshi prices "NVDA above $900 by April 30" at 55 cents, our model implies ~75% probability — that contract is cheap. We trade BOTH venues simultaneously.

How Stocks Are Different From Sports/Options

An AI builder must understand these structural differences:

  1. No Sharp External Benchmark — In sports, sharp bookmakers set fair lines. For stock events, there IS no sharp external reference. We must BUILD the fair value estimate from fundamentals, flow data, and volatility modeling.

  2. Continuous Price Discovery — Stocks trade every second, 6.5 hours per day. Data changes constantly. Sports bets lock in at game time. Timing of data collection matters more.

  3. Multi-Factor Valuation — A stock's fair value depends on earnings, balance sheet, macro regime, sector rotation, institutional positioning, insider behavior, credit markets, and sentiment simultaneously. No single data source is sufficient.

  4. Prediction Market Carry Cost — Buying a Kalshi contract at 70 cents locks up 70 cents until settlement. The carry cost is: ModelProbability * SOFR_annual * DaysToExpiry / 365. At current SOFR (~4.3%), a 70-cent 90-day contract has ~0.76 cents of carry cost. Prediction market prices should trade BELOW option-implied probabilities by this carry cost. If they trade ABOVE, the prediction market is overpriced. Do NOT hardcode the risk-free rate — wire to live SOFR from FRED. [PANEL: Opus — carry cost correction]

  5. Regime Dependence — Stock strategies that work in bull markets fail in bear markets. Every data source must be tagged with which market regime it applies to.

  6. SEC Filings Are the Goldmine — The SEC requires public companies to disclose everything. This free government data is richer than any paid service for stocks.


SECTION A: PRICE AND MARKET DATA (Foundation)

1. Daily OHLCV for Full Universe

What: Daily Open, High, Low, Close (adjusted for splits/dividends), and Volume for every stock we track.

Universe:

What to pull per ticker:

Calculated technicals (on ingestion):

Primary source: Polygon.io REST API — start Day 1, not Month 2. Endpoint: GET /v2/aggs/grouped/locale/us/market/stocks/{date} (all tickers, one call per day). Historical backfill: GET /v2/aggs/ticker/{ticker}/range/1/day/{from}/{to}. Cost: $29/month. This is the most important infrastructure decision in the spec. [PANEL: Opus — moved Polygon to Day 1]

Fallback: yfinance (unofficial Yahoo Finance scraper) — use ONLY as backup when Polygon has an outage, and ONLY for historical backfill. yfinance breaks 2-4 times per year when Yahoo changes its undocumented API. It is NOT acceptable as a primary production source.

Health check: If Polygon returns HTTP 5xx or empty data for >5 minutes, automatically fall back to yfinance with an alert. Log every fallback event.

After-hours / pre-market data: Collect extended-hours price/volume via Polygon for catalysts that occur outside regular hours (earnings, guidance, M&A). Many moves begin in extended hours. [PANEL: GPT-4.1]

History: 10 years for S&P 500. Full available history for any traded stock.

Collection frequency: Daily at 5:00 PM ET (after market close + after-hours adjustments settle).

Historical data tagging: All rows must include effective date and index membership flags. Never overwrite past constituents or remove delisted tickers. [PANEL: GPT-4.1 — schema flaw fix]

Storage:

CREATE TABLE daily_prices (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    trade_date DATE NOT NULL,
    open REAL, high REAL, low REAL, close REAL,
    adj_close REAL, volume INTEGER,
    relative_volume REAL,
    rsi_14 REAL, macd REAL, macd_signal REAL,
    bb_upper REAL, bb_lower REAL, bb_pct REAL,
    atr_14 REAL,
    market_cap REAL, free_float REAL,
    is_sp500 INTEGER DEFAULT 0,
    is_ndx100 INTEGER DEFAULT 0,
    is_delisted INTEGER DEFAULT 0,
    delisted_date DATE,
    data_source TEXT DEFAULT 'polygon',
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, trade_date)
);
CREATE INDEX idx_daily_prices_ticker_date ON daily_prices(ticker, trade_date);
CREATE INDEX idx_daily_prices_date ON daily_prices(trade_date);

Why it matters: Foundation for ALL other signals. Relative volume spikes flag unusual activity before the cause is known. For prediction markets: current price trajectory directly determines probability of hitting Kalshi threshold prices.

2. Sector and Factor ETF Returns

What: Daily returns for all 11 GICS sectors and standard factor ETFs.

Sector ETFs (11): XLK, XLF, XLE, XLV, XLI, XLP, XLY, XLC, XLRE, XLB, XLU Factor ETFs: MTUM, QUAL, VLUE, SIZE, USMV, IWD (Value), IWF (Growth)

Calculated metrics:

Source: Polygon.io (primary). Also Fama-French factor data from Kenneth French Data Library (free CSVs). AQR factor datasets (free registration). Collection frequency: Daily at close. Monthly download of updated Fama-French/AQR data.

Storage:

CREATE TABLE factor_returns (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    trade_date DATE NOT NULL,
    ticker TEXT NOT NULL,
    daily_return REAL,
    sector TEXT,
    factor_type TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, trade_date)
);

Why it matters: Factor attribution on every trade reveals whether alpha came from stock selection or factor exposure. Regime-conditional factor performance guides position sizing.


SECTION B: SEC EDGAR PIPELINE (The Goldmine — All Free)

SEC EDGAR is the single richest free data source for stock analysis. Base URL: https://efts.sec.gov/LATEST/. User-Agent REQUIRED: "CompanyName AdminEmail". Rate limit: 10 requests/second (use 5/sec conservatively).

CRITICAL architecture note: Do NOT poll each of ~700 tickers individually for new filings. Instead, poll the EDGAR full-text search endpoint once per filing type per cycle. Example: https://efts.sec.gov/LATEST/search-index?forms=4&dateRange=custom&startdt={today}&enddt={today} returns ALL Form 4 filings for the day in one call. Maintain a last_seen_filing_id watermark per filing type to avoid reprocessing. This pattern applies to Form 4, Form 144, 13D, 8-K, CORRESP, and all other monitored types. [PANEL: Opus — EDGAR architecture fix]

Legal risk: SEC has explicitly asked high-frequency scrapers to slow down. Mitigation: use RSS feed + daily batching, respect User-Agent and delay requirements, cache aggressively, document scraping practices. Store raw XML locally — never re-fetch a filing you already have. [PANEL: Sonar]

3. Form 4 — Insider Trades

What: Corporate insiders (CEO, CFO, directors, 10%+ owners) must report stock purchases/sales within 2 business days of the trade.

What to extract per filing:

Name canonicalization: SEC Form 4 insider names have variations ('Robert J Smith' vs 'Bob Smith'), name changes (marriages, legal changes), and typos. Use fuzzy matching (Levenshtein distance) + CIK primary key (which is permanent). Build a insider_name_canonical table — do not assume names are consistent across filings. [PANEL: Sonar]

Source: SEC EDGAR. Filing search: https://efts.sec.gov/LATEST/search-index?forms=4&dateRange=custom&startdt={date}&enddt={date}. Parse XML from each filing. Store raw XML locally before parsing — SEC XML structure changes every 2-3 years and you'll need to re-parse.

Collection frequency: Poll the daily feed every 5 minutes during 4 PM - 8 PM ET (when most Form 4s are filed), every 30 minutes otherwise. Do NOT check every ticker individually every 2 hours.

Storage:

CREATE TABLE insider_trades (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    ticker TEXT NOT NULL,
    company_cik TEXT,
    insider_name TEXT,
    insider_cik TEXT,
    relationship TEXT,
    transaction_type TEXT,
    is_10b51 INTEGER DEFAULT 0,
    shares_traded REAL,
    price_per_share REAL,
    total_value REAL,
    shares_owned_after REAL,
    filing_date DATE,
    transaction_date DATE,
    raw_xml_path TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_insider_trades_ticker ON insider_trades(ticker);
CREATE INDEX idx_insider_trades_date ON insider_trades(filing_date);
CREATE INDEX idx_insider_trades_cik ON insider_trades(insider_cik);

CREATE TABLE insider_network (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    insider_cik TEXT NOT NULL,
    company_cik TEXT NOT NULL,
    relationship TEXT,
    first_seen DATE,
    last_seen DATE,
    UNIQUE(insider_cik, company_cik)
);

CREATE TABLE insider_name_canonical (
    insider_cik TEXT PRIMARY KEY,
    canonical_name TEXT NOT NULL,
    name_variants TEXT,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Key signals:

4. Form 144 — Pre-Sale Notice of Restricted Stock

What: Insiders must file Form 144 BEFORE selling restricted shares. This is filed days to weeks BEFORE the actual Form 4. Form 144 is the LEADING indicator of Form 4.

Important caveat: The lag between Form 144 and Form 4 varies by insider sophistication — some file 144 on the same day they transact. The leading indicator claim must be validated with data before being used in core signals. Track the actual lag per insider CIK. [PANEL: Sonar — Form 144 validation caveat]

Source: SEC EDGAR, filing type "144". URL: https://efts.sec.gov/LATEST/search-index?forms=144&dateRange=custom&startdt={date}&enddt={date}

Parsing note: Form 144 filings are less structured than Form 4. Use LLM-assisted parsing for the text portions. Expect parse failures — build a failure queue and alert on failure rate >20%. [PANEL: DeepSeek — Form 144 parsing complexity]

Collection frequency: Daily scan.

Storage:

CREATE TABLE form144_notices (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    ticker TEXT,
    company_name TEXT,
    insider_name TEXT,
    insider_cik TEXT,
    shares_to_sell REAL,
    estimated_proceeds REAL,
    filing_date DATE,
    planned_sale_date DATE,
    parse_status TEXT DEFAULT 'ok',
    raw_text TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Key signal: Form 144 filing = selling is COMING but hasn't happened yet. Compare timing to subsequent Form 4 to measure actual lag per insider.

5. 13F Institutional Holdings (Quarterly)

What: Every institutional investment manager with >$100M AUM must file 13F-HR quarterly, disclosing all equity holdings.

What to extract:

Smart Money Cohort (~50 funds):

Source: SEC EDGAR. Filing index: https://efts.sec.gov/LATEST/search-index?forms=13F-HR&dateRange=custom. Parse infotable.xml for positions.

Collection frequency: Quarterly. 13F due 45 days after quarter-end (Feb 14, May 15, Aug 14, Nov 14). Bulk parse within 24 hours of deadline.

13F-NT tracker: Also monitor https://efts.sec.gov/LATEST/search-index?forms=13F-NT for late filing notifications. A smart money fund filing late (requesting extension) is more likely hiding embarrassing positions than an early filer. Replace the "early filing = no positions to hide" signal with "13F-NT by smart money = something changed they're not ready to reveal." [PANEL: Opus — 13F-NT correction]

Storage:

CREATE TABLE institutional_holdings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filer_cik TEXT NOT NULL,
    filer_name TEXT,
    period_of_report DATE NOT NULL,
    cusip TEXT,
    ticker TEXT,
    company_name TEXT,
    shares REAL,
    market_value REAL,
    investment_discretion TEXT,
    option_type TEXT,
    is_smart_money INTEGER DEFAULT 0,
    prev_shares REAL,
    shares_change REAL,
    is_new_position INTEGER DEFAULT 0,
    is_liquidated INTEGER DEFAULT 0,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(filer_cik, period_of_report, cusip)
);

CREATE TABLE smart_money_signals (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    signal_date DATE NOT NULL,
    funds_initiated INTEGER DEFAULT 0,
    funds_liquidated INTEGER DEFAULT 0,
    funds_increased INTEGER DEFAULT 0,
    funds_decreased INTEGER DEFAULT 0,
    overlap_count INTEGER DEFAULT 0,
    convergence_score REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, signal_date)
);

Key signals:

6. N-PORT Monthly Fund Holdings

What: Mutual fund monthly portfolio holdings with 60-day public disclosure lag. N-PORT includes data 13F does NOT: securities lending income, total return swaps, CDS held, liquidity classification.

Simplification: Focus on top 50 ETFs/mutual funds (Vanguard, BlackRock) that drive 80% of flow. Skip securities lending/CDS unless trading those specific names. [PANEL: Grok 4.1 — N-PORT scope]

Source: SEC EDGAR, filing type "NPORT-P". URL: https://efts.sec.gov/LATEST/search-index?forms=NPORT-P Collection frequency: Monthly. Parse within 24 hours of new filing.

Storage:

CREATE TABLE nport_holdings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filer_cik TEXT NOT NULL,
    filer_name TEXT,
    period_date DATE NOT NULL,
    cusip TEXT,
    ticker TEXT,
    company_name TEXT,
    shares REAL,
    market_value REAL,
    liquidity_classification TEXT,
    securities_lending_flag INTEGER DEFAULT 0,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(filer_cik, period_date, cusip)
);

Key signals:

7. 13D/13G Activist + Large Position Disclosure

What: 13D = filer intends to influence management (activist). 13G = passive large position (>5% ownership).

What to extract:

Activist Quality Tier List:

Source: SEC EDGAR. https://efts.sec.gov/LATEST/search-index?forms=SC+13D,SC+13D/A,SC+13G Collection frequency: Daily scan at 6 PM ET.

Storage:

CREATE TABLE activist_filings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    filer_name TEXT,
    filer_cik TEXT,
    target_ticker TEXT,
    target_cik TEXT,
    filing_type TEXT,
    shares_acquired REAL,
    pct_outstanding REAL,
    purpose_text TEXT,
    activist_tier TEXT,
    is_13g_to_13d INTEGER DEFAULT 0,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Key signals:

8. 10-K/10-Q Financials via XBRL

What: Quarterly and annual financial statements in machine-readable format.

What to extract:

Derived signals (calculate on ingestion):

Source: SEC EDGAR XBRL API. https://data.sec.gov/api/xbrl/companyfacts/CIK{cik_padded}.json returns ALL XBRL facts ever filed. Note: XBRL parsing is complex due to inconsistent corporate tagging. Consider Financial Modeling Prep (FMP) API ($19/month) as an alternative that parses XBRL into clean JSON. [PANEL: Gemini — XBRL complexity]

Collection frequency: Parse within 24 hours of each filing. Cluster around earnings seasons.

Storage:

CREATE TABLE fundamentals (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    cik TEXT,
    period_end DATE NOT NULL,
    period_type TEXT,
    revenue REAL, net_income REAL, eps_basic REAL, eps_diluted REAL,
    cfo REAL, capex REAL, fcf REAL,
    total_assets REAL, total_liabilities REAL, equity REAL,
    shares_basic REAL, shares_diluted REAL,
    rd_expense REAL, sbc_expense REAL,
    buyback_amount REAL, total_debt REAL,
    accrual_ratio REAL, asset_growth REAL, share_issuance_pct REAL,
    altman_z_score REAL,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, period_end, period_type)
);

9. DEF 14A Proxy Statements (Annual)

What: Executive compensation details, board composition, shareholder proposals.

What to extract:

Source: SEC EDGAR. Proxy season is March-May for most Fortune 500. Collection frequency: Daily during proxy season (Mar-Jun), weekly otherwise.

Storage:

CREATE TABLE exec_compensation (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    cik TEXT,
    fiscal_year INTEGER,
    ceo_name TEXT,
    total_comp REAL, cash_comp REAL, equity_comp REAL,
    equity_pct REAL,
    option_strike_cluster REAL,
    performance_target_change TEXT,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, fiscal_year)
);

Key signal: Shift to equity-heavy + raised performance targets = management bullish. Option strike clustering = management defends that price level. For prediction markets: exec options at $300 strike and Kalshi "above $280" = management has financial incentive to keep stock above $300, making the $280 contract safer.

10. SEC Correspondence Letters (CORRESP)

What: When the SEC sends a company a letter questioning its filings, the correspondence is eventually made public. Companies receiving SEC comment letters underperform by ~8% over the next 12 months.

Simplification for v1: Use keyword alerts for "material weakness", "restatement", "going concern" rather than full NLP parsing on every CORRESP letter. Defer full CORRESP parsing to v2. [PANEL: Grok 4.1]

Source: SEC EDGAR, filing type "CORRESP". URL: https://efts.sec.gov/LATEST/search-index?forms=CORRESP Collection frequency: Daily scan.

Storage:

CREATE TABLE sec_correspondence (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    ticker TEXT,
    cik TEXT,
    filing_date DATE,
    has_material_weakness INTEGER DEFAULT 0,
    has_restatement INTEGER DEFAULT 0,
    has_going_concern INTEGER DEFAULT 0,
    keywords_found TEXT,
    filing_url TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

11. Form 424B2 Structured Products (Shared with Options Desk)

What: Banks file 424B2 supplements when issuing structured products (autocallables, barrier notes) tied to specific stocks. Reveals barrier/knockout levels and notional amounts.

Source: SEC EDGAR, filing type "424B2". Filter for equity-linked products. Start with top 5 issuers (Goldman, JPMorgan, Morgan Stanley, Citi, BofA). Collection frequency: Daily. This is a Month 2-3 build item.

Storage:

CREATE TABLE structured_products (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    issuer TEXT,
    underlying_ticker TEXT,
    product_type TEXT,
    barrier_level REAL,
    knockout_level REAL,
    notional_amount REAL,
    maturity_date DATE,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

12. Filing Frequency Scanner + 8-K Real-Time Monitor

Standard filing alert types (hourly scan):

8-K real-time monitor: Poll https://efts.sec.gov/LATEST/search-index?forms=8-K&dateRange=custom&startdt={today}&enddt={today} every 60 seconds during 4:00 AM - 8:00 PM ET. Parse 8-K item numbers: [PANEL: Opus — 8-K real-time monitor]

Alert rule: Send immediate alert for Item 1.02, 2.05, 4.01, 5.02 on any stock with active prediction market contracts.

Simplification: Monitor 3 key 8-K patterns automatically; handle other filing types manually. [PANEL: Sonar]

Storage:

CREATE TABLE filing_alerts (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    ticker TEXT,
    cik TEXT,
    filing_type TEXT,
    item_numbers TEXT,
    alert_priority TEXT,
    filing_datetime DATETIME,
    alert_sent INTEGER DEFAULT 0,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

SECTION C: PREDICTION MARKET DATA (Where We Trade)

13. Kalshi Stock Event Contracts

What: All stock-related Kalshi binary contracts.

Contract types:

What to pull per contract:

Source: Kalshi API. Correct base URL: https://trading-api.kalshi.com/trade-api/v2/. Store this in a config file or environment variable — do NOT hardcode it in multiple places. Add a health check before each collection cycle. [PANEL: Opus — Kalshi URL fix]

Auth: RSA-PSS signing (as documented in CLAUDE.md).

Collection frequency: Every 30 minutes during market hours. Every 5 minutes for contracts expiring within 24 hours.

Storage:

CREATE TABLE prediction_market_stocks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    platform TEXT NOT NULL,
    contract_id TEXT NOT NULL,
    event_description TEXT,
    underlying_ticker TEXT,
    threshold_price REAL,
    direction TEXT,
    expiration_date DATE,
    yes_price REAL, no_price REAL,
    volume REAL, open_interest REAL,
    order_book_depth TEXT,
    settlement_source TEXT,
    settlement_time TEXT,
    last_trade_price REAL,
    collected_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(platform, contract_id, collected_at)
);
CREATE INDEX idx_pm_ticker_date ON prediction_market_stocks(underlying_ticker, expiration_date);

14. Polymarket Stock Event Contracts

What: Same as Kalshi but on Polymarket. Different liquidity pool (crypto-native traders), so prices often diverge.

Additional Polymarket-specific data:

Source: Polymarket CLOB API: https://clob.polymarket.com/markets. Gamma API: https://gamma-api.polymarket.com/markets?tag=stocks.

Reliability note: Polymarket's CLOB API changes endpoints frequently. Add fallback to TheGraph subgraph queries (free, decentralized) and validate prices against multiple endpoints to detect oracle failures. [PANEL: Grok 4.1 — Polymarket reliability]

Cross-platform arbitrage caveat: Kalshi and Polymarket contracts on the "same" event frequently differ in: settlement sources, settlement times, and edge case handling (what happens if stock is halted on expiry day?). Use the settlement rule comparison engine (Section 15B) before treating any pair as arbitrageable. [PANEL: Opus — arbitrage caveat]

Collection frequency: Every 30 minutes during market hours.

15. Binary Probability Calculator (The Bridge)

What: Math that converts traditional stock analysis into Kalshi/Polymarket-comparable probabilities.

Inputs:

Method: Use Black-Scholes d2 probability as the primary calculation. Caveats for edge cases: d2 is unreliable for deep OTM contracts, extreme IV, or near-zero interest rates. Use full risk-neutral distribution (Breeden-Litzenberger) for high-accuracy cases. [PANEL: GPT-4.1 — Black-Scholes edge cases]

Carry cost adjustment: CarryAdjustedFairValue = ModelProbability - (ModelProbability * SOFR * DaysToExpiry / 365). When comparing to market price: if MarketPrice > CarryAdjustedFairValue, the contract is overpriced. If MarketPrice < CarryAdjustedFairValue, it is underpriced.

Real carry cost note: The 70-cent 30-day example in the original spec (0.29 cents carry) understates actual costs. Real carry includes: counterparty risk on Polymarket (venue default), Kalshi's settlement fee (verify current fee structure), withdrawal delays (funds lock 1-5 days), slippage on entry/exit. Account for all costs before declaring a contract mispriced. [PANEL: Sonar — carry cost correction]

Output: Probability per threshold per expiration, compared to prediction market prices. Collection frequency: Recalculated every time prediction market prices or model inputs update.

15A. Options Chain Data (Required Input for Binary Probability Calculator)

What: Daily options chains for all stocks with active Kalshi/Polymarket contracts plus top 50 most liquid optionable stocks.

This was the most critical gap in the original spec. The Binary Probability Calculator in Section 15 explicitly requires IV surface data. Without it, you cannot price prediction market contracts accurately. This must be a Day 1 build item, not a cross-desk dependency. [PANEL: Opus, Gemini, GPT-4.1, Grok 4 Fast, Grok 4.1 — universal consensus on this gap]

What to collect per contract:

Derived per-ticker metrics:

Source: Polygon.io options snapshot endpoint: GET /v3/snapshot/options/{underlyingAsset} — covered by the $29/month subscription already required for equity data.

Collection frequency: Daily for all tracked names. Intraday (every 15 min) for names with contracts expiring within 5 days.

Storage:

CREATE TABLE options_chains (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    expiration_date DATE NOT NULL,
    strike REAL NOT NULL,
    option_type TEXT NOT NULL,
    bid REAL, ask REAL, mid REAL,
    volume INTEGER, open_interest INTEGER,
    implied_volatility REAL,
    delta REAL, gamma REAL, theta REAL, vega REAL,
    underlying_price REAL,
    snapshot_timestamp DATETIME,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, expiration_date, strike, option_type, snapshot_timestamp)
);
CREATE INDEX idx_options_ticker_exp ON options_chains(ticker, expiration_date);

15B. Settlement Rules Comparison Engine

What: A database mapping Kalshi contracts to Polymarket equivalents, with explicit settlement rule comparison. Cross-platform arbitrage is ONLY valid when settlement sources, times, and edge cases match. [PANEL: Opus — settlement rules engine]

Storage:

CREATE TABLE settlement_rule_pairs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    kalshi_contract_id TEXT,
    polymarket_contract_id TEXT,
    event_description_match_score REAL,
    settlement_source_kalshi TEXT,
    settlement_source_polymarket TEXT,
    settlement_time_kalshi TEXT,
    settlement_time_polymarket TEXT,
    settlement_source_match INTEGER DEFAULT 0,
    settlement_time_match INTEGER DEFAULT 0,
    edge_case_differences TEXT,
    arbitrageable INTEGER DEFAULT 0,
    price_difference_cents REAL,
    last_checked DATETIME,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

16. Prediction Market Historical Settlement Database

What: All historically settled contracts with outcomes, for model calibration. [PANEL: Opus — settlement history as MUST-HAVE]

Why this is MUST-HAVE: Without settled contract history, you cannot calibrate the Binary Probability Calculator. You need to know: historically, when there was a 15+ point divergence between model and market, how often did the model win?

Sources: Kalshi API: GET /trade-api/v2/markets?status=settled. Polymarket: https://gamma-api.polymarket.com/markets?closed=true&tag=stocks. Backfill ALL available history on Day 1.

Storage:

CREATE TABLE prediction_market_settlements (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    platform TEXT NOT NULL,
    contract_id TEXT NOT NULL,
    event_description TEXT,
    underlying_ticker TEXT,
    threshold_price REAL,
    direction TEXT,
    expiration_date DATE,
    settlement_date DATE,
    settlement_value REAL,
    our_model_probability REAL,
    our_entry_price REAL,
    final_pre_settlement_price REAL,
    price_7d_before REAL, price_3d_before REAL, price_1d_before REAL,
    resolution_source TEXT,
    resolution_notes TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(platform, contract_id)
);

SECTION D: MACRO AND ECONOMIC DATA (FRED API — Free)

17. FRED Economic Data Suite

What: Federal Reserve Economic Data — free API for macro data that drives sector rotation, risk appetite, and market regime.

Source: FRED API (free, API key required). Rate limit: 120 requests/minute. Base URL: https://api.stlouisfed.org/fred/series/observations.

Series to pull:

Series ID Description Frequency
DFF Federal Funds Effective Rate Daily
SOFR Secured Overnight Financing Rate Daily
DGS10 10-Year Treasury Yield Daily
DGS2 2-Year Treasury Yield Daily
DGS30 30-Year Treasury Yield Daily
T10Y2Y 10Y-2Y Yield Spread (curve) Daily
T10Y3M 10Y-3M Yield Spread Daily
BAMLH0A0HYM2 ICE BofA HY OAS Spread Daily
BAMLC0A0CM ICE BofA IG OAS Spread Daily
BAMLC0A4CBBB BBB Corporate Bond Spread Daily
DTWEXBGS Trade-Weighted USD Index (Broad) Daily
VIXCLS CBOE VIX Close Daily
WALCL Fed Balance Sheet (Total Assets) Weekly
RRPONTSYD Reverse Repo Facility Usage Daily
WTREGEN Treasury General Account (TGA) Weekly
NFCI Chicago Fed Financial Conditions Weekly
TOTCI C&I Loans (bank lending pulse) Weekly
ICSA Initial Jobless Claims Weekly
CPIAUCSL CPI-U (All Items) Monthly
PCE Personal Consumption Expenditures Monthly
PAYEMS Total Nonfarm Payrolls Monthly
UMCSENT U Michigan Consumer Sentiment Monthly
UNRATE Unemployment Rate Monthly
USREC NBER Recession Indicator Monthly

Note on SOFR: Added explicitly as a required series. Wire SOFR into the Binary Probability Calculator's carry cost adjustment. Do not use DFF as a proxy — SOFR is the correct modern risk-free rate. [PANEL: Opus]

Data lag flag: Many FRED series update with 1-3 day lags. Flag stale data in the UI to prevent false signals from acting on outdated macro readings. [PANEL: DeepSeek]

Derived signals (calculate on ingestion):

Collection frequency: Daily at 7 AM ET for overnight releases. Weekly series checked on Fridays.

Storage:

CREATE TABLE macro_data (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    series_id TEXT NOT NULL,
    observation_date DATE NOT NULL,
    value REAL,
    is_revised INTEGER DEFAULT 0,
    revision_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(series_id, observation_date)
);
CREATE INDEX idx_macro_series_date ON macro_data(series_id, observation_date);

CREATE TABLE macro_derived (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    calc_date DATE NOT NULL UNIQUE,
    yield_curve_inverted INTEGER DEFAULT 0,
    hy_spread_zscore REAL,
    credit_regime TEXT,
    usd_momentum TEXT,
    liquidity_composite REAL,
    vrp_raw REAL,
    vrp_zscore REAL,
    vrp_regime TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

18. Market Regime Classification

What: Daily classification of market environment.

Simplified to 3 primary states + confidence score. The original 6-bucket system (BULL/BEAR/CORRECTION/CRISIS/RANGE-BOUND/ROTATION) over-categorizes. In practice the system oscillates between CORRECTION and RANGE-BOUND without materially different trading logic. Use 3 states with confidence %, add detail when data supports it: [PANEL: Sonar, Grok 4.1 — simplify regimes]

Also track correlation regime: [PANEL: Sonar — correlation regime as MUST-HAVE]

Storage:

CREATE TABLE market_regime (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    calc_date DATE NOT NULL UNIQUE,
    regime_label TEXT,
    confidence REAL,
    vix REAL,
    hy_spread REAL,
    yield_curve REAL,
    breadth_pct REAL,
    dispersion REAL,
    spx_hy_correlation REAL,
    high_correlation_flag INTEGER DEFAULT 0,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

SECTION E: SHORT INTEREST AND DARK POOL DATA

19. FINRA Short Interest (Bi-Monthly)

What: Total shares sold short per ticker.

Important: FINRA short interest is bi-monthly with ~10-day reporting lag. For real-time squeeze timing, supplement with: borrow fee spikes (real-time from broker or OCC), options open interest skew (puts vs calls), X mentions of "short squeeze". Acknowledge SI data is a lagging signal. [PANEL: Sonar, Grok 4.1]

What to collect:

Source: FINRA. https://www.finra.org/finra-data/browse-catalog/short-interest/data. Published bi-monthly.

Storage:

CREATE TABLE short_interest (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    settlement_date DATE NOT NULL,
    short_interest REAL,
    days_to_cover REAL,
    short_pct_float REAL,
    change_from_prior REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, settlement_date)
);

Key signals:

20. FINRA ATS (Dark Pool) Weekly Volume

What: Weekly share volume traded on each Alternative Trading System per ticker.

Source: FINRA ATS Transparency Data. https://ats-transparency.finra.org/otc/ats-nms-weekly-data. Published weekly. Collection frequency: Weekly download every Monday morning.

Storage:

CREATE TABLE dark_pool_volume (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    week_ending DATE NOT NULL,
    ats_name TEXT,
    share_volume REAL,
    trade_count INTEGER,
    pct_of_total_volume REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, week_ending, ats_name)
);

CBOE venue breakdown: Also collect daily lit venue market share from https://www.cboe.com/us/equities/market_statistics/ — free daily CSV. Compute dark_pool_ratio = dark_pool_volume / total_volume, and dark_pool_ratio_zscore vs 20-day rolling. A stock with normal total volume but 60% dark pool (vs 35% baseline) is being accumulated/distributed by institutions. [PANEL: Opus — venue breakdown upgraded from NICE-TO-HAVE]

21. REG SHO Threshold List (Daily, Free)

What: SEC publishes daily list of stocks with excessive failed-to-deliver (FTD) shares.

Source: https://cdn.finra.org/equity/regsho/daily/ Collection frequency: Daily.

Storage:

CREATE TABLE regsho_threshold (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    threshold_date DATE NOT NULL,
    exchange TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, threshold_date)
);

22. SEC Failure-to-Deliver (FTD) Data

What: SEC publishes twice-monthly FTD files with exact share counts per ticker. FTD data is MORE granular than the REG SHO threshold list (which is a binary flag). The REG SHO list only triggers AFTER FTDs exceed threshold for 13 consecutive settlement days — FTD data lets you see the buildup 1-2 weeks early. [PANEL: Opus — FTD as MUST-HAVE gap]

Source: https://www.sec.gov/data/foiadocsfailsdatahtm — free CSV files, published twice monthly with ~2-week reporting lag.

Storage:

CREATE TABLE ftd_data (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    settlement_date DATE NOT NULL,
    cusip TEXT,
    ticker TEXT NOT NULL,
    quantity_ftd REAL,
    description TEXT,
    price REAL,
    ftd_as_pct_float REAL,
    ftd_as_pct_avg_volume REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(settlement_date, ticker)
);

23. OCC Stock Borrow Rates + DTCC Swap Reporting

Source: OCC (theocc.com — free daily rates). DTCC (free, post-Archegos transparency rules). Collection frequency: Daily.

Storage:

CREATE TABLE borrow_rates (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    rate_date DATE NOT NULL,
    borrow_rate REAL,
    borrow_status TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, rate_date)
);

Key signal: Borrow rate > 5% = hard to borrow, put/call parity breaks. Borrow rate spike = short squeeze pressure building.

24. FINRA Margin Debt (Monthly)

What: Total margin debt in brokerage accounts — a retail leverage indicator. High margin debt often precedes corrections; extreme declines signal capitulation. [PANEL: Grok 4.1]

Source: FINRA Margin Statistics. https://www.finra.org/investors/learn-to-invest/advanced-investing/margin-statistics — free monthly CSV.

Storage:

CREATE TABLE margin_debt (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    report_month DATE NOT NULL UNIQUE,
    total_margin_debt_millions REAL,
    free_credit_cash_millions REAL,
    margin_debt_zscore REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

SECTION F: EARNINGS DATA

25. Earnings Calendar and Upcoming Dates

Source: FMP API free tier (250 requests/day). https://financialmodelingprep.com/api/v3/earning_calendar?from={date}&to={date}&apikey={KEY} Collection frequency: Daily update at 7 AM ET.

Storage:

CREATE TABLE earnings_calendar (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    report_date DATE NOT NULL,
    report_time TEXT,
    fiscal_quarter TEXT,
    eps_estimate REAL,
    rev_estimate REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, report_date)
);

26. Historical Earnings Surprise Database

What per earnings report:

Source: SEC EDGAR XBRL (actuals), FMP API (consensus estimates + surprise history). Initial backfill: last 5-10 years. Collection frequency: Daily during earnings season. Update reaction metrics at 1/5/10/20/60 day marks.

Storage:

CREATE TABLE earnings_history (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    report_date DATE NOT NULL,
    fiscal_quarter TEXT,
    eps_consensus REAL, eps_actual REAL, eps_surprise_pct REAL,
    rev_consensus REAL, rev_actual REAL, rev_surprise_pct REAL,
    guidance_direction TEXT,
    implied_move REAL, actual_move_1d REAL, actual_move_2d REAL,
    pre_run_5d REAL,
    drift_5d REAL, drift_10d REAL, drift_20d REAL, drift_60d REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, report_date)
);

27. Analyst Estimate Revisions + Dispersion

What to track:

Source: FMP API (free tier). Estimize (crowdsourced estimates, free basic tier) for social whisper consensus. Collection frequency: Daily snapshot after market close.

Storage:

CREATE TABLE estimate_revisions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    snapshot_date DATE NOT NULL,
    eps_consensus REAL, eps_high REAL, eps_low REAL,
    eps_dispersion REAL,
    revisions_up_7d INTEGER, revisions_down_7d INTEGER,
    revision_ratio_7d REAL,
    target_price_mean REAL, target_price_high REAL, target_price_low REAL,
    rating_strong_buy INTEGER, rating_buy INTEGER, rating_hold INTEGER,
    rating_sell INTEGER, rating_strong_sell INTEGER,
    estimize_eps REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, snapshot_date)
);

28. Earnings Call NLP Analysis (Month 2)

Simplification for v1: Use FinBERT sentiment alone initially. Skip the Fog Index and QoQ drift detection until you have a stable pipeline. The full NLP pipeline is a Month 2 item. [PANEL: DeepSeek, Gemini — simplify NLP]

NLP metrics (Month 2+):

Source: Transcript scrapers (Motley Fool, Seeking Alpha). Defer custom FinBERT local install; use the HuggingFace Inference API (free tier) instead.

Storage:

CREATE TABLE earnings_call_nlp (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    report_date DATE NOT NULL,
    finbert_sentiment TEXT,
    finbert_score REAL,
    uncertainty_word_pct REAL,
    forward_looking_ratio REAL,
    fog_index REAL,
    qoq_sentiment_delta REAL,
    parse_status TEXT DEFAULT 'ok',
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, report_date)
);

SECTION G: CORPORATE EVENTS AND CALENDAR

29. IPO Lockup Expiration Calendar

What to collect:

Source: IPOScoop (https://www.iposcoop.com/ipo-lockup-expirations/), MarketBeat. Cross-ref with SEC S-1 "Shares Eligible for Future Sale" section. Collection frequency: Weekly scan every Sunday. Flag lockups expiring within 30 days.

Storage:

CREATE TABLE ipo_lockups (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    ipo_date DATE,
    lockup_expiry DATE NOT NULL,
    insider_ownership_pct REAL,
    is_vc_backed INTEGER DEFAULT 0,
    supply_pressure_pct REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, lockup_expiry)
);

Key signal: Stocks underperform 10-15 days BEFORE lockup expiry. VC-backed + high insider ownership (>40%) = most pressure.

30. Index Reconstitution and Corporate Actions

What to track:

Exchange halt/resume data: Track Nasdaq Trader Halts page, NYSE, CBOE for trading halts. Critical for risk management — squeeze setups and M&A often trigger halts. All free and scrapeable. [PANEL: GPT-4.1]

Sources: S&P Dow Jones press releases (free), FTSE Russell, Yahoo Finance corporate actions, Nasdaq dividend calendar, SEC Form 10-12B (spin-offs). Collection frequency: Daily at 7 AM ET.

Storage:

CREATE TABLE corporate_events (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    event_date DATE NOT NULL,
    event_type TEXT NOT NULL,
    event_detail TEXT,
    is_sp500_change INTEGER DEFAULT 0,
    effective_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE trading_halts (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    halt_datetime DATETIME,
    resume_datetime DATETIME,
    halt_reason TEXT,
    exchange TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

31. Buyback Blackout Periods

How to estimate: Earnings date minus 14 calendar days = blackout start. Earnings date + 2 business days = blackout end. Calculate % of S&P 500 market cap in blackout at any given time.

Collection frequency: Calculated from earnings calendar. Storage: Column in daily state table.

32. Corporate Supply Calendar (Secondaries, ATM Offerings)

What: Upcoming share issuance (secondary offerings, at-the-money offerings) directly impacts price via dilution. Missing this risks buying into an impending supply shock. [PANEL: DeepSeek]

Source: SEC Form S-3 and Form 424B filings (free). Monitor daily.

Storage:

CREATE TABLE share_issuance_calendar (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    ticker TEXT NOT NULL,
    filing_type TEXT,
    shares_offered REAL,
    offering_price REAL,
    dilution_pct REAL,
    offering_date DATE,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

33. Daily State Table

What: One row per trading day with every relevant boolean flag and event marker.

Schema:

CREATE TABLE daily_state (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    trade_date DATE NOT NULL UNIQUE,
    day_of_week TEXT,
    is_monthly_opex INTEGER DEFAULT 0,
    is_quarterly_opex INTEGER DEFAULT 0,
    days_to_next_opex INTEGER,
    is_fomc_day INTEGER DEFAULT 0,
    is_fomc_eve INTEGER DEFAULT 0,
    days_to_next_fomc INTEGER,
    is_cpi_day INTEGER DEFAULT 0,
    is_nfp_day INTEGER DEFAULT 0,
    is_ppi_day INTEGER DEFAULT 0,
    is_pce_day INTEGER DEFAULT 0,
    econ_release_tier INTEGER,
    is_treasury_auction INTEGER DEFAULT 0,
    auction_tenor TEXT,
    num_sp500_reporting INTEGER DEFAULT 0,
    is_peak_earnings_week INTEGER DEFAULT 0,
    pct_sp500_in_blackout REAL,
    is_month_end INTEGER DEFAULT 0,
    is_quarter_end INTEGER DEFAULT 0,
    is_half_day INTEGER DEFAULT 0,
    is_holiday INTEGER DEFAULT 0,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Note on data quality gate: The daily state table is generated at 6:00 AM ET before market open. If any CRITICAL data quality check fails (see Section M), this table can be used to block trading for that day.


SECTION H: SENTIMENT AND SOCIAL DATA

34. Reddit (Subreddits, Tiered Treatment)

r/wallstreetbets (WSB) — every 30 min during market hours:

r/stocks — every 2 hours:

r/investing — every 2 hours:

r/options — every 30 min:

Source: Reddit API / PRAW (free). Rate limits: 60 requests/minute authenticated. Stability note: Reddit's API access has become increasingly restricted since 2023. Build a fallback to direct HTML scraping of old.reddit.com for historical data when PRAW fails. Also consider Pushshift.io for historical data. [PANEL: Opus — Reddit reliability]

Storage:

CREATE TABLE reddit_sentiment (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    subreddit TEXT NOT NULL,
    collected_at DATETIME NOT NULL,
    mention_count INTEGER DEFAULT 0,
    mention_velocity REAL,
    wsb_momentum_score REAL,
    sentiment_score REAL,
    dd_quality_score REAL,
    macro_fear_index REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

35. X/Twitter Flow via Grok Search

Tier 1 — Market Structure (every 5 min): @unusual_whales, @spotgamma, @OptionsHawk, @WallStJesus, @SqueezeMetrics, @GarrettDeSimone Tier 2 — Macro & Fed (every 15 min): @NickTimiraos, @WalterBloomberg, @DeItaone, @Markets Tier 3 — Stock Analysis (every 30 min): @GaryBlack00, @DougKass, @elerianm Tier 4 — Retail Sentiment (every 60 min): @jimcramer (use as RETAIL ATTENTION signal only — NOT a directional contrarian signal; the Inverse Cramer ETF SJIM was liquidated in 2024 due to poor performance, suggesting the pure contrarian thesis is unreliable) [PANEL: Opus, Sonar — Cramer signal correction]

Pattern detection:

Source: Grok's built-in X search via xAI API (free, already have access).

Storage:

CREATE TABLE x_sentiment (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT,
    keyword_group TEXT,
    account_tier INTEGER,
    mention_count INTEGER,
    sentiment TEXT,
    collected_at DATETIME,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

36. Congressional Trading Disclosures

Source: Capitol Trades (https://www.capitoltrades.com/) or Quiver Quant (https://www.quiverquant.com/congresstrading/). Free to scrape. Supplement with QuiverQuant and flag records with confidence/lag indicators — STOCK Act filings are often delayed and incomplete. [PANEL: GPT-4.1 — congressional data lag]

Collection frequency: Daily.

Storage:

CREATE TABLE congressional_trades (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    politician_name TEXT,
    party TEXT,
    ticker TEXT NOT NULL,
    transaction_type TEXT,
    transaction_date DATE,
    disclosure_date DATE,
    amount_range TEXT,
    lag_days INTEGER,
    data_confidence TEXT DEFAULT 'medium',
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

37. Google Trends

Source: pytrends Python library (free, unofficial). Note: Google Trends has 1-5 day batching delay — it is NOT real-time. The category/brand divergence analysis is clever but defer to v2 given the lag and noise. Use simple ticker mention trends for v1. [PANEL: Sonar — Google Trends lag caveat]

Collection frequency: Daily.

Storage:

CREATE TABLE google_trends (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    trend_date DATE NOT NULL,
    search_interest INTEGER,
    buy_search_interest INTEGER,
    sell_search_interest INTEGER,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, trend_date)
);

38. Retail Flow Proxies

Sources (all free to scrape):

Simplification: Defer Webull (undocumented API, unstable) and eToro (non-US retail base, minimal predictive value for US equities) from the automated pipeline. Start with Fidelity + Stocktwits only. [PANEL: Opus — eToro removal; GPT-4.1, Sonar — simplify retail proxies]

Retail Consensus Indicator (derived composite): Composite of Stocktwits (30%), WSB (30%), r/stocks (25%), Fidelity (15%). Score >85 sustained for 3+ days = CONTRARIAN SELL. Score <15 sustained for 3+ days = CONTRARIAN BUY.

ICI Weekly Mutual Fund Flows: https://www.ici.org/research/stats/flows — free weekly data published Wednesday for prior week. This measures actual money movement by retail investors, not just social media chatter. Extreme outflows (z < -2) = retail capitulation = historically excellent buy signal. Extreme inflows (z > 2) = euphoria = caution. [PANEL: Opus — ICI fund flows]

Storage:

CREATE TABLE retail_flow_proxies (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    source TEXT NOT NULL,
    collected_at DATETIME NOT NULL,
    buy_ratio REAL,
    bull_bear_ratio REAL,
    retail_consensus_indicator REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE fund_flows (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    week_ending DATE NOT NULL UNIQUE,
    equity_domestic_flows_millions REAL,
    equity_world_flows_millions REAL,
    total_equity_flows_millions REAL,
    equity_flow_zscore REAL,
    extreme_outflow_flag INTEGER DEFAULT 0,
    extreme_inflow_flag INTEGER DEFAULT 0,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

39. Earnings Whisper / Social Consensus

What: The "real" expectation that traders have, which often differs from published consensus.

Sources:

Important: Do NOT build custom NLP for extracting EPS expectations from Reddit/X text — too noisy, too easy to game by bots. Use Estimize crowdsourced platform instead. [PANEL: Sonar — social whisper quality]

Key signal: Company beats Street consensus but MISSES Estimize = negative reaction ("beat and dump"). Company misses Street but matches Estimize = muted reaction.


SECTION I: ALTERNATIVE DATA (Free Sources — Build in Month 2+)

These are all month 2+ items. Do not build these before the core pipeline (Sections A-H) is stable and generating predictions.

40. USPTO Patent Data

What to track:

Defer: Trademark filing monitoring. Signal-to-noise ratio is terrible — companies file trademarks speculatively all the time. [PANEL: Opus]

Source: USPTO PatentsView API (free, no key needed). https://api.patentsview.org/patents/query Collection frequency: Weekly.

Storage:

CREATE TABLE patent_citations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    company_cik TEXT,
    ticker TEXT,
    period_date DATE NOT NULL,
    patent_count INTEGER,
    forward_citation_count INTEGER,
    citation_yoy_change REAL,
    abandonment_count INTEGER,
    assignment_count INTEGER,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

41. Job Posting Volume (Month 2)

Source: Company careers pages (direct scraping for top 100 holdings). Revealera.com (free tracker). Collection frequency: Weekly snapshot.

Storage:

CREATE TABLE job_postings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    snapshot_date DATE NOT NULL,
    total_postings INTEGER,
    revenue_roles INTEGER,
    cost_restructuring_roles INTEGER,
    mom_change_pct REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, snapshot_date)
);

Key signal: Posting volume drop >40% MoM = hiring freeze = operational problems incoming.

42. App Store Rankings (Manual Check — Do Not Automate)

Only relevant for ~15-20 consumer app companies. Do not build an automated pipeline. Check Sensor Tower free tier or data.ai free tier manually for specific tickers when actively trading them. [PANEL: Opus, GPT-4.1]

43. SaaS Pricing Page Monitoring — DEFERRED TO v2

Signal is too infrequent (major changes 1-2x/year), scraping is maintenance-intensive, and the signal is usually already priced in by the time it's detectable. [PANEL: Opus, Grok 4.1, Sonar]

44. GitHub Commit Velocity — DEFERRED TO v3

Only relevant for ~20 open-source-heavy tech companies. Signal-to-noise ratio is weak (commits can be noise, bots, linting). GitHub API limits throttle high-frequency collection. [PANEL: Opus, Gemini, DeepSeek, GPT-4.1]


SECTION J: CROSS-ASSET AND INTERNATIONAL SIGNALS

45. Credit-Equity Divergence

Source: FRED API (free). HY OAS (BAMLH0A0HYM2), IG OAS (BAMLC0A0CM). Already collected in Section D.

Key signal: CDX.HY widening 2+ days while SPX flat = equity about to catch down. CDX.HY tightening while SPX sells off = equity overreacting, buy the dip. ~70% directional accuracy.

46. Commodity-to-Sector Margin Mapping

Mappings (1-quarter lag):

Source: FRED commodity prices (free). https://fred.stlouisfed.org/categories/32217 Collection frequency: Daily.

Storage:

CREATE TABLE commodity_sector_signals (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    signal_date DATE NOT NULL,
    commodity TEXT NOT NULL,
    commodity_price REAL,
    commodity_30d_change_pct REAL,
    affected_sectors TEXT,
    signal_direction TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(signal_date, commodity)
);

47. CFTC Commitments of Traders (Equity Index Futures)

Source: CFTC.gov (free CSV downloads). Released Friday 3:30 PM ET, data from Tuesday — 3 days stale. Collection frequency: Weekly (shared with Futures desk).

Storage:

CREATE TABLE cot_equity (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    report_date DATE NOT NULL,
    contract TEXT NOT NULL,
    dealer_long REAL, dealer_short REAL,
    leveraged_fund_long REAL, leveraged_fund_short REAL,
    asset_manager_long REAL, asset_manager_short REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(report_date, contract)
);

48. International Leading Indicators (Simplified Scope)

Limit to the two highest-signal indicators. Defer Japan machinery orders, Australian iron ore, and China PMI to v2 — they're better suited to the Futures/Macro desk. [PANEL: Grok 4.1, Sonar]

Storage:

CREATE TABLE international_indicators (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    indicator_name TEXT NOT NULL,
    period_date DATE NOT NULL,
    value REAL,
    yoy_change_pct REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(indicator_name, period_date)
);

49. FDIC Call Reports (Bank Stocks — Simplified)

Simplification: Only collect 3 metrics from FDIC. Full FDIC report parsing (80+ fields) is overkill unless actively trading financial sector stocks. [PANEL: Sonar — FDIC simplification]

Key signal: Rising provision-to-loan ratio in CRE or C&I at multiple banks = systemic stress 1-2 quarters ahead.


SECTION K: DERIVED AND COMPUTED DATABASES

These are NOT collected from external sources. They are COMPUTED from data above and our own prediction history.

50. Failed Trade Pattern Database

Note: Build the schema on Day 1, but do not expect meaningful analysis until you have 50+ settled trades. Do not tune the categories until 200+ trades.

Root cause categories (simplified to 5 for launch): [PANEL: Opus — simplification from 10 to 5]

  1. WRONG_THESIS — fundamental analysis was incorrect
  2. WRONG_TIMING — thesis eventually correct but outside our window
  3. EXTERNAL_SHOCK — unforeseeable event
  4. EXECUTION_ERROR — sizing, entry/exit timing, or liquidity issue
  5. REGIME_SHIFT — market environment changed after entry

Add a free-text notes field for nuance. After 200 settled trades, analyze notes to determine if any category should be split.

Storage:

CREATE TABLE failed_trade_patterns (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    prediction_id TEXT NOT NULL,
    ticker TEXT NOT NULL,
    direction TEXT,
    entry_date DATE, exit_date DATE,
    return_pct REAL,
    market_regime TEXT,
    correlation_regime TEXT,
    root_cause TEXT,
    notes TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

51. Prediction Calibration Database (Brier Scores)

Storage:

CREATE TABLE prediction_calibration (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    prediction_id TEXT NOT NULL UNIQUE,
    desk TEXT DEFAULT 'stocks',
    model_name TEXT,
    confidence_pct REAL,
    confidence_bucket TEXT,
    outcome INTEGER,
    brier_score REAL,
    market_regime TEXT,
    time_horizon TEXT,
    settled_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_calibration_bucket ON prediction_calibration(confidence_bucket, market_regime);

52. Factor Exposure Attribution

Source: Calculated from Fama-French factors + sector returns + individual stock returns.

Storage:

CREATE TABLE factor_attribution (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    prediction_id TEXT NOT NULL,
    market_beta REAL,
    sector_beta REAL,
    size_factor REAL,
    value_factor REAL,
    momentum_factor REAL,
    stock_specific_alpha REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

53. Smart Money Convergence Score

Composite score per ticker per day: Insider buying (Form 4), institutional accumulation (13F), activist interest (13D), dark pool volume zscore, congressional trading, FTD trend.

Storage:

CREATE TABLE smart_money_convergence (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    calc_date DATE NOT NULL,
    insider_score REAL,
    institutional_score REAL,
    activist_score REAL,
    dark_pool_score REAL,
    congressional_score REAL,
    ftd_score REAL,
    composite_score REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, calc_date)
);

54. Squeeze Probability Score (Simplified)

Simplified version: Daily screening, not 15-minute updates (squeezes develop over days, not minutes). [PANEL: Opus — squeeze score simplification]

Screen criteria: SI > 20% of float AND days-to-cover > 5 AND borrow rate > 10%. Flag these for manual review. Add WSB mention velocity as a timing trigger.

Storage:

CREATE TABLE squeeze_scores (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    calc_date DATE NOT NULL,
    short_pct_float REAL,
    days_to_cover REAL,
    borrow_rate REAL,
    wsb_mention_velocity REAL,
    ftd_pct_float REAL,
    squeeze_flag INTEGER DEFAULT 0,
    squeeze_score REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, calc_date)
);

SECTION L: PYTHON-TYPESCRIPT INTEGRATION ARCHITECTURE

[PANEL: Opus — critical architecture addition, entire section]

The boundary: Python handles all data collection (financial Python ecosystem is superior — pandas, yfinance, praw, pytrends, nltk, textstat, FinBERT all have no good TypeScript equivalents). TypeScript/Node.js handles the orchestration layer, API serving, and prediction pipeline.

Communication layer: Shared database with a job queue table.

CREATE TABLE job_queue (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    job_id TEXT UNIQUE NOT NULL,
    job_type TEXT NOT NULL,
    status TEXT DEFAULT 'pending',
    payload TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    started_at DATETIME,
    completed_at DATETIME,
    error_message TEXT
);

CREATE TABLE alerts (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    alert_type TEXT NOT NULL,
    ticker TEXT,
    severity TEXT,
    message TEXT,
    payload TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    acknowledged INTEGER DEFAULT 0
);

Pattern:

  1. Node.js inserts jobs into job_queue (or Python cron inserts directly)
  2. Python workers poll job_queue for status = 'pending', mark started_at, execute, write results to target tables, set status = 'completed' or status = 'failed'
  3. For real-time alerts (8-K filings, squeeze flags, borrow rate spikes), Python writes to alerts table
  4. Node.js polls alerts table every 10 seconds for unacknowledged alerts

No message broker needed — the database handles this volume. PostgreSQL LISTEN/NOTIFY can be used for lower-latency alert delivery if needed.

API key management: Track ~8-10 API keys (FRED, FMP, Polygon, PRAW, etc.) in .env file. Document which keys are required vs optional. A single missing optional key should NOT crash the pipeline. Rotate keys quarterly.


SECTION M: DATA QUALITY MONITORING

[PANEL: Opus — critical addition, entire section]

A data quality monitoring job runs daily at 6:30 AM ET, BEFORE the trading day. Results go into data_quality_log. Any CRITICAL failure blocks trading for that day.

CREATE TABLE data_quality_log (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_name TEXT NOT NULL,
    check_timestamp DATETIME NOT NULL,
    check_type TEXT NOT NULL,
    status TEXT NOT NULL,
    details TEXT,
    rows_affected INTEGER,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Checks to run:

  1. FRESHNESS: For each data source, verify last ingestion timestamp is within expected window. daily_prices should have data from yesterday. macro_data daily series should have data from yesterday. If EDGAR had no Form 4 filings all day (possible but rare), log as WARNING not CRITICAL.

  2. COMPLETENESS: Verify row counts. daily_prices should have ~700 rows per day. If only 50 rows, yfinance/Polygon failed partially.

  3. RANGE: Verify no prices are negative, no volumes are negative, VIX is between 5 and 100, RSI is between 0 and 100, all implied volatilities are between 0 and 5 (500%).

  4. CONSISTENCY: Verify adjusted close changes are within 25% of prior day (catches bad split adjustments). When Polygon and yfinance both have data for the same ticker, verify prices are within 1%.

  5. CROSS-SOURCE: If prediction market prices for a contract have not updated in >2 hours during market hours, flag as stale.

Failure thresholds:


IMPLEMENTATION CHECKLIST (Ordered by Priority)

Phase 1 — Core Infrastructure (Week 1, ~$29/month)

Phase 2 — SEC Pipeline (Week 2, $0)

Phase 3 — Sentiment + Events (Week 3, $0)

Phase 4 — NLP + Advanced (Month 2+)


API ENDPOINTS REFERENCE

Source URL Auth Rate Limit
Polygon.io https://api.polygon.io/ API key 5 calls/min (free), unlimited ($29/mo)
FRED https://api.stlouisfed.org/fred/ API key (free) 120/min
SEC EDGAR https://efts.sec.gov/LATEST/ User-Agent header 10 req/sec
SEC XBRL https://data.sec.gov/api/xbrl/ User-Agent header 10 req/sec
Kalshi https://trading-api.kalshi.com/trade-api/v2/ RSA-PSS signing Varies
Polymarket CLOB https://clob.polymarket.com/markets None (public) Varies
Polymarket Gamma https://gamma-api.polymarket.com/markets None (public) Varies
FINRA ATS https://ats-transparency.finra.org/ None Weekly file
FINRA SI https://www.finra.org/finra-data/ None Bi-monthly file
OCC Borrow Rates https://www.theocc.com/ None Daily file
CBOE Market Share https://www.cboe.com/us/equities/market_statistics/ None Daily CSV
FMP https://financialmodelingprep.com/api/v3/ API key (free tier) 250/day
Kalshi settlements https://trading-api.kalshi.com/trade-api/v2/markets?status=settled RSA-PSS Varies
Polymarket resolved https://gamma-api.polymarket.com/markets?closed=true None Varies
ICI Fund Flows https://www.ici.org/research/stats/flows None Weekly
CFTC COT https://www.cftc.gov/MarketReports/ None Weekly
USPTO PatentsView https://api.patentsview.org/patents/query None Free
Reddit/PRAW OAuth2 App credentials 60/min
Capitol Trades https://www.capitoltrades.com/ None Scraping
SEC FTD https://www.sec.gov/data/foiadocsfailsdatahtm None Twice monthly
TSMC Revenue https://investor.tsmc.com/english/monthly-revenue None Monthly

COLLECTION CADENCE SUMMARY

Frequency Data Sources
Every 60 seconds (market hours) 8-K filings (EDGAR real-time feed)
Every 5 minutes Kalshi contracts expiring <24h, X/Twitter Tier 1 accounts
Every 15 minutes X/Twitter Tier 2 accounts, Stocktwits per-ticker, options chains (near-expiry)
Every 30 minutes Kalshi/Polymarket all contracts, Reddit WSB + r/options, retail flow proxies
Every 2 hours Reddit r/stocks + r/investing, X/Twitter Tier 3
Every 60 minutes X/Twitter Tier 4
Daily 5:00 PM ET Daily prices (Polygon), sector/factor ETF returns
Daily 6:00 AM ET Daily state table generation
Daily 6:30 AM ET Data quality monitoring job
Daily 7:00 AM ET FRED macro data, earnings calendar, analyst revisions, corporate events
Daily 6:00 PM ET 13D activist scan, REG SHO threshold, congressional trades
Daily 7:00 PM ET Form 4 and Form 144 daily batch (after EDGAR prime window 4-8 PM)
Daily (market hours) Options chains (all tracked names), borrow rates, CORRESP keyword scan
Weekly Monday FINRA ATS dark pool volume, CBOE venue breakdown, Google Trends
Weekly Sunday IPO lockup calendar
Weekly Friday CFTC COT data
Weekly USPTO patent data, job posting volume (Month 2+)
Bi-monthly FINRA short interest, SEC FTD data
Monthly N-PORT fund holdings, ICI mutual fund flows, FINRA margin debt, TSMC revenue, Korean semi exports
Quarterly 13F institutional holdings (within 24h of deadline: Feb 14, May 15, Aug 14, Nov 14)
Annually DEF 14A proxy statements (proxy season Mar-Jun), FDIC call reports

STORAGE ESTIMATES (REVISED)

The original spec estimate of ~575 KB/day is too low. Realistic estimate accounts for raw filing text, options chain data, and index overhead. [PANEL: Opus — storage correction]

Revised per-day estimates (full universe ~700 tickers):

Data Source Size/Day
daily_prices (700 tickers) ~140 KB
options_chains (active names, daily snapshot) ~500 KB
insider_trades ~50 KB
macro_data + derived ~5 KB
prediction_market_stocks (Kalshi + Polymarket) ~100 KB
reddit_sentiment ~25 KB
x_sentiment ~20 KB
8-K filing alerts + text ~50 KB
earnings_call_nlp (seasonal, ~200 companies/quarter) ~25 KB
corporate_events + state table ~5 KB
borrow_rates + regsho ~10 KB
collection_log + data_quality_log ~10 KB

Revised daily total: ~3-4 MB/day (Opus's estimate; still trivially small)

Per-quarter estimates:

Annual totals:

Total disk budget: ~5 GB for 2 years (comfortable on 45GB disk with 15GB free)


COST SUMMARY

Period Cost What It Gets You
Week 1+ $29/month Polygon.io — primary price data + options chains (move to Day 1)
Optional Month 2+ $19/month FMP API (cleaner XBRL parsing vs raw SEC)
Year 1 total ~$290-$480 Polygon ($348) + optional FMP ($228)

Dependencies (Python data collectors):

Future cost notes:


CROSS-DESK SYNERGIES

Shared Infrastructure (Build Once, Use Everywhere)

Shared Component Reusable For
FRED macro data suite (23+ series) Futures, Forex, Options, Crypto
Market regime classifier (3 states + correlation) All desks
Kalshi/Polymarket scraper infrastructure Options, Crypto, Sports, Futures
Settlement rules comparison engine All prediction market desks
Historical settlements database All desks
Brier score + calibration framework All desks
Daily state table (OpEx, FOMC, economic releases) Options, Futures, Forex
SEC EDGAR base infrastructure (rate limiter, XML parser, file router) Options (424B2, Form 4)
Credit spread data + credit-equity divergence Options, Futures
Earnings calendar + surprise database Options (implied move analysis)
Reddit/social sentiment infrastructure Crypto, Sports
EDGAR rate limiting + caching strategy All desks

Shared with Options Desk

Shared with Futures Desk

Shared with Forex Desk

Shared with Crypto Desk


REALITY CHECK AND KEY RISKS

Is this realistic? Yes, with four conditions:

  1. Polygon.io ($29/month) starts Day 1, not Month 2. yfinance will break before you finish Week 1.
  2. EDGAR collection uses the event-driven feed pattern (watermark-based), not per-ticker polling.
  3. Data quality monitoring is built in Phase 1, not bolted on later.
  4. The alternative data section (Section I) stays in Month 2+ — resist the temptation to build GitHub commit tracking before the core pipeline is generating predictions.

Biggest operational risk: Data source fragility. The spec relies on several sources that break regularly: yfinance (breaks 2-4x/year), PRAW/Reddit API (increasingly restricted since 2023), Polymarket API (undocumented endpoints), web scrapers for congressional trades and IPO lockups. Budget ~20% of ongoing engineering time for scraper maintenance. EDGAR and FRED are rock-solid (government APIs with stability commitments) — everything else is built on sand to varying degrees.

First to break in production: yfinance (if used as primary — don't). Second: Reddit PRAW API. Third: Polymarket CLOB endpoint.

Top single change: Build the settlement rules comparison engine (Section 15B) before executing any cross-platform arbitrage. Kalshi and Polymarket settlement rules differ subtly and those differences flip binary outcomes. Trading on an assumed arbitrage without verifying rules is not arbitrage — it's an uncompensated bet on settlement rule differences.

What NOT to build (cost > value):


End of Stocks Desk Merged Final Spec — March 15, 2026


TODO: Upgrade Kalshi REST to WebSocket Feed

Status: NOT BUILT — add when this desk goes live for execution

Current state: All Kalshi data (prices, order books, trades) is fetched via REST API polling on cron schedules. This is fine for edge detection and monitoring, but NOT sufficient for live trade execution.

Why WebSocket matters:

Note: This desk primarily trades on other exchanges, but Kalshi has event markets (rate decisions, index levels, crypto milestones) that overlap with this desk. WebSocket upgrade applies when trading any Kalshi markets.

Added 2026-03-29 — upgrade REST to WS when desk moves to live execution

Source: ~/edgeclaw/results/spec-panel/stocks-desk/spec-final.md