Stocks Desk — Data Collection Spec (MERGED FINAL)

Panel review: Claude Opus 4.6 (base) + DeepSeek V3.1, Gemini 3.1 Pro, GPT-4.1, Grok 4 Fast, Grok 4.1, Sonar Reasoning Pro

Merged: March 15, 2026. Opus is the complete base; panel additions tagged [PANEL: Model].

NOTE ON BROKER: Stock trade execution broker is TBD. All references in this spec use broker-agnostic language. The equity trading platform will be configured separately. Prediction markets (Kalshi, Polymarket) are unaffected by broker choice.

NOTE ON STACK: Data collection scripts run in Python (best financial ecosystem). The orchestration layer and prediction pipeline run in TypeScript/Node.js. Communication between them uses a shared PostgreSQL database with a job_queue table. See Section L for integration architecture. [PANEL: Opus — aplus additions]

What This Document Is

This is the complete data collection specification for the Stocks desk in the research pipeline. It covers equity trades on a TBD broker AND event contracts traded on Kalshi/Polymarket. An AI builder should be able to read this and know exactly what data to collect, from where, how often, in what format, and why it matters for finding mispriced stocks or prediction market contracts.

The Business Model (Three Venues)

Unlike sports desks that only trade prediction markets, the Stocks desk trades on THREE platforms:

Equity broker (TBD) — Actual stock trades (buy/sell shares on individual equities). We buy and sell real stocks.
Kalshi — Binary prediction markets on stock events: price thresholds ("TSLA above $300 by Friday"), earnings outcomes ("AAPL beats EPS estimate"), index levels ("S&P 500 above 5500 by month end").
Polymarket — Similar binary contracts on crypto rails (USDC on Polygon). Includes M&A, regulatory outcomes, CEO changes, index milestones.

The cross-venue edge: Fundamental analysis tells us a stock is mispriced. The same analysis also tells us whether a Kalshi/Polymarket contract is mispriced. Example: if our fundamental model says NVDA fair value is $950 and Kalshi prices "NVDA above $900 by April 30" at 55 cents, our model implies ~75% probability — that contract is cheap. We trade BOTH venues simultaneously.

How Stocks Are Different From Sports/Options

An AI builder must understand these structural differences:

No Sharp External Benchmark — In sports, sharp bookmakers set fair lines. For stock events, there IS no sharp external reference. We must BUILD the fair value estimate from fundamentals, flow data, and volatility modeling.
Continuous Price Discovery — Stocks trade every second, 6.5 hours per day. Data changes constantly. Sports bets lock in at game time. Timing of data collection matters more.
Multi-Factor Valuation — A stock's fair value depends on earnings, balance sheet, macro regime, sector rotation, institutional positioning, insider behavior, credit markets, and sentiment simultaneously. No single data source is sufficient.
Prediction Market Carry Cost — Buying a Kalshi contract at 70 cents locks up 70 cents until settlement. The carry cost is: ModelProbability * SOFR_annual * DaysToExpiry / 365. At current SOFR (~4.3%), a 70-cent 90-day contract has ~0.76 cents of carry cost. Prediction market prices should trade BELOW option-implied probabilities by this carry cost. If they trade ABOVE, the prediction market is overpriced. Do NOT hardcode the risk-free rate — wire to live SOFR from FRED. [PANEL: Opus — carry cost correction]
Regime Dependence — Stock strategies that work in bull markets fail in bear markets. Every data source must be tagged with which market regime it applies to.
SEC Filings Are the Goldmine — The SEC requires public companies to disclose everything. This free government data is richer than any paid service for stocks.

SECTION A: PRICE AND MARKET DATA (Foundation)

1. Daily OHLCV for Full Universe

What: Daily Open, High, Low, Close (adjusted for splits/dividends), and Volume for every stock we track.

Universe:

S&P 500 constituents (503 tickers)
Nasdaq 100 constituents
Russell 2000 top 200 by volume (small-cap regime detection)
Any stock we have EVER made a prediction on (even if delisted — survivorship bias prevention)
Major ETFs: SPY, QQQ, IWM, DIA, XLF, XLE, XLK, XLB, XLI, XLP, XLU, XLV, XLY, XLC, XLRE, ARKK
Factor ETFs: MTUM (Momentum), QUAL (Quality), VLUE (Value), SIZE (Size), USMV (Min Vol)

What to pull per ticker:

OHLCV + adjusted close
52-week high/low
Average daily volume (20-day)
Relative volume (today's volume / 20-day average)
Market cap (shares outstanding * price)
Free float (shares outstanding minus insider/institutional locked)

Calculated technicals (on ingestion):

RSI (14-day)
MACD (12/26/9)
Bollinger Bands (20-day, 2 std dev) + %B position
ATR (14-day)
Short-term momentum (5-day, 10-day, 20-day returns)
Medium-term momentum (60-day, 120-day returns)

Primary source: Polygon.io REST API — start Day 1, not Month 2. Endpoint: GET /v2/aggs/grouped/locale/us/market/stocks/{date} (all tickers, one call per day). Historical backfill: GET /v2/aggs/ticker/{ticker}/range/1/day/{from}/{to}. Cost: $29/month. This is the most important infrastructure decision in the spec. [PANEL: Opus — moved Polygon to Day 1]

Fallback: yfinance (unofficial Yahoo Finance scraper) — use ONLY as backup when Polygon has an outage, and ONLY for historical backfill. yfinance breaks 2-4 times per year when Yahoo changes its undocumented API. It is NOT acceptable as a primary production source.

Health check: If Polygon returns HTTP 5xx or empty data for >5 minutes, automatically fall back to yfinance with an alert. Log every fallback event.

After-hours / pre-market data: Collect extended-hours price/volume via Polygon for catalysts that occur outside regular hours (earnings, guidance, M&A). Many moves begin in extended hours. [PANEL: GPT-4.1]

History: 10 years for S&P 500. Full available history for any traded stock.

Collection frequency: Daily at 5:00 PM ET (after market close + after-hours adjustments settle).

Historical data tagging: All rows must include effective date and index membership flags. Never overwrite past constituents or remove delisted tickers. [PANEL: GPT-4.1 — schema flaw fix]

Storage:

CREATE TABLE daily_prices (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    trade_date DATE NOT NULL,
    open REAL, high REAL, low REAL, close REAL,
    adj_close REAL, volume INTEGER,
    relative_volume REAL,
    rsi_14 REAL, macd REAL, macd_signal REAL,
    bb_upper REAL, bb_lower REAL, bb_pct REAL,
    atr_14 REAL,
    market_cap REAL, free_float REAL,
    is_sp500 INTEGER DEFAULT 0,
    is_ndx100 INTEGER DEFAULT 0,
    is_delisted INTEGER DEFAULT 0,
    delisted_date DATE,
    data_source TEXT DEFAULT 'polygon',
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, trade_date)
);
CREATE INDEX idx_daily_prices_ticker_date ON daily_prices(ticker, trade_date);
CREATE INDEX idx_daily_prices_date ON daily_prices(trade_date);

Why it matters: Foundation for ALL other signals. Relative volume spikes flag unusual activity before the cause is known. For prediction markets: current price trajectory directly determines probability of hitting Kalshi threshold prices.

2. Sector and Factor ETF Returns

What: Daily returns for all 11 GICS sectors and standard factor ETFs.

Sector ETFs (11): XLK, XLF, XLE, XLV, XLI, XLP, XLY, XLC, XLRE, XLB, XLU Factor ETFs: MTUM, QUAL, VLUE, SIZE, USMV, IWD (Value), IWF (Growth)

Calculated metrics:

Value vs Growth spread: IWD return minus IWF return (daily)
Small vs Large spread: IWM return minus SPY return (daily)
Sector rotation velocity: rolling 20-day correlation between each sector pair
Factor crowding: when one factor dramatically outperforms, it is crowded
Cross-sectional dispersion: std dev of S&P 500 stock returns (high = stock-picking matters, low = macro dominates)

Source: Polygon.io (primary). Also Fama-French factor data from Kenneth French Data Library (free CSVs). AQR factor datasets (free registration). Collection frequency: Daily at close. Monthly download of updated Fama-French/AQR data.

Storage:

CREATE TABLE factor_returns (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    trade_date DATE NOT NULL,
    ticker TEXT NOT NULL,
    daily_return REAL,
    sector TEXT,
    factor_type TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, trade_date)
);

Why it matters: Factor attribution on every trade reveals whether alpha came from stock selection or factor exposure. Regime-conditional factor performance guides position sizing.

SECTION B: SEC EDGAR PIPELINE (The Goldmine — All Free)

SEC EDGAR is the single richest free data source for stock analysis. Base URL: https://efts.sec.gov/LATEST/. User-Agent REQUIRED: "CompanyName AdminEmail". Rate limit: 10 requests/second (use 5/sec conservatively).

CRITICAL architecture note: Do NOT poll each of ~700 tickers individually for new filings. Instead, poll the EDGAR full-text search endpoint once per filing type per cycle. Example: https://efts.sec.gov/LATEST/search-index?forms=4&dateRange=custom&startdt={today}&enddt={today} returns ALL Form 4 filings for the day in one call. Maintain a last_seen_filing_id watermark per filing type to avoid reprocessing. This pattern applies to Form 4, Form 144, 13D, 8-K, CORRESP, and all other monitored types. [PANEL: Opus — EDGAR architecture fix]

Legal risk: SEC has explicitly asked high-frequency scrapers to slow down. Mitigation: use RSS feed + daily batching, respect User-Agent and delay requirements, cache aggressively, document scraping practices. Store raw XML locally — never re-fetch a filing you already have. [PANEL: Sonar]

3. Form 4 — Insider Trades

What: Corporate insiders (CEO, CFO, directors, 10%+ owners) must report stock purchases/sales within 2 business days of the trade.

What to extract per filing:

Ticker, CIK
Insider name, relationship (CEO/CFO/Director/VP/10% Owner)
Transaction type: P (purchase), S (sale), A (award), M (exercise)
Shares traded, price, total value
Shares owned after transaction
Whether it is a pre-planned 10b5-1 trade (noise) or open-market (signal)
Reporting person CIK (for cross-company network mapping)

Name canonicalization: SEC Form 4 insider names have variations ('Robert J Smith' vs 'Bob Smith'), name changes (marriages, legal changes), and typos. Use fuzzy matching (Levenshtein distance) + CIK primary key (which is permanent). Build a insider_name_canonical table — do not assume names are consistent across filings. [PANEL: Sonar]

Source: SEC EDGAR. Filing search: https://efts.sec.gov/LATEST/search-index?forms=4&dateRange=custom&startdt={date}&enddt={date}. Parse XML from each filing. Store raw XML locally before parsing — SEC XML structure changes every 2-3 years and you'll need to re-parse.

Collection frequency: Poll the daily feed every 5 minutes during 4 PM - 8 PM ET (when most Form 4s are filed), every 30 minutes otherwise. Do NOT check every ticker individually every 2 hours.

Storage:

CREATE TABLE insider_trades (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    ticker TEXT NOT NULL,
    company_cik TEXT,
    insider_name TEXT,
    insider_cik TEXT,
    relationship TEXT,
    transaction_type TEXT,
    is_10b51 INTEGER DEFAULT 0,
    shares_traded REAL,
    price_per_share REAL,
    total_value REAL,
    shares_owned_after REAL,
    filing_date DATE,
    transaction_date DATE,
    raw_xml_path TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_insider_trades_ticker ON insider_trades(ticker);
CREATE INDEX idx_insider_trades_date ON insider_trades(filing_date);
CREATE INDEX idx_insider_trades_cik ON insider_trades(insider_cik);

CREATE TABLE insider_network (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    insider_cik TEXT NOT NULL,
    company_cik TEXT NOT NULL,
    relationship TEXT,
    first_seen DATE,
    last_seen DATE,
    UNIQUE(insider_cik, company_cik)
);

CREATE TABLE insider_name_canonical (
    insider_cik TEXT PRIMARY KEY,
    canonical_name TEXT NOT NULL,
    name_variants TEXT,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Key signals:

Cluster buying: 3+ insiders buying within 30 days predicts 6-month outperformance by 7-10% (academic evidence)
CEO/CFO buys stronger than Director buys
10b5-1 pre-planned sales are NOISE — filter them out
Cross-company network: when an insider who sits on boards of Company A and Company B buys stock in Company A, check Company B too

4. Form 144 — Pre-Sale Notice of Restricted Stock

What: Insiders must file Form 144 BEFORE selling restricted shares. This is filed days to weeks BEFORE the actual Form 4. Form 144 is the LEADING indicator of Form 4.

Important caveat: The lag between Form 144 and Form 4 varies by insider sophistication — some file 144 on the same day they transact. The leading indicator claim must be validated with data before being used in core signals. Track the actual lag per insider CIK. [PANEL: Sonar — Form 144 validation caveat]

Source: SEC EDGAR, filing type "144". URL: https://efts.sec.gov/LATEST/search-index?forms=144&dateRange=custom&startdt={date}&enddt={date}

Parsing note: Form 144 filings are less structured than Form 4. Use LLM-assisted parsing for the text portions. Expect parse failures — build a failure queue and alert on failure rate >20%. [PANEL: DeepSeek — Form 144 parsing complexity]

Collection frequency: Daily scan.

Storage:

CREATE TABLE form144_notices (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    ticker TEXT,
    company_name TEXT,
    insider_name TEXT,
    insider_cik TEXT,
    shares_to_sell REAL,
    estimated_proceeds REAL,
    filing_date DATE,
    planned_sale_date DATE,
    parse_status TEXT DEFAULT 'ok',
    raw_text TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Key signal: Form 144 filing = selling is COMING but hasn't happened yet. Compare timing to subsequent Form 4 to measure actual lag per insider.

5. 13F Institutional Holdings (Quarterly)

What: Every institutional investment manager with >$100M AUM must file 13F-HR quarterly, disclosing all equity holdings.

What to extract:

Filer name, CIK
Filing period (quarter-end date)
Per holding: CUSIP, company name, shares, market value, investment discretion, put/call indicator
Calculate: new positions, liquidated positions, increased/decreased vs prior quarter

Smart Money Cohort (~50 funds):

University endowments: Yale, Harvard, Stanford, Princeton, MIT
Sovereign wealth: GIC, ADIA, Norges Bank
High-conviction value: Berkshire Hathaway, Baupost, Third Point, Pershing Square, Greenlight, Appaloosa, ValueAct
Tiger Cubs: Lone Pine, Viking, Coatue, Tiger Global, Maverick

Source: SEC EDGAR. Filing index: https://efts.sec.gov/LATEST/search-index?forms=13F-HR&dateRange=custom. Parse infotable.xml for positions.

Collection frequency: Quarterly. 13F due 45 days after quarter-end (Feb 14, May 15, Aug 14, Nov 14). Bulk parse within 24 hours of deadline.

13F-NT tracker: Also monitor https://efts.sec.gov/LATEST/search-index?forms=13F-NT for late filing notifications. A smart money fund filing late (requesting extension) is more likely hiding embarrassing positions than an early filer. Replace the "early filing = no positions to hide" signal with "13F-NT by smart money = something changed they're not ready to reveal." [PANEL: Opus — 13F-NT correction]

Storage:

CREATE TABLE institutional_holdings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filer_cik TEXT NOT NULL,
    filer_name TEXT,
    period_of_report DATE NOT NULL,
    cusip TEXT,
    ticker TEXT,
    company_name TEXT,
    shares REAL,
    market_value REAL,
    investment_discretion TEXT,
    option_type TEXT,
    is_smart_money INTEGER DEFAULT 0,
    prev_shares REAL,
    shares_change REAL,
    is_new_position INTEGER DEFAULT 0,
    is_liquidated INTEGER DEFAULT 0,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(filer_cik, period_of_report, cusip)
);

CREATE TABLE smart_money_signals (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    signal_date DATE NOT NULL,
    funds_initiated INTEGER DEFAULT 0,
    funds_liquidated INTEGER DEFAULT 0,
    funds_increased INTEGER DEFAULT 0,
    funds_decreased INTEGER DEFAULT 0,
    overlap_count INTEGER DEFAULT 0,
    convergence_score REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, signal_date)
);

Key signals:

3+ smart money funds initiating same stock in same quarter = deep fundamental research already done
Ownership overlap: 5+ overlapping funds = "crowded" stock, vulnerable to contagion selling
For prediction markets: smart money piling in = Kalshi "above $X" contracts underpriced on 3-month horizon

6. N-PORT Monthly Fund Holdings

What: Mutual fund monthly portfolio holdings with 60-day public disclosure lag. N-PORT includes data 13F does NOT: securities lending income, total return swaps, CDS held, liquidity classification.

Simplification: Focus on top 50 ETFs/mutual funds (Vanguard, BlackRock) that drive 80% of flow. Skip securities lending/CDS unless trading those specific names. [PANEL: Grok 4.1 — N-PORT scope]

Source: SEC EDGAR, filing type "NPORT-P". URL: https://efts.sec.gov/LATEST/search-index?forms=NPORT-P Collection frequency: Monthly. Parse within 24 hours of new filing.

Storage:

CREATE TABLE nport_holdings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filer_cik TEXT NOT NULL,
    filer_name TEXT,
    period_date DATE NOT NULL,
    cusip TEXT,
    ticker TEXT,
    company_name TEXT,
    shares REAL,
    market_value REAL,
    liquidity_classification TEXT,
    securities_lending_flag INTEGER DEFAULT 0,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(filer_cik, period_date, cusip)
);

Key signals:

Fund liquidating a position across 2+ consecutive months before it shows in 13F = 30-60 days early
Fund increasing securities lending = expects borrowing pressure (bearish)
Liquidity classification downgrade = fund may be forced to sell

7. 13D/13G Activist + Large Position Disclosure

What: 13D = filer intends to influence management (activist). 13G = passive large position (>5% ownership).

What to extract:

Filer name, target company (CIK, ticker)
Shares acquired, % of outstanding
Purpose of transaction (KEY FIELD — 13D requires stating intent)
Whether this is a 13G-to-13D CONVERSION

Activist Quality Tier List:

Tier A (>60% campaign success): Elliott, Starboard, ValueAct, Icahn, Third Point, Jana Partners, Trian
Tier B (40-60%): Engaged Capital, Land & Buildings, Sachem Head
Tier C (unknown/new): everyone else

Source: SEC EDGAR. https://efts.sec.gov/LATEST/search-index?forms=SC+13D,SC+13D/A,SC+13G Collection frequency: Daily scan at 6 PM ET.

Storage:

CREATE TABLE activist_filings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    filer_name TEXT,
    filer_cik TEXT,
    target_ticker TEXT,
    target_cik TEXT,
    filing_type TEXT,
    shares_acquired REAL,
    pct_outstanding REAL,
    purpose_text TEXT,
    activist_tier TEXT,
    is_13g_to_13d INTEGER DEFAULT 0,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Key signals:

Tier A activist 13D filing = go long immediately. Historical average 5-15% pop on filing.
13G-to-13D conversion = imminent activist campaign. 8-15% pop within 30 days.
For prediction markets: 13D filing on stock with active Kalshi contracts = those contracts reprice within hours.

8. 10-K/10-Q Financials via XBRL

What: Quarterly and annual financial statements in machine-readable format.

What to extract:

Revenue (total + by segment + by geography)
Net income, EPS (basic and diluted)
Cash from operations (CFO), CapEx, Free cash flow
Total assets, total liabilities, stockholders' equity
Total accruals (Net Income - CFO)
Shares outstanding (basic and diluted)
R&D expense, stock-based compensation
Share repurchase amount, debt levels
Geographic revenue split (for currency mismatch analysis)

Derived signals (calculate on ingestion):

Accrual ratio = (Net Income - CFO) / Total Assets. Important caveat: Post-Regulation Fair Disclosure, this anomaly has compressed from 5-10% to ~1-3% net of costs. Use as a confirmation signal, not a standalone bet. [PANEL: Sonar — accruals reality check]
Asset growth = YoY change in total assets. Growth >20% YoY historically underperforms, but factor has decayed post-publication. [PANEL: Opus — accrual anomaly caveat]
Share issuance = YoY change in diluted shares. Net issuance >5% of float = dilution drag. Net buyback >3% = management signaling undervaluation.
R&D-adjusted P/E (capitalize R&D instead of expensing)
Altman Z-Score = 1.2*(Working Capital/Total Assets) + 1.4*(Retained Earnings/Total Assets) + 3.3*(EBIT/Total Assets) + 0.6*(Market Cap/Total Liabilities) + 1.0*(Revenue/Total Assets). Z < 1.8 = distress zone. Kalshi "above $X" contracts on distressed companies are typically overpriced. [PANEL: Sonar — bankruptcy risk]

Source: SEC EDGAR XBRL API. https://data.sec.gov/api/xbrl/companyfacts/CIK{cik_padded}.json returns ALL XBRL facts ever filed. Note: XBRL parsing is complex due to inconsistent corporate tagging. Consider Financial Modeling Prep (FMP) API ($19/month) as an alternative that parses XBRL into clean JSON. [PANEL: Gemini — XBRL complexity]

Collection frequency: Parse within 24 hours of each filing. Cluster around earnings seasons.

Storage:

CREATE TABLE fundamentals (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    cik TEXT,
    period_end DATE NOT NULL,
    period_type TEXT,
    revenue REAL, net_income REAL, eps_basic REAL, eps_diluted REAL,
    cfo REAL, capex REAL, fcf REAL,
    total_assets REAL, total_liabilities REAL, equity REAL,
    shares_basic REAL, shares_diluted REAL,
    rd_expense REAL, sbc_expense REAL,
    buyback_amount REAL, total_debt REAL,
    accrual_ratio REAL, asset_growth REAL, share_issuance_pct REAL,
    altman_z_score REAL,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, period_end, period_type)
);

9. DEF 14A Proxy Statements (Annual)

What: Executive compensation details, board composition, shareholder proposals.

What to extract:

CEO total compensation (cash + equity breakdown)
YoY change in equity vs cash compensation ratio
Option grant strike prices (management defends these price levels)
Performance-based vesting criteria and target changes

Source: SEC EDGAR. Proxy season is March-May for most Fortune 500. Collection frequency: Daily during proxy season (Mar-Jun), weekly otherwise.

Storage:

CREATE TABLE exec_compensation (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    cik TEXT,
    fiscal_year INTEGER,
    ceo_name TEXT,
    total_comp REAL, cash_comp REAL, equity_comp REAL,
    equity_pct REAL,
    option_strike_cluster REAL,
    performance_target_change TEXT,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, fiscal_year)
);

Key signal: Shift to equity-heavy + raised performance targets = management bullish. Option strike clustering = management defends that price level. For prediction markets: exec options at $300 strike and Kalshi "above $280" = management has financial incentive to keep stock above $300, making the $280 contract safer.

10. SEC Correspondence Letters (CORRESP)

What: When the SEC sends a company a letter questioning its filings, the correspondence is eventually made public. Companies receiving SEC comment letters underperform by ~8% over the next 12 months.

Simplification for v1: Use keyword alerts for "material weakness", "restatement", "going concern" rather than full NLP parsing on every CORRESP letter. Defer full CORRESP parsing to v2. [PANEL: Grok 4.1]

Source: SEC EDGAR, filing type "CORRESP". URL: https://efts.sec.gov/LATEST/search-index?forms=CORRESP Collection frequency: Daily scan.

Storage:

CREATE TABLE sec_correspondence (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    ticker TEXT,
    cik TEXT,
    filing_date DATE,
    has_material_weakness INTEGER DEFAULT 0,
    has_restatement INTEGER DEFAULT 0,
    has_going_concern INTEGER DEFAULT 0,
    keywords_found TEXT,
    filing_url TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

11. Form 424B2 Structured Products (Shared with Options Desk)

What: Banks file 424B2 supplements when issuing structured products (autocallables, barrier notes) tied to specific stocks. Reveals barrier/knockout levels and notional amounts.

Source: SEC EDGAR, filing type "424B2". Filter for equity-linked products. Start with top 5 issuers (Goldman, JPMorgan, Morgan Stanley, Citi, BofA). Collection frequency: Daily. This is a Month 2-3 build item.

Storage:

CREATE TABLE structured_products (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    issuer TEXT,
    underlying_ticker TEXT,
    product_type TEXT,
    barrier_level REAL,
    knockout_level REAL,
    notional_amount REAL,
    maturity_date DATE,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

12. Filing Frequency Scanner + 8-K Real-Time Monitor

Standard filing alert types (hourly scan):

S-3 shelf registration: company preparing to issue shares (dilution)
SC TO-T: tender offer (acquisition premium)
DEFA14A: proxy fight heating up
NT 10-K/NT 10-Q: late filing notification (strong negative signal)
15-12G: deregistering securities = going dark (bearish)

8-K real-time monitor: Poll https://efts.sec.gov/LATEST/search-index?forms=8-K&dateRange=custom&startdt={today}&enddt={today} every 60 seconds during 4:00 AM - 8:00 PM ET. Parse 8-K item numbers: [PANEL: Opus — 8-K real-time monitor]

Item 1.01: material agreement
Item 1.02: bankruptcy
Item 2.01: acquisition/disposition
Item 2.02: earnings (results of operations)
Item 2.05: restructuring costs
Item 4.01: auditor change — RED FLAG (companies that change auditors underperform 10-15% next year)
Item 5.02: departure of directors/officers
Item 7.01: Regulation FD disclosure
Item 8.01: other events

Alert rule: Send immediate alert for Item 1.02, 2.05, 4.01, 5.02 on any stock with active prediction market contracts.

Simplification: Monitor 3 key 8-K patterns automatically; handle other filing types manually. [PANEL: Sonar]

Storage:

CREATE TABLE filing_alerts (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    ticker TEXT,
    cik TEXT,
    filing_type TEXT,
    item_numbers TEXT,
    alert_priority TEXT,
    filing_datetime DATETIME,
    alert_sent INTEGER DEFAULT 0,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

SECTION C: PREDICTION MARKET DATA (Where We Trade)

13. Kalshi Stock Event Contracts

What: All stock-related Kalshi binary contracts.

Contract types:

"Will [TICKER] be above/below $X on [DATE]?" — price thresholds
"Will [TICKER] beat/miss earnings EPS estimate?" — earnings outcomes
"Will S&P 500 be above/below X on [DATE]?" — index levels
"Will [TICKER] announce dividend cut/increase?" — corporate actions
Fed rate decisions, CPI/inflation outcomes (impacts equities)

What to pull per contract:

Contract ticker/ID, event description
Parsed fields: underlying ticker, threshold price, expiration date, direction
Yes price, No price (cents, 0-100)
Volume, open interest
Order book depth (top 5 bids and asks with quantities)
Settlement rules (exact settlement source and time)
Last trade price and timestamp

Source: Kalshi API. Correct base URL: https://trading-api.kalshi.com/trade-api/v2/. Store this in a config file or environment variable — do NOT hardcode it in multiple places. Add a health check before each collection cycle. [PANEL: Opus — Kalshi URL fix]

Auth: RSA-PSS signing (as documented in CLAUDE.md).

Collection frequency: Every 30 minutes during market hours. Every 5 minutes for contracts expiring within 24 hours.

Storage:

CREATE TABLE prediction_market_stocks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    platform TEXT NOT NULL,
    contract_id TEXT NOT NULL,
    event_description TEXT,
    underlying_ticker TEXT,
    threshold_price REAL,
    direction TEXT,
    expiration_date DATE,
    yes_price REAL, no_price REAL,
    volume REAL, open_interest REAL,
    order_book_depth TEXT,
    settlement_source TEXT,
    settlement_time TEXT,
    last_trade_price REAL,
    collected_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(platform, contract_id, collected_at)
);
CREATE INDEX idx_pm_ticker_date ON prediction_market_stocks(underlying_ticker, expiration_date);

14. Polymarket Stock Event Contracts

What: Same as Kalshi but on Polymarket. Different liquidity pool (crypto-native traders), so prices often diverge.

Additional Polymarket-specific data:

AMM pool depth (liquidity available)
USDC flow in/out of the contract
Resolution source and rules

Source: Polymarket CLOB API: https://clob.polymarket.com/markets. Gamma API: https://gamma-api.polymarket.com/markets?tag=stocks.

Reliability note: Polymarket's CLOB API changes endpoints frequently. Add fallback to TheGraph subgraph queries (free, decentralized) and validate prices against multiple endpoints to detect oracle failures. [PANEL: Grok 4.1 — Polymarket reliability]

Cross-platform arbitrage caveat: Kalshi and Polymarket contracts on the "same" event frequently differ in: settlement sources, settlement times, and edge case handling (what happens if stock is halted on expiry day?). Use the settlement rule comparison engine (Section 15B) before treating any pair as arbitrageable. [PANEL: Opus — arbitrage caveat]

Collection frequency: Every 30 minutes during market hours.

15. Binary Probability Calculator (The Bridge)

What: Math that converts traditional stock analysis into Kalshi/Polymarket-comparable probabilities.

Inputs:

IV surface from options chains (Section 15A — required input, not a nice-to-have)
Risk-free rate: dynamic SOFR from FRED (series: SOFR), NOT hardcoded 5%
Time to expiration
Current price
Fundamental fair value estimate

Method: Use Black-Scholes d2 probability as the primary calculation. Caveats for edge cases: d2 is unreliable for deep OTM contracts, extreme IV, or near-zero interest rates. Use full risk-neutral distribution (Breeden-Litzenberger) for high-accuracy cases. [PANEL: GPT-4.1 — Black-Scholes edge cases]

Carry cost adjustment: CarryAdjustedFairValue = ModelProbability - (ModelProbability * SOFR * DaysToExpiry / 365). When comparing to market price: if MarketPrice > CarryAdjustedFairValue, the contract is overpriced. If MarketPrice < CarryAdjustedFairValue, it is underpriced.

Real carry cost note: The 70-cent 30-day example in the original spec (0.29 cents carry) understates actual costs. Real carry includes: counterparty risk on Polymarket (venue default), Kalshi's settlement fee (verify current fee structure), withdrawal delays (funds lock 1-5 days), slippage on entry/exit. Account for all costs before declaring a contract mispriced. [PANEL: Sonar — carry cost correction]

Output: Probability per threshold per expiration, compared to prediction market prices. Collection frequency: Recalculated every time prediction market prices or model inputs update.

15A. Options Chain Data (Required Input for Binary Probability Calculator)

What: Daily options chains for all stocks with active Kalshi/Polymarket contracts plus top 50 most liquid optionable stocks.

This was the most critical gap in the original spec. The Binary Probability Calculator in Section 15 explicitly requires IV surface data. Without it, you cannot price prediction market contracts accurately. This must be a Day 1 build item, not a cross-desk dependency. [PANEL: Opus, Gemini, GPT-4.1, Grok 4 Fast, Grok 4.1 — universal consensus on this gap]

What to collect per contract:

Ticker, expiration date, strike, option type (call/put)
Bid, ask, mid, volume, open interest
Implied volatility, delta, gamma, theta, vega
Underlying price at snapshot time

Derived per-ticker metrics:

ATM IV (at-the-money implied volatility)
25-delta put IV, 25-delta call IV
Put-call skew = 25d put IV minus 25d call IV (steep = institutional hedging = bearish positioning)
IV term structure (front month vs 3-month vs 6-month ATM IV)
Implied probability distribution (Breeden-Litzenberger second derivative)

Source: Polygon.io options snapshot endpoint: GET /v3/snapshot/options/{underlyingAsset} — covered by the $29/month subscription already required for equity data.

Collection frequency: Daily for all tracked names. Intraday (every 15 min) for names with contracts expiring within 5 days.

Storage:

CREATE TABLE options_chains (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    expiration_date DATE NOT NULL,
    strike REAL NOT NULL,
    option_type TEXT NOT NULL,
    bid REAL, ask REAL, mid REAL,
    volume INTEGER, open_interest INTEGER,
    implied_volatility REAL,
    delta REAL, gamma REAL, theta REAL, vega REAL,
    underlying_price REAL,
    snapshot_timestamp DATETIME,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, expiration_date, strike, option_type, snapshot_timestamp)
);
CREATE INDEX idx_options_ticker_exp ON options_chains(ticker, expiration_date);

15B. Settlement Rules Comparison Engine

What: A database mapping Kalshi contracts to Polymarket equivalents, with explicit settlement rule comparison. Cross-platform arbitrage is ONLY valid when settlement sources, times, and edge cases match. [PANEL: Opus — settlement rules engine]

Storage:

CREATE TABLE settlement_rule_pairs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    kalshi_contract_id TEXT,
    polymarket_contract_id TEXT,
    event_description_match_score REAL,
    settlement_source_kalshi TEXT,
    settlement_source_polymarket TEXT,
    settlement_time_kalshi TEXT,
    settlement_time_polymarket TEXT,
    settlement_source_match INTEGER DEFAULT 0,
    settlement_time_match INTEGER DEFAULT 0,
    edge_case_differences TEXT,
    arbitrageable INTEGER DEFAULT 0,
    price_difference_cents REAL,
    last_checked DATETIME,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

16. Prediction Market Historical Settlement Database

What: All historically settled contracts with outcomes, for model calibration. [PANEL: Opus — settlement history as MUST-HAVE]

Why this is MUST-HAVE: Without settled contract history, you cannot calibrate the Binary Probability Calculator. You need to know: historically, when there was a 15+ point divergence between model and market, how often did the model win?

Sources: Kalshi API: GET /trade-api/v2/markets?status=settled. Polymarket: https://gamma-api.polymarket.com/markets?closed=true&tag=stocks. Backfill ALL available history on Day 1.

Storage:

CREATE TABLE prediction_market_settlements (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    platform TEXT NOT NULL,
    contract_id TEXT NOT NULL,
    event_description TEXT,
    underlying_ticker TEXT,
    threshold_price REAL,
    direction TEXT,
    expiration_date DATE,
    settlement_date DATE,
    settlement_value REAL,
    our_model_probability REAL,
    our_entry_price REAL,
    final_pre_settlement_price REAL,
    price_7d_before REAL, price_3d_before REAL, price_1d_before REAL,
    resolution_source TEXT,
    resolution_notes TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(platform, contract_id)
);

SECTION D: MACRO AND ECONOMIC DATA (FRED API — Free)

17. FRED Economic Data Suite

What: Federal Reserve Economic Data — free API for macro data that drives sector rotation, risk appetite, and market regime.

Source: FRED API (free, API key required). Rate limit: 120 requests/minute. Base URL: https://api.stlouisfed.org/fred/series/observations.

Series to pull:

Series ID	Description	Frequency
DFF	Federal Funds Effective Rate	Daily
SOFR	Secured Overnight Financing Rate	Daily
DGS10	10-Year Treasury Yield	Daily
DGS2	2-Year Treasury Yield	Daily
DGS30	30-Year Treasury Yield	Daily
T10Y2Y	10Y-2Y Yield Spread (curve)	Daily
T10Y3M	10Y-3M Yield Spread	Daily
BAMLH0A0HYM2	ICE BofA HY OAS Spread	Daily
BAMLC0A0CM	ICE BofA IG OAS Spread	Daily
BAMLC0A4CBBB	BBB Corporate Bond Spread	Daily
DTWEXBGS	Trade-Weighted USD Index (Broad)	Daily
VIXCLS	CBOE VIX Close	Daily
WALCL	Fed Balance Sheet (Total Assets)	Weekly
RRPONTSYD	Reverse Repo Facility Usage	Daily
WTREGEN	Treasury General Account (TGA)	Weekly
NFCI	Chicago Fed Financial Conditions	Weekly
TOTCI	C&I Loans (bank lending pulse)	Weekly
ICSA	Initial Jobless Claims	Weekly
CPIAUCSL	CPI-U (All Items)	Monthly
PCE	Personal Consumption Expenditures	Monthly
PAYEMS	Total Nonfarm Payrolls	Monthly
UMCSENT	U Michigan Consumer Sentiment	Monthly
UNRATE	Unemployment Rate	Monthly
USREC	NBER Recession Indicator	Monthly

Note on SOFR: Added explicitly as a required series. Wire SOFR into the Binary Probability Calculator's carry cost adjustment. Do not use DFF as a proxy — SOFR is the correct modern risk-free rate. [PANEL: Opus]

Data lag flag: Many FRED series update with 1-3 day lags. Flag stale data in the UI to prevent false signals from acting on outdated macro readings. [PANEL: DeepSeek]

Derived signals (calculate on ingestion):

Yield curve inversion flag (T10Y2Y < 0)
HY spread z-score (vs 1-year rolling mean/stdev)
Credit regime: "risk-on" when HY tightening + VIX < 20, "risk-off" when widening + VIX > 25
USD momentum (20-day vs 60-day moving average)
Liquidity composite: RRP drain + TGA drain = liquidity injection (bullish). RRP + TGA building = liquidity drain (bearish)
Bank lending pulse: C&I loans declining 3+ weeks = credit tightening = earnings recession coming
Variance Risk Premium (VRP): VRP = VIX²/252 - RealizedVariance_20d, where RealizedVariance_20d = annualized variance of SPY daily log returns over trailing 20 trading days. Also calculate VRP z-score vs 1-year rolling. VRP regime: OVERPRICED if z > 1, FAIR if -1 < z < 1, UNDERPRICED if z < -1. High VRP = prediction market contracts paying off on large moves are OVERPRICED. Low VRP = they are UNDERPRICED. [PANEL: Opus — VRP as MUST-HAVE]
Single-stock VRP for any ticker with active Kalshi contracts: stock_vrp = (ATM_IV²/252) - stock_realized_variance_20d. [PANEL: Opus]

Collection frequency: Daily at 7 AM ET for overnight releases. Weekly series checked on Fridays.

Storage:

CREATE TABLE macro_data (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    series_id TEXT NOT NULL,
    observation_date DATE NOT NULL,
    value REAL,
    is_revised INTEGER DEFAULT 0,
    revision_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(series_id, observation_date)
);
CREATE INDEX idx_macro_series_date ON macro_data(series_id, observation_date);

CREATE TABLE macro_derived (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    calc_date DATE NOT NULL UNIQUE,
    yield_curve_inverted INTEGER DEFAULT 0,
    hy_spread_zscore REAL,
    credit_regime TEXT,
    usd_momentum TEXT,
    liquidity_composite REAL,
    vrp_raw REAL,
    vrp_zscore REAL,
    vrp_regime TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

18. Market Regime Classification

What: Daily classification of market environment.

Simplified to 3 primary states + confidence score. The original 6-bucket system (BULL/BEAR/CORRECTION/CRISIS/RANGE-BOUND/ROTATION) over-categorizes. In practice the system oscillates between CORRECTION and RANGE-BOUND without materially different trading logic. Use 3 states with confidence %, add detail when data supports it: [PANEL: Sonar, Grok 4.1 — simplify regimes]

BULL: SPY above 200 DMA, VIX < 20, HY spread < 400bps, >70% stocks above 50 DMA
BEAR: SPY below 200 DMA, VIX > 25, HY spread > 500bps, <40% stocks above 50 DMA
UNCERTAIN: Everything else

Also track correlation regime: [PANEL: Sonar — correlation regime as MUST-HAVE]

When SPX/HY correlation > 0.85 (equity-credit co-movement spikes), diversification breaks. Tag all predictions made during high-correlation regime — they require larger expected moves to overcome reduced diversification.
Calculate: 60-day rolling correlation of SPY vs HYG (HY bond ETF).

Storage:

CREATE TABLE market_regime (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    calc_date DATE NOT NULL UNIQUE,
    regime_label TEXT,
    confidence REAL,
    vix REAL,
    hy_spread REAL,
    yield_curve REAL,
    breadth_pct REAL,
    dispersion REAL,
    spx_hy_correlation REAL,
    high_correlation_flag INTEGER DEFAULT 0,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

SECTION E: SHORT INTEREST AND DARK POOL DATA

19. FINRA Short Interest (Bi-Monthly)

What: Total shares sold short per ticker.

Important: FINRA short interest is bi-monthly with ~10-day reporting lag. For real-time squeeze timing, supplement with: borrow fee spikes (real-time from broker or OCC), options open interest skew (puts vs calls), X mentions of "short squeeze". Acknowledge SI data is a lagging signal. [PANEL: Sonar, Grok 4.1]

What to collect:

Short interest (total shares)
Days to cover (SI / average daily volume)
Short interest as % of float
Change vs prior report

Source: FINRA. https://www.finra.org/finra-data/browse-catalog/short-interest/data. Published bi-monthly.

Storage:

CREATE TABLE short_interest (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    settlement_date DATE NOT NULL,
    short_interest REAL,
    days_to_cover REAL,
    short_pct_float REAL,
    change_from_prior REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, settlement_date)
);

Key signals:

Days to cover > 5 = squeeze risk. Days to cover > 10 = extreme squeeze risk.
SI velocity (rate of change) + borrow cost = powerful squeeze/crowding signal.

20. FINRA ATS (Dark Pool) Weekly Volume

What: Weekly share volume traded on each Alternative Trading System per ticker.

Source: FINRA ATS Transparency Data. https://ats-transparency.finra.org/otc/ats-nms-weekly-data. Published weekly. Collection frequency: Weekly download every Monday morning.

Storage:

CREATE TABLE dark_pool_volume (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    week_ending DATE NOT NULL,
    ats_name TEXT,
    share_volume REAL,
    trade_count INTEGER,
    pct_of_total_volume REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, week_ending, ats_name)
);

CBOE venue breakdown: Also collect daily lit venue market share from https://www.cboe.com/us/equities/market_statistics/ — free daily CSV. Compute dark_pool_ratio = dark_pool_volume / total_volume, and dark_pool_ratio_zscore vs 20-day rolling. A stock with normal total volume but 60% dark pool (vs 35% baseline) is being accumulated/distributed by institutions. [PANEL: Opus — venue breakdown upgraded from NICE-TO-HAVE]

21. REG SHO Threshold List (Daily, Free)

What: SEC publishes daily list of stocks with excessive failed-to-deliver (FTD) shares.

Source: https://cdn.finra.org/equity/regsho/daily/ Collection frequency: Daily.

Storage:

CREATE TABLE regsho_threshold (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    threshold_date DATE NOT NULL,
    exchange TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, threshold_date)
);

22. SEC Failure-to-Deliver (FTD) Data

What: SEC publishes twice-monthly FTD files with exact share counts per ticker. FTD data is MORE granular than the REG SHO threshold list (which is a binary flag). The REG SHO list only triggers AFTER FTDs exceed threshold for 13 consecutive settlement days — FTD data lets you see the buildup 1-2 weeks early. [PANEL: Opus — FTD as MUST-HAVE gap]

Source: https://www.sec.gov/data/foiadocsfailsdatahtm — free CSV files, published twice monthly with ~2-week reporting lag.

Storage:

CREATE TABLE ftd_data (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    settlement_date DATE NOT NULL,
    cusip TEXT,
    ticker TEXT NOT NULL,
    quantity_ftd REAL,
    description TEXT,
    price REAL,
    ftd_as_pct_float REAL,
    ftd_as_pct_avg_volume REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(settlement_date, ticker)
);

23. OCC Stock Borrow Rates + DTCC Swap Reporting

Source: OCC (theocc.com — free daily rates). DTCC (free, post-Archegos transparency rules). Collection frequency: Daily.

Storage:

CREATE TABLE borrow_rates (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    rate_date DATE NOT NULL,
    borrow_rate REAL,
    borrow_status TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, rate_date)
);

Key signal: Borrow rate > 5% = hard to borrow, put/call parity breaks. Borrow rate spike = short squeeze pressure building.

24. FINRA Margin Debt (Monthly)

What: Total margin debt in brokerage accounts — a retail leverage indicator. High margin debt often precedes corrections; extreme declines signal capitulation. [PANEL: Grok 4.1]

Source: FINRA Margin Statistics. https://www.finra.org/investors/learn-to-invest/advanced-investing/margin-statistics — free monthly CSV.

Storage:

CREATE TABLE margin_debt (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    report_month DATE NOT NULL UNIQUE,
    total_margin_debt_millions REAL,
    free_credit_cash_millions REAL,
    margin_debt_zscore REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

SECTION F: EARNINGS DATA

25. Earnings Calendar and Upcoming Dates

Source: FMP API free tier (250 requests/day). https://financialmodelingprep.com/api/v3/earning_calendar?from={date}&to={date}&apikey={KEY} Collection frequency: Daily update at 7 AM ET.

Storage:

CREATE TABLE earnings_calendar (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    report_date DATE NOT NULL,
    report_time TEXT,
    fiscal_quarter TEXT,
    eps_estimate REAL,
    rev_estimate REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, report_date)
);

26. Historical Earnings Surprise Database

What per earnings report:

Report date, time (pre/post market), fiscal quarter
Consensus EPS estimate (mean, median, high, low, number of analysts)
Actual reported EPS
Surprise % = (Actual - Consensus) / |Consensus|
Revenue consensus, actual, surprise %
Guidance: raised/lowered/maintained/none
Stock price reaction: close-to-close on report day, 2-day cumulative
Pre-earnings run-up: 5-day return before earnings
Post-earnings drift: 5-day, 10-day, 20-day, 60-day returns after
Implied move (ATM straddle price / stock price) vs actual move

Source: SEC EDGAR XBRL (actuals), FMP API (consensus estimates + surprise history). Initial backfill: last 5-10 years. Collection frequency: Daily during earnings season. Update reaction metrics at 1/5/10/20/60 day marks.

Storage:

CREATE TABLE earnings_history (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    report_date DATE NOT NULL,
    fiscal_quarter TEXT,
    eps_consensus REAL, eps_actual REAL, eps_surprise_pct REAL,
    rev_consensus REAL, rev_actual REAL, rev_surprise_pct REAL,
    guidance_direction TEXT,
    implied_move REAL, actual_move_1d REAL, actual_move_2d REAL,
    pre_run_5d REAL,
    drift_5d REAL, drift_10d REAL, drift_20d REAL, drift_60d REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, report_date)
);

27. Analyst Estimate Revisions + Dispersion

What to track:

Current consensus for next 4 quarters and next 2 fiscal years
Revisions up/down in last 7/30/90 days
Revision ratio = up / (up + down)
Target price consensus, high, low
Rating distribution: strong buy, buy, hold, sell, strong sell
Estimate dispersion: std dev of analyst EPS estimates / mean EPS estimate. High dispersion = low consensus = high surprise risk. Contrarian: low-dispersion estimates beat more often (crowded consensus). Directly predicts Kalshi "beat/miss" contract mispricing. [PANEL: Sonar — analyst dispersion as MUST-HAVE]

Source: FMP API (free tier). Estimize (crowdsourced estimates, free basic tier) for social whisper consensus. Collection frequency: Daily snapshot after market close.

Storage:

CREATE TABLE estimate_revisions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    snapshot_date DATE NOT NULL,
    eps_consensus REAL, eps_high REAL, eps_low REAL,
    eps_dispersion REAL,
    revisions_up_7d INTEGER, revisions_down_7d INTEGER,
    revision_ratio_7d REAL,
    target_price_mean REAL, target_price_high REAL, target_price_low REAL,
    rating_strong_buy INTEGER, rating_buy INTEGER, rating_hold INTEGER,
    rating_sell INTEGER, rating_strong_sell INTEGER,
    estimize_eps REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, snapshot_date)
);

28. Earnings Call NLP Analysis (Month 2)

Simplification for v1: Use FinBERT sentiment alone initially. Skip the Fog Index and QoQ drift detection until you have a stable pipeline. The full NLP pipeline is a Month 2 item. [PANEL: DeepSeek, Gemini — simplify NLP]

NLP metrics (Month 2+):

FinBERT sentiment (positive/negative/neutral) via Hugging Face REST API
Uncertainty word frequency ("approximately", "potentially", "challenging")
Forward-looking statement ratio (future tense / total verbs)
Fog Index (readability/complexity) — higher = more obfuscation
QoQ changes in all metrics (drift detection)

Source: Transcript scrapers (Motley Fool, Seeking Alpha). Defer custom FinBERT local install; use the HuggingFace Inference API (free tier) instead.

Storage:

CREATE TABLE earnings_call_nlp (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    report_date DATE NOT NULL,
    finbert_sentiment TEXT,
    finbert_score REAL,
    uncertainty_word_pct REAL,
    forward_looking_ratio REAL,
    fog_index REAL,
    qoq_sentiment_delta REAL,
    parse_status TEXT DEFAULT 'ok',
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, report_date)
);

SECTION G: CORPORATE EVENTS AND CALENDAR

29. IPO Lockup Expiration Calendar

What to collect:

Company, ticker, IPO date, lockup expiry date
Insider ownership % of float
VC/PE-backed flag
Estimated supply pressure (lockup shares / current float)

Source: IPOScoop (https://www.iposcoop.com/ipo-lockup-expirations/), MarketBeat. Cross-ref with SEC S-1 "Shares Eligible for Future Sale" section. Collection frequency: Weekly scan every Sunday. Flag lockups expiring within 30 days.

Storage:

CREATE TABLE ipo_lockups (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    ipo_date DATE,
    lockup_expiry DATE NOT NULL,
    insider_ownership_pct REAL,
    is_vc_backed INTEGER DEFAULT 0,
    supply_pressure_pct REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, lockup_expiry)
);

Key signal: Stocks underperform 10-15 days BEFORE lockup expiry. VC-backed + high insider ownership (>40%) = most pressure.

30. Index Reconstitution and Corporate Actions

What to track:

S&P 500 additions/deletions (ad hoc, from S&P press releases)
Russell 2000/1000 reconstitution (annual in June)
MSCI rebalancing (quarterly)
Historical S&P 500 constituent changes with dates (survivorship bias prevention)
Stock splits (ratio, effective date)
Dividend ex-dates, record dates, amounts (special vs regular)
Spin-offs (parent, spin-off entity, distribution ratio, effective date)

Exchange halt/resume data: Track Nasdaq Trader Halts page, NYSE, CBOE for trading halts. Critical for risk management — squeeze setups and M&A often trigger halts. All free and scrapeable. [PANEL: GPT-4.1]

Sources: S&P Dow Jones press releases (free), FTSE Russell, Yahoo Finance corporate actions, Nasdaq dividend calendar, SEC Form 10-12B (spin-offs). Collection frequency: Daily at 7 AM ET.

Storage:

CREATE TABLE corporate_events (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    event_date DATE NOT NULL,
    event_type TEXT NOT NULL,
    event_detail TEXT,
    is_sp500_change INTEGER DEFAULT 0,
    effective_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE trading_halts (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    halt_datetime DATETIME,
    resume_datetime DATETIME,
    halt_reason TEXT,
    exchange TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

31. Buyback Blackout Periods

How to estimate: Earnings date minus 14 calendar days = blackout start. Earnings date + 2 business days = blackout end. Calculate % of S&P 500 market cap in blackout at any given time.

Collection frequency: Calculated from earnings calendar. Storage: Column in daily state table.

32. Corporate Supply Calendar (Secondaries, ATM Offerings)

What: Upcoming share issuance (secondary offerings, at-the-money offerings) directly impacts price via dilution. Missing this risks buying into an impending supply shock. [PANEL: DeepSeek]

Source: SEC Form S-3 and Form 424B filings (free). Monitor daily.

Storage:

CREATE TABLE share_issuance_calendar (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    filing_id TEXT UNIQUE NOT NULL,
    ticker TEXT NOT NULL,
    filing_type TEXT,
    shares_offered REAL,
    offering_price REAL,
    dilution_pct REAL,
    offering_date DATE,
    filing_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

33. Daily State Table

What: One row per trading day with every relevant boolean flag and event marker.

Schema:

CREATE TABLE daily_state (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    trade_date DATE NOT NULL UNIQUE,
    day_of_week TEXT,
    is_monthly_opex INTEGER DEFAULT 0,
    is_quarterly_opex INTEGER DEFAULT 0,
    days_to_next_opex INTEGER,
    is_fomc_day INTEGER DEFAULT 0,
    is_fomc_eve INTEGER DEFAULT 0,
    days_to_next_fomc INTEGER,
    is_cpi_day INTEGER DEFAULT 0,
    is_nfp_day INTEGER DEFAULT 0,
    is_ppi_day INTEGER DEFAULT 0,
    is_pce_day INTEGER DEFAULT 0,
    econ_release_tier INTEGER,
    is_treasury_auction INTEGER DEFAULT 0,
    auction_tenor TEXT,
    num_sp500_reporting INTEGER DEFAULT 0,
    is_peak_earnings_week INTEGER DEFAULT 0,
    pct_sp500_in_blackout REAL,
    is_month_end INTEGER DEFAULT 0,
    is_quarter_end INTEGER DEFAULT 0,
    is_half_day INTEGER DEFAULT 0,
    is_holiday INTEGER DEFAULT 0,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Note on data quality gate: The daily state table is generated at 6:00 AM ET before market open. If any CRITICAL data quality check fails (see Section M), this table can be used to block trading for that day.

SECTION H: SENTIMENT AND SOCIAL DATA

34. Reddit (Subreddits, Tiered Treatment)

r/wallstreetbets (WSB) — every 30 min during market hours:

Ticker mention count per hour
Post score, comment count, awards count (conviction proxy)
Volume spike detection: flag when mentions exceed 3x 7-day average
WSB Momentum Score (0-100): composite of mention velocity, sentiment, YOLO dollars, awards
CONTRARIAN SIGNAL: WSB Momentum Score >90 for 3+ consecutive days AND ticker up >15% = SELL signal

r/stocks — every 2 hours:

DD flair posts: full text, ticker, author karma
DD quality score: author karma * upvote_ratio * (1 + log(comment_count))
New Ticker Alert: ticker appearing for first time in 30 days with positive DD = early retail discovery

r/investing — every 2 hours:

Macro keyword frequency: "recession", "fed", "rate cut", "crash", "bubble"
Macro Fear Index: ratio of fear keywords to greed keywords, 7-day moving average

r/options — every 30 min:

Call vs put sentiment ratio per ticker
LEAPS accumulation tracking

Source: Reddit API / PRAW (free). Rate limits: 60 requests/minute authenticated. Stability note: Reddit's API access has become increasingly restricted since 2023. Build a fallback to direct HTML scraping of old.reddit.com for historical data when PRAW fails. Also consider Pushshift.io for historical data. [PANEL: Opus — Reddit reliability]

Storage:

CREATE TABLE reddit_sentiment (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    subreddit TEXT NOT NULL,
    collected_at DATETIME NOT NULL,
    mention_count INTEGER DEFAULT 0,
    mention_velocity REAL,
    wsb_momentum_score REAL,
    sentiment_score REAL,
    dd_quality_score REAL,
    macro_fear_index REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

35. X/Twitter Flow via Grok Search

Tier 1 — Market Structure (every 5 min): @unusual_whales, @spotgamma, @OptionsHawk, @WallStJesus, @SqueezeMetrics, @GarrettDeSimone Tier 2 — Macro & Fed (every 15 min): @NickTimiraos, @WalterBloomberg, @DeItaone, @Markets Tier 3 — Stock Analysis (every 30 min): @GaryBlack00, @DougKass, @elerianm Tier 4 — Retail Sentiment (every 60 min): @jimcramer (use as RETAIL ATTENTION signal only — NOT a directional contrarian signal; the Inverse Cramer ETF SJIM was liquidated in 2024 due to poor performance, suggesting the pure contrarian thesis is unreliable) [PANEL: Opus, Sonar — Cramer signal correction]

Pattern detection:

Coordinated ticker pumping: 10+ low-follower accounts tweeting same ticker in 5 min = pump-and-dump, AVOID
Influencer cascade: Tier 1 account tweets ticker, 3+ Tier 2/3 amplify within 1 hour = institutional narrative forming
Sentiment divergence: X positive but price declining = trust price for intraday, trust sentiment for 5+ day horizon

Source: Grok's built-in X search via xAI API (free, already have access).

Storage:

CREATE TABLE x_sentiment (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT,
    keyword_group TEXT,
    account_tier INTEGER,
    mention_count INTEGER,
    sentiment TEXT,
    collected_at DATETIME,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

36. Congressional Trading Disclosures

Source: Capitol Trades (https://www.capitoltrades.com/) or Quiver Quant (https://www.quiverquant.com/congresstrading/). Free to scrape. Supplement with QuiverQuant and flag records with confidence/lag indicators — STOCK Act filings are often delayed and incomplete. [PANEL: GPT-4.1 — congressional data lag]

Collection frequency: Daily.

Storage:

CREATE TABLE congressional_trades (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    politician_name TEXT,
    party TEXT,
    ticker TEXT NOT NULL,
    transaction_type TEXT,
    transaction_date DATE,
    disclosure_date DATE,
    amount_range TEXT,
    lag_days INTEGER,
    data_confidence TEXT DEFAULT 'medium',
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

37. Google Trends

Source: pytrends Python library (free, unofficial). Note: Google Trends has 1-5 day batching delay — it is NOT real-time. The category/brand divergence analysis is clever but defer to v2 given the lag and noise. Use simple ticker mention trends for v1. [PANEL: Sonar — Google Trends lag caveat]

Collection frequency: Daily.

Storage:

CREATE TABLE google_trends (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    trend_date DATE NOT NULL,
    search_interest INTEGER,
    buy_search_interest INTEGER,
    sell_search_interest INTEGER,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, trend_date)
);

38. Retail Flow Proxies

Sources (all free to scrape):

Fidelity order flow: eresearch.fidelity.com/eresearch/gotoBL/fidelityTopOrders.jhtml — daily buy/sell ratios
Stocktwits: api.stocktwits.com/api/2/streams/symbol/{TICKER}.json (free, 200/hr). Bull/bear ratio >80% bullish = contrarian sell

Simplification: Defer Webull (undocumented API, unstable) and eToro (non-US retail base, minimal predictive value for US equities) from the automated pipeline. Start with Fidelity + Stocktwits only. [PANEL: Opus — eToro removal; GPT-4.1, Sonar — simplify retail proxies]

Retail Consensus Indicator (derived composite): Composite of Stocktwits (30%), WSB (30%), r/stocks (25%), Fidelity (15%). Score >85 sustained for 3+ days = CONTRARIAN SELL. Score <15 sustained for 3+ days = CONTRARIAN BUY.

ICI Weekly Mutual Fund Flows: https://www.ici.org/research/stats/flows — free weekly data published Wednesday for prior week. This measures actual money movement by retail investors, not just social media chatter. Extreme outflows (z < -2) = retail capitulation = historically excellent buy signal. Extreme inflows (z > 2) = euphoria = caution. [PANEL: Opus — ICI fund flows]

Storage:

CREATE TABLE retail_flow_proxies (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    source TEXT NOT NULL,
    collected_at DATETIME NOT NULL,
    buy_ratio REAL,
    bull_bear_ratio REAL,
    retail_consensus_indicator REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE fund_flows (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    week_ending DATE NOT NULL UNIQUE,
    equity_domestic_flows_millions REAL,
    equity_world_flows_millions REAL,
    total_equity_flows_millions REAL,
    equity_flow_zscore REAL,
    extreme_outflow_flag INTEGER DEFAULT 0,
    extreme_inflow_flag INTEGER DEFAULT 0,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

39. Earnings Whisper / Social Consensus

What: The "real" expectation that traders have, which often differs from published consensus.

Sources:

Estimize (crowdsourced estimates, free basic tier) — beats Wall Street consensus ~70% of the time
Social Whisper consensus already captured via Section 27 (estimate revisions) using Estimize integration

Important: Do NOT build custom NLP for extracting EPS expectations from Reddit/X text — too noisy, too easy to game by bots. Use Estimize crowdsourced platform instead. [PANEL: Sonar — social whisper quality]

Key signal: Company beats Street consensus but MISSES Estimize = negative reaction ("beat and dump"). Company misses Street but matches Estimize = muted reaction.

SECTION I: ALTERNATIVE DATA (Free Sources — Build in Month 2+)

These are all month 2+ items. Do not build these before the core pipeline (Sections A-H) is stable and generating predictions.

40. USPTO Patent Data

What to track:

Forward citations by company (moat strengthening/eroding signal)
Patent APPLICATIONS (pre-grant) — reveals R&D direction 18-24 months early
Patent ABANDONMENT (failure to pay maintenance fees) — distress signal
Patent ASSIGNMENT transfers (selling patents) — distressed IP monetization

Defer: Trademark filing monitoring. Signal-to-noise ratio is terrible — companies file trademarks speculatively all the time. [PANEL: Opus]

Source: USPTO PatentsView API (free, no key needed). https://api.patentsview.org/patents/query Collection frequency: Weekly.

Storage:

CREATE TABLE patent_citations (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    company_cik TEXT,
    ticker TEXT,
    period_date DATE NOT NULL,
    patent_count INTEGER,
    forward_citation_count INTEGER,
    citation_yoy_change REAL,
    abandonment_count INTEGER,
    assignment_count INTEGER,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

41. Job Posting Volume (Month 2)

Source: Company careers pages (direct scraping for top 100 holdings). Revealera.com (free tracker). Collection frequency: Weekly snapshot.

Storage:

CREATE TABLE job_postings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    snapshot_date DATE NOT NULL,
    total_postings INTEGER,
    revenue_roles INTEGER,
    cost_restructuring_roles INTEGER,
    mom_change_pct REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, snapshot_date)
);

Key signal: Posting volume drop >40% MoM = hiring freeze = operational problems incoming.

42. App Store Rankings (Manual Check — Do Not Automate)

Only relevant for ~15-20 consumer app companies. Do not build an automated pipeline. Check Sensor Tower free tier or data.ai free tier manually for specific tickers when actively trading them. [PANEL: Opus, GPT-4.1]

43. SaaS Pricing Page Monitoring — DEFERRED TO v2

Signal is too infrequent (major changes 1-2x/year), scraping is maintenance-intensive, and the signal is usually already priced in by the time it's detectable. [PANEL: Opus, Grok 4.1, Sonar]

44. GitHub Commit Velocity — DEFERRED TO v3

Only relevant for ~20 open-source-heavy tech companies. Signal-to-noise ratio is weak (commits can be noise, bots, linting). GitHub API limits throttle high-frequency collection. [PANEL: Opus, Gemini, DeepSeek, GPT-4.1]

SECTION J: CROSS-ASSET AND INTERNATIONAL SIGNALS

45. Credit-Equity Divergence

Source: FRED API (free). HY OAS (BAMLH0A0HYM2), IG OAS (BAMLC0A0CM). Already collected in Section D.

Key signal: CDX.HY widening 2+ days while SPX flat = equity about to catch down. CDX.HY tightening while SPX sells off = equity overreacting, buy the dip. ~70% directional accuracy.

46. Commodity-to-Sector Margin Mapping

Mappings (1-quarter lag):

Copper/Aluminum up = Construction/Industrials margins compress
WTI Crude up = Airlines compress, Energy expand
Lumber up = Homebuilder margins compress
Natural Gas up = Chemicals compress, Utilities expand
Lithium/Cobalt up = EV makers compress (TSLA, RIVN, LCID)
Coffee/Cocoa up = Consumer staples compress (SBUX, HSY, MDLZ)

Source: FRED commodity prices (free). https://fred.stlouisfed.org/categories/32217 Collection frequency: Daily.

Storage:

CREATE TABLE commodity_sector_signals (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    signal_date DATE NOT NULL,
    commodity TEXT NOT NULL,
    commodity_price REAL,
    commodity_30d_change_pct REAL,
    affected_sectors TEXT,
    signal_direction TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(signal_date, commodity)
);

47. CFTC Commitments of Traders (Equity Index Futures)

Source: CFTC.gov (free CSV downloads). Released Friday 3:30 PM ET, data from Tuesday — 3 days stale. Collection frequency: Weekly (shared with Futures desk).

Storage:

CREATE TABLE cot_equity (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    report_date DATE NOT NULL,
    contract TEXT NOT NULL,
    dealer_long REAL, dealer_short REAL,
    leveraged_fund_long REAL, leveraged_fund_short REAL,
    asset_manager_long REAL, asset_manager_short REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(report_date, contract)
);

48. International Leading Indicators (Simplified Scope)

Limit to the two highest-signal indicators. Defer Japan machinery orders, Australian iron ore, and China PMI to v2 — they're better suited to the Futures/Macro desk. [PANEL: Grok 4.1, Sonar]

TSMC monthly revenue: https://investor.tsmc.com/english/monthly-revenue. Released ~10th of month. TSMC revenue up >20% YoY = chip demand boom. Directly predicts NVDA, AMD, INTC, MU earnings with 1-quarter lead.
Korean semiconductor exports: Korea Customs Service. Monthly. Similar chip demand lead.

Storage:

CREATE TABLE international_indicators (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    indicator_name TEXT NOT NULL,
    period_date DATE NOT NULL,
    value REAL,
    yoy_change_pct REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(indicator_name, period_date)
);

49. FDIC Call Reports (Bank Stocks — Simplified)

Simplification: Only collect 3 metrics from FDIC. Full FDIC report parsing (80+ fields) is overkill unless actively trading financial sector stocks. [PANEL: Sonar — FDIC simplification]

Quarterly C&I loan change (from Fed H.8, already in FRED)
CET1 ratio (from 10-Q, already in XBRL fundamentals)
Non-performing loan ratio (from 10-Q, already in XBRL)

Key signal: Rising provision-to-loan ratio in CRE or C&I at multiple banks = systemic stress 1-2 quarters ahead.

SECTION K: DERIVED AND COMPUTED DATABASES

These are NOT collected from external sources. They are COMPUTED from data above and our own prediction history.

50. Failed Trade Pattern Database

Note: Build the schema on Day 1, but do not expect meaningful analysis until you have 50+ settled trades. Do not tune the categories until 200+ trades.

Root cause categories (simplified to 5 for launch): [PANEL: Opus — simplification from 10 to 5]

WRONG_THESIS — fundamental analysis was incorrect
WRONG_TIMING — thesis eventually correct but outside our window
EXTERNAL_SHOCK — unforeseeable event
EXECUTION_ERROR — sizing, entry/exit timing, or liquidity issue
REGIME_SHIFT — market environment changed after entry

Add a free-text notes field for nuance. After 200 settled trades, analyze notes to determine if any category should be split.

Storage:

CREATE TABLE failed_trade_patterns (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    prediction_id TEXT NOT NULL,
    ticker TEXT NOT NULL,
    direction TEXT,
    entry_date DATE, exit_date DATE,
    return_pct REAL,
    market_regime TEXT,
    correlation_regime TEXT,
    root_cause TEXT,
    notes TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

51. Prediction Calibration Database (Brier Scores)

Storage:

CREATE TABLE prediction_calibration (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    prediction_id TEXT NOT NULL UNIQUE,
    desk TEXT DEFAULT 'stocks',
    model_name TEXT,
    confidence_pct REAL,
    confidence_bucket TEXT,
    outcome INTEGER,
    brier_score REAL,
    market_regime TEXT,
    time_horizon TEXT,
    settled_date DATE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_calibration_bucket ON prediction_calibration(confidence_bucket, market_regime);

52. Factor Exposure Attribution

Source: Calculated from Fama-French factors + sector returns + individual stock returns.

Storage:

CREATE TABLE factor_attribution (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    prediction_id TEXT NOT NULL,
    market_beta REAL,
    sector_beta REAL,
    size_factor REAL,
    value_factor REAL,
    momentum_factor REAL,
    stock_specific_alpha REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

53. Smart Money Convergence Score

Composite score per ticker per day: Insider buying (Form 4), institutional accumulation (13F), activist interest (13D), dark pool volume zscore, congressional trading, FTD trend.

Storage:

CREATE TABLE smart_money_convergence (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    calc_date DATE NOT NULL,
    insider_score REAL,
    institutional_score REAL,
    activist_score REAL,
    dark_pool_score REAL,
    congressional_score REAL,
    ftd_score REAL,
    composite_score REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, calc_date)
);

54. Squeeze Probability Score (Simplified)

Simplified version: Daily screening, not 15-minute updates (squeezes develop over days, not minutes). [PANEL: Opus — squeeze score simplification]

Screen criteria: SI > 20% of float AND days-to-cover > 5 AND borrow rate > 10%. Flag these for manual review. Add WSB mention velocity as a timing trigger.

Storage:

CREATE TABLE squeeze_scores (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    ticker TEXT NOT NULL,
    calc_date DATE NOT NULL,
    short_pct_float REAL,
    days_to_cover REAL,
    borrow_rate REAL,
    wsb_mention_velocity REAL,
    ftd_pct_float REAL,
    squeeze_flag INTEGER DEFAULT 0,
    squeeze_score REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(ticker, calc_date)
);

SECTION L: PYTHON-TYPESCRIPT INTEGRATION ARCHITECTURE

[PANEL: Opus — critical architecture addition, entire section]

The boundary: Python handles all data collection (financial Python ecosystem is superior — pandas, yfinance, praw, pytrends, nltk, textstat, FinBERT all have no good TypeScript equivalents). TypeScript/Node.js handles the orchestration layer, API serving, and prediction pipeline.

Communication layer: Shared database with a job queue table.

CREATE TABLE job_queue (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    job_id TEXT UNIQUE NOT NULL,
    job_type TEXT NOT NULL,
    status TEXT DEFAULT 'pending',
    payload TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    started_at DATETIME,
    completed_at DATETIME,
    error_message TEXT
);

CREATE TABLE alerts (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    alert_type TEXT NOT NULL,
    ticker TEXT,
    severity TEXT,
    message TEXT,
    payload TEXT,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    acknowledged INTEGER DEFAULT 0
);

Pattern:

Node.js inserts jobs into job_queue (or Python cron inserts directly)
Python workers poll job_queue for status = 'pending', mark started_at, execute, write results to target tables, set status = 'completed' or status = 'failed'
For real-time alerts (8-K filings, squeeze flags, borrow rate spikes), Python writes to alerts table
Node.js polls alerts table every 10 seconds for unacknowledged alerts

No message broker needed — the database handles this volume. PostgreSQL LISTEN/NOTIFY can be used for lower-latency alert delivery if needed.

API key management: Track ~8-10 API keys (FRED, FMP, Polygon, PRAW, etc.) in .env file. Document which keys are required vs optional. A single missing optional key should NOT crash the pipeline. Rotate keys quarterly.

SECTION M: DATA QUALITY MONITORING

[PANEL: Opus — critical addition, entire section]

A data quality monitoring job runs daily at 6:30 AM ET, BEFORE the trading day. Results go into data_quality_log. Any CRITICAL failure blocks trading for that day.

CREATE TABLE data_quality_log (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_name TEXT NOT NULL,
    check_timestamp DATETIME NOT NULL,
    check_type TEXT NOT NULL,
    status TEXT NOT NULL,
    details TEXT,
    rows_affected INTEGER,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Checks to run:

FRESHNESS: For each data source, verify last ingestion timestamp is within expected window. daily_prices should have data from yesterday. macro_data daily series should have data from yesterday. If EDGAR had no Form 4 filings all day (possible but rare), log as WARNING not CRITICAL.
COMPLETENESS: Verify row counts. daily_prices should have ~700 rows per day. If only 50 rows, yfinance/Polygon failed partially.
RANGE: Verify no prices are negative, no volumes are negative, VIX is between 5 and 100, RSI is between 0 and 100, all implied volatilities are between 0 and 5 (500%).
CONSISTENCY: Verify adjusted close changes are within 25% of prior day (catches bad split adjustments). When Polygon and yfinance both have data for the same ticker, verify prices are within 1%.
CROSS-SOURCE: If prediction market prices for a contract have not updated in >2 hours during market hours, flag as stale.

Failure thresholds:

CRITICAL (blocks trading): Price data missing for >50% of S&P 500, VIX missing, SOFR missing
HIGH (alert only): Form 4 filings not received for 3+ consecutive days, prediction market data stale
MEDIUM (log only): Individual ticker data gaps, minor completeness issues

IMPLEMENTATION CHECKLIST (Ordered by Priority)

Phase 1 — Core Infrastructure (Week 1, ~$29/month)

Set up database (SQLite for v1, with migration path to PostgreSQL documented)
Polygon.io subscription activated — Day 1, not Month 2
Daily OHLCV collector (Polygon primary, yfinance fallback) + health check
FRED macro collector (23 series + SOFR)
VRP calculation wired into macro_derived table
EDGAR event-driven feed architecture (watermark-based, not ticker-by-ticker)
Form 4 collector (event-driven feed, insider name canonicalization, raw XML storage)
Kalshi scraper (correct API URL in config, health check)
Polymarket scraper (TheGraph fallback)
Prediction market settlement history backfill
Binary Probability Calculator (SOFR-based carry cost, NOT hardcoded 5%)
Options chain collector (Polygon, required for the calculator)
Data quality monitoring job (6:30 AM ET daily)
Job queue table and Python worker template
Market regime classifier (3 states + correlation regime)
Daily state table generator

Phase 2 — SEC Pipeline (Week 2, $0)

13F quarterly processor (smart money cohort, overlap scores)
13F-NT tracker (late filings from smart money = signal)
13D activist tracker (daily EDGAR scan, tier classification)
Form 144 pre-sale notices (with parse failure queue)
Earnings calendar + historical earnings surprise database
Analyst estimate revisions + dispersion (Estimize integration)
FINRA short interest (bi-monthly)
FTD data (twice-monthly SEC CSV)
OCC borrow rates
REG SHO threshold list
FINRA ATS dark pool volume
CBOE venue breakdown (dark pool ratio zscore)
8-K real-time monitor (60-second polling, item number parser, alerts)
DEF 14A proxy analyzer
N-PORT monthly fund holdings (top 50 funds only)
Settlement rules comparison engine

Phase 3 — Sentiment + Events (Week 3, $0)

Reddit sentiment pipeline (PRAW + old.reddit fallback, WSB Momentum Score)
X/Twitter flow via Grok search (tiered accounts)
Retail flow proxies (Fidelity + Stocktwits only for v1)
ICI weekly mutual fund flows
FINRA margin debt (monthly)
Congressional trading scraper
Google Trends (simple ticker search, defer category/brand divergence)
IPO lockup calendar
Corporate events + exchange halt tracker
Corporate supply calendar (S-3/424B)
Commodity-sector margin signals
CFTC COT data
International indicators (TSMC + Korean semis only)
Buyback blackout period calculator
Smart money convergence score (composite)
Squeeze screening (daily, not 15-minute)
Failed trade pattern database (schema)
Brier score / prediction calibration (schema, start recording Day 1)

Phase 4 — NLP + Advanced (Month 2+)

Earnings call NLP (FinBERT via HuggingFace API, defer Fog Index)
SEC CORRESP keyword scanner (material weakness, restatement alerts)
SEC 424B2 structured products parser (NLP for prospectus text)
Altman Z-Score calculator (from XBRL fundamentals)
ETF creation/redemption flow tracker
USPTO patent tracker
Job posting volume tracker
FDIC call reports (simplified — 3 metrics only)
Factor attribution (Fama-French decomposition)

API ENDPOINTS REFERENCE

Source	URL	Auth	Rate Limit
Polygon.io	`https://api.polygon.io/`	API key	5 calls/min (free), unlimited ($29/mo)
FRED	`https://api.stlouisfed.org/fred/`	API key (free)	120/min
SEC EDGAR	`https://efts.sec.gov/LATEST/`	User-Agent header	10 req/sec
SEC XBRL	`https://data.sec.gov/api/xbrl/`	User-Agent header	10 req/sec
Kalshi	`https://trading-api.kalshi.com/trade-api/v2/`	RSA-PSS signing	Varies
Polymarket CLOB	`https://clob.polymarket.com/markets`	None (public)	Varies
Polymarket Gamma	`https://gamma-api.polymarket.com/markets`	None (public)	Varies
FINRA ATS	`https://ats-transparency.finra.org/`	None	Weekly file
FINRA SI	`https://www.finra.org/finra-data/`	None	Bi-monthly file
OCC Borrow Rates	`https://www.theocc.com/`	None	Daily file
CBOE Market Share	`https://www.cboe.com/us/equities/market_statistics/`	None	Daily CSV
FMP	`https://financialmodelingprep.com/api/v3/`	API key (free tier)	250/day
Kalshi settlements	`https://trading-api.kalshi.com/trade-api/v2/markets?status=settled`	RSA-PSS	Varies
Polymarket resolved	`https://gamma-api.polymarket.com/markets?closed=true`	None	Varies
ICI Fund Flows	`https://www.ici.org/research/stats/flows`	None	Weekly
CFTC COT	`https://www.cftc.gov/MarketReports/`	None	Weekly
USPTO PatentsView	`https://api.patentsview.org/patents/query`	None	Free
Reddit/PRAW	OAuth2	App credentials	60/min
Capitol Trades	`https://www.capitoltrades.com/`	None	Scraping
SEC FTD	`https://www.sec.gov/data/foiadocsfailsdatahtm`	None	Twice monthly
TSMC Revenue	`https://investor.tsmc.com/english/monthly-revenue`	None	Monthly

COLLECTION CADENCE SUMMARY

Frequency	Data Sources
Every 60 seconds (market hours)	8-K filings (EDGAR real-time feed)
Every 5 minutes	Kalshi contracts expiring <24h, X/Twitter Tier 1 accounts
Every 15 minutes	X/Twitter Tier 2 accounts, Stocktwits per-ticker, options chains (near-expiry)
Every 30 minutes	Kalshi/Polymarket all contracts, Reddit WSB + r/options, retail flow proxies
Every 2 hours	Reddit r/stocks + r/investing, X/Twitter Tier 3
Every 60 minutes	X/Twitter Tier 4
Daily 5:00 PM ET	Daily prices (Polygon), sector/factor ETF returns
Daily 6:00 AM ET	Daily state table generation
Daily 6:30 AM ET	Data quality monitoring job
Daily 7:00 AM ET	FRED macro data, earnings calendar, analyst revisions, corporate events
Daily 6:00 PM ET	13D activist scan, REG SHO threshold, congressional trades
Daily 7:00 PM ET	Form 4 and Form 144 daily batch (after EDGAR prime window 4-8 PM)
Daily (market hours)	Options chains (all tracked names), borrow rates, CORRESP keyword scan
Weekly Monday	FINRA ATS dark pool volume, CBOE venue breakdown, Google Trends
Weekly Sunday	IPO lockup calendar
Weekly Friday	CFTC COT data
Weekly	USPTO patent data, job posting volume (Month 2+)
Bi-monthly	FINRA short interest, SEC FTD data
Monthly	N-PORT fund holdings, ICI mutual fund flows, FINRA margin debt, TSMC revenue, Korean semi exports
Quarterly	13F institutional holdings (within 24h of deadline: Feb 14, May 15, Aug 14, Nov 14)
Annually	DEF 14A proxy statements (proxy season Mar-Jun), FDIC call reports

STORAGE ESTIMATES (REVISED)

The original spec estimate of ~575 KB/day is too low. Realistic estimate accounts for raw filing text, options chain data, and index overhead. [PANEL: Opus — storage correction]

Revised per-day estimates (full universe ~700 tickers):

Data Source	Size/Day
daily_prices (700 tickers)	~140 KB
options_chains (active names, daily snapshot)	~500 KB
insider_trades	~50 KB
macro_data + derived	~5 KB
prediction_market_stocks (Kalshi + Polymarket)	~100 KB
reddit_sentiment	~25 KB
x_sentiment	~20 KB
8-K filing alerts + text	~50 KB
earnings_call_nlp (seasonal, ~200 companies/quarter)	~25 KB
corporate_events + state table	~5 KB
borrow_rates + regsho	~10 KB
collection_log + data_quality_log	~10 KB

Revised daily total: ~3-4 MB/day (Opus's estimate; still trivially small)

Per-quarter estimates:

institutional_holdings (13F): ~7.5 MB
nport_holdings: ~10 MB
fundamentals: ~350 KB
smart_money_signals: ~210 KB

Annual totals:

Daily data: ~1.2 GB
Quarterly data: ~70 MB
Raw SEC XML (stored locally): ~500 MB
Earnings transcripts (90-day retention): ~500 MB
Annual total: ~2.3 GB

Total disk budget: ~5 GB for 2 years (comfortable on 45GB disk with 15GB free)

COST SUMMARY

Period	Cost	What It Gets You
Week 1+	$29/month	Polygon.io — primary price data + options chains (move to Day 1)
Optional Month 2+	$19/month	FMP API (cleaner XBRL parsing vs raw SEC)
Year 1 total	~$290-$480	Polygon ($348) + optional FMP ($228)

Dependencies (Python data collectors):

requests, beautifulsoup4, lxml (HTTP/parsing)
polygon-api-client (official Polygon Python SDK)
yfinance (fallback only)
pandas, numpy (data manipulation)
nltk, textstat (NLP readability — Month 2)
praw (Reddit API)
pytrends (Google Trends)
transformers (HuggingFace FinBERT — Month 2)
ratelimit (rate limiting decorator)
sqlite3 (built into Python)
Total: ~10 packages

Future cost notes:

Bloomberg, FactSet, Refinitiv, BoardEx — NOT needed. EDGAR + FRED covers 80% of what you need.
X API paid tier ($5,000/month) — NOT needed. Grok search is free and sufficient.
SimilarWeb ($100/month) — NOT needed for v1.

CROSS-DESK SYNERGIES

Shared Infrastructure (Build Once, Use Everywhere)

Shared Component	Reusable For
FRED macro data suite (23+ series)	Futures, Forex, Options, Crypto
Market regime classifier (3 states + correlation)	All desks
Kalshi/Polymarket scraper infrastructure	Options, Crypto, Sports, Futures
Settlement rules comparison engine	All prediction market desks
Historical settlements database	All desks
Brier score + calibration framework	All desks
Daily state table (OpEx, FOMC, economic releases)	Options, Futures, Forex
SEC EDGAR base infrastructure (rate limiter, XML parser, file router)	Options (424B2, Form 4)
Credit spread data + credit-equity divergence	Options, Futures
Earnings calendar + surprise database	Options (implied move analysis)
Reddit/social sentiment infrastructure	Crypto, Sports
EDGAR rate limiting + caching strategy	All desks

Shared with Options Desk

Form 4 insider trades (options desk uses for flow signal)
SEC 424B2 structured products (barrier levels and notional)
CFTC COT positioning
Kalshi/Polymarket scrapers
Short interest + borrow rates (put/call parity breakdowns on hard-to-borrow)
13F institutional holdings (concentration risk → gamma risk on exits)
X/Twitter Tier 1 (SpotGamma, Unusual Whales — same source, different filters)

Shared with Futures Desk

FRED macro data suite
Credit-equity divergence signals
Commodity-sector margin mapping (futures desk handles commodity side, stocks desk consumes sector signals)
CFTC COT data

Shared with Forex Desk

Currency mismatch risk (Forex desk provides USD forecasts; apply to companies with >30% foreign revenue)
FRED DXY and yield data

Shared with Crypto Desk

Polymarket liquidity monitoring (USDC depeg risk affects all Polymarket contracts)
Stablecoin flows as cross-asset risk appetite proxy

REALITY CHECK AND KEY RISKS

Is this realistic? Yes, with four conditions:

Polygon.io ($29/month) starts Day 1, not Month 2. yfinance will break before you finish Week 1.
EDGAR collection uses the event-driven feed pattern (watermark-based), not per-ticker polling.
Data quality monitoring is built in Phase 1, not bolted on later.
The alternative data section (Section I) stays in Month 2+ — resist the temptation to build GitHub commit tracking before the core pipeline is generating predictions.

Biggest operational risk: Data source fragility. The spec relies on several sources that break regularly: yfinance (breaks 2-4x/year), PRAW/Reddit API (increasingly restricted since 2023), Polymarket API (undocumented endpoints), web scrapers for congressional trades and IPO lockups. Budget ~20% of ongoing engineering time for scraper maintenance. EDGAR and FRED are rock-solid (government APIs with stability commitments) — everything else is built on sand to varying degrees.

First to break in production: yfinance (if used as primary — don't). Second: Reddit PRAW API. Third: Polymarket CLOB endpoint.

Top single change: Build the settlement rules comparison engine (Section 15B) before executing any cross-platform arbitrage. Kalshi and Polymarket settlement rules differ subtly and those differences flip binary outcomes. Trading on an assumed arbitrage without verifying rules is not arbitrage — it's an uncompensated bet on settlement rule differences.

What NOT to build (cost > value):

eToro CopyTrader (non-US retail base, undocumented API, minimal US equity signal)
Japan core machinery orders (only relevant for ~10 industrial stocks; belongs on Futures desk)
SaaS pricing page monitoring (scraper maintenance nightmare, 1-2x/year signal)
GitHub commit velocity (weak signal for stock prices, high maintenance, defer to v3 at earliest)
App store rankings (automate for ~15 consumer stocks only; use manual Sensor Tower check)
Trademark filings in new Nice Classification codes (speculative filings are constant noise)
Form ADV amendments for broad AUM drops (massive data volume, tiny signal; only monitor ADV for your 50 smart money cohort funds)
Full earnings call NLP suite in Week 1 (start with FinBERT only, add complexity in Month 2)

End of Stocks Desk Merged Final Spec — March 15, 2026

TODO: Upgrade Kalshi REST to WebSocket Feed

Status: NOT BUILT — add when this desk goes live for execution

Current state: All Kalshi data (prices, order books, trades) is fetched via REST API polling on cron schedules. This is fine for edge detection and monitoring, but NOT sufficient for live trade execution.

Why WebSocket matters:

REST polling introduces lag — by the time you see a price, it may already be gone
On tight edges (2-5 cents), even 1-2 seconds of lag can eat the entire edge
WebSocket gives real-time order book deltas and trade notifications
Essential for hitting specific fill prices and monitoring execution quality

Note: This desk primarily trades on other exchanges, but Kalshi has event markets (rate decisions, index levels, crypto milestones) that overlap with this desk. WebSocket upgrade applies when trading any Kalshi markets.

Added 2026-03-29 — upgrade REST to WS when desk moves to live execution

Source: ~/edgeclaw/results/spec-panel/stocks-desk/spec-final.md