Stocks Desk — Data Collection Spec (Mar 14, 2026)

Reviewed by 5-model panel: Flash (Financial), Maverick (Sentiment), Gemini Pro (Post-Mortem), Haiku (Practical), Opus (Contrarian)

What This Document Is

This is the complete data collection specification for the Stocks desk in the research pipeline. It covers equity trades on Robinhood AND event contracts traded on Kalshi/Polymarket. An AI builder should be able to read this and know exactly what data to collect, from where, how often, in what format, and why it matters for finding mispriced stocks or prediction market contracts.

The Business Model (Two Venues)

Unlike sports desks that only trade prediction markets, the Stocks desk trades on THREE platforms:

Robinhood — Actual stock trades (buy/sell shares on individual equities). We buy and sell real stocks.
Kalshi — Binary prediction markets on stock events: price thresholds ("TSLA above $300 by Friday"), earnings outcomes ("AAPL beats EPS estimate"), index levels ("S&P 500 above 5500 by month end").
Polymarket — Similar binary contracts on crypto rails (USDC on Polygon). Includes M&A, regulatory outcomes, CEO changes, index milestones.

The cross-venue edge: Fundamental analysis tells us a stock is mispriced on Robinhood. The same analysis also tells us whether a Kalshi/Polymarket contract is mispriced. Example: if our fundamental model says NVDA fair value is $950 and Kalshi prices "NVDA above $900 by April 30" at 55 cents, our model implies ~75% probability — that contract is cheap. We trade BOTH venues simultaneously.

How Stocks Are Different From Sports/Options

An AI builder must understand these structural differences:

No Sharp External Benchmark — In sports, sharp bookmakers set fair lines. For stock events, there IS no sharp external reference. We must BUILD the fair value estimate from fundamentals, flow data, and volatility modeling. This makes the Stocks desk data collection broader and more complex than sports desks.
Continuous Price Discovery — Stocks trade every second, 6.5 hours per day. Data changes constantly. Sports bets lock in at game time. This means timing of data collection matters more.
Multi-Factor Valuation — A stock's fair value depends on earnings, balance sheet, macro regime, sector rotation, institutional positioning, insider behavior, credit markets, and sentiment simultaneously. No single data source is sufficient.
Prediction Market Carry Cost — Buying a Kalshi contract at 70 cents locks up 70 cents until settlement. At 5% risk-free rate, a 70-cent 30-day contract costs ~0.29 cents in opportunity cost. Prediction market prices should trade BELOW option-implied probabilities by this carry cost. If they trade ABOVE, the prediction market is overpriced.
Regime Dependence — Stock strategies that work in bull markets fail in bear markets. Every data source must be tagged with which market regime it applies to. Sports bets work the same regardless of market regime.
SEC Filings Are the Goldmine — The SEC requires public companies to disclose everything. This free government data is richer than any paid service for stocks. Sports have no equivalent.

SECTION A: PRICE AND MARKET DATA (Foundation)

1. Daily OHLCV for Full Universe

What: Daily Open, High, Low, Close (adjusted for splits/dividends), and Volume for every stock we track.

Universe:

S&P 500 constituents (503 tickers)
Nasdaq 100 constituents
Russell 2000 top 200 by volume (small-cap regime detection)
Any stock we have EVER made a prediction on (even if delisted)
Major ETFs: SPY, QQQ, IWM, DIA, XLF, XLE, XLK, XLB, XLI, XLP, XLU, XLV, XLY, XLC, XLRE, ARKK
Factor ETFs: MTUM (Momentum), QUAL (Quality), VLUE (Value), SIZE (Size), USMV (Min Vol)

What to pull per ticker:

OHLCV + adjusted close
52-week high/low
Average daily volume (20-day)
Relative volume (today's volume / 20-day average)
Market cap (shares outstanding * price)
Free float (shares outstanding minus insider/institutional locked)

Calculated technicals (on ingestion):

RSI (14-day)
MACD (12/26/9)
Bollinger Bands (20-day, 2 std dev) + %B position
ATR (14-day)
Short-term momentum (5-day, 10-day, 20-day returns)
Medium-term momentum (60-day, 120-day returns)

RSI Divergence Detection (4 types — inherited from forex spec): Scan daily RSI (14) against price swings. All 4 types apply to stocks identically to forex:

Regular Bullish: Price makes lower low, RSI makes higher low = bearish trend losing momentum, reversal likely. High-value signal near support zones or after extended selloffs.
Regular Bearish: Price makes higher high, RSI makes lower high = bullish trend losing momentum, reversal likely. High-value signal near resistance or after extended rallies.
Hidden Bullish: Price makes higher low, RSI makes lower low = pullback within uptrend, continuation likely. Use to confirm buy-the-dip during bull regimes.
Hidden Bearish: Price makes lower high, RSI makes higher high = rally within downtrend, continuation lower likely. Use to confirm sell-the-rip during bear regimes.

What to store per divergence event:

ticker: stock symbol
timeframe: daily (primary), weekly (confirmation)
divergence_type: "regular_bullish", "regular_bearish", "hidden_bullish", "hidden_bearish"
timestamp: detection candle
price_swing_1, price_swing_2: the two price swing points (level + timestamp)
rsi_swing_1, rsi_swing_2: RSI values at each swing
strength: slope difference between price and RSI swing lines (bigger = stronger signal)
market_regime: current regime from Section K when divergence forms (bull/bear/correction/crisis/range/rotation)
at_support_resistance: is price near a key level?
sector_etf_divergence: does the sector ETF show the same divergence type? (confirms sector-wide move vs single-stock noise)
earnings_within_5d: boolean — divergence near earnings is unreliable (event risk overrides technical signals)

Outcome tracking:

outcome_5d, outcome_10d, outcome_20d: price change after detection
reversal_occurred: for regular divergence — did price reverse? (boolean + magnitude)
continuation_occurred: for hidden divergence — did trend continue? (boolean + magnitude)

Storage: SQLite table rsi_divergences (shared schema with forex divergence_events table). Collection frequency: Calculated daily at 5:00 PM ET on ingestion. Weekly divergences recalculated on Friday close.

Source: Yahoo Finance via yfinance (free). Batch download handles 100+ tickers in one call. Rate limit ~2000/hr with rotating user agents. Add 0.5-1 second delay between individual calls. Backup: Polygon.io free tier (5 calls/min, delayed), Alpha Vantage (25 calls/day free), Tiingo (1000 symbols free). History: 10 years for S&P 500. Full available history for any traded stock. Collection frequency: Daily at 5:00 PM ET (after market close + after-hours adjustments settle). Storage: SQLite table daily_prices with columns (ticker, trade_date, open, high, low, close, adj_close, volume, relative_volume, rsi_14, macd, macd_signal, bb_upper, bb_lower, bb_pct, atr_14).

Why it matters: Foundation for ALL other signals. Every factor maps to "is this stock mispriced?" Relative volume spikes flag unusual activity before the cause is known. For prediction markets: current price trajectory directly determines probability of hitting Kalshi threshold prices.

2. Sector and Factor ETF Returns

What: Daily returns for all 11 GICS sectors and standard factor ETFs.

Sector ETFs (11): XLK, XLF, XLE, XLV, XLI, XLP, XLY, XLC, XLRE, XLB, XLU Factor ETFs: MTUM, QUAL, VLUE, SIZE, USMV, IWD (Value), IWF (Growth)

Calculated metrics:

Value vs Growth spread: IWD return minus IWF return (daily)
Small vs Large spread: IWM return minus SPY return (daily)
Sector rotation velocity: rolling 20-day correlation between each sector pair
Factor crowding: when one factor dramatically outperforms, it is crowded
Cross-sectional dispersion: std dev of S&P 500 stock returns (high = stock-picking matters, low = macro dominates)

Source: yfinance (free). Also Fama-French factor data from Kenneth French Data Library (free CSVs, 1963-present). AQR factor datasets (free registration). Collection frequency: Daily at close. Monthly download of updated Fama-French/AQR data. Storage: SQLite table factor_returns + Parquet files for academic factor data.

Why it matters: Factor attribution on every trade reveals whether alpha came from stock selection or factor exposure. Regime-conditional factor performance guides position sizing.

SECTION B: SEC EDGAR PIPELINE (The Goldmine -- All Free)

SEC EDGAR is the single richest free data source for stock analysis. Base URL: https://efts.sec.gov/LATEST/. User-Agent REQUIRED: "CompanyName AdminEmail". Rate limit: 10 requests/second (use 5/sec conservatively).

3. Form 4 -- Insider Trades

What: Corporate insiders (CEO, CFO, directors, 10%+ owners) must report stock purchases/sales within 2 business days of the trade.

What to extract per filing:

Ticker, CIK
Insider name, relationship (CEO/CFO/Director/VP/10% Owner)
Transaction type: P (purchase), S (sale), A (award), M (exercise)
Shares traded, price, total value
Shares owned after transaction
Whether it is a pre-planned 10b5-1 trade (noise) or open-market (signal)
Reporting person CIK (for cross-company network mapping)

Source: SEC EDGAR. Filing search: https://efts.sec.gov/LATEST/search-index?forms=4&dateRange=custom&startdt={date}&enddt={date}. Parse XML from each filing. Collection frequency: Every 2 hours during market hours (9 AM - 6 PM ET). Most filings hit EDGAR between 4-6 PM ET. Storage: SQLite tables insider_trades + insider_network (person-CIK to company-CIK mapping for cross-company cluster detection).

Key signals:

Cluster buying: 3+ insiders buying within 30 days predicts 6-month outperformance by 7-10% (academic evidence)
CEO/CFO buys stronger than Director buys
10b5-1 pre-planned sales are NOISE -- filter them out
Cross-company network: when an insider who sits on boards of Company A and Company B buys stock in Company A, check Company B too

Why it matters: For prediction markets, insider cluster buying before a Kalshi "stock above $X" expiry means the contract is underpriced. Cross-desk: Options desk shares Form 4 data for options flow signals.

4. Form 144 -- Pre-Sale Notice of Restricted Stock (Contrarian's Top Find)

What: Insiders must file Form 144 BEFORE selling restricted shares. This is filed days to weeks BEFORE the actual Form 4 (which reports the completed trade). Form 144 is the LEADING indicator of Form 4.

Source: SEC EDGAR, filing type "144". URL: https://efts.sec.gov/LATEST/search-index?forms=144&dateRange=custom&startdt={date}&enddt={date} Collection frequency: Daily scan. Storage: SQLite table form144_notices.

Key signal: Form 144 filing = selling is COMING but hasn't happened yet. You see it days/weeks before the market does. Compare volume and timing to subsequent Form 4 to validate the lag.

Why it matters: Almost nobody monitors Form 144 because the data is less structured than Form 4. An LLM can parse the text cheaply.

5. 13F Institutional Holdings (Quarterly)

What: Every institutional investment manager with >$100M AUM must file 13F-HR quarterly, disclosing all equity holdings.

What to extract:

Filer name, CIK
Filing period (quarter-end date)
Per holding: CUSIP, company name, shares, market value, investment discretion, put/call indicator
Calculate: new positions, liquidated positions, increased/decreased vs prior quarter

Smart Money Cohort (track these specifically -- ~50 funds):

University endowments: Yale, Harvard, Stanford, Princeton, MIT
Sovereign wealth: GIC, ADIA, Norges Bank
High-conviction value: Berkshire Hathaway, Baupost, Third Point, Pershing Square, Greenlight, Appaloosa, ValueAct
Tiger Cubs: Lone Pine, Viking, Coatue, Tiger Global, Maverick

Source: SEC EDGAR. Filing index: https://efts.sec.gov/LATEST/search-index?forms=13F-HR&dateRange=custom. Parse infotable.xml for positions. Collection frequency: Quarterly. 13F due 45 days after quarter-end (Feb 14, May 15, Aug 14, Nov 14). Bulk parse within 24 hours of deadline. Also monitor daily for early filers (early filing = no positions they want to hide -- this IS signal). Storage: SQLite tables institutional_holdings + smart_money_signals (derived).

Key signals:

3+ smart money funds initiating the same stock in the same quarter = deep fundamental research already done. Follow with 3-12 month horizon.
Ownership overlap: 5+ overlapping funds = "crowded" stock, vulnerable to contagion selling during redemptions.
For prediction markets: smart money piling in = Kalshi "above $X" contracts are underpriced on 3-month horizon.

6. N-PORT Monthly Fund Holdings (3x Frequency of 13F)

What: Mutual fund monthly portfolio holdings. N-PORT filings are MONTHLY with 60-day public disclosure lag, vs 13F which is QUARTERLY with 45-day lag. N-PORT includes data 13F does NOT: securities lending income, total return swaps, CDS held, liquidity classification.

Source: SEC EDGAR, filing type "NPORT-P". URL: https://efts.sec.gov/LATEST/search-index?forms=NPORT-P Collection frequency: Monthly. Parse within 24 hours of new filing. Storage: SQLite table nport_holdings.

Key signals:

Fund liquidating a position across 2+ consecutive months before it shows in quarterly 13F = you see it 30-60 days early
Fund increasing securities lending of a specific stock = they expect borrowing pressure (bearish)
Liquidity classification downgrade = fund may be forced to sell

Why it matters: Almost nobody parses N-PORT because the XML is complex. Massive information advantage for those who do.

7. 13D/13G Activist + Large Position Disclosure

What: 13D = filer intends to influence management (activist). 13G = passive large position (>5% ownership).

What to extract:

Filer name, target company (CIK, ticker)
Shares acquired, % of outstanding
Purpose of transaction (KEY FIELD -- 13D requires stating intent)
Whether this is a 13G-to-13D CONVERSION (the overlooked signal)

Activist Quality Tier List:

Tier A (>60% campaign success): Elliott, Starboard, ValueAct, Icahn, Third Point, Jana Partners, Trian
Tier B (40-60%): Engaged Capital, Land & Buildings, Sachem Head
Tier C (unknown/new): everyone else

Source: SEC EDGAR. https://efts.sec.gov/LATEST/search-index?forms=SC+13D,SC+13D/A,SC+13G Collection frequency: Daily scan at 6 PM ET. Storage: SQLite table activist_filings.

Key signals:

Tier A activist 13D filing = go long immediately. Historical average 5-15% pop on filing + further upside.
13G-to-13D conversion = imminent activist campaign. Stock typically pops 8-15% within 30 days of conversion. Track all 13G filers by CIK; when same CIK files a 13D for same company, flag immediately.
For prediction markets: 13D filing on a stock with active Kalshi contracts = those contracts reprice within hours.

8. 10-K/10-Q Financials via XBRL

What: Quarterly and annual financial statements in machine-readable format.

What to extract:

Revenue (total + by segment + by geography)
Net income, EPS (basic and diluted)
Cash from operations (CFO), CapEx, Free cash flow
Total assets, total liabilities, stockholders' equity
Total accruals (Net Income - CFO) -- the accruals anomaly signal
Shares outstanding (basic and diluted)
R&D expense, stock-based compensation
Share repurchase amount, debt levels
Geographic revenue split (for currency mismatch analysis)
Derivative/hedging disclosures (from footnotes)

Derived signals (calculate on ingestion):

Accrual ratio = (Net Income - CFO) / Total Assets. High accruals predict underperformance over 6-12 months (5-10% annual alpha).
Asset growth = YoY change in total assets. Growth >20% YoY underperforms by 5-7% next year.
Share issuance = YoY change in diluted shares. Net issuance >5% of float = dilution drag. Net buyback >3% = management signaling undervaluation.
R&D-adjusted P/E (capitalize R&D instead of expensing)

Source: SEC EDGAR XBRL API. https://data.sec.gov/api/xbrl/companyfacts/CIK{cik_padded}.json returns ALL XBRL facts ever filed. Collection frequency: Parse within 24 hours of each filing. Cluster around earnings seasons. Storage: SQLite table fundamentals.

Why it matters: These fundamentals determine FAIR VALUE of "stock above $X" prediction market contracts.

9. DEF 14A Proxy Statements (Annual)

What: Executive compensation details, board composition, shareholder proposals.

What to extract:

CEO total compensation (cash + equity breakdown)
YoY change in equity vs cash compensation ratio
Option grant strike prices (management defends these price levels)
Performance-based vesting criteria and target changes

Source: SEC EDGAR. Proxy season is March-May for most Fortune 500. Collection frequency: Daily during proxy season (Mar-Jun), weekly otherwise. Storage: SQLite table exec_compensation.

Key signal: Shift to equity-heavy + raised performance targets = management bullish. Option strike clustering = management defends that price level. For prediction markets: exec options at $300 strike and Kalshi "above $280" = management has financial incentive to keep stock above $300, making the $280 contract safer.

10. SEC Correspondence Letters (CORRESP) -- Contrarian's Biggest Find

What: When the SEC sends a company a letter asking questions about its filings, the correspondence is eventually made public. Companies receiving SEC comment letters underperform by ~8% over the next 12 months (academic research). Almost nobody monitors this systematically.

Source: SEC EDGAR, filing type "CORRESP". URL: https://efts.sec.gov/LATEST/search-index?forms=CORRESP Collection frequency: Daily scan. Storage: SQLite table sec_correspondence.

Key signal: New SEC correspondence on a stock = regulatory scrutiny = the SEC's accounting reviewers found something questionable. Short signal on a 3-12 month horizon.

11. Form 424B2 Structured Products (Shared with Options Desk)

What: Banks file 424B2 supplements when issuing structured products (autocallables, barrier notes) tied to specific stocks. Reveals which stocks banks are using as underliers, barrier/knockout levels, and notional amounts.

Source: SEC EDGAR, filing type "424B2". Collection frequency: Daily. Filter for equity-linked products. Start with top 5 issuers (Goldman, JPMorgan, Morgan Stanley, Citi, BofA) and SPY/QQQ underlyings. Storage: SQLite table structured_products (shared with Options desk).

Complexity note: Dense legal/financial text requiring NLP parsing. This is a Month 2-3 build item, not Week 1.

Why it matters: When $500M+ in autocallable notional concentrates at a specific barrier level, that barrier becomes a support/resistance zone due to dealer hedging.

12. Filing Frequency Scanner + Special Filing Types

What: Monitor unusual filing patterns that signal material events.

Filing types to watch:

8-K: 3+ in a week = something material happening
S-3 shelf registration: company preparing to issue shares (dilution)
SC TO-T: tender offer (acquisition premium)
DEFA14A: proxy fight heating up
NT 10-K/NT 10-Q: late filing notification (strong negative signal)
15-12G: deregistering securities = going dark (bearish)
Form ADV amendments: AUM drop >15% = forced selling coming

Source: SEC EDGAR RSS feed + full-text search. Collection frequency: Hourly during market hours. Storage: SQLite table filing_alerts.

SECTION C: PREDICTION MARKET DATA (Where We Trade)

13. Kalshi Stock Event Contracts

What: All stock-related Kalshi binary contracts.

Contract types:

"Will [TICKER] be above/below $X on [DATE]?" -- price thresholds
"Will [TICKER] beat/miss earnings EPS estimate?" -- earnings outcomes
"Will S&P 500 be above/below X on [DATE]?" -- index levels
"Will [TICKER] announce dividend cut/increase?" -- corporate actions
"Will there be a stock market correction (>10% decline)?" -- crash contracts
Fed rate decisions, CPI/inflation outcomes (impacts equities)

What to pull per contract:

Contract ticker/ID, event description
Parsed fields: underlying ticker, threshold price, expiration date, direction
Yes price, No price (cents, 0-100)
Volume, open interest
Order book depth (top 5 bids and asks with quantities)
Settlement rules (exact settlement source and time)
Last trade price and timestamp

Source: Kalshi API. https://api.elections.kalshi.com/trade-api/v2/markets?series_ticker=STOCK. Auth via RSA-PSS signing. Collection frequency: Every 30 minutes during market hours. Every 5 minutes for contracts expiring within 24 hours. Every 15 minutes for order book depth on active positions. Storage: SQLite table prediction_market_stocks.

14. Polymarket Stock Event Contracts

What: Same as Kalshi but on Polymarket. Different liquidity pool (crypto-native traders), so prices often diverge.

Additional Polymarket-specific data:

AMM pool depth (liquidity available)
USDC flow in/out of the contract
Resolution source and rules (often differ subtly from Kalshi)

Source: Polymarket CLOB API: https://clob.polymarket.com/markets. Gamma API: https://gamma-api.polymarket.com/markets?tag=stocks. No auth for public data. Collection frequency: Every 30 minutes during market hours. Storage: SQLite table prediction_market_stocks (shared with Kalshi, platform column distinguishes).

Cross-platform arbitrage: When the same stock event exists on both Kalshi AND Polymarket, compare prices. If Kalshi "TSLA above $1000" is 35 cents but Polymarket prices the same event at 42 cents, one is wrong. Buy cheap, sell expensive -- but CHECK SETTLEMENT RULES, they often differ subtly.

15. Binary Probability Calculator (The Bridge)

What: Math that converts traditional stock analysis into Kalshi/Polymarket-comparable probabilities. Uses options-implied distributions (d2 from Black-Scholes) or fundamental fair value models to calculate exact probability of a stock being above/below a specific price by a specific date.

Inputs: IV surface (from Options desk), risk-free rate (SOFR from FRED), time to expiration, current price, fundamental fair value estimate. Output: Probability per threshold per expiration, compared to prediction market prices. Collection frequency: Recalculated every time prediction market prices or model inputs update.

Why it matters: This is the core of the cross-venue edge. Without this bridge, we cannot compare our analysis to prediction market pricing.

SECTION D: MACRO AND ECONOMIC DATA (FRED API -- Free)

16. FRED Economic Data Suite

What: Federal Reserve Economic Data -- free API for macro data that drives sector rotation, risk appetite, and market regime.

Source: FRED API (free, API key required -- free registration at https://fred.stlouisfed.org/docs/api/api_key.html). Rate limit: 120 requests/minute.

Series to pull:

Series ID	Description	Frequency
DFF	Federal Funds Effective Rate	Daily
DGS10	10-Year Treasury Yield	Daily
DGS2	2-Year Treasury Yield	Daily
DGS30	30-Year Treasury Yield	Daily
T10Y2Y	10Y-2Y Yield Spread (curve)	Daily
T10Y3M	10Y-3M Yield Spread	Daily
BAMLH0A0HYM2	ICE BofA HY OAS Spread	Daily
BAMLC0A0CM	ICE BofA IG OAS Spread	Daily
BAMLC0A4CBBB	BBB Corporate Bond Spread	Daily
DTWEXBGS	Trade-Weighted USD Index (Broad)	Daily
VIXCLS	CBOE VIX Close	Daily
WALCL	Fed Balance Sheet (Total Assets)	Weekly
RRPONTSYD	Reverse Repo Facility Usage	Daily
WTREGEN	Treasury General Account (TGA)	Weekly
NFCI	Chicago Fed Financial Conditions	Weekly
TOTCI	C&I Loans (bank lending pulse)	Weekly
ICSA	Initial Jobless Claims	Weekly
CPIAUCSL	CPI-U (All Items)	Monthly
PCE	Personal Consumption Expenditures	Monthly
PAYEMS	Total Nonfarm Payrolls	Monthly
UMCSENT	U Michigan Consumer Sentiment	Monthly
UNRATE	Unemployment Rate	Monthly
USREC	NBER Recession Indicator	Monthly

Derived signals (calculate on ingestion):

Yield curve inversion flag (T10Y2Y < 0)
HY spread z-score (vs 1-year rolling mean/stdev)
Credit regime: "risk-on" when HY tightening + VIX < 20, "risk-off" when widening + VIX > 25
USD momentum (20-day vs 60-day moving average)
Liquidity composite: RRP drain + TGA drain = liquidity injection (bullish). RRP + TGA building = liquidity drain (bearish).
Bank lending pulse: C&I loans declining 3+ weeks = credit tightening = earnings recession coming

Collection frequency: Daily at 7 AM ET for overnight releases. Weekly series checked on Fridays. Storage: SQLite table macro_data (series_id, date, value).

Why it matters: HY spread is THE leading indicator for sector rotation (tightening = cyclicals, widening = defensives). Credit spreads lead equity by 1-3 days. For prediction markets: macro regime determines whether "S&P above X" contracts are overpriced or underpriced.

17. Market Regime Classification

What: Daily classification of market environment into one of six regimes.

Regime rules (calculate daily):

BULL: SPY above 200 DMA, VIX < 20, HY spread < 400bps, >70% stocks above 50 DMA
BEAR: SPY below 200 DMA, VIX > 25, HY spread > 500bps, <40% stocks above 50 DMA
CORRECTION: SPY 5-10% below recent high, VIX 20-30, mixed breadth
CRISIS: VIX > 35, HY spread > 600bps, VIX term structure in backwardation, MOVE > 120
RANGE-BOUND: SPY between -5% and +5% of 50 DMA, VIX 15-22, low sector dispersion
ROTATION: Sector dispersion > 2 standard deviations above mean

Source: Calculated from FRED data + price data. Also: VIX3M, VIX6M from CBOE (free), MOVE index from FRED. Collection frequency: Daily at close. Storage: SQLite table market_regime (date, regime_label, confidence, vix, hy_spread, yield_curve, breadth, dispersion).

Why it matters: Every strategy must be tagged with which regimes it works in. Bull-market strategies fail in bear markets.

SECTION E: SHORT INTEREST AND DARK POOL DATA

18. FINRA Short Interest (Bi-Monthly)

What: Total shares sold short per ticker.

What to collect:

Short interest (total shares)
Days to cover (SI / average daily volume)
Short interest as % of float
Change vs prior report

Source: FINRA. https://www.finra.org/finra-data/browse-catalog/short-interest/data. Published bi-monthly (~15th and last business day), with ~10 day reporting lag. Collection frequency: Bi-monthly, day after FINRA release. Storage: SQLite table short_interest.

Key signals:

Days to cover > 5 = squeeze risk. Days to cover > 10 = extreme squeeze risk.
SI velocity (rate of change) combined with borrow cost = powerful squeeze/crowding signal.
For prediction markets: heavily shorted + catalyst = sharp move potential. "Stock above $X" contracts get underpriced when bears are overcrowded.

19. FINRA ATS (Dark Pool) Weekly Volume

What: Weekly share volume traded on each Alternative Trading System (dark pool) per ticker.

Source: FINRA ATS Transparency Data. https://ats-transparency.finra.org/otc/ats-nms-weekly-data. Published weekly, covers prior 4-week rolling period. Collection frequency: Weekly download every Monday morning. Storage: SQLite table dark_pool_volume (ticker, week_ending, ats_name, share_volume, trade_count, pct_of_total_volume).

Key signal: Dark pool activity ratio spike = institutional accumulation/distribution hidden from lit markets. Combined with insider trades (Form 4), confirms institutional accumulation thesis.

20. REG SHO Threshold List (Daily, Free)

What: SEC publishes daily list of stocks with excessive failed-to-deliver (FTD) shares. These stocks have forced buy-in risk.

Source: SEC.gov + exchange websites (free). https://cdn.finra.org/equity/regsho/daily/ Collection frequency: Daily. Storage: SQLite table regsho_threshold.

Key signal: Consecutive days on threshold list + high short interest + rising call OI = squeeze setup. Cross-reference with FINRA SI data.

21. DTCC Equity Swap Reporting + OCC Stock Borrow Rates

What: Post-Archegos transparency rules require swap reporting. OCC publishes daily borrow rates.

Source: DTCC (free), OCC (theocc.com -- free daily rates). Collection frequency: Daily. Storage: SQLite table borrow_rates.

Key signal: Borrow rate > 5% = hard to borrow, put/call parity breaks, GEX calculations need adjustment. Borrow rate spike = short squeeze pressure building.

SECTION F: EARNINGS DATA

22. Earnings Calendar and Upcoming Dates

What: Next reporting date for all tracked tickers, plus before/after market flag.

Source: Yahoo Finance earnings calendar (free, scrapeable). Financial Modeling Prep (FMP) API free tier (250 requests/day). https://financialmodelingprep.com/api/v3/earning_calendar?from={date}&to={date}&apikey={KEY} Collection frequency: Daily update at 7 AM ET. Storage: SQLite table earnings_calendar.

23. Historical Earnings Surprise Database

What: Complete earnings history for prediction calibration and post-mortem analysis.

What to collect per earnings report:

Report date, time (pre/post market), fiscal quarter
Consensus EPS estimate (mean, median, high, low, number of analysts)
Actual reported EPS
Surprise % = (Actual - Consensus) / |Consensus|
Revenue consensus, actual, surprise %
Guidance: raised/lowered/maintained/none
Stock price reaction: close-to-close on report day, 2-day cumulative
Pre-earnings run-up: 5-day return before earnings
Post-earnings drift: 5-day, 10-day, 20-day, 60-day returns after
Implied move (ATM straddle price / stock price) vs actual move

Source: SEC EDGAR XBRL (actuals), FMP API (consensus estimates + surprise history). Initial backfill: last 5-10 years. Collection frequency: Daily during earnings season. Update reaction metrics at 1/5/10/20/60 day marks. Storage: SQLite table earnings_history.

Why it matters: Build per-stock "earnings personality" -- does this stock typically sell off on beats? Compare implied move vs actual move to find mispriced options/prediction market contracts.

24. Analyst Estimate Revisions

What: Consensus EPS/revenue estimates and revision trends.

What to track:

Current consensus for next 4 quarters and next 2 fiscal years
Revisions up/down in last 7/30/90 days
Revision ratio = up / (up + down)
Target price consensus, high, low
Rating distribution: strong buy, buy, hold, sell, strong sell

Source: FMP API (free tier). Backup: yfinance (ticker.recommendations). Collection frequency: Daily snapshot after market close. Storage: SQLite table estimate_revisions. Store every daily snapshot to reconstruct revision path.

Key signal: Strong revision momentum = leading indicator. Predictions against strong revision trends have lower hit rates. "Stale consensus" (no revision in 60+ days) = unreliable.

25. Earnings Call NLP Analysis

What: Natural language processing on quarterly earnings call transcripts.

NLP metrics to compute:

Fog Index (readability/complexity) -- higher = more obfuscation
Sentiment score (Loughran-McDonald financial dictionary)
Uncertainty word frequency ("approximately", "potentially", "challenging")
Forward-looking statement ratio (future tense / total verbs)
FinBERT sentiment (Hugging Face, free) for financial text
Quarter-over-quarter changes in all metrics (drift detection)

Source: Motley Fool transcripts (free, scrapeable). Seeking Alpha (limited free). Use nltk + textstat for readability, FinBERT for sentiment. Collection frequency: Daily during earnings season, weekly otherwise. Storage: SQLite table earnings_call_nlp.

Key signal: Fog Index increasing QoQ = management obfuscating. Uncertainty words rising = trouble ahead. Sentiment drift without fundamental change = narrative management.

SECTION G: CORPORATE EVENTS AND CALENDAR

26. IPO Lockup Expiration Calendar

What: When insider lockup periods expire on recently-IPO'd stocks, creating selling pressure.

What to collect:

Company, ticker, IPO date, lockup expiry date
Insider ownership % of float
VC/PE-backed flag
Estimated supply pressure (lockup shares / current float)

Source: IPOScoop (https://www.iposcoop.com/ipo-lockup-expirations/), MarketBeat (https://www.marketbeat.com/ipos/lockup-expirations/). Cross-ref with SEC S-1 "Shares Eligible for Future Sale" section. Collection frequency: Weekly scan every Sunday. Flag lockups expiring within 30 days. Storage: SQLite table ipo_lockups.

Key signal: Stocks start underperforming 10-15 days BEFORE lockup expiry. Short 2 weeks early, cover 2-3 days after. VC-backed + high insider ownership (>40%) = most pressure. For prediction markets: Kalshi "above $X" contracts expiring near a lockup date are overpriced.

27. Index Reconstitution and Corporate Actions

What: S&P 500 additions/deletions, Russell rebalancing, stock splits, dividend ex-dates, spin-offs, M&A.

What to track:

S&P 500 additions/deletions (ad hoc, from S&P press releases)
Russell 2000/1000 reconstitution (annual in June)
MSCI rebalancing (quarterly)
Historical S&P 500 constituent changes with dates (survivorship bias prevention)
Stock splits (ratio, effective date)
Dividend ex-dates, record dates, amounts (special vs regular)
Spin-offs (parent, spin-off entity, distribution ratio, effective date)
Float adjustment changes (secondary offerings, lockup expiry, buybacks)

Sources: S&P Dow Jones press releases (free), FTSE Russell, Yahoo Finance corporate actions, SEC Form 10-12B (spin-offs), Nasdaq dividend calendar. Collection frequency: Daily at 7 AM ET. Storage: SQLite tables corporate_events, sp500_changes, spinoffs.

Key signals:

S&P 500 addition = buy on announcement (3-7% pop into effective date). Deletion = short.
Spin-off = forced selling by index funds (if spin-off doesn't qualify for index). Buy the spin-off after forced selling subsides.
Dividend + high short interest approaching ex-date = buy 5 days before (dividend capture + short covering).

28. Buyback Blackout Periods

What: Companies cannot repurchase shares during earnings blackout windows (typically 2 weeks before earnings through 2 days after). This removes a buyer from the market.

How to estimate: Earnings date minus 14 calendar days = blackout start. Earnings date + 2 business days = blackout end. Calculate % of S&P 500 market cap in blackout at any given time.

Collection frequency: Calculated from earnings calendar. Storage: Column in daily state table.

Why it matters: When >50% of S&P 500 by market cap is in buyback blackout, a significant source of demand is removed. Downside risk increases.

29. Daily State Table

What: One row per trading day with every relevant boolean flag and event marker.

Schema includes:

Date, day of week
OpEx flags: is_monthly_opex, is_quarterly_opex (quad witching), days_to_next_opex
Fed flags: is_fomc_day, is_fomc_eve, days_to_next_fomc
Economic data: is_cpi_day, is_nfp_day, is_ppi_day, is_pce_day, econ_release_tier (1/2/3)
Treasury: is_treasury_auction, auction_tenor
Earnings: num_sp500_reporting, is_peak_earnings_week
Buyback: pct_sp500_in_blackout
Month/quarter: is_month_end, is_quarter_end
Half days, holidays

Collection frequency: Generated daily at 6:00 AM ET (before market open). Storage: SQLite table daily_state.

SECTION H: SENTIMENT AND SOCIAL DATA

30. Reddit (4 Subreddits, Separate Treatment)

What: Retail sentiment from r/wallstreetbets, r/stocks, r/investing, r/options. Each has different signal characteristics.

r/wallstreetbets (WSB):

Ticker mention count per hour (parse $TICKER, cashtags)
Post score, comment count, awards count (conviction proxy)
YOLO position dollar aggregation per ticker
Volume spike detection: flag when mentions exceed 3x 7-day average
WSB Momentum Score (0-100): composite of mention velocity, sentiment, YOLO dollars, awards
CONTRARIAN SIGNAL: WSB Momentum Score >90 for 3+ consecutive days AND ticker up >15% = SELL signal. Retail euphoria at extremes is one of the most reliable contrarian indicators.

r/stocks:

DD (Due Diligence) flair posts: full text, ticker, author karma
DD quality score: author karma * upvote_ratio * (1 + log(comment_count))
New Ticker Alert: ticker appearing for first time in 30 days with positive DD = early retail discovery

r/investing:

Macro keyword frequency: "recession", "fed", "rate cut", "crash", "bubble"
Macro Fear Index: ratio of fear keywords to greed keywords, 7-day moving average
Feeds into sector rotation signal

r/options:

Call vs put sentiment ratio per ticker
LEAPS accumulation tracking (30-day rolling window)
Cross-reference with Options desk flow data

Source: Reddit API / PRAW (free). Rate limits: 60 requests/minute authenticated. Collection frequency: Every 30 minutes during market hours for WSB and r/options. Every 2 hours for r/stocks and r/investing. Daily aggregation at 10 PM ET. Storage: SQLite table reddit_sentiment.

31. X/Twitter Flow via Grok Search (Free)

What: Real-time market intelligence from financial X accounts.

Tier 1 -- Market Structure (check every 5 min): @unusual_whales, @spotgamma, @OptionsHawk, @WallStJesus, @SqueezeMetrics, @GarrettDeSimone Tier 2 -- Macro & Fed (every 15 min): @NickTimiraos, @WalterBloomberg, @DeItaone, @Markets Tier 3 -- Stock Analysis (every 30 min): @GaryBlack00, @DougKass, @elerianm Tier 4 -- Retail Sentiment (every 60 min): @jimcramer (INVERSE CRAMER IS A REAL SIGNAL -- academic studies show his picks underperform)

Keyword groups to monitor:

Squeeze/Momentum: "short squeeze", "gamma squeeze", "diamond hands", "YOLO"
Fear/Crash: "market crash", "circuit breaker", "capitulation", "panic selling"
Fed/Macro: "rate cut", "rate hike", "fed pivot", "FOMC", "CPI", "recession"
Earnings: "earnings beat", "earnings miss", "guidance raised", "priced in"

Pattern detection (automated):

Coordinated ticker pumping: 10+ low-follower accounts tweeting same ticker in 5 min = pump-and-dump, AVOID
Influencer cascade: Tier 1 account tweets ticker, 3+ Tier 2/3 amplify within 1 hour = institutional narrative forming
Sentiment divergence: X positive but price declining = trust price for intraday, trust sentiment for 5+ day horizon
Cramer Inverse: track accuracy score, when bullish + trailing accuracy <45% = contrarian sell

Source: Grok's built-in X search via xAI API (free, already have access). Zero additional cost. Collection frequency: Per-tier schedule above during market hours. Every 2 hours outside. Storage: SQLite table x_sentiment.

32. Congressional Trading Disclosures

What: Members of Congress must disclose stock trades within 45 days (STOCK Act). Their trades historically outperform the market.

Source: Capitol Trades (https://www.capitoltrades.com/) or Quiver Quant (https://www.quiverquant.com/congresstrading/). Free to scrape. Collection frequency: Daily. Storage: SQLite table congressional_trades.

Key signal: Multiple members of Congress buying the same stock = insider knowledge of upcoming legislation/regulation. Especially relevant for defense, pharma, and tech stocks.

33. Google Trends

What: Search interest for tracked tickers and company names.

What to track:

Search interest for ticker symbols and company names
"buy [stock]" vs "sell [stock]" ratio
Category-level vs brand-level divergence (the contrarian edge: if "electric vehicles" trending up but "Tesla Model Y" trending down = Tesla losing category share BEFORE it shows in delivery numbers)

Source: pytrends Python library (free, unofficial but stable). Collection frequency: Daily (Google Trends has 1-day granularity for <90 day range). Storage: SQLite table google_trends.

Key signal: Search interest spike >3x 30-day average = retail attention surge. Bullish for 1-3 days, then contrarian (attention peak passed). Category/brand divergence >2 std dev from 12-month average = market share shift.

34. Retail Flow Proxies

What: Replacement for Robinhood's defunct popularity API.

Sources (all free to scrape):

Fidelity order flow: eresearch.fidelity.com/eresearch/gotoBL/fidelityTopOrders.jhtml -- daily buy/sell ratios. Skews institutional.
Webull comment/star count: api.webull.com (undocumented API, ~60/min rate limit). Rising star count = retail buying.
eToro CopyTrader: most copied traders' positions. When popular gurus pile into same name, it is overcrowded.
Stocktwits: api.stocktwits.com/api/2/streams/symbol/{TICKER}.json (free, 200/hr). Bull/bear ratio >80% bullish = contrarian sell.

Collection frequency: Fidelity/Webull every 30 min during market hours. eToro/Schwab weekly. Stocktwits every 15 min. Storage: SQLite table retail_flow_proxies.

Derived signal -- Retail Consensus Indicator: Composite of Stocktwits (25%), WSB (25%), r/stocks (20%), Fidelity (15%), Google Trends (15%). Score >85 sustained for 3+ days = CONTRARIAN SELL. Score <15 sustained for 3+ days = CONTRARIAN BUY.

35. Earnings Whisper / Social Consensus

What: The "real" expectation that traders have, which often differs from published consensus.

Sources:

Estimize (crowdsourced estimates, free basic tier). Estimize beats Wall Street consensus ~70% of the time.
Social Whisper: median of EPS/revenue estimates extracted from X, Reddit, StockTwits text (regex + NER parsing)

Key signal: Company beats Street consensus but MISSES Social Whisper = negative reaction ("beat and dump"). Company misses Street but matches Social Whisper = muted reaction.

Collection frequency: Daily during earnings season for stocks reporting within 5 trading days.

SECTION I: ALTERNATIVE DATA (Free Sources)

36. USPTO Patent Data

What: Patent applications, grants, citations, abandonments, and assignments for tech companies.

What to track:

Forward citations by company (how often patents are cited by newer patents)
YoY citation change (moat strengthening/eroding signal)
Patent APPLICATIONS (pre-grant) -- reveals R&D direction 18-24 months before products ship
Patent ABANDONMENT (failure to pay maintenance fees) -- company giving up on technology
Patent ASSIGNMENT transfers (selling patents) -- distressed IP monetization
Trademark filings in new Nice Classification codes -- product launch detector (6-12 month lead)

Source: USPTO PatentsView API (free, excellent, no key needed). https://api.patentsview.org/patents/query. Trademark: USPTO TSDR. Collection frequency: Weekly (patent data updates weekly, trademarks daily). Storage: SQLite table patent_citations + trademark_filings.

Why it matters: Patent abandonment in core technology = bearish for long-term moat. Assignment to NPE = company needs cash = distress signal.

37. GitHub Commit Velocity (Tech Companies)

What: For open-source-heavy tech companies (MSFT, GOOG, META, IBM, MDB, SNOW, DDOG), GitHub activity is a real-time proxy for developer ecosystem health.

What to track:

Commit velocity in core repos (weekly)
Star counts and growth rate
Community contributor growth
Fork count trends

Source: GitHub API (free, 5000 requests/hour authenticated). https://api.github.com/ Collection frequency: Weekly. Storage: SQLite table github_activity.

Key signal: Commit velocity declining >30% over 3 months = engineering problems. Community contributor growth accelerating = ecosystem moat strengthening.

38. Job Posting Volume as Company Health Proxy

What: Number and type of job postings reveals company health and strategic direction.

Source: Direct scraping of company careers pages. Indeed RSS feeds. Revealera.com (free tracker). Collection frequency: Weekly snapshot of top 100 holdings' career pages. Storage: SQLite table job_postings.

Key signal: Posting volume drop >40% MoM = hiring freeze = operational problems incoming. Surge in "revenue" roles = expects growth. Surge in "cost" roles (compliance, legal, restructuring) = defensive posture.

39. App Store Rankings (Consumer Tech)

What: App store download rankings for consumer-facing tech companies.

Source: Various free app analytics sites. Apple App Store RSS feed. Collection frequency: Weekly. Storage: SQLite table app_rankings.

40. SaaS Pricing Page Monitoring

What: Track pricing page changes for SaaS companies. Price increases signal pricing power; new free tiers signal demand weakness.

Source: Wayback Machine API for historical (https://web.archive.org/web/timemap/). Direct page monitoring for real-time. Collection frequency: Weekly. Storage: SQLite table pricing_changes.

Key signal: Price increase = revenue acceleration coming (buy). New free tier / discount language = demand concern (avoid).

SECTION J: CROSS-ASSET AND INTERNATIONAL SIGNALS

41. Credit-Equity Divergence

What: Corporate bond spreads lead equity by 1-3 days because credit markets are smarter/faster.

Source: FRED API (free). HY OAS (BAMLH0A0HYM2), IG OAS (BAMLC0A0CM). Collection frequency: Daily.

Key signal: CDX.HY widening 2+ days while SPX flat = equity about to catch down. CDX.HY tightening while SPX sells off = equity overreacting, buy the dip. ~70% directional accuracy.

42. Commodity-to-Sector Margin Mapping

What: Specific commodity price moves directly predict sector margins with 1-quarter lag.

Mappings:

Copper/Aluminum up = Construction/Industrials margins compress
WTI Crude up = Airlines compress, Energy expand
Lumber up = Homebuilder margins compress
Natural Gas up = Chemicals compress, Utilities expand
Lithium/Cobalt up = EV makers compress (TSLA, RIVN, LCID)
Coffee/Cocoa up = Consumer staples compress (SBUX, HSY, MDLZ)

Source: FRED commodity prices (free). https://fred.stlouisfed.org/categories/32217 Collection frequency: Daily. Storage: SQLite table commodity_sector_signals.

Key signal: Commodity moving >15% in 30 days = margin impact next quarter. When commodity rises but downstream stock prices do NOT adjust = market underpricing compression = short.

43. CFTC Commitments of Traders (Equity Index Futures)

What: Weekly positioning data showing dealer, leveraged fund, and asset manager positions in equity index futures.

Source: CFTC.gov (free CSV downloads). Collection frequency: Weekly (released Friday 3:30 PM ET, data from Tuesday -- 3 days stale). Storage: SQLite table cot_equity (shared with Futures desk).

44. International Leading Indicators

What: Non-US data points that predict US equity moves.

Sources (all free):

Korean semiconductor exports: Korea Customs Service. Monthly (1st of month). Korean semi export growth directly predicts NVDA, AMD, INTC, MU earnings with 1-quarter lead.
TSMC monthly revenue: https://investor.tsmc.com/english/monthly-revenue. Released ~10th of month. TSMC revenue up >20% YoY = chip demand boom.
Japan core machinery orders: Cabinet Office of Japan. Monthly. Leads global capex by 2-3 months.
China PMI (NBS + Caixin): Monthly (1st of month). Divergence between official (large SOEs) and Caixin (private sector) reveals split economy.
Australian iron ore exports: ABS. Monthly. Best real-time proxy for Chinese industrial activity (unmanaged data vs politically managed Chinese statistics).

Collection frequency: Monthly for all. Check on release dates. Storage: SQLite table international_indicators.

45. FDIC Call Reports (Bank Stocks)

What: Quarterly bank financial data with 80+ fields. More granular than 10-K/10-Q for banks and available weeks earlier.

Source: FDIC CDR (free). https://cdr.ffiec.gov/public/. FFIEC bulk download: https://www.ffiec.gov/npw/FinancialReport/DataDownload Collection frequency: Quarterly, available 30 days after quarter-end. Storage: SQLite table fdic_call_reports.

Key signal: Rising provision-to-loan ratio in CRE or C&I at multiple banks = systemic sector stress, 1-2 quarters before earnings impact.

46. Fed H.8 Report + Stress Tests

What: Weekly aggregate bank lending data. Annual stress test results determine bank capital return capacity.

Source: Federal Reserve. H.8: https://www.federalreserve.gov/releases/h8/ (weekly, Friday). FRED series TOTCI. Stress tests: https://www.federalreserve.gov/supervisionreg/stress-tests.htm (annual, June/July). Collection frequency: H.8 weekly. Stress tests annually.

Key signal: C&I loans declining 3+ weeks = businesses not borrowing = earnings recession coming. Banks with CET1 >12% + clean loan book = buyback announcement post-stress test = buy before June.

SECTION K: DERIVED AND COMPUTED DATABASES

These are not collected from external sources. They are COMPUTED from data above and our own prediction history.

47. Failed Trade Pattern Database

What: Structured post-mortem for every prediction that lost money.

Fields per failed trade:

prediction_id, ticker, direction, entry/exit dates, return
market_regime at entry, factor_regime at entry
sector performance, earnings proximity, analyst revision trend
prediction_market_consensus, correlation_regime, VIX level

Root cause categories (10):

REGIME_CHANGE -- market/factor regime shifted after entry
WRONG_DIRECTION -- thesis was simply wrong
RIGHT_DIRECTION_WRONG_TIMING -- stock eventually did what we predicted, not in our window
EXTERNAL_SHOCK -- unpredictable event
CROWDED_TRADE -- too many on same side
FACTOR_ROTATION -- our factor stopped working
CORRELATION_BREAKDOWN -- portfolio correlations spiked
SIZING_ERROR -- direction right, size too large, stopped out on noise
INFORMATION_STALE -- data old, market already priced it in
LIQUIDITY_TRAP -- couldn't exit at expected price

Storage: SQLite table failed_trade_patterns + aggregate stats by category/regime/factor/sector/month. Update: Within 24 hours of every settled trade.

48. Prediction Calibration Database

What: Track calibration by confidence bucket. Are our 70% confidence predictions winning 70% of the time?

Storage: SQLite table prediction_calibration with Brier scores, by confidence bucket, by desk member, by regime, by time horizon.

49. Factor Exposure Attribution

What: Decompose every trade's P&L into stock-specific alpha vs sector beta vs market beta vs factor exposure.

Source: Calculated from Fama-French factors + sector returns + individual stock returns. Storage: SQLite table factor_attribution.

50. Smart Money Convergence Score

What: Composite score combining insider buying (Form 4), institutional accumulation (13F), activist interest (13D), dark pool volume, and congressional trading into a single "smart money conviction" metric per ticker.

Storage: SQLite table smart_money_convergence.

51. Squeeze Probability Score

What: Composite (0-100) combining WSB mention velocity (20%), short interest + borrow fee (30%), X "squeeze" keyword frequency (15%), days to cover (15%), options skew inversion (10%), retail broker buy ratio (10%).

Storage: Calculated every 15 minutes during market hours. SQLite table squeeze_scores.

Signal thresholds: 0-30 = no squeeze risk. 30-60 = elevated. 60-80 = conditions forming. 80-100 = likely imminent. CRITICAL: once score >90 AND price spiked >30%, the squeeze is ENDING. Take profits.

BUILD PRIORITY

Week 1 -- Foundation ($0)

Day 1-2: SEC EDGAR Form 4 pipeline (insider trades + network mapping). Python script, ~400 lines. Cron: daily at 7 PM ET.

Day 2-3: FRED macro collector (23 series, credit-equity divergence baseline, sector rotation signals). Python script, ~150 lines. Cron: daily at 6 PM ET.

Day 3-4: Price/volume collector (S&P 500 + top 200 Russell 2000 via yfinance batch download, compute technicals). Python script, ~300 lines. Cron: daily at 5 PM ET.

Day 4-5: 13F quarterly processor (bulk download, XML parse, smart money cohort tracking, overlap scores). Python script, ~500 lines. Cron: quarterly within 48h of deadline.

Day 5-7: Signal integration (accruals calculator, asset growth, net share issuance, insider cluster detection, filing timing anomaly, Kalshi/Polymarket scrapers). Python signal module, ~400 lines.

Week 1 total: ~1,750 lines of Python, 5 cron jobs, $0.

Week 2 -- Expanded Free Sources ($0)

Day 8-9: 13D activist tracker (daily EDGAR scan, parse intent language, alert system). ~200 lines.

Day 9-10: DEF 14A proxy analyzer (executive compensation changes, equity/cash ratio, option strikes). ~300 lines.

Day 10-11: FINRA short interest + dark pool (bi-monthly SI parser, weekly ATS data). ~200 lines.

Day 11-12: IPO lockup calendar (scrape IPOScoop/MarketBeat, cross-ref S-1). ~200 lines.

Day 12-13: Form 144 pre-sale notices (EDGAR scan, parse restricted stock notices). ~150 lines.

Day 13-14: N-PORT monthly fund holdings (XML parse, month-over-month comparison). ~300 lines.

Week 2 total: ~1,350 additional lines, 6 more cron jobs, $0.

Week 3 -- Sentiment, NLP, and Alternative Data ($0)

Day 15-16: Reddit sentiment pipeline (PRAW, 4 subreddits, WSB Momentum Score, FinBERT). ~400 lines.

Day 16-17: Earnings call NLP (transcript scraping from Motley Fool, Fog Index, uncertainty scoring, sentiment). ~350 lines.

Day 17-18: USPTO patent tracker (PatentsView API, citations, applications, abandonments). ~250 lines.

Day 18-19: Congressional trading scraper (Capitol Trades / Quiver Quant). ~150 lines.

Day 19-20: Google Trends pipeline (pytrends, category vs brand divergence). ~200 lines.

Day 20-21: REG SHO threshold list + Earnings calendar/history + daily state table. ~300 lines.

Week 3 total: ~1,650 lines, 5 more cron jobs, $0.

Month 2-3 -- Advanced ($29/month for Polygon)

Polygon.io subscription ($29/month) -- replaces fragile Yahoo Finance dependency, adds real-time data, sweep detection
SEC Form 424B2 structured product parser -- NLP for prospectus text (shared with Options desk)
GitHub activity tracker for tech companies
Job posting volume tracker (company careers page scraping)
TRACE corporate bond data (FINRA, individual bond trades for single-name credit-equity divergence)
International indicators (Korean exports, TSMC revenue, Japan machinery, China PMI)
SaaS pricing page monitor (Wayback Machine + direct monitoring)
FDIC call reports (bank-specific)
Earnings Whisper / Estimize integration

What NOT to Build (Not Worth the Cost)

Bloomberg Terminal ($24,000/year) -- SEC EDGAR covers 80% of what you need
FactSet ($12,000+/year) -- EDGAR + FRED covers the fundamentals
Refinitiv ($10,000+/year) -- same as above
BoardEx ($5,000+/year) -- build insider networks from Form 4 data instead
S&P Capital IQ ($15,000+/year) -- EDGAR + FRED covers it
X API paid tier ($5,000/month) -- Grok search is free and sufficient
SimilarWeb ($100/month) -- only worth it if trading enough consumer/tech names

STORAGE ESTIMATES

Per-day estimates (full universe ~700 tickers):

Data Source	Size/Day
daily_prices	~140 KB
insider_trades	~50 KB
macro_data	~2 KB
stock_signals (composite)	~280 KB
earnings_call_nlp (seasonal)	~10 KB
credit_equity_signals	~6 KB
reddit_sentiment	~20 KB
x_sentiment	~15 KB
prediction_markets	~50 KB
collection_log	~3 KB

Daily total: ~575 KB/day

Per-quarter estimates:

institutional_holdings (13F): ~7.5 MB
nport_holdings: ~10 MB
fundamentals: ~350 KB
smart_money_signals: ~210 KB

Annual totals:

daily_prices: ~36 MB
insider_trades: ~12 MB
institutional_holdings: ~30 MB
earnings + NLP: ~4 MB
all other tables: ~10 MB
Annual total: ~90 MB/year

5 years of history: ~450 MB With SQLite WAL + indexes: ~600 MB With raw earnings transcripts (90-day retention): add ~500 MB/year

Total disk budget: ~3 GB (trivially small on 45GB disk with 15GB free)

COST SUMMARY

Period	Cost	What It Gets You
Weeks 1-3	$0	SEC EDGAR full pipeline, FRED macro, price/volume, Reddit, NLP, patents, Kalshi/Polymarket
Month 2+	$29/month	Polygon.io for real-time chains and reliable price data
Year 1 total	~$290	Just Polygon after Month 2

Dependencies (Python, recommended for all collectors):

requests, beautifulsoup4, lxml (HTTP/parsing)
yfinance, pandas, numpy (data)
nltk, textstat (NLP readability)
praw (Reddit API)
pytrends (Google Trends)
ratelimit (rate limiting decorator)
sqlite3 (built into Python)
Total: ~8 packages, ~150 MB installed

CROSS-DESK SYNERGIES

Shared with Options Desk

Form 4 insider trades -- Options desk uses for flow signal, Stocks desk uses for directional signal + network mapping
SEC 424B2 structured products -- same filings, both desks need barrier levels and notional
CFTC COT positioning -- equity index futures positioning feeds both desks
Credit spreads (FRED) -- leads both equity and options volatility
Kalshi/Polymarket scrapers -- shared infrastructure, both desks trade on these platforms

Shared with Futures Desk

FRED macro data -- entire macro suite is shared
Credit-equity divergence signals -- credit leads equity, futures desk tracks the credit side
Commodity-sector margin mapping -- futures desk handles commodity side, stocks desk consumes for sector signals
Treasury auction results -- affects both equity and futures positioning

Shared with Forex Desk

Currency mismatch risk -- Forex desk provides USD forecasts, Stocks desk applies to companies with >30% foreign revenue
JPY/CHF carry trade composite -- Forex desk tracks, Stocks desk uses as equity risk indicator
FRED DXY and yield data -- shared infrastructure

Shared with Crypto Desk

BTC-SPX correlation monitoring -- regime-dependent correlation affects portfolio risk
Stablecoin flows -- crypto risk appetite proxy that leads equity sentiment
Polymarket liquidity -- USDC depeg risk affects all Polymarket contracts

Shared with All Desks

Kalshi/Polymarket scraper infrastructure -- one codebase, multiple desk filters
Market regime classification -- all desks need to know bull/bear/crisis state
Brier scoring and calibration tracking -- unified prediction evaluation across all desks
Settlement AI -- shared Grok 4.1 Fast settlement resolution

Source: ~/.claude/projects/-home-ubuntu-edgeclaw/memory/stocks-desk-data-inventory.md