Stocks Desk — Data Collection Spec (Mar 14, 2026)

Reviewed by 5-model panel: Flash (Financial), Maverick (Sentiment), Gemini Pro (Post-Mortem), Haiku (Practical), Opus (Contrarian)

What This Document Is

This is the complete data collection specification for the Stocks desk in the research pipeline. It covers equity trades on Robinhood AND event contracts traded on Kalshi/Polymarket. An AI builder should be able to read this and know exactly what data to collect, from where, how often, in what format, and why it matters for finding mispriced stocks or prediction market contracts.

The Business Model (Two Venues)

Unlike sports desks that only trade prediction markets, the Stocks desk trades on THREE platforms:

  1. Robinhood — Actual stock trades (buy/sell shares on individual equities). We buy and sell real stocks.
  2. Kalshi — Binary prediction markets on stock events: price thresholds ("TSLA above $300 by Friday"), earnings outcomes ("AAPL beats EPS estimate"), index levels ("S&P 500 above 5500 by month end").
  3. Polymarket — Similar binary contracts on crypto rails (USDC on Polygon). Includes M&A, regulatory outcomes, CEO changes, index milestones.

The cross-venue edge: Fundamental analysis tells us a stock is mispriced on Robinhood. The same analysis also tells us whether a Kalshi/Polymarket contract is mispriced. Example: if our fundamental model says NVDA fair value is $950 and Kalshi prices "NVDA above $900 by April 30" at 55 cents, our model implies ~75% probability — that contract is cheap. We trade BOTH venues simultaneously.

How Stocks Are Different From Sports/Options

An AI builder must understand these structural differences:

  1. No Sharp External Benchmark — In sports, sharp bookmakers set fair lines. For stock events, there IS no sharp external reference. We must BUILD the fair value estimate from fundamentals, flow data, and volatility modeling. This makes the Stocks desk data collection broader and more complex than sports desks.

  2. Continuous Price Discovery — Stocks trade every second, 6.5 hours per day. Data changes constantly. Sports bets lock in at game time. This means timing of data collection matters more.

  3. Multi-Factor Valuation — A stock's fair value depends on earnings, balance sheet, macro regime, sector rotation, institutional positioning, insider behavior, credit markets, and sentiment simultaneously. No single data source is sufficient.

  4. Prediction Market Carry Cost — Buying a Kalshi contract at 70 cents locks up 70 cents until settlement. At 5% risk-free rate, a 70-cent 30-day contract costs ~0.29 cents in opportunity cost. Prediction market prices should trade BELOW option-implied probabilities by this carry cost. If they trade ABOVE, the prediction market is overpriced.

  5. Regime Dependence — Stock strategies that work in bull markets fail in bear markets. Every data source must be tagged with which market regime it applies to. Sports bets work the same regardless of market regime.

  6. SEC Filings Are the Goldmine — The SEC requires public companies to disclose everything. This free government data is richer than any paid service for stocks. Sports have no equivalent.


SECTION A: PRICE AND MARKET DATA (Foundation)

1. Daily OHLCV for Full Universe

What: Daily Open, High, Low, Close (adjusted for splits/dividends), and Volume for every stock we track.

Universe:

What to pull per ticker:

Calculated technicals (on ingestion):

RSI Divergence Detection (4 types — inherited from forex spec): Scan daily RSI (14) against price swings. All 4 types apply to stocks identically to forex:

What to store per divergence event:

Outcome tracking:

Storage: SQLite table rsi_divergences (shared schema with forex divergence_events table). Collection frequency: Calculated daily at 5:00 PM ET on ingestion. Weekly divergences recalculated on Friday close.

Source: Yahoo Finance via yfinance (free). Batch download handles 100+ tickers in one call. Rate limit ~2000/hr with rotating user agents. Add 0.5-1 second delay between individual calls. Backup: Polygon.io free tier (5 calls/min, delayed), Alpha Vantage (25 calls/day free), Tiingo (1000 symbols free). History: 10 years for S&P 500. Full available history for any traded stock. Collection frequency: Daily at 5:00 PM ET (after market close + after-hours adjustments settle). Storage: SQLite table daily_prices with columns (ticker, trade_date, open, high, low, close, adj_close, volume, relative_volume, rsi_14, macd, macd_signal, bb_upper, bb_lower, bb_pct, atr_14).

Why it matters: Foundation for ALL other signals. Every factor maps to "is this stock mispriced?" Relative volume spikes flag unusual activity before the cause is known. For prediction markets: current price trajectory directly determines probability of hitting Kalshi threshold prices.

2. Sector and Factor ETF Returns

What: Daily returns for all 11 GICS sectors and standard factor ETFs.

Sector ETFs (11): XLK, XLF, XLE, XLV, XLI, XLP, XLY, XLC, XLRE, XLB, XLU Factor ETFs: MTUM, QUAL, VLUE, SIZE, USMV, IWD (Value), IWF (Growth)

Calculated metrics:

Source: yfinance (free). Also Fama-French factor data from Kenneth French Data Library (free CSVs, 1963-present). AQR factor datasets (free registration). Collection frequency: Daily at close. Monthly download of updated Fama-French/AQR data. Storage: SQLite table factor_returns + Parquet files for academic factor data.

Why it matters: Factor attribution on every trade reveals whether alpha came from stock selection or factor exposure. Regime-conditional factor performance guides position sizing.


SECTION B: SEC EDGAR PIPELINE (The Goldmine -- All Free)

SEC EDGAR is the single richest free data source for stock analysis. Base URL: https://efts.sec.gov/LATEST/. User-Agent REQUIRED: "CompanyName AdminEmail". Rate limit: 10 requests/second (use 5/sec conservatively).

3. Form 4 -- Insider Trades

What: Corporate insiders (CEO, CFO, directors, 10%+ owners) must report stock purchases/sales within 2 business days of the trade.

What to extract per filing:

Source: SEC EDGAR. Filing search: https://efts.sec.gov/LATEST/search-index?forms=4&dateRange=custom&startdt={date}&enddt={date}. Parse XML from each filing. Collection frequency: Every 2 hours during market hours (9 AM - 6 PM ET). Most filings hit EDGAR between 4-6 PM ET. Storage: SQLite tables insider_trades + insider_network (person-CIK to company-CIK mapping for cross-company cluster detection).

Key signals:

Why it matters: For prediction markets, insider cluster buying before a Kalshi "stock above $X" expiry means the contract is underpriced. Cross-desk: Options desk shares Form 4 data for options flow signals.

4. Form 144 -- Pre-Sale Notice of Restricted Stock (Contrarian's Top Find)

What: Insiders must file Form 144 BEFORE selling restricted shares. This is filed days to weeks BEFORE the actual Form 4 (which reports the completed trade). Form 144 is the LEADING indicator of Form 4.

Source: SEC EDGAR, filing type "144". URL: https://efts.sec.gov/LATEST/search-index?forms=144&dateRange=custom&startdt={date}&enddt={date} Collection frequency: Daily scan. Storage: SQLite table form144_notices.

Key signal: Form 144 filing = selling is COMING but hasn't happened yet. You see it days/weeks before the market does. Compare volume and timing to subsequent Form 4 to validate the lag.

Why it matters: Almost nobody monitors Form 144 because the data is less structured than Form 4. An LLM can parse the text cheaply.

5. 13F Institutional Holdings (Quarterly)

What: Every institutional investment manager with >$100M AUM must file 13F-HR quarterly, disclosing all equity holdings.

What to extract:

Smart Money Cohort (track these specifically -- ~50 funds):

Source: SEC EDGAR. Filing index: https://efts.sec.gov/LATEST/search-index?forms=13F-HR&dateRange=custom. Parse infotable.xml for positions. Collection frequency: Quarterly. 13F due 45 days after quarter-end (Feb 14, May 15, Aug 14, Nov 14). Bulk parse within 24 hours of deadline. Also monitor daily for early filers (early filing = no positions they want to hide -- this IS signal). Storage: SQLite tables institutional_holdings + smart_money_signals (derived).

Key signals:

6. N-PORT Monthly Fund Holdings (3x Frequency of 13F)

What: Mutual fund monthly portfolio holdings. N-PORT filings are MONTHLY with 60-day public disclosure lag, vs 13F which is QUARTERLY with 45-day lag. N-PORT includes data 13F does NOT: securities lending income, total return swaps, CDS held, liquidity classification.

Source: SEC EDGAR, filing type "NPORT-P". URL: https://efts.sec.gov/LATEST/search-index?forms=NPORT-P Collection frequency: Monthly. Parse within 24 hours of new filing. Storage: SQLite table nport_holdings.

Key signals:

Why it matters: Almost nobody parses N-PORT because the XML is complex. Massive information advantage for those who do.

7. 13D/13G Activist + Large Position Disclosure

What: 13D = filer intends to influence management (activist). 13G = passive large position (>5% ownership).

What to extract:

Activist Quality Tier List:

Source: SEC EDGAR. https://efts.sec.gov/LATEST/search-index?forms=SC+13D,SC+13D/A,SC+13G Collection frequency: Daily scan at 6 PM ET. Storage: SQLite table activist_filings.

Key signals:

8. 10-K/10-Q Financials via XBRL

What: Quarterly and annual financial statements in machine-readable format.

What to extract:

Derived signals (calculate on ingestion):

Source: SEC EDGAR XBRL API. https://data.sec.gov/api/xbrl/companyfacts/CIK{cik_padded}.json returns ALL XBRL facts ever filed. Collection frequency: Parse within 24 hours of each filing. Cluster around earnings seasons. Storage: SQLite table fundamentals.

Why it matters: These fundamentals determine FAIR VALUE of "stock above $X" prediction market contracts.

9. DEF 14A Proxy Statements (Annual)

What: Executive compensation details, board composition, shareholder proposals.

What to extract:

Source: SEC EDGAR. Proxy season is March-May for most Fortune 500. Collection frequency: Daily during proxy season (Mar-Jun), weekly otherwise. Storage: SQLite table exec_compensation.

Key signal: Shift to equity-heavy + raised performance targets = management bullish. Option strike clustering = management defends that price level. For prediction markets: exec options at $300 strike and Kalshi "above $280" = management has financial incentive to keep stock above $300, making the $280 contract safer.

10. SEC Correspondence Letters (CORRESP) -- Contrarian's Biggest Find

What: When the SEC sends a company a letter asking questions about its filings, the correspondence is eventually made public. Companies receiving SEC comment letters underperform by ~8% over the next 12 months (academic research). Almost nobody monitors this systematically.

Source: SEC EDGAR, filing type "CORRESP". URL: https://efts.sec.gov/LATEST/search-index?forms=CORRESP Collection frequency: Daily scan. Storage: SQLite table sec_correspondence.

Key signal: New SEC correspondence on a stock = regulatory scrutiny = the SEC's accounting reviewers found something questionable. Short signal on a 3-12 month horizon.

11. Form 424B2 Structured Products (Shared with Options Desk)

What: Banks file 424B2 supplements when issuing structured products (autocallables, barrier notes) tied to specific stocks. Reveals which stocks banks are using as underliers, barrier/knockout levels, and notional amounts.

Source: SEC EDGAR, filing type "424B2". Collection frequency: Daily. Filter for equity-linked products. Start with top 5 issuers (Goldman, JPMorgan, Morgan Stanley, Citi, BofA) and SPY/QQQ underlyings. Storage: SQLite table structured_products (shared with Options desk).

Complexity note: Dense legal/financial text requiring NLP parsing. This is a Month 2-3 build item, not Week 1.

Why it matters: When $500M+ in autocallable notional concentrates at a specific barrier level, that barrier becomes a support/resistance zone due to dealer hedging.

12. Filing Frequency Scanner + Special Filing Types

What: Monitor unusual filing patterns that signal material events.

Filing types to watch:

Source: SEC EDGAR RSS feed + full-text search. Collection frequency: Hourly during market hours. Storage: SQLite table filing_alerts.


SECTION C: PREDICTION MARKET DATA (Where We Trade)

13. Kalshi Stock Event Contracts

What: All stock-related Kalshi binary contracts.

Contract types:

What to pull per contract:

Source: Kalshi API. https://api.elections.kalshi.com/trade-api/v2/markets?series_ticker=STOCK. Auth via RSA-PSS signing. Collection frequency: Every 30 minutes during market hours. Every 5 minutes for contracts expiring within 24 hours. Every 15 minutes for order book depth on active positions. Storage: SQLite table prediction_market_stocks.

14. Polymarket Stock Event Contracts

What: Same as Kalshi but on Polymarket. Different liquidity pool (crypto-native traders), so prices often diverge.

Additional Polymarket-specific data:

Source: Polymarket CLOB API: https://clob.polymarket.com/markets. Gamma API: https://gamma-api.polymarket.com/markets?tag=stocks. No auth for public data. Collection frequency: Every 30 minutes during market hours. Storage: SQLite table prediction_market_stocks (shared with Kalshi, platform column distinguishes).

Cross-platform arbitrage: When the same stock event exists on both Kalshi AND Polymarket, compare prices. If Kalshi "TSLA above $1000" is 35 cents but Polymarket prices the same event at 42 cents, one is wrong. Buy cheap, sell expensive -- but CHECK SETTLEMENT RULES, they often differ subtly.

15. Binary Probability Calculator (The Bridge)

What: Math that converts traditional stock analysis into Kalshi/Polymarket-comparable probabilities. Uses options-implied distributions (d2 from Black-Scholes) or fundamental fair value models to calculate exact probability of a stock being above/below a specific price by a specific date.

Inputs: IV surface (from Options desk), risk-free rate (SOFR from FRED), time to expiration, current price, fundamental fair value estimate. Output: Probability per threshold per expiration, compared to prediction market prices. Collection frequency: Recalculated every time prediction market prices or model inputs update.

Why it matters: This is the core of the cross-venue edge. Without this bridge, we cannot compare our analysis to prediction market pricing.


SECTION D: MACRO AND ECONOMIC DATA (FRED API -- Free)

16. FRED Economic Data Suite

What: Federal Reserve Economic Data -- free API for macro data that drives sector rotation, risk appetite, and market regime.

Source: FRED API (free, API key required -- free registration at https://fred.stlouisfed.org/docs/api/api_key.html). Rate limit: 120 requests/minute.

Series to pull:

Series ID Description Frequency
DFF Federal Funds Effective Rate Daily
DGS10 10-Year Treasury Yield Daily
DGS2 2-Year Treasury Yield Daily
DGS30 30-Year Treasury Yield Daily
T10Y2Y 10Y-2Y Yield Spread (curve) Daily
T10Y3M 10Y-3M Yield Spread Daily
BAMLH0A0HYM2 ICE BofA HY OAS Spread Daily
BAMLC0A0CM ICE BofA IG OAS Spread Daily
BAMLC0A4CBBB BBB Corporate Bond Spread Daily
DTWEXBGS Trade-Weighted USD Index (Broad) Daily
VIXCLS CBOE VIX Close Daily
WALCL Fed Balance Sheet (Total Assets) Weekly
RRPONTSYD Reverse Repo Facility Usage Daily
WTREGEN Treasury General Account (TGA) Weekly
NFCI Chicago Fed Financial Conditions Weekly
TOTCI C&I Loans (bank lending pulse) Weekly
ICSA Initial Jobless Claims Weekly
CPIAUCSL CPI-U (All Items) Monthly
PCE Personal Consumption Expenditures Monthly
PAYEMS Total Nonfarm Payrolls Monthly
UMCSENT U Michigan Consumer Sentiment Monthly
UNRATE Unemployment Rate Monthly
USREC NBER Recession Indicator Monthly

Derived signals (calculate on ingestion):

Collection frequency: Daily at 7 AM ET for overnight releases. Weekly series checked on Fridays. Storage: SQLite table macro_data (series_id, date, value).

Why it matters: HY spread is THE leading indicator for sector rotation (tightening = cyclicals, widening = defensives). Credit spreads lead equity by 1-3 days. For prediction markets: macro regime determines whether "S&P above X" contracts are overpriced or underpriced.

17. Market Regime Classification

What: Daily classification of market environment into one of six regimes.

Regime rules (calculate daily):

Source: Calculated from FRED data + price data. Also: VIX3M, VIX6M from CBOE (free), MOVE index from FRED. Collection frequency: Daily at close. Storage: SQLite table market_regime (date, regime_label, confidence, vix, hy_spread, yield_curve, breadth, dispersion).

Why it matters: Every strategy must be tagged with which regimes it works in. Bull-market strategies fail in bear markets.


SECTION E: SHORT INTEREST AND DARK POOL DATA

18. FINRA Short Interest (Bi-Monthly)

What: Total shares sold short per ticker.

What to collect:

Source: FINRA. https://www.finra.org/finra-data/browse-catalog/short-interest/data. Published bi-monthly (~15th and last business day), with ~10 day reporting lag. Collection frequency: Bi-monthly, day after FINRA release. Storage: SQLite table short_interest.

Key signals:

19. FINRA ATS (Dark Pool) Weekly Volume

What: Weekly share volume traded on each Alternative Trading System (dark pool) per ticker.

Source: FINRA ATS Transparency Data. https://ats-transparency.finra.org/otc/ats-nms-weekly-data. Published weekly, covers prior 4-week rolling period. Collection frequency: Weekly download every Monday morning. Storage: SQLite table dark_pool_volume (ticker, week_ending, ats_name, share_volume, trade_count, pct_of_total_volume).

Key signal: Dark pool activity ratio spike = institutional accumulation/distribution hidden from lit markets. Combined with insider trades (Form 4), confirms institutional accumulation thesis.

20. REG SHO Threshold List (Daily, Free)

What: SEC publishes daily list of stocks with excessive failed-to-deliver (FTD) shares. These stocks have forced buy-in risk.

Source: SEC.gov + exchange websites (free). https://cdn.finra.org/equity/regsho/daily/ Collection frequency: Daily. Storage: SQLite table regsho_threshold.

Key signal: Consecutive days on threshold list + high short interest + rising call OI = squeeze setup. Cross-reference with FINRA SI data.

21. DTCC Equity Swap Reporting + OCC Stock Borrow Rates

What: Post-Archegos transparency rules require swap reporting. OCC publishes daily borrow rates.

Source: DTCC (free), OCC (theocc.com -- free daily rates). Collection frequency: Daily. Storage: SQLite table borrow_rates.

Key signal: Borrow rate > 5% = hard to borrow, put/call parity breaks, GEX calculations need adjustment. Borrow rate spike = short squeeze pressure building.


SECTION F: EARNINGS DATA

22. Earnings Calendar and Upcoming Dates

What: Next reporting date for all tracked tickers, plus before/after market flag.

Source: Yahoo Finance earnings calendar (free, scrapeable). Financial Modeling Prep (FMP) API free tier (250 requests/day). https://financialmodelingprep.com/api/v3/earning_calendar?from={date}&to={date}&apikey={KEY} Collection frequency: Daily update at 7 AM ET. Storage: SQLite table earnings_calendar.

23. Historical Earnings Surprise Database

What: Complete earnings history for prediction calibration and post-mortem analysis.

What to collect per earnings report:

Source: SEC EDGAR XBRL (actuals), FMP API (consensus estimates + surprise history). Initial backfill: last 5-10 years. Collection frequency: Daily during earnings season. Update reaction metrics at 1/5/10/20/60 day marks. Storage: SQLite table earnings_history.

Why it matters: Build per-stock "earnings personality" -- does this stock typically sell off on beats? Compare implied move vs actual move to find mispriced options/prediction market contracts.

24. Analyst Estimate Revisions

What: Consensus EPS/revenue estimates and revision trends.

What to track:

Source: FMP API (free tier). Backup: yfinance (ticker.recommendations). Collection frequency: Daily snapshot after market close. Storage: SQLite table estimate_revisions. Store every daily snapshot to reconstruct revision path.

Key signal: Strong revision momentum = leading indicator. Predictions against strong revision trends have lower hit rates. "Stale consensus" (no revision in 60+ days) = unreliable.

25. Earnings Call NLP Analysis

What: Natural language processing on quarterly earnings call transcripts.

NLP metrics to compute:

Source: Motley Fool transcripts (free, scrapeable). Seeking Alpha (limited free). Use nltk + textstat for readability, FinBERT for sentiment. Collection frequency: Daily during earnings season, weekly otherwise. Storage: SQLite table earnings_call_nlp.

Key signal: Fog Index increasing QoQ = management obfuscating. Uncertainty words rising = trouble ahead. Sentiment drift without fundamental change = narrative management.


SECTION G: CORPORATE EVENTS AND CALENDAR

26. IPO Lockup Expiration Calendar

What: When insider lockup periods expire on recently-IPO'd stocks, creating selling pressure.

What to collect:

Source: IPOScoop (https://www.iposcoop.com/ipo-lockup-expirations/), MarketBeat (https://www.marketbeat.com/ipos/lockup-expirations/). Cross-ref with SEC S-1 "Shares Eligible for Future Sale" section. Collection frequency: Weekly scan every Sunday. Flag lockups expiring within 30 days. Storage: SQLite table ipo_lockups.

Key signal: Stocks start underperforming 10-15 days BEFORE lockup expiry. Short 2 weeks early, cover 2-3 days after. VC-backed + high insider ownership (>40%) = most pressure. For prediction markets: Kalshi "above $X" contracts expiring near a lockup date are overpriced.

27. Index Reconstitution and Corporate Actions

What: S&P 500 additions/deletions, Russell rebalancing, stock splits, dividend ex-dates, spin-offs, M&A.

What to track:

Sources: S&P Dow Jones press releases (free), FTSE Russell, Yahoo Finance corporate actions, SEC Form 10-12B (spin-offs), Nasdaq dividend calendar. Collection frequency: Daily at 7 AM ET. Storage: SQLite tables corporate_events, sp500_changes, spinoffs.

Key signals:

28. Buyback Blackout Periods

What: Companies cannot repurchase shares during earnings blackout windows (typically 2 weeks before earnings through 2 days after). This removes a buyer from the market.

How to estimate: Earnings date minus 14 calendar days = blackout start. Earnings date + 2 business days = blackout end. Calculate % of S&P 500 market cap in blackout at any given time.

Collection frequency: Calculated from earnings calendar. Storage: Column in daily state table.

Why it matters: When >50% of S&P 500 by market cap is in buyback blackout, a significant source of demand is removed. Downside risk increases.

29. Daily State Table

What: One row per trading day with every relevant boolean flag and event marker.

Schema includes:

Collection frequency: Generated daily at 6:00 AM ET (before market open). Storage: SQLite table daily_state.


SECTION H: SENTIMENT AND SOCIAL DATA

30. Reddit (4 Subreddits, Separate Treatment)

What: Retail sentiment from r/wallstreetbets, r/stocks, r/investing, r/options. Each has different signal characteristics.

r/wallstreetbets (WSB):

r/stocks:

r/investing:

r/options:

Source: Reddit API / PRAW (free). Rate limits: 60 requests/minute authenticated. Collection frequency: Every 30 minutes during market hours for WSB and r/options. Every 2 hours for r/stocks and r/investing. Daily aggregation at 10 PM ET. Storage: SQLite table reddit_sentiment.

31. X/Twitter Flow via Grok Search (Free)

What: Real-time market intelligence from financial X accounts.

Tier 1 -- Market Structure (check every 5 min): @unusual_whales, @spotgamma, @OptionsHawk, @WallStJesus, @SqueezeMetrics, @GarrettDeSimone Tier 2 -- Macro & Fed (every 15 min): @NickTimiraos, @WalterBloomberg, @DeItaone, @Markets Tier 3 -- Stock Analysis (every 30 min): @GaryBlack00, @DougKass, @elerianm Tier 4 -- Retail Sentiment (every 60 min): @jimcramer (INVERSE CRAMER IS A REAL SIGNAL -- academic studies show his picks underperform)

Keyword groups to monitor:

Pattern detection (automated):

Source: Grok's built-in X search via xAI API (free, already have access). Zero additional cost. Collection frequency: Per-tier schedule above during market hours. Every 2 hours outside. Storage: SQLite table x_sentiment.

32. Congressional Trading Disclosures

What: Members of Congress must disclose stock trades within 45 days (STOCK Act). Their trades historically outperform the market.

Source: Capitol Trades (https://www.capitoltrades.com/) or Quiver Quant (https://www.quiverquant.com/congresstrading/). Free to scrape. Collection frequency: Daily. Storage: SQLite table congressional_trades.

Key signal: Multiple members of Congress buying the same stock = insider knowledge of upcoming legislation/regulation. Especially relevant for defense, pharma, and tech stocks.

33. Google Trends

What: Search interest for tracked tickers and company names.

What to track:

Source: pytrends Python library (free, unofficial but stable). Collection frequency: Daily (Google Trends has 1-day granularity for <90 day range). Storage: SQLite table google_trends.

Key signal: Search interest spike >3x 30-day average = retail attention surge. Bullish for 1-3 days, then contrarian (attention peak passed). Category/brand divergence >2 std dev from 12-month average = market share shift.

34. Retail Flow Proxies

What: Replacement for Robinhood's defunct popularity API.

Sources (all free to scrape):

Collection frequency: Fidelity/Webull every 30 min during market hours. eToro/Schwab weekly. Stocktwits every 15 min. Storage: SQLite table retail_flow_proxies.

Derived signal -- Retail Consensus Indicator: Composite of Stocktwits (25%), WSB (25%), r/stocks (20%), Fidelity (15%), Google Trends (15%). Score >85 sustained for 3+ days = CONTRARIAN SELL. Score <15 sustained for 3+ days = CONTRARIAN BUY.

35. Earnings Whisper / Social Consensus

What: The "real" expectation that traders have, which often differs from published consensus.

Sources:

Key signal: Company beats Street consensus but MISSES Social Whisper = negative reaction ("beat and dump"). Company misses Street but matches Social Whisper = muted reaction.

Collection frequency: Daily during earnings season for stocks reporting within 5 trading days.


SECTION I: ALTERNATIVE DATA (Free Sources)

36. USPTO Patent Data

What: Patent applications, grants, citations, abandonments, and assignments for tech companies.

What to track:

Source: USPTO PatentsView API (free, excellent, no key needed). https://api.patentsview.org/patents/query. Trademark: USPTO TSDR. Collection frequency: Weekly (patent data updates weekly, trademarks daily). Storage: SQLite table patent_citations + trademark_filings.

Why it matters: Patent abandonment in core technology = bearish for long-term moat. Assignment to NPE = company needs cash = distress signal.

37. GitHub Commit Velocity (Tech Companies)

What: For open-source-heavy tech companies (MSFT, GOOG, META, IBM, MDB, SNOW, DDOG), GitHub activity is a real-time proxy for developer ecosystem health.

What to track:

Source: GitHub API (free, 5000 requests/hour authenticated). https://api.github.com/ Collection frequency: Weekly. Storage: SQLite table github_activity.

Key signal: Commit velocity declining >30% over 3 months = engineering problems. Community contributor growth accelerating = ecosystem moat strengthening.

38. Job Posting Volume as Company Health Proxy

What: Number and type of job postings reveals company health and strategic direction.

Source: Direct scraping of company careers pages. Indeed RSS feeds. Revealera.com (free tracker). Collection frequency: Weekly snapshot of top 100 holdings' career pages. Storage: SQLite table job_postings.

Key signal: Posting volume drop >40% MoM = hiring freeze = operational problems incoming. Surge in "revenue" roles = expects growth. Surge in "cost" roles (compliance, legal, restructuring) = defensive posture.

39. App Store Rankings (Consumer Tech)

What: App store download rankings for consumer-facing tech companies.

Source: Various free app analytics sites. Apple App Store RSS feed. Collection frequency: Weekly. Storage: SQLite table app_rankings.

40. SaaS Pricing Page Monitoring

What: Track pricing page changes for SaaS companies. Price increases signal pricing power; new free tiers signal demand weakness.

Source: Wayback Machine API for historical (https://web.archive.org/web/timemap/). Direct page monitoring for real-time. Collection frequency: Weekly. Storage: SQLite table pricing_changes.

Key signal: Price increase = revenue acceleration coming (buy). New free tier / discount language = demand concern (avoid).


SECTION J: CROSS-ASSET AND INTERNATIONAL SIGNALS

41. Credit-Equity Divergence

What: Corporate bond spreads lead equity by 1-3 days because credit markets are smarter/faster.

Source: FRED API (free). HY OAS (BAMLH0A0HYM2), IG OAS (BAMLC0A0CM). Collection frequency: Daily.

Key signal: CDX.HY widening 2+ days while SPX flat = equity about to catch down. CDX.HY tightening while SPX sells off = equity overreacting, buy the dip. ~70% directional accuracy.

42. Commodity-to-Sector Margin Mapping

What: Specific commodity price moves directly predict sector margins with 1-quarter lag.

Mappings:

Source: FRED commodity prices (free). https://fred.stlouisfed.org/categories/32217 Collection frequency: Daily. Storage: SQLite table commodity_sector_signals.

Key signal: Commodity moving >15% in 30 days = margin impact next quarter. When commodity rises but downstream stock prices do NOT adjust = market underpricing compression = short.

43. CFTC Commitments of Traders (Equity Index Futures)

What: Weekly positioning data showing dealer, leveraged fund, and asset manager positions in equity index futures.

Source: CFTC.gov (free CSV downloads). Collection frequency: Weekly (released Friday 3:30 PM ET, data from Tuesday -- 3 days stale). Storage: SQLite table cot_equity (shared with Futures desk).

44. International Leading Indicators

What: Non-US data points that predict US equity moves.

Sources (all free):

Collection frequency: Monthly for all. Check on release dates. Storage: SQLite table international_indicators.

45. FDIC Call Reports (Bank Stocks)

What: Quarterly bank financial data with 80+ fields. More granular than 10-K/10-Q for banks and available weeks earlier.

Source: FDIC CDR (free). https://cdr.ffiec.gov/public/. FFIEC bulk download: https://www.ffiec.gov/npw/FinancialReport/DataDownload Collection frequency: Quarterly, available 30 days after quarter-end. Storage: SQLite table fdic_call_reports.

Key signal: Rising provision-to-loan ratio in CRE or C&I at multiple banks = systemic sector stress, 1-2 quarters before earnings impact.

46. Fed H.8 Report + Stress Tests

What: Weekly aggregate bank lending data. Annual stress test results determine bank capital return capacity.

Source: Federal Reserve. H.8: https://www.federalreserve.gov/releases/h8/ (weekly, Friday). FRED series TOTCI. Stress tests: https://www.federalreserve.gov/supervisionreg/stress-tests.htm (annual, June/July). Collection frequency: H.8 weekly. Stress tests annually.

Key signal: C&I loans declining 3+ weeks = businesses not borrowing = earnings recession coming. Banks with CET1 >12% + clean loan book = buyback announcement post-stress test = buy before June.


SECTION K: DERIVED AND COMPUTED DATABASES

These are not collected from external sources. They are COMPUTED from data above and our own prediction history.

47. Failed Trade Pattern Database

What: Structured post-mortem for every prediction that lost money.

Fields per failed trade:

Root cause categories (10):

  1. REGIME_CHANGE -- market/factor regime shifted after entry
  2. WRONG_DIRECTION -- thesis was simply wrong
  3. RIGHT_DIRECTION_WRONG_TIMING -- stock eventually did what we predicted, not in our window
  4. EXTERNAL_SHOCK -- unpredictable event
  5. CROWDED_TRADE -- too many on same side
  6. FACTOR_ROTATION -- our factor stopped working
  7. CORRELATION_BREAKDOWN -- portfolio correlations spiked
  8. SIZING_ERROR -- direction right, size too large, stopped out on noise
  9. INFORMATION_STALE -- data old, market already priced it in
  10. LIQUIDITY_TRAP -- couldn't exit at expected price

Storage: SQLite table failed_trade_patterns + aggregate stats by category/regime/factor/sector/month. Update: Within 24 hours of every settled trade.

48. Prediction Calibration Database

What: Track calibration by confidence bucket. Are our 70% confidence predictions winning 70% of the time?

Storage: SQLite table prediction_calibration with Brier scores, by confidence bucket, by desk member, by regime, by time horizon.

49. Factor Exposure Attribution

What: Decompose every trade's P&L into stock-specific alpha vs sector beta vs market beta vs factor exposure.

Source: Calculated from Fama-French factors + sector returns + individual stock returns. Storage: SQLite table factor_attribution.

50. Smart Money Convergence Score

What: Composite score combining insider buying (Form 4), institutional accumulation (13F), activist interest (13D), dark pool volume, and congressional trading into a single "smart money conviction" metric per ticker.

Storage: SQLite table smart_money_convergence.

51. Squeeze Probability Score

What: Composite (0-100) combining WSB mention velocity (20%), short interest + borrow fee (30%), X "squeeze" keyword frequency (15%), days to cover (15%), options skew inversion (10%), retail broker buy ratio (10%).

Storage: Calculated every 15 minutes during market hours. SQLite table squeeze_scores.

Signal thresholds: 0-30 = no squeeze risk. 30-60 = elevated. 60-80 = conditions forming. 80-100 = likely imminent. CRITICAL: once score >90 AND price spiked >30%, the squeeze is ENDING. Take profits.


BUILD PRIORITY

Week 1 -- Foundation ($0)

Day 1-2: SEC EDGAR Form 4 pipeline (insider trades + network mapping). Python script, ~400 lines. Cron: daily at 7 PM ET.

Day 2-3: FRED macro collector (23 series, credit-equity divergence baseline, sector rotation signals). Python script, ~150 lines. Cron: daily at 6 PM ET.

Day 3-4: Price/volume collector (S&P 500 + top 200 Russell 2000 via yfinance batch download, compute technicals). Python script, ~300 lines. Cron: daily at 5 PM ET.

Day 4-5: 13F quarterly processor (bulk download, XML parse, smart money cohort tracking, overlap scores). Python script, ~500 lines. Cron: quarterly within 48h of deadline.

Day 5-7: Signal integration (accruals calculator, asset growth, net share issuance, insider cluster detection, filing timing anomaly, Kalshi/Polymarket scrapers). Python signal module, ~400 lines.

Week 1 total: ~1,750 lines of Python, 5 cron jobs, $0.

Week 2 -- Expanded Free Sources ($0)

Day 8-9: 13D activist tracker (daily EDGAR scan, parse intent language, alert system). ~200 lines.

Day 9-10: DEF 14A proxy analyzer (executive compensation changes, equity/cash ratio, option strikes). ~300 lines.

Day 10-11: FINRA short interest + dark pool (bi-monthly SI parser, weekly ATS data). ~200 lines.

Day 11-12: IPO lockup calendar (scrape IPOScoop/MarketBeat, cross-ref S-1). ~200 lines.

Day 12-13: Form 144 pre-sale notices (EDGAR scan, parse restricted stock notices). ~150 lines.

Day 13-14: N-PORT monthly fund holdings (XML parse, month-over-month comparison). ~300 lines.

Week 2 total: ~1,350 additional lines, 6 more cron jobs, $0.

Week 3 -- Sentiment, NLP, and Alternative Data ($0)

Day 15-16: Reddit sentiment pipeline (PRAW, 4 subreddits, WSB Momentum Score, FinBERT). ~400 lines.

Day 16-17: Earnings call NLP (transcript scraping from Motley Fool, Fog Index, uncertainty scoring, sentiment). ~350 lines.

Day 17-18: USPTO patent tracker (PatentsView API, citations, applications, abandonments). ~250 lines.

Day 18-19: Congressional trading scraper (Capitol Trades / Quiver Quant). ~150 lines.

Day 19-20: Google Trends pipeline (pytrends, category vs brand divergence). ~200 lines.

Day 20-21: REG SHO threshold list + Earnings calendar/history + daily state table. ~300 lines.

Week 3 total: ~1,650 lines, 5 more cron jobs, $0.

Month 2-3 -- Advanced ($29/month for Polygon)

  1. Polygon.io subscription ($29/month) -- replaces fragile Yahoo Finance dependency, adds real-time data, sweep detection
  2. SEC Form 424B2 structured product parser -- NLP for prospectus text (shared with Options desk)
  3. GitHub activity tracker for tech companies
  4. Job posting volume tracker (company careers page scraping)
  5. TRACE corporate bond data (FINRA, individual bond trades for single-name credit-equity divergence)
  6. International indicators (Korean exports, TSMC revenue, Japan machinery, China PMI)
  7. SaaS pricing page monitor (Wayback Machine + direct monitoring)
  8. FDIC call reports (bank-specific)
  9. Earnings Whisper / Estimize integration

What NOT to Build (Not Worth the Cost)


STORAGE ESTIMATES

Per-day estimates (full universe ~700 tickers):

Data Source Size/Day
daily_prices ~140 KB
insider_trades ~50 KB
macro_data ~2 KB
stock_signals (composite) ~280 KB
earnings_call_nlp (seasonal) ~10 KB
credit_equity_signals ~6 KB
reddit_sentiment ~20 KB
x_sentiment ~15 KB
prediction_markets ~50 KB
collection_log ~3 KB

Daily total: ~575 KB/day

Per-quarter estimates:

Annual totals:

5 years of history: ~450 MB With SQLite WAL + indexes: ~600 MB With raw earnings transcripts (90-day retention): add ~500 MB/year

Total disk budget: ~3 GB (trivially small on 45GB disk with 15GB free)


COST SUMMARY

Period Cost What It Gets You
Weeks 1-3 $0 SEC EDGAR full pipeline, FRED macro, price/volume, Reddit, NLP, patents, Kalshi/Polymarket
Month 2+ $29/month Polygon.io for real-time chains and reliable price data
Year 1 total ~$290 Just Polygon after Month 2

Dependencies (Python, recommended for all collectors):


CROSS-DESK SYNERGIES

Shared with Options Desk

Shared with Futures Desk

Shared with Forex Desk

Shared with Crypto Desk

Shared with All Desks

Source: ~/.claude/projects/-home-ubuntu-edgeclaw/memory/stocks-desk-data-inventory.md