This is the complete data collection specification for the Stocks desk in the research pipeline. It covers equity trades on Robinhood AND event contracts traded on Kalshi/Polymarket. An AI builder should be able to read this and know exactly what data to collect, from where, how often, in what format, and why it matters for finding mispriced stocks or prediction market contracts.
Unlike sports desks that only trade prediction markets, the Stocks desk trades on THREE platforms:
The cross-venue edge: Fundamental analysis tells us a stock is mispriced on Robinhood. The same analysis also tells us whether a Kalshi/Polymarket contract is mispriced. Example: if our fundamental model says NVDA fair value is $950 and Kalshi prices "NVDA above $900 by April 30" at 55 cents, our model implies ~75% probability — that contract is cheap. We trade BOTH venues simultaneously.
An AI builder must understand these structural differences:
No Sharp External Benchmark — In sports, sharp bookmakers set fair lines. For stock events, there IS no sharp external reference. We must BUILD the fair value estimate from fundamentals, flow data, and volatility modeling. This makes the Stocks desk data collection broader and more complex than sports desks.
Continuous Price Discovery — Stocks trade every second, 6.5 hours per day. Data changes constantly. Sports bets lock in at game time. This means timing of data collection matters more.
Multi-Factor Valuation — A stock's fair value depends on earnings, balance sheet, macro regime, sector rotation, institutional positioning, insider behavior, credit markets, and sentiment simultaneously. No single data source is sufficient.
Prediction Market Carry Cost — Buying a Kalshi contract at 70 cents locks up 70 cents until settlement. At 5% risk-free rate, a 70-cent 30-day contract costs ~0.29 cents in opportunity cost. Prediction market prices should trade BELOW option-implied probabilities by this carry cost. If they trade ABOVE, the prediction market is overpriced.
Regime Dependence — Stock strategies that work in bull markets fail in bear markets. Every data source must be tagged with which market regime it applies to. Sports bets work the same regardless of market regime.
SEC Filings Are the Goldmine — The SEC requires public companies to disclose everything. This free government data is richer than any paid service for stocks. Sports have no equivalent.
What: Daily Open, High, Low, Close (adjusted for splits/dividends), and Volume for every stock we track.
Universe:
What to pull per ticker:
Calculated technicals (on ingestion):
RSI Divergence Detection (4 types — inherited from forex spec): Scan daily RSI (14) against price swings. All 4 types apply to stocks identically to forex:
What to store per divergence event:
ticker: stock symboltimeframe: daily (primary), weekly (confirmation)divergence_type: "regular_bullish", "regular_bearish", "hidden_bullish", "hidden_bearish"timestamp: detection candleprice_swing_1, price_swing_2: the two price swing points (level + timestamp)rsi_swing_1, rsi_swing_2: RSI values at each swingstrength: slope difference between price and RSI swing lines (bigger = stronger signal)market_regime: current regime from Section K when divergence forms (bull/bear/correction/crisis/range/rotation)at_support_resistance: is price near a key level?sector_etf_divergence: does the sector ETF show the same divergence type? (confirms sector-wide move vs single-stock noise)earnings_within_5d: boolean — divergence near earnings is unreliable (event risk overrides technical signals)Outcome tracking:
outcome_5d, outcome_10d, outcome_20d: price change after detectionreversal_occurred: for regular divergence — did price reverse? (boolean + magnitude)continuation_occurred: for hidden divergence — did trend continue? (boolean + magnitude)Storage: SQLite table rsi_divergences (shared schema with forex divergence_events table).
Collection frequency: Calculated daily at 5:00 PM ET on ingestion. Weekly divergences recalculated on Friday close.
Source: Yahoo Finance via yfinance (free). Batch download handles 100+ tickers in one call. Rate limit ~2000/hr with rotating user agents. Add 0.5-1 second delay between individual calls.
Backup: Polygon.io free tier (5 calls/min, delayed), Alpha Vantage (25 calls/day free), Tiingo (1000 symbols free).
History: 10 years for S&P 500. Full available history for any traded stock.
Collection frequency: Daily at 5:00 PM ET (after market close + after-hours adjustments settle).
Storage: SQLite table daily_prices with columns (ticker, trade_date, open, high, low, close, adj_close, volume, relative_volume, rsi_14, macd, macd_signal, bb_upper, bb_lower, bb_pct, atr_14).
Why it matters: Foundation for ALL other signals. Every factor maps to "is this stock mispriced?" Relative volume spikes flag unusual activity before the cause is known. For prediction markets: current price trajectory directly determines probability of hitting Kalshi threshold prices.
What: Daily returns for all 11 GICS sectors and standard factor ETFs.
Sector ETFs (11): XLK, XLF, XLE, XLV, XLI, XLP, XLY, XLC, XLRE, XLB, XLU Factor ETFs: MTUM, QUAL, VLUE, SIZE, USMV, IWD (Value), IWF (Growth)
Calculated metrics:
Source: yfinance (free). Also Fama-French factor data from Kenneth French Data Library (free CSVs, 1963-present). AQR factor datasets (free registration).
Collection frequency: Daily at close. Monthly download of updated Fama-French/AQR data.
Storage: SQLite table factor_returns + Parquet files for academic factor data.
Why it matters: Factor attribution on every trade reveals whether alpha came from stock selection or factor exposure. Regime-conditional factor performance guides position sizing.
SEC EDGAR is the single richest free data source for stock analysis. Base URL: https://efts.sec.gov/LATEST/. User-Agent REQUIRED: "CompanyName AdminEmail". Rate limit: 10 requests/second (use 5/sec conservatively).
What: Corporate insiders (CEO, CFO, directors, 10%+ owners) must report stock purchases/sales within 2 business days of the trade.
What to extract per filing:
Source: SEC EDGAR. Filing search: https://efts.sec.gov/LATEST/search-index?forms=4&dateRange=custom&startdt={date}&enddt={date}. Parse XML from each filing.
Collection frequency: Every 2 hours during market hours (9 AM - 6 PM ET). Most filings hit EDGAR between 4-6 PM ET.
Storage: SQLite tables insider_trades + insider_network (person-CIK to company-CIK mapping for cross-company cluster detection).
Key signals:
Why it matters: For prediction markets, insider cluster buying before a Kalshi "stock above $X" expiry means the contract is underpriced. Cross-desk: Options desk shares Form 4 data for options flow signals.
What: Insiders must file Form 144 BEFORE selling restricted shares. This is filed days to weeks BEFORE the actual Form 4 (which reports the completed trade). Form 144 is the LEADING indicator of Form 4.
Source: SEC EDGAR, filing type "144". URL: https://efts.sec.gov/LATEST/search-index?forms=144&dateRange=custom&startdt={date}&enddt={date}
Collection frequency: Daily scan.
Storage: SQLite table form144_notices.
Key signal: Form 144 filing = selling is COMING but hasn't happened yet. You see it days/weeks before the market does. Compare volume and timing to subsequent Form 4 to validate the lag.
Why it matters: Almost nobody monitors Form 144 because the data is less structured than Form 4. An LLM can parse the text cheaply.
What: Every institutional investment manager with >$100M AUM must file 13F-HR quarterly, disclosing all equity holdings.
What to extract:
Smart Money Cohort (track these specifically -- ~50 funds):
Source: SEC EDGAR. Filing index: https://efts.sec.gov/LATEST/search-index?forms=13F-HR&dateRange=custom. Parse infotable.xml for positions.
Collection frequency: Quarterly. 13F due 45 days after quarter-end (Feb 14, May 15, Aug 14, Nov 14). Bulk parse within 24 hours of deadline. Also monitor daily for early filers (early filing = no positions they want to hide -- this IS signal).
Storage: SQLite tables institutional_holdings + smart_money_signals (derived).
Key signals:
What: Mutual fund monthly portfolio holdings. N-PORT filings are MONTHLY with 60-day public disclosure lag, vs 13F which is QUARTERLY with 45-day lag. N-PORT includes data 13F does NOT: securities lending income, total return swaps, CDS held, liquidity classification.
Source: SEC EDGAR, filing type "NPORT-P". URL: https://efts.sec.gov/LATEST/search-index?forms=NPORT-P
Collection frequency: Monthly. Parse within 24 hours of new filing.
Storage: SQLite table nport_holdings.
Key signals:
Why it matters: Almost nobody parses N-PORT because the XML is complex. Massive information advantage for those who do.
What: 13D = filer intends to influence management (activist). 13G = passive large position (>5% ownership).
What to extract:
Activist Quality Tier List:
Source: SEC EDGAR. https://efts.sec.gov/LATEST/search-index?forms=SC+13D,SC+13D/A,SC+13G
Collection frequency: Daily scan at 6 PM ET.
Storage: SQLite table activist_filings.
Key signals:
What: Quarterly and annual financial statements in machine-readable format.
What to extract:
Derived signals (calculate on ingestion):
Source: SEC EDGAR XBRL API. https://data.sec.gov/api/xbrl/companyfacts/CIK{cik_padded}.json returns ALL XBRL facts ever filed.
Collection frequency: Parse within 24 hours of each filing. Cluster around earnings seasons.
Storage: SQLite table fundamentals.
Why it matters: These fundamentals determine FAIR VALUE of "stock above $X" prediction market contracts.
What: Executive compensation details, board composition, shareholder proposals.
What to extract:
Source: SEC EDGAR. Proxy season is March-May for most Fortune 500.
Collection frequency: Daily during proxy season (Mar-Jun), weekly otherwise.
Storage: SQLite table exec_compensation.
Key signal: Shift to equity-heavy + raised performance targets = management bullish. Option strike clustering = management defends that price level. For prediction markets: exec options at $300 strike and Kalshi "above $280" = management has financial incentive to keep stock above $300, making the $280 contract safer.
What: When the SEC sends a company a letter asking questions about its filings, the correspondence is eventually made public. Companies receiving SEC comment letters underperform by ~8% over the next 12 months (academic research). Almost nobody monitors this systematically.
Source: SEC EDGAR, filing type "CORRESP". URL: https://efts.sec.gov/LATEST/search-index?forms=CORRESP
Collection frequency: Daily scan.
Storage: SQLite table sec_correspondence.
Key signal: New SEC correspondence on a stock = regulatory scrutiny = the SEC's accounting reviewers found something questionable. Short signal on a 3-12 month horizon.
What: Banks file 424B2 supplements when issuing structured products (autocallables, barrier notes) tied to specific stocks. Reveals which stocks banks are using as underliers, barrier/knockout levels, and notional amounts.
Source: SEC EDGAR, filing type "424B2".
Collection frequency: Daily. Filter for equity-linked products. Start with top 5 issuers (Goldman, JPMorgan, Morgan Stanley, Citi, BofA) and SPY/QQQ underlyings.
Storage: SQLite table structured_products (shared with Options desk).
Complexity note: Dense legal/financial text requiring NLP parsing. This is a Month 2-3 build item, not Week 1.
Why it matters: When $500M+ in autocallable notional concentrates at a specific barrier level, that barrier becomes a support/resistance zone due to dealer hedging.
What: Monitor unusual filing patterns that signal material events.
Filing types to watch:
Source: SEC EDGAR RSS feed + full-text search.
Collection frequency: Hourly during market hours.
Storage: SQLite table filing_alerts.
What: All stock-related Kalshi binary contracts.
Contract types:
What to pull per contract:
Source: Kalshi API. https://api.elections.kalshi.com/trade-api/v2/markets?series_ticker=STOCK. Auth via RSA-PSS signing.
Collection frequency: Every 30 minutes during market hours. Every 5 minutes for contracts expiring within 24 hours. Every 15 minutes for order book depth on active positions.
Storage: SQLite table prediction_market_stocks.
What: Same as Kalshi but on Polymarket. Different liquidity pool (crypto-native traders), so prices often diverge.
Additional Polymarket-specific data:
Source: Polymarket CLOB API: https://clob.polymarket.com/markets. Gamma API: https://gamma-api.polymarket.com/markets?tag=stocks. No auth for public data.
Collection frequency: Every 30 minutes during market hours.
Storage: SQLite table prediction_market_stocks (shared with Kalshi, platform column distinguishes).
Cross-platform arbitrage: When the same stock event exists on both Kalshi AND Polymarket, compare prices. If Kalshi "TSLA above $1000" is 35 cents but Polymarket prices the same event at 42 cents, one is wrong. Buy cheap, sell expensive -- but CHECK SETTLEMENT RULES, they often differ subtly.
What: Math that converts traditional stock analysis into Kalshi/Polymarket-comparable probabilities. Uses options-implied distributions (d2 from Black-Scholes) or fundamental fair value models to calculate exact probability of a stock being above/below a specific price by a specific date.
Inputs: IV surface (from Options desk), risk-free rate (SOFR from FRED), time to expiration, current price, fundamental fair value estimate. Output: Probability per threshold per expiration, compared to prediction market prices. Collection frequency: Recalculated every time prediction market prices or model inputs update.
Why it matters: This is the core of the cross-venue edge. Without this bridge, we cannot compare our analysis to prediction market pricing.
What: Federal Reserve Economic Data -- free API for macro data that drives sector rotation, risk appetite, and market regime.
Source: FRED API (free, API key required -- free registration at https://fred.stlouisfed.org/docs/api/api_key.html). Rate limit: 120 requests/minute.
Series to pull:
| Series ID | Description | Frequency |
|---|---|---|
| DFF | Federal Funds Effective Rate | Daily |
| DGS10 | 10-Year Treasury Yield | Daily |
| DGS2 | 2-Year Treasury Yield | Daily |
| DGS30 | 30-Year Treasury Yield | Daily |
| T10Y2Y | 10Y-2Y Yield Spread (curve) | Daily |
| T10Y3M | 10Y-3M Yield Spread | Daily |
| BAMLH0A0HYM2 | ICE BofA HY OAS Spread | Daily |
| BAMLC0A0CM | ICE BofA IG OAS Spread | Daily |
| BAMLC0A4CBBB | BBB Corporate Bond Spread | Daily |
| DTWEXBGS | Trade-Weighted USD Index (Broad) | Daily |
| VIXCLS | CBOE VIX Close | Daily |
| WALCL | Fed Balance Sheet (Total Assets) | Weekly |
| RRPONTSYD | Reverse Repo Facility Usage | Daily |
| WTREGEN | Treasury General Account (TGA) | Weekly |
| NFCI | Chicago Fed Financial Conditions | Weekly |
| TOTCI | C&I Loans (bank lending pulse) | Weekly |
| ICSA | Initial Jobless Claims | Weekly |
| CPIAUCSL | CPI-U (All Items) | Monthly |
| PCE | Personal Consumption Expenditures | Monthly |
| PAYEMS | Total Nonfarm Payrolls | Monthly |
| UMCSENT | U Michigan Consumer Sentiment | Monthly |
| UNRATE | Unemployment Rate | Monthly |
| USREC | NBER Recession Indicator | Monthly |
Derived signals (calculate on ingestion):
Collection frequency: Daily at 7 AM ET for overnight releases. Weekly series checked on Fridays.
Storage: SQLite table macro_data (series_id, date, value).
Why it matters: HY spread is THE leading indicator for sector rotation (tightening = cyclicals, widening = defensives). Credit spreads lead equity by 1-3 days. For prediction markets: macro regime determines whether "S&P above X" contracts are overpriced or underpriced.
What: Daily classification of market environment into one of six regimes.
Regime rules (calculate daily):
Source: Calculated from FRED data + price data. Also: VIX3M, VIX6M from CBOE (free), MOVE index from FRED.
Collection frequency: Daily at close.
Storage: SQLite table market_regime (date, regime_label, confidence, vix, hy_spread, yield_curve, breadth, dispersion).
Why it matters: Every strategy must be tagged with which regimes it works in. Bull-market strategies fail in bear markets.
What: Total shares sold short per ticker.
What to collect:
Source: FINRA. https://www.finra.org/finra-data/browse-catalog/short-interest/data. Published bi-monthly (~15th and last business day), with ~10 day reporting lag.
Collection frequency: Bi-monthly, day after FINRA release.
Storage: SQLite table short_interest.
Key signals:
What: Weekly share volume traded on each Alternative Trading System (dark pool) per ticker.
Source: FINRA ATS Transparency Data. https://ats-transparency.finra.org/otc/ats-nms-weekly-data. Published weekly, covers prior 4-week rolling period.
Collection frequency: Weekly download every Monday morning.
Storage: SQLite table dark_pool_volume (ticker, week_ending, ats_name, share_volume, trade_count, pct_of_total_volume).
Key signal: Dark pool activity ratio spike = institutional accumulation/distribution hidden from lit markets. Combined with insider trades (Form 4), confirms institutional accumulation thesis.
What: SEC publishes daily list of stocks with excessive failed-to-deliver (FTD) shares. These stocks have forced buy-in risk.
Source: SEC.gov + exchange websites (free). https://cdn.finra.org/equity/regsho/daily/
Collection frequency: Daily.
Storage: SQLite table regsho_threshold.
Key signal: Consecutive days on threshold list + high short interest + rising call OI = squeeze setup. Cross-reference with FINRA SI data.
What: Post-Archegos transparency rules require swap reporting. OCC publishes daily borrow rates.
Source: DTCC (free), OCC (theocc.com -- free daily rates).
Collection frequency: Daily.
Storage: SQLite table borrow_rates.
Key signal: Borrow rate > 5% = hard to borrow, put/call parity breaks, GEX calculations need adjustment. Borrow rate spike = short squeeze pressure building.
What: Next reporting date for all tracked tickers, plus before/after market flag.
Source: Yahoo Finance earnings calendar (free, scrapeable). Financial Modeling Prep (FMP) API free tier (250 requests/day). https://financialmodelingprep.com/api/v3/earning_calendar?from={date}&to={date}&apikey={KEY}
Collection frequency: Daily update at 7 AM ET.
Storage: SQLite table earnings_calendar.
What: Complete earnings history for prediction calibration and post-mortem analysis.
What to collect per earnings report:
Source: SEC EDGAR XBRL (actuals), FMP API (consensus estimates + surprise history). Initial backfill: last 5-10 years.
Collection frequency: Daily during earnings season. Update reaction metrics at 1/5/10/20/60 day marks.
Storage: SQLite table earnings_history.
Why it matters: Build per-stock "earnings personality" -- does this stock typically sell off on beats? Compare implied move vs actual move to find mispriced options/prediction market contracts.
What: Consensus EPS/revenue estimates and revision trends.
What to track:
Source: FMP API (free tier). Backup: yfinance (ticker.recommendations).
Collection frequency: Daily snapshot after market close.
Storage: SQLite table estimate_revisions. Store every daily snapshot to reconstruct revision path.
Key signal: Strong revision momentum = leading indicator. Predictions against strong revision trends have lower hit rates. "Stale consensus" (no revision in 60+ days) = unreliable.
What: Natural language processing on quarterly earnings call transcripts.
NLP metrics to compute:
Source: Motley Fool transcripts (free, scrapeable). Seeking Alpha (limited free). Use nltk + textstat for readability, FinBERT for sentiment.
Collection frequency: Daily during earnings season, weekly otherwise.
Storage: SQLite table earnings_call_nlp.
Key signal: Fog Index increasing QoQ = management obfuscating. Uncertainty words rising = trouble ahead. Sentiment drift without fundamental change = narrative management.
What: When insider lockup periods expire on recently-IPO'd stocks, creating selling pressure.
What to collect:
Source: IPOScoop (https://www.iposcoop.com/ipo-lockup-expirations/), MarketBeat (https://www.marketbeat.com/ipos/lockup-expirations/). Cross-ref with SEC S-1 "Shares Eligible for Future Sale" section.
Collection frequency: Weekly scan every Sunday. Flag lockups expiring within 30 days.
Storage: SQLite table ipo_lockups.
Key signal: Stocks start underperforming 10-15 days BEFORE lockup expiry. Short 2 weeks early, cover 2-3 days after. VC-backed + high insider ownership (>40%) = most pressure. For prediction markets: Kalshi "above $X" contracts expiring near a lockup date are overpriced.
What: S&P 500 additions/deletions, Russell rebalancing, stock splits, dividend ex-dates, spin-offs, M&A.
What to track:
Sources: S&P Dow Jones press releases (free), FTSE Russell, Yahoo Finance corporate actions, SEC Form 10-12B (spin-offs), Nasdaq dividend calendar.
Collection frequency: Daily at 7 AM ET.
Storage: SQLite tables corporate_events, sp500_changes, spinoffs.
Key signals:
What: Companies cannot repurchase shares during earnings blackout windows (typically 2 weeks before earnings through 2 days after). This removes a buyer from the market.
How to estimate: Earnings date minus 14 calendar days = blackout start. Earnings date + 2 business days = blackout end. Calculate % of S&P 500 market cap in blackout at any given time.
Collection frequency: Calculated from earnings calendar. Storage: Column in daily state table.
Why it matters: When >50% of S&P 500 by market cap is in buyback blackout, a significant source of demand is removed. Downside risk increases.
What: One row per trading day with every relevant boolean flag and event marker.
Schema includes:
Collection frequency: Generated daily at 6:00 AM ET (before market open).
Storage: SQLite table daily_state.
What: Retail sentiment from r/wallstreetbets, r/stocks, r/investing, r/options. Each has different signal characteristics.
r/wallstreetbets (WSB):
r/stocks:
r/investing:
r/options:
Source: Reddit API / PRAW (free). Rate limits: 60 requests/minute authenticated.
Collection frequency: Every 30 minutes during market hours for WSB and r/options. Every 2 hours for r/stocks and r/investing. Daily aggregation at 10 PM ET.
Storage: SQLite table reddit_sentiment.
What: Real-time market intelligence from financial X accounts.
Tier 1 -- Market Structure (check every 5 min): @unusual_whales, @spotgamma, @OptionsHawk, @WallStJesus, @SqueezeMetrics, @GarrettDeSimone Tier 2 -- Macro & Fed (every 15 min): @NickTimiraos, @WalterBloomberg, @DeItaone, @Markets Tier 3 -- Stock Analysis (every 30 min): @GaryBlack00, @DougKass, @elerianm Tier 4 -- Retail Sentiment (every 60 min): @jimcramer (INVERSE CRAMER IS A REAL SIGNAL -- academic studies show his picks underperform)
Keyword groups to monitor:
Pattern detection (automated):
Source: Grok's built-in X search via xAI API (free, already have access). Zero additional cost.
Collection frequency: Per-tier schedule above during market hours. Every 2 hours outside.
Storage: SQLite table x_sentiment.
What: Members of Congress must disclose stock trades within 45 days (STOCK Act). Their trades historically outperform the market.
Source: Capitol Trades (https://www.capitoltrades.com/) or Quiver Quant (https://www.quiverquant.com/congresstrading/). Free to scrape.
Collection frequency: Daily.
Storage: SQLite table congressional_trades.
Key signal: Multiple members of Congress buying the same stock = insider knowledge of upcoming legislation/regulation. Especially relevant for defense, pharma, and tech stocks.
What: Search interest for tracked tickers and company names.
What to track:
Source: pytrends Python library (free, unofficial but stable).
Collection frequency: Daily (Google Trends has 1-day granularity for <90 day range).
Storage: SQLite table google_trends.
Key signal: Search interest spike >3x 30-day average = retail attention surge. Bullish for 1-3 days, then contrarian (attention peak passed). Category/brand divergence >2 std dev from 12-month average = market share shift.
What: Replacement for Robinhood's defunct popularity API.
Sources (all free to scrape):
eresearch.fidelity.com/eresearch/gotoBL/fidelityTopOrders.jhtml -- daily buy/sell ratios. Skews institutional.api.webull.com (undocumented API, ~60/min rate limit). Rising star count = retail buying.api.stocktwits.com/api/2/streams/symbol/{TICKER}.json (free, 200/hr). Bull/bear ratio >80% bullish = contrarian sell.Collection frequency: Fidelity/Webull every 30 min during market hours. eToro/Schwab weekly. Stocktwits every 15 min.
Storage: SQLite table retail_flow_proxies.
Derived signal -- Retail Consensus Indicator: Composite of Stocktwits (25%), WSB (25%), r/stocks (20%), Fidelity (15%), Google Trends (15%). Score >85 sustained for 3+ days = CONTRARIAN SELL. Score <15 sustained for 3+ days = CONTRARIAN BUY.
What: The "real" expectation that traders have, which often differs from published consensus.
Sources:
Key signal: Company beats Street consensus but MISSES Social Whisper = negative reaction ("beat and dump"). Company misses Street but matches Social Whisper = muted reaction.
Collection frequency: Daily during earnings season for stocks reporting within 5 trading days.
What: Patent applications, grants, citations, abandonments, and assignments for tech companies.
What to track:
Source: USPTO PatentsView API (free, excellent, no key needed). https://api.patentsview.org/patents/query. Trademark: USPTO TSDR.
Collection frequency: Weekly (patent data updates weekly, trademarks daily).
Storage: SQLite table patent_citations + trademark_filings.
Why it matters: Patent abandonment in core technology = bearish for long-term moat. Assignment to NPE = company needs cash = distress signal.
What: For open-source-heavy tech companies (MSFT, GOOG, META, IBM, MDB, SNOW, DDOG), GitHub activity is a real-time proxy for developer ecosystem health.
What to track:
Source: GitHub API (free, 5000 requests/hour authenticated). https://api.github.com/
Collection frequency: Weekly.
Storage: SQLite table github_activity.
Key signal: Commit velocity declining >30% over 3 months = engineering problems. Community contributor growth accelerating = ecosystem moat strengthening.
What: Number and type of job postings reveals company health and strategic direction.
Source: Direct scraping of company careers pages. Indeed RSS feeds. Revealera.com (free tracker).
Collection frequency: Weekly snapshot of top 100 holdings' career pages.
Storage: SQLite table job_postings.
Key signal: Posting volume drop >40% MoM = hiring freeze = operational problems incoming. Surge in "revenue" roles = expects growth. Surge in "cost" roles (compliance, legal, restructuring) = defensive posture.
What: App store download rankings for consumer-facing tech companies.
Source: Various free app analytics sites. Apple App Store RSS feed.
Collection frequency: Weekly.
Storage: SQLite table app_rankings.
What: Track pricing page changes for SaaS companies. Price increases signal pricing power; new free tiers signal demand weakness.
Source: Wayback Machine API for historical (https://web.archive.org/web/timemap/). Direct page monitoring for real-time.
Collection frequency: Weekly.
Storage: SQLite table pricing_changes.
Key signal: Price increase = revenue acceleration coming (buy). New free tier / discount language = demand concern (avoid).
What: Corporate bond spreads lead equity by 1-3 days because credit markets are smarter/faster.
Source: FRED API (free). HY OAS (BAMLH0A0HYM2), IG OAS (BAMLC0A0CM). Collection frequency: Daily.
Key signal: CDX.HY widening 2+ days while SPX flat = equity about to catch down. CDX.HY tightening while SPX sells off = equity overreacting, buy the dip. ~70% directional accuracy.
What: Specific commodity price moves directly predict sector margins with 1-quarter lag.
Mappings:
Source: FRED commodity prices (free). https://fred.stlouisfed.org/categories/32217
Collection frequency: Daily.
Storage: SQLite table commodity_sector_signals.
Key signal: Commodity moving >15% in 30 days = margin impact next quarter. When commodity rises but downstream stock prices do NOT adjust = market underpricing compression = short.
What: Weekly positioning data showing dealer, leveraged fund, and asset manager positions in equity index futures.
Source: CFTC.gov (free CSV downloads).
Collection frequency: Weekly (released Friday 3:30 PM ET, data from Tuesday -- 3 days stale).
Storage: SQLite table cot_equity (shared with Futures desk).
What: Non-US data points that predict US equity moves.
Sources (all free):
https://investor.tsmc.com/english/monthly-revenue. Released ~10th of month. TSMC revenue up >20% YoY = chip demand boom.Collection frequency: Monthly for all. Check on release dates.
Storage: SQLite table international_indicators.
What: Quarterly bank financial data with 80+ fields. More granular than 10-K/10-Q for banks and available weeks earlier.
Source: FDIC CDR (free). https://cdr.ffiec.gov/public/. FFIEC bulk download: https://www.ffiec.gov/npw/FinancialReport/DataDownload
Collection frequency: Quarterly, available 30 days after quarter-end.
Storage: SQLite table fdic_call_reports.
Key signal: Rising provision-to-loan ratio in CRE or C&I at multiple banks = systemic sector stress, 1-2 quarters before earnings impact.
What: Weekly aggregate bank lending data. Annual stress test results determine bank capital return capacity.
Source: Federal Reserve. H.8: https://www.federalreserve.gov/releases/h8/ (weekly, Friday). FRED series TOTCI. Stress tests: https://www.federalreserve.gov/supervisionreg/stress-tests.htm (annual, June/July).
Collection frequency: H.8 weekly. Stress tests annually.
Key signal: C&I loans declining 3+ weeks = businesses not borrowing = earnings recession coming. Banks with CET1 >12% + clean loan book = buyback announcement post-stress test = buy before June.
These are not collected from external sources. They are COMPUTED from data above and our own prediction history.
What: Structured post-mortem for every prediction that lost money.
Fields per failed trade:
Root cause categories (10):
Storage: SQLite table failed_trade_patterns + aggregate stats by category/regime/factor/sector/month.
Update: Within 24 hours of every settled trade.
What: Track calibration by confidence bucket. Are our 70% confidence predictions winning 70% of the time?
Storage: SQLite table prediction_calibration with Brier scores, by confidence bucket, by desk member, by regime, by time horizon.
What: Decompose every trade's P&L into stock-specific alpha vs sector beta vs market beta vs factor exposure.
Source: Calculated from Fama-French factors + sector returns + individual stock returns.
Storage: SQLite table factor_attribution.
What: Composite score combining insider buying (Form 4), institutional accumulation (13F), activist interest (13D), dark pool volume, and congressional trading into a single "smart money conviction" metric per ticker.
Storage: SQLite table smart_money_convergence.
What: Composite (0-100) combining WSB mention velocity (20%), short interest + borrow fee (30%), X "squeeze" keyword frequency (15%), days to cover (15%), options skew inversion (10%), retail broker buy ratio (10%).
Storage: Calculated every 15 minutes during market hours. SQLite table squeeze_scores.
Signal thresholds: 0-30 = no squeeze risk. 30-60 = elevated. 60-80 = conditions forming. 80-100 = likely imminent. CRITICAL: once score >90 AND price spiked >30%, the squeeze is ENDING. Take profits.
Day 1-2: SEC EDGAR Form 4 pipeline (insider trades + network mapping). Python script, ~400 lines. Cron: daily at 7 PM ET.
Day 2-3: FRED macro collector (23 series, credit-equity divergence baseline, sector rotation signals). Python script, ~150 lines. Cron: daily at 6 PM ET.
Day 3-4: Price/volume collector (S&P 500 + top 200 Russell 2000 via yfinance batch download, compute technicals). Python script, ~300 lines. Cron: daily at 5 PM ET.
Day 4-5: 13F quarterly processor (bulk download, XML parse, smart money cohort tracking, overlap scores). Python script, ~500 lines. Cron: quarterly within 48h of deadline.
Day 5-7: Signal integration (accruals calculator, asset growth, net share issuance, insider cluster detection, filing timing anomaly, Kalshi/Polymarket scrapers). Python signal module, ~400 lines.
Week 1 total: ~1,750 lines of Python, 5 cron jobs, $0.
Day 8-9: 13D activist tracker (daily EDGAR scan, parse intent language, alert system). ~200 lines.
Day 9-10: DEF 14A proxy analyzer (executive compensation changes, equity/cash ratio, option strikes). ~300 lines.
Day 10-11: FINRA short interest + dark pool (bi-monthly SI parser, weekly ATS data). ~200 lines.
Day 11-12: IPO lockup calendar (scrape IPOScoop/MarketBeat, cross-ref S-1). ~200 lines.
Day 12-13: Form 144 pre-sale notices (EDGAR scan, parse restricted stock notices). ~150 lines.
Day 13-14: N-PORT monthly fund holdings (XML parse, month-over-month comparison). ~300 lines.
Week 2 total: ~1,350 additional lines, 6 more cron jobs, $0.
Day 15-16: Reddit sentiment pipeline (PRAW, 4 subreddits, WSB Momentum Score, FinBERT). ~400 lines.
Day 16-17: Earnings call NLP (transcript scraping from Motley Fool, Fog Index, uncertainty scoring, sentiment). ~350 lines.
Day 17-18: USPTO patent tracker (PatentsView API, citations, applications, abandonments). ~250 lines.
Day 18-19: Congressional trading scraper (Capitol Trades / Quiver Quant). ~150 lines.
Day 19-20: Google Trends pipeline (pytrends, category vs brand divergence). ~200 lines.
Day 20-21: REG SHO threshold list + Earnings calendar/history + daily state table. ~300 lines.
Week 3 total: ~1,650 lines, 5 more cron jobs, $0.
Per-day estimates (full universe ~700 tickers):
| Data Source | Size/Day |
|---|---|
| daily_prices | ~140 KB |
| insider_trades | ~50 KB |
| macro_data | ~2 KB |
| stock_signals (composite) | ~280 KB |
| earnings_call_nlp (seasonal) | ~10 KB |
| credit_equity_signals | ~6 KB |
| reddit_sentiment | ~20 KB |
| x_sentiment | ~15 KB |
| prediction_markets | ~50 KB |
| collection_log | ~3 KB |
Daily total: ~575 KB/day
Per-quarter estimates:
Annual totals:
5 years of history: ~450 MB With SQLite WAL + indexes: ~600 MB With raw earnings transcripts (90-day retention): add ~500 MB/year
Total disk budget: ~3 GB (trivially small on 45GB disk with 15GB free)
| Period | Cost | What It Gets You |
|---|---|---|
| Weeks 1-3 | $0 | SEC EDGAR full pipeline, FRED macro, price/volume, Reddit, NLP, patents, Kalshi/Polymarket |
| Month 2+ | $29/month | Polygon.io for real-time chains and reliable price data |
| Year 1 total | ~$290 | Just Polygon after Month 2 |
Dependencies (Python, recommended for all collectors):