This is the complete data collection specification for the Futures desk in the research pipeline. It covers commodity, energy, metals, agriculture, and financial futures traded on prediction markets (Kalshi/Polymarket) and potentially direct futures exchanges (CME/NYMEX/ICE) later. An AI builder should be able to read this and know exactly what data to collect, from where, how often, and why.
Phase 1 (now): Trade prediction markets -- Kalshi and Polymarket offer binary contracts on commodity prices, economic data releases (CPI, NFP, GDP), Fed rate decisions, and Treasury yields. We collect data that tells us whether the market is mispricing these outcomes.
Phase 2 (later): Direct futures trading via a broker. The same data infrastructure powers both -- the only difference is execution venue.
The edge: Government agencies publish enormous amounts of free data that moves futures markets. Most retail traders watch headline numbers. We parse the sub-components, cross-reference positioning data (CFTC COT), track term structure shifts, and monitor alternative data (satellite, ship tracking) to identify mispricing hours to weeks before the market corrects.
An AI builder must understand these structural differences:
Physical Delivery Matters -- Unlike stocks, most commodity futures can be settled by actual physical delivery of barrels of oil, bushels of corn, or bars of gold. This creates real-world supply bottlenecks (warehouse stocks, pipeline capacity, port congestion) that directly affect prices. Inventory data is the single most important input.
Term Structure Is the Edge -- Futures trade in monthly contracts stretching years into the future. The shape of the price curve (contango = future months more expensive; backwardation = future months cheaper) reveals the market's supply/demand expectations. Changes in curve shape often signal price moves before the front-month price reacts.
Government Data Drives Everything -- Unlike stocks where earnings matter, commodity futures are driven by government reports: EIA petroleum inventories (Wednesdays), EIA natural gas storage (Thursdays), USDA WASDE crop reports (monthly), CFTC positioning data (Fridays), BLS inflation data (monthly). These are all free. The release schedule is the trading calendar.
Seasonality Is Structural -- Natural gas demand peaks in winter (heating) and summer (cooling). Crop prices follow planting and harvest cycles. Gasoline demand peaks in summer driving season. These patterns repeat every year with measurable hit rates and are baked into the curve shape.
Positioning Data Is Public -- The CFTC publishes weekly data showing exactly how commercial hedgers, swap dealers, and speculative funds are positioned. When speculators hit extreme positions (>2 standard deviations), reversals happen. This data is free and has no equivalent in stock markets.
Cross-Commodity Relationships Are Tradeable -- Commodities exist in supply chains: crude oil becomes gasoline (crack spread), soybeans become meal and oil (crush spread), natural gas generates electricity (spark spread). These processing margins mean-revert. Ratio trades between related commodities (gold/silver, copper/gold, corn/wheat) have decades of statistical backing.
What: Every active Kalshi market related to commodities, economic data, and monetary policy. These are the contracts we actually trade.
Markets to track:
What to pull per market:
Source: Kalshi API (https://trading-api.kalshi.com/trade-api/v2) Auth: RSA-PSS signed requests (existing Kalshi Edge Bot code handles this). Collection frequency: Every 15 minutes during US market hours (9:30 AM - 4:00 PM ET). Hourly during off-hours. Snapshot at 8:00 AM ET (pre-market). Why it matters: This is our execution venue. We need real-time pricing to compare against our fundamental analysis and find mispriced contracts.
What: Same categories as Kalshi. Different liquidity pool (crypto-native traders), so prices often diverge -- creating cross-venue arbitrage.
What to pull per market:
Source: Polymarket CLOB API (https://clob.polymarket.com/) -- free, no auth needed for read. Collection frequency: Same as Kalshi -- every 15 min market hours, hourly off-hours. Why it matters: Cross-venue price comparison. If Kalshi prices "CPI above 3.5%" at 40 cents and Polymarket prices it at 48 cents, one side is wrong. We buy the cheap side.
What: The single most important weekly energy data release. Crude oil inventories, production, imports, and refinery utilization at the PADD (regional) level. Moves crude and product prices instantly on publication.
Source: EIA API (https://api.eia.gov/v2/) API Key: Free -- register at https://www.eia.gov/opendata/register.php Rate limit: 1000 requests/hour.
What to pull:
Key API endpoints:
GET /v2/petroleum/stoc/wstk/data/ with facets for product (EPC0=crude, EPM0=gasoline, EPD0=distillate) and duoarea (R10-R50 for PADDs, R60 for SPR)GET /v2/petroleum/pnp/wiup/data/GET /v2/petroleum/crd/crpdn/data/GET /v2/petroleum/pri/spt/data/Collection frequency: Weekly. Report released Wednesdays 10:30 AM ET (delayed 1 day if Monday holiday). Pull immediately on release. Store historical for 5-year seasonal comparison. Why it matters: A crude inventory build of +5M barrels when the market expected a -2M barrel draw can move oil $2-3 in minutes. PADD-level granularity tells you WHERE the build/draw is (Cushing matters more than PADD 5). Refinery utilization signals product supply tightness.
What: Weekly injection/withdrawal data for underground natural gas storage. The second most market-moving weekly report.
Source: EIA API -- GET /v2/natural-gas/stor/wkly/data/
What to pull:
Collection frequency: Weekly. Released Thursdays 10:30 AM ET. Why it matters: Natural gas is the most weather-sensitive commodity. A -150 bcf withdrawal when consensus is -120 bcf means heating demand exceeded expectations -- bullish. Storage levels vs the 5-year average tell you if the market has a supply cushion or is heading into a crisis.
What: Monthly 2-year forecast of US and global energy supply, demand, prices, and production.
Source: EIA website (https://www.eia.gov/outlooks/steo/) -- downloadable tables in Excel/CSV. What to pull:
Collection frequency: Monthly, released ~10th of each month. Why it matters: The STEO revision direction (not the level) is the signal. If EIA keeps revising production down month after month, it means supply is tighter than models predicted.
What: Monthly report on drilling efficiency and DUC (drilled but uncompleted) well count across 7 major US shale basins.
Source: EIA (https://www.eia.gov/petroleum/drilling/) -- downloadable spreadsheets. What to pull:
Collection frequency: Monthly. Why it matters: DUC count is a leading indicator of future production. If DUC inventory drops below ~4 months of completions, the industry cannot maintain production growth without adding rigs (which takes 6+ months to show up in output). This is a 3-6 month leading signal for supply tightness.
What: Weekly count of active drilling rigs in the US and Canada by basin and target (oil vs gas).
Source: Baker Hughes (https://rigcount.bakerhughes.com/) -- free weekly PDF/Excel, also available via Investing.com for historical data. What to pull:
Collection frequency: Weekly, released Fridays 1:00 PM ET. Why it matters: Rig count is a lagged indicator of production intentions. A sustained rig count decline while oil prices are stable means producers see lower future prices. Cross-reference with DUC count for the complete supply picture.
What: OPEC's own assessment of global supply, demand, and production compliance.
Source: OPEC website (https://www.opec.org/opec_web/en/publications/338.htm) -- free PDF/tables. What to pull:
Collection frequency: Monthly, typically released mid-month. Why it matters: OPEC controls ~30% of global oil supply. Their production compliance (or cheating) directly sets the supply/demand balance. The revision direction of their demand forecast signals whether they will extend, deepen, or unwind production cuts.
What: Processing margins that measure refinery profitability. Calculated from freely available price data.
Spreads to calculate:
Source: Calculate from EIA spot prices or Yahoo Finance futures (CL=F, RB=F, HO=F). What to store: Absolute value, Z-score vs 1/3/5-year history, percentile rank, seasonal adjustment. Collection frequency: Daily at market close. Why it matters: Crack spreads at extreme percentiles mean-revert. A gasoline crack at 95th percentile means refineries are printing money and will run harder (bearish gasoline, bullish crude demand). At 5th percentile, refineries cut runs (bullish gasoline, bearish crude). The 3-2-1 crack at >$40/bbl or <$10/bbl has historically been unsustainable.
What: Daily settlement prices for the three benchmark energy contracts.
Source: Yahoo Finance (CL=F, NG=F, BZ=F) for free delayed data. FRED for official daily spot (DCOILWTICO, DHHNGSP, DCOILBRENTEU). Collection frequency: Daily at market close. Store full history for seasonal analysis.
What: Exchange-monitored warehouse inventory levels for deliverable metals. The physical supply barometer.
What to pull per exchange:
Collection frequency: Daily for COMEX/LME. Weekly for SHFE. Why it matters: Warehouse stocks are the only transparent measure of physical supply. When COMEX registered gold drops sharply while price rises, physical demand is outpacing supply. LME cancelled warrants spiking = someone is aggressively pulling metal for physical use, not speculation.
What: How much gold central banks (especially China, India, Turkey, Poland) are buying or selling.
Source: World Gold Council (https://www.gold.org/goldhub/data/gold-reserves-by-country) -- quarterly report, free summary data. IMF IFS database (https://data.imf.org/) for monthly reserve changes. Collection frequency: Monthly (IMF data) and quarterly (WGC comprehensive). Why it matters: Central bank buying set records in 2022-2024. This is structural demand that sets a floor under gold prices. If PBOC (People's Bank of China) adds 30 tonnes in a month, that is not speculative -- it is a strategic reserve decision that will not reverse. Track cumulative purchases year-to-date vs prior years.
What: Key ratios between related metals that reveal macro regime and mean-revert.
Ratios to calculate daily:
Source: Yahoo Finance (GC=F, SI=F, PL=F, HG=F) or FRED (GOLDAMGBD228NLBM, SLVPRUSD). What to store: Ratio value, 20/60/200-day moving averages, Z-score vs 1/3/5-year history, percentile rank. Collection frequency: Daily at market close. Why it matters: Gold/silver ratio >85 has been a reliable silver buy signal for decades. Copper/gold ratio diverging from 10Y yields by >2 Z-scores means one market is wrong -- trade the convergence.
What: The price spread between Shanghai (SHFE) and London (LME) copper, adjusted for shipping, tariffs, and VAT.
Source: Calculate from SHFE and LME daily prices. Apply standard conversion (SHFE quotes in RMB/tonne, LME in USD/tonne). Use USD/CNY exchange rate for conversion. What to track:
Collection frequency: Daily. Why it matters: China consumes ~55% of global copper. The SHFE-LME arb window is the single best real-time indicator of Chinese copper demand. When the window opens (positive), copper prices globally tend to rise. The Contrarian panel flagged this as one of the highest-value free signals available.
What: World Agricultural Supply and Demand Estimates. The single most important monthly report for grain, oilseed, cotton, and livestock markets. Updated the ~10th of each month.
Source: USDA PSD Online API (https://apps.fas.usda.gov/PSDOnlineV2/api/) -- free, no key needed, returns JSON. Also downloadable from https://usda.library.cornell.edu/concern/publications/3t945q76s
What to pull per commodity (corn, soybeans, wheat, cotton, rice, sugar):
Collection frequency: Monthly, released ~10th of each month at 12:00 PM ET. Pull immediately on release. Store all historical estimates to calculate surprise magnitude. Why it matters: A 100M bushel cut to corn ending stocks can move corn futures 15-20 cents in seconds. The market trades the CHANGE in WASDE estimates, not the absolute level. If ending stocks are revised down 3 months in a row, the trend is tightening and the market has not fully priced it.
What: Weekly update on planting progress, crop condition ratings, and harvest progress. Published during the growing season (April-November).
Source: USDA NASS QuickStats API (https://quickstats.nass.usda.gov/api/api_GET/) -- free, requires API key (register at quickstats.nass.usda.gov). What to pull:
Collection frequency: Weekly, released Mondays 4:00 PM ET during growing season. Why it matters: A 5% drop in corn Good/Excellent ratings in one week during pollination (July) can add $0.30-0.50/bushel. Condition ratings during critical growth windows (pollination for corn, pod-filling for soybeans) are the single best yield proxy before harvest.
What: Weekly data on US agricultural export commitments (sales) and actual shipments (inspections).
Source:
What to pull:
Collection frequency: Weekly (Thursday for sales, Monday for inspections). Why it matters: Export sales pace tells you if USDA's annual export forecast is too high or too low. If cumulative sales are running 20% ahead of last year at the same point, USDA will eventually raise the export estimate and lower ending stocks (bullish). Large China purchases move prices immediately.
What: Survey of farmer planting intentions for the upcoming growing season. Released once a year at the end of March.
Source: USDA NASS (https://usda.library.cornell.edu/concern/publications/x633f100h) and QuickStats API. What to pull:
Collection frequency: Annual (late March), with June Acreage update. Why it matters: This report sets the supply expectations for the entire marketing year. A surprise increase of 2M corn acres means ~300M more bushels of production, which can drop corn prices 5-10% in a day. The corn/soybean acreage split is driven by the fertilizer-to-grain price ratio -- track this leading indicator year-round.
What: Monthly inventory reports for livestock. Published ~20th of each month.
Source: USDA NASS via QuickStats API. What to pull:
Collection frequency: Monthly. Why it matters: Placements are the leading indicator. High placements now = heavy marketings (more supply) 4-6 months later = bearish live cattle futures. The breeding herd size sets the production trajectory for 12-18 months.
What: Monthly frozen meat and poultry inventory in US cold storage warehouses. The Contrarian panel's top agriculture find.
Source: USDA NASS (https://usda.library.cornell.edu/concern/publications/pg15bd892) and QuickStats API. What to pull:
Collection frequency: Monthly, released ~22nd. Why it matters: Almost nobody watches this report, but it is the best measure of protein surplus/deficit. Frozen stocks well below the 5-year average = the supply chain is running lean = any demand shock (e.g., China buying) will spike prices. Stocks well above average = hidden oversupply that headline live cattle/lean hog prices have not discounted.
What: South American production estimates from local government agencies. Brazil and Argentina are the #1 and #3 global soybean exporters.
Source:
What to pull:
Collection frequency: Monthly for Conab; weekly for Argentine exchanges during Nov-Apr growing season. Why it matters: Brazil produces ~150M tonnes of soybeans annually. A 10M tonne downward revision (due to drought or flooding) is equivalent to ~15% of US annual production. South American crop problems are the #1 driver of US grain price rallies. Track growing season weather in Mato Grosso (Brazil) and the Pampas (Argentina) via NOAA/NASA.
What: Results of US Treasury debt auctions. The bond market equivalent of earnings reports.
Source: Treasury Fiscal Data API (https://api.fiscaldata.treasury.gov/services/api/fiscal_service/v1/accounting/od/auctions_query) -- free, returns JSON. What to pull:
Collection frequency: On auction days (schedule known months in advance). 2Y/5Y auctions typically Tuesday/Wednesday, 10Y/30Y on Wednesday/Thursday. Why it matters: A weak 10Y auction (high tail, low bid-to-cover, low indirect %) can push yields up 5-10 bps in an afternoon and immediately reprices Kalshi rate markets. Three consecutive weak auctions at the same tenor = structural demand problem (bearish bonds, bullish rates).
What: Daily yields across the full maturity curve.
Source: FRED API (https://api.stlouisfed.org/fred/series/observations) -- free, API key required (register at fred.stlouisfed.org). Rate limit: 120 requests/minute. FRED Series IDs:
Calculated metrics:
Collection frequency: Daily. Why it matters: Treasury yields are the gravitational field of all financial markets. Rising real yields = bearish gold, bearish growth stocks, bearish emerging markets. The shape of the curve (steepening vs flattening) determines which assets outperform.
What: The difference between nominal Treasury yields and TIPS (inflation-protected) yields. This IS the market's inflation expectation.
Source: FRED API -- T5YIE, T10YIE, T5YIFR. What to track:
Collection frequency: Daily. Why it matters: If 5Y breakevens are at 2.0% but trailing CPI is 3.5%, the market is betting inflation will drop sharply. If you disagree (because shelter CPI is sticky), Kalshi CPI contracts are mispriced. TIPS breakevens are the bridge between bond math and prediction market pricing.
What: The combined Treasury General Account balance and Reverse Repo Facility usage. Together, these measure how much cash is available to buy assets.
Source:
GET /v1/accounting/dts/dts_table_1 (Daily Treasury Statement, free)What to calculate:
Collection frequency: Daily for RRP. Daily for TGA (from Daily Treasury Statement). Weekly for balance sheet (WALCL). Why it matters: Rising net liquidity = more cash chasing assets = bullish everything. Falling = tightening. The TGA drawdown in H2 2023 injected ~$800B in liquidity and powered the equity/commodity rally. Track upcoming Treasury refunding announcements for TGA trajectory.
What: Market-implied probabilities of Fed rate changes at each upcoming FOMC meeting, derived from Fed Funds futures pricing.
Source: CME Group FedWatch tool (https://www.cmegroup.com/markets/interest-rates/cme-fedwatch-tool.html) -- free, scrapeable. Or calculate directly from 30-Day Fed Funds futures prices on FRED (FF1, FF2, etc.). What to track:
Collection frequency: Daily at market close. Extra snapshot after any major data release (CPI, NFP, GDP). Why it matters: FedWatch probabilities are the consensus. Kalshi rate contracts are the retail-weighted consensus. When they disagree by >5 percentage points on the same outcome, one side is wrong. We trade the gap.
What: The Federal Reserve's System Open Market Account -- what the Fed actually owns on its balance sheet (Treasuries and MBS by maturity).
Source: NY Fed API (https://markets.newyorkfed.org/api/soma/summary.json and /soma/tsy/get/all/asof/{YYYY-MM-DD}.json) -- free. What to track:
Collection frequency: Weekly. Why it matters: The Fed is the largest single holder of Treasuries. Their QT (quantitative tightening) pace determines how much Treasury supply the private market must absorb. If they slow QT, it reduces supply pressure on bonds (bullish). If they pause QT entirely, it is a major signal.
What: Secured Overnight Financing Rate. The benchmark short-term interest rate that replaced LIBOR.
Source: FRED API -- series SOFR (daily). Also NY Fed API. Collection frequency: Daily. Why it matters: SOFR is the risk-free rate used to price all derivatives. Spikes in SOFR relative to the Fed Funds target range signal funding stress in the repo market.
What: Settlement prices for the front 6-12 contract months of each major futures contract. This builds the "curve" that reveals the market's forward expectations.
Contracts to track:
Source: Yahoo Finance (symbol format: CLF26.NYM for Jan 2026 crude). Free, 15-min delay. Each contract month has a unique ticker. Upgrade path: Polygon.io ($29/month) or CME DataMine for official settlement data.
What to store per contract per day:
Collection frequency: Daily at market close (6:00 PM ET). Store historical for curve change analysis.
What: Daily classification of each commodity's term structure shape and how it is changing.
Calculate from curve data:
Why it matters: A shift from contango to backwardation is one of the strongest bullish signals in commodities. It means near-term demand is outstripping supply. The reverse (backwardation to contango) signals demand weakening. Crude oil curve shape predicted the 2020 collapse (extreme contango = no one wants oil now) and the 2022 spike (extreme backwardation = everyone scrambling for immediate supply).
What: When each futures contract expires and rolls to the next month. Contracts stop trading at different dates.
Source: CME Group contract specifications pages (https://www.cmegroup.com/markets/energy/crude-oil/light-sweet-crude.contractSpecs.html) -- free. What to track:
Collection frequency: Static calendar, updated once per contract month. Volume crossover monitored daily during roll periods. Why it matters: Index fund rolls (Goldman Roll, GSCI, BCOM) create predictable selling pressure on the front month and buying pressure on the second month for 3-5 business days each month. This temporarily distorts the curve and creates short-term trading opportunities. The roll window is known months in advance.
What: The price difference between specific contract months for the same commodity.
Key spreads to track:
What to store: Spread value, Z-score vs 1/3/5-year history, percentile rank, seasonal pattern. Collection frequency: Daily. Why it matters: Calendar spreads at extreme percentiles mean-revert. A natgas Jan-Apr spread at the 95th percentile of its 5-year range = market pricing extreme winter tightness. If weather forecasts do not support it, sell the spread.
What: Weekly breakdown of futures positions by trader type across all major contracts. This is the most important positioning dataset in existence.
Trader categories:
Source: CFTC.gov -- Disaggregated Futures-Only report.
What to pull per commodity (15 key contracts: CL, BZ, NG, RB, HO, GC, SI, HG, ZC, ZS, ZW, SB, KC, ZN, ES):
Collection frequency: Weekly. Released Friday 3:30 PM ET. Data is as of prior Tuesday close (3-day lag).
What: Statistical context for raw COT positions. Raw numbers mean nothing without historical context.
Calculate for each trader category, each commodity:
Why it matters: Managed money at +2.5 sigma net long in crude oil = crowded long trade. If price fails to make new highs (momentum divergence), expect a liquidation-driven selloff. The conviction panel identified that managed money extremes WITHOUT dealer confirmation (dealers not moving the same direction) are the highest-probability reversal setups.
What: Monitor for swap dealer net position changing sign (from net long to net short, or vice versa).
Why it matters: Dealer flips are rare (happen maybe 2-4 times per year per commodity). When they occur, they signal a structural shift in the flow landscape. A dealer flip from net long to net short in crude oil, combined with commercial hedgers increasing shorts, = both the physical and financial smart money are bearish. This combination has historically preceded major trend changes by 2-6 weeks.
Detection rule: Swap dealer net position changes sign from prior week AND the Z-score of the new position is >0.5 (not just noise around zero). Flag immediately in daily alert.
What: Additional COT report formats that provide different angles on the same positioning data.
Collection frequency: Weekly (same release schedule as disaggregated). Why it matters: The TFF report is essential for Treasury futures positioning. Asset manager positioning in 10Y note futures (ZN) vs leveraged fund positioning tells you whether the bond move is driven by real money (more persistent) or leveraged speculation (more likely to reverse).
What: The Climate Prediction Center's probabilistic temperature and precipitation forecasts for the next 6-10 and 8-14 days.
Source: NOAA CPC (https://www.cpc.ncep.noaa.gov/products/predictions/)
Collection frequency: Updated daily by CPC. Why it matters: Natural gas demand is 70%+ weather-driven. A CPC forecast shift from "near normal" to "below normal" temperatures for the Midwest in winter can move natgas 5-10% because it means less heating demand. Agriculture depends on precipitation during critical growing windows (corn pollination in July needs adequate moisture).
What: Quantitative measure of energy demand driven by temperature deviation from 65F baseline.
Source: NOAA (https://www.cpc.ncep.noaa.gov/products/analysis_monitoring/cdus/degree_days/) -- free. Also EIA includes degree day data in their weekly reports. What to track:
Collection frequency: Weekly (published with EIA gas storage). Daily forecasts from NWS. Why it matters: HDD/CDD data feeds directly into natural gas demand models. A winter running 10% above normal HDD = gas storage will deplete faster than expected = bullish gas prices. Cross-reference with EIA storage data to verify the demand signal.
What: Weekly assessment of drought conditions across the continental US.
Source: https://droughtmonitor.unl.edu/DmData/DataDownload.aspx -- free CSV/Shapefile, updated weekly (Tuesdays). What to track:
Collection frequency: Weekly. Why it matters: Drought during critical growing periods (June-August for corn/soybeans) directly reduces yields. Mississippi River drought in fall causes barge freight costs to spike (bullish Gulf export basis, bearish interior farm prices). The 2012 drought pushed corn from $5 to $8/bushel.
What: ENSO (El Nino-Southern Oscillation) phase determines global weather patterns that affect agriculture, energy demand, and even hurricane activity for 6-18 months.
Source: NOAA CPC ENSO page (https://www.cpc.ncep.noaa.gov/products/analysis_monitoring/enso_advisory/) -- free. What to track:
Collection frequency: Monthly (CPC updates monthly, with weekly SST data). Why it matters: La Nina = drier conditions in Argentina/southern Brazil (bullish soybeans), colder US winters (bullish natgas), more Atlantic hurricanes (bullish Gulf energy). El Nino = opposite pattern. These effects have statistical backing across decades and inform seasonal positioning.
What: Tropical storm and hurricane positions, forecasts, and intensity for the Atlantic basin.
Source: NOAA National Hurricane Center (https://www.nhc.noaa.gov/gis/)
What to track:
Collection frequency: Every 6 hours during active storms. Daily scan during hurricane season (June 1 - November 30). Why it matters: A Category 3+ hurricane heading for the Gulf Coast can shut in 80%+ of Gulf oil production and force refinery evacuations. Gasoline crack spreads spike 200-300% during major Gulf storms. Even the forecast track (before landfall) moves prices.
What: Historical seasonal price patterns for each commodity, with statistical validation.
Calculate from 10-20 years of price data:
Collection frequency: Calculate once, update annually. Why it matters: Seasonal trades with hit rates >65% and positive Sharpe ratios are the baseline positioning. Layer fundamental data on top: a bullish seasonal pattern PLUS tight inventories PLUS extreme managed money short positioning = high-conviction trade.
What: Near-real-time fire/hotspot detection from satellite thermal sensors. Detects agricultural burning and wildfires that affect crop production.
Source: NASA FIRMS (https://firms.modaps.eosdis.nasa.gov/) -- free, near real-time (within 3 hours of satellite pass). API available. What to monitor:
Collection frequency: Daily during relevant seasons. Why it matters: Excessive burning in Brazil soy regions during Aug-Oct can signal either (a) aggressive clearing for next season's planting (bearish -- more area coming into production) or (b) drought-driven fires damaging existing crops (bullish). Cross-reference with Conab production estimates and drought data.
What: Mississippi River and tributary water levels at key gauges. The inland waterway system moves 60% of US grain exports.
Source: USGS real-time water data (https://waterdata.usgs.gov/nwis/rt) and Army Corps (https://rivergages.mvr.usace.army.mil/) -- free. Key gauges: Memphis, St. Louis, Vicksburg, Cairo, Thebes. What to track:
Collection frequency: Daily. Why it matters: Low Mississippi River levels in fall 2022 spiked barge freight from $15/ton to $100+/ton, effectively adding $1.00/bushel to the cost of moving grain to export. This shows up as a widening basis between Gulf export prices and interior farm prices. When river levels drop below 0 feet at Memphis, barge companies start draft restrictions.
These ratios are calculated daily from freely available price data and Z-scored against historical ranges.
| # | Ratio | Calculation | What It Means | Mean-Reversion Signal |
|---|---|---|---|---|
| 1 | Gold/Silver | GC / SI | Risk sentiment. >80 = recession fear, <60 = growth. | Z-score > 2 in either direction |
| 2 | Oil/NatGas | CL / NG | Energy substitution. 10:1 = BTU parity. | >25:1 = gas too cheap, <8:1 = gas too expensive |
| 3 | Copper/Gold | HG / GC | Global growth proxy. Tracks 10Y yields closely. | Divergence from 10Y yield > 2 Z-scores |
| 4 | Soybean/Corn | ZS / ZC | Planting decision driver. >2.5 = plant beans, <2.2 = plant corn. | Extreme readings in March signal acreage shift |
| 5 | Brent/WTI | BZ - CL | Logistics/geopolitics. Spread >$7 = WTI pipeline constrained. | Z-score > 2 from 3-year mean |
| 6 | Platinum/Gold | PL / GC | Industrial vs monetary. At historic lows. | Ratio < 0.5 = extreme long platinum reversion |
| 7 | Crack Spread | (2xRB + HO)/3 - CL | Refinery margin. Mean ~$20. | >$40 or <$10 = unsustainable |
| 8 | Crush Spread | ZS value in ZM + ZL - ZS cost | Soybean processing margin. | Z-score > 2 = extreme |
| 9 | Spark Spread | Electricity price - (NG x heat rate) | Power plant profit from burning gas. | Regional; varies by market |
| 10 | Corn/Wheat | ZC / ZW | Feed grain substitution. | Corn >90% of wheat price = feed switching |
| 11 | Gold/Copper Ratio x DXY | (GC/HG) x DXY index | Combined risk sentiment filter. | Divergence from VIX = mispricing |
| 12 | Natgas/Crude BTU ratio | (NG x 6) / CL | Energy-equivalent pricing. | >0.5 = gas expensive, <0.25 = gas cheap vs oil |
Source for all: Yahoo Finance (GC=F, SI=F, CL=F, NG=F, HG=F, PL=F, ZC=F, ZS=F, ZW=F, RB=F, HO=F, BZ=F, ZM=F, ZL=F). FRED for DXY (DTWEXBGS). Collection frequency: Daily at market close. What to store: Ratio value, 20/60/200-day MAs, Z-score vs 1/3/5-year history, percentile rank.
What: Commodities priced in USD have mechanical relationships with exporter currencies.
| Currency Pair | Commodity Link | Why |
|---|---|---|
| AUD/USD | Copper, Iron Ore | Australia is a major copper/iron ore exporter |
| USD/CAD | WTI Crude | Canada's economy is oil-dependent |
| BRL/USD | Soybeans, Coffee | Brazil is the #1 exporter of both |
| NOK/USD | Brent Crude | Norway's sovereign wealth fund is oil-funded |
| DXY (broad dollar) | Commodity basket | Inverse correlation: strong dollar = cheap commodities |
Source: FRED API for exchange rates (DEXUSAL, DEXCAUS, DEXBZUS, DEXNOUS, DTWEXBGS). What to calculate: 60-day rolling correlation between each pair. Flag when correlation breaks (e.g., AUD weakening while copper strengthens = divergence, someone is wrong). Collection frequency: Daily.
What: Inter-commodity relationships driven by physical supply chains.
Chains to monitor:
Collection frequency: Weekly assessment of chain health using already-collected data. Why it matters: When one link in the chain moves but the connected commodities lag, the lagging commodity will catch up (or the leading one will revert). These are the highest-conviction spread trades.
What: China's official manufacturing Purchasing Managers' Index, released on the last day of each month for the current month. The earliest monthly indicator of Chinese industrial activity.
Source: National Bureau of Statistics of China (http://www.stats.gov.cn/english/) -- free, English version. Also: Trading Economics, Investing.com for parsed data. What to track:
Collection frequency: Monthly, last day of month. Why it matters: China consumes ~55% of global copper, ~50% of iron ore, ~20% of crude oil. A PMI move from 49.5 to 50.5 (crossing the expansion threshold) triggers copper rallies of 3-5% within a week. The new orders sub-index leads the headline by 1-2 months.
What: General Administration of Customs of China publishes monthly trade data including commodity import volumes.
Source: GACC (http://english.customs.gov.cn/) -- free, English version. Also available via Trading Economics. What to track:
Collection frequency: Monthly (released ~15th of following month). Why it matters: Chinese crude oil imports hit record levels in 2023-2024. Monthly import data tells you actual Chinese demand (not self-reported PMI surveys). A drop of >10% YoY in crude imports is a recession signal that crushes oil prices globally. Cross-reference with ship tracking data (Section L) for real-time estimates before official data release.
What: Weekly warehouse inventory reports for metals traded on SHFE (copper, aluminum, zinc, nickel, tin, lead, rubber).
Source: SHFE website (https://www.shfe.com.cn/en/) -- free, English available. Collection frequency: Weekly (published each Friday). Why it matters: SHFE copper stocks drawdown = Chinese industrial demand is absorbing physical supply. Combined with SHFE-LME arb window data (Section C), this gives a complete picture of Chinese metals demand vs rest-of-world pricing.
What: South Korea publishes export data on the 1st of each month with a 20-day flash estimate -- among the earliest global trade data available. Korean exports are heavily weighted toward semiconductors, displays, and ships, making them a proxy for the global chip cycle and manufacturing.
Source: Korea Customs Service (https://unipass.customs.go.kr/ets/index_eng.do) -- free, English. What to track:
Collection frequency: Monthly (1st of each month) with 20-day flash mid-month. Why it matters: Korean semiconductor exports lead global chip demand by 1-2 months. When Korean chip exports turn positive YoY after a contraction, it signals the chip downcycle has bottomed -- bullish for copper (used in electronics), palladium (used in chip packaging), and tech-heavy commodity indices.
What: One row per trading day with every relevant flag and event marker. Analysts query "what kind of day is today?" and get every factor at once.
Schema:
daily_state:
date DATE PRIMARY KEY
-- Energy calendar
is_eia_petroleum_day BOOLEAN (Wednesday)
is_eia_natgas_day BOOLEAN (Thursday)
is_opec_report_day BOOLEAN (monthly, mid-month)
is_baker_hughes_day BOOLEAN (Friday 1 PM)
-- Agriculture calendar
is_wasde_day BOOLEAN (monthly, ~10th)
is_crop_progress_day BOOLEAN (Monday, growing season)
is_export_sales_day BOOLEAN (Thursday)
is_prospective_plantings BOOLEAN (end of March)
is_cattle_on_feed_day BOOLEAN (monthly, ~20th)
-- CFTC
is_cot_release_day BOOLEAN (Friday 3:30 PM)
-- Fed/Treasury
is_fomc_day BOOLEAN
is_fomc_eve BOOLEAN
days_to_next_fomc INTEGER
is_treasury_auction BOOLEAN
auction_tenor TEXT (e.g., "10Y", "30Y")
-- Macro releases
is_cpi_day BOOLEAN
is_ppi_day BOOLEAN
is_nfp_day BOOLEAN (first Friday of month)
is_gdp_day BOOLEAN
is_ism_pmi_day BOOLEAN (1st business day of month)
is_pce_day BOOLEAN
econ_release_tier INTEGER (1=CPI/NFP/FOMC, 2=PPI/GDP/ISM, 3=everything else)
-- Contract rolls
contracts_rolling TEXT (JSON list of symbols in roll window)
is_goldman_roll BOOLEAN (5-9 business days before month end)
-- Weather
hurricane_active BOOLEAN
extreme_weather_flag TEXT (e.g., "polar vortex", "heat wave")
-- China
is_china_pmi_day BOOLEAN (last day of month)
is_china_customs_day BOOLEAN (monthly, ~15th)
-- General
is_month_end BOOLEAN
is_quarter_end BOOLEAN
is_half_day BOOLEAN
Collection frequency: Generated daily at 5:00 AM ET (before any market opens). Calendar data sourced from CME, USDA, EIA, BLS, Treasury release schedules (all published months in advance).
What: Classification of economic releases by market-moving potential.
Tier 1 (move futures 1%+ regularly):
Tier 2 (move futures 0.3-1%):
Tier 3 (minor, context only):
Source for release dates: BLS (https://www.bls.gov/schedule/), Census Bureau, USDA (https://www.nass.usda.gov/Publications/Calendar/), EIA, Treasury.
What: The Copernicus Sentinel-5P satellite measures nitrogen dioxide (NO2) concentrations globally with near-daily coverage. Refineries are major NO2 point sources. A sudden drop in NO2 over a refinery complex = unplanned outage, 12-48 hours before it shows up in news or official reports.
Source: Copernicus Open Access Hub or Google Earth Engine. Dataset: COPERNICUS/S5P/NRTI/L3_NO2. Free, requires registration. Refinery complexes to monitor:
Collection frequency: Daily satellite pass (data available within 3 hours). Why it matters: An unplanned refinery outage at a major complex removes 500K-1M barrels/day of processing capacity, spiking product crack spreads. Satellite data gives you this information 12-48 hours before the company announces it. The Contrarian panel flagged this as the highest-value free alternative data source for energy markets.
What: Real-time electricity prices across US power grids (ERCOT, PJM, MISO, SPP, CAISO, NYISO).
Source: GridStatus.io (https://www.gridstatus.io/) -- free tier with real-time locational marginal prices (LMPs). What to track:
Collection frequency: Hourly during business hours. Why it matters: Electricity prices are a real-time proxy for natural gas demand (gas generates ~40% of US electricity). When ERCOT power prices spike, it means gas demand is spiking. This is faster than EIA weekly data and gives you a natgas demand signal before the Thursday storage report.
What: Automated Identification System (AIS) data from commercial vessels. Every tanker, bulk carrier, and LNG carrier broadcasts position, speed, heading, destination, and cargo type.
Free Sources:
What to monitor:
Collection frequency: Daily scan. Why it matters: Ship positioning leads customs data by 2-4 weeks. If you count 15 VLCCs heading from the Arabian Gulf to China vs 8 last month, Chinese crude imports are going to increase -- and you know this weeks before GACC publishes the data (Section J). The Contrarian panel ranked this as the #1 informational asymmetry available for free.
What: D4 (biodiesel) and D6 (ethanol) RIN credit prices. RINs are compliance credits that refiners must purchase to meet Renewable Fuel Standard mandates.
Source: EPA EMTS data (https://www.epa.gov/fuels-registration-reporting-and-compliance-help/rins-generated-transactions) -- free. Also Argus/OPIS for daily RIN prices (paid, but weekly summaries are free via industry press). What to track:
Collection frequency: Weekly. Why it matters: D6 (ethanol) RINs affect 37% of US corn demand (corn >> ethanol >> D6 RINs). When D6 prices spike, ethanol margins improve, ethanol plants run harder, and corn demand increases. D4 prices affect soybean oil demand (soybean oil >> biodiesel >> D4 RINs). These compliance markets are invisible to most commodity traders but directly drive demand for the physical commodities.
What: Association of American Railroads publishes weekly rail carload and intermodal volume data.
Source: AAR (https://www.aar.org/data-research/rail-traffic-data/) -- free weekly summary. What to track:
Collection frequency: Weekly (released Wednesdays). Why it matters: Rail is the second-largest freight mode for commodities after pipelines. Falling coal carloads confirm natural gas displacement. Rising grain carloads during export season = strong demand. Falling intermodal = weak consumer goods demand (recessionary). This is a real-time economic activity indicator.
What: Composite index of dry bulk shipping rates, weighted across Capesize, Panamax, Supramax, and Handysize vessel classes.
Source: Free delayed data via FRED (not always current) or shipping news sites (Hellenic Shipping News, Splash247). Yahoo Finance: ^BDI. What to track:
Collection frequency: Daily. Why it matters: BDI decomposition tells you WHICH commodity trade flow is moving. Capesize spiking while Panamax is flat = iron ore/coal demand (China restocking) without grain demand increase. The composite BDI is a leading indicator of global trade volumes but only useful when decomposed by vessel class.
What: A systematic scoring framework to separate real signals from noise and traps. Every trade idea processed by the Futures desk gets scored on 6 components before the analysts see it.
Components:
| # | Component | Weight | Description |
|---|---|---|---|
| 1 | Fundamental Direction | 25 | Do government data sources (EIA, USDA, CFTC) agree on the direction? Score: unanimous agreement (25), majority (15), mixed (5), contradictory (0). |
| 2 | Positioning Confirmation | 20 | Is COT positioning confirming the trade? Score: dealer flip + commercial confirmation (20), one category confirming (12), managed money extreme without confirmation (5), all categories crowded same direction (0 -- everyone agrees = no edge). |
| 3 | Term Structure Signal | 15 | Does the futures curve shape support the trade? Score: backwardation supporting bullish trade or contango supporting bearish (15), neutral (8), curve shape contradicts trade direction (0). |
| 4 | Cross-Commodity Confirmation | 15 | Do related commodities and ratios confirm? Score: 3+ correlated assets moving in supporting direction (15), 1-2 supporting (8), related markets contradicting (0). |
| 5 | Seasonal Alignment | 10 | Is the trade aligned with seasonal patterns? Score: strong seasonal with >65% hit rate (10), moderate seasonal (5), fighting seasonal pattern (0). |
| 6 | Freshness / Timeliness | 15 | How recent is the data driving the signal? Score: data <24 hours old (15), <1 week (10), >1 week (5), >2 weeks (0). COT data is always 3 days stale -- cap at 10 for COT-driven signals. |
Total possible: 100. Minimum threshold: 55 to pass to analysts.
These rules reject a trade idea regardless of conviction score:
Circular Source Chain: If the bullish case for commodity X relies on data that itself was derived from commodity X's price, reject. Example: "gold should go up because gold ETF inflows are rising" -- the inflows ARE the price move, not a separate signal.
Staleness Window Exceeded: If the primary data source driving the signal has not updated in >2x its normal frequency (e.g., weekly data is >2 weeks old), reject. Stale data = you're trading yesterday's signal.
Regime Change Override: If the 60-day rolling correlation between the signal source and the commodity has flipped sign (e.g., was +0.7, now -0.3), the historical relationship has broken. Reject until the new regime is understood. Correlation panel specifically flagged this.
Narrative vs Data Mismatch: If the trade thesis relies on a qualitative narrative ("OPEC will cut because...") but quantitative data (actual production, tanker flows, inventory builds) contradicts it, reject the narrative. Data wins.
Single-Source Dependency: If the entire trade thesis rests on one data point from one source with no corroboration, reject. Require at least 2 independent data sources supporting the direction.
These don't auto-reject but add a warning flag visible to analysts:
Goal: Get all free government data flowing into SQLite. Daily futures prices. Prediction market scraping.
Goal: Build analytical layers that turn raw data into tradeable signals.
Goal: Add China data, alternative data, seasonal analysis, and conviction scoring.
NOT in monthly budget (research-only via Grok/Sonar search):
| Data Category | Daily Size | Annual Size |
|---|---|---|
| Futures prices + curves (20+ contracts, 6-12 months each) | ~200KB | ~50MB |
| EIA petroleum + natgas + production | ~50KB/week | ~3MB |
| USDA WASDE + crop progress + exports | ~100KB/week | ~5MB |
| CFTC COT (15 contracts, 3 report types) | ~200KB/week | ~10MB |
| Treasury yields + auctions + Fed data | ~50KB/day | ~12MB |
| Spreads + ratios + Z-scores | ~100KB/day | ~25MB |
| Kalshi + Polymarket markets (every 15 min) | ~2MB/day | ~500MB |
| Weather outlooks + HDD/CDD | ~20KB/day | ~5MB |
| China data + Korean exports | ~50KB/month | ~1MB |
| Daily state table + calendar | ~5KB/day | ~1MB |
| Metals warehouse stocks | ~20KB/day | ~5MB |
| Alt data (AIS, satellite, rail, BDI) | ~100KB/day | ~25MB |
| Total | ~150MB/year |
Note: Prediction market data (Kalshi/Polymarket at 15-min intervals) dominates storage. Without intraday prediction market data, total drops to ~50MB/year. SQLite handles this easily -- no external database needed.
| Period | Cost | What You Get |
|---|---|---|
| Weeks 1-3 | $0 | All government data (EIA, USDA, CFTC, FRED, BLS, Treasury, NOAA), free futures prices (Yahoo), prediction market scrapers, full analytical engine, conviction scoring |
| Month 2+ | $29/month | Polygon.io (shared with Options desk) for reliable price data |
| Month 3+ | $29-50/month | Add Quandl premium if clean COT API format saves development time |
| Year 1 Total | ~$290-500 | Complete Futures desk data infrastructure |
The key insight from the Practical panel: 90%+ of the data that moves commodity markets is published for free by US government agencies (EIA, USDA, CFTC, BLS, Treasury, NOAA, Fed). The paid data services ($500+/month for Kpler, Genscape, Platts) are only marginally better than what you can get for free + satellite alt-data. Start with government data, add Polygon for reliability, and use Grok/Sonar web search to fill gaps.