Every trader who gets serious about systematic strategies goes through the same moment. They write their first backtest, tune the parameters, run the numbers — and see a Sharpe ratio that would make any quant fund jealous. The equity curve climbs smoothly to the upper right. Max drawdown is tight. CAGR crushes the index.
Then they take it live and it falls apart immediately.
This isn’t bad luck. It isn’t a market regime change. It’s the inevitable result of a backtest that was constructed incorrectly from the very beginning. Not during the strategy logic. Not in the parameter tuning. Before any of that — in the foundational assumptions about how historical data is built and used.
This post exists to name those failure modes clearly. Not abstractly. Not academically. With enough specificity that you can look at your own process and identify exactly where the contamination is coming from.
The False Promise of the Backtest
When most people talk about backtesting, they’re describing a process that goes roughly like this: download some historical price data, write a buy and sell rule, simulate it against the past, look at the output. If the output looks good, they have a strategy. If it looks bad, they adjust and re-run.
This is not research. This is curve fitting with extra steps.
A backtest doesn’t tell you that a strategy has edge. It tells you that a strategy would have had edge under the specific conditions modeled. The word “would have” is doing enormous weight-bearing work in that sentence. Every assumption baked into your data construction, every imprecision in your execution modeling, every look at the results before finalizing parameters — all of it collapses the distance between “what worked in the past” and “what will work going forward.”
The problem has gotten worse, not better. Drag-and-drop backtesting platforms, AI-generated strategy code, and five-minute tutorials have lowered the barrier to running a backtest to near zero. What they haven’t done is lower the barrier to running one correctly. The result is a generation of systematic traders with extremely high confidence in results that mean almost nothing.
There are five specific ways backtests break before you write a single line of strategy logic. Every one of them is solvable. None of them are obvious if no one has pointed them out to you directly.
-
01
Survivorship Bias — Your Universe Only Contains Winners When you download a list of S&P 500 stocks and backtest against them, you are not looking at the S&P 500 as it existed ten years ago. You are looking at the companies that survived long enough to still be in the index today. Every company that went bankrupt, got acquired, was delisted, or simply declined into irrelevance has been silently removed from your dataset. The practical effect: every strategy you run on a survivorship-biased universe will look better than it actually is, because you’ve systematically excluded the worst outcomes. The companies most likely to blow up your long positions are the ones you never tested against. This isn’t a minor statistical quibble. It can easily account for several percentage points of apparent annual return — enough to turn a real losing strategy into an apparent winner.
-
02
Look-Ahead Bias — Your Strategy Knows the Future Look-ahead bias happens when your backtest uses information that wouldn’t have been available at the moment the trade was actually taken. It sounds obvious when stated that way. It is remarkably easy to introduce accidentally. The most common version: earnings data. If you’re using reported quarterly earnings in your signal, you need to know the exact date the filing was made public — not the period it covers. Companies routinely report earnings 30–45 days after the quarter ends. A backtest that treats Q3 earnings as available on October 1st is trading on data that didn’t exist until mid-November. Every trade in that window is contaminated. The same problem appears in index rebalancing dates, analyst estimate revisions, corporate action data, and almost any fundamental dataset that updates on a lag. A single misaligned timestamp can inflate returns dramatically — and you will never see it in the backtest output.
-
03
Data Snooping Bias — You Tested 200 Parameters and Reported the Best One You hypothesize that a moving average crossover has edge. You test an SMA 20/50. The results are mediocre. You try 20/100. Better. You try 50/200. The equity curve looks interesting. You run 40 more combinations. One of them prints a 1.8 Sharpe ratio. You declare victory. What you’ve actually done is searched through enough random variation until one combination happened to fit the historical noise. This is p-hacking. The strategy didn’t find edge in the data — the data found a number that matched the strategy, by chance, given enough attempts. The damage compounds when you report only the best-performing parameter set without disclosing how many you tested. The Sharpe ratio on its own is meaningless without knowing the denominator of iterations it represents.
-
04
Overfitting — Your Strategy Memorized the Past Instead of Learning From It Data snooping is the cause. Overfitting is the symptom. A strategy that has been tuned through enough iterations will eventually achieve near-perfect fit on in-sample data not because it has identified a real market dynamic, but because it has memorized the specific noise pattern of the period you tested it on. The classic tell: a strategy that performs brilliantly on the training data and immediately deteriorates on any out-of-sample period. If your 2010–2020 backtest looks excellent but your 2021–2024 holdout period shows random or negative returns, you don’t have a strategy. You have a model of 2010–2020. The number of parameters in your model matters more than most people realize. Every free parameter is another degree of freedom for the optimizer to exploit. Simple strategies with fewer parameters are harder to overfit and more likely to reflect genuine edge.
-
04
Ignoring Transaction Costs — The Difference Between Gross and Net Is the Strategy This one feels like it should be obvious, yet it is where most retail systematic strategies die. The backtest models frictionless execution. The live account does not. The costs that matter: commissions (zero for equities on Alpaca, real for options), slippage (assume you execute at least 0.1% worse than the signal price on both entry and exit), and bid-ask spread (which widens meaningfully during volatile periods — exactly when many strategies want to trade). On higher-frequency strategies, these costs are not a rounding error. They are often the entire edge. A strategy returning 14% gross before costs might return 7% after realistic friction. That is a fundamentally different risk/return proposition. A strategy returning 8% gross might return negative after costs. These are not equivalent outcomes. Model the friction before you trust any headline return number.
The Right Mental Model: A Science Experiment, Not a Profit Projection
Every one of the five failure modes above shares a common root cause: they emerge when a researcher approaches a backtest looking for confirmation rather than falsification. The goal becomes “find parameters that make this work” rather than “design a test that would catch it if it doesn’t.”
You are not trying to prove the strategy works.
You are trying to prove it doesn’t.
A properly designed backtest is an honest attempt to destroy your own hypothesis. Every safeguard — survivorship-free data, point-in-time fundamentals, fixed parameters before out-of-sample testing, realistic cost models — exists not to make the backtest harder but to make the results mean something.
The practical structure of an honest backtest setup is not complicated. Split your data before you start: training period, validation period, test period. Lock parameters after the training split. Never touch the test set until you’re done. One look at the test set is one look — not an iterative loop. Model your costs explicitly. Run the same parameters on instruments you didn’t optimize for and observe whether the behavior is consistent.
This approach produces fewer strategies that look good. It produces more strategies that actually are.
What a Contaminated Backtest Looks Like in Code
Here’s a minimal example of look-ahead bias in Python — the kind that’s trivially easy to introduce and impossible to catch by looking at returns alone:
# ❌ WRONG — look-ahead bias introduced by shift() import pandas as pd import yfinance as yf df = yf.download('SPY', start='2015-01-01', end='2024-01-01') # Signal uses today's closing price to decide today's trade # A real system can only see yesterday's close at the open df['sma_20'] = df['Close'].rolling(20).mean() df['signal'] = (df['Close'] > df['sma_20']).astype(int) # ✅ CORRECT — shift(1) ensures signal uses only prior bar's data df['signal'] = (df['Close'] > df['sma_20']).shift(1).astype(int) df['returns'] = df['Close'].pct_change() df['strategy_returns'] = df['signal'] * df['returns']
The difference is a single .shift(1). Without it, every trade in the backtest is using the closing price of the current bar to make a decision that would only be executable at the next bar’s open. Your strategy appears to buy at the signal bar’s close but is actually trading on a price it couldn’t have known when the order needed to be placed.
That one missing shift can add 1–3% of apparent annual return on a daily price signal. Invisible in the output. Catastrophic in live trading.
How We Build at Code Assassins
Every Lab that publishes on this platform has run through a six-stage pipeline designed specifically around the failure modes described above. The pipeline is not a formality — it is the entire point.
Stage 1 — Hypothesis and Data Pull: The hypothesis is written before any data is examined. The data universe is defined, sourced survivorship-free where possible, and validated for point-in-time accuracy before the backtest engine runs a single iteration.
Stage 2 — Backtest Engine: Research-speed parameter sweeps run across the training split only. The number of parameter combinations tested is logged and disclosed with the results. No optimization runs against the validation or test splits.
Stage 3 — Results Storage: Metrics are stored, not recomputed. Every test run is versioned. Parameter sets and their corresponding results are recorded in full, not just the best performer.
Stage 4 — Lab Post Builder: The written research document. Hypothesis, methodology, results, and a preliminary read on what the data is actually saying — before paper trading begins.
Stage 5 — Paper Trading: Before any capital touches the strategy, it runs in simulated live conditions through at least one market cycle. This is where execution reality diverges most clearly from backtest assumptions.
Stage 6 — Publish the Verdict: The full result — including the failure modes tested, the regimes where the strategy underperformed, and the realistic cost-adjusted metrics — is published publicly. Pass, fail, or inconclusive. No spin.
This is not a guarantee that our strategies work. It is a guarantee that when we say a strategy works, we’ve given that claim every honest opportunity to fail first.
What Comes Next
This post is the hub for the methodology series. Before we get into the specific failure modes in depth, the next post lays the statistical foundation everything else rests on — what risk actually looks like when you measure it precisely.
The bias series follows after that. Each post is a standalone reference, but read in sequence they build the complete foundation that every experiment on this platform is measured against.