It has been a month or two since I got back the “build and test trading strategies” bug. A bug that my very first, real gig gave me. Thinking about that job now, it was pretty absurd.
It was a six months internship for what at the time was a big Italian asset management firm. The guy running the FX desk wanted to launch an absolute return fund that invested only in currencies. The concept was quite ahead of time, especially in Italy, but the setup…amatorial at best. The FX team is typically a staff function that covers FX trades execution and manages the FX risk across all funds; not exactly the focal point of the firm, resource-wise. It was a two-person team, the head and an analyst…plus me. To offer you an example, once I saw the analyst checking an Excel sum on a desktop calculator because “sometimes Excel results are wrong“, bless her.
On day one, the boss gave me a pile of other banks’ research pieces and FX strategies and told me to reverse engineer them to check if there was a real value that we could later build upon. I am still not sure if it was because I got a desk with four monitors and I felt I was in the position I knew I belong to, like those characters I read about in Market Wizards, but I (somehow) pulled it off. Probably it was one of those cases where ignorance helps: if I knew what I know now, the type of knowledge serious researchers have in terms of math, statistics and programming languages I would not even try. But back then, a Bloomberg terminal (total prior experience: 20 hours top) and my VBA skills were more than enough to create a trading system that, I thought, kicked asses.
All the FX trading models out there are basically based on three/four factors and their combinations. I will go quite fast now because the topic of this post should be about backtesting, not a deep-dive on FX models, even if the subject is really interesting (at least for me). If you do not understand something, Google it. The most intuitive factor is carry: take a basket of currencies like the G10, the most liquid and freely traded ones (no RMB then), and rank them by yield. Then you go long the 3 at the top of the ranking and short the 3 at the bottom. Not that complicated uh? The reason why this factor works is linked to an intuitive reason: investors chase yields, whatever the assets. The carry factor represents a piece of empirical evidence that invalidates the Interest Rate Parity, which can be an intellectual added bonus.
[for this post I wanted to use examples from a model designed by yours truly. Unfortunately:
- I tossed away the FX carry model years ago with my old laptop so I have to use graphs I found around the net/Bloomberg;
- all the models I implemented from @wifey list do not have much history because the ETFs they use are quite recent.]
I took the internship in 2005 and I back-tested all the models starting from 01/01/99, the day the EURO was introduced. Here is a close representation of what my carry model equity line looked like:
You can imagine the excitement I felt: the line is basically an arrow pointing up to the right, accompanied by an absurd return and Sharpe ratio.
You have to be as naive as someone at his first internship to not realize that the same graph was on the desk of every bank (at the time they could still prop-trade), asset manager and hedge fund out there. To cut the story short, here is how it went from there:
It continued to perform until 2007 so my optimism grew as the line, higher and higher. Then the GFC came and…it never worked again. I am not an AQR analyst but I think one of the reasons why the performance flattened is because all Central Banks brought their rates to 0: it is hard for a system based on rate differentials to generate relevant signals when CBs used other tricks, like QE, to drive their monetary policy.
Here are some lessons I learned building this and other backtests.
Time Frame
The backtest should include as many regimes as possible, so that you can exclude that the result is not based on a particular market condition that might change in the future. For example, an equity system should include at least one bull market and one bear. Even if daily signals for 6 years might seem an interval long enough, my FX carry backtest was run on a period where all conditions were the same. Consider for how long value stocks did underperform growth, more than 10 years.
On the other side, you do not want to use a period too long, including years where markets were fundamentally different. Think about Bretton Woods for currencies or retail traders having access to ETFs to trade commodities. Including those periods in your system might generate false signals because the rules were fundamentally different.
What breaks a trading system is extreme risk events. Given their rarity, you want to include as many as possible to prove the system’s resilience, therefore the length of the testing period. But you should be able to realize when prior assumptions are not valid anymore because rules have changed and adapt your system accordingly (if not tossing it into the bin altogether).
If your system relies on many parameters (moving averages, stop losses based on % moves, filters) you want to divide your period of analysis into two: a part where you define the parameters and an out of sample where you see how robust your system is, to avoid curve fitting.
You can create an ad hoc stress test scenario to prove the system’s resilience but this exercise is easier said than done. Markets are full of non-linear relationships and more often than not, a particular stress depends on how other market participants will react (Soros reflexivity and policymakers’ actions). Even the greatest hedge fund will not survive a market with zero liquidity (nor the financial system in general), so your stress scenario depends on where CBs will “draw the line”.
Inputs
The FX carry model is based on interest rate differentials…but which interest rates you should use? Consider the difference between the overnight rate and the 2 years: both rates have merits, the former because represents current conditions and the latter because it is the market standard for short-term expectations. Today, if you use the O/N rate, the GBP has a higher yield than the USD. But 99% of market movements are based on expectations, therefore the 2 years rate might be better and in that case, the USD has a higher yield than the GBP. What if market participants start to focus on the 1 year instead of the 2, as they do now? Do not wait for a text alert on that.
The 60/40 portfolio is as well a trading system with its rules. But as I wrote in the past, the 60 and the 40 parts have different meanings for different investors: S&P500 vs World, Treasuries vs Global Aggregate, choose your fighter. Each will lead to different results.
The commodity space is even messier because a lot of research is based on spot prices but very few investors have the ability to replicate spot returns. The majority of us have to rely on investment vehicles based on futures and the specific future used, combined with the forward curve shape at each point in time, determine the investor end result.
Data Quality
When you go long USD / short EUR, your performance is not only driven by the spot rate but also by the interest rate differential between the two currencies. If you test your system only against €/$ historical prices, your system P&L will not be replicated by your real P&L. The FX market trades 24/7, so the “end of day” price you get depends on when the database provider set its own arbitrary “end of day”. If you get data from Yahoo Finance and YF sets the end of the day as 4 pm NYC time, then your system (or you) has to trade exactly at that point in time, all the time.
If your system trades distributing ETFs, you have to add back to your P&L the dividends paid. If your system uses signals based on price, as many Technical Analysis systems do, you have to distinguish between a real price drop and the price drop after a dividend payment.
Then your system has to consider trading costs and slippage. Trading costs are easy to model (but you have to include them!) while slippage, the difference between the screen mid-price and the bid or ask where you order can be filled in full, is more art than science. The slippage depends on the instrument itself but also on the moment of the day you trade and the direction of your trade. If your system is mean reverting, in very loose terms you buy when the market is down and vice versa, then you should expect a smaller slippage compared to a system based on momentum. Slippage can be particularly painful for stop-loss orders: if you use a limit order, you risk that the market will ‘jump’ it and never fill it, if you use a market order…the next flash crash will teach you not to use market orders. Both cases mean your backtest should account for a reasonable slippage, a rule that in practice no one is able to follow.
Risk Filters
No one follows the rule on slippage costs because no one accepts the #1 rule of backtesting: your worst drawdown is the one that did not happen yet. Myself included, the asshole that tests systems only wearing his rose-colored glasses. This rule is really difficult to accept because it means you have to bin 99% of the work you do. Look on research in scientific papers, writers are more likely to bend reality to fit their initial hypothesis than to admit defeat, accept they wasted months or years of work and move on.
It is a very competitive field, it is easier to cut some corners hoping for the best (the market is liquid enough to absorb all my orders) than to apply layers and layers of precautions that eat into your projected returns.
In order to limit drawdowns, we can add “filters” to the trading system, rules that (try to) identify unfavorable periods and move the system to cash (out of the market) or from long to short. The most classic of these filters are Moving Averages to identify trends: when the price is above the MA, go long and when it is below, go out or short. The main advantage of MAs, their simplicity, it is also their biggest drawback: they are based on nothing other than past prices.
Which time frame should your system use? Setting a short MA has the advantage of signaling fast a change of trend and the disadvantage of offering several false signals; a long MA reverses the pros and cons. Common time frames include 20, 50 and 200 days. Finding the best MA for your system is not that difficult, you can simply run several tests and pick the MA that generates the highest P&L. The issue with this approach is that you risk overfitting your curve, meaning the chosen MA might not work in the future as it did in the past. Look at the below example:
The risk filter gave no meaningful false signal (feels like the red line did some catch-up in 2020 so prob there was a bad signal when mkts rebounded) and, excluding two minor episodes in 2002 and 2003, it triggers basically only to avoid the GFC fall, which overlaps with the strategy worst drawdown. To partially avoid the risk of overfitting, MPinvestit run different MAs and found that the system performance was not dramatically different. Generally speaking, I find it conceptually acceptable to sacrifice some returns to lower drawdowns, the underpinning of any trend strategy; in this case, I would prefer to land to a standard MA like the 200 days and call it a day.
The more you personalize the filter, the higher the chance you will not stick with it in the future, in case of a big out-of-sample drawdown. The biggest risk of any trading strategy is to abandon it at its worst point; for this reason, using signals that have some general meaning (see next paragraph) is better than following blindly something simply because it worked in the past within your particular strategy.
During and after the GFC, I tried different risk filters to ‘protect’, after the fact at that point, the FX carry model from future, similar crashes. I landed on an indicator that blends several risk-off factors like VIX, gold and corporate credit spreads, and applied a MA on that. Despite my desperate search for a sophisticated filter, the main lesson here is that no filter can save an un-performing strategy (as it was at least in the 2008-2020 period).
Drawdown profiles, more than volatility, have a disproportionate effect on the total returns you can squeeze out of a strategy. This is why everyone is obsessed with them. The lower the drawdowns, in particular the max DD, the more you can leverage up returns. And by the way, it does not matter what’s your personal view on leverage; if you are not using it, someone else will and you both will ride the consequences of returns drying at first and the (most probable) blow-out later.
And finally…
“A factor ceases to be a factor once it is published and its value is arb-ed away“. Cannot remember who said it but I agree. The real alpha of any (real) strategy is in the rules and your propensity to stick with them. Going back to the FX carry strategy, it might start to perform again now that (some) CBs around the world remembered that rates can also go up, not only down. Look at the poster child of this strategy, USD/JPY:
And one of the reasons it might perform again is that no professional money manager could have stuck to the strategy rules in the period 2008-2020. I mean, they could have…as long as they were managing only their money.
I wanted to write this non-comprehensive, not very detailed list of lessons not only for people that might/have run a backtest but also for the many more that will have a financial strategy, and relative backtests, pushed on their desk at a certain point of their life. Any strategy must have a pain point; if you do not see it, either the backtest was bogus (overfitting) or gains will be arb-ed away in the future. I would not bet you will be the one presented with the latter.
What I am reading now:
Follow me on Twitter @nprotasoni