The objective is to create a realistic investment strategy based on several machine learning techniques seen during the masters course Machine Learning Applied to Financial Data, at HEC Montréal.
This repository was built as a part of a course webproject, and the data and techniques assume rather unrealistic constraints that would make the strategy completely unprofitable and extremely risky in real life. The contents of this repository do not constitute by any means any financial advise and we waive ourselves from all legal liability in the use or misuse of this stategy.
We use an automatic S&P 500 sub-universe selection algorithm based on economic sectors and stocks with largest market capitalisation. At the beginning of every month, we subset a window of "recent" past data (e.g., two years), and for each combination of sector and corresponding stocks, we perform backward linear feature selection, followed by Elastic Net Regression to anticipated returns, and a Random Forest to anticipate their trajectory (up or down). We filter assets emphasizing on anticipated returns, direction and the Sharpe ratio. With the help of a modified minimum-variance optimization framework based on a different set of time-series-based forecasted returns and volatility, we rebalance the portfolio by longing based on the optimized weights. We then hold until the beginning of next month, where we close all positions, and perform the strategy again.
The backtest spans over a 5-year period, using a rolling 24-month window to forecast the metrics for the subsequent month. Our data, however, consists of weekly Wednesday adjusted close. We benchmarked the performance against the S&P 500 index. As evident from the graph below, the portfolio has consistently outperformed the index on a monthly basis. The portfolio, however, also displays significant drawdowns indicating high volatility and large declines from peaks, indicating a high-level of risk.
As mentioned in the Trading Strategy section, every month, we leverage a fixed window of past data, in this case 2 years, or 24 months. For each of these, we extract the features for that range, but we also compute a couple of "dynamic features", given by the following:
- Shifted SARIMA(p,d,q) features for different combinations of p,d,q. The goal is to forecast the realized returns 4-weeks ahead, so that these are introduced to the historical data as predictive features. The SARIMA (Seasonal Autoregressive Integrated Moving Average) is a statistical time-series model that combines the AR (autoregressive on the target) models, the MA (moving average of past noise), differencing and seasonality modelling to approximate the behaviour of the stochastic process in question. The mathematical expression for this model is complex in nature, and therefore omitted here for brevity.
- Shifted GARCH(1,1) features. This feature reflects a 4-weeks ahead volatility forecast for the asset. Generalized Autoregressive Conditional Heteroskedasticity (GARCH) with specification (1,1) can be described as follows:
These features have two goals: (i) to create informative predictive features used in the regression and classification models and (ii) function as the future-shifted mean and volatility return vectors used in the min-variance framework, as opposed to the raw historical data.
Subsequent to feature selection, an Elastic Net regression model was implemented, which seamlessly blends the penalties from both Lasso and Ridge regression, succinctly expressed by the following objective function:
$$ \underset{\beta}{\min} \dfrac{1}{N} \sum_{i=1}^{N}\ell (y_i, x_{i}^T\beta) + \lambda\left[ \alpha ||\beta||_1
- \dfrac{1}{2}(1-\alpha)||\beta||_{2}^{2} \right] $$
Random Forest Model for Classification
Mathematically, a Random Forest can be described as a collection of decision trees
In the case of classification, with
That is, this a majority vote of the predictions for each class, over the random forest parameter space.
In our case,
To choose the best performing stocks, we combine these methods along with the historical Sharpe ratio using the following heuristic:
- Hair Parra: Design, architecture, implementation, overall strategy, code cleaning and optimization, feature engineering, modelling and portfolio optimization, simulation, documentation, repository and project setup, final report.
- Prateek: Data collection, feature engineering, report.
- Xiao Xue: Modelling, testing, experimentation, final report.
- Kriti: Portfolio optimization, final report.
- Sector
$G$ contains tickers${S_1,S_1,\dots, S_{|G|}}$ , where$|G|$ = number of stocks per sector (before selection). - For each ticker, want to calculate current window:
e.g. with