-
Notifications
You must be signed in to change notification settings - Fork 99
/
preface.html
412 lines (377 loc) · 31.8 KB
/
preface.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title>Preface | Machine Learning for Factor Investing</title>
<meta name="author" content="Guillaume Coqueret and Tony Guida" />
<meta name="generator" content="placeholder" />
<meta property="og:title" content="Preface | Machine Learning for Factor Investing" />
<meta property="og:type" content="book" />
<meta name="twitter:card" content="summary" />
<meta name="twitter:title" content="Preface | Machine Learning for Factor Investing" />
<!-- JS -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.6/clipboard.min.js" integrity="sha256-inc5kl9MA1hkeYUt+EC3BhlIgyp/2jDIyBLS6k3UxPI=" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js" integrity="sha512-zv6Ywkjyktsohkbp9bb45V6tEMoWhzFzXis+LrMehmJZZSys19Yxf1dopHx7WzIKxr5tK2dVcYmaCk2uqdjF4A==" crossorigin="anonymous"></script>
<script src="https://kit.fontawesome.com/6ecbd6c532.js" crossorigin="anonymous"></script>
<script src="libs/header-attrs-2.11/header-attrs.js"></script>
<script src="libs/jquery-3.6.0/jquery-3.6.0.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no" />
<link href="libs/bootstrap-4.6.0/bootstrap.min.css" rel="stylesheet" />
<script src="libs/bootstrap-4.6.0/bootstrap.bundle.min.js"></script>
<script src="libs/bs3compat-0.3.1/transition.js"></script>
<script src="libs/bs3compat-0.3.1/tabs.js"></script>
<script src="libs/bs3compat-0.3.1/bs3compat.js"></script>
<link href="libs/bs4_book-1.0.0/bs4_book.css" rel="stylesheet" />
<script src="libs/bs4_book-1.0.0/bs4_book.js"></script>
<script src="libs/kePrint-0.0.1/kePrint.js"></script>
<link href="libs/lightable-0.0.1/lightable.css" rel="stylesheet" />
<script src="https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.jquery.min.js" integrity="sha512-GU9ayf+66Xx2TmpxqJpliWbT5PiGYxpaG8rfnBEk1LL8l1KGkRShhngwdXK1UgqhAzWpZHSiYPc09/NwDQIGyg==" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/mark.min.js" integrity="sha512-5CYOlHXGh6QpOFA/TeTylKLWfB3ftPsde7AnmhuitiTX4K5SqCLBeKro6sPS8ilsz1Q4NRx3v8Ko2IBiszzdww==" crossorigin="anonymous"></script>
<!-- CSS -->
</head>
<body data-spy="scroll" data-target="#toc">
<div class="container-fluid">
<div class="row">
<header class="col-sm-12 col-lg-3 sidebar sidebar-book">
<a class="sr-only sr-only-focusable" href="#content">Skip to main content</a>
<div class="d-flex align-items-start justify-content-between">
<h1>
<a href="index.html" title="">Machine Learning for Factor Investing</a>
</h1>
<button class="btn btn-outline-primary d-lg-none ml-2 mt-1" type="button" data-toggle="collapse" data-target="#main-nav" aria-expanded="true" aria-controls="main-nav"><i class="fas fa-bars"></i><span class="sr-only">Show table of contents</span></button>
</div>
<div id="main-nav" class="collapse-lg">
<form role="search">
<input id="search" class="form-control" type="search" placeholder="Search" aria-label="Search">
</form>
<nav aria-label="Table of contents">
<h2>Table of contents</h2>
<div id="book-toc"></div>
<div class="book-extra">
<p><a id="book-repo" href="#">View book source <i class="fab fa-github"></i></a></li></p>
</div>
</nav>
</div>
</header>
<main class="col-sm-12 col-md-9 col-lg-7" id="content">
<!--bookdown:title:end-->
<!--bookdown:title:start-->
<div id="preface" class="section level1 unnumbered">
<h1>Preface</h1>
<style>
.container-fluid main {
max-width: 60rem;
}
</style>
<p>This book is intended to cover some advanced modelling techniques applied to equity <strong>investment strategies</strong> that are built on <strong>firm characteristics</strong>. The content is threefold. First, we try to simply explain the ideas behind most mainstream machine learning algorithms that are used in equity asset allocation. Second, we mention a wide range of academic references for the readers who wish to push a little further. Finally, we provide hands-on <strong>R</strong> code samples that show how to apply the concepts and tools on a realistic dataset which we share to encourage <strong>reproducibility</strong>.</p>
<div id="what-this-book-is-not-about" class="section level2 unnumbered">
<h2>What this book is not about</h2>
<p>This book deals with machine learning (ML) tools and their applications in factor investing. Factor investing is a subfield of a large discipline that encompasses asset allocation, quantitative trading and wealth management. Its premise is that differences in the returns of firms can be explained by the characteristics of these firms. Thus, it departs from traditional analyses which rely on price and volume data only, like classical portfolio theory à la <span class="citation">Markowitz (<a href="solutions-to-exercises.html#ref-markowitz1952portfolio" role="doc-biblioref">1952</a>)</span>, or high frequency trading. For a general and broad treatment of Machine Learning in Finance, we refer to <span class="citation">Matthew F. Dixon, Halperin, and Bilokon (<a href="solutions-to-exercises.html#ref-dixon2020machine" role="doc-biblioref">2020</a>)</span>.</p>
<p>The topics we discuss are related to other themes that will not be covered in the monograph. These themes include:</p>
<ul>
<li>Applications of ML in <strong>other financial fields</strong>, such as <strong>fraud detection</strong> or <strong>credit scoring</strong>. We refer to <span class="citation">Ngai et al. (<a href="solutions-to-exercises.html#ref-ngai2011application" role="doc-biblioref">2011</a>)</span> and <span class="citation">Baesens, Van Vlasselaer, and Verbeke (<a href="solutions-to-exercises.html#ref-baesens2015fraud" role="doc-biblioref">2015</a>)</span> for general purpose fraud detection, to <span class="citation">Bhattacharyya et al. (<a href="solutions-to-exercises.html#ref-bhattacharyya2011data" role="doc-biblioref">2011</a>)</span> for a focus on credit cards and to <span class="citation">Ravisankar et al. (<a href="solutions-to-exercises.html#ref-ravisankar2011detection" role="doc-biblioref">2011</a>)</span> and <span class="citation">Abbasi et al. (<a href="solutions-to-exercises.html#ref-abbasi2012metafraud" role="doc-biblioref">2012</a>)</span> for studies on fraudulent financial reporting. On the topic of credit scoring, <span class="citation">G. Wang et al. (<a href="solutions-to-exercises.html#ref-wang2011comparative" role="doc-biblioref">2011</a>)</span> and <span class="citation">Brown and Mues (<a href="solutions-to-exercises.html#ref-brown2012experimental" role="doc-biblioref">2012</a>)</span> provide overviews of methods and some empirical results. Also, we do not cover ML algorithms for data sampled at higher (daily or intraday) frequencies (microstructure models, limit order book). The chapter from <span class="citation">Kearns and Nevmyvaka (<a href="solutions-to-exercises.html#ref-kearns2013machine" role="doc-biblioref">2013</a>)</span> and the recent paper by <span class="citation">Sirignano and Cont (<a href="solutions-to-exercises.html#ref-sirignano2019universal" role="doc-biblioref">2019</a>)</span> are good introductions on this topic.<br />
</li>
<li><strong>Use cases of alternative datasets</strong> that show how to leverage textual data from social media, satellite imagery, or credit card logs to predict sales, earning reports, and, ultimately, future returns. The literature on this topic is still emerging (see, e.g., <span class="citation">Blank, Davis, and Greene (<a href="solutions-to-exercises.html#ref-blank2019using" role="doc-biblioref">2019</a>)</span>, <span class="citation">Jha (<a href="solutions-to-exercises.html#ref-jha2019implementing" role="doc-biblioref">2019</a>)</span> and <span class="citation">Z. T. Ke, Kelly, and Xiu (<a href="solutions-to-exercises.html#ref-ke2019predicting" role="doc-biblioref">2019</a>)</span>) but will likely blossom in the near future.<br />
</li>
<li><strong>Technical details</strong> of machine learning tools. While we do provide some insights on specificities of some approaches (those we believe are important), the purpose of the book is not to serve as reference manual on statistical learning. We refer to <span class="citation">Hastie, Tibshirani, and Friedman (<a href="solutions-to-exercises.html#ref-friedman2009elements" role="doc-biblioref">2009</a>)</span>, <span class="citation">Cornuejols, Miclet, and Barra (<a href="solutions-to-exercises.html#ref-cornuejols2011apprentissage" role="doc-biblioref">2018</a>)</span> (written in French), <span class="citation">James et al. (<a href="solutions-to-exercises.html#ref-james2013introduction" role="doc-biblioref">2013</a>)</span> (coded in R!) and <span class="citation">Mohri, Rostamizadeh, and Talwalkar (<a href="solutions-to-exercises.html#ref-mohri2018foundations" role="doc-biblioref">2018</a>)</span> for a general treatment on the subject.<a href="solutions-to-exercises.html#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> Moreover, <span class="citation">K.-L. Du and Swamy (<a href="solutions-to-exercises.html#ref-du2013neural" role="doc-biblioref">2013</a>)</span> and <span class="citation">Goodfellow et al. (<a href="solutions-to-exercises.html#ref-goodfellow2016deep" role="doc-biblioref">2016</a>)</span> are solid monographs on neural networks particularly and <span class="citation">Sutton and Barto (<a href="solutions-to-exercises.html#ref-sutton2018reinforcement" role="doc-biblioref">2018</a>)</span> provide a self-contained and comprehensive tour in reinforcement learning.<br />
</li>
<li>Finally, the book does not cover methods of <strong>natural language processing</strong> (NLP) that can be used to evaluate sentiment which can in turn be translated into investment decisions. This topic has nonetheless been trending lately and we refer to <span class="citation">Loughran and McDonald (<a href="solutions-to-exercises.html#ref-loughran2016textual" role="doc-biblioref">2016</a>)</span>, <span class="citation">Cong, Liang, and Zhang (<a href="solutions-to-exercises.html#ref-cong2019analyzing" role="doc-biblioref">2019a</a>)</span>, <span class="citation">Cong, Liang, and Zhang (<a href="solutions-to-exercises.html#ref-cong2019textual" role="doc-biblioref">2019b</a>)</span> and <span class="citation">Gentzkow, Kelly, and Taddy (<a href="solutions-to-exercises.html#ref-gentzkow2019text" role="doc-biblioref">2019</a>)</span> for recent advances on the matter.</li>
</ul>
</div>
<div id="the-targeted-audience" class="section level2 unnumbered">
<h2>The targeted audience</h2>
<p>Who should read this book? This book is intended for two types of audiences. First, <strong>postgraduate students</strong> who wish to pursue their studies in quantitative finance with a view towards investment and asset management. The second target groups are <strong>professionals from the money management industry</strong> who either seek to pivot towards allocation methods that are based on machine learning or are simply interested in these new tools and want to upgrade their set of competences. To a lesser extent, the book can serve <strong>scholars or researchers</strong> who need a manual with a broad spectrum of references both on recent asset pricing issues and on machine learning algorithms applied to money management. While the book covers mostly common methods, it also shows how to implement more exotic models, like causal graphs (Chapter <a href="causality.html#causality">14</a>), Bayesian additive trees (Chapter <a href="bayes.html#bayes">9</a>), and hybrid autoencoders (Chapter <a href="NN.html#NN">7</a>).</p>
<p>The book assumes basic knowledge in <strong>algebra</strong> (matrix manipulation), <strong>analysis</strong> (function differentiation, gradients), <strong>optimization</strong> (first and second order conditions, dual forms), and <strong>statistics</strong> (distributions, moments, tests, simple estimation method like maximum likelihood). A minimal <strong>financial culture</strong> is also required: simple notions like stocks, accounting quantities (e.g., book value) will not be defined in this book. Lastly, all examples and illustrations are coded in R. A minimal culture of the language is sufficient to understand the code snippets which rely heavily on the most common functions of the tidyverse (<span class="citation">Wickham et al. (<a href="solutions-to-exercises.html#ref-wickham2019welcome" role="doc-biblioref">2019</a>)</span>, www.tidyverse.org), and piping (<span class="citation">Bache and Wickham (<a href="solutions-to-exercises.html#ref-bache2014magrittr" role="doc-biblioref">2014</a>)</span>, <span class="citation">Mailund (<a href="solutions-to-exercises.html#ref-mailund2019pipelines" role="doc-biblioref">2019</a>)</span>).</p>
</div>
<div id="how-this-book-is-structured" class="section level2 unnumbered">
<h2>How this book is structured</h2>
<p>The book is divided into four parts.</p>
<p>Part I gathers preparatory material and starts with notations and data presentation (Chapter <a href="notdata.html#notdata">1</a>), followed by introductory remarks (Chapter <a href="intro.html#intro">2</a>). Chapter <a href="factor.html#factor">3</a> outlines the economic foundations (theoretical and empirical) of factor investing and briefly sums up the dedicated recent literature. Chapter <a href="Data.html#Data">4</a> deals with data preparation. It rapidly recalls the basic tips and warns about some major issues.</p>
<p>Part II of the book is dedicated to predictive algorithms in supervised learning. Those are the most common tools that are used to forecast financial quantities (returns, volatilities, Sharpe ratios, etc.). They range from penalized regressions (Chapter <a href="lasso.html#lasso">5</a>), to tree methods (Chapter <a href="trees.html#trees">6</a>), encompassing neural networks (Chapter <a href="NN.html#NN">7</a>), support vector machines (Chapter <a href="svm.html#svm">8</a>) and Bayesian approaches (Chapter <a href="bayes.html#bayes">9</a>).</p>
<p>The next portion of the book bridges the gap between these tools and their applications in finance. Chapter <a href="valtune.html#valtune">10</a> details how to assess and improve the ML engines defined beforehand. Chapter <a href="ensemble.html#ensemble">11</a> explains how models can be combined and often why that may not be a good idea. Finally, one of the most important chapters (Chapter <a href="backtest.html#backtest">12</a>) reviews the critical steps of portfolio backtesting and mentions the frequent mistakes that are often encountered at this stage.</p>
<p>The end of the book covers a range of advanced topics connected to machine learning more specifically. The first one is <strong>interpretability</strong>. ML models are often considered to be black boxes and this raises trust issues: how and why should one trust ML-based predictions? Chapter <a href="interp.html#interp">13</a> is intended to present methods that help understand what is happening under the hood. Chapter <a href="causality.html#causality">14</a> is focused on <strong>causality</strong>, which is both a much more powerful concept than correlation and also at the heart of many recent discussions in Artificial Intelligence (AI). Most ML tools rely on correlation-like patterns and it is important to underline the benefits of techniques related to causality. Finally, Chapters <a href="unsup.html#unsup">15</a> and <a href="RL.html#RL">16</a> are dedicated to non-supervised methods. The latter can be useful, but their financial applications should be wisely and cautiously motivated. <!-- Lastly, the final chapter (\@ref(NLP)) introduces standard approaches for the treatment of textual data. --></p>
</div>
<div id="companion-website" class="section level2 unnumbered">
<h2>Companion website</h2>
<p>This book is entirely available at <a href="http://www.mlfactor.com" class="uri">http://www.mlfactor.com</a>. It is important that not only the content of the book be accessible, but also the data and code that are used throughout the chapters. They can be found at <a href="https://github.com/shokru/mlfactor.github.io/tree/master/material" class="uri">https://github.com/shokru/mlfactor.github.io/tree/master/material</a>. The online version of the book will be updated beyond the publication of the printed version.</p>
</div>
<div id="why-r" class="section level2 unnumbered">
<h2>Why R?</h2>
<p>The supremacy of Python as <em>the</em> dominant ML programming language is a widespread belief. This is because almost all applications of deep learning (which is as of 2020 one of the most fashionable branches of ML) are coded in Python via Tensorflow or Pytorch.
The fact is that <strong>R</strong> has a <strong>lot</strong> to offer as well. First of all, let us not forget that one of the most influencial textbooks in ML (<span class="citation">Hastie, Tibshirani, and Friedman (<a href="solutions-to-exercises.html#ref-friedman2009elements" role="doc-biblioref">2009</a>)</span>) is written by statisticians who code in R. Moreover, many statistics-orientated algorithms (e.g., BARTs in Section <a href="bayes.html#BART">9.5</a>) are primarily coded in R and not always in Python. The R offering in Bayesian packages in general (<a href="https://cran.r-project.org/web/views/Bayesian.html" class="uri">https://cran.r-project.org/web/views/Bayesian.html</a>) and in Bayesian learning in particular is probably unmatched.</p>
<p>There are currently several ML frameworks available in R.</p>
<ul>
<li><strong>caret</strong>: <a href="https://topepo.github.io/caret/index.html" class="uri">https://topepo.github.io/caret/index.html</a>, a compilation of more than 200 ML models;<br />
</li>
<li><strong>tidymodels</strong>: <a href="https://github.com/tidymodels" class="uri">https://github.com/tidymodels</a>, a recent collection of packages for ML workflow (developed by Max Kuhn at RStudio, which is a token of high quality material!);<br />
</li>
<li><strong>rtemis</strong>: <a href="https://rtemis.netlify.com" class="uri">https://rtemis.netlify.com</a>, a general purpose package for ML and visualization;<br />
</li>
<li><strong>mlr3</strong>: <a href="https://mlr3.mlr-org.com/index.html" class="uri">https://mlr3.mlr-org.com/index.html</a>, also a simple framework for ML models;<br />
</li>
<li><strong>h2o</strong>: <a href="https://github.com/h2oai/h2o-3/tree/master/h2o-r" class="uri">https://github.com/h2oai/h2o-3/tree/master/h2o-r</a>, a large set of tools provided by h2o (coded in Java);<br />
</li>
<li><strong>Open ML</strong>: <a href="https://github.com/openml/openml-r" class="uri">https://github.com/openml/openml-r</a>, the R version of the OpenML (www.openml.org) community.</li>
</ul>
<p>Moreover, via the <em>reticulate</em> package, it is possible (but not always easy) to benefit from Python tools as well. The most prominent example is the adaptation of the <em>tensorflow</em> and <em>keras</em> libraries to R. Thus, some very advanced Python material is readily available to R users. This is also true for other resources, like Stanford’s CoreNLP library (in Java) which was adapted to R in the package <em>coreNLP</em> (which we will not use in this book).</p>
</div>
<div id="coding-instructions" class="section level2 unnumbered">
<h2>Coding instructions</h2>
<p>One of the purposes of the book is to propose a large-scale tutorial of ML applications in financial predictions and portfolio selection. Thus, one keyword is <strong>REPRODUCIBILITY</strong>! In order to duplicate our results (up to possible randomness in some learning algorithms), you will need running versions of R and RStudio on your computer. The best books to learn R are also often freely available online. A short list can be found here <a href="https://rstudio.com/resources/books/" class="uri">https://rstudio.com/resources/books/</a>. The monograph <em>R for Data Science</em> is probably the most crucial.</p>
<p>In terms of coding requirements, we rely heavily on the <strong>tidyverse</strong>, which is a collection of <strong>packages</strong> (or libraries). The three packages we use most are <strong>dplyr</strong> which implements simple data manipulations (filter, select, arrange), <strong>tidyr</strong> which formats data in a tidy fashion, and <strong>ggplot</strong>, for graphical outputs.</p>
<p>A list of the packages we use can be found in Table <a href="preface.html#tab:packages">0.1</a> below. Packages with a star <span class="math inline">\(*\)</span> need to be installed via <em>bioconductor</em>.<a href="solutions-to-exercises.html#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> Packages with a plus <span class="math inline">\(^+\)</span> need to be installed <strong>manually</strong>.<a href="solutions-to-exercises.html#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a></p>
<table>
<caption><span id="tab:packages">TABLE 0.1: </span> List of all packages used in the book.</caption>
<thead>
<tr class="header">
<th align="left"><em>Package</em></th>
<th align="left">Purpose</th>
<th align="center">Chapter(s)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left"><em>BART</em></td>
<td align="left">Bayesian additive trees</td>
<td align="center">10</td>
</tr>
<tr class="even">
<td align="left"><em>broom</em></td>
<td align="left">Tidy regression output</td>
<td align="center">5</td>
</tr>
<tr class="odd">
<td align="left"><em>CAM</em><span class="math inline">\(^+\)</span></td>
<td align="left">Causal Additive Models</td>
<td align="center">15</td>
</tr>
<tr class="even">
<td align="left"><em>caTools</em></td>
<td align="left">AUC curves</td>
<td align="center">11</td>
</tr>
<tr class="odd">
<td align="left"><em>CausalImpact</em></td>
<td align="left">Causal inference with structural time series</td>
<td align="center">15</td>
</tr>
<tr class="even">
<td align="left"><em>cowplot</em></td>
<td align="left">Stacking plots</td>
<td align="center">4 & 13</td>
</tr>
<tr class="odd">
<td align="left"><em>breakDown</em></td>
<td align="left">Breakdown interpretability</td>
<td align="center">14</td>
</tr>
<tr class="even">
<td align="left"><em>dummies</em></td>
<td align="left">One-hot encoding</td>
<td align="center">8</td>
</tr>
<tr class="odd">
<td align="left"><em>e1071</em></td>
<td align="left">Support Vector Machines</td>
<td align="center">9</td>
</tr>
<tr class="even">
<td align="left"><em>factoextra</em></td>
<td align="left">PCA visualization</td>
<td align="center">16</td>
</tr>
<tr class="odd">
<td align="left"><em>fastAdaboost</em></td>
<td align="left">Boosted trees</td>
<td align="center">7</td>
</tr>
<tr class="even">
<td align="left"><em>forecast</em></td>
<td align="left">Autocorrelation function</td>
<td align="center">4</td>
</tr>
<tr class="odd">
<td align="left"><em>FNN</em></td>
<td align="left">Nearest Neighbors detection</td>
<td align="center">16</td>
</tr>
<tr class="even">
<td align="left"><em>ggpubr</em></td>
<td align="left">Combining plots</td>
<td align="center">11</td>
</tr>
<tr class="odd">
<td align="left"><em>glmnet</em></td>
<td align="left">Penalized regressions</td>
<td align="center">6</td>
</tr>
<tr class="even">
<td align="left"><em>iml</em></td>
<td align="left">Interpretability tools</td>
<td align="center">14</td>
</tr>
<tr class="odd">
<td align="left"><em>keras</em></td>
<td align="left">Neural networks</td>
<td align="center">8</td>
</tr>
<tr class="even">
<td align="left"><em>lime</em></td>
<td align="left">Interpretability</td>
<td align="center">14</td>
</tr>
<tr class="odd">
<td align="left"><em>lmtest</em></td>
<td align="left">Granger causality</td>
<td align="center">15</td>
</tr>
<tr class="even">
<td align="left"><em>lubridate</em></td>
<td align="left">Handling dates</td>
<td align="center">All (or many)</td>
</tr>
<tr class="odd">
<td align="left"><em>naivebayes</em></td>
<td align="left">Naive Bayes classifier</td>
<td align="center">10</td>
</tr>
<tr class="even">
<td align="left"><em>pcalg</em></td>
<td align="left">Causal graphs</td>
<td align="center">15</td>
</tr>
<tr class="odd">
<td align="left"><em>quadprog</em></td>
<td align="left">Quadratic programming</td>
<td align="center">12</td>
</tr>
<tr class="even">
<td align="left"><em>quantmod</em></td>
<td align="left">Data extraction</td>
<td align="center">4, 12</td>
</tr>
<tr class="odd">
<td align="left"><em>randomForest</em></td>
<td align="left">Random forests</td>
<td align="center">7</td>
</tr>
<tr class="even">
<td align="left"><em>rBayesianOptimization</em></td>
<td align="left">Bayesian hyperparameter tuning</td>
<td align="center">11</td>
</tr>
<tr class="odd">
<td align="left"><em>ReinforcementLearning</em></td>
<td align="left">Reinforcement Learning</td>
<td align="center">17</td>
</tr>
<tr class="even">
<td align="left"><em>Rgraphviz</em><span class="math inline">\(^*\)</span></td>
<td align="left">Causal graphs</td>
<td align="center">15</td>
</tr>
<tr class="odd">
<td align="left"><em>rpart</em> and <em>rpart.plot</em></td>
<td align="left">Simple decision trees</td>
<td align="center">7</td>
</tr>
<tr class="even">
<td align="left"><em>spBayes</em></td>
<td align="left">Bayesian linear regression</td>
<td align="center">10</td>
</tr>
<tr class="odd">
<td align="left"><em>tidyverse</em></td>
<td align="left">Environment for data science, data wrangling</td>
<td align="center">All</td>
</tr>
<tr class="even">
<td align="left"><em>xgboost</em></td>
<td align="left">Boosted trees</td>
<td align="center">7</td>
</tr>
<tr class="odd">
<td align="left"><em>xtable</em></td>
<td align="left">Table formatting</td>
<td align="center">4</td>
</tr>
</tbody>
</table>
<p>Of all of these packages (or collections thereof), the <strong>tidyverse</strong> and <strong>lubridate</strong> are compulsory in almost all sections of the book. To install a new package in R, just type</p>
<p>install.packages(“name_of_the_package”)</p>
<p>in the console. Sometimes, because of function name conflicts (especially with the select() function), we use the syntax package::function() to make sure the function call is from the right source. The exact version of the packages used to compile the book is listed in the “<em>renv.lock</em>” file available on the book’s GitHub web page <a href="https://github.com/shokru/mlfactor.github.io" class="uri">https://github.com/shokru/mlfactor.github.io</a>. One minor comment is the following: while the functions <em>gather()</em> and <em>spread()</em> from the <em>dplyr</em> package have been superseded by <em>pivot_longer()</em> and <em>pivot_wider()</em>, we still use them because of their much more compact syntax.</p>
<p>As much as we could, we created short <strong>code chunks</strong> and commented each line whenever we felt it was useful. Comments are displayed at the end of a row and preceded with a single hastag #.</p>
<p>The book is constructed as a very big notebook, thus results are often presented below code chunks. They can be graphs or tables. Sometimes, they are simple numbers and are preceded with two hashtags ##. The example below illustrates this formatting.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="preface.html#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="dv">1</span><span class="sc">+</span><span class="dv">2</span> <span class="co"># Example</span></span></code></pre></div>
<pre><code>## [1] 3</code></pre>
<p></p>
<p>The book can be viewed as a very big tutorial. Therefore, most of the chunks depend on previously defined variables. When replicating parts of the code (via online code), please make sure that <strong>the environment includes all relevant variables</strong>. One best practice is to always start by running all code chunks from Chapter <a href="notdata.html#notdata">1</a>. For the exercises, we often resort to variables created in the corresponding chapters.</p>
</div>
<div id="acknowledgments" class="section level2 unnumbered">
<h2>Acknowledgments</h2>
<p>The core of the book was prepared for a series of lectures given by one of the authors to students of master’s degrees in finance at EMLYON Business School and at the Imperial College Business School in the Spring of 2019. We are grateful to those students who asked fruitful questions and thereby contributed to improve the content of the book.</p>
<p>We are grateful to Bertrand Tavin and Gautier Marti for their thorough screening of the book. We also thank Eric André, Aurélie Brossard, Alban Cousin, Frédérique Girod, Philippe Huber, Jean-Michel Maeso, Javier Nogales and for friendly reviews; Christophe Dervieux for his help with bookdown; Mislav Sagovac and Vu Tran for their early feedback; John Kimmel for making this happen and Jonathan Regenstein for his availability, no matter the topic. Lastly, we are grateful for the anonymous reviews collected by John Kimmel, our original editor.</p>
</div>
<div id="future-developments" class="section level2 unnumbered">
<h2>Future developments</h2>
<p>Machine learning and factor investing are two immense research domains and the overlap between the two is also quite substantial and developing at a fast pace. The content of this book will always constitute a solid background, but it is naturally destined to obsolescence. Moreover, by construction, some subtopics and many references will have escaped our scrutiny. Our intent is to progressively improve the content of the book and update it with the latest ongoing research. We will be grateful to any comment that helps correct or update the monograph. Thank you for sending your feedback directly (via pull requests) on the book’s website which is hosted at <a href="https://github.com/shokru/mlfactor.github.io" class="uri">https://github.com/shokru/mlfactor.github.io</a>.</p>
</div>
</div>
</main>
<div class="col-md-3 col-lg-2 d-none d-md-block sidebar sidebar-chapter">
<nav id="toc" data-toggle="toc" aria-label="On this page">
<h2>On this page</h2>
<div id="book-on-this-page"></div>
<div class="book-extra">
<ul class="list-unstyled">
<li><a id="book-source" href="#">View source <i class="fab fa-github"></i></a></li>
<li><a id="book-edit" href="#">Edit this page <i class="fab fa-github"></i></a></li>
</ul>
</div>
</nav>
</div>
</div>
</div> <!-- .container -->
<footer class="bg-primary text-light mt-5">
<div class="container"><div class="row">
<div class="col-12 col-md-6 mt-3">
<p>"<strong>Machine Learning for Factor Investing</strong>" was written by Guillaume Coqueret and Tony Guida. It was last built on 2022-10-18.</p>
</div>
<div class="col-12 col-md-6 mt-3">
<p>This book was built by the <a class="text-light" href="https://bookdown.org">bookdown</a> R package.</p>
</div>
</div></div>
</footer>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
var src = "true";
if (src === "" || src === "true") src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML";
if (location.protocol !== "file:")
if (/^https?:/.test(src))
src = src.replace(/^https?:/, '');
script.src = src;
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
<script type="text/x-mathjax-config">const popovers = document.querySelectorAll('a.footnote-ref[data-toggle="popover"]');
for (let popover of popovers) {
const div = document.createElement('div');
div.setAttribute('style', 'position: absolute; top: 0, left:0; width:0, height:0, overflow: hidden; visibility: hidden;');
div.innerHTML = popover.getAttribute('data-content');
var has_math = div.querySelector("span.math");
if (has_math) {
document.body.appendChild(div);
MathJax.Hub.Queue(["Typeset", MathJax.Hub, div]);
MathJax.Hub.Queue(function() {
popover.setAttribute('data-content', div.innerHTML);
document.body.removeChild(div);
})
}
}
</script>
</body>
</html>