L2F.html

<!DOCTYPE html>
<html>
  <head>
    <title>L2F</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
    <link rel="stylesheet" href="fonts/quadon/quadon.css">
    <link rel="stylesheet" href="fonts/gentona/gentona.css">
    <link rel="stylesheet" href="slides_style.css">
  </head>
  <body>
    <textarea id="source">


class: center, middle

name:opening

## Lifelong Learning Forests


.center[


Joshua T. Vogelstein
<br>
[jovo@jhu.edu](mailto:jovo at jhu dot edu)
| <http://brainx.io/L2F>
<br><br><br>
<img src="images/logo_jhu.png" STYLE="HEIGHT:100px;"/>

]

---

## Goals of this Talk

- Motivation
- Background
- Intuition
- Applications
- Everyone in room understands

Please ask questions!

---
class: center, middle

# Motivation


---


<video width="320" height="550" controls>
  <source src="images/lion_knocking.mp4" type="video/mp4">
</video>

---

### A motivating example: psychopathy

- .blue[truth]: this guy is a psychopath
- .blue[sample]: psychopath index is high
- .blue[action]: kill him, keep him in jail, or release him
- .blue[loss]: cost of jailing
- .blue[risk]: false positives are quite bad!

Goal: release people if they are not "too" dangerous

--

But then....

- .blue[truth]: cured of psychopathy
- .blue[sample]: new index, now says low
- .blue[action]: can start treatment
- .blue[loss]: cost of treatment
- .blue[risk]: treating people that don't get better is not so bad


Goal: Can the judge (.blue[learner]) updates her ruling?

---
class: center, middle

# Background


---

### Statistical Decision Theory

| object | space | definition |
|:--- |:--- |
| $\xi \sim P_\xi$ | $\Xi$ | .blue[true] latent state |
| $z   \sim  P_{z \vert \xi} = h(\xi)$ | $ \mathcal{Z}$ | noisily .blue[sampled] state
| $a$ |  $\mathcal{A}$ | possible .blue[actions]
| $g: \mathcal{Z} \to \mathcal{A}$ |  $\mathcal{G}$ | decision rule
| $\ell(g(z), a) \to \mathbb{R}_+$ | $\mathcal{L}$ | .blue[loss] function
| $f_n: \mathcal{Z}^n \times \mathcal{L} \to \mathcal{G}$ | $\mathcal{F}$ | .blue[learner]
| $\mathbb{R}_P [ \ell (\hat{g}(z; f_n), a)]$ | $\mathcal{R}$ | .blue[risk] functional


- Let $\varepsilon_n = \mathbb{R}[f_n] - \mathbb{R}^{*}$ be the .blue[residual]
- Goal: choose $f_n$ such that for many $P \in \mathcal{P}$,
$\varepsilon_n \rightarrow 0$

---

## Lifelong Learning Theory


- At $t=t'$, any of the following may change: $P\_{\xi}, h, \mathcal{Z}, \mathcal{A}, \ell, \mathbb{R}$
-  We know when $\mathcal{Z}, \mathcal{A}, \ell, \mathbb{R}$ changes
- But not necessarily if $P=P\_{z,\xi}$ changes
-  Goal: choose $f_n$ such that for  many $P \in \mathcal{P}$

<!-- 1. $\varepsilon^{t<t^*}_n \rightarrow 0$ for many $P \in \mathcal{P}$
2. -->

 | constraint | definition |
|:--- |:--- |
 | $\varepsilon_n(t) \rightarrow 0$ |  consistency |
 | $\varepsilon_n(t>t') = \varepsilon_n(t)$ | no catestrophic forgetting
| $\varepsilon_{n+m}(t>t') \rightarrow 0$| continual learning
 | $\\#(f_{n+m}) - \\#(f_n)  \in \mathcal{O}(\log m)$ | space efficient


What class of $f_n$ might have these properties?


---

##  Random Forests Might!

- Excellent empirical performance
  - Caruana et al. 2006 (ICML): "With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. .blue[Random forests] are close second."
  - Caruana et al. 2008 (ICML): "the method that performs consistently well across all dimensions is .blue[random forests]."
  - Delgado et al. 2014 (JMLR): "The classifiers most likely to be the bests are the .blue[random forest]."

--
- Strong theoretical performance
  - .blue[consistent]: $\varepsilon_n(RF_n)  \rightarrow 0$ for any $P \in \mathcal{P}$
  - .blue[space efficient]: $\\#(RF_n) \in \mathcal{O}(T n )$


---

### Brief Introduction to Random Forests

- Let $z_i=(x_i,y_i)$, where $ x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$
- Given $z_i$ for $i \in [n]$,
- For T trees:
  - Subsample $n' < n$ samples
  - At each node $j$, select  feature $d_j$ and threshold $\tau_j$ to split
  - For each child of $j$, repeat until criteria is met
  - Tree is encoded by $( d_j,\tau_j )_j$

All the magic is in choosing the $d_j$'s and $\tau_j$'s

---

### A Simple Example: XOR


<img src="images/sample_XOR.png" style="height: 200px;"/>

--

<img src="images/Fig1_posteriors.svg" style="height: 200px;"/>


---
class: center, middle

# Intuition

---

### Scenario 1: Semisupervised

- Given $(x_i,y_i)$ for $i \in [n]$, $\quad x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$
- After time $t'$, given  $x_i$ for $i = n+1, \ldots n+m$
- Assume (for now), that $P_x(t < t') = P_x(t>t')$

--

### What would Tukey do?

---

### Strategy \# 1:  Update $\tau_j$'s

- ".blue[Thm]" (Telgarsky-Vattani):
  - The set of local optima of Hartigan’s method is a (possibly strict) subset of local optima of Lloyd’s method.


- .blue[Algorithm]:
  - Use Hartigan's method to recursively update $\tau_j$'s

--


- ".blue[Conjecture 1]":
  - Updating $\tau\_j$'s this way yields $\varepsilon'\_{n+m} < \varepsilon\_{n}$

---

### Strategy \# 2:  Make trees deeper


- ".blue[Thm]" (DGL): For RF to be consistent:
  - Let $A(x)$ denote the "cell" containing $x$
  - diam$(A(x)) \rightarrow 0$ in probability
  - \# of points in $A(x) \rightarrow \infty$ in probability


- .blue[Algorithm]
  - Pass each new point down the tree,
  - At leaf node, decide whether to continue spliting
  - If so, split as per usual


- ".blue[Conjecture 2]"
    - Making trees deeper this way yields $\varepsilon'\_{n+m} < \varepsilon\_{n}$

---

### Strategy \# 3: Update $d_j$'s

- Some  algorithm parameters can be estimated
- Let's think of the process at each node as a .blue[random projection]
- At each node, sample $A \sim F_A$

<img src="images/RF1.png" style="width: 100%;"/>


---

#### Strategy \# 3: Update $d_j$'s


- ".blue[Thm]" (Bayes): posterior = likelihood  $\times$ prior
  - $F_A$ has a low-dimensional parameterization
  - $d$ is the \# of non-zeros in the matrix
  - Option A: estimate .blue[which $d$'s] are doing well
  - Option B: estimate .blue[which dimensions's] are doing well


- .blue[Algorithm]:
  - build T trees using a prior distribution over $d$
  - sample different $d$'s for each node
  - estimate posterior over $d$/dimensions
  - build T' more trees using posterior

- ".blue[Conjecture 3]":
  - Making trees stronger this way yields $\varepsilon'\_{n} < \varepsilon\_{n}$

---

#### Strategy \# 3: Make trees stronger & less correlated

- ".blue[Thm]" (Breiman): RF error is bounded by a function of tree "strength" and "correlation"


- .blue[Algorithm]
  - Add more parameters to  $F_A$
  - e.g., relax constraint of 1 non-zero per row

<img src="images/RerF1.png" style="width: 60%;"/>

- ".blue[Conjecture 4]"
  - Making trees stronger this way yields $\varepsilon'\_{n} < \varepsilon\_{n}$


---

#### $\hat{R}\_{RerF}(n) < \hat{R}\_{RF}(n)$

-  ~100 benchmark datasets from Delgado et al.
-  For continuous data, RerF is significantly better

<img src="images/error_histogram_benchmarks.svg" style="height: 400px;"/>


---

#### Strategy \# 4: Make trees learn dictionaries


- Putting an identity matrix in there changes nothing

<img src="images/RF_Dict1.png" style="width: 100%;"/>

---

#### Strategy \# 3: Make trees learn dictionaries

- Changing that matrix enables RF to choose a dictionary

<img src="images/RF_Dict2.png" style="width: 100%;"/>


---

#### Strategy \# 3: Make trees learn dictionaries

- Choose dictionary based on prior information

<img src="images/RF_Dict3.png" style="width: 100%;"/>

---

#### Strategy \# 3: Make trees learn dictionaries

- ".blue[Thm]" (Sulam):
  - Dictionary learning pursuit achieves the global optimum of the biconvex problem


- .blue[Algorithm]
  - Choose low-dimensional dictionary based on prior knowledge
  - Update posterior over dictionary after training


-   ".blue[Conjecture 5]":
  - Making trees stronger this way yields $\varepsilon'\_{n} < \varepsilon\_{n}$


---

#### $\hat{R}\_{S-RerF}(n) < \hat{R}\_{RerF}(n)$


<img src="images/image_stripes.svg" style="width: 100%;"/>


---

### Scenario 2: "Covariate Shift" Learning

- Given $(x_i,y_i)$ for $i \in [n]$, $\quad x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$
- after time $t'$, given another $x_i$ for $i = n+1, \ldots n+m$
- Assume that we do not know whether $$P_x(t < t') \neq P_x(t>t')$$

---

### Strategy 4: Weight trees

- Learn $S$ trees under $P_x(t< t')$
- Learn $S'$ trees under $P_x(t>t')$
- For a new sample $x$, let $\hat{y}= \alpha RF\_{\mathcal{S}}(x) + (1-\alpha) RF\_{\mathcal{S}'}(x)$

The magic is in how to choose $\alpha$, which depends on
- difference between  $n$ and $m$
-  difference between $P_x(t < t')$ and $P_x(t>t')$


---

### Strategy 4: Weight trees


- ".blue[Thm]" (Shen): Our multiscale graph correlation (MGC) test has
$$\varepsilon_n(MGC) \leq \varepsilon_n(global)$$

- .blue[Algorithm]
  - Hypothesis test: $H_0: P_x(t < t') = P_x(t>t')$
  - Effect size estimate: $|| P_x(t < t') - P_x(t>t')||$
  - Estimate $\alpha$

- ".blue[Conjecture 6]"
  - $\forall \,  P\_x(t < t'), P\_x(t>t')$,   with high probability
  $$\varepsilon\_{n}^{\mathcal{S}, \mathcal{S'}}(\hat{\alpha}) < \varepsilon\_{n}^{\mathcal{S'}}$$

---

#### $\hat{R}\_{MGC}(n) < \hat{R}\_{global}(n)$


<img src="images/FigSimVisual.svg" style="width: 100%;"/>

---

#### $\hat{R}\_{MGC}(n) < \hat{R}\_{global}(n)$


<img src="images/FigHDPower.svg" style="width: 100%;"/>

---

### Scenario 3: The Whole Megilla

- Given $(x_i,y_i)$ for $i \in [n]$, $\quad x_i \in \mathbb{R}^p$, $y_i \in \{0,1\}$
- after time $t'$, given another $x_i$ for $i = n+1, \ldots n+m$
- .blue[everything] might be changing dynamically; define the setting as any time as

$$\mathcal{S}\_t=P\_{\xi}, h, \mathcal{Z}, \mathcal{A}, \ell, \mathbb{R}$$
- what to do?

---

### Strategy 4: Forest Daemon

- ".blue[Thm]": we got nothing here :)


- .blue[Algorithm]
  -  A .y[Forest Demon] stores "meta-data" for each (tree, sample, setting) triple
  - The demon performs "random forestry", that is, chooses weights for each tree and sample based on the setting
   $$\hat{y} = \sum_k^K \alpha^{\mathcal{S}}_k T_k(x)$$

- ".blue[Conjecture 7]"
  -  The forest demon solves L2M!

---

### Phase 1: Demonstrate L2Forests Can Satisfy the Criteria

1. Continual learning (updating old trees)
2. Avoids catastrophic forgetting (building new trees)
3. Goal driven perception (demon is given a new setting)
4. Select plasticity (demon determines the the relationship between current and past contexts)
5. Monitoring and safety (demon keeps track of everything required)

### Phase 2: Extend to Multimodal Data

- Requires learning multimodal trees
- Requires building a demon to integrate cross-modal information


---
class: center, middle

# Applications

---
class:center, middle

<img src="images/kent1.jpg" style="height: 100%;"/>

---
### \#1: Characterizing Psychopathy with Changing Sensors

- 10 years of scanning psychopaths (multi-modality)
- Frequent scanner updates meticulously documented (.blue[$h$])
- Estimate recidivism using the first $n$ samples of $(x_i, y_i)$
- Improve accuracy including $m$ additional $x_i$'s?
<img src="images/KentF1.png" style="width: 80%;"/>

---
### \#2: Characterizing Personality in Non-Stationary Life Span Data

- Bringing it back to &#x1F981;
- Lifespan data: 8-80 (.blue[$P$])
- Certain phenotypics are fixed (e.g., ethnicity,  gender, IQ)
- Multimodal brain measurements  are dynamic with age
- Learn function using $(x_i,y_i)$ of kids
- Continue improving using only $x_i$ of adults

---
### Batch Effects are a Mess in these data

<img src="images/figure_batch.svg" style="width: 100%;"/>


---
class:  top,  left


.pull-left[
Theory Collaborations:
- Carey E. Priebe
- Cencheng Shen
- Minh Tang
- Tyler Tomita
- James Browne
- Randal Burns
- Mincent K. Tanzynski?
- Karl Kumbier?

Data Collaborators:
- Kent Kiel (MRN)
- Bruce Rosen (MGH)
]


.pull-left[

Love Collaborators:
- &#x1F981;
- &#x1f60b;
- &#x1f46a;
- &#x1F30D;
- &#x1f320;

Questions?
- [jovo@jhu.edu](mailto:jovo@jhu.edu)
- [neurodata.io](http://neurodata.io)

We are hiring!
]


<img src="images/funding/nsf_fpo.png" STYLE="position:absolute; TOP:550px; LEFT:10px; HEIGHT:100px;"/>

<img src="http://brainx.io/images/funding/nih_fpo.png" STYLE="position:absolute; TOP:550px; LEFT:120px; HEIGHT:100px;"/>

<img src="http://brainx.io/images/funding/darpa_fpo.png" STYLE="position:absolute; TOP:550px; LEFT:230px; HEIGHT:100px;"/>

<img src="http://brainx.io/images/funding/iarpa_fpo.jpg" STYLE="position:absolute; TOP:550px; LEFT:430px; HEIGHT:100px;"/>

<img src="http://brainx.io/images/funding/kavli_fpo.png" STYLE="position:absolute; TOP:550px; LEFT:550px; HEIGHT:100px;"/>

<img src="http://brainx.io/images/funding/kndi_fpo.png" STYLE="position:absolute; TOP:550px; LEFT:650px; HEIGHT:100px;"/>


    </textarea>
    <!-- <script src="https://gnab.github.io/remark/downloads/remark-latest.min.js"></script> -->
    <script src="remark-latest.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/contrib/auto-render.min.js"></script>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.css">
    <script type="text/javascript">
      var options = {};
      var renderMath = function() {
        renderMathInElement(document.body);
        // or if you want to use $...$ for math,
        renderMathInElement(document.body, {delimiters: [ // mind the order of delimiters(!?)
            {left: "$$", right: "$$", display: true},
            {left: "$", right: "$", display: false},
            {left: "\\[", right: "\\]", display: true},
            {left: "\\(", right: "\\)", display: false},
        ]});
      }
      var slideshow = remark.create(options, renderMath);

    </script>
  </body>
</html>