03-2020_JHU_present.html

<!DOCTYPE html>
<html>

<head>
  <title>L2M 2yr</title>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <link rel="stylesheet" href="fonts/quadon/quadon.css">
  <link rel="stylesheet" href="fonts/gentona/gentona.css">
  <link rel="stylesheet" href="slides_style_i.css">
  <script type="text/javascript" src="assets/plotly/plotly-latest.min.js"></script>
</head>

<body>
  <textarea id="source">


<!-- TODO add slide numbers & maybe slide name -->

### Lifelong Learning Machines 


![:scale 60%](images/neurodata_blue.png)

Joshua T. Vogelstein | {[BME](https://www.bme.jhu.edu/),[CIS](http://cis.jhu.edu/), [ICM](https://icm.jhu.edu/), [KNDI](http://kavlijhu.org/)}@[JHU](https://www.jhu.edu/) | [neurodata](https://neurodata.io)
<br>
[jovo&#0064;jhu.edu](mailto:j1c@jhu.edu) | <http://neurodata.io/talks> | [@neuro_data](https://twitter.com/neuro_data)


---

## Outline 

- [Summary](#summary)
- [Definition](#def)
- [Scenarios](#scenarios)
- [Metrics](#metrics)
- [Algorithm](#alg)
- [Simulations](#sims)
- [Real](#real)
- [Theory](#theory)
- [Neurobiology](#neuro)
- [Discussion](#disc)


---

## Outline 

- Summary
- [Definition](#def)
- [Scenarios](#scenarios)
- [Metrics](#metrics)
- [Algorithm](#alg)
- [Simulations](#sims)
- [Real](#real)
- [Theory](#theory)
- [Neurobiology](#neuro)
- [Discussion](#disc)


---

## What is lifelong Learning?

Lifelong learning occurs when 

1. there is a stream of .ye[continually changing] tasks, and 
2. performance improves by .ye[leveraging other tasks].

--


A system has .ye[weakly] lifelong learned if task performance .ye[on average], and has .ye[strongly] lifelong learned task performance improves .ye[for each task].


--


Thus, the only way to lifelong learn is by .ye[transferring knowledge across tasks], ideally both .ye[forward] (to improve future task performance) and .ye[backward] (to improve past task performance).


---

## Key Accomplishments


1. Formalized Lifelong Learning as generalization of classical machine learning
1. Introduced novel evaluation criteria: forward and backward transfer efficiency
1. Proposed  omnidirectional transfer learning  framework by ensembling  representations
1. Implemented Lifelong Forests (LF) as a specific example 
1. Demonstrated LF uniquely exhibits 
    1. positive forward transfer 
    1. positive backward transfer 
    1. positive overall transfer
1. Conjectured theory promising to prove consistency and robustness
1. Described equivalence between Decision Forests and Deep Nets
1. Proposed extension to implement and improve via Deep Nets


---


## Current State of the Art 

Used either 
1. fixed architecture/capacity, 
2. increase capacity, with complicated architecture. 

But did not demonstrate transfer (so maybe did not lifelong learn).

---


##  Key Insights

1. Avoiding catastrophic forgetting: data don't hurt performance on old tasks. Why stop there?
1. The key to lifelong learning is the ability to transfer knowledge from one setting to others (both past and future).
2. This can be achieved by
  1. learning new representation functions (as necessary)
  1. ensembling  representations to take actions.


Motivation: in biology, learning subsequent tasks often improves performance on past and future tasks.

The key is updating internal representations with each sample that are useful for multiple tasks.
  

---


## Proposed Metrics


**Transfer Efficiency**: The extent to which performance on a specific task improves by virtue of data from .ye[all other task data].

**Forward Transfer Efficiency**: The extent to which performance on a specific task improves by virtue of data from .ye[all past task data]. 

**Backward Transfer Efficiency**:  The extent to which performance on a specific task improves by virtue of data from .ye[all future task data]. 


---


## Key Claims


1. If you don't  transfer, you haven't lifelong learned, rather, you've .ye[sequentially compressed].
2. We propose the only algorithm in the literature the demonstrates weak or strong lifelong learning.


---
name:def

## Outline 

- [Summary](#summary)
- Definition
- [Scenarios](#scenarios)
- [Metrics](#metrics)
- [Algorithm](#alg)
- [Simulations](#sims)
- [Real](#real)
- [Theory](#theory)
- [Neurobiology](#neuro)
- [Discussion](#disc)


---

## What is Learning?


Given $n$ new data points in setting $\mathcal{S}$, assuming $P$, 

$f$ learns  when its performance $\mathcal{E}$ improves due to the data:

.center[$f$ learns when $\mathcal{E}(f_n) < \mathcal{E}(f_0)$.]

--


$\mathcal{E}(f_0)$ is the  performance of $f$ prior to seeing $n$ new data. 


---

## What is Learning?


Given $n$ new .ye[data] points in .ye[setting] $\mathcal{S}$, .ye[assuming] $P$, 

.ye[$f$] learns  when its .ye[performance] $\mathcal{E}$ improves due to the data:

.center[$f$ learns when $\mathcal{E}(f_n) < \mathcal{E}(f_0)$.]

$\mathcal{E}(f_0)$ is the  performance of $f$ prior to seeing $n$ new data. 


---
 

## What are data?

- $z_i \in \mathcal{Z}$ for $i \in [n]$ are samples
- Classification Example
  -  $Z_i = (X_i,Y_i)$ where $\mathcal{X}=\mathbb{R}^p$ and $\mathcal{Y}=\lbrace 0,1\rbrace$ 


---


## What is a Setting?


The setting is determined by the available resources:

-  .ye[Sample space]: $\mathcal{Z}$,  determined by available sensors
  - e.g., scalars, d-dimensional vectors, networks
- .ye[Action space]: $\mathcal{A}$,  determined by available actuators
  - e.g., {&rarr;, &larr;, &uarr;, &darr;, A,B}, {reject, fail to reject}, $\mathbb{R}$
- .ye[Query space]: $\mathcal{Q}$, determined by system's "interface"
  - e.g., in which cluster is $z$? what is this object?
- .ye[Constraints]: $\mathcal{C}$,  determined by hardware, time, money, subject matter expertise
  - e.g., $\mathcal{O}(n)$ training time, k-sparse, 8 GB


.ye[Setting]: an element of the tuple $\mathcal{S} :=  (\mathcal{Z}, \mathcal{A}, \mathcal{Q}, \mathcal{C})$ 


---


## What are the Assumptions?

These are required to have any theoretical performance guarantees, though they can be quite general:

- The data, $(Z_1,\ldots, Z_n) \in \mathcal{Z}^n$, are sampled from some true but unknown distribution $P_Z \in \, \mathcal{P}_Z$
- A query, $q \in \mathcal{Q}$ is sampled from some true but unknown distribution $P_Q \in \, \mathcal{P}_Q$ 
- An optimal action , $a \in \mathcal{A}$ given $q$, is sampled from some true but unknown distribution $P\_{A \mid Q} \in \, \mathcal{P}_{A \mid Q}$ 

Let $P = P\_Z \otimes P\_{A, Q} \in \, \mathcal{P}$ denote the joint distribution over samples, queries, and optimal actions.
  
  
---
 

## What is $f$?

We get to choose this, though we must respect the resource constraints defined by the setting:

- A .ye[hypothesis], $h : \mathcal{Q} \rightarrow \mathcal{A}$ takes an action on the basis of a query
- $f_n$ is a  .ye[learner], which maps from a subset of $n$ samples in $\mathcal{Z}$ to a hypothesis $h \in \mathcal{H}$
- $f=f_1, f_2, \ldots$ is a sequence of learners, called a learning .ye[algorithm] 

$$f \in \mathcal{F} = \lbrace  f_n : \mathcal{Z}^n \rightarrow \mathcal{H}\rbrace$$ 


--


- Supervised machine learning example
  - $f$ is *RandomForestClassifier.fit*
  - $h$ is *RandomForestClassifier.predict*


<!-- --- -->

<!-- 
## What are Constraints?

Provided by subject matter expect and available resources, including:
- Distributional constraints $P \in \, \mathcal{P}$, 
  - e.g, mixture of K Gaussians, or convex
- Decision rule (or hypothesis) constraints, $h \in \mathcal{H}$, 
  - e.g., $\mathcal{O}(1)$, or k-sparse 
- Learning rule constraints, $f \in \mathcal{F}$, 
  - e.g., $\mathcal{O}(n)$, or Decision stump 

-->

<!-- 
  Let $\mathcal{C}= \lbrace \mathcal{P}, \mathcal{H}, \mathcal{F} \rbrace$. 
-->

<!-- Let $\mathcal{S} = \lbrace \mathcal{Z}, \mathcal{Q}, \mathcal{A}, \mathcal{P}, \mathcal{H}, \mathcal{F} \rbrace$. --> 


<!-- --- -->


<!-- ## Constraints  -->

<!-- $\mathcal{P}$, $\mathcal{H}$, and $\mathcal{F}$ are sets of constraints on learning  -->

<!-- 
| Constraint | Example | 
| :--- | :--- 
| interpretability | hyperplanes or sparse 
| complexity | $\mathcal{O}(n)$
| memory | $< 1$ gigabyte of memory for a given dataset
| time | $< 1$ sec on a specific hardware configuration for a given dataset
| scalability| must operate on distributed storage/compute
| power | $< 1$ watt on a given system for a given dataset 
| price | $< 1$ USD on a given system for a given dataset  
| hardware | must run on iPhone X
 -->


---


## What is Performance?


- .ye[Loss] quantifies the error of a specific action $a$ taken by $h$ for a query $q$, $\mathcal{L} = \lbrace  \ell: \mathcal{A} \times \mathcal{A} \to \mathbb{R} \rbrace$, 
  - e.g., 0-1 loss: $ \ell(a, a') := \mathbb{I}[a \neq a']. $

<!-- $\mathsf{l}, \mathsf{R}, \mathsf{E}, \mathsf{h}, l, R, E, h$ -->

--

- .ye[Risk] quantifies the loss over the whole query sample space, $\mathcal{R} = \lbrace R : \mathcal{H} \times \mathcal{L} \times \mathcal{P}\_{Q, A} \to \mathbb{R} \rbrace$ 
 - We think of this only as a function of $h \in \mathcal{H}$, because the loss and distribution effectively index the function
  - eg, expected loss $ R(h) := R(h; \ell, P\_{Q, A}) = \ \mathbb{E}\_{Q, A}[\ell(h(Q), A)]. $ 


---


## What is Performance?


- Performance, also called  generalization .ye[error], quantifies risk over the distribution of possible training datasets,  $\mathcal{E} : \mathcal{F} \times \mathcal{R} \times \mathcal{P}\_{z}  \to \mathbb{R}$,  
 - We think of this only as a function of $f\_n \in \mathcal{F}$, for the same reason as above 
  - e.g., expected risk: 
$ \mathcal{E}(f\_n) := \mathcal{E}(f\_n; R, P\_Z) = \mathbb{E}\_Z[R(f\_n(Z))]. $

<!-- TODO@ronak i put a \cdot in there. ok? -->

---


## Has $f$ Learned?


Given $n$ new data points in setting $\mathcal{S}$, assuming $P$, 

$f$ learns  when its performance $\mathcal{E}$ improves due to the data:

.center[$f$ learns when $\mathcal{E}(f_n) < \mathcal{E}(f_0)$.]


$\mathcal{E}(f_0)$ is the  performance of $f$ prior to seeing $n$ new data points, and therefore a function of
  - prior on $\theta$ 
  - inductive bias of $\mathcal{H}$
  - estimation bias of $f$
  - model bias of $\mathcal{P}$
  - pre-training


<!-- 
## What is a Setting?

A setting is defined by a septuple $\mathcal{S} = \lbrace \mathcal{Z}, \mathcal{A}, \mathcal{L}, \mathcal{R}, \mathcal{P},  \mathcal{H},  \mathcal{F} \rbrace$

| Object | Notation | Example
|:--- |:--- |:--- | 
| Measurements  | $ \mathcal{Z}^n$ |  $\mathbb{R}^p \times \lbrace 0, 1 \rbrace$ |
| Actions |  $\mathcal{A}$ |  {↑,↓,&larr;, &rarr;,B,A,start}
| Loss  | $\mathcal{L}: \mathcal{A} \to \mathbb{R}_+$  | $ (\hat{y} - y_*)^2$
| Risk  | $\mathcal{R}: \mathcal{P} \times \mathcal{L}  \to \mathbb{R}_+$  | $\mathbb{E}_P[ \mathcal{L}(a)]$
| Distributions | $\mathcal{P} := \lbrace P_Z \rbrace$ | Gaussian
| Hypotheses  | $\mathcal{H} = \lbrace h: \mathcal{Z} \to \mathcal{A} \rbrace$  | hyperplanes
| Algorithms | $\mathcal{F} = \lbrace f : 2^{\mathcal{Z}^n} \to \mathcal{H} \rbrace$  | *RandomForest.fit*
 -->


---


## What is a Learning Task?

- Given
 - a sample size $n$
 - a setting $\mathcal{S}$
- Assume a true but unknown distribution $P \in \, \mathcal{P}$ 
- Find $f$ that minimizes generalization error

$$f^* = \arg \min\_{f} \, \mathcal{E}\_P(f_n).$$


---


## What is Transfer Learning?  


- Let 
  - $t_i \in \lbrace 0, 1 \rbrace$ label each sample, where $0$ denotes .ye[source] and $1$ denotes .ye[target] 
  - $\mathcal{Z}' = (\mathcal{Z},\lbrace 0,1 \rbrace)$
- Assume $(Z\_i,T\_i)$  is sampled iid from $P\_{Z,T}$, $Z\_i | T\_i=j \sim P\_t$, 
- Let  $P = P_{Z,T} \otimes P_Q \in \, \mathcal{P}'$
- Define a transfer learning algorithm $f$ as a sequence 

$$ \mathcal{F}_{TL} = \lbrace f_n : (\mathcal{Z} \times \color{yellow}{\lbrace 0,1 \rbrace })^n \rightarrow \mathcal{H} \rbrace$$

- Let  $f^t$ denote the learner that only sees samples where $t_i=t$. 
- Identify the appropriate performance function.


---


## Has $f$ Transfer Learned?


Given $n$ source target data points in transfer learning setting $\mathcal{S}\_{TL}$, assuming $P$, 

$f$ .ye[transfer] learns  when its performance $\mathcal{E}$ improves due to the source data:

.center[$f$ learns when $\mathcal{E}(f_n) < \, \mathcal{E}(f_n^1)$.]

---


## A Transfer Learning Task?  


- Given
  - a sample size $n$
  - a transfer learning setting $\mathcal{S} := \mathcal{S}\_{TL} = ( \mathcal{Z}', \mathcal{A}, \mathcal{Q}, \mathcal{C} )$
- Assume a true but unknown distribution $P \in \, \mathcal{P}'$
- Find $f$ that minimizes generalization error
  $$f^* = \arg \min\_{f} \, \mathcal{E}\_P(f_n).$$
  

---


## What is Multitask Learning? 


- Let 
  - $t_i \in \color{yellow}{[T]}$ denote the task associated with sample $i$
  - $\mathcal{Z}' = (\mathcal{Z}, [T])$
- Assume $(Z\_i,T\_i)$  is sampled iid from $P\_{Z,T}$, $Z\_i | T\_i=t \sim P\_t$, 
- Let  $P = P_{Z,T} \otimes P_Q \in \, \mathcal{P}'$
- Define a multi-task learning algorithm $f$ as a sequence 

$$ \mathcal{F}_{MT} = \lbrace f_n : (\mathcal{Z} \times \color{yellow}{[T]} )^n \rightarrow \mathcal{H} \rbrace$$

- Let $f^t$ denote the learner that only sees samples where $t_i=t$. 
- Identify the appropriate performance functions $\color{yellow}{\lbrace \mathcal{E}_t \rbrace}$.


---


## Has $f$ Multitask Learned?


Given $n$ data points in multitask learning setting $\mathcal{S}\_{MT}$, assuming $P$, 


$f$ .ye[weakly multitask] learns when its  performances $\lbrace \mathcal{E}_t \rbrace$ improve due to other task's data .ye[on average]:

$$  \sum\_{t \in [T]} \mathcal{E}\_t(f\_n ) P(t) <  \sum\_{t \in [T]} \mathcal{E}\_t(f\_n^t) P(t),$$ 

<br>

$f$ .ye[strongly multitask] learns when its performances $\lbrace \mathcal{E}_t \rbrace$ improve due to other task's data .ye[for each task]:

$$ \mathcal{E}\_t(f_n) < \, \mathcal{E}\_t(f_n^t) \quad \forall t \in [T].$$ 


---


## What is Sequential Learning? 

- When:
  - the data arrive sequentially, potentially in batches of size $m$
  - the constraints  demand a fixed capacity
  - there is potential for additional, $\Xi$-valued side information 
- Then, a sequential learning algorithm updates existing hypotheses on the basis of new data, that is, 


$$\mathcal{F}_S = \lbrace  f : \color{yellow}{\mathcal{H}} \times {\mathcal{Z}^m} \times \Xi \rightarrow \mathcal{H} \rbrace$$ 


---


## A Sequential Learning Task?  


- Given
  - a number of rounds $n$
  - a sequential learning setting $( \mathcal{Z}, \mathcal{A}, \mathcal{Q}, \mathcal{C})$, where $\mathcal{C}$ includes that $f_n \in \mathcal{O}(1) \, \forall n$
- Assume a true but unknown distribution $P\_i \in \, \mathcal{P}$ that nature selects at each round $i$ (can be the same). 
- Assume (potentially) a pool of side information, $\Xi$.
- Find $f$ that minimizes a generalization error that is indexed by $n$
  $$f^* = \arg \min\_{f} \, \mathcal{E}(f, n).$$
  

---

## Two Special Cases

Online Learning (OL) and Reinforcement Learning (RL) are two special cases of sequential learning, in which:

--

- The distribution of data $P\_i$ can change at each round $i$, perhaps adversarially.

--


- Some side information $\Xi$ is required to learn, either in the form of:
  - $k$ experts, as in OL, or
  - the existence of some state transition matrix $[P\_{s, s' \mid a}]$ for each action $a \in \mathcal{A}$, as in RL.

---

## What is Online Learning?


- Let 
  - data arrive sequentially in $T$ rounds
  - we also observe the prediction of $k$ experts, collectively in $\Xi$
- Assume .ye[nothing], $Q\_i \sim P_i \in \mathcal{P}$,  distribution could be i.i.d., conditionally dependent, or adversarial
- Define aclass of .ye[online] learning algorithms $f$ as a maps

$$ \mathcal{F}_{O} = \lbrace f : \mathcal{H} \times \color{yellow}{\Xi} \rightarrow \mathcal{H} \rbrace$$


---

## An Online Learning Task?

- Given
  - a online learning setting $( \mathcal{Z}, \mathcal{A}, \mathcal{Q}, \mathcal{C})$, where $\mathcal{C}$ includes that $f \in \mathcal{O}(1) \, \forall n$
  - a risk $R\_i$ at each round $i$
  - expert advice $\xi\_i$ at each round $i$
- Find $f$ that minimizes .ye[regret]
$$f^* = \arg \min\_{f} \, \mathcal{E}(f, n) = \sum\_{i=1}^n R\_i(f(h\_{i-1}, \xi\_i)) - \min\_{h \in \mathcal{H}} \sum\_{i=1}^n R\_i(h).$$


---

### What is Reinforcement Learning?


- Let 
  - data (states) arrive sequentially in $n$ rounds
  - $\mathcal{Z}\_i$ be the space of past states and actions at round $i$
- Assume upon taking action $a$, state distribution changes according to some transition matrix transition matrix $[P\_{s, s' \mid a}]$ (for finite $\mathcal{Q}$ and $\mathcal{A}$).
- Let $\mathcal{H}$ be the space of policies (hypotheses)
- Define a .ye[reinforcment] learning algorithms $f$ as a sequence

$$ \mathcal{F}_{R} = \lbrace f\_i : \color{yellow}{\mathcal{Z}_i} \rightarrow \mathcal{H} \rbrace$$


---

### A Reinforcement Learning Task?

- Given
  - reinforcement learning settings $( \mathcal{Z}\_i, \mathcal{A}, \mathcal{Q}, \mathcal{C})\_i$, where 
    - $\mathcal{Q}$ and $\mathcal{A}$ are the state and action spaces, respectively,
    - $\mathcal{Z}\_i = (\mathcal{Q} \times \mathcal{A})^{i-1}$ is the space of past state-action pairs
  - a discount rate $\gamma$
  - a reward function $\bar{R}$
- Find $f$ that maximizes .ye[expected reward]
$$ f^* = \arg \min\_{f} \, \mathcal{E}(f, n) = -\mathbb{E}\big[ \sum\_{i=0}^n \gamma^{n-i} \bar{R}(Q\_i, f(Z\_i))\big] $$


---


## What is a Lifelong  Task? 


Sequential multi-task learning, where
- $|\mathcal{T}|$ is a (countably) infinite set of tasks
- $T_n$ is the number of tasks observed after $n$ samples
- $n_t$ is the number of samples for task $t$
- Requires .ye[out of task] capabilities  


---


## What is a Lifelong  Task? 


- Let 
  - $t_i \in \color{yellow}{\mathcal{T}}$ denote the task associated with sample $i$
  - $\mathcal{Z}' = (\mathcal{Z}, \mathcal{T})$
- Assume $(Z\_i,T\_i)$  is sampled iid from $P\_{Z,T}$, $Z\_i | T\_i=t \sim P\_t$, 
- Let  $P = P_{Z,T} \otimes P_Q \in \, \mathcal{P}'$
- Define a lifelong learning algorithm $f=f_1, f_2, \ldots$ as a sequence

$$ \mathcal{F}_{L2} = \lbrace f_n : \mathcal{H} \times (\mathcal{Z} \times \color{yellow}{\mathcal{T}} )^n \rightarrow \mathcal{H} \rbrace$$

- Let $f^t$ denote the learner that only sees samples where $t_i=t$. 
- Identify the appropriate performance functions $\lbrace \mathcal{E}_t \rbrace$.


---


## What is Lifelong Learning? 


Given $n$ data points sequentially in lifelong learning setting $\mathcal{S}\_{L2}$, assuming $P$, 


$f$ .ye[weakly lifelong] learns when its  performances $\lbrace \mathcal{E}_t \rbrace$ improve due to other task's data .ye[on average]:

<!-- $$ \mathbb{E} \Big[ \sum\_{t \in [T\_n]} \mathcal{E}\_t(f\_n )  \Big]  < 
 \mathbb{E} \sum\_{t \in [T\_n]} \mathcal{E}\_t(f\_n^t),$$  -->


$$  \sum\_{t \in [T\_n]} n\_t \mathcal{E}\_t(f\_n )   < 
 \sum\_{t \in [T\_n]} n\_t \mathcal{E}\_t(f\_n^t) ,$$ 

 <!-- $$ \sum\_{i \in [n]} \mathcal{E}\_{t\_i}(f\_i ) 
 < 
  \sum\_{i \in [n]} \mathcal{E}\_{t\_i}(f\_i^{t_i}),$$  -->
 

<br>

$f$ .ye[strongly lifelong] learns when its performances $\lbrace \mathcal{E}_t \rbrace$ improve due to other task's data .ye[for each task]:

$$ \mathcal{E}\_t(f_n) < \, \mathcal{E}\_t(f_n^t) \quad \forall t \in [T\_n].$$ 


---
name:scenarios

## Outline 

- [Summary](#summary)
- [Definition](#def)
- Scenarios
- [Metrics](#metrics)
- [Algorithm](#alg)
- [Simulations](#sims)
- [Real](#real)
- [Theory](#theory)
- [Neurobiology](#neuro)
- [Discussion](#disc)


---

## Lifelong Learning Taxonomy 


![:scale 100%](images/learning-taxonomy.svg)


---

## Ways Tasks can Differ


| Component | Notation | Examples |
| :--- | :--- | :--- 
| Sample Space | $\mathcal{Z}$ | another modality
| Action Space | $\mathcal{A}$ | class incremental, task incremental
| Query Space | $\mathcal{Q}$ | new queries allowed
| Constraints | $\mathcal{C}$ | added hardware
| Performance | $\mathcal{E}$ | $L_2 \to L_1$
| Distribution | $P$ | Gaussian to Log-Gaussian
| Task Awareness | $T_i$ | {aware, oblivious, semi-aware}

$2^6 \times 3 \approx 200$ ways tasks can differ.


---

## Assumptions on Nature

The curriculum decided by nature can be:
- task constant
- task semi-constant
- task non-constant
  - typically requires restrictions on amount of data per task, or side information (OL, RL, LL)


<!-- TODO@JV maybe move below slides to real data -->

---

## Vision Task 1


.pull-left[
- *CIFAR 100* is a popular image classification dataset with 100 classes of images. 
- 500 training images and 100 testing images per class.
- All images are 32x32 color images.
- CIFAR 10x10 breaks the 100-class task problem into 10 tasks, each with 10-class.
]

.pull-right[
<img src="images/l2m_18mo/cifar-10.png" style="position:absolute; left:450px; width:400px;"/>
]

<br>

--

Why its a good scenario:
1. current reference dataset
2. a variant of this is "Class Incremental", where each subsequent task includes all previous tasks, which is harder


---

## Vision Task 2


- EfficientNet used we different image datasets 
- Each has different number of classes, samples
- Even within dataset, each image could be different size and aspect ratio
- Sequentially train on each dataset

![:scale 50%](images/12-datasets.png)

Why it is a good scenario:
1. Images are more real (different resolutions, scales, # classes)
2. Many metrics (localization, fine grained objects, texture, scene)
3. Benchmark results are available 

---


## Language Task 1


- 8,194,317 sentences from wikipedia (downloaded from facebook). 
- 156 languages
- Trained using unsupervised FastText embedding
- words, 2-4 char n-grams embedded into 16 dimensions
- selected 30 languages
- break into batches of 3 "related" languages

![:scale 50%](images/30-languages.png) 

Why it is a good scenario:
1. Public and real data
2. Not vision


---


## Language Task 2

.pull-left[
- same feature vectors as above
- labels now correspond to Microsoft Bing "dominant type"
- 10k training and 1k testing entities
- 20 classes (each with at least 11k samples)
- 4 classes per task

Why is this a good scenario:
1. Public  data 
2. Real application 
]

.pull-right[
![:scale 100%](images/bing-dominant-types.png) 
]


<!-- 

## Examples 

1. Task 2: $P$ and $\mathcal{A}$ are different, 
  - this is "Task Incremental Learning" (CIFAR 10x10)
1. Task 2: $P$ changes and  $\mathcal{A}$ gets bigger, 
  - this is "Class Incremental Learning"
1. Task 2:  $P$ changes, but $f$ does not know (continual learning) 
-->


---
name:metrics

## Outline 

- [Summary](#summary)
- [Definition](#def)
- [Scenarios](#scenarios)
- Metrics
- [Algorithm](#alg)
- [Simulations](#sims)
- [Real](#real)
- [Theory](#theory)
- [Neurobiology](#neuro)
- [Discussion](#disc)


---


## Transfer Efficiency (TE)


The transfer efficiency of learning algorithm $f$ for task $t$ is
$$  TE\_t(f) := 
    \frac{\mathcal{E}\_t(f^t_n)}{\mathcal{E}\_t(f_n)}.
$$

<br>

Algorithm $ f $ transfer learns if $ TE_t(f) > 1 $. 


---
 

## Forward / Backward TE 


- Let $f^{t_-}_n$ denote the algorithm with all access up to the last sample associated with task $t$.
<!-- - Let $\mathcal{D}_F^t = \{(X_i, Y_i, T_i) \in \, \mathcal{D} : i \leq n_t\}$ be the set of all data up to sample $n_t$. -->
- .ye[Forward] transfer efficiency of $ f $ for task $t$ is the improvement on task $t$ resulting from all data .ye[preceding] task $t$
$$    FTE\_t(f) := 
\frac{\mathcal{E}\_t(f^t\_n)}{\mathcal{E}\_t(f^{t\_-}\_n)}.
$$


--


<!-- ## Backward Transfer Efficiency -->


<!-- Backward Transfer Efficiency (BTE) for task $t$ measures the improvement on task $t$ resulting from all data occurring after the last sample $i$ with $T_i = j$.  -->


- .ye[Backward] transfer efficiency of $ f $ for task $t$ is the improvement on task $t$ resulting from all data .ye[after] task $t$ 


<!-- The backward transfer efficiency of $ f $ for task $t$ is  -->
$$    BTE\_t(f) := 
\frac{\mathcal{E}\_t(f^{t\_-}_n)}{\mathcal{E}_t(f_n)}.
$$

---

## TE Factorizes


$$  TE\_t(f) := 
    \frac{\mathcal{E}\_t(f^t\_n)}{\mathcal{E}\_t(f\_n)}
    = \frac{\mathcal{E}\_t(f^t\_n)}{\mathcal{E}\_t(f^{t\_-}\_n)}
    \times
    \frac{\mathcal{E}\_t(f^{t\_-}\_n)}{\mathcal{E}\_t(f\_n)}.
$$


---
name:alg

## Outline 

- [Summary](#summary)
- [Definition](#def)
- [Scenarios](#scenarios)
- [Metrics](#metrics)
- Algorithm
- [Simulations](#sims)
- [Real](#real)
- [Theory](#theory)
- [Neurobiology](#neuro)
- [Discussion](#disc)

---

## Basic Idea 

For each new task, 
1. learn a new representation function, 
2. apply it to all data from all tasks: the updated representation for everything is the composition of this new representation with existing representations.  
4. update all decision rules using this representation.

Notes:
- This linearly increases representation capacity.
- Without increasing representation capacity, performance on all tasks will necessarily drop to chance levels eventually as number of tasks increases.
- Thus, fixed capacity systems can only lifelong learn insofar as they are inefficient (unnecessarily big) for individual tasks.

---
 

## Composable Hypotheses 

.center[ .ye[$h(\cdot) := w \circ v \circ u (\cdot) = w(v(u(\cdot)))$]]

- Let $u$ be .ye[transformer] data to a new representation, 

$$ u : \mathcal{X}  \to \tilde{\mathcal{X}}$$

- Let $v$ be .ye[voter] which operate on the transformed data outputs votes on all possible actions 


$$ v : \tilde{\mathcal{X}} \to \mathcal{P}_{A|X}$$


- Let $w$ be .ye[decider] which decides which actions to take on the basis of the votes 


$$ w : \mathcal{P}_{A|X} \to \mathcal{A}$$


---
 

## Simple Examples

- Linear Discriminant Analysis (shallow)
  - $u$: projection onto a line 
  - $v$: fraction of points per over/under threshold
  - $w$: maximum a posteriori class 
--


- Decision Tree (deep)
 - $u$: union of polytopes
 - $v$: fraction of points per class per leaf node
 - $w$: maximum a posteriori class 

 
---


## Complicated Example


- Decision Forest 
  - $u_b$ for $B$ trees: union of overlapping polytopes
  - $v_b$ for $B$ trees: fraction of points per class per leaf node
  - $w$: maximum a posteriori class averaging over trees 
--


- Deep Nets 
  - $u$: "backbone" (all but last layer)
  - $v$: softmax layer
  - $w$: max 


---


## Key Idea 

- .ye[Different transformers can composed with  voters]
- Learn many different transformers $u_t(\cdot)$'s 
- For each $u\_t$, learn voter per task $v\_{t,t'}$'s 
- Use the decider to weight the various options 
- This is .ye[ensembling representations].

### Notes

- We learn new representation for each task. 
- Dimensionality of internal representation grows linearly with number of tasks.


<!-- TODO@jv: somewhere must introduce the concept of adjusting representations -->


---
 

## Composable Learning

<br> 

|  Scenario | Composition 
|  :--- | :--- 
| Single task learning | $ h(\cdot) = w \circ v \circ u (\cdot)$
| Multiple independent task learning | $ h_t(\cdot) = w_t \circ  v_t \circ u_t (\cdot)$ 
| Single task ensemble learning |$ h(\cdot) = w \circ \bigcup_t [ v_t \circ u_t (\cdot)] $ 
| Multitask learning | $ h_t(\cdot) = w_t \circ  v  \circ \bigcup_t  u_t (\cdot)$
| .ye[Multitask ensemble representation learning]  | $ h\_t(\cdot) = w\_t \circ  \bigcup\_{t'}  [v\_{t,t'}  \circ    u\_{t'} (\cdot) ] $


---
 

## Lifelong Learning Schema


![:scale 100%](images/learning-schemas.svg)


- Any learner with an explicit internal representation is ok, 
  - e.g.,  decision trees, decision forests, deep networks 
- SVM's are not


---

## Pseudocode 

- Given  $\color{magenta}{j-1}$ transformers learned from the previous $\color{magenta}{j-1}$ datasets and  a new $\color{yellow}{j^{th}}$ dataset with task label $\color{yellow}{t_j}$, do:
- learn a new transformer using $\color{yellow}{j^{th}}$ data
- .magenta[reverse transfer update] for each of the $\color{magenta}{j-1}$ previous tasks: 
    1. transform a subset of the data through the $\color{yellow}{j^{th}}$ transformer
     (this requires having stored some of the data)
    3. learn a new voter using the $\color{yellow}{j^{th}}$ representation of data
    4. update decision rules by appending this additional voter
- .ye[forward transfer update] for all data associated with $\color{yellow}{j^{th}}$ task:
  1. transform a subset of the data through the $\color{yellow}{j^{th}}$ transformer
  2. transform through each of the $\color{magenta}{j-1}$ existing transformers 
  3. learn a new voter for all $j$ transformers 
  4. make decision rule by averaging over $j$ voters

---
 

## General Representations 

- Transformers learn representations 
- We desire representations that are sufficient for one task, and  useful for other tasks 
- Decision trees, decision forests, and deep nets (with ReLu nodes) .ye[partition] feature space into polytopes

--

![:scale 100%](images/deep-polytopes.png)


<!-- <img src="images/deep-polytopes.png" style="width:500px;"/> -->


---
name:sims 

## Outline 

- [Summary](#summary)
- [Definition](#def)
- [Scenarios](#scenarios)
- [Metrics](#metrics)
- [Algorithm](#alg)
- Simulations
- [Real](#real)
- [Theory](#theory)
- [Neurobiology](#neuro)
- [Discussion](#disc)


---


## A Transfer Example

- .ye[XOR]
  - Samples in the (0,0) and (1,1) quadrants are purple  
  - samples in the (0,1) and (1,0) quadrants are green 
- .lb[N-XOR]
  - Samples in the (0,0) and (1,1) quadrants are green  
  - samples in the (0,1) and (1,0) quadrants are purple 
- Optimal decision boundaries for both problems are coordinate axes

<img src="images/l2m_18mo/xor_nxor.png" style="width:475px" class="center"/> 

<!-- TODO@HH replace with svg of Gaussian XOR & N-XOR -->


---


## Lifelong Classifier 

<img src="images/columbia20/xor-nxor-all.png"  style="height:300px;">


<!-- TODO@HH replace with 3 lower panels of Fig 2 -->
<!-- TODO@HH add titles to left and middle panel saying "Forward Transfer" and "Reverse Transfer",  respectively-->

- .lb[Uncertainty Forest] uses 100 samples from XOR to learn partitions
- .ye[Lifelong Forest] uses 100 samples from XOR and $n$ samples from N-XOR to learn partitions


---


## Lifelong Classifier 

<img src="images/columbia20/xor-nxor-all2.png"  style="height:500px;">

<!-- TODO@HH replace with 3 panels of XOR vs R-XOR -->
<!-- TODO@HH add fig showing XOR-Chess & vice verse (so, 2x3 panels) -->

---
 

## Different # of Classes 

<img src="images/spiral-all.png"  style="height:500px;">


<!-- TODO@HH replaec with new spiral figure -->

---
 

## Graceful Forgetting

<img src="images/rxor-suite-new-row.png"  style="width:700px;">


<!-- TODO@HH replace with new graceful forgetting figure -->


---
name:real 

## Outline 

- [Summary](#summary)
- [Definition](#def)
- [Scenarios](#scenarios)
- [Metrics](#metrics)
- [Algorithm](#alg)
- [Simulations](#sims)
- Real
- [Theory](#theory)
- [Neurobiology](#neuro)
- [Discussion](#disc)


<!-- ## Consider an  example -->


<!-- TODO@JD: replace CIFAR10 image with same thing but using CIFAR100 images and categories (not urgent, show me the image first) -->

<!-- TODO@JV add multimodal example -->

---
 

## CIFAR-10x10 Previous SOTA


<img src="images/l2m_18mo/progressive_netsc.png" style="width:650px;"/>

Andrei A. Rusu et al. [Progressive Neural Networks](https://arxiv.org/abs/1606.04671), arXiv, 2016.
  
<!-- Seungwon Lee, James Stokes, and Eric Eaton. "[Learning Shared Knowledge for Deep Lifelong Learning Using Deconvolutional Networks](https://www.ijcai.org/proceedings/2019/393)." IJCAI, 2019. -->


---


## Forward Transfer Efficiency

- y-axis indicates .ye[forward transfer efficiency] (FTE), 
  - which is the ratio of "single task error" to "error using past tasks"
- each algorithm has a line
  - if the line .ye[increases], that means it is doing "forward transfer"


---

Lifelong Forests demonstrates the .ye[largest forward transfer].

<!-- ![:scale 100%](images/cifar-100-FTE.svg) -->
![:scale 100%](images/cifar-100-FTE_.svg)
---


## Backward Transfer Efficiency

- y-axis indicates .ye[backward transfer efficiency] (BTE), 
  - which is the ratio of "single task error" to "error using future tasks"
- each task will have a line
  - if the line .ye[increases], that means it is doing "backward transfer"


---
 

Lifelong Forests .ye[uniquely exhibits backward transfer].


![:scale 100%](images/cifar-100-BTE.svg)


---
 

Lifelong Forests uniquely exhibits .ye[strong] lifelong learning.

| Algorithm  | Average TE | Min TE 
|:---        |:---       |:--- |
| LF         |  .ye[1.13 (&plusmn;0.01)] | .ye[1.10 (&plusmn;0.01)] 
| DF-CNN     |  0.75 (&plusmn;0.08)   |  0.40 (&plusmn;0.01)
| Online EWC |  0.96 (&plusmn;0.01)   |  0.88 (&plusmn;0.01)
| EWC        |  0.97 (&plusmn;0.01)  |  0.91 (&plusmn;0.01)   
| SI         |  0.86 (&plusmn;0.02)   |  0.75 (&plusmn;0.01) 
| LwF        |  1.00 (&plusmn;0.01)   |  0.97 (&plusmn;0.01)
| ProgNN     |  1.02 (&plusmn;0.01)   |  0.97 (&plusmn;0.01)


<!-- <img src="images/all_TE.png" style="height:250px;" /> -->
<!-- <img src="images/all_TE_2.png" style="height:500px;" /> -->


---

Lifelong Forests accuracy is worse than .ye[C]NNs with O(10M) parameters, and better than .ye[D]NNs with O(1M) parameters. 


![:scale 100%](images/cifar-100-E.svg)


---
class:inverse

## Language Task 1 

![:scale 100%](images/RTE_language.png)


---

## Language Task 2 

![:scale 100%](images/RTE_bing.png)


---
name:theory

## Outline 

- [Summary](#summary)
- [Definition](#def)
- [Scenarios](#scenarios)
- [Metrics](#metrics)
- [Algorithm](#alg)
- [Simulations](#sims)
- [Real](#real)
- Theory
- [Neurobiology](#neuro)
- [Discussion](#disc)


---


## What do classifiers do?

<br>

learn: given $(x_i,y_i)$, for $i \in [n]$, where $y \in \lbrace 0,1 \rbrace$
1. partition feature space into "parts",
2. compute plurality  of points in each part.


predict: given $x$
2. find its part, 
3. report the plurality vote in its part.


---


## What can regressors do?

<br>

learn: given $(x_i,y_i)$, for $i \in [n]$, where $y \in \mathbb{R}$
1. partition feature space into "parts",
2. compute average of points in each part.


predict: given $x$
2. find its part,
3. report the average vote in its part.


---


## The fundamental theorem of statistical pattern recognition


If each part is:

1. small enough, and 
2. has enough points in it, 

then given enough data, one can learn *perfectly, no matter what*! 


$$\mathcal{E}\(f_n) \rightarrow \mathcal{E}^*,$$

where $\mathcal{E}^*$is Bayes optimal.

-- Stone, 1977


<!-- NB: the parts can be overlapping (as in kNN) or not (as in histograms) -->


---


## The fundamental .ye[conjecture] of transfer learning


If each cell is:

- small enough, and 
- has enough points in it, 

then given enough data, one can .ye[transfer learn] *no matter what*! 


-- jovo, 2020


Specifically, this means:
- as $n_0 \to \infty$, TE is at least $1$ 
- as $n_1 \to \infty$, $\mathcal{E}(f_n) \to \mathcal{E}^*$ 

<!-- TODO@ronak i added the above two things, does that seem right to you as a conjecture? -->

---
name:neuro 

## Outline 

- [Summary](#summary)
- [Definition](#def)
- [Scenarios](#scenarios)
- [Metrics](#metrics)
- [Algorithm](#alg)
- [Simulations](#sims)
- [Real](#real)
- [Theory](#theory)
- Neurobiology
- [Discussion](#disc)

---

## Neurobiology Background

- All brains start with 1 neurons, and *increase neural capacity* during embroynic and juvenile developmental stages 
<!-- - In many taxa, # of neurons increases throughout development  -->
<!-- - In all taxa, # synapses increase through juvenile state -->
- During development, basic concepts are established
<!-- - If natural stimuli are unavailable developmentally, such concepts never form  -->
- In adulthood, animals learn new concepts  by recombination 
- Concepts that are not combinations can never form 
- So fixed capacity system only happens after significant training 

<iframe width="560" height="315" src="https://www.youtube.com/embed/C2q3Dqv9PEA?start=5" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


---

## How do brains learn? 

- "Partitioning", as implemented by a network, corresponds to only a subset of nodes responding to any given input 

<img src="images/rock20/Side-black.gif" style="height:230px;"/>
<img src="images/rock20/Front_of_Sensory_Homunculus.gif" style="height:230px;"/>
<img src="images/rock20/Rear_of_Sensory_Homunculus.jpg" style="height:230px;"/>


<!-- - Each connectome dynamically reconfigures at multiple time-scales to store novel information  -->
<!-- - Memory consolidation requires a physical reconfiguration implemented by a sequence of immediate early genes (IEGs) -->


---

## How do brains learn? 


- "Partitioning", as implemented by a network, corresponds to only a subset of nodes responding to any given input 
- A brain's connectome implements a partitioning of feature space 

<!-- <iframe width="560" height="315" src="videos/zebrafish_em_traces.m4v" frameborder="0" allow="encrypted-media" allowfullscreen></iframe> -->
<iframe width="560" height="315" src="https://www.youtube.com/embed/ykIj-9a_ss4?start=495" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

---

## How do brains learn? 


- "Partitioning", as implemented by a network, corresponds to only a subset of nodes responding to any given input 
- A brain's connectome implements a partitioning of feature space 
- Each connectome dynamically reconfigures at multiple time-scales to store novel information 

<!-- <iframe width="560" height="315" src="videos/zebrafish_ca.m4v" frameborder="0" allow="encrypted-media" allowfullscreen></iframe> -->
<iframe width="560" height="315" src="https://www.youtube.com/embed/lppAwkek6DI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


---

## How do brains learn? 

- "Partitioning", as implemented by a network, corresponds to only a subset of nodes responding to any given input 
- A brain's connectome implements a partitioning of feature space 
- Each connectome dynamically reconfigures at multiple time-scales to store novel information 
- Memory consolidation requires a physical reconfiguration implemented by a sequence of immediate early genes (IEGs)


<!-- <video width="560" height="420" controls>
  <source src="videos/zebrafish_ca.m4v" type="video/mp4">
</video> 
-->

<!-- TODO@JV add video? -->

---

## NeuroExperiments 

- How does the brain select which neurons/synapses to modify to store new information?
  - The choice should maximize transfer efficiency 
- We can simultaneously observe neural and IEG activity during and after a learning event (e.g., a foot shock)
- We can identify the neural ensembles primed to learn with Arc-GFP 
- We can identify sets of ensembles of neural activity using jRGECO1a
- We can then discover the relationship between these two sets of ensembles of neurons

---
name:disc 

## Outline 

- [Summary](#summary)
- [Definition](#def)
- [Scenarios](#scenarios)
- [Metrics](#metrics)
- [Algorithm](#alg)
- [Simulations](#sims)
- [Real](#real)
- [Theory](#theory)
- [Neurobiology](#neuro)
- Discussion


---

## Extension #1: Streaming 

- Current implementation requires all data per task are batched
- Could stream trees per sample
- Would provide truly continual transfer
- Collaborators: JHU seedling (Braverman)

---

## Extension #2: Compression 

- Current implementation linearly grows internal representation with each new task
- Could compress internal representation after training to achieve a fixed representation space (e.g., using coresets)
- For forests, this could happen at the node or tree level 
- Collaborators: JHU seedling (Braverman)

---

## Extension #3: Replay

- Current implementation requires storing some data to achieve *backwards* transfer
- Could leverage replay to reduce dependency of increasing data storage
- Collaborators: Baylor (Tolias) & McNaughton (UCI+UCSD)

---

## Extension #4: Agent

- Current implementation's action are labels and do not impact future data
- Could integrate into larger L2 system that incorporates agent based learning
- Collaborators: Raghavan (SRI)


---
 

## All possible extensions 

1. Allow fully sequential data 
2. Allow fixed capacity representation
3. Allow replay to support fixed capacity 
4. Allow agent based extension
2. Allow non-discrete tasks 


1. No implementation using deep nets      
4. Tasks must be known (no implementation that imputes task ID)
5. Feature space must be the same for all tasks (no data fusion step)
6. Only unimodal data supported (no multimodal implementation)
1. Must grow rather than recruit new internal representations (no pre-training implemented)      
1. Requires storing some samples to achieve backwards transfer (no replay capacity)
1. No support for specific modalities (e.g., images)


---

## Publications


1. H. Helm et al. Lifelong Learning Forests, 2020
1. R. Mehta et al. A General Theory of Learnability, 2020. 
1. R Guo, et al. [Estimating Information-Theoretic Quantities with Uncertainty Forests](https://arxiv.org/abs/1907.00325). arXiv, 2019.
1. R. Perry, et al. Manifold Forests: Closing the Gap on Neural Networks. preprint, 2019.
1. C. Shen and J. T. Vogelstein. [Decision Forests Induce Characteristic Kernels](https://arxiv.org/abs/1812.00029). arXiv, 2019
1. M. Madhya, et al. [Geodesic Learning via Unsupervised Decision Forests](https://arxiv.org/abs/1907.02844). arXiv, 2019.

---

## Conferences 


1. J.T. Vogelstein et al. A biological implementation of lifelong learning in the pursuit of artificial general intelligence.  From Neuroscience to Artificially Intelligent Systems, 2020.
2. B. Pedigo et al.  A quantitative comparison of a complete connectome to artificial intelligence architectures. From Neuroscience to Artificially Intelligent Systems, 2020.


---
 

### Acknowledgements


<!-- <div class="small-container">
  <img src="faces/ebridge.jpg"/>
  <div class="centered">Eric Bridgeford</div>
</div>

<div class="small-container">
  <img src="faces/pedigo.jpg"/>
  <div class="centered">Ben Pedigo</div>
</div>

<div class="small-container">
  <img src="faces/jaewon.jpg"/>
  <div class="centered">Jaewon Chung</div>
</div> -->


<div class="small-container">
  <img src="faces/yummy.jpg"/>
  <div class="centered">yummy</div>
</div>

<div class="small-container">
  <img src="faces/lion.jpg"/>
  <div class="centered">lion</div>
</div>

<div class="small-container">
  <img src="faces/violet.jpg"/>
  <div class="centered">baby girl</div>
</div>

<div class="small-container">
  <img src="faces/family.jpg"/>
  <div class="centered">family</div>
</div>

<div class="small-container">
  <img src="faces/earth.jpg"/>
  <div class="centered">earth</div>
</div>


<div class="small-container">
  <img src="faces/milkyway.jpg"/>
  <div class="centered">milkyway</div>
</div>


##### JHU

<div class="small-container">
  <img src="faces/cep.png"/>
  <div class="centered">Carey Priebe</div>
</div>

<!-- <div class="small-container">
  <img src="faces/randal.jpg"/>
  <div class="centered">Randal Burns</div>
</div> -->


<!-- <div class="small-container">
  <img src="faces/cshen.jpg"/>
  <div class="centered">Cencheng Shen</div>
</div> -->


<!-- <div class="small-container">
  <img src="faces/bruce_rosen.jpg"/>
  <div class="centered">Bruce Rosen</div>
</div>


<div class="small-container">
  <img src="faces/kent.jpg"/>
  <div class="centered">Kent Kiehl</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/mim.jpg"/>
  <div class="centered">Michael Miller</div>
</div>

<div class="small-container">
  <img src="faces/dtward.jpg"/>
  <div class="centered">Daniel Tward</div>
</div> -->


<!-- <div class="small-container">
  <img src="faces/vikram.jpg"/>
  <div class="centered">Vikram Chandrashekhar</div>
</div>


<div class="small-container">
  <img src="faces/drishti.jpg"/>
  <div class="centered">Drishti Mannan</div>
</div> -->

<div class="small-container">
  <img src="faces/jesse.jpg"/>
  <div class="centered">Jesse Patsolic</div>
</div>

<!-- <div class="small-container">
  <img src="faces/falk_ben.jpg"/>
  <div class="centered">Benjamin Falk</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/kwame.jpg"/>
  <div class="centered">Kwame Kutten</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/perlman.jpg"/>
  <div class="centered">Eric Perlman</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/loftus.jpg"/>
  <div class="centered">Alex Loftus</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/bcaffo.jpg"/>
  <div class="centered">Brian Caffo</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/minh.jpg"/>
  <div class="centered">Minh Tang</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/avanti.jpg"/>
  <div class="centered">Avanti Athreya</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/vince.jpg"/>
  <div class="centered">Vince Lyzinski</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/dpmcsuss.jpg"/>
  <div class="centered">Daniel Sussman</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/youngser.jpg"/>
  <div class="centered">Youngser Park</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/shangsi.jpg"/>
  <div class="centered">Shangsi Wang</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/tyler.jpg"/>
  <div class="centered">Tyler Tomita</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/james.jpg"/>
  <div class="centered">James Brown</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/disa.jpg"/>
  <div class="centered">Disa Mhembere</div>
</div> -->

<!-- <div class="small-container">
  <img src="faces/gkiar.jpg"/>
  <div class="centered">Greg Kiar</div>
</div> -->


<!-- <div class="small-container">
  <img src="faces/jeremias.png"/>
  <div class="centered">Jeremias Sulam</div>
</div> -->


<div class="small-container">
  <img src="faces/meghana.png"/>
  <div class="centered">Meghana Madhya</div>
</div>
  

<!-- <div class="small-container">
  <img src="faces/percy.png"/>
  <div class="centered">Percy Li</div>
</div>
-->

<div class="small-container">
  <img src="faces/hayden.png"/>
  <div class="centered">Hayden Helm</div>
</div>


<div class="small-container">
  <img src="faces/rguo.jpg"/>
  <div class="centered">Richard Gou</div>
</div>

<div class="small-container">
  <img src="faces/ronak.jpg"/>
  <div class="centered">Ronak Mehta</div>
</div>

<div class="small-container">
  <img src="faces/jayanta.jpg"/>
  <div class="centered">Jayanta Dey</div>
</div>

##### Microsoft Research

<div class="small-container">
  <img src="faces/chwh-180x180.jpg"/>
  <div class="centered">Chris White</div>
</div>


<div class="small-container">
  <img src="faces/weiwei.jpg"/>
  <div class="centered">Weiwei Yang</div>
</div>

<div class="small-container">
  <img src="faces/jolarso150px.png"/>
  <div class="centered">Jonathan Larson</div>
</div>

<div class="small-container">
  <img src="faces/brtower-180x180.jpg"/>
  <div class="centered">Bryan Tower</div>
</div>


##### DARPA 
Hava, Ben, Robert, Jennifer, Ted.

</div>
<!-- <img src="images/funding/nsf_fpo.png" STYLE="HEIGHT:95px;"/> -->
<!-- <img src="images/funding/nih_fpo.png" STYLE="HEIGHT:95px;"/> -->
<!-- <img src="images/funding/darpa_fpo.png" STYLE=" HEIGHT:95px;"/> -->
<!-- <img src="images/funding/iarpa_fpo.jpg" STYLE="HEIGHT:95px;"/> -->
<!-- <img src="images/funding/KAVLI.jpg" STYLE="HEIGHT:95px;"/> -->
<!-- <img src="images/funding/schmidt.jpg" STYLE="HEIGHT:95px;"/> -->

---
<!-- class:center -->
background-image: url(images/l_and_v.jpeg)

# Questions?

<!-- <img src="images/l_and_v.jpeg" style=" height:600px;"/> -->

---
class: middle, inverse


## .center[Extra Slides]


---


## Background 

3. T. M. Tomita et al. [Sparse  Projection Oblique Randomer Forests](https://arxiv.org/abs/1506.03410). arXiv, 2018.
7. J. Browne et al. [Forest Packing: Fast, Parallel Decision Forests](https://arxiv.org/abs/1806.07300). SIAM ICDM, 2018.

More info: [https://neurodata.io/sporf/](https://neurodata.io/sporf/)


---
class: middle 

# Biology


---
 

## Do brains do it?

--

(brains obviously learn)

1. Do brains partition feature space?
2. Is there some kind of "voting" occurring within each part?


---


## Brains partition  

- Feature space = the set of all possible inputs to a brain
- Partition = only a subset of "nodes" respond to any given input
- Examples
  1. visual receptive fields
  2. place fields / grid cells
  3. sensory homonculus

<br>

<img src="images/rock20/Side-black.gif" style="height:230px;"/>
<img src="images/rock20/Front_of_Sensory_Homunculus.gif" style="height:230px;"/>
<img src="images/rock20/Rear_of_Sensory_Homunculus.jpg" style="height:230px;"/>


---


## Brains vote

- Vote = pattern of responses indicate which stimulus evoked response

<img src="images/rock20/brody1.jpg" style="height:400px;" />


---
 

## Can Humans Backward Transfer?


- "Knowledge and skills from a learner’s first language are used and reinforced, deepened, and expanded upon when a learner is engaged in second language literacy tasks." -- [American Council on the Teaching of Foreign Languages](https://www.actfl.org/guiding-principles/literacy-language-learning)


---
 

## Proposed Experiments 

- Behavioral Experiment
  - Source Task: Delayed Match to Sample (DMS) on colors
  - Target Task A: Delayed Match to Not-Sample  on colors 
  - Target Task B: DMS on orientation 
- Measurements
  - Arc-GFP to identify which neurons could learn 
  - Ca2+-YFP to measure neural activity
  - Narp-RFP to identify which neurons actually consolidate
- Species 
      - Zebrafish (Engert)
      - Mouse (McNaughton and/or Tolias)
      - Human (Isik)


---
class: middle 

# Other Stuff


---


## Not So Clevr

<img src="images/not-so-clevr.png" style="width:650px" />


---

### RF is more computationally efficient 


<img src="images/s-rerf_6plot_times.png" style="width:750px;"/>


</textarea>
  <!-- <script src="https://gnab.github.io/remark/downloads/remark-latest.min.js"></script> -->
  <!-- <script src="remark-latest.min.js"></script> -->
  <script src="remark-latest.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/contrib/auto-render.min.js"></script>
  <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.css">
  <script type="text/javascript">

    var options = {};
    var renderMath = function () {
      renderMathInElement(document.body);
      // or if you want to use $...$ for math,
      renderMathInElement(document.body, {
        delimiters: [ // mind the order of delimiters(!?)
          { left: "$$", right: "$$", display: true },
          { left: "$", right: "$", display: false },
          { left: "\\[", right: "\\]", display: true },
          { left: "\\(", right: "\\)", display: false },
        ]
      });
    }

    remark.macros.scale = function (percentage) {
      var url = this;
      return '<img src="' + url + '" style="width: ' + percentage + '" />';
    };

    // var slideshow = remark.create({
    // Set the slideshow display ratio
    // Default: '4:3'
    // Alternatives: '16:9', ...
    // {
    // ratio: '16:9',
    // });

    var slideshow = remark.create(options, renderMath);


  </script>
</body>

</html>