-
Notifications
You must be signed in to change notification settings - Fork 0
/
presentation.Rmd
747 lines (447 loc) · 28.5 KB
/
presentation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
---
title: Dimensionality Reduction Methods
subtitle: Statistical Data Analysis Exam <br/> <br/> <br/> <br/> <br/>
author:
Povilas Gibas, <br/>
PhD Candidate in Biochemistry <br/>
Department of Biological DNA Modification, <br/>
Vilnius University <br/> <br/>
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
xaringan::moon_reader:
css: [default, metropolis, metropolis-fonts]
lib_dir: libs
---
```{r, echo = FALSE, include = FALSE, warning = FALSE, message = FALSE}
library(dimRed)
library(ggplot2)
library(knitr)
library(plotly)
library(vegan)
opts_chunk$set(
echo = FALSE,
error = FALSE,
message = FALSE,
warning = FALSE,
include = TRUE,
results = TRUE,
cache = TRUE,
fig.align = "center"
)
```
# Content
- Curse of Dimensionality
- Practical Problems
- Theoretical Problems
- Features of Dimensionality Reduction Methods
- Expected Functionalities
- Characteristics
- Categories
- Dimensionality Reduction Methods
- Principal Component Analysis
- Kernel Principal Component Analysis
- Classical Multidimensional Scaling
- non-Metric Dimensional Scaling
- Locally Linear Embedding
- t-Distributed Stochastic Neighbor Embedding
- Tips for Effective Dimensionality Reduction
---
# Curse of Dimensionality
In nature combining multiple simple units allows to perform complex tasks (i.e. units are redundant and after failing can be replaced with others that achieve the same task).
The goal of dimensionality reduction (DR) is to **summarize a large set of parameters into a smaller set with no or less redundancy** (i.e. find low representation of data while preserving it's key properties).
High dimensionality problems:
- Practical
- Theoretical / Mathematical
---
# Practical Problems
*By essence, the world is multidimensional*
- **Processing of sensor arrays**: biomedical applications; ecology; seismology; weather forecasting; gps; image processing.
- **Multivariate data analysis**: in contrast with sensor arrays multivariate data analysis focuses on the analysis of measures that are related to each other, but come from different types of sensors.
- **Data mining**: deals with more *exotic* data structures than arrays of numbers (e.g. text mining).
```{r, out.width = 400}
include_graphics("./figures/Diaz_plants.png")
```
---
# Theoretical Problems
## Curse of dimensionality & mathematical anomalies
*When the dimensionality grows, the well-known properties of the usual 2D or 3D Euclidean spaces make way for strange and annoying phenomena*
First time term *curse of dimensionality* used by [Bellman, 1961](https://books.google.be/books?hl=en&lr=&id=iwbWCgAAQBAJ&oi=fnd&pg=PR9&ots=bDH6UqK66h&sig=2JT8vX8dg1-8jSqkwY9ACw3bbnc&redir_esc=y#v=onepage&q=curse&f=false) for an empty space phenomenon:
```{r, out.width = 1000, out.height = 300}
include_graphics("./figures/bellman.png")
```
---
# Theoretical Problems
## Ratio of hyper-cube & hyper-sphere
In a $D$-dimensional space a sphere contains a cube (i.e. circumscripted cube):
$V_{cube}\left(r\right) = \left(2r\right)^D$
$V_{sphere}\left(r\right) = \frac{\pi^{D/2}r^D}{\Gamma\left(1+ D / 2r\right)^D}$
When $D$ increases the ratio between volume of sphere and cube goes towards zero:
$\lim_{D\to\infty} \frac{V_{sphere}\left(r\right)}{V_{cube}\left(r\right)} = 0$
---
# Theoretical Problems
## Ratio of hyper-cube & hyper-sphere
Implementation in `R`:
```{r, echo = TRUE}
volume_hypersphere <- function(r, D) {
numer <- pi^(D / 2)
denum <- gamma(D / 2 + 1)
numer / denum * r^D
}
volume_hypercube <- function(r, D) {
(2 * r)^D
}
# Ratio from $D = 2$ to $D = 10$
```
```{r, fig.asp = 0.4, out.width = 700, include = TRUE}
round(sapply(2:10, function(D) {volume_hypersphere(10, D) / volume_hypercube(10, D)}), 3)
include_graphics("./figures/hypercube.png")
```
---
# Theoretical Problems
## Hypervolume of a spherical shell
2 concentric spheres (first with radius $r$, second with slightly smaller radius $r - \epsilon$) (i.e. $\epsilon$ is the thickness of the first sphere).
$\frac{V_{spehere}\left(r\right) - V_{spehere}\left(r\left(1 - \epsilon\right)\right)}{V_{spehere}\left(r\right)} = \frac{r^D - \left(r-\epsilon\right)^D}{r^D}$
When $D$ increases the ratio tends to go towards $1$ (the shell contains almost all the volume).
```{r, out.height = 250, out.width = 400}
include_graphics("./figures/hyperspheres.png")
```
Probability mass versus dimension plot shows the volume of density inside $0.95$ of a SD (yellow), between $0.95$ and $1.05$ SD (green), over $1.05$ SD (white) [Bishop, 1995](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1104&rep=rep1&type=pdf)
---
# Features of DR Methods
## Expected functionalities
Goal of DRM is to **identify and eliminate the redundancies among the variables**. This requires:
1. Estimate the number of latent variables.
2. Embed data in order to reduce their dimensionality.
3. Embed data in order to recover the latent variable.
There's no *ideal* method that can perform all three tasks.
---
# Expected Functionalities
## Estimation of the number of latent variables
A certain process in nature may be generated from a small set of independent degrees of freedom, but it will usually appear in a more complex way due to a number of reasons:
- Measurement procedure
- Stochastic variation
A sample of $D$-dimensional vectors that has been generated by an unknown distribution. We assume that this distribution in data space is actually due to a small number $L < D$ of variables acting in combination, called **latent variables**. DR is achieved by defining a reverse mapping from data space onto latent space, so that every data point is assigned a representative in latent space.
---
# Expected Functionalities
## Estimation of the number of latent variables
The number of latent variables can be computed from a topological point of view, by estimating the intrinsic dimension(ality) of data. This intrinsic dimension $L$ of data equals the dimension $D$ of the embedding space, there is no structure. On the contrary, when $L < D$, data points are often constrained to lie in a well-delimited subspace. Consequently, a low intrinsic dimension indicates that a topological object or structure underlies the data set.
```{r, out.height = 350}
data_topo <- loadDataSet("3D S Curve")
pd <- data.frame(data_topo@data, data_topo@meta)
plot_ly(x = pd$x, y = pd$y, z = pd$z, type = "scatter3d", mode = "markers", color = pd$x.1)
```
---
# Expected Functionalities
## Estimation of the number of latent variables
Usually manifold is twisted or curved and points on it will be non-uniformly distributed
Different approaches have been developed to cope with the ID estimation problem:
- Projection techniques look for the best subspace to project the data by:
1. Minimizing a reconstruction error (e.g. In PCA can be computed from the eigenvalues)
2. Preserving pairwise distances (e.g MDS)
- Fractal methods (Hausdorff dimension): based on the idea that the volume of a $D$-dimensional ball of radius $r$ scales as $r^D$, they count the number of points within a neighborhood of $r$ and estimate the rate of growth of this number [Grassberg and Procaccia, 1981](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.50.346)
---
# Expected Functionalities
## Embedding data for dimensionality reduction
If there is a low $L$-dimensional hidden space where the real data generation mechanism lies, but we observe the data in a high $D$-dimensional space. There is a function $f: R^L \Rightarrow R^D$ which is called embedding function which takes the data from $L$-dimensional and maps to the high dimensional observable one as $y_i = f(x_i) + \epsilon_i$.
DR aims to re-embed data in such way that the manifold structure is preserved (the main problem is how to measure or characterize the structure of a manifold in order to preserve it).
```{r, out.width = 400, fig.asp = 0.6, dpi = 200}
pd <- embed(data_topo, "Isomap")
pd2 <- data.frame(pd@data@data, pd@data@meta)
ggplot(pd2, aes(iso.1, iso.2, color = x)) +
geom_point() +
scale_color_viridis_c() +
labs(
x = "Dimension1",
y = "Dimension2",
color = NULL
) +
theme_classic()
```
---
# Expected Functionalities
## Embedding for latent variable separation
Additional constraints can be imposed on the desired $L$-dimension representation (generally not related to topology). For example, it is often assumed that the latent variables are independent from each other. In this case, the $L$-dimensional representation must also satisfy this property in order to state that the latent variables have been retrieved.
---
# Characteristics
## The model that data are assumed to follow
All methods expect that data sets are generated according to some specific model (e.g. linear / non-linear; continuous / discrete)
## The type of algorithm
- Batch methods - can't start working until the whole set of data is available.
- Online methods - used when data samples arrive one by one. Each time a new sample is available, online algorithms handle it independently from the previous ones. Unfortunately, such algorithms don't show the same desirable properties as algebraic procedures:
- Can fall in a local optimum of the criterion, (find a solution that is not exactly the best, but only an approximation).
- Often require a careful adjustment of several hyperparameters to fasten the convergence.
---
# Characteristics
## The criterion to be optimized
- Mean square error - In order to compute it, the dimensionality is first reduced and then expanded back (provided that the data model could be reversed).
$E_{codec} = E_y\{\Vert y - dec\left(cod\left(y\right)\right)\Vert^2_2\}$
- Variance - Projection that preserves the variance initially observable in the raw data.
- Pairwise distances - measured between the observations in the data set (from a topological point of view, the projection of the object should preserve its structure)
- Decorrelation - if the aim is latent variable separation. This criterion can be further enriched by making the estimated latent variables as independent as possible [independent component analysis](https://en.wikipedia.org/wiki/Independent_component_analysis).
---
# Categories
## Hard & soft DR
DR can be hard or soft, depending on the ratio of dimensionalities before and after reduction.
- Hard DR is suited for problems in which data have from hundreds to hundred of thousands of variables. In such a case, a drastic DR is performed (several orders of magnitude). Simple methods such as PCA or classical MDS can process very high-dimensional data and project them to very low-dimensional space. Most nonlinear methods are less robust, due to their higher model complexity, and often fail to converge.
- Soft DR is suited for problems in which the data aren't too high-dimensional (less than a few tens of variables). Usually, the components are observed or measured values of different variables, which have a straightforward interpretation.
---
# Categories
## Linear & non-linear model
The distinction between methods based on a linear or a nonlinear model is probably the straightest way to classify them.
Nonlinear methods are often more powerful than linear ones, because the connection between the latent variables and the observed ones may be much richer than a simple matrix multiplication.
```{r, fig.asp = 0.4, dpi = 200}
pd <- loadDataSet("3D S Curve", n = 1000) %>%
embed("PCA")
pd <- data.frame(pd@data@data)
ggplot(pd, aes(PC1, PC2)) +
geom_point() +
scale_color_viridis_c() +
labs(
x = "Dimension1",
y = "Dimension2"
) +
theme_classic()
```
---
# Categories
## Continuous & discrete model
*Regards models continuity*
- Continous - DR is often achieved by mapping between the initial and final spaces (linear transformation in PCA).
- Discrete model consists of a finite set of interconnected points (e.g self organizing maps).
## Estimation of the dimensionality
*The presence of an estimator of the intrinsic dimensionality*
Most of the methods (except PCA) need the embedding dimension as an external hyper-parameter to be given by the user.
---
# Categories
## Implicit vs. explicit mapping
- Explicit methods search for mapping, from the original high-dimensional space to the low-dimensional space:
1. Feature selection methods select only a few attributes of the total set of attributes.
2. Feature extraction methods search for new features that are a combination of the original features.
- Implicit methods search for a low-dimensional space that best preserves certain properties of the data, like distances or neighbourhood information. The term employed for these methods is em- bedding rather than dimension reduction or mapping methods.
---
# Categories
## Global & local methods
- Global methods try to recover the global information explicitly in the optimisation function.
- Local methods concentrate on recovering the local structure of the data, the global structure then emerges from the continuity of the local fits.
## Optional & mandatory vector quantization
When the amount of available data is very large one may work with a smaller set of representative observations - vector quantization can be applied (replaces the original observations in the data set with a smaller set of so-called prototypes or centroids).
---
# Categories
## Batch & online algorithm
*Observations may arrive consecutively or the whole data set may be available at once*
- Online algorithm works on a whole dataset (e.g. PCA)
- Batch algorithm can't work until the whole set of observations is known (iterative PCA)
The behavior of true online algorithms is rather complex, especially when the data sequence does not fulfill certain conditions (e.g. stationarity).
Fortunately, most online algorithms can be made batch ones using the Robbins–Monro procedure (Stochastic approximation). The latter simulates a possibly infinite data sequence by repeating a finite set of observations; by construction, this method ensures some kind of stationarity.
---
# Categories
## Exact vs. approximate optimization
- Most often, batch algorithms result from some analytical or algebraic developments that give the solution in closed form, like PCA. Given a finite data set, which is known in advance, can compute the optimal solution.
- On the opposite, online or adaptive algorithms are often associated with generic optimization procedures like stochastic gradient descent. Such procedures don't offer strong guarantees about the result (the convergence may fail).
## Criterion To Be Optimized
- Pairwise distances between the points (i.e. distance preservation). As an advantage, distances are scalar values that can be easily compared to each other.
- Qualitative measure of proximities. The exact value of the distances does not matter ("Distance from point $a$, point $b$ is closer than point $c$").
---
# Principal Component Analysis
Is the most basic technique for DR (maybe even the oldest - [Pearson, 1901](https://www.tandfonline.com/doi/abs/10.1080/14786440109462720)).
```{r, out.height = 250, out.width = 400}
include_graphics("./figures/pearson.png")
```
### Goal
Find orthonormal transformation matrix $P$ with which $Y = PX$ such that $C_Y = \frac{1}{n - 1} YY^{T}$ is diagonalized (i.e PCA tries to de-correlate the original data by finding directions in which variance in maximized (or minimized for off-diagonal))
???
Coveratiance matrix shows how much variables change together and we want it where:
1. Maximum variance on the diagonal (max difference between samples);
2. Minimum off-diagonal variance (non-zero values are caused by noise and redundancy).
---
# Principal Component Analysis
### Algorithm
1. Scale - equalizes variance
2. Center - equalizes mean
3. Get similarity matrix $C_X$ (covariance, correlation, etc)
4. Calculate eigenvalues and eigenvectors
$C_Y = \frac{1}{n - 1}YY^{T} = \frac{1}{n - 1} PX(PX)^{T} = \frac{1}{n - 1} PXX^{T}P^{T} \\ \hspace{7.4mm} = \frac{1}{n - 1} PAP^{T} = \frac{1}{n - 1} V(V \Lambda V^{T})V^{T} = \frac{1}{n - 1} \Lambda$
Complexity of PCA is $\mathcal{O}(p^2n + p^3)$.
Implementation in `R` with `stats::prcomp`
### Assumptions
- Linearity
- Large variances have important structure
---
# Kernel Principal Component Analysis
If groups are lineary inseperable in the input space $R^{2}$ we can make them lineary seperable by mapping them to a higher dimension space $R^3$. kPCA extends PCA to deal with nonlinear dependencies - map the data into a high dimensional space and then perform PCA in this space ([Scholkopf et al., 1998](http://www.face-rec.org/algorithms/Kernel/kernelPCA_scholkopf.pdf)):
$\Phi: R^{2} \Rightarrow R^{3}$
$(x_1, x_2) \Rightarrow (x_1, x_2, x_1^2 + x_2^2)$
```{r, out.height = 250, out.width = 400}
include_graphics("./figures/kPCA.png")
```
---
# Kernel Principal Component Analysis
This can be computationally expensive, but we can use *kernel trick* - the principal components can also be computed from the inner product matrix $K = X^TX$. Here we don't need to explicitly map all points into the high dimensional space and do the calculations there, it is enough to obtain the kernel matrix.
Example of calculating the kernel matrix using a Gaussian kernel:
$K = K(x_i\, x_j) = \Phi(x_i)^T\Phi(x_j) = exp (- \frac{\lVert x_i - x_j\rVert ^2}{2\sigma^2})$
### Algorithm
1. Select kernel function
2. Calculate the kernel matrix
3. Center kernel matrix
4. Solve the eigenproblem
5. Project the data to each new dimension
Implementation in `R` with `kernlab::kpca`
---
# Classical Multidimensional Scaling
Uses an eigenvalue decomposition on a transformed distance matrix to find an embedding that maintains the distances of the distance matrix.
### Algorithm
Given Euclidean distance matrix $D = (d_{ij})$ find $X = [x_1, ..., x_n]$ so that $\lVert x_i - x_j \rVert = d_{ij}$.
- Given proximity matrix $D$ set up squared matrix $D^2$
- Apply double centering:
$B=-\frac{1}{2}JD^2J$, where $J$ is centering matrix
- Use eigen-decomposition and determine $n$ largest eigenvalues and corresponding eigenvectors of $B$
- Compute $X$ by applying $X = V \Lambda^\frac{1}{2}$
---
# Classical Multidimensional Scaling
```{r, echo = TRUE, include = TRUE}
cmdscale_povilas <- function(D, dimensions = 2) {
n <- attr(D, "Size")
x <- matrix(0, n, n)
x[lower.tri(x)] <- D^2
x <- x + t(x)
rm <- x * 0 + rowMeans(x)
cm <- t(x * 0 + colMeans(x))
B <- -(x - rm - cm + mean(x)) / 2
e <- eigen(B, symmetric = TRUE)
ev <- e$values[seq_len(dimensions)]
evec <- e$vectors[, seq_len(dimensions), drop = FALSE]
evec * rep(sqrt(ev), each = n)
}
mean(cmdscale_povilas(eurodist) - stats::cmdscale(eurodist))
```
---
# non-Metric Dimensional Scaling
nMDS relies on the ranking of distances (is an optimization process minimizing *stress* function, and is solved by iterative algorithms). A solution is found such that the rank order of distances between points in the ordination match the order of dissimilarity between the points.
### Algorithm
Given dissimilarities matrix $D$ and number of wanted dimensions $n$
1. Generate an initial ordination of the sample units (starting configuration) in $n$-dimensional space (in `R` can be specified as `cmdscale(D, n)`)
2. Calculate the distances between each pair of sample units in the current ordination
3. Perform a monotonic regression of the distances vs. dissimilarities (Shepherd plot)
4. Calculate the stress
5. Slightly move each point in the ordination in a way that will decrease stress
6. Repeat steps 2 – 5 until stress either approaches zero or stops decreasing
---
# non-Metric Dimensional Scaling
There’s no way to know in advance what $n$ is (possible to use Scree plot `stress ~ n`).
Implementation in `R` with `vegan::monoMDS`.
```{r, out.height = 200, out.width = 450, dpi = 400}
include_graphics("./figures/shepard.png")
# vegan::stressplot(vegan::metaMDS(dist(eurodist)), main = "Shepard diagram with monotonic regression")
```
---
# Locally Linear Embedding
Points that lie on a manifold in a high dimensional space can be reconstructed through linear combinations of their neighborhoods. Makes assumption that "expect each data point and its neighbors to lie on or close to a locally linear patch of the manifold".
### Algorithm
1. Compute the neighbors of each data point $\vec{x}_{\,i}$ (number of neighbors is the only needed parameter)
2. Compute the weights $W_{ij}$ that best reconstruct each data point $\vec{x}_{\,i}$ from its neighbors
3. Compute the vectors $\vec{y}_{\,i}$ best reconstructed by the weights $W_{ij}$
4. Apply eigenvalue decomposition.
Is similar to Isomap but is computationally better because $W$ is sparse. Implementation in `R` with `lle::lle`.
---
# t-dist Stochastic Neighbor Embedding
Uses the local relationships between points to create a low-dimensional mapping and this allows it to capture non-linear structure.
Why not LLE or kPCA?
- Handles the crowding problem very well (makes the optimization "spread-out" the medium distance points to prevent crowding).
- Uses "stochastic neighbors" - there is no clear line between which points are neighbors of the other points (takes both global and local structure into account).
```{r, out.height = 200, out.width = 450, dpi = 400}
include_graphics("./figures/crowding.png")
````
---
# t-dist Stochastic Neighbor Embedding
### Algorithm
- In the high-dimensional space, create a probability distribution that dictates the relationships between various neighboring points
1. Pick point $x_i$
2. Define the probability of picking another point $x_j$ in the dataset as the neighbor:
$p_{ij} = \frac{\exp(-\left \| x_i - x_j \right \|^2 / 2\sigma_i^2)}{\sum_{k \neq l} \exp(- \left \| x_k - x_l \right \|^2 / 2\sigma_i^2)}$
Probability is proportionate to the probability density of a Gaussian centered at $x_i$ and $\sigma_i$ is calculated using hyperparameter *perplexity*.
- Recreate a low dimensional space that follows that probability distribution as best as possible.
1. Use Student t-distribution with a single degree of freedom (A Gaussian distribution has a "short tail", meaning nearby points are likely to be squashed together):
$q_{ij} =\frac{(1 + \left \| y_i - y_j \right \|^2 )^{-1}}{\sum_{k \neq l} (1 + \left \| y_k - y_l \right \|^2 )^{-1} }$
---
# t-dist Stochastic Neighbor Embedding
### Algorithm
2.1 Optimize the distribution according to Kullback–Leibler divergence:
$D_{\text{KL}}(P\parallel Q)=\sum _{x\in {\mathcal {X}}}P(x)\log \left({\frac {P(x)}{Q(x)}}\right)$
2.2 "Early compression" penalty - Adds penalty to the cost function at the early stages of optimization to make the distance between embedded points remain small.
2.3 "Early exaggeration" - all the $p_{ij}$ are exaggerated at the early stages of optimization. The effect of this is to force the values of $q_{ij}$ to become more focused on larger $p_{ij}$
---
# t-dist Stochastic Neighbor Embedding
### Interpretation
- Hyperparameters are important
```{r, out.height = 140, out.width = 1000, dpi = 400}
include_graphics("./figures/tsne1.png")
````
- Cluster sizes in mean nothing
```{r, out.height = 140, out.width = 1000, dpi = 400}
include_graphics("./figures/tsne2.png")
````
- Distances between clusters might not mean anything
---
# Tips for Effective DR
## Choose an appropriate method
- Nature of input data (continuous, categorical, count, distance data).
- When multidomain data is available one might apply DR to each dataset separately and then align them using [a Procrustes transformation](https://onlinelibrary.wiley.com/doi/abs/10.1002/bs.3830070216) or methods that allow integration of multiple datasets (STATIS).
- Resolution of the data (DR methods can be focused on recovering either global or local structures in the data).
- If observations in data have assigned class labels (and goal is to obtain a representation that best separates them into known categories) use *supervised DR techniques*. For example: partial least squares (PLS), linear discriminant analysis (LDA), neighborhood component analysis (NCA) and the bottleneck neural network classifier.
---
# Tips for Effective DR
## Preprocess continuous and count input data
Before applying DR, suitable data preprocessing is often necessary.
- Data centering (subtracting variable means) — is a required step for PCA.
- Scaling variance (multiplying each measurement of a variable by a scalar factor so that the resulting feature has a variance of one), this ensures equal contribution from each variable. When units of all variables are the same normalizing feature variances is not advised, because it results in shrinkage of features containing strong signals and inflation of features with no signal.
- log-transformation if changes in data are multiplicative (variables measure percent increase/decrease).
- Variance stabilization transformation when data exhibit a mean-variance trend in which features with higher means have higher variances.
---
# Tips for Effective DR
## Handling categorical input data
- If available measurements are categorical use Multiple Correspondence Analysis (MCA) - quantifies categorical data by assigning numerical values to the objects and categories so that objects within the same category are close together and objects in different categories are far apart.
- Mixed dataset - factor analysis of mixed data (PCA for quantitative variables and a MCA for qualitative variables); perform PCA on variables transformed using an *optimal quantification* (replace variance in PCA by a chi-squared distance on category frequencies).
---
# Tips for Effective DR
## Use embedding methods to reduce similarity/dissimilarity
Choose a dissimilarity metric that provides the best summary of data:
- Euclidian
- Manhattan (binary data)
- Jaccard (features are sparse)
## Apply the correct aspect ratio for visualizations
- DR plots should obey the aspect ratio consistent with the relative amount of information explained (i.e. ratio of a plot should be consistent with the ratio between the corresponding eigenvalues).
- The ordering of the dimensions is not meaningful in many optimization-based DR methods (e.g. t-SNE).
---
# Tips for Effective DR
## Decide on the number of dimensions to retain
First $n$ dimensions might explain an insufficient fraction of the variance, in which case the higher-order components should be retained.
- The Guttman-Kaiser rule - pick PCs with eigenvalues of at least $1$.
- The elbow rule - Choose a number of dimensions so that adding another dimension doesn't give much better modeling of the data.
- Randomization methods based on eigenvalues - randomize the values within variables in the data and conduct a PCA on the reshuffled data. In each randomization, evaluate statistics based on the eigenvalues:
1. Observed eigenvalue for axis.
2. a Pseudo-F-ratio calculated as each eigenvalue divided by the sum of the remaining (or smaller) eigenvalues.
3. The ratio (or difference) between an eigenvalue and the next adjacent eigenvalue.
---
# Tips for Effective DR
## Understanding the meaning of the new dimensions
Correlation circles can be used to determine which original variables are associated with each other or with generated dimensions. The angles between the feature vectors or with the PC axes are informative: vectors at approximately 0 (180) with each other indicate that the corresponding variables are closely, positively (negatively) related, whereas vectors with a 90 angle indicate rough independence.
```{r, out.height = 250, out.width = 800, dpi = 400}
include_graphics("./figures/contribution.png")
```
---
# Tips for Effective DR
## Finding hidden signal
- If goal is to identify discrete clusters it's a common practice to apply PCA (intended as a noise-reduction step). The output generated by neighborhood embedding techniques, such as t-SNE, should not be used for clustering, as they preserve neither distances nor densities.
- Unlike discrete clusters, continuous changes in the data are less frequently identified.
```{r, out.height = 250, out.width = 800, dpi = 400}
include_graphics("./figures/Cluster_vs_Gradient.png")
```
---
# Tips for Effective DR
## Checking the robustness of results and quantify uncertainties
- If multiple PCs have very similar variances and the corresponding eigenvalues are almost exactly the same then their loadings can't be interpreted separately.
- When working with techniques that require parameter specification, one should also check the stability of results against different parameter settings, e.g., perplexity in t-SNE.
- Estimate uncertainties by constructing a collection of "bootstrap" datasets, i.e., random subsets of the data generated by resampling observations.
```{r, out.height = 220, out.width = 350, dpi = 300}
include_graphics("./figures/bootstrap_uncertainties.png")
```