Quartz sync: Jan 28, 2024, 2:35 PM

ps4vs · Jan 28, 2024 · 137d550 · 137d550
1 parent 5c8db52
commit 137d550
Show file tree

Hide file tree

Showing 27 changed files with 237 additions and 2 deletions.
diff --git a/Pasted image 20240109185933.png b/Pasted image 20240109185933.png
diff --git a/Pasted image 20240109190020.png b/Pasted image 20240109190020.png
diff --git a/Pasted image 20240127171935.png b/Pasted image 20240127171935.png
diff --git a/Pasted image 20240127172003.png b/Pasted image 20240127172003.png
diff --git a/Pasted image 20240127174656.png b/Pasted image 20240127174656.png
diff --git a/Pasted image 20240127180635.png b/Pasted image 20240127180635.png
diff --git a/Pasted image 20240128033122.png b/Pasted image 20240128033122.png
diff --git a/Pasted image 20240128033437.png b/Pasted image 20240128033437.png
diff --git a/Pasted image 20240128033627.png b/Pasted image 20240128033627.png
diff --git a/Pasted image 20240128033645.png b/Pasted image 20240128033645.png
diff --git a/Pasted image 20240128033726.png b/Pasted image 20240128033726.png
diff --git a/Pasted image 20240128034433.png b/Pasted image 20240128034433.png
diff --git a/Pasted image 20240128141547.png b/Pasted image 20240128141547.png
diff --git a/Pasted image 20240128142944.png b/Pasted image 20240128142944.png
diff --git a/Pasted image 20240128143200.png b/Pasted image 20240128143200.png
diff --git a/Pasted image 20240128143210.png b/Pasted image 20240128143210.png
diff --git a/Pasted image 20240128143218.png b/Pasted image 20240128143218.png
diff --git a/Pasted image 20240128143241.png b/Pasted image 20240128143241.png
diff --git a/content/notes/AI.md b/content/notes/AI.md
@@ -18,6 +18,7 @@ My notes/code/projects related to Reinforcement Learning
 * [[W2V-BERT |Notes on W2V-BERT]] 
 	* w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretise input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualised speech representations via solving a masked prediction task consuming the discretised tokens.
 * [[AudioMAE |Notes on AudioMAE]]
+* [[PolyGrad |Notes on PolyGrad]]
 
 
 
diff --git a/content/notes/AudioMAE.md b/content/notes/AudioMAE.md
@@ -1,7 +1,7 @@
 ---
 title: AudioMAE
 tags:
-  - sapling
+  - paper-review
 enableToc: false
 ---
 ## AudioMAE

diff --git a/content/notes/CS231n Classification.md b/content/notes/CS231n Classification.md
@@ -0,0 +1,64 @@
+---
+title: CS231n Classification
+tags:
+  - sapling
+  - short-notes
+enableToc: false
+---
+### Introduction
+Many other seemingly distinct Computer Vision tasks (such as object detection, segmentation) can be reduced to image classification.
+
+Image classification is solved through data-driven approach.
+
+>[!question] 
+>Any alternatives other than data-driven approach for Image Classification.
+
+
+### Challenges
+>[!info] 
+>Because Computer Vision algorithms take raw representation of images as a 3-D array  of brightness values.
+
+Non-exhaustive challenges are:
+
+- Viewpoint variation
+- Scale variation
+- Deformation
+- Occlusion
+- Illumination conditions
+- Background clutter
+- Intra-class variation
+
+### Image Classification Pipeline
+- **Input**: A set of N images, each labeled with one of K different classes. [training set]
+- **Learning**: Use the training set to learn what every one of the classes looks like. [training a classifier] or [learning a model]
+- **Evaluation**: Evaluate the quality of classifier on a new set of images that it has never seen before. Predictions should match up with the true answers (which we call ground truth).
+
+### Nearest Neighbour Classifier
+- It is rarely used in practice. 
+- Given the training set and test set; find nearest neighbour of each test sample using L1/L2 distance.
+- [k-nearest neighbour classifier] Higher values of k have a smoothing effect that makes the classifier more resistant to outliers.
+
+### Validation sets for Hyperparameter tuning
+- [Validation] Split the training set into training set and validation set. Use validation set to tune all hyperparameters. 
+- At the end run a single time on the test set and report performance, ie, generalisation.
+- [Cross-validation] Split the training set into k-fold, where 1 fold is used as validation set in each turn, and average over all performances on validation sets is used for hyperparameter tuning. 
+>[!question]  
+>Cross-validation has same different trained model across different validation set selection from the k-fold sets of the training set.
+
+![[Pasted image 20240127171935.png]]
+![[Pasted image 20240127172003.png]]
+
+- Raw-pixel based L1/L2 distance is very counter intuitive, because of many realistically simple reasons.
+- K-nearest Neighbours is very computational complex during test time, and very intuitive and easy and training time.
+#### Copy-Notes
+- If there are many hyperparameters to estimate, you should err on the side of having larger validation set to estimate them effectively. 
+- If you are concerned about the size of your validation data, it is best to split the training data into folds and perform cross-validation. 
+- If you can afford the computational budget it is always safer to go with cross-validation (the more folds the better, but more expensive).
+### Relevant Blogs
+[[t-SNE.md|t-SNE]]
+[[PCA.md|PCA]]
+
+
+
+
+
diff --git a/content/notes/Lasso vs Ridge.md b/content/notes/Lasso vs Ridge.md
@@ -0,0 +1,26 @@
+---
+title: Lasso vs Ridge
+tags:
+  - sapling
+enableToc: false
+---
+### Introduction
+Lasso and Ridge are regularisation methods used to find optimally complex model, which is as simple as possible while performing well on training data.
+
+![[Pasted image 20240128033726.png]]
+- Optimally complex model balances bias and variance.
+### When Lasso When Ridge
+- [Lasso]: To remove unnecessary features
+ ![[Pasted image 20240128033627.png]]
+- [Ridge]: To build robust model
+![[Pasted image 20240128033645.png]]
+
+### Feature Selection using Lasso
+- Orange contour represents the regularisation term contour and blue contour represents the error term contour.
+- The points where error term and regularisation terms are tangential to one another are the possible optimal solutions for which cost function can be minimised.
+- The chances that two contours will be tangential to one another on x or y-axis are very unlikely to happen in Ridge, so it's difficult to have sparse solution.
+- Since lasso has faces, corners, and sides in high-dimensions there are high chances that two contours are tangential to one another on x or y-axis.
+
+![[Pasted image 20240128033122.png]]
+
+![[Pasted image 20240128033437.png]]
diff --git a/content/notes/PCA.md b/content/notes/PCA.md
@@ -0,0 +1,77 @@
+---
+title: PCA
+tags:
+  - sapling
+enableToc: true
+---
+### Introduction
+
+Data is often high dimensional. So can't be stored directly, nor ignored completely.
+Dimensionality reduction techniques are:
+- [Filtering]: Leave most of the dimensions and concentrate only on certain dimensions
+- [PCA]: Project the high-dimensional data onto a lower dimensional subspace using linear or non-linear transformations (or projections). 
+>[!info]
+>The basic idea is n (the number of data items) should be more than number of dimensions.
+
+![[Pasted image 20240127174656.png]]
+The above is an example of PCA which is a linear projection method.
+### Detailed Explanation
+
+- **Problem**: Approximating 2-D datapoints using lower-dimensional representation, ie, 1-D
+- Details:
+	- Instead of storing 2 values for each data, we will store 1 value for each data plus vector V which is common across all the datapoints.
+	- For each data point you have to store only this scalar value s, which gives the distance along this vector V.
+![[Pasted image 20240127180635.png]]
+- Other Details
+	- You should choose V that will minimise the residual variance, ie, the difference b/w your original data and your projections.
+	- It allows you to reconstruct the original data, with the least possible error.
+	- Orthogonal projection on to the vector V.
+	- You should pick V in the direction of biggest spread of your data.
+	- Can extend this to multiple components.
+	- you can repeat this process, and find second component that has second biggest variance of the data, ie, principal comp 2.
+![[Pasted image 20240128034433.png]]
+
+### Understanding SVD
+
+>[!question] Need for SVD 
+>- The steps to implement PCA are expensive when X is very large or very small. 
+>- Best way to compute principal components is by using SVD. 
+>- SVD is one of the best linear transformation methods.
+
+**PCA Implementation**
+1. Subtract mean from the data.
+2. Scale each dimensions by its variance.
+3. Compute the covariance matrix S. Here X is data matrix.
+$$S = 1/N (X^TX)$$
+4. Compute K largest eigen vectors of S. These eigen vectors are the principal components of the data set.
+#### What is SVD?
+
+Any matrix X, whether it is singular, square or diagonal, can be decomposed into product of three matrices; two orthogonal matrices U and V and diagonal matrix D.
+
+$$X = UDV^T$$
+
+![[Pasted image 20240128141547.png]]
+
+[PCA using SVD] on S(co-variance matrix) is used to obtain eigen vectors and eigen values.
+- The columns of matrix U form the eigen vectors of S
+- The D matrix is a diagonal matrix, whose diagonal values are eigen values in descending order.
+- The eigen vectors have the same dimensions of a single datapoint.
+- **What does SVD has to do with Dimensionality Reduction?**
+	- How PCA helps in dimensionality reduction?
+	- If we reduce number of dimensions from k to q (q < k).
+	- Number of column vectors of U would have been changed to q, ie, we are now is q-dimensional hyper-plane in a k-dimensional world. 
+
+>[!notes] Intuition behind PCA using SVD Dimensionality Reduction
+>When we reduce the dimensions from k to q (q < k), then the points are now in q-dimensional hyper-plane in a k-dimensional world which can be stored as a q-dimensional datapoints along with eigen vectors and values indicating how much we have lost.
+>
+>![[Pasted image 20240128142944.png]]
+
+
+### Image recognition example
+
+![[Pasted image 20240128143200.png]]
+
+![[Pasted image 20240128143210.png]]
+
+![[Pasted image 20240128143218.png]]
+![[Pasted image 20240128143241.png]]
diff --git a/content/notes/PolyGrad.md b/content/notes/PolyGrad.md
@@ -0,0 +1,34 @@
+---
+title: PolyGrad
+tags:
+  - paper-review
+enableToc: false
+---
+## PolyGrad
+
+The paper ["World Models via Policy-Guided Trajectory Diffusion"](https://arxiv.org/abs/2312.08533) introduces novel world modelling approach "Policy-Guided Trajectory Diffusion" (PolyGrad) that is not autoregressive and generates entire on-policy trajectories in a single pass through a diffusion model.
+
+> [!info] Drawback of Autoregressive World Models
+> Prediction error inevitably compounds as the trajectory length grows, as they interleave predicting the next state with sampling the next action from policy.
+
+>[!question] Examples of On-policy and Off-Policy RL algorithms?
+> SARSA and Q-Learning respectively.
+> [[./on-policy-Vs-off-policy|On-Policy vs Off-Policy RL]]
+
+### Model 
+- TBA
+### Method
+- TBA
+
+### Techniques
+- TBA
+## Generalisability:
+
+* TBA
+## Limitations:
+
+* TBA
+
+## Extended Research Direction:
+
+* TBA
diff --git a/content/notes/W2V-BERT.md b/content/notes/W2V-BERT.md
@@ -1,7 +1,7 @@
 ---
 title: W2V-BERT
 tags:
-  - sapling
+  - paper-review
 enableToc: false
 ---
 ## W2V-BERT [Contrastive Learning + MLM]

diff --git a/content/notes/hitlist.md b/content/notes/hitlist.md
@@ -28,6 +28,8 @@ tags:
 	- This can be implemented on AlexNet where grouped convolution was first used
 	- Wav2Vec 2.0 also uses grouped convolution.
 - Inspiration from multiple cooperative agents to grouped convolution tasks or convolutions filters in general.
+- add LNN, and neuralODE to hugging-face
+- add learned optimizer to huggingface
 
 
 
diff --git a/content/notes/on-policy-Vs-off-policy.md b/content/notes/on-policy-Vs-off-policy.md
@@ -0,0 +1,31 @@
+---
+title: On-policy vs Off-policy RL
+tags:
+  - sapling
+enableToc: false
+---
+### Introduction
+- Reinforcement Learning is learning how to map situations to actions so as to maximum a numerical reward signal
+- Policy is a function that maps from state to action.
+- There are two types of policies
+	- **Target policy**: Policy used to optimise the decision making. 
+	- **Behaviour policy**: Policy used to take actions in the environment or policy used to navigate the environment.
+- Off policy RL algorithms have different behaviour and target policies, they can be decouple the data collection and training.
+- On policy RL algorithms have same behaviour and target policies. The agent takes actions and learns using the same policy.
+
+### Q-Learning is Off-Policy RL Algorithm
+
+- Say that the agent is randomly choosing action to execute in the environment, ie, the behaviour policy is random.
+- We will get Q value for (S, right), using Bellman equation $$Q(S, right) = R + max_a Q(S', a)$$
+- Note that in the above equation, we are not taking the action a, it is selected based on our target policy. But it is not executed.
+- For most off-policy algorithms, 
+	- the target policy is greedy.
+	- the behaviour policy can be random, $\epsilon$-greedy or greedy.
+- This Q(S, right) observed on taking action, will be used to update our target policy using TD method.
+- They can be decoupled, collecting data and learning our target policy, so Q-Learning is Off-Policy RL algorithm.
+
+>[!info] Q-Learning (off-policy TD control) for estimating $\pi \approx \pi_*$
+>![[Pasted image 20240109185933.png]]
+
+>[!info] Sarsa (on-policy TD control) for estimating $Q \approx q_*$
+>![[Pasted image 20240109190020.png]]