Quartz sync: Dec 26, 2023, 3:51 PM

ps4vs · Dec 26, 2023 · 42c097e · 42c097e
1 parent adb2bfc
commit 42c097e
Show file tree

Hide file tree

Showing 4 changed files with 59 additions and 12 deletions.
diff --git a/content/notes/RL Alpha.md b/content/notes/RL Alpha.md
@@ -1,7 +1,7 @@
 ---
 title: "RL: Alpha"
 tags:
-  - seed
+  - sapling
 enableToc: false
 ---
 ### Introduction
@@ -35,8 +35,4 @@ Key insights include:
 3. **Challenge with Random Play**: Learning efficiency drops significantly against a randomly moving opponent due to limited informative feedback.
 4. **Learning Through Self-Play**: When the RL agent plays Tic-Tac-Toe against itself, it experiences a unique learning environment. Since both players (first and second) are the same agent, they update value functions of different sets of states based on their position in the game.
 
-
-
-
-
-
+More detailed/rough notes can be found [here](https://github.com/ps4vs/Deep-RL/tree/main/Chapter-1).
diff --git a/content/notes/RL Beta.md b/content/notes/RL Beta.md
@@ -0,0 +1,48 @@
+---
+title: "RL: Alpha"
+tags:
+  - sapling
+enableToc: false
+---
+### Introduction
+- The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that **evaluates** the actions taken rather than **instructs** by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior.
+* Why exploration is needed?
+	- We won't be able to find optimal policy during exploitation since we are getting evaluation feedback alone.
+
+The two main aspects of RL are
+* **evaluative feedback***, ie, how good the action taken was, but not whether it was the best or the worst action possible.
+- **associative property**, ie, the best action depends on the situation. The topic of K-armed Bandit problem settings helps understand non-associative, evaluative feedback aspect of RL.
+
+
+### K-armed Bandit Problem
+- "Learning Action-value estimates", ie, Action-value Methods
+	- $\epsilon$-greedy
+		- Incremental update to estimate the value associated with each action
+		- The $\epsilon$ is the exploration probability
+		- $NewEstimate \leftarrow  OldEstimate + StepSize*[Target-OldEstimate]$
+		- Having **decreasing StepSize for stationary reward distribution** allows convergence of the Estimates, whereas **constant StepSize is used for non-stationary reward distribution**.
+	- Optimistic Initial Values
+		- For stationary reward distribution problem, setting optimistic initial values helps exploration and faster convergence.
+		- For non-stationary reward distribution, it doesn't affect since the mean of targets are constantly changes
+	- Upper-Confidence-Bound Action Selection
+		- allows exploration, with preference to actions, ie, based on how close their estimates are to being maximal and the uncertainity in their estimates.
+		   $$A_t = argmax_a[Q_t(a)+c\sqrt(ln(t)/N_t(a))]$$
+		- If $N_t(a)=0$ then a is selected first.
+- "Learning a Numerical Preference" is an alternative to methods that estimate action values.
+	- Gradient Bandit Algorithm
+		- The numerical preference has no interpretation in terms of rewards, in contrast with estimated action values. (ie, which is estimated mean reward we get)
+		   $$Pr(A_t=a) = e^{H_t(a)}/{\Sigma_b e^{H_t(b)}} = \pi(a)$$
+		- Here $\pi(a)$ is the probability of taking action a at time t. 
+		- Learning algorithm for soft-max action preferences based on the idea of stochastic gradient ascent.
+			- $H_{t+1}(A_t) = H_t(A_t) + \alpha * (R_t - \bar{R_t}) (1 - \pi_t(A_t))$
+			- $H_{t+1}(a) = H_t(a) - \alpha * (R_t - \bar{R_t})* \pi_t(A_t)\text{   ........}     \forall a!=A_t$ 
+
+
+The implementation related to the above algorithms can be found [here](https://github.com/ps4vs/Deep-RL/tree/main/Chapter-2). along with more detailed/rough notes.
+
+### Bandits vs Contextual Bandits vs Full RL problems
+
+- Associative search tasks (contextual bandits) are intermediate b/w the k-armed bandit problem and full-reinforcement learning problem. 
+- They are like full-reinforcement learning problem in that they involve learning a policy, but they are also like our k-armed bandit problem in that each action only affects the immediate reward. 
+- If actions are allowed to affect the next situation as well as the reward, then we have full-reinforcement learning problem
+
diff --git a/content/notes/Reinforcement Learning.md b/content/notes/Reinforcement Learning.md
@@ -4,11 +4,12 @@ tags:
   - seed
 enableToc: false
 ---
-These are some notes/projects I made related to Reinforcement Learning
+My notes/code/projects related to Reinforcement Learning
 ## Blogs
 - [[RL Alpha |Reinforcement Learning: Alpha]]
+- [[RL Beta |Reinforcement Learning: Beta]]
 
-## Resources:
+## Resources
 - [Deep RL Hugging Face](https://huggingface.co/learn/deep-rl-course)
 - [Reinforcement Learning: An Introduction, Richard Sutton and Andrew G](http://incompleteideas.net/book/RLbook2020.pdf)
 - [Foundations of Deep RL Series, by Pieter Abbeel](https://youtu.be/Psrhxy88zww)

diff --git a/content/notes/hitlist.md b/content/notes/hitlist.md
@@ -4,12 +4,14 @@ tags:
   - seed
 ---
 ![robotics](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExdXVtbmV4ajVyYzBsNjNybmZ3M21lcTh0bTB0MHdnZmVibWx3eW15ZyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/TJyHZPUF4jNZRpbWkK/source.gif)
-###  Reading
+
+### **Somethings I love to-do!!!**
+###### Reading
 * [https://distill.pub/](https://distill.pub/) Indepth explainations of multiple areas though outdated 2021.
-#### Robotics
+###### Robotics
 * Create an automatic drone from scratch using MIT's [Liquid Neural Networks code ](https://github.com/makramchahine/drone_causality)repository, and https://hackaday.io/ 's resources.
-
-
+###### Research 
+- Benchmarking recent sequential models, including Mamba, Transformer, ResNet, RetNet, RNN, Liquid Neural Networks, and Neural ODEs in ["Scalable-L20"](https://github.com/VITA-Group/Scalable-L2O)