Skip to content

Commit

Permalink
Quartz sync: Dec 26, 2023, 3:51 PM
Browse files Browse the repository at this point in the history
  • Loading branch information
ps4vs committed Dec 26, 2023
1 parent adb2bfc commit 42c097e
Show file tree
Hide file tree
Showing 4 changed files with 59 additions and 12 deletions.
8 changes: 2 additions & 6 deletions content/notes/RL Alpha.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "RL: Alpha"
tags:
- seed
- sapling
enableToc: false
---
### Introduction
Expand Down Expand Up @@ -35,8 +35,4 @@ Key insights include:
3. **Challenge with Random Play**: Learning efficiency drops significantly against a randomly moving opponent due to limited informative feedback.
4. **Learning Through Self-Play**: When the RL agent plays Tic-Tac-Toe against itself, it experiences a unique learning environment. Since both players (first and second) are the same agent, they update value functions of different sets of states based on their position in the game.






More detailed/rough notes can be found [here](https://github.com/ps4vs/Deep-RL/tree/main/Chapter-1).
48 changes: 48 additions & 0 deletions content/notes/RL Beta.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
title: "RL: Alpha"
tags:
- sapling
enableToc: false
---
### Introduction
- The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that **evaluates** the actions taken rather than **instructs** by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior.
* Why exploration is needed?
- We won't be able to find optimal policy during exploitation since we are getting evaluation feedback alone.

The two main aspects of RL are
* **evaluative feedback***, ie, how good the action taken was, but not whether it was the best or the worst action possible.
- **associative property**, ie, the best action depends on the situation. The topic of K-armed Bandit problem settings helps understand non-associative, evaluative feedback aspect of RL.


### K-armed Bandit Problem
- "Learning Action-value estimates", ie, Action-value Methods
- $\epsilon$-greedy
- Incremental update to estimate the value associated with each action
- The $\epsilon$ is the exploration probability
- $NewEstimate \leftarrow OldEstimate + StepSize*[Target-OldEstimate]$
- Having **decreasing StepSize for stationary reward distribution** allows convergence of the Estimates, whereas **constant StepSize is used for non-stationary reward distribution**.
- Optimistic Initial Values
- For stationary reward distribution problem, setting optimistic initial values helps exploration and faster convergence.
- For non-stationary reward distribution, it doesn't affect since the mean of targets are constantly changes
- Upper-Confidence-Bound Action Selection
- allows exploration, with preference to actions, ie, based on how close their estimates are to being maximal and the uncertainity in their estimates.
$$A_t = argmax_a[Q_t(a)+c\sqrt(ln(t)/N_t(a))]$$
- If $N_t(a)=0$ then a is selected first.
- "Learning a Numerical Preference" is an alternative to methods that estimate action values.
- Gradient Bandit Algorithm
- The numerical preference has no interpretation in terms of rewards, in contrast with estimated action values. (ie, which is estimated mean reward we get)
$$Pr(A_t=a) = e^{H_t(a)}/{\Sigma_b e^{H_t(b)}} = \pi(a)$$
- Here $\pi(a)$ is the probability of taking action a at time t.
- Learning algorithm for soft-max action preferences based on the idea of stochastic gradient ascent.
- $H_{t+1}(A_t) = H_t(A_t) + \alpha * (R_t - \bar{R_t}) (1 - \pi_t(A_t))$
- $H_{t+1}(a) = H_t(a) - \alpha * (R_t - \bar{R_t})* \pi_t(A_t)\text{ ........} \forall a!=A_t$


The implementation related to the above algorithms can be found [here](https://github.com/ps4vs/Deep-RL/tree/main/Chapter-2). along with more detailed/rough notes.

### Bandits vs Contextual Bandits vs Full RL problems

- Associative search tasks (contextual bandits) are intermediate b/w the k-armed bandit problem and full-reinforcement learning problem.
- They are like full-reinforcement learning problem in that they involve learning a policy, but they are also like our k-armed bandit problem in that each action only affects the immediate reward.
- If actions are allowed to affect the next situation as well as the reward, then we have full-reinforcement learning problem

5 changes: 3 additions & 2 deletions content/notes/Reinforcement Learning.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,12 @@ tags:
- seed
enableToc: false
---
These are some notes/projects I made related to Reinforcement Learning
My notes/code/projects related to Reinforcement Learning
## Blogs
- [[RL Alpha |Reinforcement Learning: Alpha]]
- [[RL Beta |Reinforcement Learning: Beta]]

## Resources:
## Resources
- [Deep RL Hugging Face](https://huggingface.co/learn/deep-rl-course)
- [Reinforcement Learning: An Introduction, Richard Sutton and Andrew G](http://incompleteideas.net/book/RLbook2020.pdf)
- [Foundations of Deep RL Series, by Pieter Abbeel](https://youtu.be/Psrhxy88zww)
Expand Down
10 changes: 6 additions & 4 deletions content/notes/hitlist.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,14 @@ tags:
- seed
---
![robotics](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExdXVtbmV4ajVyYzBsNjNybmZ3M21lcTh0bTB0MHdnZmVibWx3eW15ZyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/TJyHZPUF4jNZRpbWkK/source.gif)
### Reading

### **Somethings I love to-do!!!**
###### Reading
* [https://distill.pub/](https://distill.pub/) Indepth explainations of multiple areas though outdated 2021.
#### Robotics
###### Robotics
* Create an automatic drone from scratch using MIT's [Liquid Neural Networks code ](https://github.com/makramchahine/drone_causality)repository, and https://hackaday.io/ 's resources.


###### Research
- Benchmarking recent sequential models, including Mamba, Transformer, ResNet, RetNet, RNN, Liquid Neural Networks, and Neural ODEs in ["Scalable-L20"](https://github.com/VITA-Group/Scalable-L2O)



0 comments on commit 42c097e

Please sign in to comment.