diff --git a/content/notes/RL Alpha.md b/content/notes/RL Alpha.md index 5b9436f..1e5a6b7 100644 --- a/content/notes/RL Alpha.md +++ b/content/notes/RL Alpha.md @@ -1,7 +1,7 @@ --- title: "RL: Alpha" tags: - - seed + - sapling enableToc: false --- ### Introduction @@ -35,8 +35,4 @@ Key insights include: 3. **Challenge with Random Play**: Learning efficiency drops significantly against a randomly moving opponent due to limited informative feedback. 4. **Learning Through Self-Play**: When the RL agent plays Tic-Tac-Toe against itself, it experiences a unique learning environment. Since both players (first and second) are the same agent, they update value functions of different sets of states based on their position in the game. - - - - - +More detailed/rough notes can be found [here](https://github.com/ps4vs/Deep-RL/tree/main/Chapter-1). \ No newline at end of file diff --git a/content/notes/RL Beta.md b/content/notes/RL Beta.md new file mode 100644 index 0000000..ef6f550 --- /dev/null +++ b/content/notes/RL Beta.md @@ -0,0 +1,48 @@ +--- +title: "RL: Alpha" +tags: + - sapling +enableToc: false +--- +### Introduction +- The most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that **evaluates** the actions taken rather than **instructs** by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior. +* Why exploration is needed? + - We won't be able to find optimal policy during exploitation since we are getting evaluation feedback alone. + +The two main aspects of RL are +* **evaluative feedback***, ie, how good the action taken was, but not whether it was the best or the worst action possible. +- **associative property**, ie, the best action depends on the situation. The topic of K-armed Bandit problem settings helps understand non-associative, evaluative feedback aspect of RL. + + +### K-armed Bandit Problem +- "Learning Action-value estimates", ie, Action-value Methods + - $\epsilon$-greedy + - Incremental update to estimate the value associated with each action + - The $\epsilon$ is the exploration probability + - $NewEstimate \leftarrow OldEstimate + StepSize*[Target-OldEstimate]$ + - Having **decreasing StepSize for stationary reward distribution** allows convergence of the Estimates, whereas **constant StepSize is used for non-stationary reward distribution**. + - Optimistic Initial Values + - For stationary reward distribution problem, setting optimistic initial values helps exploration and faster convergence. + - For non-stationary reward distribution, it doesn't affect since the mean of targets are constantly changes + - Upper-Confidence-Bound Action Selection + - allows exploration, with preference to actions, ie, based on how close their estimates are to being maximal and the uncertainity in their estimates. + $$A_t = argmax_a[Q_t(a)+c\sqrt(ln(t)/N_t(a))]$$ + - If $N_t(a)=0$ then a is selected first. +- "Learning a Numerical Preference" is an alternative to methods that estimate action values. + - Gradient Bandit Algorithm + - The numerical preference has no interpretation in terms of rewards, in contrast with estimated action values. (ie, which is estimated mean reward we get) + $$Pr(A_t=a) = e^{H_t(a)}/{\Sigma_b e^{H_t(b)}} = \pi(a)$$ + - Here $\pi(a)$ is the probability of taking action a at time t. + - Learning algorithm for soft-max action preferences based on the idea of stochastic gradient ascent. + - $H_{t+1}(A_t) = H_t(A_t) + \alpha * (R_t - \bar{R_t}) (1 - \pi_t(A_t))$ + - $H_{t+1}(a) = H_t(a) - \alpha * (R_t - \bar{R_t})* \pi_t(A_t)\text{ ........} \forall a!=A_t$ + + +The implementation related to the above algorithms can be found [here](https://github.com/ps4vs/Deep-RL/tree/main/Chapter-2). along with more detailed/rough notes. + +### Bandits vs Contextual Bandits vs Full RL problems + +- Associative search tasks (contextual bandits) are intermediate b/w the k-armed bandit problem and full-reinforcement learning problem. +- They are like full-reinforcement learning problem in that they involve learning a policy, but they are also like our k-armed bandit problem in that each action only affects the immediate reward. +- If actions are allowed to affect the next situation as well as the reward, then we have full-reinforcement learning problem + diff --git a/content/notes/Reinforcement Learning.md b/content/notes/Reinforcement Learning.md index 90f5e29..73090ec 100644 --- a/content/notes/Reinforcement Learning.md +++ b/content/notes/Reinforcement Learning.md @@ -4,11 +4,12 @@ tags: - seed enableToc: false --- -These are some notes/projects I made related to Reinforcement Learning +My notes/code/projects related to Reinforcement Learning ## Blogs - [[RL Alpha |Reinforcement Learning: Alpha]] +- [[RL Beta |Reinforcement Learning: Beta]] -## Resources: +## Resources - [Deep RL Hugging Face](https://huggingface.co/learn/deep-rl-course) - [Reinforcement Learning: An Introduction, Richard Sutton and Andrew G](http://incompleteideas.net/book/RLbook2020.pdf) - [Foundations of Deep RL Series, by Pieter Abbeel](https://youtu.be/Psrhxy88zww) diff --git a/content/notes/hitlist.md b/content/notes/hitlist.md index 5e1d1c6..f6f6a41 100644 --- a/content/notes/hitlist.md +++ b/content/notes/hitlist.md @@ -4,12 +4,14 @@ tags: - seed --- ![robotics](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExdXVtbmV4ajVyYzBsNjNybmZ3M21lcTh0bTB0MHdnZmVibWx3eW15ZyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/TJyHZPUF4jNZRpbWkK/source.gif) -### Reading + +### **Somethings I love to-do!!!** +###### Reading * [https://distill.pub/](https://distill.pub/) Indepth explainations of multiple areas though outdated 2021. -#### Robotics +###### Robotics * Create an automatic drone from scratch using MIT's [Liquid Neural Networks code ](https://github.com/makramchahine/drone_causality)repository, and https://hackaday.io/ 's resources. - - +###### Research +- Benchmarking recent sequential models, including Mamba, Transformer, ResNet, RetNet, RNN, Liquid Neural Networks, and Neural ODEs in ["Scalable-L20"](https://github.com/VITA-Group/Scalable-L2O)