Skip to content

Commit

Permalink
Quartz sync: Dec 26, 2023, 3:57 PM
Browse files Browse the repository at this point in the history
  • Loading branch information
ps4vs committed Dec 26, 2023
1 parent bf7afa4 commit e68df0c
Showing 1 changed file with 7 additions and 7 deletions.
14 changes: 7 additions & 7 deletions content/notes/RL Beta.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,29 +9,29 @@ enableToc: false
* Why exploration is needed?
- We won't be able to find optimal policy during exploitation since we are getting evaluation feedback alone.

The two main aspects of RL are
* **evaluative feedback***, ie, how good the action taken was, but not whether it was the best or the worst action possible.
- **associative property**, ie, the best action depends on the situation. The topic of K-armed Bandit problem settings helps understand non-associative, evaluative feedback aspect of RL.
* The two main aspects of RL are
* **evaluative feedback***, ie, how good the action taken was, but not whether it was the best or the worst action possible.
- **associative property**, ie, the best action depends on the situation. The topic of K-armed Bandit problem settings helps understand non-associative, evaluative feedback aspect of RL.


### K-armed Bandit Problem
- "Learning Action-value estimates", ie, Action-value Methods
- $\epsilon$-greedy
- Incremental update to estimate the value associated with each action
- The $\epsilon$ is the exploration probability
- $NewEstimate \leftarrow OldEstimate + StepSize*[Target-OldEstimate]$
- Having **decreasing StepSize for stationary reward distribution** allows convergence of the Estimates, whereas **constant StepSize is used for non-stationary reward distribution**.
- $$NewEstimate \leftarrow OldEstimate + StepSize*[Target-OldEstimate]$$
- Having decreasing StepSize for stationary reward distribution allows convergence of the Estimates, whereas constant StepSize is used for non-stationary reward distribution.
- Optimistic Initial Values
- For stationary reward distribution problem, setting optimistic initial values helps exploration and faster convergence.
- For non-stationary reward distribution, it doesn't affect since the mean of targets are constantly changes
- Upper-Confidence-Bound Action Selection
- allows exploration, with preference to actions, ie, based on how close their estimates are to being maximal and the uncertainity in their estimates.
$$A_t = argmax_a[Q_t(a)+c\sqrt(ln(t)/N_t(a))]$$
- $$A_t = argmax_a[Q_t(a)+c\sqrt(ln(t)/N_t(a))]$$
- If $N_t(a)=0$ then a is selected first.
- "Learning a Numerical Preference" is an alternative to methods that estimate action values.
- Gradient Bandit Algorithm
- The numerical preference has no interpretation in terms of rewards, in contrast with estimated action values. (ie, which is estimated mean reward we get)
$$Pr(A_t=a) = e^{H_t(a)}/{\Sigma_b e^{H_t(b)}} = \pi(a)$$
- $$Pr(A_t=a) = e^{H_t(a)}/{\Sigma_b e^{H_t(b)}} = \pi(a)$$
- Here $\pi(a)$ is the probability of taking action a at time t.
- Learning algorithm for soft-max action preferences based on the idea of stochastic gradient ascent.
- $H_{t+1}(A_t) = H_t(A_t) + \alpha * (R_t - \bar{R_t}) (1 - \pi_t(A_t))$
Expand Down

0 comments on commit e68df0c

Please sign in to comment.