Quartz sync: Dec 26, 2023, 3:57 PM

ps4vs · Dec 26, 2023 · e68df0c · e68df0c
1 parent bf7afa4
commit e68df0c
Showing 1 changed file with 7 additions and 7 deletions.
diff --git a/content/notes/RL Beta.md b/content/notes/RL Beta.md
@@ -9,29 +9,29 @@ enableToc: false
 * Why exploration is needed?
 	- We won't be able to find optimal policy during exploitation since we are getting evaluation feedback alone.
 
-The two main aspects of RL are
-* **evaluative feedback***, ie, how good the action taken was, but not whether it was the best or the worst action possible.
-- **associative property**, ie, the best action depends on the situation. The topic of K-armed Bandit problem settings helps understand non-associative, evaluative feedback aspect of RL.
+* The two main aspects of RL are
+	* **evaluative feedback***, ie, how good the action taken was, but not whether it was the best or the worst action possible.
+	- **associative property**, ie, the best action depends on the situation. The topic of K-armed Bandit problem settings helps understand non-associative, evaluative feedback aspect of RL.
 
 
 ### K-armed Bandit Problem
 - "Learning Action-value estimates", ie, Action-value Methods
 	- $\epsilon$-greedy
 		- Incremental update to estimate the value associated with each action
 		- The $\epsilon$ is the exploration probability
-		- $NewEstimate \leftarrow  OldEstimate + StepSize*[Target-OldEstimate]$
-		- Having **decreasing StepSize for stationary reward distribution** allows convergence of the Estimates, whereas **constant StepSize is used for non-stationary reward distribution**.
+			- $$NewEstimate \leftarrow  OldEstimate + StepSize*[Target-OldEstimate]$$
+		- Having decreasing StepSize for stationary reward distribution allows convergence of the Estimates, whereas constant StepSize is used for non-stationary reward distribution.
 	- Optimistic Initial Values
 		- For stationary reward distribution problem, setting optimistic initial values helps exploration and faster convergence.
 		- For non-stationary reward distribution, it doesn't affect since the mean of targets are constantly changes
 	- Upper-Confidence-Bound Action Selection
 		- allows exploration, with preference to actions, ie, based on how close their estimates are to being maximal and the uncertainity in their estimates.
-		   $$A_t = argmax_a[Q_t(a)+c\sqrt(ln(t)/N_t(a))]$$
+			 - $$A_t = argmax_a[Q_t(a)+c\sqrt(ln(t)/N_t(a))]$$
 		- If $N_t(a)=0$ then a is selected first.
 - "Learning a Numerical Preference" is an alternative to methods that estimate action values.
 	- Gradient Bandit Algorithm
 		- The numerical preference has no interpretation in terms of rewards, in contrast with estimated action values. (ie, which is estimated mean reward we get)
-		   $$Pr(A_t=a) = e^{H_t(a)}/{\Sigma_b e^{H_t(b)}} = \pi(a)$$
+			- $$Pr(A_t=a) = e^{H_t(a)}/{\Sigma_b e^{H_t(b)}} = \pi(a)$$
 		- Here $\pi(a)$ is the probability of taking action a at time t. 
 		- Learning algorithm for soft-max action preferences based on the idea of stochastic gradient ascent.
 			- $H_{t+1}(A_t) = H_t(A_t) + \alpha * (R_t - \bar{R_t}) (1 - \pi_t(A_t))$