-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
11 changed files
with
168 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
--- | ||
title: 演员评论家算法 | ||
date: 2023-10-07T14:57:55+08:00 | ||
draft: true | ||
categories: | ||
- 强化学习 | ||
tags: | ||
--- | ||
## 策略Policy | ||
### 增加Baseline | ||
|
||
![image.png](https://cdn.statically.io/gh/SivanLaai/image-store-rep@master/note/20231008152433.png) | ||
|
||
上面的式子表示的是梯度上升的更新方向使得整体的loss最大。 | ||
理想情况下,上面的三个动作abc都可以采样得到,其中c的奖励值最大,a的奖励次之,b的奖励值最小。因为所有的情况都采样到了,所以最后调整过后的概分率分布如图所示,a虽然选中的概率小,但是经过一轮采样后,概率变大了。 | ||
|
||
![image.png](https://cdn.statically.io/gh/SivanLaai/image-store-rep@master/note/20231008160027.png) | ||
|
||
假设在采样过程当中,a从来都没有被采样过,b和c的概率在采样过后都会更新并增加,结果就是a这个动作选中的概率就会下降。 | ||
|
||
那要怎么样解决没采样概率下降的情况呢? | ||
|
||
使用一个baseline来衡量R,式子如下 : | ||
![image.png](https://cdn.statically.io/gh/SivanLaai/image-store-rep@master/note/20231008160525.png) | ||
|
||
这样下来中间的值就会是有正有负了,这个b需要自己设置。 | ||
|
||
这样做的好处就是,大于b的Reward对应的动作概率就会增加,而小于b的Reward对应的动作就会减小,这就可以在一定程度上避免没有采样的概率变小。 | ||
|
||
## 批评家Critic | ||
|
||
批评家不决定作什么动作,而是对演员的动作进行打分。 | ||
|
||
学习一个状态函数,看到一个状态s的时候累加函数的奖励值。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
--- | ||
title: Q学习算法 | ||
date: 2023-09-28T11:52:17+08:00 | ||
draft: false | ||
categories: | ||
- 强化学习 | ||
tags: | ||
--- | ||
![image.png](https://cdn.statically.io/gh/SivanLaai/image-store-rep@master/note/20230928170535.png) | ||
|
||
状态$s_1,s_2$,动作$a_1,a_2$ | ||
当前步的实现:$$Q(s_1,a_2)=r+\gamma \max_{a^`}(Q(s^`,a^`))$$,其中$r$表示的是采取动作$a_2$到达状态$s^`$的奖励,所以当前步的现实是包含了对于下一步采取动作$a^`$的奖励估计,也包含了到达当前状态的奖励$r$。 | ||
当前步的估计:$$Q(s_1,a_2)$$ | ||
当前状态的估计和现实之前的误差为:$$r+\gamma \max_{a^`}(Q(s^`,a^`))-Q(s_1,a_2)$$ | ||
其中$\gamma$表示对未来奖励衰减参数,则更新对于$Q(s_1,a_2)$的估计:$$Q(s_1,a_2)=Q(s_1,a_2)+\alpha (r+\gamma \max_{a^`}(Q(s^`,a^`))-Q(s_1,a_2))$$ | ||
其中$\alpha$表示误差学习参数 | ||
|
||
用$Q(S_1)$来简化表示从状态$S_1$出发的现实奖励,有以下推论:$$Q(S_1)=r_2+\gamma Q(S_2)=r_2+\gamma(r_3+\gamma Q(S_3))$$ | ||
则有:$$Q(S_1)=r_2+\gamma r_3+\gamma^2 r_4+\gamma^3 r_5+\dots$$,则比较远的步骤能否学习和$\gamma$有关,$\gamma=1$则全部都能看到,为0则全部都看不到,正常情况在0到1之间。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
--- | ||
title: 简介 | ||
date: 2023-09-28T10:36:03+08:00 | ||
draft: false | ||
categories: | ||
- 强化学习 | ||
tags: | ||
--- | ||
# 概念 | ||
原先的动作是没有标签的,强化学习可以对每一个动作进行打分,然后学习,不断的迭代完成算法的学习。 | ||
## 按环境分类 | ||
### Model-Free RL(不理解所处环境) | ||
环境给到什么就是什么,不理解环境是什么,这里的model表示的是环境,做出的动作完全独立于当前环境,反馈也只有在做出动作后才知道。 | ||
- Q Learning | ||
- Sarsa | ||
- Policy Gradients | ||
### Model-Based RL(理解所处环境) | ||
模型会对真实世界进行建模,通过过往的经验先理解真实世界是怎么样的,然后建立模拟反馈,让模拟世界尽量接近于真实的世界,可以作出最好的动作。 | ||
|
||
## 按Policy和Value分类 | ||
|
||
### 基于概念分类(Policy-Based RL) | ||
根据当前的动作计算出下一步所有可能的动作的概率,下一步的每个动作都有可能选中。(可以处理连续的情况,输出一个概率分布) | ||
- Q Learning | ||
### 基于价值分类(Value-Based RL) | ||
根据当前的动作计算出下一步所有可能的动作的值,选择价值最高的动作。(离散的,不能处理连续的情况) | ||
- Sarsa | ||
- Policy Gradients | ||
### 基于概念和价值结合 | ||
- Actor-Critic | ||
Actor做出动作,Critic对动作进行打分 | ||
|
||
## 按回合/单步分类 | ||
|
||
### 回合更新(蒙特卡罗更新) | ||
等所有的步骤完成以后才更新 | ||
- Policy Gradients | ||
- Monte-Carlo Learning | ||
### 单步更新(蒙特卡罗更新) | ||
等当前循环中的每个单步完成以后都可以进行更新 | ||
- 改进版Policy Gradients | ||
- Sarsa | ||
- Q Learning | ||
|
||
## 按在线/离线分类 | ||
### 在线学习 | ||
只能是本人玩然后本人进行学习 | ||
- Sarsa | ||
- Sarsa() | ||
### 离线学习 | ||
可以学习别人怎么玩,看着别人怎么玩进行学习,边玩边学习,可以先记下来别人怎么玩,然后在过后在自己学习,学习和作动作可以分开执行 | ||
- Q Learning | ||
- Deep Q Network | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,7 @@ title: 随笔 | |
date: 2022-03-25 09:34:25 | ||
draft: true | ||
password: 1233 | ||
tag: | ||
tags: | ||
- hide | ||
--- | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
|
||
# 源码 | ||
网站源码文件夹:```source``` | ||
数据库初始脚本文件:```init_sql.sql``` | ||
# 备份网站 | ||
|
||
- 打开网站[Dashboard | HostGator Billing/Support System](https://portal.hostgator.com/) | ||
- 登录密码进入cPanel->Files->File Manager | ||
|
||
![image.png](https://cdn.statically.io/gh/SivanLaai/image-store-rep@master/note/20230619210942.png) | ||
|
||
- 点选主目录选中public_html | ||
|
||
![image.png](https://cdn.statically.io/gh/SivanLaai/image-store-rep@master/note/20230619211144.png) | ||
- 右键点击compress根据提示进行备份 | ||
|
||
![image.png](https://cdn.statically.io/gh/SivanLaai/image-store-rep@master/note/20230619211251.png) | ||
|
||
- 鼠标选中生成的文件,点击右键然后下载到本地备份即可 | ||
|
||
# 数据库备份 | ||
|
||
- 打开网站[Dashboard | HostGator Billing/Support System](https://portal.hostgator.com/) | ||
- 登录密码进入cPanel->DATABASES->phpMyAdmin | ||
|
||
![image.png](https://cdn.statically.io/gh/SivanLaai/image-store-rep@master/note/20230619211647.png) | ||
|
||
- 按如下操作即可把文件备份到本地 | ||
|
||
![image.png](https://cdn.statically.io/gh/SivanLaai/image-store-rep@master/note/20230619211807.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.