|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Introduction to Policy Based Method" |
| 8 | + ] |
| 9 | + }, |
| 10 | + { |
| 11 | + "cell_type": "markdown", |
| 12 | + "metadata": {}, |
| 13 | + "source": [ |
| 14 | + "## Recap \n", |
| 15 | + "### Value-Based Methods:\n", |
| 16 | + "* Interaction $\\to$ Optimal Value Function $q_*$ $\\to$ OptimalPolicy $\\pi_*$\n", |
| 17 | + "\n", |
| 18 | + "\n", |
| 19 | + "* Value Function is represented in form of table, where rows corresponds to states and column corresponds to action.\n", |
| 20 | + "* And then we use the above table to build an optimal policy.\n", |
| 21 | + "\n", |
| 22 | + "* **BUT what about environments with much larger state space?**\n", |
| 23 | + " * So we investigated how to represent the optimal action value function with a neural network which formed the basis for the deep Q learning algorithm.\n", |
| 24 | + " * Input_dim : state_dim , Output_dim: action_dim\n", |
| 25 | + " \n", |
| 26 | + "* But the important message here is that the both cases, whether we use the table for small state spaces or an neural network for much larger state spaces, we had to first **Estimate** optimal action value before we could tackle the optimal policy.\n", |
| 27 | + "\n", |
| 28 | + "## Million Dollar Question\n", |
| 29 | + "* Can we directly find the optimal policy without worrying about a value function at all?\n", |
| 30 | + " * YES!! -- Policy-based methods.\n", |
| 31 | + " " |
| 32 | + ] |
| 33 | + }, |
| 34 | + { |
| 35 | + "cell_type": "markdown", |
| 36 | + "metadata": {}, |
| 37 | + "source": [ |
| 38 | + "# Policy Function Approximation" |
| 39 | + ] |
| 40 | + }, |
| 41 | + { |
| 42 | + "cell_type": "markdown", |
| 43 | + "metadata": {}, |
| 44 | + "source": [ |
| 45 | + "* How might we approach this idea of estimating an optimal policy?\n", |
| 46 | + " * Let's take cart pole example\n", |
| 47 | + " * In this case agent has **two** possible action and dimension of state space is **four**.\n", |
| 48 | + " * It can push the cart either left or right.\n", |
| 49 | + " * At each time step agents picks one of two actions.\n", |
| 50 | + " * And we can construct a neural network that **approximates** the policy, that accepts the state as input.\n", |
| 51 | + " * As output, it can return the probability that the agent selects each possible action.\n", |
| 52 | + " * So if there are two possible actions, the output layer will have two nodes.\n", |
| 53 | + " * The agent uses this policy to interact with the environment by just passing the most recent state to the network.\n", |
| 54 | + " * It outputs the action probability and then the agent samples from those probabilities to select an action in response.\n", |
| 55 | + "* Our objective then is to determine appropriate values for the network weights so that for each state that we pass into the network it returns the **Action probabilities** where the optimal action is most likely selected.\n", |
| 56 | + "* This will help agent with its goal to maximize expected return.\n", |
| 57 | + "\n", |
| 58 | + "\n", |
| 59 | + "\n", |
| 60 | + "* This is an iterative process where weights are initially set to the random values.\n", |
| 61 | + "* Then, as the agent interacts with the environment and learns more about the strategies are best for maximizing reward." |
| 62 | + ] |
| 63 | + }, |
| 64 | + { |
| 65 | + "cell_type": "markdown", |
| 66 | + "metadata": {}, |
| 67 | + "source": [ |
| 68 | + "## More on the Policy\n", |
| 69 | + "* The agent can use a simple neural network architecture to approximate a **Stochastic policy**. The agent passes the current environment state as input to the network, ehich returns action probabilities. Then, the agent samples from those probabilities to select an action.\n", |
| 70 | + "* The same neural network architecture can be used to approximate a **Deterministic policy**. Instead of sampling from the action probabilities, the agent need only choose the greedy action.\n", |
| 71 | + "* **SOFTMAX** as output activation function." |
| 72 | + ] |
| 73 | + }, |
| 74 | + { |
| 75 | + "cell_type": "markdown", |
| 76 | + "metadata": {}, |
| 77 | + "source": [ |
| 78 | + "## What about continuous action spaces?\n", |
| 79 | + "*** \n", |
| 80 | + "The CartPole environment has a discrete action space. So, how do we use a neural network to approximate a policy, if the environment has a continuous action space?\n", |
| 81 | + "\n", |
| 82 | + "As we have learned, in the case of **discrete** action spaces, the neural network has one node for each possible action.\n", |
| 83 | + "\n", |
| 84 | + "For **continuous** action spaces, the neural network has one node for each action entry (or index). For example, consider the action space of the **bipedal walker** environment.\n", |
| 85 | + "<img src = \"./images/a14.png\">\n", |
| 86 | + "In this case, any action is a vector of four numbers, so the output layer of the policy network will have four nodes.\n", |
| 87 | + "\n", |
| 88 | + "Since every entry in the action must be a number between -1 and 1, we will add a tanh activation function to the output layer.\n", |
| 89 | + "\n", |
| 90 | + "In `MountainCarContinuous-v0` environment. The Layer size 1, Activation function: tanh" |
| 91 | + ] |
| 92 | + }, |
| 93 | + { |
| 94 | + "cell_type": "markdown", |
| 95 | + "metadata": {}, |
| 96 | + "source": [ |
| 97 | + "# Hill Climbing\n", |
| 98 | + "* So remember that the agent goal is always to **maximize expected return**.\n", |
| 99 | + "* Let's denote the expected return as $J$.\n", |
| 100 | + "* And weights in NeuralNetwork as $\\theta$.\n", |
| 101 | + "* And there's some mathematical relationship between $\\theta$ and the expected return $J$.\n", |
| 102 | + "* This is because $\\theta $ encodes the policy which makes some action more likely than others, and depending on action that influences the reward and then we sum up the rewards to get the return.\n", |
| 103 | + "* The main idea is that it's possible to write the expected return $J$ as the function of $\\theta$ for e.g. $J = F(\\theta)$.\n", |
| 104 | + "> $J(\\theta) = \\sum_{\\tau} P(\\tau;\\theta)R(\\tau)$\n", |
| 105 | + "<br>$\\sum_{\\tau} P(\\tau;\\theta)R(\\tau)$ = Expected value of reward across $\\tau$ timesteps, Expected value of a probability distribution is summation of probability of each sample in that distribution times(product) value of that sample.\n", |
| 106 | + "\n", |
| 107 | + "* And we want to optimize the values of $\\theta$ such that it **maximizes the EXPECTED RETURN** $J$.\n", |
| 108 | + "\n", |
| 109 | + "### Hill Climbing\n", |
| 110 | + "* We begin with an initially random set of weights $\\theta$.We collect a single **episode** with the policy that corresponds to those weights and then record the return.\n", |
| 111 | + "* This return is an estimate of what the surface looks like at that value of $\\theta$.\n", |
| 112 | + "* Now it's not going to be a perfect estimate because the return we just collected is unlike to be equal to the expected return, but in practice the estimates often turns out to be good enough.\n", |
| 113 | + "* Then we add little bit of noise to those weights to give us another set of candidate weights we can try.\n", |
| 114 | + "* **To see how good those new weights are, we'll use the policy that they give us to again iteract with the environment for an episode and add up the return.**\n", |
| 115 | + "* In up the new weights, give us __more__ return than our current best estimate, we focus our attention on that new value, and then we just repeat by iteratively proposing new policies in the hope that they outperform the existing policy.\n", |
| 116 | + "* In the event that they don't we go back to our previous best policy.\n", |
| 117 | + "\n", |
| 118 | + "### Working of the process\n", |
| 119 | + "* Consider the case that the neural network has only two weights.\n", |
| 120 | + "* The agent's goal is to maximize expected return $J$.\n", |
| 121 | + "* The weights in the neural network are $\\theta = (\\theta_0,\\theta_1)$\n", |
| 122 | + "* Then we can plot the expected return $J$ as a function of the values of both weights.\n", |
| 123 | + "\n", |
| 124 | + "* Once we get that function we can optimize our $\\theta$ values so that we can maximize the expected return function $J$.\n", |
| 125 | + "\n", |
| 126 | + "#### Gradient Ascent\n", |
| 127 | + "* **Gradient ascent** is similar to gradient descent.\n", |
| 128 | + " * Gradient descent steps in the **direction opposite the gradient**, since it wants to minimize a function.\n", |
| 129 | + " * Gradient ascent is otherwise identical, expcept we step in the **direction of the gradient**, to reach the maximum.\n", |
| 130 | + " \n", |
| 131 | + "#### Local Minima \n", |
| 132 | + "In the video above, we learned that **hill climbing** is relatively simple algorithm that the agent can use to gradually improve the weights $\\theta$ in its policy network while interacting with environment.\n", |
| 133 | + "\n", |
| 134 | + "Note however, that it's **not** guaranteed to always yield the weights of the optimal policy. This is because we can easily get stuck in a local maximum. In this lesson, we'll learn about some policy-based methods that are less prone to this.\n" |
| 135 | + ] |
| 136 | + }, |
| 137 | + { |
| 138 | + "cell_type": "markdown", |
| 139 | + "metadata": {}, |
| 140 | + "source": [ |
| 141 | + "## Hill Climbing Pseudocode\n", |
| 142 | + "* Initialize the weights $\\theta$ in the policy arbitrarily.\n", |
| 143 | + "* Collect an episode with $\\theta$, and record the return $G$.\n", |
| 144 | + "* $\\theta_{best} \\gets \\theta; G_{best} \\gets G$\n", |
| 145 | + "* Add a little bit of random noise to $\\theta_{best}$, to get a new set of weights $\\theta_{new}$.\n", |
| 146 | + "* Collect an episode with $\\theta_{new}$, and record the return $G_{new}$\n", |
| 147 | + "* if $G_{new} > G_{best}$ then :\n", |
| 148 | + " * $G_{best} \\gets G_{new}$, $\\theta_{best} \\gets \\theta_{new}$\n", |
| 149 | + "* And repeat until we get optimal policy.\n", |
| 150 | + "\n", |
| 151 | + "\n", |
| 152 | + "### What's the difference between G and J?\n", |
| 153 | + "Well .. in reinforcement learning, the goal of the agent is to find the value of the policy network weights $\\theta$ that maximizes **expected** return, which we have denoted by $J$.\n", |
| 154 | + "\n", |
| 155 | + "In the hill climbing algorithm, the values of $\\theta$ are evaluated according to how much return $G$ they collected in a **single episode**. To see that this might be a little bit strange, note that due to randomness in the environment (and the policy, if it is stochastic), it is highly likely that if we collect a second episode with the same values for $\\theta$ we'll likely get a different value for the return $G$. Because of this, the (sampled) return $G$ is not perfect estimate for the expected return $J$, but it often turns out to be **good enough** in practice.\n", |
| 156 | + " " |
| 157 | + ] |
| 158 | + }, |
| 159 | + { |
| 160 | + "cell_type": "markdown", |
| 161 | + "metadata": {}, |
| 162 | + "source": [ |
| 163 | + "## Beyond Hill Climbing \n", |
| 164 | + "\n", |
| 165 | + "We denoted the expected return by $J$. Likewise, we used $\\theta$ to refer to weights in the policy network. Then, since $\\theta$ encodes the policy, which influences how much reward the agent will likely receive, we know that $J$ is a function of $\\theta$.\n", |
| 166 | + "\n", |
| 167 | + "Despite the fact that we have no idea what that function $J = J(\\theta)$ looks like, the *hill climbing* algorithm helps us determine the value of $\\theta$ that maximizes it.\n", |
| 168 | + "\n", |
| 169 | + "## Note\n", |
| 170 | + "We refer to the general class of approaches that find $arg\\ max_{\\theta} J(\\theta)$ through randomly pertubring the most recent best estimate as **stochastic policy search.**. Likewise, we can refer to $J$ as an **objective function**, which just refers to the fact that we'd like to *maximize it*!\n", |
| 171 | + "\n", |
| 172 | + "\n", |
| 173 | + "### Improvements in Hill Climbing algorithm\n", |
| 174 | + "* One small improvement to this approach is to choose a small number of neighboring policies at each iteration and pick the best among a small number of neighboring policies at each iteration and pick the best among them.\n", |
| 175 | + "* Generate a few candidate policies by perturbing the parameters randomly and evaluate each policy by iterating with environment. \n", |
| 176 | + "* This give us an idea of the **neighborhood** of the current policy.\n", |
| 177 | + "* Now pick the candidate policy that looks most promising and iterate.\n", |
| 178 | + "* This variation is known as **steepest ascent hill climbing** and it helps to reduce the risk of selecting a next policy that may lead to suboptimal solution.\n", |
| 179 | + "\n", |
| 180 | + "## Stimulated Annealing \n", |
| 181 | + "* Stimulated annealing uses a predefined schedule to control how the policy space is explored.\n", |
| 182 | + "* Starting with a large noise parameter, that is broad neighborhood to explore, we gradually reduce the noise or radius as we get closer and closer to the optimal solution.\n", |
| 183 | + "\n", |
| 184 | + "\n", |
| 185 | + "## Adaptive Noise (Scaling)\n", |
| 186 | + "* Whenever we find a better policy than before, we're likely getting closer to the optimal policy.\n", |
| 187 | + "* This translate to reducing or decaying the variance of the Gaussian noise we add.\n", |
| 188 | + "* But if we don't find a better policy it's probably a good idea to increase our search radius and continue exploring from the current best policy.\n", |
| 189 | + "* This small tweak to stochastic policy makes it much less likely to get stuck, especially in domains with a complicated objective function." |
| 190 | + ] |
| 191 | + }, |
| 192 | + { |
| 193 | + "cell_type": "markdown", |
| 194 | + "metadata": {}, |
| 195 | + "source": [ |
| 196 | + "## More Black-Box Optimization\n", |
| 197 | + "\n", |
| 198 | + "All algorithms that we've learned about in this lesson can be classified as **black box optimization** techniques.\n", |
| 199 | + "\n", |
| 200 | + "**Black-box** refers to the fact that in order to find the value of $\\theta$ that maximizes the function $J = J(\\theta)$, we need only to estimate the value of $J$ at any potential value of $\\theta$.\n", |
| 201 | + "\n", |
| 202 | + "That is, both hill climbing and steepest ascent hill climbing don't know that we're solving a reinforcement learning problem, and they don not care that the function we're trying to maximize corresponding to the expected return.\n", |
| 203 | + "\n", |
| 204 | + "These algorithm only know that for each value of $\\theta$, there's corresponding to $\\theta$ to collect an episode, but the algorithm are not aware of this. To these algorithms, the way we evaluate $\\theta$ is considered a black box, and they don't worry about the details. The algorithms only care about finding the value of $\\theta$ that will maximize the number comes out of the black box.\n", |
| 205 | + "\n" |
| 206 | + ] |
| 207 | + }, |
| 208 | + { |
| 209 | + "cell_type": "markdown", |
| 210 | + "metadata": {}, |
| 211 | + "source": [ |
| 212 | + "## Cross-Entropy method\n", |
| 213 | + "* Hill climbing begins with a best guess for the weights, then it adds a little bit of noise to propose one new policy that might perfrom better.\n", |
| 214 | + "* **Steepest-ascent Hill climbing**, does a little bit more work by generating several neighboring policies at each iteration.\n", |
| 215 | + "* But in both cases, only the best policy prevails.\n", |
| 216 | + "* In **Steepest-ascent Hill climbing** there's a lot of useful information that we're **throwing** out.\n", |
| 217 | + "* Now we'learn about some methods that leverage useful information from the weights that aren't selected as best.\n", |
| 218 | + "#### The process\n", |
| 219 | + "* So what if, instead of selecting only the best policy, we selected the top 10 or 20 percent of them, and took the average?\n", |
| 220 | + "* This what the **Cross -Entropy Method** does.\n", |
| 221 | + "## Evolution Strategies\n", |
| 222 | + "* Another approach is to look at the return that was collected by __each__ candidate policy.\n", |
| 223 | + "* The best policy will be the **Weighted** sum of all of these, where poicies that got higher return, are given more say or get higher weight.\n", |
| 224 | + "* This technique is called **Evolution Strategies**\n", |
| 225 | + "#### Background of evolution strategies\n", |
| 226 | + "* The name originally comes form the idea of biological evolution, where the idea is that the most successful individuals in the policy population, will have the most influence on the next generation or iteration.\n", |
| 227 | + "* Evolution strategies as just another black box optimization technique." |
| 228 | + ] |
| 229 | + }, |
| 230 | + { |
| 231 | + "cell_type": "code", |
| 232 | + "execution_count": null, |
| 233 | + "metadata": {}, |
| 234 | + "outputs": [], |
| 235 | + "source": [] |
| 236 | + } |
| 237 | + ], |
| 238 | + "metadata": { |
| 239 | + "kernelspec": { |
| 240 | + "display_name": "Python 3", |
| 241 | + "language": "python", |
| 242 | + "name": "python3" |
| 243 | + }, |
| 244 | + "language_info": { |
| 245 | + "codemirror_mode": { |
| 246 | + "name": "ipython", |
| 247 | + "version": 3 |
| 248 | + }, |
| 249 | + "file_extension": ".py", |
| 250 | + "mimetype": "text/x-python", |
| 251 | + "name": "python", |
| 252 | + "nbconvert_exporter": "python", |
| 253 | + "pygments_lexer": "ipython3", |
| 254 | + "version": "3.6.5" |
| 255 | + } |
| 256 | + }, |
| 257 | + "nbformat": 4, |
| 258 | + "nbformat_minor": 2 |
| 259 | +} |
0 commit comments