Reinforcement Learning Background
What follows is a series of definitions that may be useful to understand the concepts of reinforcement learning that are used when training an agent in an RLGym environment. Note that these definitions are not meant to be exhaustive, and we will formulate the reinforcement learning setting in a somewhat non-standard way to better align with the environments typically considered by practitioners using RLGym.
The Basics
A decision process , sometimes called an environment, is characterized by a set of states , a set of actions , and a state transition probability function .
When an agent executes an action in state , the environment transitions to a new state according to .
For our purposes it is useful to further consider a set of observations , which are representations of states that an agent acts upon. The observation function maps states to observations .
A policy is a function that maps an observation to a real-valued vector, .
An action function is a function that maps the output of a policy to an action.
The complete state-to-action mapping is given by . We will refer to this mapping as an agent.
Following each action in state , a reward function generates a scalar reward .
We refer to a single interaction between the agent and the environment as a timestep, which contains .
Trajectories and Returns
We are concerned with two types of sequences:
- An episode: A complete sequence of timesteps starting from an initial state and ending with a terminal state
- A trajectory: Any sequence of timesteps from some arbitrary state to If all sequences of actions from any are guaranteed to eventually reach some terminal state , we refer to this as a finite-horizon problem. If instead we allow the trajectory to continue indefinitely, we refer to this as an infinite-horizon problem.
A return is the cumulative reward obtained over a trajectory. In the finite-horizon case, the return is can be written simply as
However, to account for infinite-horizon problems, we must introduce a discount factor to form the discounted return
Note that this reduces back to our first equation for when . This discount factor serves two purposes. First, it ensures convergence of as so long as by forming a convergent geometric series when is bounded. Second, it acts as a from of temporal credit assignment by assigning more weight to rewards that were obtained closer to the current time .
Value Functions
The state value function, often just called the value function is a function that maps states to the expected return of a policy at that state. It is given by
This is an important quantity to understand because it captures the quality of a policy at a given state. It should be emphasized that the value function considers only one specific policy, so every time we make even a tiny change to our agent's policy, the value function will change as well. One way to envision the value function is to imagine the agent being dropped into the game at some arbitrary state . The value of the policy at that state is the return it would get on average if we allowed it to play from that state infinitely many times, restarting from the same state each time a terminal state is reached.
The state-action value function, or Q function is a function that maps states and actions to the expected return of a policy at a state when the agent takes action at that state, then acts according to forever afterwards. It is given by
This quantity is similar to , but with the caveat that the agent must first take the action at state before acting according to the policy forever afterwards. Note that can be written in terms of as
The state-action advantage function, or advantage function , is the difference between the Q function and the value function at a state given an action. This is given by
Think of the advantage function as a measure of how much better it was to take the action at state than it would have been to just act according to the policy at that state.
The Learning Process
Most learning algorithms consider an objective function , which is a function that maps a policy to a real number. The goal of learning is then to find a policy that maximizes the objective function, i.e. . A convenient choice for would be any of the Q function, value function, or advantage function. For our purposes we will focus on the advantage function, because the Proximal Policy Optimization (PPO) algorithm uses that as an bjective.
Generalized Advantage Estimation
A common point of confusion to users of PPO is the Generalized Advantage Estimation (GAE) algorithm, which was written by Schulman et. al in 2015, and published at ICLR 2016.
Before we begin, be aware that I will be writing this explanation of GAE from the perspective of the value function , rather than the advantage function , but everything I say here can be said about as well. I chose this because I think it's easier to understand GAE for its use in computing stable targets for our critic, rather than stable advantages for our policy.
To understand how GAE works, we first need to understand an interesting fact about the value function - that it can be written in terms of itself. Consider the following equalities:
Which, so long as the reward function is deterministic, is equivalent to
As you can see, we can write either as the expectation of the return at the current state, or as the sum of the reward and the discounted value of the next state. Further, we can write similar equations for with as many terms as we want. All of the following are equivalent:
These equalities are important because they show us that there are as many ways to write as there are timesteps in a trajectory. We care about that because, in practice, we don't know the actual value of for any state. Instead, we collect one trajectory at a time, and consider the return we calculate from each timestep as a sample from the return distribution at that state. We then train our critic to predict the return we calculate for each state. This works because when we encounter the same state more than once we'll get a different return for it, so the critic will learn to predict the average return at that state. If we do this enough times, the critic will learn to predict the true value function.
However, when training the critic, one might look at the above equivalent ways of writing and wonder, "which of these equations should I train the critic to predict?" To answer that question we will first rewrite the above equations by denoting each form of as , and we will introduce our critic to the calculation by replacing with :
Note that this means every is not the actual output of the value function at , because the critic is only an approximation of the correct value. Next, recall that we train our critic to predict some target value by minimizing the mean-squared error between the predicted value and . That is, we minimize the loss where is the batch size.
With this in mind, GAE is somewhat an exercise in choosing the target value . We could choose the target value to be , or , or , etc, so which should we choose? The key insight behind GAE is that we can choose the average over all of them:
This would work fine, but another insight made by Schulman et. al is that, as mentioned earlier, each time we compute from the same state the results might be different. This is because if we let an agent play a game from some state many times it may choose different actions each time, so the resulting trajectories might be different, which means the returns we calculate at might be different. In contrast to this, our estimate of the value function (the critic) is computed by a deterministic neural network, so it will give us the same output no matter how many times we feed it the same state.
Knowing that, we might consider to be an estimation of that has a low variance because it only uses a single reward in the calculation, but in exchange it has a high bias because the critic has a lot of influence over the final calculation, and it will almost always be incorrect. In contrast, we might consider to be an estimation of that has a high variance because it uses all the possible terms in the calculation, but in exchange it has a low bias because the critic is not used until the very end of the calculation.
It seems then that it might be useful to introduce a way to weight each term by some factor so we can choose how to balance the trade-off between bias and variance. Schulman et. al do this by computing an exponential average of the terms instead of the arithmetic mean like we did above. This results in an equation that is remarkably similar to the discounted return, but with in place of and in place of .
And that's it! Now we can change the value of to control the trade-off between bias and variance in the value targets. Higher values of lead to more variance but less bias, and vice versa. The GAE paper does a little more algebra to rearrange the above equation in a way that is easier to compute, but the idea is the same.
Writing the equation for in this fashion gives us a single parameterized method of estimating from a single trajectory that encompasses all the ways we might write . Because of this, it is a general way to estimate , so we call it a generalized value estimator, and if we had used the advantage function instead of the value function in this section we would have called it a generalized advantage estimator, like the paper. If you're interested in learning more about GAE, check out the paper by Schulman et. al.