0. Reinforcement Learning Background
What follows is a series of definitions that may be useful to understand the concepts of reinforcement learning when training an agent in an RLGym environment. Note that these definitions are not meant to be exhaustive, and we will formulate the reinforcement learning setting in a somewhat non-standard way to better align with the environments typically considered by practitioners using RLGym.
1. The Basics
A decision process , sometimes called an environment, is characterized by a set of states , a set of actions , and a state transition probability function . To interact with an environment, an agent must follow a policy , where denotes the set of admissible policies.
When an agent executes an action in state , the environment transitions to a new state according to .
For our purposes it is useful to further consider a set of observations , which are representations of states that an agent acts upon. The observation function maps states to observations .
A policy is a function that maps an observation to a real-valued vector, .
In practice, a policy typically defines a distribution over actions. For discrete actions, directly outputs the probability mass for each action. For continuous actions, the most common approach is for the policy to parameterize some known distribution (e.g. a Gaussian distribution) rather than try to directly learn the density of the action space. This is called the reparameterization trick.
An action function is a function that maps the output of a policy to an action.
The complete state-to-action mapping is given by . We will refer to this mapping as an agent.
Following each action in state , a reward function generates a scalar reward .
We refer to a single interaction between the agent and the environment as a timestep, which contains .
1.1 Trajectories and Returns
We are concerned with two types of sequences:
- A trajectory: Any sequence of timesteps from some state to another state .
- An episode: A special case of trajectory that begins with an initial state and ends with a terminal state .
We will denote such sequences of timesteps as .
These sequences help characterize two types of problems we will deal with. The first case happens when all sequences of actions from any are guaranteed to eventually reach some terminal state . We call these environments episodic, or finite-horizon, because interacting with them for long enough is guaranteed to eventually form an episode. The second case happens when there are trajectories that will never end. That is, sequences of timesteps that never reach some . We call these environments non-episodic or infinite-horizon because interacting with them forever is not guaranteed to form an episode.
This distinction is important when we consider a return , which is the cumulative reward obtained from timestep onward along a trajectory. There is one return per timestep.
In the simplest episodic case, the return from timestep can be written as
For non-episodic or infinite-horizon settings, we introduce a discount factor and define the discounted return from timestep as
Note that for finite-horizon episodic tasks, setting recovers the undiscounted return.
The discount factor serves two purposes. First, it ensures that the infinite-horizon return converges so long as by forming a convergent geometric series when is bounded. Second, it acts as a form of temporal credit assignment by assigning more weight to rewards that were obtained closer to the current time .
1.2 Value Functions
The state value function, often just called the value function is a function that maps states to the expected return of a policy at that state. It is given by
This is an important quantity to understand because it captures the quality of a policy at a given state. It should be emphasized that the value function considers only one specific policy, so every time we make even a tiny change to our agent's policy, the value function will change as well. One way to envision the value function is to imagine the agent being dropped into the game at some arbitrary state . The value of the policy at that state is the return it would get on average if we allowed it to play from that state infinitely many times, restarting from the same state each time a terminal state is reached.
The state-action value function, or Q function is a function that maps states and actions to the expected return of a policy at a state when the agent takes action at that state, then acts according to forever afterwards. It is given by
This quantity is similar to , but with the caveat that the agent must first take the action at state before acting according to the policy forever afterwards. Note that, in general, we can write in terms of as
The state-action advantage function, or advantage function , is the difference between the Q function and the value function at a state given an action. This is given by
Think of the advantage function as a measure of how much better it was to take the action at state than it would have been to just act according to the policy at that state.
2. The Learning Process
This section will outline the general process of learning a policy from a set of trajectories. The derivation of the policy gradient shown here comes from OpenAI's Spinning Up blog.
Most learning algorithms optimize an objective ; for , (where denotes the return starting at time ). The goal of learning is then to find a policy that maximizes the objective function, i.e. . For our purposes we care about the objective . However, there are many options.
The most common way to maximize this function is a process called gradient ascent (though you may have heard of gradient descent, the concepts are the same), which is an iterative process by which a policy is repeatedly adjusted in the direction of the gradient of the objective function . To do this in practice we will only consider policies that are differentiable functions parameterized by , which we will write as .
2.1 The Policy Gradient
This formulation of the policy and objective allows us to write an update rule for any parameters by gradient ascent on ,
where is referred to as the learning rate.
To derive we will first rewrite our objective in integral form as
Note that to evaluate this integral we need to know , which is the probability of a trajectory occurring under . For a trajectory of length this is given by
That is, the probability of a trajectory given a policy is the product of the probabilities of each state transition and the probabilities of each action taken over that trajectory. This would be cumbersome to differentiate on its own, but because specifies a probability function, we can employ the log-derivative trick to get
This may not seem useful at first, but separates that nasty product from earlier into a summation:
This means we can differentiate each term of with respect to to get its gradient, and neither or are functions of , so their contribution to is zero. Therefore, we can write its gradient as
Circling back to our goal of finding , we have
Finally we can plug this back into the integral form of our objective as above to get
Putting this together, we can approximate over a group of timesteps as
Note that is typically referred to as the batch size. Larger batch sizes will yield more accurate gradient estimates, but will also require more memory and time to compute.
One interesting property of this gradient is that it requires the probability of any action in a batch according to the policy to be non-zero ( is undefined) and also not equal to one (). For this estimator of the policy gradient, the policy should be stochastic so that is well-defined and informative. Deterministic policies can be handled by different estimators.
2.2 Baselines and The Critic
When approximating as we derived above, we might run into a problem in settings where returns have a high variance. We would rather find a method to estimate that is less sensitive to the variance of returns. To do this, we will introduce a baseline , which is any function of the state (not dependent on the action), into our approximation:
A comprehensive analysis of and what it should be is beyond the scope of this article, but the key concept to understand is that because only depends on the state, it does not change the expectation of our gradient estimate at all:
However, does change the variance of the gradient estimator. It turns out that a near-optimal choice of baseline (and the one we will use going forward) is . One way we can intuit the usefulness of this choice is to think of as how the agent actually performed at the state , and as how the agent was expected to perform at that state, so we are weighting the gradient estimate such that it points away from actions that led to below-average performance and towards actions that led to above-average performance.
Now that we are equipped with a baseline, we need to figure out how to compute it. Since is an expectation, we could try to approximate it by visiting every many times and taking the average of the returns we get, but this is obviously impractical for environments with more than a few states. Instead, we will parameterize a function to approximate and optimize it by least squares regression.
Thankfully, learning a model of is much easier than learning . We will denote our model with parameters . Our objective is then to learn such that . Because is an expectation of returns according to , we can consider each a Monte Carlo sample from the distribution of returns under . Therefore, we can approximate by performing gradient descent on
2.3 Value Targets and TD()
While what we have done so far is a valid way to compute value targets, we will encounter some issues in practice. Right now to train our critic we need to traverse a full trajectory in order to compute even one . In episodic environments with long episodes this can be impractical, and it is clearly impossible to do in non-episodic environments. To deal with these problems we need to introduce the famous Bellman equation, which describes a recursive relationship between and . The Bellman equation rewrites the value function as follows:
Which, so long as the reward function is deterministic, is equal to
This is the Bellman equation. As you can see, this creates a recursive definition of the value function, where we can compute the value of one state as the sum of the reward at the current state and the value of the next. Further, we can write similar equations for with up to as many terms as there are timesteps in a trajectory. All of the following are equal:
When we chose the value function as our baseline earlier, we approximated it by least squares regression on the returns . However, because all of these expansions are equal, it would be just as valid to train our critic by regression on any of these expansions of the Bellman equation. To make our lives easier let us rephrase the learning objective for the critic more generally as
where is the learning target for , which we previously set to . Let us now rewrite the above equalities and replace each with . For convenience, we will label each expansion of the Bellman equation as where is the number of terms in the calculation, like so:
The question we are now faced with is which to choose as a replacement for in our critic target. We could choose , or , or , etc, but the key insight behind TD() is that we can take an average over all rather than just choose one:
This would work fine, but another important aspect to consider is the variance in each expansion . Because there is variance in the returns, with fewer terms will have less variance. However, this also means has a larger influence over those , and the critic is an imperfect estimator of , so there will be a higher bias in those . These facts outline a trade-off between variance and bias in our value targets; the more terms we use, the higher the variance, but the lower the bias. TD() addresses this trade-off by choosing a weighted average of all expansions, where the weight of each term is given by .
Choosing in this fashion gives us a single parameterized method of estimating from a single trajectory that encompasses all the ways we might expand the Bellman equation, where controls the relative amounts of bias and variance in the learning targets.