Monte-Carlo Learning

Monte Carlo methods are a family of algorithms used to estimate the value of an unknown quantity by using random sampling. In the context of reinforcement learning, Monte Carlo methods can be used to estimate the value of a state or state-action pair by playing a game or interacting with an environment multiple times and observing the rewards that are obtained. The idea is to average the rewards obtained over a large number of episodes to get an estimate of the true value. Monte Carlo methods do not make any assumptions about the Markov property or the model of the environment, and can be used in situations where other methods may not be applicable.

The basic intuition behind the Monte Carlo (MC) method is to learn from experience. Learning through sequences of states, actions, and rewards is key. Let’s consider an agent in state s1, taking action a1, receiving reward r1, and transitioning to state s2. This entire sequence is an experience.

The Bellman equation describes the value of a state as the immediate reward plus the value of the next state. However, in cases where we lack knowledge about the environment dynamics (transition probabilities), calculating this is not possible. Nonetheless, we can still gather information about the environment by focusing on what we do know: the rewards. By repeatedly averaging these rewards over an infinite number of iterations, we can estimate the value of the state, coming close to its actual value. This is the essence of Monte Carlo: learning through the average rewards obtained from these sequences.

Basic Intuition of Monte-Carlo

Calculating Incremental Means
Evaluate a policy in Model-free environments? (policy evaluation)
The problem of exploration…
On-policy policy improvement(control)
What is GLIE Monte-Carlo?
Important Sampling
Off-policy policy improvement(control)

Calculating Incremental Means

To calculate the value function of a state (s), we average the rewards obtained after visiting that state. However, instead of summing all the rewards at once and dividing by the total number of visits, which can be computationally expensive and memory-intensive, we can use incremental means or moving averages. Incremental means allow us to calculate the mean incrementally as new rewards are encountered, without waiting for all rewards to be collected.

Let’s derive the formula for incremental means step-by-step:

The total mean for n time steps is calculated as:Total Mean = Sum of rewards / Total number of visits
Suppose we want to calculate a new mean U(n). We can split U(n) into the sum of the previous mean until step (n-1) and the value obtained at step n:New Mean = Total Mean (n-1) + Value at step n
The summation term can be defined as (n-1) times the mean obtained until step (n-1), which is U(n-1). Substituting this in the equation:New Mean = (n-1) * U(n-1) + Value at step n
Rearranging these terms:New Mean = U(n-1) + (Value at step n – U(n-1))
Taking N as a common factor and solving further:New Mean = U(n-1) + (1 / N) * (Value at step n – U(n-1))

The final equation for incremental means tells us that the new mean (U(n)) can be defined as the sum of the previous estimate of the mean (U(n-1)) and the difference between the current obtained return (Value at step n) and the previous estimate of the mean (U(n-1)). This difference is weighted by a step size, which represents the total number of times the state was encountered over different episodes. The term “error term” refers to the difference between the current obtained return and the previous estimate of the mean. By updating the mean in the direction of this error term, we are approximating the new mean without accurately calculating it.

In summary, incremental means provide a computationally efficient way to update the mean estimate as new rewards are encountered. Now, let’s delve deeper into how we evaluate/predict a policy in Monte Carlo (MC) methods.

policy evaluation

In the context of policy evaluation in model-free environments using Monte Carlo (MC) methods, the value function is estimated by averaging the rewards obtained over episodes. Once we have the value function, we can construct/extract a policy from it.

To extract a policy from the state-value function, we maximize the state values. However, this approach faces two problems in model-free environments: we don’t know the transition probabilities (Pss’) and the next state (s’).

Alternatively, we can extract a policy from the state-action function (also known as the Q-function). The policy is obtained by maximizing the Q-values, which are the state-action values. By maximizing over the actions, we choose the action that yields the best Q-value. Therefore, for MC methods, we use the state-action values for both policy evaluation and improvement steps.

To calculate state-action values, we follow a similar approach as with state values. The difference is that now we calculate values for each state-action pair (s, a) instead of just states (s). A state-action pair (s, a) is considered visited only if we have visited state s and taken action a from it. We average the action values, as described in the previous section, to obtain the state-action value for that state. There are two variants of MC methods based on when we perform the averaging: First-visit MC and Every-visit MC.

In First-visit MC, we average the state-action values only after the (s, a) pair has been visited for the first time in each episode. In Every-visit MC, we average the state-action values after every visit to the (s, a) pair within an episode.

By averaging over state-action values, we obtain the mechanism for both policy evaluation and improvement. However, this approach faces the challenge of exploration. To address this, additional techniques such as epsilon-greedy or softmax exploration can be employed to ensure a balance between exploitation and exploration during the learning process.

The problem of exploration

Monte Carlo (MC) control is a technique similar to solving Markov decision processes using dynamic programming. It employs the concept of Generalized Policy Iteration (GPI), where the focus is not on directly estimating the optimal policy, but rather on maintaining estimates of the action-value function and policy. Through a continuous process, the action-value function is refined to better approximate the policy, and vice versa.

The process of MC control can be summarized as follows:

Initialization: Start with an initial policy, denoted as π.
Policy Evaluation: Utilize the current policy π to calculate the action values. This step involves estimating the expected return for each state-action pair.
Policy Improvement: Once the action values have been computed (through policy evaluation), update the policy by acting greedily with respect to these action values. The aim is to construct a new policy, denoted as π*, which is either better than or equal to the initial policy π.
Iteration: Repeat steps 2 and 3 iteratively. Oscillating between policy evaluation and policy improvement gradually leads to the convergence of an optimal policy.

By following this iterative process, Monte Carlo control allows us to refine the policy and action-value function, ultimately enabling the discovery of an optimal policy for the given problem.

On-policy policy improvement

In the on-policy control method, we aim to improve the policy used for evaluating the Q-function. In this approach, we evaluate a state-action value (Q-function) for a given state s using a policy π and then proceed to improve the same policy π through policy control. The policies employed in the on-policy method typically follow the epsilon-greedy strategy. Now, let’s delve into what an epsilon-greedy policy entails.

As discussed earlier, the desired policy for Monte Carlo (MC) methods is π(a|s) > 0, and such a policy is referred to as an epsilon-soft policy. On the other hand, the policy utilized in on-policy control is known as an epsilon-greedy policy. This type of policy selects a greedy action with a probability of 1-ϵ+ϵ/A(s), where A(s) represents the total number of actions that can be taken from state s. Additionally, it randomly selects an action with a probability of ϵ/A(s). In summary, the epsilon-greedy policy ensures exploration by occasionally selecting random actions while predominantly favoring greedy actions.

To further refine the policy, we can adopt a policy-improvement step that entails selecting the action with the maximum Q-value for a given state, as we have previously discussed:

Policy improvement step: Choose the action that maximizes the Q-value for a given state.

This policy-improvement step, combined with the ideas of Generalized Policy Iteration (GPI), involves iterative evaluation and improvement until optimality is achieved. The following diagram illustrates the cycle of evaluation and improvement:

Cycle of evaluation and improvement in GPI.

It is important to note that GPI does not necessarily require full policy evaluation in a single pass. Instead, we can evaluate a policy for one episode and then act greedily based on the evaluated policy. The aim is to nudge the action values and policy closer to the optimal values. This concept is depicted in the diagram below:

GPI approach to policy evaluation and improvement.

Considering all the mechanisms we have discussed, an on-policy algorithm called Greedy in the Limit with Infinite Exploration (GLIE) can be presented.

Important Sampling

The following algorithm represents the culmination of our previous discussions on the ideal policy formulation for Monte Carlo (MC):

Infinite Exploration: We ensure that every state-action pair is visited an infinite number of times by continuously exploring.
Gradual Transition to Greedy Policy: As we progress, the algorithm gradually shifts towards a more deterministic policy. This transition is accomplished by setting appropriate values for ϵ in the ϵ-greedy policy.

Determining the right values for ϵ:

Initial Policy: To promote exploration, we initialize the policy with random values. This encourages a higher exploration rate, which can be achieved by setting ϵ close to 1.
Later Time-Steps: To favor a more deterministic policy, we set ϵ close to 0.

How does GLIE MC set the ϵ value?

In GLIE MC, the value of ϵ is systematically reduced over time. Initially, ϵ is set to 1.0, and then it is gradually decreased by a small amount at each time-step. When ϵ > 0, the policy becomes more greedy, prioritizing exploitation over exploration. By reducing ϵ over many time steps, GLIE Monte Carlo promotes the convergence to an optimal policy.

Next, let’s delve into importance sampling, a crucial component of off-policy MC.

Important Sampling

Important Sampling is a technique used in Off-policy reinforcement learning where two policies are used: a behavior policy that explores the environment, and a target policy that we want to improve by learning its value function based on the behavior policy. The objective is to estimate the target policy distribution π(a/s) by using samples from the behavior policy distribution b(a/s). Important Sampling helps in achieving this goal.

There are two ways to use important sampling to calculate the state-value function for the target policy: Ordinary Important Sampling and Weighted Important Sampling.

Ordinary Important Sampling keeps track of the time steps in which a state was visited and scales the returns obtained from the behavior policy by dividing it by the number of times the state was visited.

On the other hand, Weighted Important Sampling is defined as a weighted average of the returns obtained from the behavior policy. Weighted Important Sampling is biased because it does not cancel out the ratio terms in the numerator and denominator, unlike Ordinary Important Sampling. Therefore, the expectation of Weighted Important Sampling is not Vπ but Vb. However, the variance of Weighted Important Sampling is bounded, which is not the case for Ordinary Important Sampling.

Off-policy policy improvement

As mentioned earlier, our goal is to estimate the target policy based on the behavior policy. Similar to Monte Carlo (MC) methods, we estimate the value of a state by averaging the returns. In this case, we calculate the value function by estimating the returns using Value function for off-policy.

Where is this ratio used?

Since our objective is to estimate the value function Vπ for the target policy π, we use this ratio to transform the returns obtained from the behavior policy b. This allows us to estimate the value function Vπ based on the returns obtained from a different policy.

Off-policy Monte Carlo Control

To enhance the target policy using the returns obtained from the behavior policy, we employ off-policy Monte Carlo control. The target policy and the behavior policy are independent of each other. While the target policy can be deterministic and prioritize exploitation based on the Q-function, the behavior policy is designed to explore the environment. To achieve this, we need to ensure that we collect sufficient returns from the behavior policy for each state-action pair. The behavior policy is typically “soft,” meaning that every state-action pair has a non-zero probability of being selected, ensuring that each state-action pair is visited an infinite number of times.

Additionally, by utilizing incremental means, we can update and refine our value function progressively towards the optimal target policy. The update rule for the first “n” returns, weighted by the important sampling ratio. In practice, this ratio is often referred to as the weighted important sampling ratio.