The cartpole problem is a notorious reinforcement learning objective where the goal is to balance a pole on a cart that has two directional movements. These movements combined with the velocity of the cart can change the angle of the pole in relation to the cart. The objective is just to balance the pole on the cart given the two controls, moving the cart left and right. At each stage of a given episode the state of the cart, pole angles, and cart velocity is collected and processed in some form to determine the highest possible reward or lowest penalty. How the next action is calculated is through various algorithms that might include two common ones known as Reinforce and A2C for an actor.

Cartpole

The Reinforce algorithm is a policy-based algorithm. Being considered a policy-based algorithm means that an actor will directly optimize the policy function instead of using what is referred to as a value function (Beysolow II, 2019, p. 20). The Reinforce algorithm thrives at converging on solutions better than a value-based algorithm such as Q-Learning (Beysolow II, 2019, p. 21). Unfortunately, the Reinforce algorithm has difficulties with high variability in log probabilities and cumulative reward values which creates a noisy gradient. This noisy gradient creates a less-than-optimal learning situation (Yoon, 2019). The Reinforce algorithm will collect sequences of states, actions, and rewards before updating the policy. Monte Carlo sampling is a technique used in this algorithm where samples are chosen randomly to assist in approximations and find potentially unlearned rewards (Yoon, 2019). This algorithm also enables an agent to sample actions based on the probability distribution found in the policy distribution.

To summarize, during training the actor that is utilizing a reinforce algorithm to solve the cartpole problem will iterate through episodes exploring and collecting information on potential rewards. It will balance random decisions and exploiting its environment. This helps approximate and progress through the episode while gathering the optimal rewards. The algorithm uses rewards to adjust the weights of the actor to help localize its probability of the highest reward in a distribution (Beysolow II, 2019, p. 21).

The Q-Learning algorithm on the other hand is a value-based algorithm. Being considered a value-based algorithm means that an actor collects information on its surroundings the same as a Reinforce algorithm but uses a different exploration method called ‘Epsilon-Greedy Algorithm.’ In this scenario, Epsilon is initialized with a value between zero and one that acts as a percentage of how often you want it to explore. To determine if a Q-Learning algorithm should exploit or explore its surroundings you generate a value between zero and one and see if it is larger than Epsilon. If it is less than you should explore the given environment. As the algorithm develops a better understanding of the rewards Epsilon is decayed through various methods to force the actor to exploit more frequently than explore (Beysolow II, 2019, pp. 59-60).

Finally, we can speak to the Actor-Critic (A2C) algorithm. The A2C is considered a hybrid of both value-based (Q-Learning) and policy-based (Reinforce) algorithms. Where two models coexist to create an optimal process. The actor is typically initialized with a policy-based algorithm like Reinforce while the Critic is initialized with a value-based algorithm. In the cart pole problem, the actor must determine the optimal action by estimating where to place the cart to create the optimal angle. During this the critic will evaluate the action the actor has taken and provide feedback on how a more optimal action could’ve been made during that state (Beysolow II, 2019, p. 11). This critic feedback is used to update the actors policy on how it should decide in the future the best action. Slowly the actor and critic work as a team that better defines the weights in the policy.

Policy gradient approaches differ from value-based approaches by factors such as how the policies are represented, how the policy is optimized, and differ in convergence abilities. Policy gradients will map states to actions and outputs the rewards associated to each action in that state. A value-based approach on the other hand will focus on its value function to estimate the returned reward for each action based on the cumulative future rewards. These rewards are calculated from previous values it acquired during training. Both functions have exploration functions. Policy gradients typically stay clear of the over estimation issues that value functions provided a value-based Q-learning model (Beysolow II, 2019, p. 20), but they suffer in samples required to converge on an optimal path. Finally, a value-based model will spend its time optimizing a value function rather than the parameters of the policy itself.

Actor-Critic (A2C) and solely value or policy-based approaches differ primarily in how many models are active in the data collection process. In policy or value-based approaches there is one model updating the weights for decision-making but in an A2C model there are two active models, each of which have a different approach. The A2C model is efficient in its exploration because it has the value-based critic guiding the policy driven actor while a value-based or policy-based approach may suffer from over exploitation or high variance. Since A2C models can explore more effectively they typically converge on an optimal path faster than a value or policy-based approach alone (Beysolow II, 2019, p. 38). Value and Policy-based approaches are opposite sides of the spectrum and when placed together complement each other’s shortcomings.

References

- Beysolow II, T. (2019). Applied Reinforcement. Apress.
- Yoon, C. (2019, February 6). Understanding Actor Critic Methods and A2C. Retrieved from Towards Data Science: https://web.archive.org/web/20200526014757/https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f?gi=a701bcc17ce