Mastering Robotic Manipulation with Reinforcement Learning: TQC and DDPG for Fetch Environments

Michael Kudlaty
Michael Kudlaty
January 1, 2025

Introduction

Robotic manipulation remains one of the most challenging frontiers in robotics research. Tasks like pushing, sliding, and picking and placing objects require complex coordination and decision-making in dynamic environments.  Reinforcement Learning (RL) has emerged as a powerful tool for learning these skills, enabling robots to learn through trial and error, without explicit programming for every scenario.

This blog post dives deep into using two state-of-the-art off-policy RL algorithms, Truncated Quantile Critics (TQC) and Deep Deterministic Policy Gradient (DDPG), to tackle the Fetch environments provided by Gymnasium Robotics: FetchPush-v2, FetchSlide-v2, and FetchPickAndPlace-v2. We'll explore the mathematical foundations of these algorithms, discuss their strengths and weaknesses, and provide code snippets to illustrate key concepts.

The Fetch Environments: A Testbed for Robotic Manipulation

The Fetch environments simulate a Fetch Mobile Manipulator robot tasked with interacting with a box on a table. Each environment presents a unique challenge:

  • FetchPush-v4: The robot must push a box to a target location on the table.
  • FetchSlide-v4: The robot must slide a box to a target location on the table, where the distance to the goal is much further than in pushing.
  • FetchReach-v4: The robot must reach to a target location in space.
  • FetchPickAndPlace-v4: The robot must pick up the box from the table and move it to a target location in 3D space.

Observations: These environments provide a rich observation space, including:

  • achieved_goal: The current position of the object.
  • desired_goal: The target position of the object.
  • observation: A dictionary with a variety of information, such as:
    • observation key of the outer dictionary: An array with the gripper's position, object's position, object's rotation, and object's velocity.
    • achieved_goal key of the outer dictionary: Same information as the achieved_goal key in the outer dictionary.
    • desired_goal key of the outer dictionary: Same information as the desired_goal key in the outer dictionary.

Actions: The action space is continuous, representing the 3D movement of the gripper and the gripper opening/closing command.

Rewards: Sparse rewards are used. The agent receives -1 at every timestep if the object hasn't reached the goal. If the object reached the goal the reward is 0.

DDPG: Learning Continuous Control with Deterministic Policies

Deep Deterministic Policy Gradient (DDPG) is an off-policy, actor-critic algorithm designed for continuous action spaces. It extends the ideas of Q-learning to the continuous domain by combining a deterministic policy (actor) with a Q-function approximator (critic).

Mathematical Underpinnings of DDPG

  1. Deterministic Policy: Unlike stochastic policies that output a probability distribution over actions, DDPG learns a deterministic policy, μ(s), that directly maps a state s to a specific action a. This is represented by a neural network (the actor).
  2. Q-function (Critic): DDPG utilizes a Q-function, Q(s,a), to estimate the expected return of taking action a in state s and following the policy μ thereafter. This is also approximated by a neural network (the critic).
  3. Bellman Equation: The critic is trained to satisfy the Bellman equation:
  4. Q(s,a)=r(s,a)+γQ(s′,μ(s′))
  5. where:
    • r(s,a) is the immediate reward received after taking action a in state s.
    • γ is the discount factor, balancing immediate and future rewards.
    • s′ is the next state.
  6. Target Networks: DDPG uses target networks for both the actor and critic to stabilize training. These are time-delayed copies of the main networks, whose parameters are slowly updated. The target Q-function, Q′, and target policy, μ′, are used in the Bellman equation for the target value:
  7. y=r(s,a)+γQ′(s′,μ′(s′))
  8. Loss Function (Critic): The critic is trained by minimizing the mean squared Bellman error (MSBE):
  9. L(θQ)=E[(y−Q(s,a))2]
  10. where θQ are the critic network parameters.
  11. Policy Gradient (Actor): The actor is updated by following the policy gradient, which aims to maximize the expected return:
  12. ∇θμ​J≈E[∇a​Q(s,a)∣a=μ(s)​∇θμ​μ(s)]
  13. where θμ are the actor network parameters. This is essentially saying "adjust the policy in the direction that increases the Q-value."

Code Snippet (DDPG Critic Update):

Python

1# Sample from replay buffer
2state, action, reward, next_state, done = replay_buffer.sample(batch_size)
3
4# Compute target Q-value
5target_action = target_actor(next_state)
6target_Q = target_critic(next_state, target_action)
7y = reward + (1 - done) * gamma * target_Q
8
9# Compute critic loss
10critic_loss = torch.mean((y - critic(state, action))**2)
11
12# Update critic
13critic_optimizer.zero_grad()
14critic_loss.backward()
15critic_optimizer.step()

TQC: Enhancing Stability and Performance with Quantile Regression

Truncated Quantile Critics (TQC) builds upon DDPG and addresses the issue of overestimation bias commonly found in Q-learning methods. It achieves this by using quantile regression to learn a distribution over returns instead of a single point estimate.

Mathematical Underpinnings of TQC

  1. Quantile Regression: Instead of estimating the expected return, TQC learns a distribution over returns represented by a set of quantiles. A quantile τ represents the value such that the probability of the return being less than that value is τ.
  2. Quantile Regression Loss: The quantile regression loss, also known as the Huber quantile loss, is used to train the critic to estimate these quantiles. For each quantile τi​, the loss is defined as:ρτi​κ​(u)=∣τi​−δ{u<0}​∣κLκ​(u)​, where:
    • $\mathcal{L}_\kappa(u)= \begin{cases}\frac{1}{2} u^2 & \text { if }|u| \leq \kappa \\ \kappa\left(|u|-\frac{1}{2} \kappa\right) & \text { otherwise }\end{cases}$1 is the Huber loss.
    • δ is the Dirac delta function.
    • κ is a hyperparameter that determines the point where the loss transitions from squared error to absolute error.
    • u=y−Qτi​​(s,a) is the temporal difference error for the quantile τi​.This loss function is asymmetric, penalizing overestimation more when τi​>0.5 and underestimation more when τi​<0.5.
  3. Multiple Critics: TQC typically uses an ensemble of critics, each learning a distribution over returns. This further reduces overestimation bias and improves stability.
  4. Truncation: To further combat overestimation, TQC uses only the smallest N quantiles for each critic when computing the target value, a method called truncation. This is performed for the minimum target Q-value over M critics.

Code Snippet (TQC Critic Update with truncation):

1# Sample from replay buffer
2state, action, reward, next_state, done = replay_buffer.sample(batch_size)
3
4# Compute target actions for each critic
5target_actions = [target_actor(next_state) for _ in range(num_critics)]
6
7# Compute target quantiles for each critic, each having N quantiles
8target_quantiles = torch.stack([target_critic(next_state, target_actions[i]) for i in range(num_critics)], dim=1) # shape: (batch_size, num_critics, N_target_quantiles)
9
10# Take the minimum M quantiles across critics
11sorted_target_quantiles, _ = torch.sort(target_quantiles, dim=1)
12target_quantiles = sorted_target_quantiles[:,:M,:] # shape: (batch_size, M, N_target_quantiles)
13
14# Truncate by selecting only the first N quantiles
15target_quantiles = target_quantiles[:,:,:N] # shape: (batch_size, M, N)
16
17# Compute target values
18target_quantiles = reward.unsqueeze(-1) + (1 - done).unsqueeze(-1) * gamma * target_quantiles # shape: (batch_size, M, N)
19
20# Compute the loss for each critic and each quantile
21critic_quantiles = torch.stack([critic(state, action) for critic in critics], dim=1) # shape: (batch_size, num_critics, N_quantiles)
22
23u = target_quantiles.unsqueeze(2) - critic_quantiles.unsqueeze(-1) # shape: (batch_size, M, num_critics, N, N_quantiles)
24
25tau = torch.linspace(0, 1, N_quantiles + 1, device=device)[1:].view(1, 1, 1, 1, -1) # shape: (1, 1, 1, 1, N_quantiles)
26tau = (tau - 0.5 / N_quantiles) # center the quantiles
27huber_loss = torch.where(u.abs() <= kappa, 0.5 * u.pow(2), kappa * (u.abs() - 0.5 * kappa)) # shape: (batch_size, M, num_critics, N, N_quantiles)
28quantile_loss = (tau - (u < 0).float()).abs() * huber_loss # shape: (batch_size, M, num_critics, N, N_quantiles)
29critic_loss = quantile_loss.sum(-1).mean(-1).sum(-1).mean(-1).sum() # shape: ()
30
31# Update critics
32critic_optimizer.zero_grad()
33critic_loss.backward()
34critic_optimizer.step()

Advantages of TQC over DDPG

  • Reduced Overestimation Bias: TQC's use of quantile regression and truncation effectively mitigates the overestimation bias that can plague DDPG.
  • Improved Stability: Learning a distribution over returns and using an ensemble of critics leads to more stable training and performance.
  • Better Exploration: The distributional perspective can potentially encourage better exploration, as the agent is aware of the uncertainty in its value estimates.

Putting it all Together: Training TQC and DDPG on Fetch Environments

To train a TQC or DDPG agent on the Fetch environments, we need to:

  1. Define the Actor and Critic Networks: These can be multi-layer perceptrons (MLPs) with appropriate input and output dimensions based on the observation and action spaces. For TQC, we need multiple critic networks.
  2. Implement the Replay Buffer: This stores past experiences for off-policy learning.
  3. Implement the Training Loop: This involves sampling from the replay buffer, computing target values, updating the critic(s) and actor, and updating the target networks.
  4. Hyperparameter Tuning: Experiment with different learning rates, network architectures, batch sizes, number of quantiles (for TQC), etc., to find optimal settings.
  5. Evaluation: Periodically evaluate the agent's performance by running it in the environment and tracking the average reward.

Example of how to use TQC algorithm with Stable-Baselines3 in the FetchReach environment:

1import gymnasium as gym
2from stable_baselines3 import TQC
3from stable_baselines3.common.evaluation import evaluate_policy
4env = gym.make("FetchReach-v2", render_mode="rgb_array")
5
6# Create a model
7model = TQC("MultiInputPolicy", 
8			env, 
9    		policy_kwargs={"net_arch" : [512, 512, 512]},
10			learning_starts=1000,
11			target_update_interval=10,
12            train_freq=(1,"step"),
13            gradient_steps=1, 
14            verbose=1,
15            )
16            
17# Evaluate the model before training
18print("Evaluation before training:")
19mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
20print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")
21
22# Train the agent for 10000 steps
23model.learn(total_timesteps=10000)
24
25# Evaluate the trained agent
26print("Evaluation after training:")
27mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
28print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")env.close()

Conclusion

TQC and DDPG are powerful RL algorithms capable of solving complex robotic manipulation tasks like those presented in the Fetch environments. Understanding their mathematical underpinnings is crucial for effectively applying and tuning these algorithms. By combining the strengths of deterministic policies, quantile regression, and careful optimization techniques, we can train robots to perform intricate manipulations, bringing us closer to a future where robots seamlessly interact with the world around them.

This blog post provides a starting point for your journey into the world of RL for robotics. Further exploration can involve implementing these algorithms from scratch, experimenting with different network architectures, exploring more advanced techniques like Hindsight Experience Replay (HER), and applying these methods to real-world robotic platforms. The possibilities are vast, and the field is continuously evolving, offering exciting opportunities for innovation and discovery.

Additional Learning Materials

Code Repository & Models

Updated On:
January 14, 2025
Follow on social media: