rllib/algorithms/tqc/README.md
TQC is an extension of SAC (Soft Actor-Critic) that uses distributional reinforcement learning with quantile regression to control overestimation bias in the Q-function.
Paper: Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics
n_critics independent critic networks (default: 2)from ray.rllib.algorithms.tqc import TQCConfig
config = (
TQCConfig()
.environment("Pendulum-v1")
.training(
n_quantiles=25, # Number of quantiles per critic
n_critics=2, # Number of critic networks
top_quantiles_to_drop_per_net=2, # Quantiles to drop for bias control
)
)
algo = config.build()
for _ in range(100):
result = algo.train()
print(f"Episode reward mean: {result['env_runners']['episode_reward_mean']}")
| Parameter | Default | Description |
|---|---|---|
n_quantiles | 25 | Number of quantiles for each critic network |
n_critics | 2 | Number of critic networks |
top_quantiles_to_drop_per_net | 2 | Number of top quantiles to drop per network when computing targets |
TQC inherits all SAC parameters including:
actor_lr, critic_lr, alpha_lr: Learning ratestau: Target network update coefficientinitial_alpha: Initial entropy coefficienttarget_entropy: Target entropy for automatic alpha tuningn_quantiles quantile estimatesn_critics * n_quantiles valuestop_quantiles_to_drop_per_net * n_critics quantiles| Aspect | SAC | TQC |
|---|---|---|
| Critic Output | Single Q-value | n_quantiles quantile values |
| Number of Critics | 2 (twin_q) | n_critics (configurable) |
| Loss Function | Huber/MSE | Quantile Huber Loss |
| Target Q | min(Q1, Q2) | Truncated sorted quantiles |
@article{kuznetsov2020controlling,
title={Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics},
author={Kuznetsov, Arsenii and Shvechikov, Pavel and Grishin, Alexander and Vetrov, Dmitry},
journal={arXiv preprint arXiv:2005.04269},
year={2020}
}