No more hand-tuning rewards: Masked constrained policy optimization for safe reinforcement learning

Stef Van Havermaet, Yara Khaluf, Pieter Simoens

Research output: Chapter in Book/Report/Conference proceedingConference paperAcademicpeer-review

Abstract

In safe Reinforcement Learning (RL), the agent attempts to find policies which maximize the expectation of accumulated rewards and guarantee its safety to remain above a given threshold. Hence, it is straightforward to formalize safe RL problems by both a reward function and a safety constraint.We define safety as the probability of survival in environments where taking risky actions could lead to early termination of the task. Although the optimization problem is already constrained by a safety threshold, reward signals related to unsafe terminal states influence the original maximization objective of the task. Selecting the appropriate value of these signals is often a time consuming and challenging reward engineering task, which requires expert knowledge of the domain. This paper presents a safe RL algorithm, called Masked Constrained Policy Optimization (MCPO), in which the learning process is constrained by safety and excludes the unsafe reward signals. We develop MCPO as an extension of gradient-based policy search methods, in which the updates of the policy and the expected reward models are masked. Our method benefits from having a high probability of satisfying the given constraints for every policy in the learning process. We validate the proposed algorithm in two continuous tasks. Our findings prove the proposed algorithm is able to neglect unsafe reward signals, and thereby resolving the desired safety-performance trade-off without having the need for hand-tuning rewards.

Original languageEnglish
Title of host publication20th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2021
PublisherInternational Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS)
Pages1332-1340
Number of pages9
ISBN (Electronic)9781713832621
Publication statusPublished - 2021
Externally publishedYes
Event20th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2021 - Virtual, Online
Duration: 3 May 20217 May 2021

Publication series

NameProceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS
Volume3
ISSN (Print)1548-8403
ISSN (Electronic)1558-2914

Conference/symposium

Conference/symposium20th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2021
CityVirtual, Online
Period3/05/217/05/21

Keywords

  • Constrained Policy Optimization
  • Reward Engineering
  • Safe Reinforcement Learning

Fingerprint

Dive into the research topics of 'No more hand-tuning rewards: Masked constrained policy optimization for safe reinforcement learning'. Together they form a unique fingerprint.

Cite this