Our paper on Tsallis reinforcement learning is accepted to RSS 2020

[2020.05.06]

The following paper is accepted to Robotics: Science and Systems (RSS 2020)

  • Generalized Tsallis Entropy Reinforcement Learning and Its Application to Soft Mobile Robots by Kyungjae Lee, Sungyub Kim, Sungbin Lim, Sungjoon Choi, Mineui Hong, Jaein Kim, Yong-Lae Park, and Songhwai Oh
    • Abstract: In this paper, we present a new class of Markov decision processes (MDPs), called Tsallis MDPs, with Tsallis entropy maximization, which generalizes existing maximum entropy reinforcement learning (RL). A Tsallis MDP provides a unified framework for the original RL problem and RL with various types of entropy, including the well-known standard Shannon-Gibbs (SG) entropy, using an additional real-valued parameter, called an entropic index. By controlling the entropic index, we can generate various types of entropy, including the SG entropy, and a different entropy results in a different class of the optimal policy in Tsallis MDPs. We also provide a full mathematical analysis of Tsallis MDPs. Our theoretical result enables us to use any positive entropic index in RL. To handle complex and large-scale problems such as learning a controller for soft mobile robot, we also propose a Tsallis actor-critic (TAC). For a different type of RL problems, we find that a different value of the entropic index is desirable and empirically show that TAC with a proper entropic index outperforms the state-of-the-art actor-critic methods. Furthermore, to alleviate the effort for finding the proper entropic index, we propose a linear scheduling method where an entropic index linearly increases as the number of interactions increases. In simulations, the linear scheduling shows the fast convergence speed and a similar performance to TAC with the optimal entropic index, which is a useful property for real robot applications. We also apply TAC with the linear scheduling to learn a feedback controller of a soft mobile robot and shows the best performance compared to other existing actor critic methods in terms of convergence speed and the sum of rewards. Consequently, we empirically show that the proposed method efficiently learns a controller of soft mobile robots.