RLvSL project page
Reinforcement Learning via Supervised Learning


Problem statement

The goal of Reinforcement Learning is to infer the best possible action policy from past interaction experience with the environment. In the general case of Markovian environments, such a control policy is defined as a mapping from states to actions. The starting point of the proposed research lies on the observation that such a control policy can be cast as a classification problem, where each state (input) is classified to one of the available actions (classes). The main question in this context is how to specify a complete classification problem, namely how to generate a set of correct training examples in the absence of a teacher.

It has been commonly conjectured by many researchers that the structure of optimal or good control policies for many real-world problems is not arbitrary, but rather follows very particular, yet unknown, structural patterns which relate to the characteristics of the domain. For example, a policy for balancing on a moving bicycle will certainly have large areas in the state space where the rider/learner uniformly chooses to displace its weight to the left (right) when the bicycle is leaning to the right (left). However, the borders and the exact shape of these areas are unknown, depend on various characteristics of the bicycle, and need to be learned from interaction samples.

This structural property of policies could not be exploited in the past for there were no means available to directly apply appropriate structure-discovery tools on policies. This research aims at eliminating this restriction and allowing for the first time the direct use of such tools, such as support vector machines, which map the input data to high-dimensional spaces where structure inference can be automated.

Transforming a Reinforcement Learning problem into a sequence of classification problems implies defining a set of successive policies, which relate our approach to the family of Policy Iteration algorithms. Policy Iteration generates a sequence of improving policies, hence, the training set used for each stage of the classification problems reflects the current improved policy. Improvements are carried out by choosing the best immediate action, based on an evaluation of the long-term reward associated with this action. Such an evaluation can be obtained by performing roll-outs, ie. extensive simulations of the action followed by policy application. However, such an evaluation comes at a cost: the one of actually performing the simulations. This first approach, formalized as the Roll-out Classification Policy Iteration (RCPI) algorithm, laid the foundations of the current research.

Roll-out sampling

In the initial RCPI algorithm, the sampling scheme used to obtain a classifier's training set performs K roll-outs for each state-action pair. It can be observed that most of the computational effort for evaluation and action selection — in order to build the training set — is then wasted on states which do not clearly present a dominating action. Finding a dominating action in such states implies performing a prohibitively large number of roll-outs to discriminate between two actions yielding very similar long-term reward, thus wasting a lot of computational resources for a low value in policy improvement. On the other hand, states for which there is a clearly dominating action often do not require to exhaust the K available roll-outs in order to identify this action.

The idea developed by the Roll-out Sampling Policy Iteration (RSPI) is to define utility functions for each state, indicating how profitable it might be to obtain samples from one state rather than from another one. A similar resource allocation setting belongs to the field of multi-armed bandit problems. Four different utility functions have been introduced and tested in RSPI. This state selection scheme is then combined with a measure of statistical relevance which indicates when a state has been sampled enough times to guarantee with a given probability that the dominating action found so far is the real dominating action.

Experimental results have shown that RSPI obtains equally good policies as those produced by RCPI on two classical benchmarks, but with significantly less effort. There remain some practical obstacles, especially concerning the state space coverage of state samples. These topics are still being investigated. This algorithmic work was followed by a theoretical analysis which established conditions under which policy improvement is guaranteed after a single step of policy iteration and bounds on the achieved distance to optimality.

Number of successful runs against number of state samples
Number of successful runs against number of state samples for each utility function and RCPI — pendulum domain.

Classifiers evaluation

To understand the problem of compactly representing policies using classifiers, we invested on studying the true underlying structure of the policies we are trying to represent. Using a customized algorithm, we managed to obtain the optimal control policy for two well-known control problems (inverted pendulum and mountain car) to a sufficient resolution. It became apparent that there is significant structure in the majority of the state space with small “noisy” areas. Subsequently, we focused on testing several known classification methods (Support Vector Machines, Neural Networks, Decision Trees, Nearest Neighbours, Naive Bayes) on the problem of representing this optimal policy to identify the strengths and weaknesses of each method. While these experiments assessed the accuracy of the classifiers in approximating the representation of the optimal policy, the performance of the represented policies were assessed by a different series of experiments, which also revealed the rate of performance improvement with respect to the amount of collected data. Nearest Neighbours classifiers seem to be the best choice for achieving good representation and performance within classifier-based policy iteration.

Mountain car problem — optimal policy
Mountain car problem — optimal policy.

Continuous action and state spaces

Controling real-world systems as robotic actuators often imply defining continuous commands to apply. However, most Reinforcement Learning methods are designed to cope with finite discrete action spaces. We developed a novel approach to continuous action optimization . This approach uses previous results from telecommunications systems. It defines a continuous action policy in terms of continuous adaptive action increments and learns the adaptation scheme from experience, relying on the temporal locality property of actions.

This method can be used in conjunction with any standard Reinforcement Learning algorithm. It requires very little tuning and very limited computationnal resources. Experiments on the bicyle riding and the inverted pendulum domains showed excellent performance, both in terms of solution quality and energy consumption. Indeed, the results obtained in terms of reaching the goal were comparable or better than the ones obtained with the standard discrete actions setup. Moreover, the total energy consumption was greatly decreased using our algorithm, which is a crucial feature for implementation on real systems.

Control of high-dimensional event-driven temporal processes

The idea of using classifiers to represent the policies at stake in some high-dimensional reinforcement learning problems was tested on the specific setting of complex explicit-event time-dependent problems. These problems describe the case where the learning agent's environment is subject to change. Most often, the underlying dynamics of such problems are only partially observable which leads to increased complexity in building an efficient policy and might benefit from the regularisation properties of statistical learning tools.

The main case of study involved optimizing a subway network management policy, by deciding to add or remove trains from service depending on the observable state of the process. Such a benchmark relied on the previous algorithmic contributions, especially roll-out evaluations.

This study pushed the use of statistical learning tools in another direction. First, we adapted the exploration strategy destined to obtain the classifier's training set so as to direct it towards the situations which were most likely to be encountered. Such an oriented exploration towards important zones used the structured nature of temporal problems. This implied defining notions of explored and unexplored zones, thus yielding another type of statistical learning problem, namely density function estimation from samples. Then, we maintained both a value function and a policy, each being defined on only part of the state space. Hence, the incremental improvement scheme of reinforcement learning which was called upon the classifier objects in order to represent successive policies updated both the classifiers used for policy representation and the regressors used as value functions.