CRL Task 6: Causal Imitation Learning

causalityreinforcement-learningmachine-learning

CRL Task 6: Causal Imitation Learning

And as quickly as that, we’re at task 6 of our quest. Causal imitation learning is perhaps the most fanciful-sounding, but at it’s core it remains as simple a goal as the challenge presented by Abbeel and Ng in 2004. The really exciting part here is that we have rigorous methods for combining our causal techniques with imitation learning procedures using RL. This is a really cutting edge research area and we only discuss one paper in detail here. Keep an eye out for exciting developments coming from left, right and centre!

Causal Imitation Learning

We now reach the final task presented by Bareinboim in his development of the causal reinforcement learning framework 1Bareinboim, E. (2020). Causal Reinforcement Learning. ICML 2020.. This task involves the interesting challenge of learning by expert demonstration - imitation learning. In their now classic paper, Abbeel and Ng 2Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. ICML. explored applying inverse reinforcement learning (IRL) techniques to the imitation learning procedure. Essentially, IRL learns a reward function that emphasises the observed expert trajectories. This is in contrast to the other common method of imitation learning known as behaviour cloning where an agent seeks to mimic the policy of the expert. Both these methods have been successful in their own right, but they make the strong assumption that actions of the expert are fully observed by the imitator. 3Zhang, J., Kumor, D., & Bareinboim, E. (2020). Causal imitation learning with unobserved confounders. addresses some of these shortcomings by introducing a complete graphical criterion for determining whether imitation learning is feasible given observational data and knowledge about the underlying causal process. Further, a sufficient algorithm for identifying an imitation policy when this criterion does not hold is presented.

Partially Observable SCM 3Zhang, J., Kumor, D., & Bareinboim, E. (2020). Causal imitation learning with unobserved confounders.: A POSCM is a tuple M,O,L\langle M, \boldsymbol{O}, \boldsymbol{L} \rangle. Here MM is an SCM, O\boldsymbol{O} represents the observed endogenous variables, and L\boldsymbol{L} represents the latent (unobserved) endogenous variables. The observed and latent variables are mutually exhaustive over the endogenous variables.

The task at hand is to determine the value of performing some intervention (action) that is part of the observed variable set, XOX \in \boldsymbol{O}. Assuming the reward is latent, we wish to identify a policy π\pi such that the expected reward E[Ydo(π)]\mathbb{E}[Y \mid do(\pi)] exceeds a certain minimum performance requirement, τ\tau. We say P(ydo(π))P(y \mid do(\pi)) is identifiable if for a subset of the exogenous variables, YV\boldsymbol{Y} \subseteq \boldsymbol{V}, the distribution P(ydo(π);M)P(y \mid do(\pi); M) is uniquely computable from the observation distribution and POSCM, MM. In other words, if we can identify the outcome from the observations of the expert behaviour in an imitation learning context. In fact, when the reward Y\boldsymbol{Y} is latent (not all yY\boldsymbol{y} \in \boldsymbol{Y} are observed), we cannot identify P(ydo(π))P(\boldsymbol{y} \mid do(\pi)) (see corollary 1 in 3Zhang, J., Kumor, D., & Bareinboim, E. (2020). Causal imitation learning with unobserved confounders.). We thus need more information to learn an effective imitation policy.

By assuming that the observed actions are demonstrated by an expert (exceeds a threshold), we relax the need to worry about non-identifiability issues. Further, we say a reward distribution P(y)P(\boldsymbol{y}) is imitable if there exists some policy in the policy space that can identify the distribution for some POSCM. For example, consider XWYX \rightarrow W \rightarrow Y with policy π(x)=P(x)\pi(x) = P(x). Then the interventional distribution is

P(ydo(π))=x,wP(yx)P(wx)π(x)=P(y).P(y \mid do(\pi)) = \sum_{x,w} P(y \mid x)P(w \mid x)\pi(x) = P(y).

In other words, in this simple case the unobserved reward distribution P(y)P(y) is imitable purely by observing realisations of action XX. Importantly, as previously discussed, the reward distribution remains unidentifiable. That is, imitability does not guarantee identifiability of the imitators reward distribution. In fact, Zhang et al. further show that if an expert and imitator share the policy space (possible actions), the the policy itself is always imitable. When the policy spaces do not agree, the criteria become more complicated. We now explore some graphical criteria for this case.

Imitation Backdoor 3Zhang, J., Kumor, D., & Bareinboim, E. (2020). Causal imitation learning with unobserved confounders.: Consider a causal system with diagram GG and policy space Π\Pi. We say Z\boldsymbol{Z} satisfies the imitation backdoor criterion (i-backdoor) with respect to G,Π\langle G, \Pi \rangle if and only if the ZZ is a subset of the parents of Π\Pi and YY is conditionally independent of XX given Z\boldsymbol{Z} in the graph with edges out of XX removed. Formally, ZPa(Π)\boldsymbol{Z} \subseteq Pa(\Pi) and (YXZ)GX.(Y \perp X \mid \boldsymbol{Z})_{G_{\underline{X}}}.

For an example of how this definition applies, consider the figure below. Specifically, (a) with set {Z}\{Z\}. Here we have {Z}\{Z\} is in the parents of the policy space (inclusive of the space itself). Also, removing edge XYX \rightarrow Y and conditioning on ZZ still leaves path XLYX \leftarrow L \rightarrow Y as a ‘backdoor’. In figure (b), however, the LYL \rightarrow Y edge is removed and we no longer have an i-backdoor set.

ZKB 2020 Fig1

Causal diagrams showing examples where imitation learning can or cannot occur. Blue variables indicate latent reward variable, while red variables represent action. Light red indicates the inputs to the policy space, and light blue represents the minimal imitation surrogates. Figure extracted from 3Zhang, J., Kumor, D., & Bareinboim, E. (2020). Causal imitation learning with unobserved confounders..

The i-backdoor criterion is used to characterise when imitation of an expert is possible given that the reward variable is unobserved.

Imitation by Backdoor 3Zhang, J., Kumor, D., & Bareinboim, E. (2020). Causal imitation learning with unobserved confounders.: Given a causal diagram GG with policy space Π\Pi, we say that the distribution P(y)P(y) is imitable with respect to G,Π\langle G, \Pi \rangle if and only if there exists some i-backdoor admissible set Z\boldsymbol{Z} in this causal system. Further, the policy itself can be determined and is given by π(xpa(Π))=P(xz)\pi(x \mid pa(\Pi)) = P(x \mid \boldsymbol{z}).

Applying this theorem, we can see that using set {Z}\{Z\} in the earlier figure (b), an imitating policy is learnable. Further, the policy itself is given by π(xz)=P(xz).\pi(x \mid z) = P(x \mid z). These results are impressive, but Zhang et al. further point out that the i-backdoor requirement is not necessary for imitation of expert performance. Consider the earlier figure (c) in which variable SS mediates all actions (intervention on XX) on the outcome (latent reward YY). We could imagine that learning a distribution over SS could be sufficient for imitation of the distribution of YY in this case. This train of thought motivates the following definitions and formalism’s.

Imitation Surrogate 4Kocaoglu, M., Shanmugam, K., & Bareinboim, E. (2017). Experimental design for learning causal graphs with latent variables. NIPS.: Consider a causal diagram GG with policy space Π\Pi. Take S\boldsymbol{S} as an arbitrary subset of the observations, O\boldsymbol{O}. We say S\boldsymbol{S} is an imitation surrogate (i-surrogate) with respect to G,Π\langle G, \Pi \rangle if (YX^S)GΠ(Y \perp \hat{X} \mid \boldsymbol{S})_{G \cup \Pi}. Here GΠG \cup \Pi is the graph obtained by adding directed edges from Pa(Π)Pa(\Pi) to XX. X^\hat{X} is a new parent of XX that has been added in this procedure.

We say that the imitation surrogate is minimal if there is no subset of it such that the subset is also an imitation surrogate. A simple example of this is visible in figure (c) where both {W,S}\{W,S\} and {S}\{S\} are surrogates, but {S}\{S\} is the minimal surrogate. Figure (d) poses an additional problem in that the addition of the collider ZZ to (c) means P(sdo(π))P(s \mid do(\pi)) is not identifiable even though we have an i-surrogate SS. The authors tackle this problem by realising that having a subspace of the policy space that yields an identifiable distribution is sufficient to solve the imitation learning task. Without delving into the details here, the authors of 3Zhang, J., Kumor, D., & Bareinboim, E. (2020). Causal imitation learning with unobserved confounders. implement a confounding robust imitation learning algorithm and apply it to several interesting problems in which the causal approach is shown to be superior to naive imitation learning approaches.

This completes the discussion about this task. I expected causal imitation learning to become an active area of research as this paper was only recently released, being the first of its kind. This also concludes the development of the theory necessary for engaging with state-of-the-art research in causal reinforcement learning.

Image credit: Header Image.

References

  1. Bareinboim, E. (2020). Causal Reinforcement Learning. ICML 2020.
  2. Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. ICML.
  3. Zhang, J., Kumor, D., & Bareinboim, E. (2020). Causal imitation learning with unobserved confounders.
  4. Kocaoglu, M., Shanmugam, K., & Bareinboim, E. (2017). Experimental design for learning causal graphs with latent variables. NIPS.
St John

Written by St John

Author of the Asking Why Blog - a personal blog and website with everything I find interesting.

Comments are being migrated. Check back soon.