CRL Task 1: Generalised Policy Learning

causalityreinforcement-learning

CRL Task 1: Generalised Policy Learning

In the previous blog post we developed some ideas and theory needed to discuss a causal approach to reinforcement learning. We formalised notions of multi-armed bandits (MABs), Markov Decision Processes (MDPs), and some causal notions. In this blog post we’ll finally get to developing some causal reinforcement learning ideas. The first of which is dubbed Task 1, for CRL can help solve. This is Generalised Policy Learning. Let’s begin.

Generalised Policy Learning

Reinforcement learning typically involves learning and optimising some policy about how to interact in an environment to maximise some reward signal. Typical reinforcement learning agents are trained in isolation, exploiting copious amounts of computing power and energy resources. In a crude manner of speaking, offline policy learning involves learning from a fixed set of collated data. Online policy learning typically involves learning on-the-fly, with the main constraint being time. This approach requires flexibility as data can change over time without any indication of such a change to the agent. In addition, state-of-the-art agents often take a substantial amount of time to train. Transfer learning seeks to solve this inefficiency in the learning process by applying previous knowledge and experience to boost learning performance, similar to how humans can exploit previous knowledge to solve novel tasks. The field of causal inference similarly deals with this problem of inferring effect from heterogeneous sources of data. A major problem in this process involves learning in the face of unobserved (hidden) confounders. We now discuss some ideas of how causal inference and modelling can be applied to multi-armed bandits (MABs) and Markov decision processes (MDPs) to boost learning performance by combining different modes of learning - observation and interventional.

One such paper that tackles this problem is 1Zhang, J., & Bareinboim, E. (2017). Transfer learning in multi-armed bandits: A causal approach. IJCAI-17:1340–1346. in which the authors combine transfer learning in (basic) reinforcement learning with causal inference theory. This is done in the context of two multi-armed bandit agents given access to a causal model of the environment. In the case where causal effects are not identifiable, the authors provide a method of extracting causal bounds from available knowledge contained in the available distributions. We now develop some of the theory presented in this paper.

Contextual bandits are discussed in 2Langford, J., & Zhang, T. (2007). The epoch-greedy algorithm for contextual multi-armed bandits. NIPS. and are a variation of MABs such that the agent can observe additional information (context) associated with the reward signal. The authors of 1Zhang, J., & Bareinboim, E. (2017). Transfer learning in multi-armed bandits: A causal approach. IJCAI-17:1340–1346. start by considering an off-policy learning problem such that agent AA follows some policy do(X=π(ϵ,u))do(X = \pi(\epsilon, u)) with context uUu \in U and noise ϵ\epsilon, resulting in joint distribution P(x,y,u)P(x,y,u). Another agent, AA^\prime, would similarly like to learn about the environment and exploit the experience of AA to boost its learning and quickly converging upon the optimal policy. This problem boils down to identifying the causal effect of an intervention on XX, given by E[Ydo(x)]\mathbb{E}[Y \mid do(x)]. That is, the expected outcome given that we experiment by intervening on XX. A more challenging scenario appears if we wish to transfer knowledge from this contextual bandit to a standard MAB agent, say BB (see figure below). If we denote by GXZG_{\overline{X}\underline{Z}} the subgraph obtained by deleting all edges directed into XX and all edges directed out of ZZ, then (YXU)G(Y \perp X \mid U)_{\underline{G}} means that by removing edges directed out of G, given the context UU, we obtain that YY is independent of XX. This is useful because we can then derive E[Ydo(x)]=uD(U)E[Yx,u]P(u)\mathbb{E}[Y \mid do(x)] = \sum_{u \in D(U)} \mathbb{E}[Y \mid x, u] P(u) (applying the second rule of dodo-calculus). In this case the average effect E[Ydo(x)]\mathbb{E}[Y \mid do(x)] was identifiable - there were not multiple causal structures inducing the same distribution. The authors note that dodo-calculus provides a complete method for identifying such causal effects but it is not useful for constructing such formulae for non-identifiable queries. For example, if agent BB cannot observe the context in which AA operates, dodo-calculus cannot identify the average effect E[Ydo(x)]\mathbb{E}[Y \mid do(x)] since different causal models can induce the same observational distribution P(x,y)P(x,y) with different expected rewards. This is a very important concept to note since naivete in the transfer process under these conditions can lead to negative impact on the performance of the target. In practice, however, we often don’t have access to the underlying SCMs beforehand, and thus cannot distinguish between two such models.

ZB 2017 1

At this point it is natural to assume that if the identifiability condition does not hold then prior data is not useful in the transfer process. Remarkably though, 1Zhang, J., & Bareinboim, E. (2017). Transfer learning in multi-armed bandits: A causal approach. IJCAI-17:1340–1346. shows that for non-identifiable tasks we can still obtain causal bounds over the expected rewards of the target agent. This is achieved by using prior knowledge to construct a general SCM compatible with all the available models. Let us consider this is some detail. Given an stochastic MAB problem such that it has a prior represented as a list of bounds over the expected rewards, then for any bandit arm xx, let μx[lx,hx]\mu_x \in [l_x,h_x]. WLOG, assume 0<lx<hx<10 < l_x < h_x < 1 and denote lmax=maxx=1,,Klxl_{max} = \max_{x = 1, \dots, K} l_x. Note that a K-MAB problem is simply a generalisation of the MAB problem to multiple independent bandits. The following theorem is presented and proved by the authors:

Theorem: Consider a K-MAB problem with rewards bounded in [0,1][0,1], with each arm x{1,,K}x \in \{ 1, \dots, K \}, and expected reward μx[lx,hx]\mu_x \in [l_x,h_x] s.t. 0<lx<hx<10 < l_x < h_x < 1. Taking f(t)=log(t)+3log(log(t))f(t) = \log(t) + 3\log(\log(t)), in the B-kl-UCB algorithm (shown in algorithm below), the number of draws of E[Nx(T)]\mathbb{E}[N_x(T)] for any sub-optimal arm aa is upper bounded for any horizon T3T \geq 3 as:

{0if hx<lmax4+4elog(log(T))if hx[lmax,μ)log(T)KL(μx,μ)+O(log(log(T))KL(μx,μ)) if hxμ\left\{\begin{array}{ll} 0 & \text {if } h_{x}<l_{\max } \\ 4+4 e \log (\log (T)) & \text {if } h_{x} \in\left[l_{\max }, \mu^{*}\right) \\ \frac{\log (T)}{K L\left(\mu_{x}, \mu^{*}\right)}+\mathcal{O}\left(\frac{\log (\log (T))}{K L\left(\mu_{x}, \mu^{*}\right)}\right) & \text { if } h_{x} \geq \mu^{*} \end{array}\right.

Though seemingly very abstract, this theorem tells us that if the causal bounds impose strong constraints over the arm’s distribution then the B-kl-UCB algorithm (see below) provides asymptotic improvements over its unbounded counterpart (the kl-UCB algorithm 3Garivier, A., & Cappé, O. (2011). The KL-UCB algorithm for bounded stochastic bandits and beyond., not presented here). This implies that constraints translate into different regret bounds for the MAB agent. We could expect that finding such constraints over our problem bounds would increase performance. The algorithm below should be self-contained. For more information refer to the source material.

B Kl UCB

Zhang and Bareinboim 4Zhang, J., & Bareinboim, E. (2019). Near-optimal reinforcement learning in dynamic treatment regimes. NeurIPS. extend similar ideas to the field of dynamic treatment regimes (DTRs) and personalised medicine. A DTR consists of a set of decision rules controlling the provided treatment at any given stage, given a patient’s conditions. The challenge is to apply online reinforcement learning algorithms to the problem of selecting optimal DTRs given observational data, with the hope that RLs sample efficiency success in other decision making processes can translate to DTRs. Policy learning in this case refers to the process of finding an optimal policy π\pi that maximises some outcome YY - usually the patient’s recovery or improvement in health markers. Often, however, the parameters of the DTR remain unknown and direct optimisation isn’t possible. Traditional algorithms rely on there being no unobserved confounders, while randomisation techniques are often not feasible in the medical domain. We certainly wouldn’t want doctors randomly testing strategies on patients to see ‘how it plays out’! Reinforcement learning offers an attractive set of techniques for DTRs as it should offer an efficient means to learn DTRs while balancing the exploration of state-space and exploitation of rewards. Existing RL techniques, however, are not applicable in the DTR context as they rely on the Markov property. DTRs are clearly non-Markovian as the treatment procedure at some point in the future is a function of past treatments. The authors formalise this in causal language as follows:

Dynamic Treatment Regime: A dynamic treatment regime (DTR) is a SCM U,V,F,P(u)\langle \boldsymbol{U}, \boldsymbol{V}, \boldsymbol{F}, P(\boldsymbol{u}) \rangle where the endogenous variables V={XK,SK,Y}\boldsymbol{V} = \{ \overline{\boldsymbol{X}}_K, \overline{\boldsymbol{S}}_K, Y \} are the total stages of interventions. Here XK\overline{\boldsymbol{X}}_K represents a sequence {X1,,XK}\{X_1,\dots,X_K\}. Values of exogenous variables U\boldsymbol{U} are drawn from the distribution P(u)P(\boldsymbol{u}). For stage k=1,,K:k = 1, \dots, K:

  1. XkX_k is a finite decision decided by a behaviour policy xkfk(sk,xk1,u)x_k \leftarrow f_k(\overline{\boldsymbol{s}}_k, \overline{\boldsymbol{x}}_{k-1}, \boldsymbol{u}).
  2. SkS_k is a finite state decided by a transition function skτk(xk1,sk1,u)s_k \leftarrow \tau_k(\overline{\boldsymbol{x}}_{k-1}, \overline{\boldsymbol{s}}_{k-1}, \boldsymbol{u}).
  3. YY is the primary outcome at the final state KK, decided by a reward function yr(xK,sK,u)y \leftarrow r(\overline{\boldsymbol{x}}_{K}, \overline{\boldsymbol{s}}_{K}, \boldsymbol{u}) bounded in [0,1][0,1].

A DTR MM^\ast induces some observational distribution P(xK,sK,u))P(\overline{\boldsymbol{x}}_{K}, \overline{\boldsymbol{s}}_{K}, \boldsymbol{u})), responsible for the data we observe without intervention. A policy π\pi for the DTR defines some sequence of stochastic interventions do(X1π1(X1S1),,πK(XKSK,XK1))do(X_1 \sim \pi_1(X_1 \mid \overline{\boldsymbol{S}_1}), \dots, \pi_K(X_K \mid \overline{\boldsymbol{S}}_K, \overline{\boldsymbol{X}}_{K-1})). These interventions induce an interventional distribution Pπ(xK,sK,y)=PxK(ysK)k=0K1Pxk(sk+1sk)πk+1(xk+1sk+1,xk),P_{\pi}\left(\overline{\boldsymbol{x}}_{K}, \overline{\boldsymbol{s}}_{K}, y\right)=P_{\overline{\boldsymbol{x}}_{K}}\left(y \mid \overline{\boldsymbol{s}}_{K}\right) \prod_{k=0}^{K-1} P_{\overline{\boldsymbol{x}}_{k}}\left(s_{k+1} \mid \overline{\boldsymbol{s}}_{k} \right) \pi_{k+1}\left(x_{k+1} \mid \overline{\boldsymbol{s}}_{k+1}, \overline{\boldsymbol{x}}_{k}\right), where Pxk(sk+1sk)P_{\overline{\boldsymbol{x}}_{k}}\left(s_{k+1} \mid \overline{\boldsymbol{s}}_{k} \right) is the transition distribution at stage kk, and PxKP_{\overline{\boldsymbol{x}}_{K}} is the reward distribution over the primary outcome. The expected cumulative reward is then Vπ(M)=Eπ[Y]V_\pi(M^\ast) = \mathbb{E}_{\pi}[Y], implying our task is to find π=argmaxπΠVπ(M).\pi^\ast = \textbf{argmax}_{\pi \in \Pi} V_{\pi}(M^\ast). In other words, we would like to find the policy that finds the best sequence of actions that leads to an optimal outcome. The notation VπV_\pi is deliberately chosen to correspond to value functions in RL literature.

The authors introduce the UC-DTR algorithm, presented in the algorithm below, to optimise an unknown DTR. This algorithm takes an optimism in the face of uncertainty approach - a common strategy in the reinforcement learning literature. Given only knowledge of the state and action domains, UC-DTR achieves near-optimal total regret bounds. This is really quite remarkable. Knowing only about the current state and the possible actions we can take, we have an algorithm to reach almost optimal outcomes in very few steps! We now delve into the algorithm a bit deeper and discuss the overall strategy it employs. First, a new policy πt\pi_t is proposed using samples

{SKi,XKi,Yi}i=1t1\{ \overline{\boldsymbol{S}}_K^i, \overline{\boldsymbol{X}}_K^i, Y^i \}_{i=1}^{t-1}

collected up until the current episode, tt. That is, the deciding agent exploits its current knowledge to propose what it believes to be a good policy choice. The empirical estimates for the expected reward and the transitional probabilities are calculated and used to consider a set of plausible DTRs in terms of a confidence region around these estimates. The optimal policy of the most optimistic DTR in the plausible DTR set is calculated and executed to collect the next set of samples. This is the optimism and the uncertainty we discussed earlier. This procedure is repeated until a tolerance level or specific episode is reached. The authors proceed to show that the the UC-DTR algorithm has cumulative regret that scales with

O~(KSXT),\tilde{\mathcal{O}}(K\sqrt{|\mathcal{\boldsymbol{S}}||\mathcal{\boldsymbol{X}}|T}),

where O~()\tilde{\mathcal{O}}(\cdot) is like Big-Oh notation but also ignores log\log-terms. Formally,

f=O~(g)k,f=O(glogk(g)).f = \tilde{\mathcal{O}}(g) \Leftrightarrow \exists k, f = \mathcal{O}(g \log^k(g)).

The proofs for this analysis are fairly involved and are provided in the appendix of 4Zhang, J., & Bareinboim, E. (2019). Near-optimal reinforcement learning in dynamic treatment regimes. NeurIPS..

Theorem: Fix tolerance parameter δ(0,1)\delta \in (0,1). With probability at least 1δ1 - \delta, it holds for any T>1T > 1 that the regret of UC-DTR is bounded by

R(T)12KSXTlog(2KSXT/δ)+4KTlog(2T/δ).R(T) \leq 12 K \sqrt{|\mathcal{S}||\mathcal{X}| T \log (2 K|\mathcal{S}||\mathcal{X}| T / \delta)}+4 K \sqrt{T \log (2 T / \delta)}.

Theorem: For any algorithm A\mathcal{A}, any natural numbers K1K \geq 1, and

Sk2andXk2|\boldsymbol{\mathcal{S}}^k| \geq 2 \quad\text{and}\quad |\boldsymbol{\mathcal{X}}^k| \geq 2

for any k{1,,K}k \in \{ 1, \dots, K \}, there is a DTR MM with horizon KK, state domains S\boldsymbol{\mathcal{S}} and action domain X,\boldsymbol{\mathcal{X}}, such that the expected regret of A\mathcal{A} after

TSXepisodes is at leastE[R(T)]0.05SXT.T \geq |\boldsymbol{\mathcal{S}}||\boldsymbol{\mathcal{X}}| \quad\text{episodes is at least}\quad \mathbb{E}[R(T)] \geq 0.05 \sqrt{|\boldsymbol{\mathcal{S}}||\boldsymbol{\mathcal{X}}| T}.

Together, these theorems indicate that UC-DTR is near-optimal given only the state and action domains. The authors further propose exploiting available observational data to improve performance of the online learning procedure in the face of unobserved confounders and non-indentifiability. Using observational data, the authors derive theoretically sound bounds on the the system dynamics in DTRs. We include the full UC-DTR algorithm below as it is indicative of a general algorithmic approach to similar problems in the causal reinforcement literature. The reader is encouraged to work through the steps to confirm that it matches the theory discussed above. The explanations for the steps are fairly self-contained and are not discussed further for brevity.

UC DTR

We have discussed some interesting theory that underlines much of the later work in this active area of research. The interested reader is encouraged to refer to 5Zhang, J., & Bareinboim, E. (2020). Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. ICML. for extensions to dynamic treatment regimes. 6Namkoong, H., Keramati, R., Yadlowsky, S., & Brunskill, E. (2020). Off-policy policy evaluation for sequential decisions under unobserved confounding. and 7Zhang, J., & Bareinboim, E. (2020). Bounding causal effects on continuous outcomes. will interest the readers motivated by the development of further theory for generalised decision making and performance bounding. We are now ready to discuss the problem of finding where in a causal system we should intervene for optimal and efficient outcomes.

References

  1. Zhang, J., & Bareinboim, E. (2017). Transfer learning in multi-armed bandits: A causal approach. IJCAI-17:1340–1346.
  2. Langford, J., & Zhang, T. (2007). The epoch-greedy algorithm for contextual multi-armed bandits. NIPS.
  3. Garivier, A., & Cappé, O. (2011). The KL-UCB algorithm for bounded stochastic bandits and beyond.
  4. Zhang, J., & Bareinboim, E. (2019). Near-optimal reinforcement learning in dynamic treatment regimes. NeurIPS.
  5. Zhang, J., & Bareinboim, E. (2020). Designing optimal dynamic treatment regimes: A causal reinforcement learning approach. ICML.
  6. Namkoong, H., Keramati, R., Yadlowsky, S., & Brunskill, E. (2020). Off-policy policy evaluation for sequential decisions under unobserved confounding.
  7. Zhang, J., & Bareinboim, E. (2020). Bounding causal effects on continuous outcomes.
St John

Written by St John

Author of the Asking Why Blog - a personal blog and website with everything I find interesting.

Comments are being migrated. Check back soon.