CRL Task 4: Generalisability and Robustness

causalityreinforcement-learningmachine-learning

CRL Task 4: Generalisability and Robustness

At this point we’ve developed a good sense of the technical theory of causal reinforcement learning. This next section brings together many important ideas and generalises notions of data transfer between different environments. This will prove important when we discuss imitation learning in the future. For those coming from a more pure reinforcement learning background, being able to generalise results about where and when we can transfer knowledge between related domains is clearly useful for a general agent. Let’s get going!

Generalisability and Robustness

One of the most important features of human intelligence is its ability to generalise and transfer causal knowledge across seemingly disparate domains. This allows powerful inferences and decision making procedures possible even in foreign environments. Bareinboim and Pearl address the problem of transferring knowledge from data collected in heterogeneous domains Π={π1,,πn}\Pi = \{ \pi_1,\dots,\pi_n\} to some target domain π\pi^\ast - a problem known as mzmz-transportability. In 1Bareinboim, E., & Pearl, J. (2014). Transportability from multiple environments with limited experiments: Completeness results. NIPS. the authors establish a necessary and sufficient condition for deciding whether this transfer is feasible. In the sciences, powerful studies involving transfer of knowledge across related domains are known as meta-analysis or externally valid studies. The transfer of causal knowledge is known as transportability and is a crucial ability needed for artificial agents to automate the process of knowledge acquisition, discovery and learning.

Consider an example in which we would like to use knowledge of social science experiments done in Los Angeles (predicting outcome YY with cause XX, confounded by some age distribution ZZ) to make similar predictions in New York. Calling the distribution in Los Angeles P(ydo(x))P(y \mid do(x)), we would like to predict R=P(ydo(x))R = P^\ast(y \mid do(x)) - the cause/effect relationship under a different age distribution in New York. We call the process which generates this difference in age across the populations a difference generating factor, denoted graphically by \blacksquare, which are caused by some set of selection variables SS. In this case we have SZS \rightarrow Z. We can then derive an remarkably simple transport formula as follows:

R=sP(ydo(x),z)P(z)=sP(ydo(x),z)P(z)R = \sum_s P^\ast(y \mid do(x), z) P^\ast(z) = \sum_s P(y \mid do(x), z) P^\ast(z)

This deceptively simple formula tells us we can estimate RR - an interventional quantity - using a drop in observational distribution P(zdo(x),z)P^\ast(z \mid do(x), z). This acts to re-weight observations by the interventional affect in a different domain. To generalise a transport formula is not completely trivial. We need to know whether dodo-calculus is complete. That is, whether dodo-calculus operations can always find such a transport formula. Recall that causal models and induced diagrams encode relationships under a particular domain. A formalism that is helpful for the study of transfer of knowledge across causal domains is the notion of selection diagrams. These diagrams graphically encode the shared causal relations and difference generating factors of different causal systems.

Selection Diagrams 1Bareinboim, E., & Pearl, J. (2014). Transportability from multiple environments with limited experiments: Completeness results. NIPS.: Let M,M\langle M, M^\ast \rangle be a pair of structural causal models from domains π,π\langle \pi, \pi^\ast \rangle, sharing causal diagram GG. The pair M,M\langle M, M^\ast \rangle induces a selection diagram DD if DD obeys the following criteria: (1) every edge in G is also an edge in D, and (2) D contains an extra edge SiViS_i \rightarrow V_i whenever there might exist some discrepancy fifif_i \neq f_i^\ast or P(Ui)P(Ui)P(U_i) \neq P^\ast(U_i) between MM and MM^\ast.

These SS variables in the selection diagram serve as identifying the mechanisms where structural differences in the data generating process takes place between models under different domains. Knowledge of these structural overlaps of different causal domains allows us to formalise what it means to transfer knowledge between domains. This is the notion of mzmz-transportability discussed earlier. Simply put, knowledge is transferable between domains only if the causal effect RR can be determined from information available in the observational and interventional distributions.

mzmz-Transportability 1Bareinboim, E., & Pearl, J. (2014). Transportability from multiple environments with limited experiments: Completeness results. NIPS.: Let D={D(1),,D(n)}\mathcal{D} = \{ D^{(1)}, \dots, D^{(n)} \} be a collection of selection diagrams with source domains Π={π1,,πn}\Pi = \{ \pi_1,\dots,\pi_n \} and target domain π\pi^\ast. Let Zi\boldsymbol{Z}_i be the variables in which experiments can be conducted in domain πi\pi_i. If Pi,Izi\langle P^i, I_z^i \rangle are the observational and interventional distributions, then the causal effect R=Px(y)R = P_{\boldsymbol{x}}^\ast (\boldsymbol{y}) is said to be mzmz-transportable from Π\Pi to π\pi^\ast in D\mathcal{D} if Px(y)P_{\boldsymbol{x}}^\ast(\boldsymbol{y}) is uniquely computable from i=1,,nPi,IziP,Iz\cup_{i=1,\dots,n} \langle P^i, I_z^i \rangle \cup \langle P^\ast, I_z^\ast \rangle in any model that induces D\mathcal{D}.

The above graphical condition has a counterpart that can be written in terms of dodo-calculus criteria.

Theorem: Let symbols be defined as above. The effect R=P(ydo(x))R = P^{\ast} (\boldsymbol{y} \mid do(x)) is mzmz-transportable from Π\Pi to π\pi^\ast if the expression P(ydo(x),S1,,Sn)P(\boldsymbol{y} \mid do(x), \boldsymbol{S}_1, \dots, \boldsymbol{S}_n) is reducible using the rules of dodo-calculus to an expression in which (1) do-operators that apply to subsets o IziI_z^i have no Si\boldsymbol{S}_i-variables or (2) do-operators apply only to subsets of IziI_z^i.

This theorem tells us that do-calculus is complete in terms of finding these transport formulae. The authors also prove completeness for an established algorithm for computing transport formulae. Refer to source material 1Bareinboim, E., & Pearl, J. (2014). Transportability from multiple environments with limited experiments: Completeness results. NIPS. for details of this algorithm.

BP 2014 Fig2

Figures (a) through (f) show illustrative examples of transportability in causal selection diagrams. These highlight the important of the nature of unobserved confounders. (a) This diagram shows an example of when transportability of R=P(ydo(x))R=P^\ast(y \mid do(x)) is trivially solved by re-weighting of the variable directly affected by difference-generating variable. In this case SZS \rightarrow Z. (b) shows the simplest example in which one cannot transport a causal relation between domains. Even by randomisation on XX, the causal effect is not uniquely computable due to UCs. (c) and (d) show examples where transportability of causal effects require interventional information over Z1Z_1 in π1\pi_1 and Z2Z_2 in π2\pi_2, but not over {Z1,Z2}\{Z_1,Z_2\} in the combined domain. (e) and (f) show examples where transportability is only possible in the combined domain. Figure extracted from 1Bareinboim, E., & Pearl, J. (2014). Transportability from multiple environments with limited experiments: Completeness results. NIPS..

This process of transferring knowledge relates well to the concept of unifying big data. The ability to fuse multiple datasets, collected under heterogeneous conditions, without incurring large bias penalties is something critically important for generalising an agent’s ability to learn under different conditions. In 2Bareinboim, E., & Pearl, J. (2016). Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences 113:7345–7352. Bareinboim and Pearl review this problem of data fusion under the auspice of causal inference. In 3Lee, S., Correa, J. D., & Bareinboim, E. (2019). General identifiability with arbitrary surrogate experiments. UAI. Lee et al. argue that identifiability and randomisation are two extremes in approach to inferring cause-effect relationships from some combination of observations, experiments and prior (substantive) knowledge. In fact, zz- identifiability (zID) generalises exactly this question for the case where all possible interventions (experiments) are possible. The authors argue that this requirement is (obviously) not always reasonable and propose a generalisation such that any expression derivable from an arbitrary collection of observations and experiments is returned by the proposed algorithm. The following theory is developed to introduce a strategy that is used to prove non-gID (defined later) which allows for a graphical, necessary and sufficient condition for the causal decision problem of interest. We start by defining a c-component.

C-component 4Tian, J., & Pearl, J. (2002). Studies in causal reasoning and learning.: Let causal graph G\mathcal{G} be such that a subset of its bidirected arcs forms a spanning-tree over all its vertices, then G\mathcal{G} is a confounded-component (c-component).

With this definition in mind, notice the c-components in the figure above. We use C(G)\mathcal{C}(\mathcal{G}) to denote the set of c-components that partitions the vertices in G\mathcal{G} such that C(G)={Wi}i=1k\mathcal{C}(\mathcal{G}) = \{\boldsymbol{W}_i\}_{i=1}^k implies that G[Wi]\mathcal{G}[\boldsymbol{W}_i] is a c-component for every WiV\boldsymbol{W}_i \subseteq \boldsymbol{V}, the endogenous (visible) variables.

C-forest 3Lee, S., Correa, J. D., & Bareinboim, E. (2019). General identifiability with arbitrary surrogate experiments. UAI.: A causal graph G\mathcal{G} with root set R\boldsymbol{R} is an R\boldsymbol{R}-rooted c-forest if G\mathcal{G} is a c-component with minimal number of edges.

We now refer to the figure below. All of figure (a) through (b) are c-components since there are unobserved confounders (bidirected edges) spanning the vertices. Further \ref{fig

}(a) through \ref{fig
}(c) are c-forests since they have minimal number of spanning bidrected edges.

Hedge 3Lee, S., Correa, J. D., & Bareinboim, E. (2019). General identifiability with arbitrary surrogate experiments. UAI.: A hedge is a pair of R\boldsymbol{R}-rooted c-forests F,F\langle \mathcal{F}, \mathcal{F}^\prime \rangle such that FF.\mathcal{F}^\prime \subseteq \mathcal{F}.

By this definition, figure (a) and (b) are hedges because we can find two c-forests F\mathcal{F} and F\mathcal{F}^\prime such that FF.\mathcal{F}^\prime \subseteq \mathcal{F}. Crucially, (c) is not a hedge since the spanning bidirected edges are not minimal. This type of structure prevents g-identifiability, which is now formalised and discussed.

LCB 2019 Fig3

Figure showing examples of hedges, c-components, c-forests, and thickets. These form graphical criteria for g-identifiability. Details are discussed in the text itself. Thickets are shown to preclude g-identifiability. Crucially, (d) is shown to be an overlap of hedges which forms a thicket. Figure extracted from 3Lee, S., Correa, J. D., & Bareinboim, E. (2019). General identifiability with arbitrary surrogate experiments. UAI..

g-Identifiability 3Lee, S., Correa, J. D., & Bareinboim, E. (2019). General identifiability with arbitrary surrogate experiments. UAI.: Let X,Y\boldsymbol{X}, \boldsymbol{Y} be disjoint sets of variables, Z={Zi}i=1m\mathbb{Z} = \{\boldsymbol{Z}_i\}_{i=1}^m be a collection of sets of variables, and let G\mathcal{G} be a causal diagram. If Px(y)P_x(y) is uniquely computable from distributions {P(Vdo(z))}ZZ,zdom(Z)\{ P(\boldsymbol{V} \mid do(z)) \}_{\boldsymbol{Z}\in\mathbb{Z}, \boldsymbol{z} \in dom(\boldsymbol{Z})} in any causal model which induces G\mathcal{G}, we say that Px(y)P_x(y) is gg-identifiable from Z\mathbb{Z} in G\mathcal{G}. Here P(V)P(\boldsymbol{V}) is the probability distribution describing the natural state of the system (assumed to be available).

Simply put, we say the distribution is gg-identifiable with respect to a set of intervenable variables in the causal system if they are sufficient to uniquely compute it. This set of variables are the ones we intervene on by doing an experiment, as we discussed earlier. In this way it is a generalisation of zz-identifiability discussed earlier. We now introduce some more definitions needed for the non-gID criteria.

Hedgelet decomposition 3Lee, S., Correa, J. D., & Bareinboim, E. (2019). General identifiability with arbitrary surrogate experiments. UAI.: The hedgelet decomposition of a hedge F,F\langle \mathcal{F}, \mathcal{F}^\prime \rangle is the collection of hedgelets {F(W)}WC(F)\{ \mathcal{F}(\boldsymbol{W}) \}_{\boldsymbol{W} \in \mathcal{C}(\mathcal{F}^{\prime\prime})} (F=FF\mathcal{F}^{\prime\prime} = \mathcal{F}\setminus\mathcal{F}^{\prime}) where each hedgelet F(W)\mathcal{F}(\boldsymbol{W}) is a subgraph of F\mathcal{F} made of (i) F[W(F)W]\mathcal{F}[\boldsymbol{W}(\mathcal{F}) \cup \boldsymbol{W}] and (ii) F[De(WF)]\mathcal{F}[De(\boldsymbol{W}_\mathcal{F})] without bidirected edges.

Referring back to the reference figure, some possible hedgelet decompositions are colour coded in blue and red to indicate distinct hedgelets, with purple used to indicate the shared variables (commonly root sets). This leads us nicely to the last definition we need for this criterion. Though this definition appears arbitrarily technical, it is rather intuitive once the reasoning is developed.

Thicket 3Lee, S., Correa, J. D., & Bareinboim, E. (2019). General identifiability with arbitrary surrogate experiments. UAI.: Let R\boldsymbol{R} be non-empty set of variables and Z\mathbb{Z} be a collection of sets of variables in G\mathcal{G}. A thicket JG\mathcal{J} \subseteq \mathcal{G} is an R\boldsymbol{R}-rooted c-component consisting of a minimal c-component over R\boldsymbol{R} and hedges FJ={FZ,J[R]FZGZ,ZR=}ZZ.\mathbb{F}_{\mathcal{J}} = \{ \langle \mathcal{F}_{\boldsymbol{Z}}, \mathcal{J}[\boldsymbol{R}] \rangle \mid \mathcal{F}_{\boldsymbol{Z}} \subseteq \mathcal{G} \setminus \boldsymbol{Z}, \boldsymbol{Z} \cap \boldsymbol{R} = \emptyset \}_{\boldsymbol{Z} \in \mathbb{Z}}.

Let’s consider this definition step-by-step by considering figure (c). First, we notice the graph is a c-component that contains a minimal c-component. It does not necessarily need to be a c-forest itself. Next, we need a pair of R\boldsymbol{R}-rooted c-forests FZ,J[R]\langle \mathcal{F}_{\boldsymbol{Z}}, \mathcal{J}[\boldsymbol{R}] \rangle. We select the graphs induced by sets {W,X1,R}\{W,X_1,R\} and {W,X2,R}\{W,X_2,R\} with Z={{X1},{X2}}\boldsymbol{Z} = \{\{X_1\},\{X_2\}\} the intervention set. Then we have hedges FX1,J[R]\langle \mathcal{F}_{X_1}, \mathcal{J}[R] \rangle and FX2,J[R]\langle \mathcal{F}_{X_2}, \mathcal{J}[R] \rangle that overlap and have intervention variables ZZZ\in\boldsymbol{Z} that do not intersect with the root set, R={R}\boldsymbol{R}=\{R\}. Basically, a thicket is an overlapping of hedges, and hedges were the ‘bad’ structure that prevented gID in the causal graph. Though this is a fairly involved procedure to do manually, especially on large causal graphs, it is algorithmically feasible as shown in Lee et al. The usefulness of this algorithm relies on the following result.

Thicket non-gID 3Lee, S., Correa, J. D., & Bareinboim, E. (2019). General identifiability with arbitrary surrogate experiments. UAI.: If there exists some thicket J\mathcal{J} for Px(y)P_{\boldsymbol{x}}(\boldsymbol{y}) in causal graph GG with respect to intervention set Z\mathbb{Z}, then Px(y)P_{\boldsymbol{x}}(\boldsymbol{y}) is not g-identifiable in GG.

To make this idea explicit we include the following figure extracted from slides provided directly by Sanghack Lee, coauthor of several papers (including 3Lee, S., Correa, J. D., & Bareinboim, E. (2019). General identifiability with arbitrary surrogate experiments. UAI.) presented in this work 5Lee, S. (2020). General identifiability with arbitrary surrogate experiments. AAAI 2020 (presented at UAI 2019)..

ThicketsSanghackLeeSlides

Thicket structure for Px(y)P_x(y) identified as an overlap of distinct hedges, each colours as a red rounded triangle. Extracted from 5Lee, S. (2020). General identifiability with arbitrary surrogate experiments. AAAI 2020 (presented at UAI 2019)..

This completes the required formalism’s for identifying structural constraints from explicit causal models. This ties well into the next section in which we discuss how we can apply this theory to learn causal structure from observational and interventional data. This is especially useful for allowing reinforcement learning agents to uncover causal structure.

Image credit: Header Image.

References

  1. Bareinboim, E., & Pearl, J. (2014). Transportability from multiple environments with limited experiments: Completeness results. NIPS.
  2. Bareinboim, E., & Pearl, J. (2016). Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences 113:7345–7352.
  3. Lee, S., Correa, J. D., & Bareinboim, E. (2019). General identifiability with arbitrary surrogate experiments. UAI.
  4. Tian, J., & Pearl, J. (2002). Studies in causal reasoning and learning.
  5. Lee, S. (2020). General identifiability with arbitrary surrogate experiments. AAAI 2020 (presented at UAI 2019).
St John

Written by St John

Author of the Asking Why Blog - a personal blog and website with everything I find interesting.

Comments are being migrated. Check back soon.