The third paper in a series on how learning health systems can use routinely collected electronic health data (EHD) to advance knowledge and support continuous learning, this review describes how analytical methods for individual-level electronic health data EHD, including regression approaches, interrupted time series (ITS) analyses, instrumental variables, and propensity score methods, can also be used to address the question of whether the intervention “works.”

The two major potential sources of bias in non-experimental studies of health care interventions are that the treatment groups compared do not have the same probability of treatment or exposure and the potential for confounding by unmeasured covariates. Although very different, the approaches presented in this chapter are all based on assumptions about data, causal relationships, and biases. For instance, regression approaches assume that the relationship between the treatment, outcome, and other variables is properly specified, all of the variables are available for analysis (i.e., no unobserved confounders) and measured without error, and that the error term is independent and identically distributed. The instrumental variables approach requires identifying an instrument that is related to the assignment of treatment but otherwise has no direct on the outcome. Propensity score methods approaches, on the other hand, assume that there are no unobserved confounders. The epidemiological designs discussed also make assumptions, for instance that individuals can serve as their own control.

To properly address these assumptions, analysts should conduct sensitivity analyses within the assumptions of each method to assess the potential impact of what cannot be observed. Researchers also should analyze the same data with different analytical approaches that make alternative assumptions, and to apply the same methods to different data sets. Finally, different analytical methods, each subject to different biases, should be used in combination and together with different designs, to limit the potential for bias in the final results.

Learning health systems use routinely collected electronic health data (EHD) to advance knowledge and support continuous learning. Even without randomization, observational studies can play a central role as the nation’s health care system embraces comparative effectiveness research and patient-centered outcomes research. However neither the breadth, timeliness, volume of the available information, nor sophisticated analytics, allow analysts to confidently infer causal relationships from observational data. Rather, depending on the research question, careful study design and appropriate analytical methods can improve the utility of EHD.

This is the second paper in a series (see Box 1) on how learning health systems can use routinely collected electronic health data (EHD) to advance knowledge and support continuous learning, this review summarizes study design approaches, including choosing appropriate data sources, and methods for design and analysis of natural and quasi-experiments. The first paper

This is one of four papers in a series of papers intended to (1) illustrate how existing electronic health data (EHD) data can be used to improve performance in learning health systems, (2) describe how to frame research questions to use EHD most effectively, and (3) determine the basic elements of study design and analytical methods that can help to ensure rigorous results in this setting.

Paper 1, “Framing the Research Question,”

Paper 2, “Design of observational studies,”

Paper 3, this paper, describe how analytical methods for individual-level electronic health data EHD, including regression approaches, interrupted time series (ITS) analyses, instrumental variables, and propensity score methods, can be used to better assess whether interventions improve outcomes of interest.

Paper 4, “Delivery system science,”

When the question is whether an intervention improves outcomes of interest, the second paper in this series illustrates how study design methods can help researchers identify valid results that better balance internal and external validity than RCTs. The methods discussed include choosing appropriate data sources, epidemiologic designs, methods for design of natural and quasi-experiments, and the use of logic models. The primary issue addressed by these evaluation designs is how to estimate the counterfactual – what would have happened if the intervention had not been implemented. Even with a strong design, however the potential for bias remains.

Faced with the need to infer cause and effect when randomization is not feasible, statisticians and econometricians have developed a series of analytical methods for “causal analysis.” The current paper complements the second by describing how analytical methods for individual-level electronic health data EHD, including regression approaches, interrupted time series (ITS) analyses, instrumental variables, and propensity score methods, can also be used to address the question of whether the intervention “works.” Each of these approaches addresses Cochran’s call for methods to adjust for differences in observed characteristics between treatment and control groups in order to isolate the effect of an intervention from other factors.

This paper does not attempt to serve as a textbook or describe these approaches in detail. Rather, it presents these methods in a consistent framework rather than provide detailed information on each topic. Because the use of existing EHD is not yet well developed, some of the examples use other types of data but were chosen to illustrate the methods.

The methods discussed in this section primarily involve the use of individual-level EHD. Since each of these paradigms face the same basic inference questions, there is some overlap in the material covered, and throughout we explain how each method relies on assumptions that are often not possible to verify with the existing data. This paper concludes with a discussion of “analyzing observational data like randomized experiments.” This is not so much an analytic method

As background for the methods described in this section it is useful to clarify a framework for causal inference. The fundamental idea is that, for a given individual, the “effect” of a treatment is based on the difference between that outcome that would be observed if the person receives the treatment and what would be observed if the person receives the comparison condition instead (the counterfactual). The problem, of course, is that no single individual can receive

The regression-based approaches described assume that all of the factors that differentiate the treatment and control group members are represented in the observed variables and covariates. The instrumental variables approach identifies special variables (the “instruments”) that affect treatment but are unrelated to outcomes except through the treatment, and estimates how much of the variation in the treatment variable that is induced by the instrument – and only that induced variation – affects the outcome measure. Propensity score methods model the factors related to the probability of treatment assignment and, typically, match treated and untreated based on such probabilities. These models also assume that the causal model is correctly specified. This can be hard to assess, but directed acyclic graphs (DAGs) can be used to clarify assumptions about causal pathways and use their representation in graphical form to guide selection of covariates for statistical adjustment through structural equation models (SEM) or other approaches,

Perhaps the simplest and most intuitive approach to analyzing observational data is to fit a linear statistical model of the form

where

Y_{i} is the outcome variable for subject i

X_{i} is an indicator variable for the treatment, e.g. 1 for treatment group and 0 for control

β_{i} is the effect of the treatment, conditional on the covariates

Z_{i} represents other factors that influence the outcome

e_{i} is an independent and identically distributed (iid) error term.

The parameters in Equation 1 are typically are estimated by ordinary least squares (OLS) methods, and such “OLS estimates” are commonly used to describe this regression approach. In this model, the fitted value of β_{1} estimates the effect of the treatment, and can be evaluated using standard statistical hypothesis tests. This approach can be extended as necessary if Y_{i} is categorical or dichotomous (e.g., logistic regression), there are multiple Z’s, or the relationship is non-linear. Another extension is known as a “difference-in-differences” approach, which uses the difference in an outcome variable before and after an intervention as Y_{i}, which can have the benefit of individuals serving as their own control.

Despite the simplicity of this approach, there are many ways that things can go wrong when applied to observational data. Most basically regression approaches assume that the actual causal relationship between the treatment, outcome, and other variables is properly specified, all of the variables are available for analysis and measured without error and that the error term is independent and identically distributed. In particular, the two groups may be on different trajectories, and would not have exhibited the same difference after the intervention that they did before. The relationship could be improperly specified; the functional form could be incorrect or a variable omitted from the model may have a relationship with Y X and/or Z. In addition, X and/or Z may depend, in part, on Y for example if the treatment received (X) is dependent on Z (confounding bias) or Y (endogeneity or selection bias). Also, some of the Z may not be available for analysis or measurement errors may affect X and/or Z. These problems could result in bias in the estimated treatment effect (β_{1}). They could also cause the error term e_{i} to not be iid, which would lead to incorrect confidence intervals and hypothesis tests.

It is standard practice in econometrics to assess omitted variable bias by identifying the available variables that are most closely related to the missing variable and seeing how the results change when these variables are dropped from the model. One never knows, however, how well these variables capture the effect of the missing variable, or of additional missing factors that may exist but are unknown to the researcher.

A variant of this approach, the regression discontinuity method,

The functional form of f(Z_{i}) could simply be linear (which would be modeled as β_{2}Z_{i}) or alternatively Y or Z could be transformed to fit this approach. As in other regression approaches, the fitted value of β_{1} estimates the effect of the treatment, and can be evaluated using standard statistical hypothesis tests. Figure

Regression Discontinuity Method

Source: Adapted from Dowd & Oakes[

The within-study comparisons literature has shown that RD analyses of observational data generally replicate RCT results well despite the use of different statistical methods to estimate the RD effect._{1} will be biased. Furthermore, the effect estimated, is considered valid only for observations close to the cutoff of the assignment variable Z, not more generally. Another challenge, relating to implementation, is ensuring that assignment to both the treatment and comparison conditions adheres strictly to the cutoff value of Z.

To motivate the need for interrupted time series (ITS) methods,

Figure

Traditional Difference in Differences Analysis of RaPP Study

Source: Ross-Degnan and colleagues.[

ITS Analysis of RaPP Study: Intervention Group Only

Source: Ross-Degnan and colleagues.[

Figure

ITS Logic and Parameters Estimated by Segmented Linear Regression

Source: Ross-Degnan and colleagues.[

Parameters of ITS Model

Source: Ross-Degnan and colleagues.[

In this model, β_{2} is the effect of implementing policy 1 and β_{4} is the effect of implementing policy 2, conditional on the covariates.

One key assumption is that the baseline trend correctly reflects what would have happened after the intervention time point, had the intervention not occurred. This in turn depends on the trends within segments being linear, and that the autocorrelation structure of errors is correctly modeled. The New Hampshire example in Figure

The major threats to the validity of the ITS design are: confounding (i.e. a co-occurring intervention), selection (pre-intervention factors that affect inclusion in the intervention, such as volunteers), regression to the mean (groups selected on baseline values), instrumentation (changes in measurement or ascertainment), and history or maturation (some other event or natural process explains the observed effect).

The major threats to the reliability of ITS estimates are: unstable data and wild data points, low frequency outcomes (e.g., deaths), boundary conditions (e.g., percentages which are bounded between 0 and 100), short segments that inaccurately reflect the trend, changing denominator populations, and non-linear trends.

Ross-Degnan

check data quality: identify and remove outliers and implausible data, impute missing data;

contrast multiple outcomes or groups such as high-risk subgroups or differential response;

account for policy phase-in including anticipatory effects or post-intervention lag; and

test model assumptions including normality of errors, linearity of segments.

Although the description to this point assumes that there is only one group being followed, ITS analysis can also be used to compare two or more comparison groups. Indeed, the results of an ITS analysis are strengthened if the comparison groups are matched by standardizing or using propensity scores (see below) or chosen using principles of natural and quasi-experimental design as discussed in Stoto.

In summary, the advantages of ITS analysis include an intuitive visual display, direct estimate of effects, and the controls it provides to common threats to validity. The limitations are that the ITS method requires reasonably stable series and relatively long segments. There can also be boundary problems and sensitivity to points near end of segments. Also, because ITS analysis uses aggregate data, there is no opportunity for patient-level adjustment, but one can use risk-adjusted rates.

The instrumental variables approach addresses the causal inference problem by identifying special variables (“instruments”) that affect the treatment that research subjects receive, but are unrelated to the outcomes they experience except through the treatment, and estimating how much of the variation in the treatment variable that is induced by the instrument – and only that induced variation – affects the outcome measure. The idea is that the instrument can be thought of as more plausibly randomly assigned than the treatment of interest, and that the instrument affects whether an individual takes the treatment but does not directly affect outcomes. A classic example of an IV is the distance subjects lived from a health care facility offering two types of emergency procedure.

Specifically, this approach centers on two regression equations, generally fit by two stage least squares:

where

IV_{i} is the instrumental variable for subject i

_{i} after fitting equation 4

β_{1} is the effect of the treatment, conditional on the covariates

e_{i} and f_{i} are iid error terms

and the rest of the variables are as defined above. Figure

Causal Diagrams for RCTs and Instrumental Variables

Source: Adapted from Dowd & Oakes.[

For the instrumental variables approach to be effective, two critical assumptions must be true. The first is known as relevance, i.e. that there is a strong association between the IV and the treatment variable X. Using weak or poor instruments (those for which this association is not strong) can lead to biased and imprecise estimates of the treatment effect. The second assumption, exogeneity, is that the correlation between the IV and e_{i}, the error term in equation 5, must be zero. Leaving an important Z_{i} out of equation 4, perhaps because it was not available, can bias the estimated treatment effect (β_{1}). Unfortunately, there is no way to be sure that these conditions are met in any particular situations. There are tests to identify the best of multiple instruments conditional on having a good one, but none that test whether a particular instrument is truly exogenous, or whether any of the instruments are “good enough” to yield reasonably precise estimates. Baiocchi and colleagues demonstrate how the strength of an instrument relates to observed and unobserved covariates and discussion approaches for building stronger instruments and testing how well they work.

Rassen and colleagues

The study uses data from two sources: Pennsylvania’s (PA)Pharmaceutical Assistance Contract for the Elderly (PACE) program from 1994 to 2003 as well as for British Columbia (BC) residents aged 65+ from 1996 to 2004. The comparison is between initiators of conventional vs. atypical APM therapy and the outcome: mortality within 180 days of initiation (the index date). The available covariates reflect baseline patient characteristics (coexisting illnesses and use of health services) in the 6 months before the index date. Frailty, cognitive impairment, ability to perform activities of daily living are all potentially important but are not available.

The study examined 25 different variants of PPP as an IV. The “base case” was the approach used in the original analysis: an indicator variable based on the physician’s current preference for conventional vs. atypical APM therapy If the physician’s previous APM prescription was for a conventional APM, then for the next patient, the physician was classified as a “conventional APM prescriber.” Otherwise, the physician was classified as an “atypical APM prescriber.” Rassen and colleagues considered variants based on (1) preference assignment algorithm (e.g. the number of conventional APM prescriptions out of the previous 2-4 prescriptions), (2) cohort restrictions based on physician and patient characteristics, and (3) stratification criteria (e.g. patient of a similar age). Eventually they determined that restricting the analysis to primary care physicians produced the best instrument.

Table

Differences in Risk of All-cause Mortality Within 180 Days of Initiation of Conventional Versus Atypical APM Treatment

POPULATION AND VARIATION | EVENTS IN CONVENTIONAL APM GROUP | EVENTS IN ATYPICAL APM GROUP | UNADJUSTED OLS ESTIMATE | AGE/SEX-ADJUSTED OLS ESTIMATE | FULLY ADJUSTED OLS ESTIMATE^{a} |
IV ANALYSIS ESTIMATE |
---|---|---|---|---|---|---|

Base case (unrestricted) | 1,806 | 2,307 | 4.46 (3.69, 5.23) | 4.49 (3.75, 5.22) | 3.55 (2.74, 4.37) | 4.00 (0.94, 7.06) |

Restricted to PCPs (R6) | 1,735 | 2,115 | 4.24 (3.41, 5.06) | 4.48 (3.68, 5.28) | 3.59 (2.70, 4.48) | 3.11 (-0.57, 6.79) |

Base case (unrestricted) | 1,307 | 1,628 | 2.69 (1.65, 3.73) | 2.47 (1.46, 3.49) | 3.91 (2.68, 5.13) | 7.69 (1.26, 14.12) |

Restricted to PCPs (R6) | 960 | 1,129 | 2.39 (1.07, 3.71) | 2.29 (0.98, 3.60) | 4.32 (2.71, 5.93) | 5.34 (-3.53, 14.21) |

Adjusted for age, sex, race, year of treatment, and history of diabetes, arrhythmia, cerebrovascular disease, myocardial infarction, congestive heart failure, hypertension, other ischemic heart disease, other cardiovascular disorders, dementia, delirium, mood disorders, psychotic disorders, other psychiatric disorders, antidepressant use, nursing home residence, and hospitalization. See text for description of the base case and restriction to PCPs. NOTE. The values within brackets are 95 percent confidence intervals. Risk differences are expressed per 100 patients. Abbreviations: APM, antipsychotic medication; OLS, ordinary least squares; IV, instrumental variable; PCP, primary care physician. Source: Rassen and colleagues.

The estimates from an IV analysis often have larger standard errors, and this effect can be seen in Table

Rassen and colleagues

A final key assumption is the assumption of no direct effect of the instrument on the outcome. Rassen and colleagues argue that this is a reasonable assumption here, although it could be violated if, for example, PPP is associated with higher or lower quality of care in general (e.g., if physicians who prescribe a particular type of APM also tend to provide lower or higher quality of care in general).

Thus, even done well, IV is fraught with difficulty in interpretation, due to the large standard errors. Based on these analyses, Rassen and colleagues conclude that PPP was at least a reasonably valid instrument in this setting, and implicitly that the IV estimates are superior to the OLS estimates. This type of careful analysis of whether a particular instrument is truly exogenous, or whether any of the instruments are “good enough” to yield reasonably precise estimates is not common, and without it one cannot be sure that the results are valid.

Propensity score methods aim to equate treatment and comparison groups on a single variable, the probability of treatment, which is modeled from a set of observed characteristics and estimated on the pool of treatment group members and potential comparison group cases. The key to this is the propensity score, p, which is defined as the predicted probability of receiving the treatment given the observed covariates.

There are five basic steps involved in using propensity score methods:

Estimate

Use the propensity score to equate groups through matching, weighting, or sub-classification. Matching involves finding one or more comparison cases for each treated case that have similar values of

Check how well the equating worked to create balance in observed covariates. Since the goal is to reduce bias by forming groups that look similar on the observed covariates, we can see how well the matching worked by comparing the distributions of the covariates in the equated treatment and comparison groups.

Estimate the treatment effect by comparing outcomes in equated groups. With matching, this involves comparing outcomes in the matched groups (some weighting will be required if the number of matches selected for some treatment group observations differs from the number selected for other treatment group cases). Alternatively use the weights described in step 2 to calculate the average treatment effect. With sub-classification, effects are estimated separately within each subclass and then aggregated. (Note that these approaches are the same ones used to calculate the balance measures in Step #3).

Conduct a sensitivity analysis to unobserved confounding. This can be done in a number of ways, for instance by positing an unobserved confounder and obtaining adjusted impact estimates if that confounder existed, given its assumed characteristics.

Schneeweiss and colleagues

To illustrate the use of propensity score analysis, Stuart and colleagues

As the propensity score, Stuart and colleagues

Figure

Standardized Differences Between the Experimental Groups on Covariates Before (Hollow Dots) and After (Solid Dots) Propensity Score Weighting

Although not directly related to assessing cause and effect relationships, Forrest and colleagues

For example, consider the case of biologic therapy for Crohn’s disease. Agents targeted to reduce TNFα-mediated inflammation are rational therapeutic choices; their efficacy has been demonstrated in adults by an RCT, and the REACH study

Forrest and colleagues

The primary outcome variables are clinical and steroid-free remission. Forrest and colleagues

Percentage of Trials Achieving Remission and Corticosteroid-free Remission During 26- and 52-Week Follow-up Periods

OUTCOME | DURATION OF FOLLOW-UP | INITIATOR TRIALS | NON-INITIATOR TRIALS | ||
---|---|---|---|---|---|

% ACHIEVING OUTCOME (95% CI) | |||||

Clinical remission | 26 weeks | 54.4 | (47.7–61.1) | 41.2 | (38.2–44.2) |

52 weeks | 66.6 | (60.3–72.8) | 56.2 | (53.2–59.3) | |

Corticosteroid-free remission | 26 weeks | 47.3 | (40.6–53.9) | 31.2 | (28.4–34.0) |

52 weeks | 60.1 | (53.7–66.5) | 47.5 | (44.5–50.5) | |

Clinical remission | 26 weeks | 54.8 | (47.2–62.4) | 40.7 | (36.5–45.0) |

52 weeks | 67.3 | (60.1–74.4) | 55.6 | (51.1–60.1) | |

Corticosteroid-free remission | 26 weeks | 45.6 | (38.1–53.1) | 30.8 | (26.8–34.7) |

52 weeks | 58.8 | (51.5–66.2) | 47.0 | (42.5–51.5) |

Note: Proportions were adjusted for patient age, gender, and race, disease location, duration, and phenotype, and concurrent medications, all measured at baseline of the trial. Adapted from Forrest and colleagues.

Based on these analyses, Forrest and colleagues

When the question is whether an intervention improves outcomes of interest, the second paper in this series

The two major potential sources of bias in non-experimental studies of health care interventions are that the treatment groups compared do not have the same probability of treatment or exposure and the potential for confounding by unmeasured covariates. This paper described a range of analytical methods for the analysis of individual data – deriving primarily from statistics and econometrics – that may help to address these problems. These methods include statistical methods such as regression approaches, propensity score methods, instrumental variables, and clinical trial methods.

Although these approaches are very different, they all are based on assumptions about data, causal relationships, and biases. Regression approaches assume that the actual causal relationship between the treatment, outcome, and other variables is properly specified, all of the variables are available for analysis (i.e., no unobserved confounders) and measured without error, and that the error term is independent and identically distributed. The instrumental variables approach requires identifying an instrument that is related to the assignment of treatment but otherwise has no direct on the outcome. Propensity score methods approaches, on the other hand, assume that there are no unobserved confounders. The epidemiological designs discussed also make assumptions, for instance that individuals can serve as their own control.

There are, however three things that can be done. First, analysts should conduct sensitivity analyses within the assumptions of each method to assess the potential impact of what cannot be observed. It is standard practice in econometrics, for instance, to assess omitted variable bias, and similar analysis would be useful for all of the analytical methods described in this section. The second solution is to analyze the same data with different analytical approaches that make alternative assumptions, and to apply the same methods to different data sets (as Rassen and colleagues

Finally using any of these methods effectively to obtain unbiased estimates knowledge about the setting, the behavior, and the population being studied. In their study of the effect of Medicare Part D, for instance, Stuart and colleagues