Introduction
Learning health systems use routinely collected electronic health data (EHD) to advance knowledge and support continuous learning. Even without randomization, observational studies can play a central role as the nation’s health care system embraces comparative effectiveness research and patientcentered outcomes research. However neither the breadth, timeliness, volume of the available information, nor sophisticated analytics, allow analysts to confidently infer causal relationships from observational data. Rather, depending on the research question, careful study design and appropriate analytical methods can improve the utility of EHD.
This is the second paper in a series (see Box 1) on how learning health systems can use routinely collected electronic health data (EHD) to advance knowledge and support continuous learning, this review summarizes study design approaches, including choosing appropriate data sources, and methods for design and analysis of natural and quasiexperiments. The first paper1 began by drawing a distinction between bigdata style analytics of electronic health data (EHD), with its claims that randomized studies were no longer necessary and traditionalists who believe that without randomization little can be known with certainty. Of course this is a false distinction; some questions do not involve assessing a cause and effect relationship, but when causal assessment is necessary observational studies of existing EHD can be a useful complement to RCTs. In particular, when the question is whether an intervention “works” – improves outcomes of interest, causal inference is indeed critical, but appropriately designed and analyzed observational studies can yield valid results that better balance internal and external validity than RCTs.
Box 1. Series on Analytic Methods to Improve the Use of Electronic Health Data in a Learning Health SystemThis is one of four papers in a series of papers intended to (1) illustrate how existing electronic health data (EHD) data can be used to improve performance in learning health systems, (2) describe how to frame research questions to use EHD most effectively, and (3) determine the basic elements of study design and analytical methods that can help to ensure rigorous results in this setting.

When the question is whether an intervention improves outcomes of interest, the second paper in this series illustrates how study design methods can help researchers identify valid results that better balance internal and external validity than RCTs. The methods discussed include choosing appropriate data sources, epidemiologic designs, methods for design of natural and quasiexperiments, and the use of logic models. The primary issue addressed by these evaluation designs is how to estimate the counterfactual – what would have happened if the intervention had not been implemented. Even with a strong design, however the potential for bias remains.
Faced with the need to infer cause and effect when randomization is not feasible, statisticians and econometricians have developed a series of analytical methods for “causal analysis.” The current paper complements the second by describing how analytical methods for individuallevel electronic health data EHD, including regression approaches, interrupted time series (ITS) analyses, instrumental variables, and propensity score methods, can also be used to address the question of whether the intervention “works.” Each of these approaches addresses Cochran’s call for methods to adjust for differences in observed characteristics between treatment and control groups in order to isolate the effect of an intervention from other factors.5, 6, 7 These methods are routinely used to study health interventions when randomization is not possible, and despite the potential for bias methods researchers from various disciplines agree that each method has merit when implemented with care. The analytical methods discussed in this paper can be used in combination; doing so can limit the potential for bias in the final results if the individual methods are subject to different biases. These methods can also be combined with the design approaches discussed in the second paper in this series8 to limit threats to validity.
This paper does not attempt to serve as a textbook or describe these approaches in detail. Rather, it presents these methods in a consistent framework rather than provide detailed information on each topic. Because the use of existing EHD is not yet well developed, some of the examples use other types of data but were chosen to illustrate the methods.
The methods discussed in this section primarily involve the use of individuallevel EHD. Since each of these paradigms face the same basic inference questions, there is some overlap in the material covered, and throughout we explain how each method relies on assumptions that are often not possible to verify with the existing data. This paper concludes with a discussion of “analyzing observational data like randomized experiments.” This is not so much an analytic method per se, but rather a general approach or framework that should cut across all the methods.
Causal Inference Framework
As background for the methods described in this section it is useful to clarify a framework for causal inference. The fundamental idea is that, for a given individual, the “effect” of a treatment is based on the difference between that outcome that would be observed if the person receives the treatment and what would be observed if the person receives the comparison condition instead (the counterfactual). The problem, of course, is that no single individual can receive both the treatment and the comparison condition at the same time. For instance in Stuart’s study of the Medicare Part D program discussed below the treatment group members (dual eligibles) are likely sicker and older than the comparison group. Simply comparing those who are dual eligibles to other Medicare or Medicaid recipients would yield a biased estimate of the intervention’s effect. Randomization solves this by creating a control group that, on average, is no different from the treatment group, and in particular there will be no confounding.
The regressionbased approaches described assume that all of the factors that differentiate the treatment and control group members are represented in the observed variables and covariates. The instrumental variables approach identifies special variables (the “instruments”) that affect treatment but are unrelated to outcomes except through the treatment, and estimates how much of the variation in the treatment variable that is induced by the instrument – and only that induced variation – affects the outcome measure. Propensity score methods model the factors related to the probability of treatment assignment and, typically, match treated and untreated based on such probabilities. These models also assume that the causal model is correctly specified. This can be hard to assess, but directed acyclic graphs (DAGs) can be used to clarify assumptions about causal pathways and use their representation in graphical form to guide selection of covariates for statistical adjustment through structural equation models (SEM) or other approaches,9 although the details are beyond the scope of this paper.
Regression Approaches
Perhaps the simplest and most intuitive approach to analyzing observational data is to fit a linear statistical model of the form
where
 Y_{i} is the outcome variable for subject i
 X_{i} is an indicator variable for the treatment, e.g. 1 for treatment group and 0 for control
 β_{i} is the effect of the treatment, conditional on the covariates
 Z_{i} represents other factors that influence the outcome
 e_{i} is an independent and identically distributed (iid) error term.
The parameters in Equation 1 are typically are estimated by ordinary least squares (OLS) methods, and such “OLS estimates” are commonly used to describe this regression approach. In this model, the fitted value of β_{1} estimates the effect of the treatment, and can be evaluated using standard statistical hypothesis tests. This approach can be extended as necessary if Y_{i} is categorical or dichotomous (e.g., logistic regression), there are multiple Z’s, or the relationship is nonlinear. Another extension is known as a “differenceindifferences” approach, which uses the difference in an outcome variable before and after an intervention as Y_{i}, which can have the benefit of individuals serving as their own control.
Despite the simplicity of this approach, there are many ways that things can go wrong when applied to observational data. Most basically regression approaches assume that the actual causal relationship between the treatment, outcome, and other variables is properly specified, all of the variables are available for analysis and measured without error and that the error term is independent and identically distributed. In particular, the two groups may be on different trajectories, and would not have exhibited the same difference after the intervention that they did before. The relationship could be improperly specified; the functional form could be incorrect or a variable omitted from the model may have a relationship with Y X and/or Z. In addition, X and/or Z may depend, in part, on Y for example if the treatment received (X) is dependent on Z (confounding bias) or Y (endogeneity or selection bias). Also, some of the Z may not be available for analysis or measurement errors may affect X and/or Z. These problems could result in bias in the estimated treatment effect (β_{1}). They could also cause the error term e_{i} to not be iid, which would lead to incorrect confidence intervals and hypothesis tests.
It is standard practice in econometrics to assess omitted variable bias by identifying the available variables that are most closely related to the missing variable and seeing how the results change when these variables are dropped from the model. One never knows, however, how well these variables capture the effect of the missing variable, or of additional missing factors that may exist but are unknown to the researcher.
Regression Discontinuity (RD) Method
A variant of this approach, the regression discontinuity method,10, 11 is used when assignment to treatment is based on a cutoff value of a continuous variable Z. The relationship between Y and Z is modeled as
The functional form of f(Z_{i}) could simply be linear (which would be modeled as β_{2}Z_{i}) or alternatively Y or Z could be transformed to fit this approach. As in other regression approaches, the fitted value of β_{1} estimates the effect of the treatment, and can be evaluated using standard statistical hypothesis tests. Figure 1 illustrates this approach.
The withinstudy comparisons literature has shown that RD analyses of observational data generally replicate RCT results well despite the use of different statistical methods to estimate the RD effect.12 This model assumes that subjects just to either side of the cutoff value are likely to be similar in all relevant respects, including those unobserved. The challenge is ensuring that the relationship between Y and Z is completely and correctly modeled; otherwise the effect parameter β_{1} will be biased. Furthermore, the effect estimated, is considered valid only for observations close to the cutoff of the assignment variable Z, not more generally. Another challenge, relating to implementation, is ensuring that assignment to both the treatment and comparison conditions adheres strictly to the cutoff value of Z.
Interrupted Time Series Analyses
To motivate the need for interrupted time series (ITS) methods,13 RossDegnan14 considers the Rational Prescribing in Primary Care (RaPP) cluster randomized trial of a tailored intervention to improve the use of antihypertensive and cholesterol medicines for primary prevention of cardiovascular disease (CVD) in Norway.15 The intervention, an educational outreach by pharmacists with audit and feedback, and computerized reminders in the EMR was implemented in 70 practices including a total of 257 physicians. The control subjects received passive dissemination of evidencebased guidelines (69 practices; 244 physicians). Outcomes were measured monthly for all eligible patients in participating practices one year before and after intervention.
Figure 2 illustrates the traditional “differenceindifferences” analysis of RaPP study with prescribing of lowdose diuretics as the outcome variable. For the sake of comparison, the difference between change in the treatment and control groups is 9.0 percent (95% C.I. 4.9% – 13.1%), a significant improvement. Figure 3 displays the ITS analysis of the RaPP study based only on the intervention group data. Displaying the results by month vs. an annual basis shows that adherence to guidelines changed immediately after the intervention began, which is less likely to be due to some other cause than if the change had occurred at some other time. The monthly data also suggest that the effect not only didn’t drop off in time, but might have increased. In addition, even though the control group data were not used, the ITS estimate of the effect, 11.5 percent (95% C.I. 9.5% – 13.5%), is consistent with the randomized trial analysis.
Figure 4 summarizes the logic of ITS analysis and shows how parameters can be estimated by segmented linear regression. Figure 5 shows how this approach can be extended to multiple time segments. In this example, New Hampshire Medicaid data on 860 multiple drug recipients show the effect of the implementation a reimbursement cap in August 1981, which was replaced by a $1 copay the following year. The statistical model is as follows:
In this model, β_{2} is the effect of implementing policy 1 and β_{4} is the effect of implementing policy 2, conditional on the covariates.
One key assumption is that the baseline trend correctly reflects what would have happened after the intervention time point, had the intervention not occurred. This in turn depends on the trends within segments being linear, and that the autocorrelation structure of errors is correctly modeled. The New Hampshire example in Figure 5 illustrates how simple assumptions about linearity can lead to misleading conclusions. The most basic ITS model also depends on the assumption that there is no lag between when the intervention occurs and when its effects are reflected in the outcome measures. With sufficient time points, more complex and flexible nonlinear models could also be used to relax these assumptions.
The major threats to the validity of the ITS design are: confounding (i.e. a cooccurring intervention), selection (preintervention factors that affect inclusion in the intervention, such as volunteers), regression to the mean (groups selected on baseline values), instrumentation (changes in measurement or ascertainment), and history or maturation (some other event or natural process explains the observed effect).
The major threats to the reliability of ITS estimates are: unstable data and wild data points, low frequency outcomes (e.g., deaths), boundary conditions (e.g., percentages which are bounded between 0 and 100), short segments that inaccurately reflect the trend, changing denominator populations, and nonlinear trends.
RossDegnan16 suggests the following approaches for strengthening ITS studies:
 check data quality: identify and remove outliers and implausible data, impute missing data;
 contrast multiple outcomes or groups such as highrisk subgroups or differential response;
 account for policy phasein including anticipatory effects or postintervention lag; and
 test model assumptions including normality of errors, linearity of segments.
Although the description to this point assumes that there is only one group being followed, ITS analysis can also be used to compare two or more comparison groups. Indeed, the results of an ITS analysis are strengthened if the comparison groups are matched by standardizing or using propensity scores (see below) or chosen using principles of natural and quasiexperimental design as discussed in Stoto.17
In summary, the advantages of ITS analysis include an intuitive visual display, direct estimate of effects, and the controls it provides to common threats to validity. The limitations are that the ITS method requires reasonably stable series and relatively long segments. There can also be boundary problems and sensitivity to points near end of segments. Also, because ITS analysis uses aggregate data, there is no opportunity for patientlevel adjustment, but one can use riskadjusted rates.
Instrumental Variables
The instrumental variables approach addresses the causal inference problem by identifying special variables (“instruments”) that affect the treatment that research subjects receive, but are unrelated to the outcomes they experience except through the treatment, and estimating how much of the variation in the treatment variable that is induced by the instrument – and only that induced variation – affects the outcome measure. The idea is that the instrument can be thought of as more plausibly randomly assigned than the treatment of interest, and that the instrument affects whether an individual takes the treatment but does not directly affect outcomes. A classic example of an IV is the distance subjects lived from a health care facility offering two types of emergency procedure.18 The Physician prescribing preference example below demonstrates how instruments can be found in commonly available EHD.
Specifically, this approach centers on two regression equations, generally fit by two stage least squares:
where
 IV_{i} is the instrumental variable for subject i
 ${\widehat{X}}_{i}$ , is the predicted value of X_{i} after fitting equation 4
 β_{1} is the effect of the treatment, conditional on the covariates
 e_{i} and f_{i} are iid error terms
and the rest of the variables are as defined above. Figure 6 compares this approach to an RCT.
For the instrumental variables approach to be effective, two critical assumptions must be true. The first is known as relevance, i.e. that there is a strong association between the IV and the treatment variable X. Using weak or poor instruments (those for which this association is not strong) can lead to biased and imprecise estimates of the treatment effect. The second assumption, exogeneity, is that the correlation between the IV and e_{i}, the error term in equation 5, must be zero. Leaving an important Z_{i} out of equation 4, perhaps because it was not available, can bias the estimated treatment effect (β_{1}). Unfortunately, there is no way to be sure that these conditions are met in any particular situations. There are tests to identify the best of multiple instruments conditional on having a good one, but none that test whether a particular instrument is truly exogenous, or whether any of the instruments are “good enough” to yield reasonably precise estimates. Baiocchi and colleagues demonstrate how the strength of an instrument relates to observed and unobserved covariates and discussion approaches for building stronger instruments and testing how well they work.19
Example of Instrumental Variables: Physician Prescribing Preference
Rassen and colleagues20 illustrate the instrumental variables approach with an example study about the risk of shortterm mortality associated with the initiation of antipsychotic medication (APM). The premise is that the IV in this example, physician prescribing preference (PPP), predicts which drug a patient will be treated with but is otherwise unrelated to the outcome. This premise could be incorrect if, say, some of the patients had already been tried on APM and had a bad reaction.
The study uses data from two sources: Pennsylvania’s (PA)Pharmaceutical Assistance Contract for the Elderly (PACE) program from 1994 to 2003 as well as for British Columbia (BC) residents aged 65+ from 1996 to 2004. The comparison is between initiators of conventional vs. atypical APM therapy and the outcome: mortality within 180 days of initiation (the index date). The available covariates reflect baseline patient characteristics (coexisting illnesses and use of health services) in the 6 months before the index date. Frailty, cognitive impairment, ability to perform activities of daily living are all potentially important but are not available.
The study examined 25 different variants of PPP as an IV. The “base case” was the approach used in the original analysis: an indicator variable based on the physician’s current preference for conventional vs. atypical APM therapy If the physician’s previous APM prescription was for a conventional APM, then for the next patient, the physician was classified as a “conventional APM prescriber.” Otherwise, the physician was classified as an “atypical APM prescriber.” Rassen and colleagues considered variants based on (1) preference assignment algorithm (e.g. the number of conventional APM prescriptions out of the previous 24 prescriptions), (2) cohort restrictions based on physician and patient characteristics, and (3) stratification criteria (e.g. patient of a similar age). Eventually they determined that restricting the analysis to primary care physicians produced the best instrument.21
Table 1 displays the estimated differences in the risk of allcause mortality within 180 days of initiation of conventional versus atypical APM treatment based on four different statistical models; the first three are based on different versions of OLS regression, and the fourth is the IV analysis estimate. The IV estimates are substantially different from the OLS estimates, which themselves vary In the Pennsylvania base case (including all prescribing physicians), for instance, the IV estimate of excess risk is 7.69 per 100 patients, compared to the OLS estimates that range from 2.47 to 3.91 per 100 patients.
POPULATION AND VARIATION  EVENTS IN CONVENTIONAL APM GROUP  EVENTS IN ATYPICAL APM GROUP  UNADJUSTED OLS ESTIMATE  AGE/SEXADJUSTED OLS ESTIMATE  FULLY ADJUSTED OLS ESTIMATE^{a}  IV ANALYSIS ESTIMATE 

BRITISH COLUMBIA  
Base case (unrestricted)  1,806  2,307  4.46 (3.69, 5.23)  4.49 (3.75, 5.22)  3.55 (2.74, 4.37)  4.00 (0.94, 7.06) 
Restricted to PCPs (R6)  1,735  2,115  4.24 (3.41, 5.06)  4.48 (3.68, 5.28)  3.59 (2.70, 4.48)  3.11 (0.57, 6.79) 
PENNSYLVANIA  
Base case (unrestricted)  1,307  1,628  2.69 (1.65, 3.73)  2.47 (1.46, 3.49)  3.91 (2.68, 5.13)  7.69 (1.26, 14.12) 
Restricted to PCPs (R6)  960  1,129  2.39 (1.07, 3.71)  2.29 (0.98, 3.60)  4.32 (2.71, 5.93)  5.34 (3.53, 14.21) 
The estimates from an IV analysis often have larger standard errors, and this effect can be seen in Table 1. But are the IV estimates sufficiently less biased to justify this? Rassen and colleagues22 address this by examining the key assumptions of the IV approach. First, they assess instrument strength using a partial F test from the firststage regression, which predicts treatment as a function of instrument and covariates, and find that it is significant at the 5 percent level for all cohort definitions. They also report that the partial r2 between instrument and treatment conditional on other covariates in model, comparing across cohort definition, ranged from 0.028 to 0.099. Thus they conclude that the instrument strength is high.
Rassen and colleagues23 also find that using PPP as an IV generally alleviated imbalances in nonpsychiatryrelated patient characteristics, making the instrument more plausibly randomly assigned than are APMs themselves. In Pennsylvania, for instance, the proportion of the patients who were male was 15.1 percent in the atypical APM group and 20.1 percent in the conventional APM group for a difference of 5.0 percent. When stratified by IV the difference was reduced to 1.8 percent. Averaging over the 25 different variants of PPP the overall imbalance was reduced by an average of 36 percent over the two cohorts.
A final key assumption is the assumption of no direct effect of the instrument on the outcome. Rassen and colleagues argue that this is a reasonable assumption here, although it could be violated if, for example, PPP is associated with higher or lower quality of care in general (e.g., if physicians who prescribe a particular type of APM also tend to provide lower or higher quality of care in general).
Thus, even done well, IV is fraught with difficulty in interpretation, due to the large standard errors. Based on these analyses, Rassen and colleagues conclude that PPP was at least a reasonably valid instrument in this setting, and implicitly that the IV estimates are superior to the OLS estimates. This type of careful analysis of whether a particular instrument is truly exogenous, or whether any of the instruments are “good enough” to yield reasonably precise estimates is not common, and without it one cannot be sure that the results are valid.
Propensity Score Methods
Propensity score methods aim to equate treatment and comparison groups on a single variable, the probability of treatment, which is modeled from a set of observed characteristics and estimated on the pool of treatment group members and potential comparison group cases. The key to this is the propensity score, p, which is defined as the predicted probability of receiving the treatment given the observed covariates.24, 25
There are five basic steps involved in using propensity score methods:
 Estimate p, typically estimated using logistic regression, although nonparametric approaches such as random forests26 have been shown to potentially work better.
 Use the propensity score to equate groups through matching, weighting, or subclassification. Matching involves finding one or more comparison cases for each treated case that have similar values of p. Propensity scores can also be used to create weights p/(1p), which gives less weight to comparison subjects that look less like the treated group. Subclassification forms subgroups of individuals with similar propensity scores (for example, 10 subclasses, defined by propensity score deciles).
 Check how well the equating worked to create balance in observed covariates. Since the goal is to reduce bias by forming groups that look similar on the observed covariates, we can see how well the matching worked by comparing the distributions of the covariates in the equated treatment and comparison groups.
 Estimate the treatment effect by comparing outcomes in equated groups. With matching, this involves comparing outcomes in the matched groups (some weighting will be required if the number of matches selected for some treatment group observations differs from the number selected for other treatment group cases). Alternatively use the weights described in step 2 to calculate the average treatment effect. With subclassification, effects are estimated separately within each subclass and then aggregated. (Note that these approaches are the same ones used to calculate the balance measures in Step #3).
 Conduct a sensitivity analysis to unobserved confounding. This can be done in a number of ways, for instance by positing an unobserved confounder and obtaining adjusted impact estimates if that confounder existed, given its assumed characteristics.
Schneeweiss and colleagues27 demonstrate how this method can be used with health care claims data to study the safety and effectiveness of medications. Using a multistep algorithm to implement a highdimensional propensity score adjustment with claims data, the authors demonstrate improved effect estimates compared with adjustment limited to predefined covariates, when benchmarked against results expected from randomized trials. Other researchers, however, have found propensity score methods to yield very different estimates than those from a RCT using the same treatment group cases.28
Example of Propensity Score Methods: Medicare Part D Prescription Drug Program
To illustrate the use of propensity score analysis, Stuart and colleagues29 use an analysis of the effect of the Medicare Part D prescription drug program, which became available in 2006, on individuals with serious mental illness. In particular, the study was focused on individuals eligible for both Medicaid and Medicare, known as “dualeligibles,” who transitioned from state Medicaid coverage to commercial Medicare coverage and asked about the impact on medication continuity and outcomes such as inpatient admissions, mortality, and the cost of care. The study was based on Medicare and Medicaid billing data and the population was Maryland residents with schizophrenia, bipolar, or depressive disorders with dual eligibility on January 1, 2006.
As the propensity score, Stuart and colleagues30 estimated the probability of being a dual eligible. They fit a logistic regression relating eligibility to baseline measures of the key outcomes of interest, as well as diagnoses, demographics, and other covariates. The predicted value from this equation became each individual’s propensity score, p. They then analyzed the data using a differenceindifferences design, comparing prepost change in utilization between dualeligibles and comparable Medicaid only patients. Stuart and colleagues tried two approaches to equating groups. The first was a 1:1 match based on nearest propensity score. The second used the propensity scores in a “weighting by the odds” approach (with weights equal to p/ (1p)). The weighting approach allowed Stuart and colleagues to retain the full sample in the analysis; all duals are included as the “treatment” group, and all nonduals are included but weighted relative to their similarity to the duals.
Figure 7 displays the standardized differences between the experimental groups (duals vs. nonduals) on a range of covariates before (hollow circles) and after (solid circles) propensity score weighting. The figure shows that the weighting reduced nearly all of the standardized differences, and after weighting all standardized biases were less than 0.2, a threshold used to indicate adequate balance. It is particularly reassuring that the baseline measures of some of the key outcomes (e.g. unique day counts [UDC] of six types of prescription drugs) are very well balanced after weighting.
Analyzing Observational Data Like Randomized Experiments
Although not directly related to assessing cause and effect relationships, Forrest and colleagues31 have shown how methods developed for the analysis of clinical trials can address a number of challenges in analyzing observational data. Adapting the controlled trial simulation by methods of Hernan and colleagues32 can lead to robust estimates of comparative effectiveness.
For example, consider the case of biologic therapy for Crohn’s disease. Agents targeted to reduce TNFαmediated inflammation are rational therapeutic choices; their efficacy has been demonstrated in adults by an RCT, and the REACH study33 and others have evaluated single group efficacy of biologics in children. However, there is currently no direct assessment of biologic agents in the pediatric population. And since some patients with moderate to severe Crohn’s disease will get better regardless of treatment, a comparison group of nonbiologic treated patients is needed to estimate treatment effects. ImproveCareNow (ICN) is a collaborative network of more than 50 pediatric GI practices established in 2007 to improve care for children with inflammatory bowel disease. Its dataset includes a large population of children with CD, with longitudinal followup, so provides an opportunity to fill in the gap in evidence about the effect of biologic therapy for Crohn’s disease in children.34
Forrest and colleagues35 describe an observational study using ICN data designed to contrast the 6month outcome experience (disease activity) for patients with new biologic therapy with those not receiving biologic therapy but only usual care. Their results are based on 198 children initiating treatment with new biologics and 1157 trials (a “trial” refers to one child initiating treatment followed for a 6month observation window) in 680 children in two control groups (biologic users preinitiation and biologic nonusers). The treatment and control groups differ in a number of respects; most importantly, the biologic initiators had a higher probability of colonic involvement and less concurrent medication use at baseline. The groups did not differ significantly in terms of the Pediatric Crohn’s Disease Activity Index (PCDAI), an 11element composite index of disease activity that includes symptoms, exam findings, and lab results.
The primary outcome variables are clinical and steroidfree remission. Forrest and colleagues36 used multiple analytic methods to seek accurate estimates of the treatment effect. For instance, they calculate the posttest difference in the probability of recurrence during the trial, and prepare empirical cumulative probability plots. KaplanMeier plots and a Coxproportional hazards model that controlled for patient characteristics both indicate a significant difference in the cumulative probability of achieving either clinical response or remission between the biologic and control groups. As summarized in Table 2, Forrest and colleagues37 also conducted a variety of pre and posttest analyses, both unadjusted and adjusted (difference in difference GEE estimates) for demographic characteristics and medication use at baseline.
OUTCOME  DURATION OF FOLLOWUP  INITIATOR TRIALS  NONINITIATOR TRIALS  

% ACHIEVING OUTCOME (95% CI)  
UNADJUSTED  
Clinical remission  26 weeks  54.4  (47.7–61.1)  41.2  (38.2–44.2) 
52 weeks  66.6  (60.3–72.8)  56.2  (53.2–59.3)  
Corticosteroidfree remission  26 weeks  47.3  (40.6–53.9)  31.2  (28.4–34.0) 
52 weeks  60.1  (53.7–66.5)  47.5  (44.5–50.5)  
ADJUSTED  
Clinical remission  26 weeks  54.8  (47.2–62.4)  40.7  (36.5–45.0) 
52 weeks  67.3  (60.1–74.4)  55.6  (51.1–60.1)  
Corticosteroidfree remission  26 weeks  45.6  (38.1–53.1)  30.8  (26.8–34.7) 
52 weeks  58.8  (51.5–66.2)  47.0  (42.5–51.5) 
Based on these analyses, Forrest and colleagues38 conclude that multiple analytic methods converge on consistent estimates of effect size. Substantively they find that biologic agents are modestly more effective than other therapies for moderate/severe disease, and that their estimates are consistent with the limited existing prospective data. The number needed to treat (NNT) to avoid one clinical remission is approximately 8. For steroidfree remission the NNT is 5.2, and indeed the reduced steroid use may be a significant benefit of biologic treatment.
Conclusions
When the question is whether an intervention improves outcomes of interest, the second paper in this series39 illustrates how study design methods can help researchers identify valid results that better balance internal and external validity than RCTs. The current paper complements this by describing how analytical methods for individuallevel EHD, including regression approaches, interrupted time series (ITS) analyses, instrumental variables, and propensity score methods, can also be used to address the question of whether the intervention “works.”
The two major potential sources of bias in nonexperimental studies of health care interventions are that the treatment groups compared do not have the same probability of treatment or exposure and the potential for confounding by unmeasured covariates. This paper described a range of analytical methods for the analysis of individual data – deriving primarily from statistics and econometrics – that may help to address these problems. These methods include statistical methods such as regression approaches, propensity score methods, instrumental variables, and clinical trial methods.
Although these approaches are very different, they all are based on assumptions about data, causal relationships, and biases. Regression approaches assume that the actual causal relationship between the treatment, outcome, and other variables is properly specified, all of the variables are available for analysis (i.e., no unobserved confounders) and measured without error, and that the error term is independent and identically distributed. The instrumental variables approach requires identifying an instrument that is related to the assignment of treatment but otherwise has no direct on the outcome. Propensity score methods approaches, on the other hand, assume that there are no unobserved confounders. The epidemiological designs discussed also make assumptions, for instance that individuals can serve as their own control.
There are, however three things that can be done. First, analysts should conduct sensitivity analyses within the assumptions of each method to assess the potential impact of what cannot be observed. It is standard practice in econometrics, for instance, to assess omitted variable bias, and similar analysis would be useful for all of the analytical methods described in this section. The second solution is to analyze the same data with different analytical approaches that make alternative assumptions, and to apply the same methods to different data sets (as Rassen and colleagues40 did in the PPP study and Yih and colleagues41 did in their study of intussusception risk after rotavirus vaccination. Finally different analytical methods, each subject to different biases, can be used in combination and together with different designs, to limit the potential for bias in the final results.
Finally using any of these methods effectively to obtain unbiased estimates knowledge about the setting, the behavior, and the population being studied. In their study of the effect of Medicare Part D, for instance, Stuart and colleagues42 limited their sample to Maryland residents, assuming that this would help to control for variation in statelevel factors and policies. They also focused on patients with diagnoses of schizophrenia, bipolar, or depressive disorders, assuming that the impact of the new drug benefit would be similar for these groups and because the impact on individuals with other health conditions could be quite different.