Introduction

A learning health system is one in which “science, informatics, incentives, and culture are aligned for continuous improvement and innovation, with best practices seamlessly embedded in the care process, patients and families active participants in all elements, and new knowledge captured as an integral by-product of the care experience” [1]. According to Fineberg, [2] these systems are continuously committed to improvement: they are engaged in discovery, innovation, and implementation, embedding research into the process of care. In particular learning health systems, which can range from national systems such as Medicare to integrated health care delivery systems to physician practice networks such as practice-based research networks to individual hospitals, use routinely collected electronic health data (EHD) to advance knowledge and support continuous learning. Indeed, the increasing popularity of the term “learning health care system” signals the broad acceptance of the idea that routinely collected clinical data can—and indeed should—be used to advance knowledge and support continuous learning so as to improve patient care [3].

Okun and colleagues, [4] for instance, demonstrate how EHD can help to improve disease monitoring and tracking; better target medical services for improved health outcomes and cost savings; help inform both patients and clinicians to improve how they make decisions during clinical visits; avoid harm to patients and unnecessary costs associated with repeat testing and delivery of unsuccessful treatments; and accelerate and improve the use of research in routine medical care to answer medical questions more effectively and efficiently.

“Big data” – which includes but is not limited to EHD – has many uses in a learning health system [5, 6, 7, 8]. Some big data enthusiasts even seem to suggest that with enough data RCTs, long viewed as the gold standard in health research, are no longer necessary [9]. Schneeweiss [10] argues that most of this can be automated, without losing validity, but Dahabreh and Kent [11] agree with the need for this kind of analysis but are less sanguine about the potential.

However, despite the new tools available for EHD, none of this is simple. Whether health system data are used to manage the care of individuals, to carry out quality improvement (QI) studies, or to conduct comparative effectiveness research (CER), data in individuals’ medical records must be accurate and complete; and the collection of records available for analysis must be reasonably representative of the population served. Without randomizing a large number of subjects to a treatment or intervention of interest, research results can suffer bias, especially by confounding by factors not recorded in existing electronic health records (EHR). Indeed, some seem to feel that without RCTs, we know nothing with certainty [12]. However depending on the research question, careful study design and appropriate analytical methods can improve the utility of EHD in a learning health system and even yield convincing causal inferences.

Project goals

The goals of this set of papers (See Box 1) are to (1) illustrate how existing EHD data can be used to improve performance in learning health systems, (2) describe how to frame research questions to use EHD most effectively and (3) determine the basic elements of study design and analytical methods that can help to ensure rigorous results in this setting. We explore how researchers can balance the rigor and internal validity of RCTs with the relevance and external validity of real world observational studies based on EHD. There are many existing publications addressing one of more of these topics, and most health professionals are aware of some, but not all, of these methods and approaches.

Box 1. Series on Analytic Methods to Improve the Use of Electronic Health Data in a Learning Health System

This is one of four papers in a series of papers intended to (1) illustrate how existing electronic health data (EHD) data can be used to improve performance in learning health systems, (2) describe how to frame research questions to use EHD most effectively, and (3) determine the basic elements of study design and analytical methods that can help to ensure rigorous results in this setting.

  • Paper 1, this paper, focuses on clarifying the research question, including whether assessment of a causal relationship is necessary; why the randomized clinical trial (RCT) is regarded as the gold standard for assessing causal relationships, and how these conditions can be addressed in observational studies.
  • Paper 2, “Design of observational studies,” [13] addresses how study design approaches, including choosing appropriate data sources, methods for design and analysis of natural and quasi-experiments, and the use of logic models, can be used to reduce threats to validity in assessing whether interventions improve outcomes of interest.
  • Paper 3, “Analysis of observational studies,” [14] describe how analytical methods for individual-level electronic health data EHD, including regression approaches, interrupted time series (ITS) analyses, instrumental variables, and propensity score methods, can be used to better assess whether interventions improve outcomes of interest.
  • Paper 4, “Delivery system science,” [15] addresses translation and spread of innovations, where a different set of questions comes into play: How and why does the intervention work? How can a model be amended or adapted transported to work in new settings? In these settings, causal inference is not the main issue, so a range of quantitative, qualitative, and mixed research designs are needed.

This paper will not attempt to serve as a textbook or describe these approaches in detail. Rather, it presents these methods in a consistent framework rather than provide detailed information on each topic. The intended audience is health system researchers, analysts, and managers who want to understand the range of methods that have been developed for the analysis of EHD, their strengths and weaknesses, and the circumstances in which it is appropriate to apply them. Even though the examples are drawn mainly from large-scale research and evaluation studies where these methods have mostly been employed, the principles apply to topics relevant to smaller health systems and individual hospitals as well.

Organization of the series of papers

This set of papers begins with a discussion of the kind of research questions that EHD can help address, noting how different kinds of evidence and assumptions are needed for each. We argue that when the question involves describing the current (and likely future) state of affairs, causal inference – and hence RCTs – is not relevant. When the question is whether an intervention improves outcomes of interest, causal inference is critical, but appropriately designed and analyzed observational studies can yield valid results that better balance internal and external validity than typical RCTs. When the question is one of translation and spread of innovations, a different set of questions comes into play: How and why does the intervention work? How can a model be amended or adapted to work in new settings? In these “delivery system science” settings, causal inference is not the main issue, so a range of quantitative, qualitative, and mixed research designs are needed. We then describe why RCTs are regarded as the gold standard for assessing cause and effect, how alternative approaches relying on observational data can be used to the same end, and how observational studies of EHD can be effective complements to RCTs. We also describe how RCTs can be a model for designing rigorous observational studies, building an evidence base through iterative studies that build upon each other (i.e., confirmation across multiple investigations).

We recognize that randomized studies are and will continue to be critical for learning health systems. This includes studies with randomization conducted at the individual or group level, using stepped-wedge [16] and delayed-start [17] designs, as well as pragmatic clinical trials [18] and registry-based RCTs [19]. Although EHD increasingly is being used in such studies, and even though we draw lessons and inspiration from these designs, randomized studies per se are beyond the scope of this paper. We also do not address measurement or data quality issues.

The current paper is intended as an introduction to the series, focusing on framing the research question, describing why RCTs are the gold standard for establishing cause and effect, and exploring how elements of RCTs can be used to strengthen causal inference in observational studies. Additionally to further assess the rigor and appropriateness of observational studies, we point researchers to the literature on within-study comparisons. This involves the comparison of quasi-experimental designs to RCTs to assess whether, and in under what circumstances, observational studies and RCT have similar results. The other three papers in the series explore different perspective on the use of observational data. The second paper [20] describes study design and analysis approaches for observational data that can be used to evaluate interventions, drawing primarily on epidemiology social sciences, and the program evaluation literature. This includes logic models and methods for design and analysis of natural and quasi-experiments. The third paper, [21] drawing primarily on statistics and econometrics, discusses methods for analysis of individual-level data including regression approaches, interrupted time series (ITS) analyses, instrumental variables, and propensity score methods. The final paper [22] addresses methods from delivery system science for the evaluation of health care improvement initiatives. As indicated above, the questions are different, but the design and analytic approaches discussed in the second and third papers, along with other more qualitative and mixed methods such as realist and other theory-based evaluation approaches, can be useful.

What are the research questions?

Before we can even consider the most appropriate analytical methods for design and analysis of observational studies relying on EHD, we must begin by clarifying the questions to be addressed. Observational EHD can be used for a number of purposes. When it is used in place of an RCT to draw inferences about an intervention’s effects, great caution is needed in the selection of a counterfactual, the econometric approach used in estimation, and the interpretation.

For other questions, assessing causality is less central, and the use of observational data less contentious. For instance, one common type of question learning health systems address relates to the current (and likely future) situation for patients, providers, or health care systems. For instance, the question might be about:

  • The magnitude of a disease or condition; for instance, the prevalence of diagnosed and undiagnosed diabetes among adults in medical practice X or in county Y.
  • The occurrence, timing, and patterns of adverse events; for instance, without changing practice, how many new cases of hospital acquired infections should we expect next year? Or, how does the rate of central line-associated blood stream infections (CLASBI) compare to peer institutions?
  • Cost and utilization of health care; for example, who are the most intensive users of health care (“super utilizers”) in the system?

For such descriptive questions, causal inference is simply not relevant. Indeed, because they relate to actual practice, existing EHD will likely provide more appropriate answers than RCTs, which are typically designed to yield precise answers in ideal contexts. Because causal inference is not an issue, these questions are beyond the scope of this series of papers.

The second type of question essentially asks whether an intervention “works,” that is, improves outcomes of interest. For example, since the 2006 Massachusetts health care reform was a model for the Affordable Care Act enacted four years later, it is important to ask about the Massachusetts reform’s impact on health care utilization, morbidity, and mortality. Learning health systems are also focused on safety, asking questions such as, what is the risk of intussusception after vaccination with second-generation rotavirus vaccines? Questions about whether an intervention “works” must also ask “compared to what?” For instance, is an intervention compared to usual care or a placebo or is it a question of comparative effectiveness, such as “what is the risk of short-term mortality associated with initiation of conventional vs. atypical antipsychotic medication (APM)?” It is important to distinguish between effectiveness and efficacy, for instance, “what is the relative risk of mortality due to initiation of a conventional vs. atypical APM in actual practice?” (As compared to in some idealized setting that might, for example, just include large academic medical centers). For this type of question causal inference is critical, but appropriately designed and analyzed observational studies can yield valid results that better balance internal and external validity than RCT [23]. Indeed for issues of comparative effectiveness as opposed to efficacy, observational studies of EHD offer many advantages over simply RCTs [24]. Study designs and analytical methods for these questions are the major focus of this paper.

A third type of question, common in health care improvement efforts, arises when the focus is on the translation and spread of interventions from the institutions where they were developed to others throughout the health care system. For example, consider a patient-centered medical home (PCMH) approach to managing patients with diabetes. These complex interventions might rely on combinations of team-based care, health information technology and registry functionality, care coordination and management, and quality-adjusted financial incentives, with the specific versions of these components depending on the context in which the intervention is employed. For instance, Psek and colleagues [25] describe the implementation of learning health system principles in the Geisinger Health System, identifying “evaluation and methodology” (activities and methodological approaches needed to identify, implement, measure, and disseminate learning initiatives) as one of nine learning health system framework components. In a similar description of Kaiser Permanente’s approach to a learning health system, Schilling and colleagues [26] identify “real-time sharing of meaningful performance data” as one of six building blocks.

In this “delivery system science” setting, questions about whether or not interventions could improve outcomes are mostly settled. Accordingly in these settings attention is devoted to not so much whether the intervention works, but rather (1) how and why the intervention works, (2) what works for whom and in what contexts, and (3) how a model can be amended to work in new settings [27]. Because the questions are different, the design and analysis approaches discussed in the second and third papers of this series, along with other more qualitative and mixed methods approaches such as realist and other theory-based evaluation methods are necessary. The main point is to appropriately match the design to the research question. Constraints related to context, capacity and operational demands may require that we not consider certain design options—such as the case when randomization is not feasible or is ethically inappropriate. Further, operational considerations may require trade-offs in implementation strategies, moving from a randomized cluster trial design to a phased, time-delayed approach. Learning health systems engaging in delivery science must strive for rigor and researchers must balance the science with necessary tradeoffs necessary to care delivery operations in the pursuit of generating operationally relevant evidence.

RCTs as a model for observational studies using EHD

RCTs are critical in health research because randomization with large numbers of subjects is the best method (i.e., it requires the fewest and most defensible assumptions) to ensure that an association between a treatment and a health outcome represents a cause and effect relationship. This is because, when carried out well, randomization ensures that the groups receiving different treatments differ at baseline only on those treatments. So, in a “clean” RCT with few or no complications, if differences in outcomes are observed those effects must be due to the treatments being compared, and not to any pre-existing differences between the groups. This quality is what makes randomization the “gold standard” for questions of efficacy or in terms of the questions above, for determining whether an intervention “works.”

However, aside from pragmatic trials, most RCTs are limited by restrictions on subjects to obtain homogeneity (e.g., excluding patients with comorbidities) or due to their design and where they are carried out (e.g., in large medical centers rather than community hospitals), and as a result their participants often do not represent real-world patient populations. An additional drawback is that RCTs are often quite expensive. Because of their expense, RCTs often have small sample sizes, which limits their ability to detect rare adverse effects and study heterogeneity of treatment effects. RCTs also typically have short durations, limiting ability to discern long-term consequences. Thus, RCTs are generally good for internal validity (estimating effects for the research subjects in the study), but weak for external validity (estimating effects for some target population of interest). RCTs also have other limitations in terms of size and cost, and face complexities such as non-compliance or missing data, which make alternative designs sometimes attractive.

Observational studies using existing EHD offer the opportunity to investigate interventions and outcomes—often at lower costs and larger scale. In particular Fleurence and colleagues [28] write that observational studies can play a central role as the nation’s health care system embraces comparative effectiveness research (CER). In fact, the field is moving towards consensus that future CER evidence is as likely to be drawn from the analysis of large observational databases as it is from clinical trials [29, 30, 31, 32, 33]. ARRA investments addressed the need for improved data infrastructure for CER with its investment in administrative and clinical datasets, [34] much of which is suited for observational studies. Given these investments, additional efforts are needed to educate researchers not only about the availability of new data sources, but also about how best to apply methods to match their research questions and data [35]. In their review of ARRA-funded data infrastructure projects, O’Day et al [36]. found that about half of all data infrastructure projects sought to leverage EHR data sources for CER.

More recently, other initiatives have picked up where these efforts have left off. For instance, PCORnet, the National Patient-Centered Clinical Research Network, is an initiative of the Patient-Centered Outcomes Research Institute (PCORI). It is designed to make it faster, easier, and less costly to conduct clinical research than is now possible by harnessing the power of large amounts of health data and patient partnerships [37]. One of the first PCORnet observational studies is examining the relationship between antibiotic use and weight gain during childhood [38]. Building on the work of the Observational Medical Outcomes Partnership (OMOP), the Observational Health Data Sciences and Informatics (OHDSI) is a collaborative that strives to bring out the value of observational health data through large-scale analytics. Its research community enables active engagement across multiple disciplines (e.g., clinical medicine, biostatistics, computer science, epidemiology, life sciences) and spans multiple stakeholder groups (e.g., researchers, patients, providers, payers, product manufacturers, regulators) [39, 40]. The Food and Drug Administration’s (FDA) Sentinel Initiative enhances the FDA’s ability to proactively monitor the safety of medical products after they have reached the market by rapidly and securely accessing information from large amounts of electronic health care data, such as electronic health records (EHR), insurance claims data and registries, from a diverse group of data partners [41]. OptumLabs, a collaborative health care research and innovation center and its private- and public-sector partners, also undertakes initiatives that harness the power of existing electronic health data to improve patient care and outcomes [42]. Existing EHD is also essential for the Center for Medicare and Medicaid Innovation’s new, rapid-cycle approach to evaluation, which aims to deliver frequent feedback to providers in support of continuous QI, while rigorously evaluating the outcomes of each model tested [43].

Strengths and weaknesses of observational studies

The strengths of observational studies that can be conducted using EHD are the large, diverse populations “under observation” and the relatively (and increasingly) complete information in electronic medical records (EMR) and other EHD on treatments administered and health outcomes experienced. In addition to representing what happens in the “real world,” as needed to estimate effectiveness, observational data representing large populations are increasingly available, making it possible to estimate how treatment effects vary across subgroups defined by demographic factors, geography disease severity and comorbidity, physiological variables, and patient reported measures. The data are already in electronic form, permitting more rapid analyses and possibly lower costs compared to typical RCTs. In addition, reporting bias may be minimized because the data are collected for operational rather than research purposes. According to the Network for Excellence in Health Innovation, [44] evidence from “real world” practice and utilization – outside of clinical trials – is seen as a way to tailor health care decision making more closely to the characteristics of individual patients. While RWE will not supplant the randomized clinical trial (RCT), appropriate adoption of RWE by sponsors of new drugs and devices and regulators could streamline or supplement data from RCTs. Analysis of real world data can help expedite generation of research hypotheses that sharpen the focus of clinical research and may augment conventional RCT data with data from patients whose diversity reflects real world practice. Analysis of long-term observation of patient outcomes from the use of innovations in real world settings generates further insight on safety and efficacy.

The major weakness of observational studies is the lack of randomization, creating the possibility or indeed likelihood, of some degree of confounding and/or selection bias. EHD systems designed for health care operations typically lack complete information on the confounders (e.g. the patients’ home situation and caregiver status) needed to adjust for treatment and control group differences. In addition, the data available are typically not of “research grade,” i.e. not collected with the completeness and accuracy typical of a carefully planned RCT. There are also ethical considerations – primarily privacy and confidentiality – that limit researchers’ access to existing data for health research purposes [45] when researchers are not employed by the health care system.

Using principles of RCTs to design better observational studies

Learning health systems often require evidence of a cause and effect relationship; simply observing an improvement in targeted outcomes, for example, does not tell us whether a given patient improved because of a particular treatment, or whether they would have improved even without such intervention. Observational studies are generally regarded as establishing association but not necessarily causation. But in certain circumstances, well-designed observational studies can indeed contribute to, if not provide, evidence of a cause and effect relationship. The key issue is the assumptions being made, and assessing their validity in each study. Just as in RCTs that face complexities such as non-compliance or missing data, inference from observational designs requires data and assumptions. In part to help motivate thinking regarding the careful design and analysis of observational studies we consider the two major reasons why RCTs with a large number of subjects are considered the gold standard in health research.

First, RCTs are particularly useful in studying the efficacy of medications, devices, and health services provided to individuals – and have earned their reputation – because individual patient outcomes vary and are unpredictable. Only a fraction respond favorably to even the most efficacious medications, and some people get better even without treatment. Moreover, confounding bias is often difficult to avoid in observational studies; income, education, insurance, and many other factors are often common causes of both treatments and health outcomes. The key advantage of an RCT is that randomization balances all potential confounders – measured and unmeasured – between comparison groups. RCTs thus provide good information about what would have happened to individuals had they instead gotten the other treatment condition (known as a “counterfactual”). Counterfactuals are unobservable outcomes that represent what would have happened if a treatment or exposure had not been applied. Through randomization of large numbers of subjects, RCTs produce observable data to substitute for the unobservable counterfactual conditions. Due to randomization in RCTs, the treatment and control groups are equivalent at baseline on both observable and unobservable characteristics, the treatment group outcomes are a good indication of what would have happened to the control group had the control group members instead gotten the treatment of interest, thus allowing a reliable estimation of causal effects. As we discuss in the second and third papers in this series, both retrospective and prospective quasi-experimental designs provide matched and unmatched control options without randomization [46].

Sometimes it is not clear how the desired but unobservable counterfactual might be found in the real world. For example, under what conditions could we observe “black” persons being “white?” Such phenomena complicate the estimation of so-called race effects. In other studies, however, the counterfactual is clear. With tongue in cheek, for instance, Smith and Pell [47] challenge the conventional wisdom that parachutes can “prevent death and major injury after gravitational challenge,” i.e., jumping out of an airplane, because of the lack of RCTs on the subject. This is humorous because everyone knows what the counterfactual to the use of a parachute would be. But unlike the use of medications and other individual-level interventions, the counterfactual for some systems-level changes can be quite clear. Consider the Keystone Initiative study by a collection of Michigan intensive care units (ICUs) of the impact of a simple checklist on health care acquired infections. When the authors found that the median rate of infections at a typical ICU dropped from 2.7 per 1,000 patients to zero after three months, which was sustained for 15 months of follow-up, [48] no one thought that the rate would have dropped so precipitously without the intervention.

Second, RCTs help establish a cause and effect relationship because they address the three criteria for a “contributory cause” [49, 50]. First, the cause must precede the effect. This is true by definition in an RCT because the outcomes are observed after the subjects are randomized. Second, there must be an individual-level association between cause and effect, or that there is a difference in outcomes between those who receive the treatment and those who do not. Randomization ensures that there are no systematic differences between treatment and control groups at the time of the randomization/treatment assignment, even in unobserved variables. Finally altering the cause results in a change in the effect. Susser describes this third criterion as “direction,” which is indicated by the presence of consequential change. An active agent (such as the use of a medication) itself changes and, in turn, is shown to change the outcome. With a static determinant, the effect changes in consequence of a change or shift in a prior condition, such as having diabetes as a prior condition [51]. RCTs feature a deliberate allocation of subjects to treatment and control groups. In RCTs this allocation is done on a random basis but it is worth noting that deliberate allocation is also a feature of natural or quasi experiments as well as in health care QI efforts. This allocation, however, is neither under the control of the researcher or done at random.

A third strength of RCTs is that they are carefully planned. Researchers should and do use lessons about RCTs to help design observational studies with just as much care. William Cochran, author of the seminal book, Planning and Analysis of Observational Studies, [52] suggested that researchers carefully design observational studies by asking how the study would be conducted if it were possible to do it by controlled experimentation, [53] and recent authors [54, 55] have reinforced those ideas. In particular, observational studies should aim to replicate the following key features of an experiment:

  • Clear definition of treatment and comparison conditions,
  • Clear inclusion and exclusion criteria for the study,
  • Methods to adjust for differences in observed characteristics between groups as a way to mitigate the inherent differences, and
  • Identifying when evidence of an association might suggest a causal relationship.

Assessing causality in observational studies

The statistical methods for analysis of individual-level data discussed in the third paper in this series [56] all relate to Cochran’s third point. Regarding the last point, as early as the 1950s epidemiologists have used the “Bradford Hill criteria” to determine whether existing observational data – taken together with an understanding of the related science – are sufficient to say that there is a cause and effect relationship. These criteria developed by the British statistician Sir Austin Bradford Hill are summarized as follows:

  1. Strength of association between risk factor (cause) and disease (health effect) measured by the risk ratio (RR), odds ratio (OR), or other measure of association,
  2. Consistency of association (studies in different settings produce similar results),
  3. Dose-response relationship (the risk of disease increases with level or extent of exposure)
  4. Temporal relationship (the cause precedes the effect),
  5. Biological plausibility (based on scientific theory; timing and magnitude compatible with effect),
  6. Coherence (consistency with pre-existing theory and other knowledge), and
  7. Specificity (the cause is not associated with other health effects).

Hill did not intend that these criteria be used in an algorithmic way for instance that all need to be met before an association is judged to be causal, but rather that they serve as reminders of the kinds of evidence needed to make that judgment.

For example, consider the evidence regarding smoking and lung cancer at the time of the first U.S. Surgeon General’s report issued in 1964 (Cochran served as the lead statistician). No one today questions the causal relationship between smoking and lung cancer, even though this has never been tested in an RCT in humans. The evidence available in 1964 showed that the strength of association was high, with estimated RR of greater than 10 in a range of studies. This association was also consistent; elevated RRs were found in case-control and cohort studies, different populations, and so on. There was also evidence of a dose-response relationship in epidemiological studies showing a higher risk of lung cancer with increasing time smoked and number of cigarettes. The temporal relationship was clear – the epidemiological studies compared current cancer to earlier smoking – and coherence demonstrated by increased risk at sites other than the lung. The biological plausibility of the relationship was demonstrated through randomized experimental studies in animals, higher risks as anatomical sites more exposed to tobacco smoke, and what was known at the time about the causes of cancer The specificity of the relationship is the only criteria not met. In this example the evidence is strong, and the public health need for action was compelling. In most other cases faced by learning health systems, the call will be much harder

In addition to carefully designing observational studies, the within-study comparison literature can provide much guidance about observational study methods, as it is generating empirical evidence of the degree of correspondence in estimated treatment effects between observational studies and RCTs. This literature usually involves a head-to-head comparison of effects estimated in an RCT to that of an observational study although the focus has recently started moving to studying the circumstances under which the replication is most likely, in addition to assessing correspondence between the two. Most of the past within-study comparisons dealt with interventions in job training and education with relatively few studies in health, [57] but the literature is rapidly expanding. Much remains to be learned, especially about the observational study designs that are much more common place because they are more feasible, but also more problematic in terms of the assumptions made, such as propensity score methods.