Several years after the Affordable Care Act became law, there have been a number of innovative efforts that aim to improve the health care system in the United States with the intention to transform payment and improve patient outcomes by enhancing patient-centered care and the quality of care [1]. Health care costs continue to rise, with Americans spending almost twice as much as individuals in other developed nations, while experiencing worse health outcomes [2, 3, 4]. Although quality-based payment programs have been around for decades, recently the Centers for Medicare and Medicaid Services (CMS) has been charged with changing 80 percent of these payments from fee- for-service to value-based, where “value” can be thought of as benefit over cost, or, for this example, the quality of care over its cost. Additionally, CMS recently announced the Merit-based Incentive Payment System (MIPS) under the Medicare Access and CHIP Reauthorization Act (MACRA), which will start paying clinicians based on quality of care beginning in 2019 [5, 6].

One effort that intends to improve value is the U.S. Department of Health and Human Services’ Million Hearts initiative. This national effort aims to prevent one million heart attacks and strokes, in part by focusing on clinical quality measures (CQMs) that represent key steps to reduce heart attacks and strokes; providing these simpler, lower cost (and higher value) services has been shown to reduce deadly and complicated heart attacks and strokes [7]. Known as the “ABCS”, the measures are: Aspirin when appropriate (CMS164v5; NQF0068), Blood pressure control (CMS165v4; NQF0018), Cholesterol management (CMS347v0; NQF pending), and Smoking cessation (CMS138v4; NQF0028).7 To assist in this effort, the Agency for Healthcare Research and Quality (AHRQ) supports the EvidenceNOW grant initiative. This initiative provides support to thousands of primary care practices across seven regional cooperatives to reduce patients’ cardiovascular risk by improving adherence to the ABCS measures to 70 percent [8]. A mixed methods evaluation of EvidenceNOW is being conducted by Evaluating System Change to Advance Learning and Take Evidence to Scale (ESCALATES), and global technical assistance is being provided by a separate team [9]. As part of EvidenceNOW, a set of experienced informatics leads help practices understand how they can leverage technology to produce and improve the ABCS measures, including helping with the CQM definitions [10].

For EvidenceNOW the ABCS measures are defined by standardized electronic CQMs, where the CQM definition, logic, and pertinent concepts are encoded into Health Quality Measure Format (HQMF) and value sets [11]. Value sets are grouped codes from standard vocabularies (e.g., SNOMED CT, RxNorm, LOINC, and ICD-10-CM), that are found in Electronic Health Records (EHRs) and represent the clinical concepts of interest. Although the Aspirin, Blood Pressure, and Smoking Cessation measures have been used extensively the measures previously used for Cholesterol Management were outdated and needed revisions with new criteria from the American Heart Association. This required the creation of the HQMF and value set for the Cholesterol measure. Although measures endorsed by CMS are stored on their website and the value sets held and updated by the National Library of Medicine in the Value Set Authority Center (VSAC), these value sets needed be created rapidly and no official measure existed [12, 13].

Because the Cholesterol measure had recently changed, the specific definition and value sets had not yet been created. Therefore, ESCALATES contracted a standard measure developer (Research or “Res.”) to produce these elements. Since this measure was also planned to be used in value- based care programs, another standard measure developer created a separate value set for the Cholesterol measure using CMS’ Measure Authoring Tool (CMS). This parallel process allowed for a natural comparison of the measure development process. As with any process using controlled terminologies applied to real health care, many issues may impact results. As Cimino warned in the “Desiderata for Controlled Terminologies”, “concept orientation” requires using conceptual building blocks in order for the concept to be complete and cover multiple levels of detail [14, 15]. He also discussed the inevitable risk for redundancy in controlled vocabularies and stated that coding information in multiple ways should generally be avoided [14]. Likewise, Chute et. al described the need for non-ambiguous, non-overlapping concepts in order to support “aggregate outcome analyses” [13, 16]. More recently, Winnenburg and Bodenreider found gaps in the “completeness” and “correctness” of CQM value sets on the VSAC, and suggested metrics for value set developers to identify common errors [17]. However, measure developers are not required to follow these precepts for approval; rather, the process requires more generic steps to creation, validation, and implementation of measures as outlined in the CMS Measures Management System Blueprint [18]. With the advent of MACRA and MIPS, the widespread reporting of CQM data continues to grow, with the goal of monitoring population health for those receiving health care services; facilitating more complete data in key areas is crucial for new payment incentives, and for understanding public health concerns [19]. Dependable value sets – with proper concept orientation, consideration of redundancy completeness, and correctness – are central to this aim.

The parallel measure development process from EvidenceNOW and CMS led to subsequent parallelism in Cholesterol measure implementations; some cooperatives were using the CMS value set and others were using the Res. value set in their implementations of the Cholesterol measure. Rector described the implementation challenges of medical terminologies and stated that: “[one] reason that medical terminology is hard is the complexity of clinical pragmatics … and the need for testing the pragmatics of terminologies implemented in software.” [20] In this study we sought to use this parallel measure development process to:

  1. understand the common differences that may occur as part of measure definition and value set creation;
  2. understand how variations in value sets change a measure’s meaning through implementation differences in a CQM calculation registry; and
  3. discuss alternative processes for CQM development and implementation.

Our hypothesis was that the decisions made in the value set creation process would lead to differences in the measure populations and performance estimates.


We performed our study in two stages. We first identified the difference in specifications – unique identifiers, concepts, code groups, and coding systems – between the two value sets used to define the same Cholesterol measure. We then implemented the various versions in a quality measure calculation registry to understand how the differences in the value sets affected calculated prevalence of risk and measure performance.

Measure specifications

The intent of the Cholesterol CQM is to measure the proportion of patients at high risk for cardiovascular events – largely heart attack and stroke – who are currently prescribed or taking statin medication. The measure has three denominators that relate to risk: patients with atherosclerotic cardiovascular disease (ASCVD); those with low-density lipoproteins (LDLs) greater than 190 mg/dL; and patients with diabetes between 40 and 75 years of age with LDLs greater than 70 mg/dL. All three groups have high risk of initial or subsequent vascular events, and evidence shows that the use of statin medications lower those events [21].

Value set analysis

The EvidenceNOW team sent the Res. and CMS value sets to us as flat files. The two value sets contained different human-readable terms to represent convergent clinical concepts, so we first created categories by grouping the convergent concepts between the two sets. See Appendix A for a complete list of all the OID names contained in the Res. Value Set and those contained in the CMS value set, as well as a list of the study categories used for comparison. For example, the Res. value set contained a concept (assigned to one unique object identifier or “OID”) named ‘palliative care’, while the CMS value set contained four concepts (assigned to four distinct unique identifiers) with various names (e.g., ‘comfort measures’) that represented the same broader clinical concept. We assigned the four concepts in the CMS value set into a single category named ‘palliative care’ and compared it to the ‘palliative care’ concept from the Res. value set; in this example, both categories were defined by the same code group comprised of 11 codes (ICD9CM, ICD-10CM, and SNOMED CT). Next, we mapped each category to the specific measure criteria as defined by the HQMF (e.g., the category ‘ischemic vascular disease’ was mapped to the criteria for ‘denominator one’ which identifies patients with ASCVD). We also counted the unique object identifiers, or “OIDs” used to specify distinct concepts, and the codes from standard vocabularies that define each category. Any identifiers that belonged to the code system “grouping” were excluded from our analysis. These conceptual groupings used a single grouping OID to represent a collection of individual OIDs; the individual OIDs within the groupings were used in our analysis rather than the singular grouping OID. The ASCVD denominator was the most complex – it includes the grouping of acute coronary syndromes, history of myocardial infarction, stable or unstable angina, coronary arterial revascularization, stroke or transient ischemic attack, and peripheral artery disease of atherosclerotic origin – and we explored specific code variations of the aggregate denominator and its sub-groups, and related them to taxonomy structure.

Implementation and comparison

The Integrated Care Coordination Information System (ICCIS) is a clinical quality measure calculation registry [22]. ICCIS contains data mapped from a variety of EHRs from over 500,000 patients into a star database format that facilitates value set queries. Implementation of CQMs in ICCIS currently requires processing of human-readable CQM definitions and writing queries in Structured Query Language (SQL) to try to find the right patient data from the right clinical sources to calculate CQM performance. The Cholesterol measure HQMF included logic for patients who met the denominator criteria (patients with ASCVD, elevated LDLs, and diabetes), numerator criteria (patients taking a statin medication), and those excluded from the CQM calculation if they met the exceptions criteria (palliative care, pregnancy statin allergy). After we received the specifications for the Cholesterol measure, we implemented a query in ICCIS following the specific criteria for each value set, and queried against the OIDs for the two different versions of the measure; the measure logic remained constant. Initially, the global composite measures were implemented and a quality measure performance rate was measured. Then, the measure was divided into its three unique denominators and each was implemented individually. Next, denominator one (patients with ASCVD) was divided into aggregate clinical concepts to see which concepts, taxonomies, and individual codes were most responsible for the performance rate.

Inclusion Criteria

The measures were implemented against data from five medium to large primary care clinics (e.g., family medicine, internal medicine) serving almost 47,000 patients, in urban, suburban, and rural regions in Oregon. The practitioners in each clinic used a fully functional EHR in their ambulatory work. We chose to implement the measures using data from these five clinics because they had suitable structured data in the ICCIS database to generate reliable quality measure performance rates. The three clinics excluded from our analysis did not have an adequate average number of medications per patient stored as structured data in the ICCIS database; therefore we could not have reliably queried whether the patients were taking a statin medication.


A synopsis of the vocabularies by measure criteria and category found in the Res. and CMS value sets is shown in Table 1.

Table 1

Counts of Cholesterol measure vocabularies by criteria and category (excluding grouping OIDs)



Initial Pt
Encounters 6 8 32 42 SNOMED, CPT, HCPCS SNOMED, CPT, HCPCS
Denominator 1 CABG 1 3 160 160 ICD-9/10, SNOMED ICD-9/10, SNOMED
Denominator 1 Carotid Intervention 1 3 757 757 ICD-9/10, SNOMED ICD-9/10, SNOMED
Denominator 1 Ischemic Vascular Disease 1 9 561 169 ICD-9/10, SNOMED ICD-9/10, SNOMED
Denominator 1 Myocardial Infarction 1 3 62 90 ICD-9/10, SNOMED ICD-9/10, SNOMED
Denominator 1 PCI 1 3 80 80 ICD-9/10, SNOMED ICD-9/10, SNOMED
Denominator 1 Stroke 2 3 197 198 ICD-9/10, SNOMED ICD-9/10, SNOMED
Denominator 2 Hypercholesterolemia 0 3 0 14 ICD-9/10, SNOMED
Denominator 2 LDL Test 1 1 9 9 LOINC LOINC
Denominator 3 Diabetes 1 3 225 224 ICD-10, ICD-9, SNOMED ICD-10, ICD-9, SNOMED
Exceptions Breastfeeding 1 2 12 25 SNOMED ICD-10, SNOMED
Exceptions ESRD 1 3 6 6 ICD-9/10, SNOMED ICD-9/10, SNOMED
Exceptions Liver Disease 1 9 85 162 ICD-9/10, SNOMED ICD-9/10, SNOMED
Exceptions Medical Reason 1 0 19 0 SNOMED
Exceptions Palliative Care 1 4 11 11 ICD-9/10, SNOMED ICD-9/10, SNOMED
Exceptions Pregnancy 1 3 1981 1981 ICD-9/10, SNOMED ICD-9/10, SNOMED
Exceptions Statin Allergen 1 3 35 38 RxNorm RxNorm
Numerator Statin RXNORM 1 3 70 71 RxNorm RxNorm
Aggregate Full Value Set 23 66 4302 4037 ICD-10, ICD-9, SNOMED, RxNorm, LOINC, CPT, HCPCS ICD-10, ICD-9, SNOMED, RxNorm, LOINC, CPT, HCPCS

Globally there were more unique OIDs in the CMS value set (66) than the Res. value set (23); however, there were more unique codes in the Res. value set (4,302) than CMS (4,037). The OIDs between the two versions differed in their levels of granularity. The CMS value set contained more individual OIDs which represented specific concepts (e.g., individual OIDs for Hepatitis A and Hepatitis B, respectively) while the Res. value set had a fewer total number of OIDs and these represented broader conceptual groupings (e.g., 1 OID for Liver Disease which included Hepatitis A and B). Some of the larger differences within each category are highlighted in Table 1.

Table 2 shows the performance rates for various implementations of the measure, as well as the number of patients who met the exceptions criteria. The composite measures had similar results, with the Res. returning 4,451 patients and 71.9 percent adherence, and CMS producing 4,204 patients with 72.9 percent adherence. Despite these similar rates, there was noticeable variation for the stand-alone implementations of ‘ischemic vascular disease’ (IVD) and ‘myocardial infarction’ (MI). The Res. implementation of IVD returned 677 more patients in its denominator than that of the CMS implementation of IVD, and had a 3.5 percentlower adherence rate. The Res. implementation of MI had a 7.5 percentbetter performance rate than the CMS version of MI, and the Res. denominator returned 110 patients versus 258 patients in the CMS version.

Table 2

CQM Performance Rates in Composite Measures, Aggregate Concepts, & Distinct Denominators


Composite Measure Statin Therapy for the Prevention and Treatment of Cardiovascular Disease (CMS347v0) 3202/4451 (71.9%) 3057/4204 (72.7%) 327 356
Denominator 1 Ischemic Vascular Disease (including Stable/Unstable Angina & Artery Disease) 1802/2305 (78.2%) 1330/1628 (81.7%) 93 82
Denominator 1 Myocardial Infarction 96/110 (87.3%) 206/258 (79.8%) 2 13
Denominator 1 Stroke 466/684 (68.1%) 519/753 (68.9%) 31 42
Complete Denominator 1 Patients aged 21 years and older at the beginning of the measurement period with clinical ASCVD 1896/2487 (76.2%) 1767/2305 (76.7%) 100 118
Complete Denominator 2 Patients aged 21 years and older at the beginning of the measurement period with any fasting or direct laboratory test result of LDL-C>=190 mg/dL 196/445 (44.0%) 196/440 (44.5%) 3 8
Complete Denominator 3 Patients aged 40 through 75 years at the beginning of the measurement period with Type 1 or Type 2 Diabetes with the highest fasting or direct laboratory test result of LDL-C 70-189 mg/dL in the measurement year or two years prior to the beginning of the measurement period 1441/1891 (76.2%) 1375/1782 (77.2%) 255 259

These differences can be explained by the codes that are distinct to one value set and not included in the other, as shown in Table 3. Across all categories, there were 224 ICD-10-CM codes (620 unique codes total) in the Res. value set that were not in the CMS value set, and 119 ICD-10CM codes (370 codes total) in the CMS value set that were not in that of Res. For IVD in particular the Res. value set included 128 distinct ICD-10-CM codes and one code alone (I65.29) accounted for 4.1 percent of the patients included in the Res. composite denominator, and 8.1 percent of patients included when Res. denominator 1 is a stand-alone measure. Appendix B contains an example of the overlap of codes for the ‘MI’ category. Of the 72 codes that were distinct to one of the ‘MI’ value sets and not included in the other, a subset were used frequently and are included in Table 3.

Table 3

Most Prevalent Distinct Code Details for Denominator 1 Clinical Concepts


Stroke CMS G45.9
Res. G45.8
Ischemic Vascular Disease CMS I25.5 1.51%
Res. I65.29
414.00 24.28%
Myocardial Infarction CMS I25.2 1.28% 435.9

Table 3, below, provides the global prevalence of any of the distinct most prevalent ICD-10 and ICD-9 codes between the two versions.

The category ‘ischemic vascular disease’ (IVD) had the largest variation in OIDs and code groups, with the Res. value set containing 1 OID and 561 individual codes, while the CMS value set contained 9 OIDs and 169 individual codes. The Res. value set contained several codes that were not included in CMS; for example, it included ten codes from the I75 ICD-10CM series for different granularities of ‘atheroembolism’. Although the CMS value set contained fewer ICD-CM codes, it also included distinct codes that were not in the Res. version. Interestingly the Res. value set included 42 codes from the I70 series for ‘atherosclerosis’; the CMS value set, meanwhile, included only 6 codes from the I70 series – and none of these 6 codes overlapped with the code group included in the Res value set.

In addition to the ICD-CM codes, 75 percent of the IVD codes in Res. that are not in CMS were SNOMED codes. The CMS value set defined IVD with SNOMED codes related to ‘atherosclerosis and peripheral arterial disease’, ‘stable and unstable angina’, and ‘myocardial ischemia’ while the Res. included these and many more concepts including but not limited to ‘coronary arteriosclerosis’, ‘coronary artery disease’, and ‘coronary heart disease’. Figure 1 demonstrates the selected SNOMED codes for one parent (carotid artery occlusion) and six children. Only one child was in both versions, while the parent and three other children were in the Res. version. Two children related to “asymptomatic occlusion of carotid artery” were in neither version; these codes represent incidental findings and might be appropriately excluded. There was no rationale provided for inclusion/exclusion of codes.

Figure 1 

SNOMED Code Representation for Parent and Children of ‘Carotid Artery Occlusion’ in the Different Measures

Finally one EvidenceNOW cooperative reported a poor mapping to statin drugs as stored in the variety of EHRs from which they extracted the measures. Although the value sets are very similar in count (70 versus 71 RxNorm codes with complete overlap except for 1 medication), no clarification was made to address data mapped to differing level of the hierarchies. Figure 2 shows the RxNorm hierarchy with drug class at the top and generic clinical products and branded products at the bottom. We downloaded the value set for statins from RxMix, which is an application that provides RxNorm, RxTerms, NDF-RT (web service for accessing the current National Drug File), RxClass, and interaction application programming interfaces (APIs) [23]. The Res. value set for the Cholesterol measure included 70 of the Semantic Clinical Drugs (SCDs) from the RxMix value set, while the CMS value set contained all 71 SCDs. There were neither Semantic Branded Drugs (SBDs) nor Generic Pack (GPCK) in either Cholesterol measure value set.

Figure 2 

RxNorm Hierarchy


We identified key differences in the value set creation process for two similar definitions of the same CQM. Although this did not drastically affect the global measure performance, the two different versions did lead to variations in calculated prevalence of patients at high risk from key conditions, with 7.5 percent difference in performance rates for the aggregate clinical concept ‘myocardial infarction’, and 3.5 percent performance difference for the concept ‘ischemic vascular disease’. Furthermore, there were 41.5 percent (677) more patients included in the aggregate concept ‘IVD’ as defined by the Res. value set, and 135 percent (148) more patients included in the concept ‘myocardial infarction’ as defined by the CMS value set. The CQM requirements of new payment systems have operationalized many vocabularies, but deciding how to interpret broad terms such as ‘heart attack’ and ‘stroke’ when creating specific sets of codes intended to represent these conditions lead to specific variation and confusion. The implications of these differences extend beyond administrative concerns around measurement, and also directly impact delivery of targeted patient care. As technical assistance providers, we know, anecdotally, that high functioning care teams use patient-level data from CQM reports to plan targeted quality improvement interventions for patients not meeting measures. Clinics depending on implementations of measures with inaccurate value sets will not be able to identify all of the relevant patients for these interventions.

It is difficult to make a judgment about whether the Res. or CMS value set is more clinically accurate based on the face-validity of the performance rates alone. Upon careful perusal of the distinct code details we observed that the Res. value set’s definition of ‘MI’ only included ICD-10-CM codes associated with an “initial encounter”, whereas the CMS value set also included codes associated with “subsequent” and “unspecified” encounters. We could not think of any rationale for limiting the ‘MI’ subgroup to only relevant diagnoses at “initial encounters”, and the Res. value set developers did not provide us with any criteria for which to base this decision; therefore, we inferred that the CMS value set is more clinically accurate for the concept ‘myocardial infarction’. For IVD, we observed more random variation in code differences, and observed that the same code groups represented different clinical concepts between the two value sets. For example, the Res. value set included 70 codes from the ‘I63’ ICD-10-CM series (which represents the concept ‘cerebral infarction’) and included these in its concept ‘IVD’, while the CMS value set included the ‘I63’ ICD-10-CM series in its concept ‘stroke’. We could not make a clear determination about clinical accuracy for the IVD subgroup given this overlap and ambiguity of the codes and concepts included. No logic was provided by measure developers about the decisions made in value set development.

It is also difficult to evaluate the long-term impact of these value set differences on CQM implementations. One might hope that the coding differences we observed in this natural experiment might be resolved in further value set development. In fact, we learned retrospectively that the OID which represents ‘MI’ in the Res. value set was updated in 2016 to include “subsequent” and “unspecified” encounters making it identical to the ‘MI’ concept definition from the CMS value set. From this, we infer that the Res. OID was corrected to include the codes that were erroneously omitted. However, given our current process, we did not have any way of knowing about this update without manually searching the VSAC for this particular OID. Our results presented in this paper reflect the 2015 version of the ‘MI’ OID, and not the most updated version. Cimino identified the need for controlled vocabularies to gracefully change over time: “if … a concept is changed in such a way as to alter its meaning, what happens to the ability of the aggregated patient data that are coded before and after the change?” [14] Moving forward, CQM implementers will need to overcome the challenge of inadvertently using antiquated value sets.

From discussions with the EvidenceNOW informatics leads and measure developers, we suggest a set of potential changes to the development and implementation process:

  1. The measure development process should require significant validation with extant data sets, utilizing the substantial number of EHR mapped data sets existing now to determine the most accurate code sets to include. The developers of the Cholesterol measure conducted workflow assessments to determine whether the measure’s data elements were integrated into a typical clinic’s documentation workflow as structured data, but did not implement the measure directly against any clinical datasets.
  2. Another suggestion is that when value sets need to be created for new measures, informaticians should be included in the creation to mitigate potential challenges with concept orientation, redundancy, completeness, and correctness. The developers may already be doing this, as they indicated that they have a team of clinical experts and coding experts who make decisions about the level of granularity of code groups to include/exclude.
  3. Additionally value set designation should be changed to clearly highlight developers’ choices at different levels of the various hierarchies. An example of this would be indicating which RxNorm drug class is chosen, and the subsequent SCDs that come from that choice in a measure. The measure developers indicated that they make these decisions about inclusiveness on a case-by-case basis based on their understanding of the clinical concepts of interest.
  4. Finally system vendors, CQM implementers, the VSAC, and the NQF need to establish integrated processes for adequately disseminating updates to value sets. Collaboration between these groups is necessary to make sure that CQM implementations reflect the most recent – and most accurate – versions of measure specifications. Promoting the use of APIs to the VSAC for updates, for example, could be a promising step to automate this process.


This study has two primary limitations. First, we analyzed the performance rates for the various versions of the measure against data in only five primary care clinics in Oregon. Thus, the practical significance of our findings is unclear given the small size of the study. Second, our research design is based on a case study of a single statin therapy clinical quality measure so the extent to which our findings are generalizable beyond this measure is unclear Future analysis could be done against data from several other clinics to further assess the differences in the prevalence of the at-risk patient groups based on these distinct value sets.

The goal of this paper was to show differences between point estimates of measures when different definitions of key concepts were chosen by developers. Although we could test whether each difference is statistically different, the primary point is to show the difference itself. The IVD-only implementation of denominator 1 had the largest difference in the number of patients included; a chi- square test provides a p-value < .05, showing that the magnitude of difference can be large enough to be significant. A difference of 1-2 percent in CQM performance indicates that 1-2/100 additional patients would be given appropriate treatment and monitoring for their disease in the definition most aligned with the concepts intended to be measured. Depending on the measure set, this may be important clinically In addition, we know the causality of the difference (the change in value sets), so detecting whether the change causes large or small differences is what is important.


We discovered significant differences for the clinical concept ‘ischemic vascular disease’ – and surprisingly little difference in other concepts – in the new measure development decision-making for the same clinical measure. Thus, we conclude that the measure development process should require developers of value sets to provide the explicit criteria used in the choices made regarding inclusion and exclusion of concepts. This study revealed that it is very hard to understand decisions at different levels of the various hierarchies (e.g., SNOMED, RxNorm) because there is no specific guidance provided in current value sets. Without this information, we may be underspecifying broad medical terms. Therefore, when CQMs are implemented against clinical data, the ability to accurately identify patients at risk for key conditions may be impacted. Importantly, there is a need to evaluate value sets against clinical data before publishing for them for implementation. In addition, assumptions about temporality of the condition – whether they can be new, existing, and/or historical to qualify – made a significant difference in the value set creation. Endorsements are needed to encourage collaboration between system vendors, payers, informaticians, and clinicians so that integrated terminology sets can evolve.