About 30 percent, or 117 billion dollars, of Centers for Medicare and Medicaid Services (CMS) payments are now linked to quality of care delivery, with the goal of linking 90 percent payments to quality of care by 2020 [1]. As a result, a new framework for rewarding health care providers for provision of high value care is needed in the United States. However, there have been substantial obstacles in establishing efficient and meaningful quality reporting systems and pay for performance programs, leading to increasing concerns from professional groups and health policy experts [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22].

One salient challenge for physician quality reporting systems is the significant gap between what is desirable to measure for establishing value-based payment arrangements, and what is feasible for practices to report about the quality of care they deliver [2, 3, 5, 6, 8, 11, 14, 15, 16, 21, 23, 24, 25, 26, 27]. Although the wide-scale adoption of electronic health records (EHRs) has increased the availability of routinely collected structured (e.g., ICD diagnoses and CPT codes), and unstructured (e.g., free-text clinical notes) data that can be used to improve the reporting of quality of care, the bulk of information in the EHR is in unstructured and not amenable to automated reporting.

Traditional quality measurement and reporting methods consider only structured EHR data. The high degree of organization facilitates automated reporting, but structured data provide a limited representation of each patient’s treatment across settings, their health outcomes, or why a clinical decision was made (e.g., a guideline was not followed because patients could not tolerate a high dose of medication) [12, 13, 23, 28, 29]. Unstructured EHR data, such as the free-text notes that a clinical team documents as part of a patient’s care process, is the opposite and represents a rich and complex pool of clinical information that does not conform to a pre-specified format. Although traditional methods for performance and quality reporting miss a substantial amount of information that is “locked” in clinical text, to warrant their wide scale adoption for population health management and measurement, clinical text analysis tools must be able to efficiently analyze an institution’s entire clinical text collection, easily adapt to new information extraction tasks, and demonstrate reliable performance across institutions and different EHR products.

Information extraction is the task of automatically extracting structured information from unstructured data sources. We developed and evaluated information extraction methods to assess the feasibility of and potential benefit of improving EHR- based measurement for the Physician Quality and Performance Reporting System (PQRS). The PQRS is a physician quality reporting program implemented by the Centers for Medicare and Medicaid Services (CMS), under the Tax Relief and Health Care Act of 2006 [30]. In combination with the cost of care an eligible health professional delivers, the PQRS is used by Medicare as the basis for differential payments based on healthcare “value” [7, 18, 30, 31].

We discuss our research findings in the context of value-based payment for physicians. To the best of our knowledge, our work is the first systematic comparison of enhanced EHR-based methods with traditional methods in the setting of a physician performance and quality measurement. Prior studies applying clinical text analysis for enhanced measurement largely reflect research on improving event detection for the patient safety domain, such as the adverse events measured by Agency for Healthcare Research and Quality (AHRQ)’s Patient Safety Indicators (PSIs) and other surgery- related complications [32, 33, 34, 35, 36]. We sought to assess the feasibility and benefit of enhanced measurement, spanning multiple National Quality Strategy domains and healthcare settings. We also depart from prior work in the use of a hybrid framework for clinical text analysis, CLEVER (from CLinical EVEnt Recognizer) [37]. CLEVER is an open source tool that incorporates statistical term expansion components (i.e., word embedding) and semantic components (i.e., context analyzing rules) to expedite the development of rule-based extractors [38, 39].


Participants and Setting

Stanford Health Care (SHC) is an academic medical center located in Northern California. SHC provides inpatient and outpatient care for patients with high acuity disease with a recent focus on primary care. During the time of our analysis, SHC used the Epic (Epic Systems, Verona Wisconsin) EHR. Among the 646,973 patients that received care at SHC from 2008 through 2013, 178,794 senior patients, 4,213 dementia patients, 2,335 cataract surgery patients, and 7,414 ischemic stroke patients satisfied one or more of the ten PCPI denominator definitions. The SHC data for our study was provided by the STRIDE (Stanford Translational Research Integrated Database Environment), which contains deidentified data for over 2 million patients [40]. We analyzed over 21 million notes in the Epic EHR, including Letters, Phone Encounter Logs, Goals of Care and more standard note types such as Progress Notes, Nursing Sign Out Notes, ED Notes and other types.

Physician Measures

Our study included ten physician performance and quality of care measures developed and approved by the Physician Consortium for Performance Improvement (PCPI), convened by the American Medical Association (AMA). Along with their name, PCPI group, and quality domain, each of our ten study measures are categorized by PCPI approval and PQRS adoption status in Figure 1.

Figure 1 

Physician Quality Measures by TITLE, National Quality Strategy Domain, Measure Type and PCPI Approval Status

The measure technical specifications determined by PCPI standardize the collection of measures and are distributed, without modification, for claims and registry based reporting by CMS, private companies, and professional groups [41, 42]. Each measure’s technical specification includes the Measure Description, Measure Components, Measure Importance and Measure Designation. The Measure Description provides a short description of the measure; for example, the Measure Description for PQRS Measure #48, Urinary Incontinence Assessment, is “Percentage of female patients aged 65 years and older who were assessed for the presence or absence of urinary incontinence within 12 months”. The Measure Components refer to the Denominator Statement and the Numerator Statement. For example, the Measure Components for Measures #48, Urinary Incontinence Assessment, indicates the age, gender and CPT or HCPCS codes for identifying patients to include in the measure’s denominator; also, the set of CPT codes for calculating the numerator The Measure Importance describes the relevance of each measure, and the Measure Designation categorizes each measure by type and National Quality Strategy domain. We provide the Measure Description and the Numerator and Denominator Statements from the PCPI Measure Component section in our Appendix, Table A1.

Using PCPI technical specifications, we used only structured data to estimate patient eligible for the denominator of each study measure. Keeping the denominator’s value fixed, we compared traditional structured and enhanced methods for estimating the numerator. We prioritized the choice of measures by identifying measures that we hypothesized would be under-coded in structured EHR fields. For example, a measurement approach that used structured data for the calculation of Measure #47 Advance Care Plan, could identify events relevant to the numerator by detecting a specific CPT code, but most providers are unaware of them and many relevant discussions between patients and physicians, and advance care plans, are still documented in only free-text.

After recognizing that some measure numerators cannot be represented by the PQRS core coding systems (e.g., ICD, CPT, HCPCS), we also included measures that required disease recognition at a level of specificity that current coding systems do not support. For example, PQRS Measure #280, Staging of Dementia, is defined by the percent of Dementia patients that were staged as mild, moderate, or severe. The ICD allows for more granular dementia diagnoses such as presenile, senile, vascular and drug induced, but cannot capture information on whether a patient’s dementia is mild, moderate, or severe. Similarly the PQRS core coding systems cannot be used to report Measure #191, “the percentage of patients aged 18 years and older with a diagnosis of uncomplicated cataract who had cataract surgery and no significant ocular conditions impacting the visual outcome of surgery and had best-corrected visual acuity of 20/40 or better (distance or near) achieved within 90 days following the cataract surgery patient reported 20/40 or greater vision within 90 days of cataract surgery”. Although International Classification of Diseases Version 10 Clinical Modification (ICD10-CM) allows for the indication of laterality it has no code that can be used to indicate a patient with 20/40 or better vision.

Clinical Information Extraction

Our approach to the detection of events “locked” in clinical text was to employ an efficient and flexible framework for building custom extractors, called CLEVER. Our clinical information extraction system consisted of four steps. Figure 2 shows an overview of our system’s pipeline, based on a set of n patients, where pi is the ith patient and 1≤in ,m candidate events, where cidj is the jth candidate event and 1≤jm and a measurement observation window, t, of one year, t=365.

Figure 2 

Overview of the Clinical Information Extraction Pipeline for Enhanced EHR-based Reporting

Step One of our information extraction pipeline was Terminology Construction. This step involved identifying the target concepts associated with the documentation of each measure in clinical text. Then, after using the UMLS and the SPECIALIST Lexicon to identify a set of high-quality biomedical “seed” terms for our target concepts, we used statistical term expansion techniques to identify new clinical terms that shared the same contexts [43, 44, 45].

Step Two, Pre-Processing, used our terminology to tag the target terms for our measure-specific target concepts and a general set of clinical concept modifiers included in CLEVER. For each target term tagged in clinical text, CLEVER extracted a range of high and low-level features such as the term’s target class, the surrounding context, or “snippet”, entailing the target term, the note section, creation time and type (e.g., outpatient progress note), and the patient’s ID. After the features were populated into CLEVER’s intermediate event schema, we executed CLEVER’s rule-based labeling algorithm and added a column to the intermediate event schema, indicating a positive or non-positive label for each candidate event.

In Step Three, Extraction, we used structured EHR data to estimate each measure’s Denominator and Numerator Statements. By matching on patient ID and year, we merged data from our structured EHR data sources with CLEVER’s intermediate event schema to create a candidate event matrix. We used note creation time to approximate the time of positive events extracted from clinical text, or in the case of events documented in structured data, visit time. Then, based on the approximated event time, all patients in the measures denominator were indexed from t1, the date of their initial qualifying visit, through tmax where “max” was the length of the measurement period indicated by each measure’s PCPI technical specification. In the final step, Patient-Level Reporting, we queried the candidate event matrix to calculate the value of each measure’s numerator and denominator, reporting the final measurement rate. CLEVER and the terminologies used for our experiments are publically available and distributed as open source software under the MIT license. Additional details on our information extraction pipeline illustrated in Figure 2 are provided in our Appendix.


The main study outcomes we report are the measurement rates for traditional and enhanced EHR-based quality measurement, and the precision, or PPV of enhanced event detection from clinical text.

We evaluated the precision of our enhanced quality measurement method, using patients that qualified for the Numerator Component, based on only unstructured EHR data. For each study measure, we randomly selected 100 patients that were potentially missed by traditional measurement methods, for review by clinical experts. If the reviewer felt enough evidence was present in a patient’s record to support the inclusion of the patient in the numerator, the clinical experts were instructed to indicate true. For instances where there was a contraindication or insufficient evidence, they were instructed to indicate false. To facilitate their case review, we included information from CLEVERs intermediate event schema – e.g., the snippets for target terms, the note type the target term appeared, the note’s ID (NID), the time of the note, and the patient’s ID. An example of four events that were selected for evaluation and ground truth (GT) labels that were assigned by clinical experts appear in Table 1.

Table 1

Evaluation Examples



T Transition to comfort care HPI XX yo F from SNF Sent to ED for increased WOB. Per daughter and granddaughter (DPOA) pt would not want anything done (including fluids antibiotics meds) and would like to be allowed to pass peacefully. History and Physical 4325 346 XXX
T Discussed goals of care with pt and her husband. Pt brought in her advance directive which names her husband [name ommited] as her surrogate decision maker However, she has not documented her wishes with regards to life prolonging measures. History and Physical 6341 645 XXX

T The patient is also alert and oriented to person place and time. Distance Visual Acuity Right Eye Without correction With correction 20/40 -1 With Pin-Hole Autorefraction x Left Eye Without correction With correction 20/25 -1 With Pin-Hole Autorefraction x 3 Letter 3455 166 XXX
F Distance Visual Acuity Right Eye Without correction CF at 3' With correction With Pin-Hole NI Autorefraction x Left Eye Without correction 20/80 With correction With Pin-Hole Autorefraction x 3 Intraocular Pressure Progress Note, Outpatient 4425 169 XXX

Notes: For PQRS Measure #47, Advance Care Plan and Measure #191, 20/40 or Better Visual Acquity within 90 Days following Cateract Surgery shown by expert asigned ground truth label (GT), snippet, candidate event ID (CID), patient ID (PID), note ID (NID) and note type (NTYPE).


For our ten PQRS measures, we extracted seven measures from unstructured EHR data – i.e., clinical text, – with 80 percent or higher precision. For the seven measures enhanced measurement extracted with good performance, Table 2 shows the number of patients with clinical events that qualified for inclusion in each measure’s numerator in the “Patient Events” column. The number of patients in the denominator appears under “Eligible Patients.” On the left side of the “Patient Events” column, under “Structured Events,” are the total number of patients in the numerator identified with traditional measurement. Under “Unstructured Events” is the number of additional patients identified with enhanced measurement methods. We also show the precision of enhanced measurement for the Numerator Component, and in the last two columns of Table 2, we compare traditional and enhanced measurement rates, based on six years of annual reporting (2008 though 2013).

Table 2

Total Patient Events, Text-Based Extraction Precision and MULTI-YEAR Measurement RATES FOR PQRS MEASURES Using Traditional and Enhanced Quality Reporting


Measure #47 (NQF 0326): Advanced Care Plan 0 2412 0.92 181734 0.00 1.33
Measure #48: Urinary Incontinence Assessment of Presence or Absence 1002 7999 0.91 178794 0.56 5.03
Measure #49: Urinary Incontinence Characterization 578 2320 0.98 9001 6.42 32.20
Measure #191 (NQF 0565): 20/40 or Better Visual Acuity within 90 Days Following Cataract Surgery 0 423 0.76 2335 0.00 18.12
Measure #280: Staging of Dementia 0 574 0.89 4213 0.00 13.62
Measure #286: Counseling Regarding Safety Concern 0 144 0.82 4213 0.00 3.42
Measure #287: Counseling Regarding Risks of Driving 0 42 0.92 4213 0.00 1.00

For the seven PQRS measures in Table 2, a total of 384,503 patients contributed to measure denominators, from 2008 to 2013. Among all patients eligible for PQRS reporting, traditional claims-based reporting methods identified 1,580 patients for inclusion in measure numerators. We found that our enhanced quality measurement approach identified an additional 13,914 patients missed by traditional methods with an average precision of 88 percent (95 percent CI: 83-93 percent). These additional events improved the assessment of PQRS measures spanning four different National Quality Strategy domains – i.e., Care-Coordination, Patient Safety, Effective Clinical Care, and Person and Caregiver Experience and Outcomes.

For the two study measures that traditional method detected from ICD codes —Measure #48 Urinary Incontinence Assessment and Measure #49 Urinary Incontinence Characterization— enhanced quality measurement methods resulted in approximately an eight to and four-fold increase in the number of patients that satisfied the Numerator Statement of each measure, respectively. In addition, based on a traditional quality measurement approach, five PQRS measures reported zero patients in their numerator As shown in Table 2, our enhanced measurement approach enabled PQRS measurement and reporting for the following PQRS measures: Staging of Dementia (574), Counseling Regarding Safety Concern (144), Counseling Regarding Risks of Driving (42), Advanced Care Plan (2,412) and 20/40 of Better Visual Acuity within 90 days Following Cataract Surgery (423).

To examine annual reporting trends, based on our enhanced measurement method, Table 3 show shows annual reporting rates. Similar to Table 2, the numerator and denominator of each measure appear in the “Patient Events” and “Eligible Patients” columns, respectively. To quantify the change in physician performance between consecutive reporting year, and beginning with 2009, the column “Annual Improvement” reflects the relative difference in the annual measurement rate between consecutive years. For example, Measure #280, Staging of Dementia, showed an almost 1.61 percent improvement in performance between 2010 and 2011.This measure continued to improve by another 1.67 percent between 2011 and 2012. None of our physician performance and quality measures consistently increased over the six-year period and overall; we observed incremental improvements in the PQRS study measures over the six-year period and no dramatic decreases in reporting rates between consecutive years.

Table 3

PQRS Measurement Rates by Year as Measured by Enhanced Quality Measurement Reporting


Measure #40: Advanced Care Plan NQF(0326) 2008 148 22146 0.67
2009 250 26297 0.95 0.28
2010 380 31836 1.19 0.24
2011 519 34986 1.48 0.29
2012 695 40684 1.71 0.23
2013 420 25785 1.63 –0.08

Measure #48: Urinary Incontinence Assessment 2008 980 23508 4.17
2009 1331 27516 4.84 0.67
2010 1578 30305 5.21 0.37
2011 1777 33443 5.31 0.10
2012 1972 39018 5.05 –0.26
2013 1363 25004 5.45 0.40

Measure #50: Urinary Incontinence Characterization 2008 316 23508 1.34
2009 452 27516 1.64 0.30
2010 524 30305 1.73 0.09
2011 578 33443 1.73 0.00
2012 601 39018 1.54 –0.19
2013 427 25004 1.71 0.17

Measure #191: Cataracts – 20/40 or Better Visual Acuity within 90 Days Following Cataract Surgery (NQF 0565) 2008 25 384 6.51
2009 38 379 10.03 3.52
2010 96 482 19.92 9.89
2011 88 393 22.39 2.47
2012 100 460 21.74 –0.65
2013 76 237 32.07 10.33

Measure #280: Dementia Measure Group – Staging of Dementia 2008 59 473 12.47
2009 74 576 12.85 0.38
2010 95 792 11.99 –0.86
2011 117 860 13.60 1.61
2012 155 1015 15.27 1.67
2013 74 497 14.89 –0.38

Measure #286: Dementia Measure Group – Counseling Regarding Safety Concerns 2008 4 23508 0.85
2009 13 27516 2.26 1.41
2010 29 30305 3.66 1.40
2011 26 33443 3.02 –0.64
2012 47 39018 4.63 1.61
2013 25 25004 5.03 0.40

Measure #287: Dementia Measure Group – Counseling Regarding Risks of Driving 2008 5 23508 1.06
2009 5 27516 0.87 –0.19
2010 8 30305 1.01 0.14
2011 10 33443 1.16 0.15
2012 8 39018 0.79 –0.37
2013 6 25004 1.21 0.42

Notes: Measure, reporting year, the number of patient events (measure’s numerator), eligible patients (measure’s denominator), enhanced measurement rate (unstructured and structured data sources), and the Annual rate of change from the prior year Are indicated in each column. Years that show a decrease in the institutional annual perfromance of a quality measure appear in italics.


Quality measurement and reporting is evolving to accommodate a more comprehensive definition of quality in healthcare delivery. However, the gap between what is desirable to measure and what is possible for providers to report on the quality of care they deliver is significant. Given the critical role quality reporting systems have in the success of value-based reform, our work suggests that fundamental changes in EHR-based data analysis and automated reporting software are required to support value-based care.

Other studies of enhanced EHRs, although using somewhat different methodologies, have shown that unstructured EHR data provides a wealth of rich information that can be used for quality reporting. A key innovation of our work is the application of clinical text analysis to physician performance and quality reporting. We found that enhanced quality measurement and reporting systems have the potential to improve physician quality reporting systems in several important ways. First, in conjunction with information on the cost of care they deliver the ability to increase event detection across multiple National Quality Strategy domains provides more accurate and comprehensive information for establishing value-based spending arrangements. Whereas other studies focus on the detection of patient safety, we demonstrate the ability of enhanced measurement to extract 13,914 additional events that were missed by traditional methods, spanning four different National Quality Strategy domains: Care-coordination, Patient Safety, Effective Clinical Care and Person and Caregiver Experience and Outcomes.

Second, we found that enhanced quality reporting methods can enable physician practices to calculate quality measures which are associated with overall patient care, including coordination of care and patient satisfaction. Although the role of the individual in healthcare is changing from that of a passive patient to active consumer with increased financial responsibility for their healthcare costs, current quality reporting systems overwhelmingly focus on process measures that capture medical aspects of a patient’s care (e.g., cancer screening according to guidelines, or the administration of prophylaxis before surgery). To provide patients with meaningful information to compare costs and quality among providers, payment arrangements linked to quality of care must incorporate value from the perspective of multiple stakeholders, including patients. For example, our enhanced quality measurement approach resulted in over a four-fold increase in one PQRS measure from the Patient and Caregiver Outcome and Experience domain of the National Quality Strategy In addition, enhanced quality measurement enabled the quantification of five study measures that could not be reported using traditional methods, including two PQRS measures from the Communication and Care-coordination domain.

A third opportunity enabled through advancements in enhanced EHR-based systems is reporting efficiency. As of December 2015, over 450,000 eligible professionals chose a negative payment adjustment instead of participating in the PQRS program [2, 7]. Low participation has been attributed to a complex and labor-intensive measurement process, which has been estimated to cost over $40,000 per practice and over 14 billion dollars across all practices in 2015 [2, 7]. Without a “computable” quality measurement and reporting infrastructure for the whole EHR, including clinical text, physician quality reporting will continue to demand a costly and labor intensive manual chart abstraction. For example, all three measures from our Dementia Measure Group were detected only by our enhanced measurement approach. Even with a state-of-the-art EHR, a physician practice reporting the Dementia Measure Group must engage in a burdensome administrative process to satisfactorily participate in the 2016 PQRS program.

It is important to note that our study has several limitations. Not all measures were equally amenable to our enhanced measurement method. Specifically we do not report measurement rates for the three PCPI measures from the Stroke and Rehabilitation group, shown in Figure 1. Based on both traditional and enhanced measurement methods, the dysphagia measure, which is a retired PQRS measures, resulted in zero patients in the numerator and was omitted from our results. For the two Potentially Preventable Harmful Event measures from this group, our enhanced measurement method detected positive conditions of UTI or Stage 3 or greater decubitus, but could not rule out the absence of the conditions on hospital admission and the precision was 10 percent and 14 percent, respectively Since such low performance is unlikely to reflect meaningful results, and the measure has not been adopted by the PQRS, we also omitted these measures from our results.

The lack of a gold standard corpus to evaluate our clinical text analysis methods is also a limitation of our study. We demonstrate the benefit of enhanced quality reporting by comparing traditional and enhanced EHR-based measurement approaches, and showing substantial increases in the total number of additional events detected from unstructured sources, with good precision. However, the recall (or sensitivity) of our system is unknown. Similar to open domain information extraction tasks, the size and complexity of our corpus makes it infeasible to manually annotate a sample of clinical notes that is large enough to provide a meaningful and unbiased estimate of system recall. A community resource of this type would be invaluable for developing and evaluating enhanced EHR-based measurement and quality reporting systems. Finally a multi-site evaluation is needed to establish the generalizability of our methods.


For ten physician quality measures, we compared traditional reporting methods that considered only structured EHR data (e.g., diagnosis and procedural codes) with an enhanced EHR-based approach that included clinical text analysis. Based on our analysis of six years of EHR data from patients who visited Stanford Health Care, we found a total of 13,914 additional patients encounters relevant to the Numerator Component of measures adopted by Medicare’s Physician Quality Reporting System and identified relevant clinical events with good precision (88 percent; 95 percent CI: 83–93 percent). The additional patient encounters that we detected from clinical text encompassed four National Quality Strategy domains including Communication and Care-Coordination, Patient Safety, Effective Clinical Care, and Person and Caregiver Experience and Outcomes in our assessment. In addition, for five PQRS measures that could not be detected using traditional methods, our enhanced approach made event detection feasible for quality measurement and reporting. Our study suggests that enhanced EHR-based methods have the potential to improve value-based payment arrangements by increasing the detection of clinical events related to physician performance and quality with good precision, by supporting automated reporting from unstructured EHR data across domains that are relevant to multiple stakeholders, including patients, and by expediting the costly and labor-intensive manual chart review process that is now associated with participating in programs such as the PQRS.