Aggregated data from electronic health records (EHRs) are increasingly being used for population health surveillance; however best practices for using this data source remain to be established . Before EHR data can be used for surveillance, it is necessary to establish that the health indicators reflect, with acceptable accuracy, the information contained within the EHR and the individual’s true health status. A number of studies have evaluated the feasibility of using EHRs for surveillance and the validity of indicators, both in the United States and around the world [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. The majority of these studies have compared EHR-based prevalence estimates with population survey estimates [2, 3, 4, 5, 8, 9, 10] or measured agreement of indicators between EHR data and the entire medical record [6, 7, 8, 9, 11, 12] or patient self-report . The results from these studies, however are largely self-contained and their generalizability to other EHR surveillance systems remains to be established.
Since 2007, the New York City (NYC) Department of Health and Mental Hygiene (DOHMH) has been assisting hundreds of ambulatory primary care practices with the uptake and use of EHRs and engaging in clinical and data quality improvement activities through the Primary Care Information Project (PCIP) . A key component of PCIP is the Hub Population Health System (the Hub), a network of practices using eClinicalWorks software. Using the Hub infrastructure, the NYC Macroscope was designed, in collaboration with the City University of New York School of Public Health, to estimate chronic disease and risk factor prevalence for the NYC adult population in care (defined as having seen a health care provider for primary care in the past year) . Practices included in the NYC Macroscope cohort met data documentation inclusion and exclusion criteria described elsewhere .
The development of the NYC Macroscope EHR surveillance system and the validity of indicators relative to the NYC Health and Nutrition Examination Survey (HANES), a gold standard population health examination survey, has been documented in a 3-part series previously published in eGEMS (Generating Evidence & Methods to improve patient outcomes) [4, 5, 16] These validation studies also compared individual-level NYC HANES data with EHR data from 48 NYC HANES participants whose data were included in the 2013 NYC Macroscope [4, 5]. Given the unique relationship the NYC DOHMH has with Hub Population Health System providers and the reliance of the NYC Macroscope on a single EHR software system, it remains unclear if NYC Macroscope indicators could be generalized to EHR surveillance systems using data from other providers or recorded on other EHR platforms. To determine the generalizability of NYC Macroscope indicators beyond that singular EHR platform, we examined the validity of the NYC Macroscope indicators for non-NYC Macroscope records from participants included in NYC HANES.
Data sources and sample
Data sources for this study were the 2013-14 NYC HANES, a population-based examination survey of NYC residents 20 years of age and older , and EHR data from a sub-sample of NYC HANES participants. Printed copies of EHRs were requested from providers via in-person visits, telephone, fax, and mail. Only records from providers practicing general internal medicine, family medicine, pediatrics, geriatrics, and adolescent medicine, based on the taxonomy associated with the National Provider Identifier , were included.
Of the 1,527 NYC HANES participants, 1,089 met eligibility criteria because they had reported visiting a health care provider for a “check-up, advice about a health problem, or basic care” within the previous year (i.e., “in care”), and did not have a proxy interview (Figure 1). Of these participants, 45 percent (491 individuals) signed a consent form and completed a Health Insurance Portability and Accountability Act waiver, granting us access to their medical records. We obtained printed copies of EHRs for 277 participants, of which 190 contained primary care data recorded within a year prior to the participant’s NYC HANES interview.
The 190 valid records consisted of 48 NYC Macroscope records and 142 non-NYC Macroscope records. The 142 non-NYC Macroscope records were received from 133 providers working in 89 medical practices throughout NYC, using more than 20 different EHR software systems. Of these non-NYC Macroscope records, 86 were received from 79 providers working in 49 medical practices that had attested to Stage 1 Meaningful Use, using more than 15 EHR software systems.
Chart review procedures
A team of medical and nursing students was trained to abstract relevant data from the participants’ medical records for the period starting from their NYC HANES interview date back to January 1, 2011. The abstracted data were stored on a secure and confidential internal server at the NYC DOHMH. Data were abstracted from the following structured fields: International Classification of Diseases Ninth Revision (ICD-9) diagnosis codes, body mass index (BMI), smoking, blood pressure, and laboratory values including total cholesterol and glycated hemoglobin (A1C). Data were also abstracted from unstructured fields, including provider notes and scanned documents. Scanned documents included external laboratory results not incorporated into the EHR interface, external medical records, consultation notes or records, and other relevant documents.
Data quality control
To ensure data quality, more than half (56 percent [107/190]) of the records were abstracted by multiple reviewers. Duplicate entries were compared, and discrepancies were investigated and corrected. Of the remaining 83 records reviewed by a single reviewer, 59 (71 percent) were abstracted by one of the 3 most accurate reviewers (inter-rater reliability [Kappa coefficient] among these 3 reviewers was 0.96). The remaining 24 charts were entered one time and reviewed by a supervisor to confirm data entry accuracy.
The NYC DOHMH and City University of New York Institutional Review Boards approved collection of HIPAA waivers from NYC HANES study participants, and the NYC DOHMH Institutional Review Board approved this study.
Self-reported data from NYC HANES were collected through an in-person interview, and objective clinical measures were obtained through physical examination and laboratory testing of a blood specimen. NYC Macroscope indicators were based on data from structured EHR fields that captured diagnoses, medications, vital signs, laboratory values, and smoking status.
Appendix table A1 provides definitions of the NYC Macroscope and NYC HANES indicators for prevalence measures of obesity, smoking, hypertension, diabetes, and hyperlipidemia. Obesity was defined as a BMI ≥30, using the most recent BMI recorded in the EHR; BMI in NYC HANES was calculated based on measured height and weight. Smoking was based on the most recently recorded smoking status in the EHR and was selfreported in NYC HANES. The diagnosis indicators for hypertension, diabetes, and hyperlipidemia were based on the presence of an ICD-9 diagnosis code in the EHR, and were contrasted to the NYC HANES measures of self-reported diagnosis. Augmented indicators for hypertension, diabetes, and hyperlipidemia were defined as an ICD-9 diagnosis code, or the most recent diagnostic blood pressure/laboratory values, or medications electronically prescribed to treat the condition, and were contrasted to the NYC HANES measures based on self-reported diagnosis and/or examination or laboratory findings .
Weighted demographic characteristics and health outcomes were compared between the NYC HANES chart review subsample (N=190) and the other NYC HANES in-care participants (N=945). Similar comparisons using unweighted data were made within the chart review sample, stratified by whether the participants’ records contributed to the NYC Macroscope (N=48) or not (N=142). Additionally, we described differences in sample distributions between NYC Macroscope (N=716,076) and the chart review sample (N=190). All comparisons were made using χ2 tests with a 2-sided a level set at 0.05.
NYC Macroscope indicator definitions were applied to the abstracted EHR data and computed outcome classifications were linked with NYC HANES outcome classifications for each individual. Agreement between the NYC Macroscope indicator classifications and the NYC HANES classifications was quantified using Cohen’s Kappa, sensitivity, and specificity. Kappa coefficients were interpreted using the criteria established by Landis and Koch that characterize agreement as slight (Kappa: 0.0-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), and near perfect (0.81-1.0) . Sensitivity was characterized as high (0.90-1.00), moderate (0.70-0.89), and low (<0.70), and specificity was characterized as high (0.90-1.00), moderate (0.800.89), and low (<0.80). To aid interpretation, 95 percent confidence intervals were computed for sensitivity and specificity. All analyses were conducted using SAS 9.4 (SAS Institute, Cary, NC) and SAS-callable SUDAAN® software, version 11.0 (Research Triangle Institute, Research Triangle Park, NC).
The first sensitivity analysis restricted data to the subset of non-NYC Macroscope records from patients seen by providers who had attested to the Centers for Medicaid & Medicare Services EHR Stage 1 Meaningful use incentive program . Stage 1 Meaningful Use aims to optimize data capture and sharing through documentation and communication specifications (e.g., recording smoking status for patients 13 years of age and older; providing patients with timely access to their electronic health information). Attestation status was based on Centers for Medicaid & Medicare Services data through December 31, 2013, which coincided most closely with the attestation status achieved during the chart review look-back period [22, 23]. At that time, Stage 1 Meaningful Use was the highest level of attestation possible. This analysis was important because NYC Macroscope uses documentation quality criteria aligned with Stage 1 Meaningful Use as an inclusion criterion . The second sensitivity analysis assessed the impact of incorporating unstructured data on sample size and sensitivity for obesity and smoking status indicators, and on sensitivity for diagnosis and augmented indicators of hypertension, diabetes, and hyperlipidemia, using non-NYC Macroscope records.
Our final sample consisted of 190 NYC HANES participants in the chart review; 945 participants were excluded for various reasons (Figure 1).
When comparing demographics and health outcomes between people who were included or excluded from the study (Table 1), there were no significant differences with the exception of higher prevalence of hyperlipidemia (diagnosis) in the chart review sample (573 percent vs. 45.6 percent; P = 0.04). Within the chart review sample, there were no significant differences between participants with NYC Macroscope records and non-NYC Macroscope records. Overall, most participants in the chart review sample were >40 years of age (66.8 percent) and female (61.6 percent). In terms of risk factors and chronic diseases, 35.8 percent of the sample were obese, 14.3 percent were smokers, and based on diagnosis/augmented definitions, 15.2 percent/17.8 percent had diabetes, 37.1 percent/43.2 percent had hypertension, and 57.3 percent/62.2 percent had hyperlipidemia. Compared to the unweighted NYC Macroscope sample, the unweighted chart review sample was less likely to be in the 20-39 year age group (-3.1 percentage points), more likely to be female (+5.6 percentage points), and more likely to be from the city’s wealthiest neighborhoods (+9.3 percentage points). The weighted chart review sample also had higher prevalence of obesity (+8 percentage points), hypertension (+4.8 percentage points) and hyperlipidemia (+8 percentage points), as measured by NYC HANES, than the weighted NYC Macroscope sample [4, 5, 16].
|NYC MACROSCOPE SAMPLEa (N = 716,076)||2013-14 NYC HANES IN-CARE PARTICIPANTS
|NOT IN CHART REVIEW SAMPLE (N = 945)||OVERALL CHART REVIEW SAMPLE (N = 190)||CHART REVIEW SAMPLE
|NYC MACROSCOPE (N = 48)||NON-NYC MACROSCOPE (N = 142)|
|%||WEIGHTED %||WEIGHTED %||UNWEIGHTED %||UNWEIGHTED %|
|NEIGHBORHOOD POVERTY LEVELc|
|≥30% (very high)||23.4b||173||16.3||18.8||16.2|
Agreement and validity of indicators
In the total chart review sample combining both NYC Macroscope and non-NYC Macroscope records, indicator agreement between the medical record and NYC HANES ranged from fair (Kappa=0.30 for hyperlipidemia [diagnosis]) to near perfect (Kappa=0.86 for obesity) (Table 2). Sensitivity ranged from moderate (0.71 for hyperlipidemia [diagnosis]) to high (0.94 for diabetes [augmented]) and specificity ranged from low (0.59 for hyperlipidemia [diagnosis]) to high (0.98 for both smoking and diabetes [augmented]). Indicator agreement in the non-NYC Macroscope sample ranged from fair (Kappa=0.30 for hyperlipidemia [diagnosis]) to almost perfect (Kappa=0.89 for diabetes [augmented]). Sensitivity ranged from moderate (0.72 for hyperlipidemia [diagnosis]) to high (0.91 for both obesity and diabetes [augmented]) and specificity was high (>0.90) for all indicators other than hyperlipidemia (0.58 [diagnosis], 0.59 [augmented]). Overall, agreement, sensitivity, and specificity were similar between NYC Macroscope [4, 5] and non-NYC Macroscope records.
|INDICATOR||ALL PRACTICES||NYC MACROSCOPE PRACTICESb||NON-NYC MACROSCOPE PRACTICES
|OVERALL||RESTRICTED TO STAGE 1 MEANINGFUL USEc|
|Obesity||N=159||N = 44||N=115||N = 72|
|Sensitivity (95% CI)||0.91 (0.80, 0.97)||0.92 (0.64, 1.00)||0.91 (0.78, 0.97)||0.96 (0.80, 1.00)|
|Specificity (95% CI)||0.95 (0.89, 0.98)||0.97 (0.83, 1.00)||0.94 (0.86, 0.98)||0.98 (0.88, 1.00)|
|Smoking Status||N=151||N = 43||N=108||N=66|
|Sensitivity (95% CI)||0.83 (0.63, 0.95)||1.00 (0.54, 1.00)||0.78 (0.52, 0.94)||0.90 (0.56, 1.00)|
|Specificity (95% CI)||0.98 (0.94, 1.00)||1.00 (0.91, 1.00)||0.98 (0.92, 1.00)||0.96 (0.88, 1.00)|
|Sensitivity (95% CI)||0.87 (0.69, 0.96)||1.00 (0.66, 1.00)||0.81 (0.58, 0.95)||0.83 (0.52, 0.98)|
|Specificity (95% CI)||0.97 (0.93, 0.99)||0.95 (0.83, 0.99)||0.98 (0.93, 0.99)||0.97 (0.91, 1.00)|
|Sensitivity (95% CI)||0.94 (0.80, 0.99)||1.00 (0.69, 1.00)||0.91 (0.72, 0.99)||0.86 (0.57, 0.98)|
|Specificity (95% CI)||0.98 (0.94, 1.00)||0.97 (0.92, 1.00)||0.98 (0.94, 1.00)||0.99 (0.92, 1.00)|
|Sensitivity (95% CI)||0.84 (0.73, 0.92)||1.00 (0.78, 1.00)||0.79 (0.65, 0.89)||0.89 (0.72, 0.98)|
|Specificity (95% CI)||0.94 (0.89, 0.98)||1.00 (0.89, 1.00)||0.92 (0.85, 0.97)||0.91 (0.81, 0.97)|
|Sensitivity (95% CI)||0.84 (0.71, 0.89)||0.83 (0.59, 0.96)||0.80 (0.68, 0.89)||0.82 (0.65, 0.93)|
|Specificity (95% CI)||0.92 (0.85, 0.96)||0.93 (0.78, 0.99)||0.91 (0.83, 0.96)||0.88 (0.77, 0.96)|
|Sensitivity (95% CI)||0.71 (0.58, 0.82)||0.69 (0.39, 0.91)||0.72 (0.57, 0.84)||0.70 (0.50, 0.86)|
|Specificity (95% CI)||0.59 (0.44, 0.72)||0.62 (0.32, 0.86)||0.58 (0.41, 0.74)||0.52 (0.31, 0.72)|
|Sensitivity (95% CI)||0.80 (0.68, 0.89)||0.77 (0.46, 0.95)||0.81 (0.67, 0.91)||0.78 (0.58, 0.91)|
|Specificity (95% CI)||0.60 (0.43, 0.74)||0.60 (0.26, 0.88)||0.59 (0.41, 0.76)||0.52 (0.31, 0.73)|
Restriction to practices attesting Stage 1 Meaningful Use
When the non-NYC Macroscope sample was restricted to providers who attested to Stage 1 Meaningful Use (Table 2), sensitivity of the smoking indicator increased by 0.12 (from 0.78 to 0.90) and the sensitivity of the hypertension (diagnosis) indicator increased by 0.10 (from 0.79 to 0.89). Differences in other indicators were smaller and went in both directions. Differences in specificity were negligible.
When unstructured data were incorporated into the NYC Macroscope indicators for obesity and smoking status, the sample sizes increased by 1 and 3 records, respectively, and sensitivity changed minimally by 0.02. The incorporation of unstructured data into metabolic indicators had no or little impact on sensitivity of diabetes diagnosis, hypertension diagnosis, and all augmented indicators, but increased the sensitivity of the hyperlipidemia diagnosis indicator by 0.11. Across all metabolic conditions, all of the unstructured diagnoses (9 cases) were obtained from free text fields instead of scanned documents. (Full data not shown for this sensitivity analysis).
Overall, the indicators developed for the NYC Macroscope had generally good sensitivity and specificity in a sample of EHR records from outside of the NYC Macroscope when using NYC HANES as a reference. Relative to previously published data from a sample of 48 NYC Macroscope records [4, 5], Kappa coefficients in the non-NYC Macroscope sample ranged from 0.21 points lower to 0.05 points higher than in the NYC Macroscope sample. However, the impact of these quantitative differences on the characterization of agreement was minimal. We also found that restricting non-NYC Macroscope records to providers who attested to Stage 1 Meaningful Use had mixed influence on agreement across indicators, including no meaningful impact on agreement for diabetes indicators and augmented hypertension, improved agreement for obesity, smoking status, and hypertension diagnosis indicators, and decreased agreement for both hyperlipidemia indicators. Contrary to findings from a study of behavioral health records , we found that incorporating unstructured data did not improve indicator sensitivity. Our results suggest that several of the indicators developed for the NYC Macroscope EHR surveillance system are generally robust for application to populations outside of the Hub Population Health System. Jurisdictions outside of NYC can learn from these findings to accelerate the development of their own EHR surveillance systems and validation of indicators.
Among indicators, estimates of obesity prevalence had the best agreement between the NYC Macroscope and NYC HANES, as shown by the high Kappa coefficients, sensitivity, and specificity across all subsamples. Based on these data and findings from other studies [3, 10, 25, 26], we expect that, in most situations, EHR-based adult obesity prevalence estimates will accurately reflect the underlying prevalence in the population sampled. Observed differences at the population level will most likely reflect differences in sample composition rather than measurement error, as measurement error associated with height and weight in the primary care setting appears to be nonsystematic and is not differential by socioeconomic status, for example . Other jurisdictions may want to use the obesity indicator to carry out preliminary evaluations of the representativeness of their EHR data source before carrying out a full-scale multi-indicator validation study.
The smoking status indicator had good validity and agreement overall, with improved sensitivity when restricting the sample to records from providers who had attested to Stage 1 Meaningful Use. Smoking status sensitivity in the Stage 1 Meaningful Use sample, although lower than in NYC Macroscope records, was slightly higher than reported in two earlierstudies [11, 28]. One of these studies (Chen et al.) reported improvements in smoking indicator sensitivity over time . Previous studies reported that roughly one-third of primary care patients lacked data on smoking status in structured fields; [5, 29, 30] however, we anticipate that further uptake of Meaningful Use provisions will lead to less missing data on smoking status and in turn improve sensitivity. As a cautionary note, misclassification may be a problem among those with documented smoking status [31, 32]. For example, a former smoker might be misclassified as a current smoker because the provider did not update the patient’s smoking status in the EHR. Therefore, further efforts are warranted to improve documentation quality.
There was good agreement between the two data sources for diabetes indicators in both the non-NYC Macroscope sample overall and the restricted non-NYC Macroscope sample, with the augmented indicator having slightly better agreement than the diagnosis indicator. The observation that sensitivity was at least 0.14 points lower in non-NYC Macroscope records than in the NYC Macroscope sample supports our hypothesis that the unique relationship between the NYC DOHMH and Hub Population Health System providers contributes to high validity of the NYC Macroscope [4, 16, 33]. However, diabetes indicators in both the full and restricted non-NYC Macroscope samples had high sensitivity suggesting that these indicators may be acceptable for use in most other EHR systems, with or without Stage 1 Meaningful Use provider restriction, and may need only limited local validation. Our finding of the robustness of diabetes indicators is consistent with two studies from Britain; one compared selfreported diabetes diagnoses obtained in a survey of patients with EHR data  and the other compared the diabetes status classifications assigned by an algorithm with the complete medical record . Both studies found strong agreement between the two data sources. In addition, a study using data from the Women’s Health Initiative found a high level of agreement between self-reported diabetes and paper medical records .
The external validity of hypertension indicators was weaker relative to obesity smoking status, and diabetes indicators. Results for the augmented indicator, overall, were similar to those observed in the NYC Macroscope sample, when restricting non-NYC Macroscope records to Stage 1 Meaningful Use attestation, and similar to results for hypertension indicators used in other studies [7, 10]. The diagnosis indicator, on the other hand, was not as sensitive in the non-NYC Macroscope sample, especially when the records were not restricted to Stage 1 Meaningful Use. The higher sensitivity among the NYC Macroscope records might have been the result of NYC DOHMH’s data driven quality improvement efforts in improving blood pressure documentation by providers who contribute data to the NYC Macroscope relative to data from other providers. For this reason, we encourage jurisdictions interested in using EHR data for hypertension prevalence to devote resources to local validation research.
The hyperlipidemia indicators had low specificity in non-NYC Macroscope and NYC Macroscope samples alike. In both samples, sensitivity was borderline low for the diagnosis indicator and on the low side of moderate for the augmented indicator. There are many possible explanations for the poor validity of the hyperlipidemia indicators, including diagnoses in the EHR that are not captured in NYC HANES, which seems to be a more pervasive issue for hyperlipidemia than for diabetes and hypertension; [36, 37, 38, 39] the study’s 2-year laboratory lookback period, which is shorter than the approximate 5-year recommended cholesterol testing interval;  and factors guiding diagnostic decisions that are not captured in these indicators such as low-density lipoprotein levels and 10-year cardiovascular disease risk . If, contrary to our designation of NYC HANES as the reference standard, we accept the diagnoses in the EHR as accurate, the true hyperlipidemia prevalence could be substantially higher than the NYC HANES estimates suggest. Continued work is needed to develop a strong hyperlipidemia surveillance indicator.
An important limitation of this study was the small size of our sample and low statistical power. Even after combining NYC Macroscope and non-NYC Macroscope records, confidence intervals for sensitivity spanned more than 18 percentage points on average, making comparisons across indicators speculative at best. Another limitation was the year of data collection relative to secular changes in health information technology advances. The NYC Macroscope was designed in 2012 and data were captured in 2013 when Stage 1 Meaningful Use was the highest level of attestation. Documentation quality is likely to have improved considerably as the field moves towards the third and final stage of Meaningful Use attestation . It may be that the sensitivity of some of the measures, for example, smoking status, hypertension, and hyperlipidemia, is now higher than our findings suggest. We should also point out that our study was restricted to New York City residents. There may be factors that we have not assessed that would limit the generalizability of our findings to EHR systems in other geographic regions.
This study had several strengths. The chart review sample was small, but not significantly different from the entire NYC HANES cohort with regard to sociodemographic or health characteristics other than having higher prevalence of hyperlipidemia. Our results were consistent across provider subgroups. We evaluated 8 different indicators, 4 of which were based on direct physical or laboratory measurements (BMI and the 3 augmented measures). We also evaluated our indicators with and without provider inclusion criteria pertaining to documentation quality. Another strength is the study’s innovative design. The majority of EHR-based surveillance system validation studies are limited to comparisons of population-based prevalence estimates [2, 3, 8, 10, 11, 30, 42, 43, 44], inclusive of both measured and self-reported data. Chart review validation studies using individual-level data have only been carried out for a handful of EHR-based surveillance systems [6, 7, 9, 11, 12, 13]. Unlike these self-contained studies that compared indicators using EHR data with review of the complete medical record or patient self-report, our study compared indicators using EHR data with individual-level data from a gold standard population-based survey that employed a physical examination and laboratory testing. Thus, our findings offer a bridge between examination surveys and EHR-based surveillance. Furthermore, our findings are not specific to any single EHR or health care delivery system, as we demonstrated the external validity of NYC Macroscope indicators in a sample of 133 providers and 89 practices that use more than 20 different EHR software systems.
Our findings suggest that many of the EHR-based surveillance indicators developed and validated for the NYC Macroscope are generalizable for use in other EHR surveillance systems; that provider inclusion/exclusion criteria based on attesting to Stage 1 Meaningful use has variable impact on indicator validity; and, that incorporation of unstructured data into EHR-based surveillance indicators may not be cost effective, given the little added value seen in this study and the expense and difficulty of incorporating natural language processing. Other studies, however, have found that unstructured data may contain speculative and diagnostic data that can be missed by algorithms [9, 24, 45]. Thus, other jurisdictions should determine at a local level if unstructured data have a relevant impact on indicator validity. The indicators assessed in this study may be transferable to other settings, but we would recommend that other jurisdictions developing their own EHR surveillance systems start with a small chart review of records from contributing practices to help inform indicator development. The results from this study can serve as a guide as they develop and validate indicators specific to their own EHR surveillance system.