## Introduction

New methods are emerging that enable data mining for unsuspected drug and vaccine adverse reactions in large longitudinal databases, such as the United States Food and Drug Administration’s (FDA’s) Sentinel System [**1**], a distributed data network of administrative claims databases. Data mining is a technique for simultaneous monitoring of many exposure-outcome pairs without having to pre-specify particular pairs of interest. Data mining analyses have traditionally been performed using spontaneous reporting databases, which lack denominator data. Moreover spontaneous reports require suspicion by a health care worker or patient that an outcome is potentially the result of exposure to a given medical product, which means that these databases suffer from selective reporting of particular exposure-outcome pairs and persistent underreporting [**2**].

The longitudinal nature of administrative claims data provides the ability to systematically evaluate thousands of outcomes as potential adverse reactions. Data mining analyses using longitudinal data can act as a wide-ranging safety net, ensuring that rate and count data are collected and analyzed routinely. Such general safety surveillance can fulfill the congressional mandate to provide access to safety data summaries utilizing its new pharmacovigilance infrastructure [**1**] including identification of any new risks not previously identified, potential new risks, or known risks reported in unusual number [**3**].

Here, we focus on one data mining method that leverages these longitudinal data: the tree-based scan statistic [**4**]. Previously, it has been shown to perform well in postmarket medical product safety settings [**5****, 6****, 7**], and is planned to monitor nine-valent human papillomavirus vaccine exposure in the FDA’s Sentinel System [**8**]. Additionally, the United States Centers for Disease Control and Prevention have indicated that they intend to use the method in their vaccine monitoring system, the Vaccine Safety Datalink [**9****, 10**]. First, analytic datasets containing rate or count data for many disease outcome pairs are assembled using familiar epidemiologic designs that control for confounding. Second, data for these outcomes are organized into a hierarchical tree. For example, febrile seizures can be combined with other similar outcomes under a more general heading, e.g., convulsions. Table 1 shows a very small part of an example tree. Then, the tree-based scan statistic is calculated for the entire analytic dataset using maximum likelihood estimation and Monte Carlo hypothesis testing to automatically control for multiplicity among the many outcomes being evaluated.

TREE LEVEL | TREE NODE | TREE NODE NAME |
---|---|---|

1 | 06 | Diseases of the nervous system and sense organs |

2 | 06.04 | Epilepsy; convulsions |

3 | 06.04.02 | Convulsions |

4 | 06.04.02.00 | Convulsions |

5 / Leaf | ICD-9-CM 780.3 | Convulsions |

5 / Leaf | ICD-9-CM 780.31 | Febrile convulsions not otherwise specified |

5 / Leaf | ICD-9-CM 780.32 | Complex febrile convulsions |

5 / Leaf | ICD-9-CM 780.33 | Post traumatic seizures |

5 / Leaf | ICD-9-CM 780.39 | Other convulsions |

In the analyses described herein, the hierarchical tree is predefined based on clinical knowledge and used to structure data, and the main results are the expected statistical power The null hypothesis is that there is no elevated risk for any of the thousands of outcomes. Conceptually, this use of the tree is very different from the tree structures created by classification and regression trees (CART), another data-mining method. In those analyses, the trees are the results of the analyses and that work is aimed at tree generation itself.

The advantage of employing a pre-defined hierarchical tree structure to arrange the analytic dataset is that it allows one to “borrow strength” when a clinical concept may be coded in multiple ways. Therefore, it is unnecessary for the investigator to specify which set of codes is used to identify a particular clinical concept. Additionally, a clinical concept can be experienced somewhat differently by certain individuals in a population, and the tree allows biologically-related reactions to be aggregated.

Nelson et al. have published a comprehensive review paper that describes other data mining techniques in longitudinal data [**11**], including logistic regression approaches [**12****, 13**], and disproportionality analyses [**14****, 15****, 16****, 17****, 18****, 19****, 20**]. The former approaches execute multiple logistic regression analyses and then use post hoc techniques to adjust for multiple hypothesis testing. Thus, while the maximum likelihood estimation is similar, the multiplicity control is different and there is not a way to leverage the tree structure to account for variable ways that clinical concept is coded. Disproportionality approaches were originally designed for spontaneous reporting data and then extended to make use of newly available longitudinal data. They use shrinkage estimators to informally control for multiplicity rather than through formal hypothesis testing, and also do not make use of the tree structure. Nelson et al.’s paper does not formally compare methods. Brown et al. compared Poisson tree-based scan statistics to disproportionality analyses and found reasonable concordance with the two approaches [**6**]. In other words, when the methods are applied to the same empirical dataset, both techniques showed similar detection capability although no formal power studies were performed for this comparison.

The tree-based scan statistic is hypothesis-generating, in that it produces an early warning with respect to potential associations. Because thousands of outcomes are evaluated simultaneously, confounding control is design-based using familiar epidemiologic techniques such as confounder adjustment of expected counts, restriction, stratification, or matching. As with any other data mining method, statistically significant “alerts” generated using the tree-based scan statistic must be carefully evaluated using other pharmacoepidemiologic methods where confounding control is more specifically tailored to the exposure-outcome pair of concern. In addition to generating statistically significant alerts, the method will also produce estimates of relative risk and attributable risk.

Moore et al. have expressed concern regarding the potential for missed safety signals in automated data [**21**]. Here, we demonstrate how to assess the statistical power of the tree-based scan statistic allowing regulators to understand its statistical power for different sample sizes and outcome frequencies. Our work is part of a larger literature that studies the statistical power of other types of scan statistics [**22****, 23****, 24****, 25****, 26****, 27****, 28****, 29****, 30**]. These sample size calculations should be used in the same way that sample size calculations are used for traditional epidemiologic studies: to allow the investigator to decide whether to proceed with a study or to wait for more sample size to accrue based on the desired ability to detect particular effect sizes of interest.

Statistical power varies with the effect size, the sample size, and the frequency of the underlying outcome rate. We simulated data using a new user cohort design, which compared an exposed population to an unexposed population. We created known alternative hypotheses that generated clusters of excess risk in the tree structure. We then used the tree-based scan statistic to analyze these data.

Based on these preparatory-to-surveillance power simulations, regulators can properly frame the aforementioned mandatory safety data summaries at eighteen months postmarket, clearly spelling out what level of risk was detectable. Further, such simulations allow regulators to make key process decisions related to the timing of retrospective data-mining analyses.

## Methods

### Tree-Based Scan Statistics for Cohort Data

The tree-based scan statistic detects elevated frequencies of outcomes in electronic health data that have been grouped into hierarchical tree structures. In our case, the tree structure is derived from the Agency for Healthcare Research And Quality’s Multi-Level Clinical Classifications Software (MLCCS) (http://www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp). The MLCCS groups outcomes into clinically meaningful categories and arranges them into four grouping levels. The broadest grouping identifies eighteen body systems and the narrowest grouping may contain multiple ICD-9-CM codes, forming a “branch.” Each individual ICD-9-CM code is a “leaf.” Any particular location on the tree – be it at the leaf or branch level – is referred to as a node. Table 1 shows an example branch.

We curated the full MLCCS tree by excluding ICD-9-CM outcome codes that 1) are unlikely to be caused by medical product exposures such as well care visits and pregnancy; 2) are unlikely to manifest within a few weeks after exposure, such as cancer; and 3) are common and of a less serious or unspecific nature, such as fever or diarrhea. Following the curation of the original thirteen thousand unique ICD-9-CM codes, we evaluated 6,162 ICD-9-CM codes which all represent individual leaves on the tree. Overall, there are 6,861 nodes on the tree. The curated tree is available upon request.

The null hypothesis being tested is that, for all nodes on the tree, an outcome is expected to occur in proportion to the underlying expected count that defined that node, as generated from a Poisson distribution. The alternative hypothesis is that one or more particular nodes on the tree have outcomes occurring with higher probability than the specified expected counts on those nodes.

A log-likelihood ratio was calculated for every node on the tree. The maximum among these log-likelihood ratios from the real data set is the test statistic for the entire analytic dataset. This maximum is compared with the maximum log-likelihood ratios that were calculated in the same way from simulated datasets generated under the null hypothesis. If the test statistic from the real dataset is among the 5 percent highest of all the maxima, the null hypothesis is rejected. The fact that it is the maxima over the whole tree is what adjusts for the multiple testing. This hypothesis testing method allows one to detect whether any node on the tree had clusters of excess outcomes that were statistically significant while adjusting for multiple testing inherent to evaluating more than six thousand nodes [**31**]. Specific details of this procedure are included in the eAppendix.

Tree-based scan statistics can be used unconditionally or one can condition on the total number of observed outcomes in the dataset. Mathematical expressions for both versions can be found in the eAppendix. Conditioning is a mechanism to control for situations when there is an across-the-board increase in health care utilization during a particular time period that is unrelated to the exposure of interest. This situation might occur commonly in vaccine safety surveillance when the cohort has follow-up tests or visits in the days immediately following their well-care visit when a vaccine was administered. The conditional tree-based scan statistic attenuates this health care utilization unrelated to the exposure by standardizing all diagnoses by the frequency with which they appear in the dataset.

### Simulated Datasets

To create the simulated datasets, we required background rates, and chose the exposure of interest to be quadrivalent human papillomavirus vaccine (Gardasil, Merck and Co. Inc.), identified by CPT code 90649. The choice of exposure is incidental to the power evaluations, but we chose this example to motivate how one might use these power evaluations in decision-making.

We extracted background rates for all the outcomes in the curated MLCCS tree from Florida Medicaid data using a cohort of 9-26 year olds from June 2006 to June 2009. All persons were minimally enrolled for 183 days in the health plan to ascertain chronic medical conditions and then began contributed time to the background rates. Contributed time was censored for any of the following criteria: 1) the last date of the study period, 2) disenrollment, 3) when the first incident outcome occurred with incidence criteria defined next, 4) or when a subsequent identical vaccination occurred. Vaccinated individuals only contributed unexposed time post-vaccination in days after the designated risk window. Never-vaccinated individuals were allowed to contribute time after the 183 day run-in period. Key metrics to describe the source data for the background rates are listed in Table 2. These background rates are used to calculate age-adjusted expected counts that are used by the Poisson-based tree-based scan statistic for comparison with the simulated observed counts in the risk window.

KEY METRICS | |
---|---|

Total person-years followed | 1,807,325 |

Total events | 256,117 |

Total persons | 24,369 |

Total exposed person-years | 1,664 |

Total expected events | 164.1 |

Total observed events in exposed time | 379 |

Outcome events were defined by ICD-9-CM codes and visit location/setting. An incident outcome was defined as the chronologically first third-level MLCCS outcome observed in the inpatient or emergency department setting, which was not observed during the prior 183 days in either the emergency department, inpatient or outpatient setting. This means that, even if it was a never before seen ICD-9-CM code, it was not counted if a different ICD-9-CM code belonging to the same third level MLCCS group, i.e. the same branch, was observed during the prior 183 days. For example, as shown in Table 1, a febrile seizure (ICD-9-CM 780.31) and a complex febrile seizure (ICD-9-CM 780.32) are part of the same branch at the third-level node on the MLCCS tree (06.04.02). Therefore, in order for a 780.31 code to be incident, none of those branch-level outcomes could have occurred in the previous 183 days.

### Alternative Hypotheses

To understand the statistical power to detect various effect sizes, we pre-defined effect sizes of interest ranging from 5 excess event per million doses to 500 excess events per million doses. We chose four different outcomes that have varying incidence rates and created known alternative hypotheses by injecting the risk at the leaf level (i.e., ICD-9-CM code) on the tree. The choices of outcomes also were incidental, but were required to be differing orders of magnitude in their base frequency in the dataset. We used Monte Carlo simulation to create multiple alternative datasets under both the null and known alternative hypotheses. The incidence rates and the known alternative hypotheses were inputs to stochastic Poisson processes. That is, these values allow us to calculate expected counts that serve as the parameter of interest for Poisson random draws. Using the maximum log-likelihood ratio as the test statistic, we computed the percentage of time an alert is raised when the type I error was set to 0.05. This output was the statistical power.

All analyses were performed using the power evaluation feature in the free TreeScan tool (www.treescan.org, v1.1.3), which calculates pure power of the analytic dataset. That is, when performing a power evaluation, we do not know which particular nodes give rise to the alert, only that an alert was generated. The probability of signaling on the particular node with the injected elevated risk is slightly lower than the pure power since there is an allowance for false positive alerts (i.e., 0.05). For actual analyses of real data (i.e., those that do not use the power evaluation feature), it always possible to determine which nodes individually alert.

We also explored the effect on statistical power of composite alternative hypotheses of risk. A composite alternative hypothesis is one in which the elevated risk is assigned to an outcome that is defined over a grouping of ICD-9-CM codes rather than being assigned to a singular ICD-9-CM code. Such a scenario is more likely to occur when multiple ICD-9-CM codes could be assigned for the same clinical concept. We used both optic neuritis and thrombocytopenia as examples. To illustrate, optic neuritis may be coded as 377.30 or 377.39, and in our source data, the latter was coded 10 times more frequently. When we created a simple injected risk scenario, then the risk was only elevated at one node on the tree, i.e., at the most frequently coded ICD-9-CM code. In contrast, when we created a complex injected risk scenario, then the risk was elevated at all nodes on the tree associated with the concept. Thrombocytopenia can be coded with eight different ICD-9-CM codes. We held effect sizes constant and performed the same statistical power analyses.

We also created artificial elevations in the occurrence of all outcomes uniformly throughout the tree on all nodes, representing an across-the-board increase in health care utilization during the risk window. We used these known alternative hypotheses to evaluate the conditional tree-based scan statistic that is designed to control for such utilization. For this comparison, we held effect sizes constant and compared the probability of rejecting the null hypothesis of the conditional and unconditional tree-based scan statistics.

### Mis-Specification of the Risk Window

In the scenarios described above, the risk window was perfectly specified, meaning that the true risk window was coincident with the observed risk window. Data-mining does not involve pre-specification of hypotheses of interest, and therefore there is a universal risk window applied to the 6000+ outcomes. Consequently we considered circumstances when the specified risk window is either too short or too long, and the consequent effects on statistical power. Appropriate risk window specification has been considered in detail elsewhere [**32**].

First, we considered the circumstance when the true risk window was longer, but encompassed the observed risk window. For example, the true risk window could occur 1–28 days post-vaccination whereas the observed risk window could occur 1–14 days post-vaccination. That is, exposed outcomes in the 15–28 days following vaccination would be missed. Losses in sensitivity underestimate the true attributable risk but do not bias the true relative risk when assuming a Poisson likelihood, i.e. the risk is constant over the relevant time period. It has the same effect as reducing the overall sample size. That is, specifying a too-short risk window simply means that one needs more vaccinees to attain the same statistical power.

Then, we considered the circumstance when specifying a too-long risk window, i.e. when the true risk window was shorter and contained within the observed risk window. In these circumstances, the true relative risk is diluted or washed out, but the attributable risk remains unbiased. Therefore, in these scenarios, we calculated the observed effect size and created the known alternative hypotheses accordingly.

## Result

### Simple and Complex Injected Risk

Figure 1 shows the statistical power to detect various attributable risks. We vary the total sample size among four outcomes of interest with underlying event rates that vary by orders of magnitude. In the population of interest, syncope (ICD-9-CM 780.2) occurs most frequently at 95.6 events per 1 million doses whereas optic neuritis occurs least frequently at 0.30 events per 1 million doses.

When using a fixed risk difference measure, it is more difficult to detect the identical risk difference in a more frequently occurring event because it takes many such events to provide adequate separation of the treatment and comparator group population. To illustrate, five excess events in the treatment group amounts to statistical noise in a commonly occurring outcome such as a headache. With rare events, separation between the two groups is observable even with few events, thereby generating higher statistical power to rule out the same attributable risk. For example, five excess cases are quite meaningful for some autoimmune diseases that are only expected to occur once in a million exposed. As expected, it is easier to detect the same risk differences with larger sample sizes.

In administrative data, clinical concepts may be coded uniquely (i.e., a singular ICD-9-CM code), or as a collection of codes. In Figure 1, we included examples of injected risk when the risk is spread among a collection of related ICD-9-CM codes (e.g.,thrombocytopenia) to observe differences between singular injections of risk. The expansion of the outcome definition to encompass several codes increases the expected counts for the clinical concept. The higher expected counts are equivalent to testing a different outcome with a higher underlying frequency. In other words, when we hold the attributable risk constant, the total expected counts – regardless of whether it was derived using a singular ICD-9-CM code or a collection of codes – corresponds to the statistical power.

### Adjusting for Across-the-Board Elevations in Health Care Utilization

Table 3 demonstrates the ability of the conditional versus unconditional tree-based scan statistic to properly control for across-the-board elevations in health care utilization that happen to occur in the risk window but are unrelated to the exposure. We compare actual type I error observed to allowable type I error (i.e., 0.05). The unconditional tree-based scan statistic inflates type I error when general utilization is increased by as little as 3 percent. Utilization increases of this magnitude are not unusual in administrative data and have been observed by the authors in other analyses as well as in the source data as seen in Table 2. However the conditional tree-based scan statistic continues to hold type I error to the allowable level even when across-the-board health care utilization increases by 500 percent.

GENERAL INCREASES IN HEALTH CARE UTILIZATION APPLIED TO DATASET |
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

0% | 1% | 2% | 3% | 5% | 8% | 10% | 20% | 50% | 200% | 500% | |

Unconditional | 0.05 | 0.06 | 0.06 | 0.08 | 0.24 | 0.82 | 0.98 | 1.00 | 1.00 | 1.00 | 1.00 |

Conditional | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 |

Figure 2 is a comparison of the statistical power of the conditional versus unconditional tree-based scan statistic in the absence of general health care utilization increases in the risk window. We compare small sample sizes to illustrate situations when the conditional tree-based scan statistic performs less well than the unconditional tree-based scan statistic. Under these circumstances, the conditional tree-based scan statistic detects the attributable risk less often than the unconditional statistic because the small sample size intensifies the “attenuation effect” that occurs when conditioning on the total number of cases. For example, observe the situation of a 1000 vaccine sample size with a true attributable risk of 5 excess cases per 1000 vaccinees (5000 events per million doses in Figure 2). In this dataset, the caseload is almost entirely due to the risk (5 observed cases that are all excess cases of syncope). Therefore, the conditional tree-based scan statistic mistakes part of the “signal” for noise and subsequently has reduced statistical power to detect it properly. Compare the 89 percent statistical power in the unconditional analysis to the 79 percent power in the conditional. Particular situations that warrant concern include the very early uptake period or low exposure prevalence medical products.

Figure 3 is a comparison of the statistical power of the conditional tree-based scan statistic’s performance with varying levels of attributable risk and varying levels of general increased utilization unrelated to the health outcome of interest. As before, when using fixed attributable risks, it is easier to detect signals against very rare background rates. However the statistical power of the conditional tree-based scan statistic is lower for the same fixed risk difference because of the attenuation effect due. For example, the statistical power of the conditional tree-based scan statistic to detect an excess risk of 100 excess events per million doses in an event that occurs with the frequency of syncope in a 500,000 vaccinee population is 97 percent when there is no background elevation in overall health care utilization. If overall health care utilization is 50 percent higher for reasons unrelated to the exposure, the statistical power to detect the same risk difference drops to 59 percent.

### Risk Window Mis-Specification

Figure 4 demonstrates the effect on statistical power when a too-long risk window has been specified. The ratio defined in the figure represents the ratio of the too-long observed risk window to the true risk window contained within it. The losses in statistical power occur because of the “washing out” of the signal. For example, compare the circumstance when the specified window is twice as long as the true risk interval (i.e., the ratio is 2) and we are interested in the statistical power to detect 100 events per million doses for an outcome that occurs with the frequency of syncope. In Figure 1, the statistical power is 98 percent as compared to 29 percent in Figure 4. The too-short risk window does not result in loss of statistical power as a consequence of bias. Rather a too-short risk window is equivalent to having a smaller sample size. Therefore, one can refer to Figure 1 and treat a loss of risk window days as a loss of vaccinees.

### Sensitivity Analyses: Pruning the Tree

Alerts at the most aggregated nodes on the tree are typically not actionable because they are so general. For example, an alert raised for quadrivalent human papillomavirus vaccine and “diseases of the circulatory system” is unlikely to be useful information. However, hypothesis testing is performed at these nodes. We tested whether “pruning the tree” to eliminate hypothesis testing at the top two levels of aggregated nodes would result in an increase in statistical power The results were unaffected by this pruning because of the relatively small number of nodes there, i.e. 18 nodes at the root level as compared to 6000+ nodes at the leaf level.

## Discussion

We performed numerous simulations to examine the statistical power of both the unconditional and conditional Poisson tree-based scan statistic for cohort-type data. In studies with small sample sizes, the unconditional tree-based scan statistic had slightly higher statistical power to detect attributable risk than the conditional tree-based scan statistic. However the unconditional tree-based scan statistic inflated type I error even in the presence of low general increases in health care utilization following exposure. The conditional tree-based scan statistic controlled type I error well when faced with general increases in health care utilization following exposure but experienced slightly decreased power as a consequence of the increasing noise. We observed reductions in statistical power resulting from specifying a too-long risk window, and reductions in sample size from specifying a too-short risk window.

To give our statistical power study context, we considered an example problem of quadrivalent human papillomary virus vaccine, which is administered to 9–26 year olds. We further developed background rates based on their “unexposed time” when we considered exposed time to occur in the first 28 days following vaccination. These background rates were used to compute expected counts for various sample sizes. The statistical power concepts and trends demonstrated with this example should apply to all problems regardless of the source data or the particular tree being utilized. We focus here on demonstrating the process to perform statistical power calculations using the power evaluation feature within the TreeScan software.

We also use these tables to prepare for future vaccine safety monitoring (e.g., nine-valent human papillomary virus vaccine) in a population that is represented by the source data and by using the same tree [**8**]. First, we estimate the number of vaccinated individuals expected at eighteen months. If the expected sample size of vaccinated individuals is small (as it is in Figure 2) and the overall health care utilization following exposure is not expected to be elevated, then an unconditional analysis is preferred to a conditional analysis. However, if the expected number of vaccinated individuals is larger, or if overall health care utilization is expected to be elevated in the “designated risk window” for reasons unrelated to the exposure, then the conditional tree-based scan statistic is preferred because it minimizes false positive alerting. One could estimate this increased level of utilization in the source data. For example, in Table 2, comparing the total observed events in exposed time (i.e., 379 events) to the total expected events (i.e., 164.1 events) yields a 2.2x elevation in overall health care utilization. These source data have greater overall health care utilization in the time period immediately following vaccination, which is expected due to follow-up visits that occur closely after well-visits for reasons unrelated to vaccination. Then, an investigator could use Figure 3 to get a sense of the statistical power available for various underlying event frequencies.

Our preparatory-to-surveillance simulation demonstrates what magnitudes of risk can be ruled out or detected based on expected sample size at the time of performance of a TreeScan analysis. Regulators can use these simulations to contextualize what type of safety information can reasonably be available at the congressionally mandated eighteen month/10000 user postlicensure review. Further, if multiple TreeScan analyses are likely to be performed over the course of a medical product’s lifetime, these simulations can be used to optimize analyses and limit potential reuse of observational data [**33**].

Data mining analyses using tree-based scan statistics expand the safety net of pharmacovigilance, ensuring adequate monitoring of thousands of outcomes of interest while controlling for multiple hypothesis testing. They are an important complement to the existing armamentarium of knowledge generation about the effects of medical products, and we have shown how to estimate statistical power for such analyses.