To perform sample size calculations when using tree-based scan statistics in longitudinal observational databases.

Tree-based scan statistics enable data mining on epidemiologic datasets where thousands of disease outcomes are organized into hierarchical tree structures with automatic adjustment for multiple testing. We show how to evaluate the statistical power of the unconditional and conditional Poisson versions. The null hypothesis is that there is no increase in the risk for any of the outcomes. The alternative is that one or more outcomes have an excess risk. We varied the excess risk, total sample size, frequency of the underlying event rate, and the level of across-the-board health care utilization. We also quantified the reduction in statistical power resulting from specifying a risk window that was too long or too short.

For 500,000 exposed people, we had at least 98 percent power to detect an excess risk of 1 event per 10,000 exposed for all outcomes. In the presence of potential temporal confounding due to across-the-board elevations of health care utilization in the risk window, the conditional tree-based scan statistic controlled type I error well, while the unconditional version did not.

Data mining analyses using tree-based scan statistics expand the pharmacovigilance toolbox, ensuring adequate monitoring of thousands of outcomes of interest while controlling for multiple hypothesis testing. These power evaluations enable investigators to design and optimize implementation of retrospective data mining analyses.

New methods are emerging that enable data mining for unsuspected drug and vaccine adverse reactions in large longitudinal databases, such as the United States Food and Drug Administration’s (FDA’s) Sentinel System [

The longitudinal nature of administrative claims data provides the ability to systematically evaluate thousands of outcomes as potential adverse reactions. Data mining analyses using longitudinal data can act as a wide-ranging safety net, ensuring that rate and count data are collected and analyzed routinely. Such general safety surveillance can fulfill the congressional mandate to provide access to safety data summaries utilizing its new pharmacovigilance infrastructure [

Here, we focus on one data mining method that leverages these longitudinal data: the tree-based scan statistic [

Example Branch of the Multi-Level Clinical Classifications Software Tree

TREE LEVEL | TREE NODE | TREE NODE NAME |
---|---|---|

1 | 06 | Diseases of the nervous system and sense organs |

2 | 06.04 | Epilepsy; convulsions |

3 | 06.04.02 | Convulsions |

4 | 06.04.02.00 | Convulsions |

5 / Leaf | ICD-9-CM 780.3 | Convulsions |

5 / Leaf | ICD-9-CM 780.31 | Febrile convulsions not otherwise specified |

5 / Leaf | ICD-9-CM 780.32 | Complex febrile convulsions |

5 / Leaf | ICD-9-CM 780.33 | Post traumatic seizures |

5 / Leaf | ICD-9-CM 780.39 | Other convulsions |

In the analyses described herein, the hierarchical tree is predefined based on clinical knowledge and used to structure data, and the main results are the expected statistical power The null hypothesis is that there is no elevated risk for any of the thousands of outcomes. Conceptually, this use of the tree is very different from the tree structures created by classification and regression trees (CART), another data-mining method. In those analyses, the trees are the results of the analyses and that work is aimed at tree generation itself.

The advantage of employing a pre-defined hierarchical tree structure to arrange the analytic dataset is that it allows one to “borrow strength” when a clinical concept may be coded in multiple ways. Therefore, it is unnecessary for the investigator to specify which set of codes is used to identify a particular clinical concept. Additionally, a clinical concept can be experienced somewhat differently by certain individuals in a population, and the tree allows biologically-related reactions to be aggregated.

Nelson et al. have published a comprehensive review paper that describes other data mining techniques in longitudinal data [

The tree-based scan statistic is hypothesis-generating, in that it produces an early warning with respect to potential associations. Because thousands of outcomes are evaluated simultaneously, confounding control is design-based using familiar epidemiologic techniques such as confounder adjustment of expected counts, restriction, stratification, or matching. As with any other data mining method, statistically significant “alerts” generated using the tree-based scan statistic must be carefully evaluated using other pharmacoepidemiologic methods where confounding control is more specifically tailored to the exposure-outcome pair of concern. In addition to generating statistically significant alerts, the method will also produce estimates of relative risk and attributable risk.

Moore et al. have expressed concern regarding the potential for missed safety signals in automated data [

Statistical power varies with the effect size, the sample size, and the frequency of the underlying outcome rate. We simulated data using a new user cohort design, which compared an exposed population to an unexposed population. We created known alternative hypotheses that generated clusters of excess risk in the tree structure. We then used the tree-based scan statistic to analyze these data.

Based on these preparatory-to-surveillance power simulations, regulators can properly frame the aforementioned mandatory safety data summaries at eighteen months postmarket, clearly spelling out what level of risk was detectable. Further, such simulations allow regulators to make key process decisions related to the timing of retrospective data-mining analyses.

The tree-based scan statistic detects elevated frequencies of outcomes in electronic health data that have been grouped into hierarchical tree structures. In our case, the tree structure is derived from the Agency for Healthcare Research And Quality’s Multi-Level Clinical Classifications Software (MLCCS) (

We curated the full MLCCS tree by excluding ICD-9-CM outcome codes that 1) are unlikely to be caused by medical product exposures such as well care visits and pregnancy; 2) are unlikely to manifest within a few weeks after exposure, such as cancer; and 3) are common and of a less serious or unspecific nature, such as fever or diarrhea. Following the curation of the original thirteen thousand unique ICD-9-CM codes, we evaluated 6,162 ICD-9-CM codes which all represent individual leaves on the tree. Overall, there are 6,861 nodes on the tree. The curated tree is available upon request.

The null hypothesis being tested is that, for all nodes on the tree, an outcome is expected to occur in proportion to the underlying expected count that defined that node, as generated from a Poisson distribution. The alternative hypothesis is that one or more particular nodes on the tree have outcomes occurring with higher probability than the specified expected counts on those nodes.

A log-likelihood ratio was calculated for every node on the tree. The maximum among these log-likelihood ratios from the real data set is the test statistic for the entire analytic dataset. This maximum is compared with the maximum log-likelihood ratios that were calculated in the same way from simulated datasets generated under the null hypothesis. If the test statistic from the real dataset is among the 5 percent highest of all the maxima, the null hypothesis is rejected. The fact that it is the maxima over the whole tree is what adjusts for the multiple testing. This hypothesis testing method allows one to detect whether any node on the tree had clusters of excess outcomes that were statistically significant while adjusting for multiple testing inherent to evaluating more than six thousand nodes [

Tree-based scan statistics can be used unconditionally or one can condition on the total number of observed outcomes in the dataset. Mathematical expressions for both versions can be found in the eAppendix. Conditioning is a mechanism to control for situations when there is an across-the-board increase in health care utilization during a particular time period that is unrelated to the exposure of interest. This situation might occur commonly in vaccine safety surveillance when the cohort has follow-up tests or visits in the days immediately following their well-care visit when a vaccine was administered. The conditional tree-based scan statistic attenuates this health care utilization unrelated to the exposure by standardizing all diagnoses by the frequency with which they appear in the dataset.

To create the simulated datasets, we required background rates, and chose the exposure of interest to be quadrivalent human papillomavirus vaccine (Gardasil, Merck and Co. Inc.), identified by CPT code 90649. The choice of exposure is incidental to the power evaluations, but we chose this example to motivate how one might use these power evaluations in decision-making.

We extracted background rates for all the outcomes in the curated MLCCS tree from Florida Medicaid data using a cohort of 9-26 year olds from June 2006 to June 2009. All persons were minimally enrolled for 183 days in the health plan to ascertain chronic medical conditions and then began contributed time to the background rates. Contributed time was censored for any of the following criteria: 1) the last date of the study period, 2) disenrollment, 3) when the first incident outcome occurred with incidence criteria defined next, 4) or when a subsequent identical vaccination occurred. Vaccinated individuals only contributed unexposed time post-vaccination in days after the designated risk window. Never-vaccinated individuals were allowed to contribute time after the 183 day run-in period. Key metrics to describe the source data for the background rates are listed in Table

Key Metrics of the Source Dataa used to Capture the Background Rates of Outcomes of Interest

KEY METRICS | |
---|---|

Total person-years followed | 1,807,325 |

Total events | 256,117 |

Total persons | 24,369 |

Total exposed person-years | 1,664 |

Total expected events | 164.1 |

Total observed events in exposed time | 379 |

^{a}These data are based on 183-day lookback period, with an “exposed” risk window of 1–28 days following vaccination.

Outcome events were defined by ICD-9-CM codes and visit location/setting. An incident outcome was defined as the chronologically first third-level MLCCS outcome observed in the inpatient or emergency department setting, which was not observed during the prior 183 days in either the emergency department, inpatient or outpatient setting. This means that, even if it was a never before seen ICD-9-CM code, it was not counted if a different ICD-9-CM code belonging to the same third level MLCCS group, i.e. the same branch, was observed during the prior 183 days. For example, as shown in Table

To understand the statistical power to detect various effect sizes, we pre-defined effect sizes of interest ranging from 5 excess event per million doses to 500 excess events per million doses. We chose four different outcomes that have varying incidence rates and created known alternative hypotheses by injecting the risk at the leaf level (i.e., ICD-9-CM code) on the tree. The choices of outcomes also were incidental, but were required to be differing orders of magnitude in their base frequency in the dataset. We used Monte Carlo simulation to create multiple alternative datasets under both the null and known alternative hypotheses. The incidence rates and the known alternative hypotheses were inputs to stochastic Poisson processes. That is, these values allow us to calculate expected counts that serve as the parameter of interest for Poisson random draws. Using the maximum log-likelihood ratio as the test statistic, we computed the percentage of time an alert is raised when the type I error was set to 0.05. This output was the statistical power.

All analyses were performed using the power evaluation feature in the free TreeScan tool (

We also explored the effect on statistical power of composite alternative hypotheses of risk. A composite alternative hypothesis is one in which the elevated risk is assigned to an outcome that is defined over a grouping of ICD-9-CM codes rather than being assigned to a singular ICD-9-CM code. Such a scenario is more likely to occur when multiple ICD-9-CM codes could be assigned for the same clinical concept. We used both optic neuritis and thrombocytopenia as examples. To illustrate, optic neuritis may be coded as 377.30 or 377.39, and in our source data, the latter was coded 10 times more frequently. When we created a simple injected risk scenario, then the risk was only elevated at one node on the tree, i.e., at the most frequently coded ICD-9-CM code. In contrast, when we created a complex injected risk scenario, then the risk was elevated at all nodes on the tree associated with the concept. Thrombocytopenia can be coded with eight different ICD-9-CM codes. We held effect sizes constant and performed the same statistical power analyses.

We also created artificial elevations in the occurrence of all outcomes uniformly throughout the tree on all nodes, representing an across-the-board increase in health care utilization during the risk window. We used these known alternative hypotheses to evaluate the conditional tree-based scan statistic that is designed to control for such utilization. For this comparison, we held effect sizes constant and compared the probability of rejecting the null hypothesis of the conditional and unconditional tree-based scan statistics.

In the scenarios described above, the risk window was perfectly specified, meaning that the true risk window was coincident with the observed risk window. Data-mining does not involve pre-specification of hypotheses of interest, and therefore there is a universal risk window applied to the 6000+ outcomes. Consequently we considered circumstances when the specified risk window is either too short or too long, and the consequent effects on statistical power. Appropriate risk window specification has been considered in detail elsewhere [

First, we considered the circumstance when the true risk window was longer, but encompassed the observed risk window. For example, the true risk window could occur 1–28 days post-vaccination whereas the observed risk window could occur 1–14 days post-vaccination. That is, exposed outcomes in the 15–28 days following vaccination would be missed. Losses in sensitivity underestimate the true attributable risk but do not bias the true relative risk when assuming a Poisson likelihood, i.e. the risk is constant over the relevant time period. It has the same effect as reducing the overall sample size. That is, specifying a too-short risk window simply means that one needs more vaccinees to attain the same statistical power.

Then, we considered the circumstance when specifying a too-long risk window, i.e. when the true risk window was shorter and contained within the observed risk window. In these circumstances, the true relative risk is diluted or washed out, but the attributable risk remains unbiased. Therefore, in these scenarios, we calculated the observed effect size and created the known alternative hypotheses accordingly.

Figure

Statistical Power to Detect Various Attributable Risks

Note: This figure accounts for different background event rates, sample sizes, and coding algorithms (i.e., singular or multiple ICD-9-CM codes of interest). All simulations were performed with 99,999 iterations under the null hypothesis that observed counts for all nodes on the tree were expected to occur proportionately to the underlying expected counts; and with 10,000 iterations under the known alternative hypothesis using the unconditional tree-based scan statistic. Critical values were set at a signaling threshold of p = 0.05.

When using a fixed risk difference measure, it is more difficult to detect the identical risk difference in a more frequently occurring event because it takes many such events to provide adequate separation of the treatment and comparator group population. To illustrate, five excess events in the treatment group amounts to statistical noise in a commonly occurring outcome such as a headache. With rare events, separation between the two groups is observable even with few events, thereby generating higher statistical power to rule out the same attributable risk. For example, five excess cases are quite meaningful for some autoimmune diseases that are only expected to occur once in a million exposed. As expected, it is easier to detect the same risk differences with larger sample sizes.

In administrative data, clinical concepts may be coded uniquely (i.e., a singular ICD-9-CM code), or as a collection of codes. In Figure

Table

Type I Error in the Conditional and Unconditional Tree-Based Scan Statistic under Conditions of Across-the-board Elevations in Health care Utilizationa

GENERAL INCREASES IN HEALTH CARE UTILIZATION APPLIED TO DATASET |
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

0% | 1% | 2% | 3% | 5% | 8% | 10% | 20% | 50% | 200% | 500% | |

Unconditional | 0.05 | 0.06 | 0.06 | 0.08 | 0.24 | 0.82 | 0.98 | 1.00 | 1.00 | 1.00 | 1.00 |

Conditional | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 |

^{a}All simulations were performed with 99,999 iterations under the null hypothesis that observed counts for all nodes on the tree were expected to occur proportionately to the underlying expected counts with a sample size of 500,000 vaccinees. Allowable type I error set to 0.05.

Figure

Statistical Power to Detect Various Attributable Risks with Various Sample Sizes Using Both the Unconditional Versus Conditional Tree-Based Scan Statistic in the Absence of Overall Increases in Health Care Utilization

Note: All simulations were performed with 99,999 iterations under the null hypothesis that observed counts for all nodes on the tree were expected to occur proportionately to the underlying expected counts; and with 10,000 iterations under the known alternative hypothesis. Critical values were set at a signaling threshold of p = 0.05.

Figure

Statistical Power to Detect Various Attributable Risks, Accounting for Different Background Event Rates and Different Levels of Overall Increases in Health Care Utilization

Note: All simulations were performed assuming a sample size of 500,000 vaccinees with 99,999 iterations under the null hypothesis that observed counts for all nodes on the tree were expected to occur proportionately to the underlying expected counts; and with 10,000 iterations under the known alternative hypothesis using a conditional tree-based scan statistic. Critical values were set at a signaling threshold of p = 0.05.

Figure

Statistical Power to Detect Various Attributable Risks When Mis-Specifying the Risk Window

Note: Ratio is the length of the observed/assumed risk window to the length of the true risk window. All simulations were performed assuming a sample size of 500,000 vaccinees with 99,999 iterations under the null hypothesis that observed counts for all nodes on the tree were expected to occur proportionately to the underlying expected counts; and with 10,000 iterations under the known alternative hypothesis using an unconditional tree-based scan statistic. Critical values were set at a signaling threshold of p = 0.05.

Alerts at the most aggregated nodes on the tree are typically not actionable because they are so general. For example, an alert raised for quadrivalent human papillomavirus vaccine and “diseases of the circulatory system” is unlikely to be useful information. However, hypothesis testing is performed at these nodes. We tested whether “pruning the tree” to eliminate hypothesis testing at the top two levels of aggregated nodes would result in an increase in statistical power The results were unaffected by this pruning because of the relatively small number of nodes there, i.e. 18 nodes at the root level as compared to 6000+ nodes at the leaf level.

We performed numerous simulations to examine the statistical power of both the unconditional and conditional Poisson tree-based scan statistic for cohort-type data. In studies with small sample sizes, the unconditional tree-based scan statistic had slightly higher statistical power to detect attributable risk than the conditional tree-based scan statistic. However the unconditional tree-based scan statistic inflated type I error even in the presence of low general increases in health care utilization following exposure. The conditional tree-based scan statistic controlled type I error well when faced with general increases in health care utilization following exposure but experienced slightly decreased power as a consequence of the increasing noise. We observed reductions in statistical power resulting from specifying a too-long risk window, and reductions in sample size from specifying a too-short risk window.

To give our statistical power study context, we considered an example problem of quadrivalent human papillomary virus vaccine, which is administered to 9–26 year olds. We further developed background rates based on their “unexposed time” when we considered exposed time to occur in the first 28 days following vaccination. These background rates were used to compute expected counts for various sample sizes. The statistical power concepts and trends demonstrated with this example should apply to all problems regardless of the source data or the particular tree being utilized. We focus here on demonstrating the process to perform statistical power calculations using the power evaluation feature within the TreeScan software.

We also use these tables to prepare for future vaccine safety monitoring (e.g., nine-valent human papillomary virus vaccine) in a population that is represented by the source data and by using the same tree [

Our preparatory-to-surveillance simulation demonstrates what magnitudes of risk can be ruled out or detected based on expected sample size at the time of performance of a TreeScan analysis. Regulators can use these simulations to contextualize what type of safety information can reasonably be available at the congressionally mandated eighteen month/10000 user postlicensure review. Further, if multiple TreeScan analyses are likely to be performed over the course of a medical product’s lifetime, these simulations can be used to optimize analyses and limit potential reuse of observational data [

Data mining analyses using tree-based scan statistics expand the safety net of pharmacovigilance, ensuring adequate monitoring of thousands of outcomes of interest while controlling for multiple hypothesis testing. They are an important complement to the existing armamentarium of knowledge generation about the effects of medical products, and we have shown how to estimate statistical power for such analyses.

All outcomes are first classified into a hierarchical tree structure described in the main paper. For each leaf _{i}_{i}

The next step is to define nodes on the tree. Each node _{G}_{G}

The log likelihood ratio is derived from a Poisson-based maximum likelihood estimator and is:

where:

Log likelihood ratios are computed for computational convenience, and results from them are identical to results based on likelihood ratios). The order in which the nodes are evaluated does not impact the results. The node

The distribution of

If the

When using the unconditional tree-based scan statistic described above, the null hypothesis is that any outcome is equally likely to occur in proportion to underlying background rate of the event as given by the expected counts. In the conditional version, the null hypothesis is based on the relative magnitude of the expected counts rather than the expected counts themselves, and the analysis is conditioned on the total number of outcomes in the whole tree. Thus, the statistical model is a multinomial distribution. Full derivation of the equations is in the paper by

Thus, we calculate the total number of outcomes in the risk window

Again, log likelihood ratios are used for computational convenience as opposed to likelihood ratios. The order in which the nodes are evaluated does not impact the results. The node

The other difference occurs in the Monte Carlo simulation step. Now, every random data set has to have the same

We gratefully acknowledge comments received from Katherine Yih and Jeffrey S. Brown on this project and manuscript as well as the project management efforts of Carolyn Balsbaugh.