When Is Evidence Sufficient?
Abstract
Traditional conceptualizations of evidence-based medicine rely heavily on randomized controlled trials. Although initiatives to broaden definitions of evidence have been advanced, they generally have not tied evidentiary criteria formally and quantitatively to the benefits and costs involved in a decision to adopt or reject an intervention. Decision analysis provides a framework for combining information to inform the adoption decision in this manner. Value-of-information analysis, a related methodology, helps to determine whether it is worthwhile to collect additional information as well as the type of research that would be most helpful.
A framework for making use of all available information in medical decision making and for deciding whether more is needed.
The decision to adopt a new intervention, such as a new medical technology or program, depends on evidence. Evidence can be difficult, time- consuming, and expensive to collect, however, and the true performance of an intervention typically remains uncertain even after evidence has been collected. This residual uncertainty makes it necessary to establish standards for deciding when evidence is adequate to adopt an intervention. A second question is whether additional evidence should be gathered to further reduce uncertainty and, if so, what type.
In this paper we argue that conventional approaches to both of these issues are inadequate because they do not yield prescriptions that will best promote health. This paper advances a broader conceptualization of evidence-based medicine (EBM) that uses decision analysis and value-of-information (VOI) analysis to determine whether an intervention should be adopted, whether additional evidence to further inform that decision is worth gathering, and what kind of information is of the greatest value. We also discuss the experience of the United Kingdom’s National Institute for Clinical Excellence (NICE) in using these methods and implications for U.S. research and policy.
Traditional Conceptualizations Of EBM
Traditional approaches to the evaluation of health interventions in general and new medical technologies in particular rely both on predetermined criteria for assigning weights to different types of evidence and on long-established methods of statistical hypothesis testing.
Randomized controlled trials.
The favored form of evidence has been the randomized controlled trial (RCT), which serves as the gold standard for clinical research, with rigorous scientific design and prespecified endpoints. Decisions about whether or not to adopt a new drug, device, procedure, or program usually hinge first on the availability and quality of experimental evidence from trials.
However, RCT-based evidence on its own is not sufficient for making these decisions. For one, RCTs typically are not designed to address implementation issues (for example, the extent to which patients will comply with a drug regimen), which in some cases can be the key variable determining an intervention’s success. Second, available RCT evidence may not directly compare the intervention under evaluation to the most relevant alternative. For example, it may compare a new medication to placebo but not to the best available treatment. Third, RCTs may not inform all of the other assumptions critical to evaluating a technology (for example, disease costs). In addition, demanding the use of RCT evidence ignores the usefulness of other sources of information. For example, suppose a well-conducted observational (nonrandomized) study suggests that a treatment for a terminal illness is effective. Does it make sense to withhold this treatment from the population until a “proper” RCT can be conducted?
Statistical hypothesis testing.
The use of classical statistical hypothesis testing to interpret study results can also lead to suboptimal decisions. Although seldom codified, conventional decision-making criteria often rely on such testing. In the context of medical technologies, this refers to the assessment of the hypothesis, based on RCT data, that the new technology offers an improvement in health outcomes relative to the status quo.
This comparison is complicated by “noise” introduced by natural variation across individuals. For example, even if a new pain relief medication has no medicinal benefit relative to placebo, patients in the treatment group may report better results by chance. Classical statistics addresses this problem by calculating the probability that any difference observed between the treatment and the comparator (in this case the placebo) reflects noise rather than a “real” difference. Only if this probability is sufficiently small—typically 5 percent—is the treatment under investigation declared superior. In the example of the pain medication, a conventional decisionmaker would therefore reject adoption of this new treatment if the chance that the study results represent noise exceeds 5 percent. Proponents of the new medication would at that point have the option of gathering additional evidence (such as by conducting a larger RCT that would, as a result, have less noise) in an effort to make their case.
The problem with the conventional approach is that the adoption criteria, like the exclusive reliance on RCT studies, are not linked to broader concerns about benefits and risks (and resource use).1 For example, suppose we know that the new pain medication has a low risk of side effects, low cost, and the possibility of offering relief for patients with severe symptoms. In that case, does it really make sense to hold the candidate medication to the stringent 5 percent adoption criterion? Similarly, let us suppose that there is a candidate medication for patients with a terminal illness. If the evidence suggesting that it works has a 20 percent chance of representing only noise (and hence an 80 percent chance that the observed efficacy is real), does it make sense to withhold it from patients who might benefit from its use?
In fairness, there have been numerous initiatives to broaden conceptualizations of evidence that move away from simply considering the rigor of the research design. The U.S. Preventive Services Task Force (USPSTF), for example, noted recently the need to consider evidence “as a whole, including trade-offs among benefits, harms, and costs and the net benefit relative to other needs for optimal resource allocation.”2 Many other groups have developed guidance based on grading hierarchies and other approaches for classifying and using disparate sources of evidence.3 However, these initiatives generally lack specificity about how to make trade-offs. In particular, evidentiary criteria are not tied formally and quantitatively to the benefits, risks, and costs associated with an intervention and as a result do not maximize health benefits.
Toward A Broader Conceptualization Of EBM
Instead of establishing standards for evidence independently of the circumstances related to a particular intervention, these standards should take into account the potential benefits of the technology and the consequences of adopting it if supporting claims turn out to be false. Likewise, the decision to gather additional evidence should be based on what might be learned from that evidence, and how the new information might improve decision making.
Decision analysis.
Techniques from the field of decision analysis formalize the question of whether (provisionally) to adopt or reject an intervention. Decision analysis identifies the set of consequences of concern to the decisionmaker that might result from each available option (for example, the therapeutic effects and side effects associated with a drug, its direct costs, and its impact on social costs such as productivity losses) and determines their associated probabilities. Aggregating these probability-weighted consequences using an appropriate common metric yields an expected net impact for each option. Common metrics typically used include quality-adjusted life years (QALYs) lost and monetary equivalents.4
Decision analysis identifies as optimal that option that maximizes net benefits (for example, in terms of QALYs). Note that decision analysis does not prejudge evidence as acceptable or unacceptable based on the level of evidence (such as whether it is based on an RCT or an observational study), although the type of evidence will help determine its influence on an adoption decision. However, this influence depends on what the evidence reveals in a particular circumstance, not on a prespecified weight assigned to different types of evidence.
Value-of-information analysis.
In addition to deciding whether to adopt or reject an intervention, the decisionmaker must also decide whether the gathering of additional information is warranted. Here, VOI techniques are useful.5 VOI analysis evaluates the extent to which new evidence might improve expected benefits by reducing the chance for error and compares that improvement with the cost of the information. Once again, evidentiary considerations depend on the particular circumstances of a decision (the consequences of an error, what can be learned from additional evidence, and how new knowledge will change and improve the option identified as optimal), not on predetermined standards.
In general, VOI analysis prescribes the gathering of more information when the following conditions hold: (1) The research has the potential of changing what is thought to be the best available alternative intervention; (2) there is likely to be a large advantage of the new optimal intervention compared with the alternative now viewed as optimal; and (3) the cost of gathering new information is not too large compared to its value.6 We discuss each of these criteria in turn.
The potential to identify a new optimal alternative.
Evidence has value only if it affects the decision-making objective—for example, improved health or reduced resource spending. If the decisionmaker is already reasonably sure about which alternative is optimal, then gathering more information has little chance of affecting the ultimate choice and hence little chance of influencing health or resource use. In this case, there is little reason to gather that information.
The possibility of a large advantage with the new alternative.
Even if switching from the currently identified optimal action to an alternative turns out to be a good idea, doing so has to outweigh the cost of gathering the information. If the potential advantage is at best limited, then gathering the information to identify a new optimal action will on net be ill advised. For example, conducting elaborate trials to evaluate a treatment for a rare condition will typically make less sense than conducting research to better understand the performance of treatments for common conditions (of similar severity).
The cost of gathering the information.
Gathering additional information involves resource expenditures, especially if it involves conducting complex studies (such as clinical trials), and it also takes time. In the interim, the decisionmaker must make do with available information. Suboptimal decision making in the interim may subject patients to inappropriate treatment. In some cases, it may make more sense to pursue somewhat less rigorous information (for example, data from uncontrolled studies) if that information can be gathered much more quickly.
The same techniques used to determine whether additional research is worthwhile can also help to determine how research efforts should be prioritized. For example, suppose that it is possible either to study the efficacy of a new treatment or to gather information about the progression of the condition it addresses. It is possible to quantify separately the extent to which research in each of these two areas is likely to improve decision making. In this way, given the decisionmaker’s objectives, the most “important” research can be identified.
An Example
An example demonstrates the application of decision analysis and VOI analysis to inform reimbursement decisions. In the late 1990s the decision to adopt or reject donepezil, a drug for mild or moderate Alzheimer’s disease, symbolized the dilemma typically facing decisionmakers confronting a new technology.
Data from a twenty-four-week RCT demonstrated that donepezil improved cognitive functioning compared with placebo. But was the drug worthwhile, given its added costs (roughly $1,400 per year)? Limiting consideration to the RCT evidence only, this question could not be answered comprehensively because the trial was limited in its duration and because it did not collect data on health economics, such as the cost of care.
Decision analysis allows a more complete evaluation because it instructs that all available information should be used to inform the decision. In this case, additional information about donepezil’s cost-effectiveness can be gathered from a computer simulation model that makes use of both the twenty-four-week RCT data for this treatment and other information.7 Such information might include a characterization of the disease’s progression both with and without the drug, based on expert opinion and other reports in the literature, disease treatment costs, and information about other parameters.
Results from one such model suggested that if one believed that the drug worked for only twenty-four weeks, then the expected net benefits would be negative.8 However, the model also predicted that if one assumed that the drug’s effect persisted longer than twenty-four weeks (assuming that patients continued on medication), then the expected net benefits could become positive. If one believed that the drug effect persisted longer than two years, the model predicted that it might even save money by offsetting other costs (such as delaying nursing home placement).
Although the analysis by one of us (Claxton) and colleagues suggested that adoption of donepezil may be optimal, it also showed that the benefits are uncertain (in part because of all the assumptions in the computer simulation model), which means that the possibility remains that adopting donepezil would be a mistake. As a result, gathering additional evidence about donepezil could on net be beneficial.
Claxton and colleagues also investigated the value of gathering information about donepezil’s performance after the end of the twenty-four-week period for which direct clinical evidence is available. This analysis showed that as the time horizon for which the performance information is available is extended outward, the value of the information grows. For example, the analysis showed that it is worth investing approximately $200 per patient to determine how well donepezil performs out to one year. It is worth around $350 to determine how well it works out to around four years.9
In this example, decision analysis and VOI analysis allow the decisionmaker to use all of the information available to determine the best way to proceed (provisionally adopt donepezil). However, in doing so, these techniques keep track of the uncertainty underlying the prescription and help the decisionmaker identify what additional evidence should be collected.
Recent U.K. Policy Experience
In many countries, decisions to adopt, reimburse, or issue guidance on health technologies are increasingly based on explicit cost-effectiveness analysis using a decision analytic framework.10 A prime example is the British National Institute for Clinical Excellence (NICE), whose recent guidance on the methods of technology appraisal reflect the importance of representing decision problems explicitly, synthesizing evidence from a range of sources, and facilitating the extrapolation of costs and effects over time and among patient groups and clinical settings.11 The guidance does not prescribe particular methods; rather, it aims to specify what is required to inform decisions consistent with the objectives of a health care system (maximizing health gain), given the resources available. The guidance stipulates that “all relevant comparators for the technology being appraised need to be included in the analysis [and] all relevant evidence needs to be assembled systematically and synthesised in a transparent and reproducible manner.”12 Requirements specified by the guidance include the development of cost-effectiveness estimates for technologies, calculation of uncertainty for these estimates, the use of decision analysis to identify optimal technologies, and the synthesis of evidence from various sources.
Although explicit valuation of additional evidence through VOI analysis is not required, it is recommended, to inform the research recommendations. The guidance states: “Candidate topics for future research can be identified on the basis of evidence gaps identified by the systematic review and cost-effectiveness analysis. These may be best prioritized by considering the value of additional information in reducing the degree of decision uncertainty.”13
A recent pilot study commissioned by NICE includes a VOI analysis to support its research recommendations made as part of the NICE guidance to the National Health Service (NHS) in England and Wales and to inform the deliberations of the NICE Research and Development Committee.14 The pilot study consisted of six case studies based on a reanalysis of recent technology assessment reports, to establish the feasibility and requirements of using VOI analysis to consider the possible implementation of this framework within the NICE processes.
In addition, the U.K. National Coordinating Centre for Health Technology Assessment (NCCHTA), which commissions research for the NHS, conducted a series of case studies to establish whether these methods might contribute to the process of achieving the greatest return in outcomes such as health improvements from the resources available for health technology assessment. Claxton and colleagues describe three of these: use of screening in age-related macular degeneration; use of alternative manual physiotherapy techniques in asthma and in chronic obstructive pulmonary disease; and use of alternative long-term, low-dose antibiotics in children with recurrent urinary tract infections.15
These examples were selected for study because conventional evaluation of the available evidence did not yield sufficient information to definitively recommend (or reject) the interventions being considered. The decision-analytic framework used in the case studies demonstrated how evidence from a variety of sources could be used, including RCT studies, observational studies, pooled trial data, registry studies, and clinical judgment.
Overall, this pilot study demonstrated that the framework of decision analysis and VOI analysis can be applied to policy-relevant decision making in a timely way to inform the research prioritization and commissioning process. It also showed that the amount and type of evidence needed to inform decisions about health technologies is essentially an empirical issue: Different amounts and types of evidence are needed for different technologies, applied to different patient populations in different circumstances.
Critical Challenges
The experience of applying this explicit analytic framework to policy-relevant decision making in a timely way has, of course, highlighted a number of challenges. However, the critical challenges for the types of analyses noted above are not the analytic and VOI methods themselves but, rather, issues related to structuring decision problems, synthesizing evidence from a variety of sources, and characterizing uncertainty.
More specifically, these challenges include (1) ensuring a sufficiently wide scope for the assessment of a technology so that it includes all of the relevant alternative comparators and competing clinical strategies; (2) exploring and accounting for the additional uncertainty surrounding alternative but credible structural assumptions when extrapolating from limited evidence about clinical effects and natural history; (3) developing synthesis methods that can use both direct and indirect evidence while reflecting the potential bias and generalizability of evidence from different sources; (4) developing and applying evidence synthesis methods that can facilitate comparisons not directly made in the clinical trial evidence; (5) establishing the appropriate role and methods for the elicitation of judgments from “experts” when no evidence is available to inform particular estimates; and (6) developing efficient computational methods for VOI calculations.
The challenges detailed above are not, with a few exceptions, specific to VOI analysis or indeed to decision modeling generally.16 The issues surrounding the synthesis and interpretation of evidence, potential bias, and so forth have always been present in any informal and partial review of evidence. In fact, until quite recently, these challenging issues could be conveniently ignored by policymakers, clinicians, and analysts while decision making was opaque and based on implicit criteria and unspecified “weighing” of the evidence. We must confront these challenges as we move to explicit and transparent approaches. Indeed, one of the many advantages of adopting such approaches is that they expose important, and previously underdeveloped, methodological issues.
Implications For U.S. Research And Policy
The methods described above have important implications for U.S. research and policy, including those related to Section 1013 of the Medicare Prescription Drug, Improvement, and Modernization Act (MMA) of 2003, which calls for research on outcomes, comparative clinical effectiveness, and appropriateness of health care. Section 1013 offers a potentially important vehicle for considering and prioritizing future clinical research and, indeed, an opportunity to advance methods for synthesizing evidence beyond that based on RCTs.
The language of Section 1013 raises the question of how to address gaps in the research base and in turn to inform decision making. In the United States there has been relatively little effort at prioritizing research to understand whether we are asking the right questions and whether broadly speaking, society is marshaling resources for gathering evidence efficiently. Recent discussions about gaps in the evidence base have tended to address the need to fund “pragmatic” clinical trials. Such calls are important in highlighting the fact that despite major increases in clinical research funding at the National Institutes of Health (NIH), the current U.S. clinical research enterprise may not be producing an adequate supply of information. As Sean Tunis and colleagues have noted, there is relatively little information produced on long-term effectiveness and health outcomes and few head-to-head comparisons with older, less costly agents.17
However, these calls alone are insufficient. As steps are taken to improve the infrastructure for pragmatic trials, we must also improve the infrastructure for decision analysis and VOI analysis at the institutional level, with funding from the NIH, the Agency for Healthcare Research and Quality (AHRQ), and the Centers for Medicare and Medicaid Services (CMS). Making use of all available information will improve decisions to adopt or reject interventions and will help determine whether it is worthwhile to gather additional information.
Footnotes
-
Karl Claxton is senior lecturer in the Department of Economics and Related Studies at the University of York (United Kingdom). Joshua Cohen is a senior research associate at the Harvard Center for Risk Analysis in Boston, Massachusetts. Peter Neumann (pneumann{at}hsph.harvard.edu) is an associate professor of policy and decision sciences at the Harvard School of Public Health, also in Boston.

