Chapter 1 Principles of experimental design
Although it is obviously true that statistical tests are not the only method for arriving at the ‘truth’, it is equally true that formal experiments generally provide the most scientifically valid research result. (Bailar III 1981)
1.1 Introduction
The validity of conclusions drawn from a statistical analysis crucially hinges on the manner in which the data are acquired, and even the most sophisticated analysis will not rescue a flawed experiment. Planning an experiment and thinking about the details of data acquisition is so important for a successful analysis that R. A. Fisher—who single-handedly invented many of the experimental design techniques we are about to discuss—famously wrote
To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. (Fisher 1938)
(Statistical) design of experiments provides the principles and methods for planning experiments and tailoring the data acquisition to an intended analysis. Design and analysis of an experiment are best considered as two aspects of the same enterprise: the goals of the analysis strongly inform an appropriate design, and the implemented design determines the possible analyses.
The primary aim of designing experiments is to ensure that valid statistical and scientific conclusions can be drawn that withstand the scrutiny of a determined skeptic. Good experimental design also considers that resources are used efficiently, and that estimates are sufficiently precise and hypothesis tests adequately powered. It protects our conclusions by excluding alternative interpretations or rendering them implausible. Three main pillars of experimental design are randomization, replication, and blocking, and we will invest substantial effort into fleshing out their effects on the subsequent analysis as well as their implementation in an experimental design.
An experimental design is always tailored towards predefined (primary) analyses and an efficient analysis and unambiguous interpretation of the experimental data is often straightforward from a good design. This does not prevent us from doing additional analyses of interesting observations after the data are acquired, but these analyses can be subjected to more severe criticisms and conclusions are more tentative.
In this chapter, we provide the wider context for using experiments in a larger research enterprise and informally introduce the main statistical ideas of experimental design. We use a comparison of two samples as our main example to study how design choices affect their comparison, but postpone a formal quantitative analysis to the next chapters.
1.2 A cautionary tale
A | 8.96 | 8.95 | 11.37 | 12.63 | 11.38 | 8.36 | 6.87 | 12.35 | 10.32 | 11.99 |
B | 12.68 | 11.37 | 12.00 | 9.81 | 10.35 | 11.76 | 9.01 | 10.83 | 8.76 | 9.99 |
For illustrating some of the issues arising in the interplay of experimental design and analysis, we consider a simple example. We are interested in comparing the enzyme levels measured in processed blood samples from laboratory mice, when the preparation is done either with a kit from a vendor A, or a kit from a competitor B. The data in Table 1.1 show measured enzyme levels of 20 mice, with samples of 10 mice prepared with kit A and the remaining 10 samples with kit B.
One option for comparing the two kits is by looking at the difference in average enzyme levels, and we find an average level of 10.32 for vendor A and 10.66 for vendor B. We would like to interpret their difference -0.34 as the difference due to the two preparation kits and conclude whether the two kits give equal results, or if measurements base done one kit are systematically different from those based on the other kit.
Such interpretation, however, is only valid if the two groups of mice and their measurements are identical in all aspects except the sample preparation kit. If we use one strain of mice for kit A and another strain for kit B, any difference might also be attributed to inherent differences between the strains. Similarly, if the measurements using kit B were conducted much later than those using kit A, any observed difference might be attributed to changes in, e.g., mice selected, batches of chemicals used, device calibration or any number of other influences. None of these competing explanation for an observed difference can be excluded from the given data alone, but good experimental design allows us to render them (almost) arbitrarily implausible.
A second aspect for our analysis is the inherent uncertainty in our calculated difference: if we repeat the experiment, the observed difference will change each time, and this will be more pronounced for smaller number of mice, among others. If we do not use a sufficient number of mice in our experiment, the uncertainty associated with the observed difference might be too large, such that random fluctuations become a plausible explanation for the observed difference. Systematic differences between the two kits, of practically relevant magnitude in either direction, might then be compatible with the data, and we can not draw any reliable conclusions from our experiment.
In each case, the statistical analysis—no matter how clever—was doomed before the experiment was even started, while simple ideas from statistical design of experiments would have prevented failure and provided correct and robust results with interpretable conclusions.
1.3 The language of experimental design
By an experiment, we understand an investigation where the researcher has full control over selecting and altering the experimental conditions of interest, and we only consider investigations of this type. The selected experimental conditions are called treatments. An experiment is comparative if the responses to several treatments are to be compared or contrasted. The experimental units are the smallest subdivision of the experimental material to which a treatment can be assigned. All experimental units given the same treatment constitute a treatment group. Especially in biology, we often contrast responses to a control group to which some standard experimental conditions are applied; a typical example is using a placebo for the control group, and different drugs in the other treatment groups.
Multiple experimental units are sometimes combined into groupings or blocks, for example mice are naturally grouped by litter, and samples by batches of chemicals used for their preparation. The values observed are called responses and are measured on the response units; these are often identical to the experimental units but need not be. More generally, we call any grouping of the experimental material a unit.
In our example, we selected the mice, used a single sample per mouse, deliberately chose the two specific vendors, and had full control over assigning a kit to a mouse. Here, the mice are the experimental units, the samples the response units, the two kits are the treatments, and the responses are the measured enzyme levels. Since we compare the average enzyme levels between treatments and choose which kit to assign to which sample, this is a comparative experiment.
In this example, we can identify experimental and response units, because we have a single response per mouse and cannot distinguish a sample from a mouse in the analysis. By contrast, if we take two samples per mouse and use the same kit for both samples, then the mice are still the experimental units, but each mouse now has two response units associated with it. If we take two samples per mouse, but apply each kit to one of the two samples, then the samples are both the experimental and response units, while the mice are blocks that group the samples. If we only use one kit and determine the average enzyme level, then this investigation is still an experiment, but is not comparative.
Finally, the design of an experiment determines the logical structure of the experiment; it consists of (i) a set of treatments; (ii) a specification of the experimental units (animals, cell lines, samples); (iii) a procedure for assigning treatments to units; and (iv) a specification of the response units and the quantity to be measured as a response.
1.4 Experiment validity
Before we embark on the more technical aspects of experimental design, we discuss three components for evaluating an experiment’s validity: construct validity, internal validity, and external validity. These criteria are well-established in, e.g., educational and psychological research, but have more recently been proposed for animal research (Würbel 2017) where experiments are increasingly scutinized for their scientific rationale and their design and intended analyses.
1.4.1 Construct validity
Construct validity concerns the choice of the experimental system for answering our research question. Is the system even capable of providing a relevant answer to the question?
Studying the mechanisms of a particular disease, for example, might require careful choice of an appropriate animal model that shows a disease phenotype and is amenable to experimental interventions. If the animal model is a proxy for drug development for humans, biological mechanisms must be sufficiently similar between animal and human physiologies.
Another important aspect of the construct is the quantity that we intend to measure (the measurand), and its relation to the quantity or property we are interested in. For example, we might measure the concentration of the same chemical compound once in a blood sample and once in a highly purified sample, and these constitute two different measurands, whose values might not be comparable. Often, the quantity of interest (e.g., liver function) is not directly measurable (or even quantifiable) and we measure a biomarker instead. For example, pre-clinical and clinical investigations may use concentrations of proteins or counts of specific cell types from blood samples, such as the CD4+ cell count used as a biomarker for immune system function. The problem of measurements and measurands is further discussed for statistics in (Hand 1996) and specifcially for biological experiments in (Coxon, Longstaff, and Burns 2019).
1.4.2 Internal validity
The internal validity of an experiment concerns the soundness of the scientific rationale, statistical properties such as precision of estimates, and the measures taken against risk of bias. It refers to the validity of claims within the context of the experiment. Statistical design of experiments plays a prominent role in ensuring internal validity, and we briefly discuss the main ideas here before providing the technical details and an application to our example in the subsequent sections.
Scientific rationale and research question
The scientific rationale of a study is (usually) not immediately a statistical question. Translating a scientific question into a quantitative comparison amenable to statistical analysis is no small task and often requires substantial thought. It is a substantial, if non-statistical, benefit of using experimental design that we are forced to formulate a precise-enough research question and decide on the main analyses required for answering it before we conduct the experiment. For example, the question: is there a difference between placebo and drug? is insufficiently precise for planning a statistical analysis and determine an adequate experimental design. What exactly is the drug treatment? What concentration and how is it administered? How do we make sure that the placebo group is comparable to the drug group in all other aspects? What do we measure and what do we mean by “difference”? A shift in average response, a fold-change, change in response before and after treatment?
There are almost never enough resources to answer all conceivable scientific questions in a statistical analysis. We therefore select a few primary outcome variables whose analysis answers the most important questions and design the experiment to ensure these variables can be estimated and tested appropriately. Other, secondary outcome variables, can still be measured and analyzed, but we are not willing to increase the experiment to ensure that reliable conclusions can be drawn from these variables.
The scientific rationale also enters the choice of a potential control group to which we compare responses. The quote
The deep, fundamental question in statistical analysis is ‘Compared to what?’ (Tufte 1997)
from Edward Tufte highlights the importance of this choice also for the statistical analyses of an experiment’s results.
Risk of bias
Experimental bias is a systematic difference in response between experimental units in addition to the difference caused by the treatments. The experimental units in the different groups are then not equal in all aspects except the treatment applied to them, and we saw several examples in Section 1.2.
Minimizing the risk of bias is crucial for internal validity. Experimental design offers several methods for this, such as randomization, the random assignment of treatments to units to randomly distribute other differences between the treatment groups; blinding, the hiding of treatment assignments from the researcher and potential experiment subject to prevent conscious or unconscious biased assignments (e.g., by treating more agile mice with our favourite drug and more docile ones with the competitor’s); sampling, the random selection of units for inclusion in the experiment; and predefining the analysis plan detailing the intended analyses, including how to deal with missing data to counteract criticisms of performing many comparisons and only reporting those with the desired outcome, for example.
Precision and effect size
Another aspect of internal validity is the precision of estimates and the expected effect sizes. Is the experimental setup, in principle, able to detect a difference of relevant magnitude? Experimental design offers several methods for answering this question based on the expected heterogeneity of samples, the measurement error, and other sources of variation: power analysis is a technique for determining the number of samples required to reliably detect a relevant effect size and provide estimates of sufficient precision. More samples yield more precision and more power, but we have to be careful that replication is done at the right level: simply measuring a biological sample multiple times yields more measured values, but is pseudo-replication for analyses. Replication should also ensure that the statistical uncertainties of estimates can be gauged from the data of the experiment itself, and does not require additional untestable assumptions. Finally, the technique of blocking can remove a substantial proportion of the variation and thereby increase power and precision if we find a way to apply it.
1.4.3 External validity
The external validity of an experiment concerns its replicability and the generalizability of inferences. An experiment is replicable if its results can be confirmed by an independent new experiment, preferably by a different lab and researcher. Experimental conditions in the replicate experiment usually differ from the original experiment, which provides evidence that the observed effects are robust to such changes. A much weaker condition on an experiment is reproducibility, the property that an independent researcher draws equivalent conclusions based on the data from this particular experiment, using the same analyses techniques. Reproducibility requires publishing the raw data, details on the experimental protocol, and a detailed description of the statistical analyses, preferably with accompagnying source code.
Reporting the results of an experiment so that others can preproduce and replicate them is no simple task, and requires sufficient information about the experiment and its analysis. Many scientific journals subscribe to reporting guidelines that are also helpful for planning an experiment. Two such guidelines are the the ARRIVE guidelines for animal research (Kilkenny et al. 2010) and the CONSORT guidelines for clinical trials (Moher et al. 2010). Guidelines describing the minimal information required for reproducing experimental results have been developed for many types of experimental techniques, including microarrays (MIAME), RNA sequencing (MINSEQE), metabolomics (MSI) and proteomics (MIAPE) experiments, and the FAIRSHARE initiative provides a more comprehensive collection (Sansone et al. 2019).
A main threat to replicability and generalizability are too tightly controlled experimental conditions, when inferences only hold for a specific lab under the very specific conditions of the original experiment. Introducing systematic heterogeneity and using multi-center studies effectively broadens the experimental conditions and therefore the inferences for which internal validity is available.
For systematic heterogeneity, experimental conditions other than treatments are systematically altered and treatment differences estimated for each condition. For example, we might split the experimental material into several batches and use a different day of analysis, sample preparation, batch of buffer, measurement device, and lab technician for each the batches. A more general inference is then possible if the effect size, effect direction, and precision are comparable between the batches, indicating that the treatment differences are stable over the different conditions.
In multi-center experiments, the same experiment is conducted in several different labs and the results compared and merged. Already using a second laboratory increases replicability of animal studies substantially (Karp 2018) and differences between labs can be used for standardizing the treatment effects (Kafkafi et al. 2017). Multi-center approaches are very common in clinical trials and often necessary to reach the required number of patient enrollments.
Generalizability of randomized controlled trials in medicine and animal studies often suffers from overly restrictive eligibility criteria. In clinical trials, patients are often included or excluded based on co-medications and co-morbidities, and the resulting sample of eligible patients might no longer be representative of the patient population. For example, (Travers et al. 2007) used the eligibility criteria of 17 random controlled trials of asthma treatments and found that out of 749 patients, only a median of 6% (45 patients) would be eligible for an asthma-related randomized controlled trial. This puts a question mark on the relevance of the trials’ findings for asthma patients in general.
1.5 Reducing the risk of bias
1.5.1 Randomization of treatment allocation
If systematic differences other than the treatment exist between our treatment groups, then the effect of the treatment is confounded with these other differences and our estimates of treatment effects might be biased.
We remove such unwanted sysstematic differences from our treatment comparisons by randomizing the allocation of treatments to experimental units. In a completely randomized design, each experimental unit has the same chance of being subjected to any of the treatments, and any differences between the experimental units other than the treatments are distributed over the treatment groups. Importantly, randomization is the only method that also protects our experiment against unknown sources of bias: we do not need to know all or even any of the potential differences and yet their impact is eliminated from the treatment comparisons by random treatment allocation.
Randomization has two effects: (i) differences unrelated to treatment become part of the residual variance rendering the treatment groups more similar; and (ii) the systematic differences are thereby eliminated as sources of bias from the treatment comparison. In short,
Randomization transforms systematic variation into random variation.
In our example, a proper randomization would select 10 out of our 20 mice fully at random, such that the probability of any mice being picked is 1/20. These ten mice are then assigned to kit A, and the remaining mice to kit B. This allocation is entirely independent of the treatments and of any properties of the mice.
To ensure completely random treatment allocation, some kind of random process needs to be employed. This can be as simple as shuffling a pack of 10 red and 10 black cards or we might use a software-based random number generator. Randomization is slightly more difficult if the number of experimental units is not known at the start of the experiment, such as when patients are recruited for an ongoing clinical trial (sometimes called rolling recruitment), and we want to have reasonable balance between the treatment groups at each stage of the trial.
Seemingly random assignments “by hand” are usually no less complicated than fully random assignments, but are always inferior. If surprising results ensue from the experiment, such assignments are subject to unanswerable criticism and suspicion of unwanted bias. Even worse are systematic allocations; they can only remove bias from known causes, and immediately raise red flags under the slightest scrutiny.
The problem of undesired assignments
Even with a fully random treatment allocation procedure, we might end up with an undesirable allocation. For our example, the treatment group of kit A might—just by chance—contain mice that are bigger or more active than those in the other treatment group. Statistical orthodoxy and some authors recommend using the design nevertheless, because only full randomization guarantees valid estimates of residual variance and unbiased estimates of effects. This argument, however, concerns the long-run properties of the procedure and seems of little help in this specific situation. Why should we care if the randomization yields correct estimates under replication of the experiment, if the particular experiment is jeopardized?
Another solution is to create a list of all possible allocations that we would accept and randomly choose one of these allocations for our experiment. The analysis should then reflect this restriction in the possible randomizations, which often renders this approach difficult to implement.
The most pragmatic method is to reject undesirable designs and compute a new randomization (Cox 1958). Undesirable allocations are unlikely to arise for large sample sizes, and we might accept a small bias in estimation for small sample sizes, when uncertainty in the estimated treatment effect is already high. In this approach, whenever we reject a particular outcome, we must also be willing to reject the outcome if we permute the treatment level labels. If we reject eight big and two small mice for kit A, then we must also reject a two big and eight small mice. We must also be transparent and report a rejected allocation, so that a critic may weigh the risk in bias due to rejection against the risk of bias due to the rejected allocation.
1.5.2 Blinding
Bias in treatment comparisons is also introduced if treatment allocation is random, but responses cannot be measured entirely objective, or if knowledge of the assigned treatment might affect the response. In clinical trials, for example, patients might (objectively) react differently when they know to be on a placebo treatment, an effect known as cognitive bias. In animal experiments, caretakers might report more abnormal behavior for animals on a more severe treatment. Cognitive bias can be eliminated by concealing the treatment allocation from participants of a clinical trial or technicians, a technique called single-blinding.
If response measures are partially based on professional judgement (e.g., a pain score), patient or physician might unconsciously report lower scores for a placebo treatment, a phenomenon known as observer bias. Its removal requires double blinding, where treatment allocations are additionally concealed from the experimentalist.
Blinding requires randomized treatment allocation to begin with and substantial effort might be needed to implement it. Drug companies, for example, have to go to great lengths to ensure that a placebo looks, tastes, and feels similar enough to the actual drug so that patients cannot unblind their treatment. Additionally, blinding is often done by coding the treatment conditions and samples, and statements about effect sizes and statistical significance are made before the code is revealed.
In clinical trials, double-blinding creates a conflict of interest. The attending doctors do not know which patient received which treatment, and thus accumulation of side-effects cannot be linked to any treatment. For this reason, clinical trials always have a data monitoring committee, constituted of doctors, pharmacologists, and statisticians. At predefined intervals, the data from the trials is used for an intermediate analysis of efficacy and safety by members of the committee. If severe problems are detected, the committee might recommend altering or aborting the trial. The same might happen if one treatment already shows overwhelming evidence of superiority, such that it becomes unethical to withhold better treatment from the other treatment groups.
1.5.3 Analysis plan, and registration
An often overlooked but nevertheless severe source of bias is what has been termed ‘researcher degrees of freedom’ or ‘a garden of forking paths’ in the data analysis. For any set of data, there are many different options for its analysis: some results might be considered outliers and discarded, assumptions are made on error distributions and appropriate test statistics, different covariates might be included into a regression model. Often, multiple hypotheses are investigated and tested, and analyses are done separately on various (overlapping) subgroups. Hypotheses formed after looking at the data require additional care in their interpretation; almost never will \(p\)-values for these ad hoc or post hoc hypotheses be statistically justifiable. Only reporting those sub-analyses that gave ‘interesting’ findings invariably leads to biased conclusions and is called cherry-picking or \(p\)-hacking (or much less flattering names). Many different measured response variables invite fishing expeditions, where patterns in the data are sought without an underlying hypothesis.
The interpretation of a statistical analysis is always part of a larger scientific argument and we should consider the necessary computations in relation to building our scientific argument about the interpretation of the data. In addition to the statistical calculations, this interpretation requires substantial subject-matter knowledge and includes (many) non-statistical arguments. Two quotes highlight highlight that experiment and analysis are a means to an end and not the end in itself.
There is a boundary in data interpretation beyond which formulas and quantitative decision procedures do not go, where judgment and style enter. (Abelson 1995)
Often, perfectly reasonable people come to perfectly reasonable decisions or conclusions based on nonstatistical evidence. Statistical analysis is a tool with which we support reasoning. It is not a goal in itself. (Bailar III 1981)
The deliberate use of statistical analyses and their interpretation for supporting a larger argument was called statistics as principled argument (Abelson 1995). Employing useless statistical analysis without reference to the actual scientific question is surrogate science (Gigerenzer and Marewski 2014) and adaptive thinking is integral to meaningful statistical analysis (Gigerenzer 2002).
There is often a grey area between exploiting researcher degrees of freedom to arrive at a desired conclusion, and creative yet informed analyses of data. One way to navigate this area is to distinguish between exploratory studies and confirmatory studies. The former have no clear stated scientific question, but are used to generate interesting hypotheses by identifying potential associations or effects that are then further investigated. Conclusions from these studies are very tentative and must be reported honestly. In contrast, standards are much higher for conformatory studies, which investigate a clearly defined scientific question. Here, analysis plans and pre-registration of an experiment are now the accepted means for demonstrating lack of bias due to researcher degrees of freedom.
Analysis plans
The analysis plan is written before conducting the experiment and details the measurands and estimands, the hypotheses to be tested together with a power and sample size calculation, a discussion of relevant effect sizes, detection and handling of outliers and missing data, as well as steps for data normalization such as transformations and baseline corrections. If a regression model is required, its factors and covariates are outlined. Particularly in biology, measurements below the limit of quantification require special attention in the analysis plan.
In the context of clinical trials, the problem of estimands has become a recent focus of attention. The estimand is the target of a statistical estimation procedure, for example the true average difference in enzyme levels between the two preparation kits. A main problem in many studies are post-randomization events that can change the estimand, even if the estimation procedure remains the same. For example, if kit B fails to produce usable samples for measurement in five out of ten cases because the enzyme level was too low, while kit A could handle these enzyme levels perfectly fine, then this might severely exaggerate the observed difference between the two kits. Similar problems arise in drug trials, when some patients stop taking one of the drugs due to side-effects or other complications, and data is then available for only those patients without side-effects.
Pre-registration
Pre-registration of experiments is an even more severe measure used in conjunction with an analysis plan and is becoming standard in clinical trials. Here, information about the trial, including the analysis plan, procedure to recruit patients, and stopping criteria, are registered at a dedicated website, such as ClinicalTrials.gov or AllTrials.net, and stored in a database. Publications based on the trial then refer to this registration, such that reviewers and readers can compare what the researchers intended to do and what they actually did. A similar portal for pre-clinical and translational research is PreClinicalTrials.eu.
References
Abelson, R P. 1995. Statistics as principled argument. Lawrence Erlbaum Associates Inc.
Bailar III, J. C. 1981. “Bailar’s laws of data analysis.” Clinical Pharmacology & Therapeutics 20 (1): 113–19.
Cox, D R. 1958. Planning of Experiments. Wiley-Blackwell.
Coxon, Carmen H., Colin Longstaff, and Chris Burns. 2019. “Applying the science of measurement to biology: Why bother?” PLOS Biology 17 (6): e3000338. https://doi.org/10.1371/journal.pbio.3000338.
Fisher, R. 1938. “Presidential Address to the First Indian Statistical Congress.” Sankhya: The Indian Journal of Statistics 4: 14–17.
Gigerenzer, G. 2002. Adaptive Thinking: Rationality in the Real World. Oxford Univ Press. https://doi.org/10.1093/acprof:oso/9780195153729.003.0013.
Gigerenzer, G, and J N Marewski. 2014. “Surrogate Science: The Idol of a Universal Method for Scientific Inference.” Journal of Management 41 (2). {SAGE} Publications: 421–40. https://doi.org/10.1177/0149206314547522.
Hand, D J. 1996. “Statistics and the theory of measurement.” Journal of the Royal Statistical Society A 159 (3): 445–92. http://www.jstor.org/stable/2983326.
Kafkafi, Neri, Ilan Golani, Iman Jaljuli, Hugh Morgan, Tal Sarig, Hanno Würbel, Shay Yaacoby, and Yoav Benjamini. 2017. “Addressing reproducibility in single-laboratory phenotyping experiments.” Nature Methods 14 (5): 462–64. https://doi.org/10.1038/nmeth.4259.
Karp, Natasha A. 2018. “Reproducible preclinical research—Is embracing variability the answer?” PLOS Biology 16 (3): e2005413. https://doi.org/10.1371/journal.pbio.2005413.
Kilkenny, Carol, William J Browne, Innes C Cuthill, Michael Emerson, and Douglas G Altman. 2010. “Improving Bioscience Research Reporting: The ARRIVE Guidelines for Reporting Animal Research.” PLoS Biology 8 (6): e1000412. https://doi.org/10.1371/journal.pbio.1000412.
Moher, David, Sally Hopewell, Kenneth F Schulz, Victor Montori, Peter C Gøtzsche, P J Devereaux, Diana Elbourne, Matthias Egger, and Douglas G Altman. 2010. “CONSORT 2010 Explanation and Elaboration: updated guidelines for reporting parallel group randomised trials.” BMJ 340. BMJ Publishing Group Ltd. https://doi.org/10.1136/bmj.c869.
Sansone, Susanna-Assunta, Peter McQuilton, Philippe Rocca-Serra, Alejandra Gonzalez-Beltran, Massimiliano Izzo, Allyson L. Lister, and Milo Thurston. 2019. “FAIRsharing as a community approach to standards, repositories and policies.” Nature Biotechnology 37 (4): 358–67. https://doi.org/10.1038/s41587-019-0080-8.
Travers, Justin, Suzanne Marsh, Mathew Williams, Mark Weatherall, Brent Caldwell, Philippa Shirtcliffe, Sarah Aldington, and Richard Beasley. 2007. “External validity of randomised controlled trials in asthma: To whom do the results of the trials apply?” Thorax 62 (3): 219–33. https://doi.org/10.1136/thx.2006.066837.
Tufte, E. 1997. Visual Explanations: Images and Quantities, Evidence and Narrative. 1st ed. Graphics Press.
Würbel, Hanno. 2017. “More than 3Rs: The importance of scientific validity for harm-benefit analysis of animal research.” Lab Animal 46 (4). Nature Publishing Group: 164–66. https://doi.org/10.1038/laban.1220.