diff --git a/Dockerfile b/Dockerfile index 533036ef96a9b1459c3c8b2e04698b7a43eeec1d..2fd937f910a186846341667a9ff8213f0c3d08e4 100755 --- a/Dockerfile +++ b/Dockerfile @@ -34,7 +34,7 @@ CMD if [ "$pdfdocker" = "false" ] ; then \ && mv figure/* /output/figure/ ; \ else \ echo "compiling PDF inside Docker" \ - && Rscript -e "tinytex::install_tinytex()" --vanilla \ + # && Rscript -e "tinytex::install_tinytex()" --vanilla \ ## knit Rnw to tex and compile tex inside docker to PDF && Rscript -e "knitr::knit2pdf('"$FILE".Rnw')" --vanilla \ && mv "$FILE".pdf /output/ ; \ diff --git a/paper/rsabsence.Rnw b/paper/rsabsence.Rnw index 56ac3f36e9221ea1ea82636c3527562505c5c9c5..b10a6cd626dee911348504212ba7d0fb9823cdf0 100755 --- a/paper/rsabsence.Rnw +++ b/paper/rsabsence.Rnw @@ -131,9 +131,10 @@ paper by Douglas Altman and Martin Bland has since become a mantra in the statistical and medical literature \citep{Altman1995}. Yet, the misconception that a statistically non-significant result indicates evidence for the absence of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null -result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the null -hypothesis of an absent effect -- may also occur if an effect is actually -present. For example, if the sample size of a study is chosen to detect an +result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the +null hypothesis of an absent effect -- may also occur if an effect is actually +present. For example, if the sample size of a +study is chosen to detect an assumed effect with a power of 80\%, null results will incorrectly occur 20\% of the time when the assumed effect is actually present. Conversely, if the power of the study is lower, null results will occur more often. In general, the lower @@ -148,12 +149,13 @@ or Bayes factors \citep{Kass1995}, should be used from the outset. % two systematic reviews that I found which show that animal studies are very % much underpowered on average \citep{Jennions2003,Carneiro2018} -The contextualization\todo{replace contextualization with interpretation?} of null results becomes even more complicated in the -setting of replication studies. In a replication study, researchers attempt to -repeat an original study as closely as possible in order to assess whether -similar\todo{replace similar with consistent?} results can be obtained with new data \citep{NSF2019}. In the last decade, various large-scale replication projects have been conducted in diverse fields, from the biomedical to the social sciences - \citep[among -others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}. \todo{changed sentennce to not assume that their were only projects in biomed and soc sciences, but there might be more, or more in the pipeline} +The interpretation of null results becomes even more complicated in the setting +of replication studies. In a replication study, researchers attempt to repeat an +original study as closely as possible in order to assess whether consistent +results can be obtained with new data \citep{NSF2019}. In the last decade, +various large-scale replication projects have been conducted in diverse fields, +from the biomedical to the social sciences \citep[among +others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}. Most of these projects reported alarmingly low replicability rates across a broad spectrum of criteria for quantifying replicability. While most of these projects restricted their focus on original studies with statistically @@ -164,41 +166,37 @@ significant results (``positive results''), the \emph{Reproducibility Project: also attempted to replicate some original studies with null results. The RPP excluded the original null results from its overall assessment of -replication success (\textit{i.e.} the proportion of successful replications\todo{added by me, can be deleted again}), but the RPCB and the RPEP explicitly defined null results -in both the original and the replication study as a criterion for ``replication -success''. There are several logical problems with this ``non-significance'' -criterion. First, if the original study had low statistical power, a -non-significant result is highly inconclusive and does not provide evidence for -the absence of an effect. It is then unclear what exactly the goal of the -replication should be -- to replicate the inconclusiveness of the original -result? On the other hand, if the original study was adequately powered, a -non-significant result may indeed provide some evidence for the absence of an -effect when analyzed with appropriate methods, so that the goal of the -replication is clearer. However, the criterion does not distinguish between -these two cases. Second, with this criterion researchers can virtually always -achieve replication success by conducting two studies with very small sample -sizes, such that the \textit{p}-values are non-significant and the results are -inconclusive. \todo{I find the "second, ..." argument a bit unnecessary for our cause. Also because if you do a replication, you probably do not design the first study (to be of low power). Instead I would directly write "Second, if the goal of inference is to quantify the -evidence for the absence of an effect, the null hypothesis under which the \textit{p}-values are computed is misaligned with the goal."} This is because the null hypothesis under which the \textit{p}-values are -computed is misaligned with the goal of inference, which is to quantify the -evidence for the absence of an effect. We will discuss methods that are better -aligned with this inferential goal. % in Section~\ref{sec:methods}. -Third, the criterion does not control the error of falsely claiming the absence -of an effect at some predetermined rate. This is in contrast to the standard -replication success criterion of requiring significance from both studies -\citep[also known as the two-trials rule, see chapter 12.2.8 in][]{Senn2008}, -which ensures that the error of falsely claiming the presence of an effect is -controlled at a rate equal to the squared significance level (for example, -5\ $\times$ 5\% = 0.25\% for a 5\% significance level). The non-significance -criterion may be intended to complement the two-trials rule for null results, -but it fails to do so in this respect, which may be important to regulators, -funders, and researchers. We will now demonstrate these issues and potential -solutions using the null results from the RPCB. +replication success (i.e., the proportion of ``successful'' replications), but +the RPCB and the RPEP explicitly defined null results in both the original and +the replication study as a criterion for ``replication success''. There are +several logical problems with this ``non-significance'' criterion. First, if the +original study had low statistical power, a non-significant result is highly +inconclusive and does not provide evidence for the absence of an effect. It is +then unclear what exactly the goal of the replication should be -- to replicate +the inconclusiveness of the original result? On the other hand, if the original +study was adequately powered, a non-significant result may indeed provide some +evidence for the absence of an effect when analyzed with appropriate methods, so +that the goal of the replication is clearer. However, the criterion does not +distinguish between these two cases. Second, with this criterion researchers can +virtually always achieve replication success by conducting a replication study +with a very small sample size, such that the \textit{p}-value is non-significant +and the result are inconclusive. This is because the null hypothesis under which +the \textit{p}-value is computed is misaligned with the goal of inference, which +is to quantify the evidence for the absence of an effect. We will discuss +methods that are better aligned with this inferential goal. Third, the criterion +does not control the error of falsely claiming the absence of an effect at some +predetermined rate. This is in contrast to the standard replication success +criterion of requiring significance from both studies \citep[also known as the +two-trials rule, see chapter 12.2.8 in][]{Senn2008}, which ensures that the +error of falsely claiming the presence of an effect is controlled at a rate +equal to the squared significance level (for example, 5\% $\times$ 5\% = 0.25\% +for a 5\% significance level). The non-significance criterion may be intended to +complement the two-trials rule for null results, but it fails to do so in this +respect, which may be important to regulators, funders, and researchers. We will +now demonstrate these issues and potential solutions using the null results from +the RPCB. -\section{Null results from the Reproducibility Project: Cancer Biology} -\label{sec:rpcb} - << "data" >>= ## data rpcbRaw <- read.csv(file = "../data/rpcb-effect-level.csv") @@ -242,11 +240,8 @@ rpcbNull <- rpcb %>% ## paper 41 (https://osf.io/qnpxv) - 1 Hazard ratio - sample size correspond to forest plot ## paper 47 (https://osf.io/jhp8z) - 2 r - sample size correspond to forest plot ## paper 48 (https://osf.io/zewrd) - 1 r - sample size do not correspond to forest plot for original study -@ - -\begin{figure}[!htb] -<< "2-example-studies", fig.height = 3.25 >>= +## 2 examples ## some evidence for absence of effect https://doi.org/10.7554/eLife.45120 I ## can't find the replication effect like reported in the data set :( let's take ## it at face value we are not data detectives @@ -267,8 +262,25 @@ plotDF1 <- rpcbNull %>% ## ## https://doi.org/10.7554/eLife.25306.012 ## plotDF1$no[plotDF1$id == study2] <- plotDF1$no[plotDF1$id == study2]*2 ## plotDF1$nr[plotDF1$id == study2] <- plotDF1$nr[plotDF1$id == study2]*2 -## create plot showing two example study pairs with null results + conflevel <- 0.95 +@ + +\section{Null results from the Reproducibility Project: Cancer Biology} +\label{sec:rpcb} + +Figure~\ref{fig:2examples} shows effect estimates on standardized mean +difference scale with \Sexpr{round(100*conflevel, 2)}\% confidence intervals +from two RPCB study pairs. In both study pairs, the original and replications +studies are ``null results'' and therefore meet the non-significance criterion +for replication success (the two-sided \textit{p}-values are greater than 0.05 +in both the original and the replication study). However, intuition would +suggest that the conclusions in the two pairs are very different. + + +\begin{figure}[!htb] +<< "2-example-studies", fig.height = 3 >>= +## create plot showing two example study pairs with null results ggplot(data = plotDF1) + facet_wrap(~ label) + geom_hline(yintercept = 0, lty = 2, alpha = 0.3) + @@ -306,22 +318,20 @@ ggplot(data = plotDF1) + for the null hypothesis that the effect is absent.} \end{figure} -Figure~\ref{fig:2examples} shows standardized mean difference effect estimates -with \Sexpr{round(100*conflevel, 2)}\% confidence intervals from two RPCB study -pairs. In both study pairs, the original and replications studies are ``null results'' and therefore meet the non-significance criterion for -replication success (the two-sided \textit{p}-values are greater than 0.05 in both the -original and the replication study). However, intuition would suggest that the conclusions in the two pairs are very different. The original study from \citet{Dawson2011} and its replication both show large effect estimates in magnitude, but due to the small sample sizes, the -uncertainty of these estimates is large, too. If the sample sizes of the -studies were larger and the point estimates remained the same, intuitively both -studies would provide evidence for a non-zero effect\todo{Does this sentence add much information? I'd delete it and start the next one "With such low sample sizes used, ...".}. However, with the samples -sizes that were actually used, the results are inconclusive. In contrast, the +uncertainty of these estimates is large, too. +% If the sample sizes of the studies were larger and the point estimates +% remained the same, intuitively both studies would provide evidence for a +% non-zero effect\todo{Does this sentence add much information? I'd delete it +% and start the next one "With such low sample sizes used, ...".}. However, with +% the samples sizes that were actually used, +With such low sample sizes used, the results seem inconclusive. In contrast, the effect estimates from \citet{Goetz2011} and its replication are much smaller in magnitude and their uncertainty is also smaller because the studies used larger -sample sizes. Intuitively, these studies seem to provide some evidence for a -zero (or negligibly small) effect. While these two examples show the qualitative +sample sizes. Intuitively, the results seem to provide some evidence for a zero +(or negligibly small) effect. While these two examples show the qualitative difference between absence of evidence and evidence of absence, we will now discuss how the two can be quantitatively distinguished. @@ -341,17 +351,17 @@ data. Equivalence testing was developed in the context of clinical trials to assess whether a new treatment -- typically cheaper or with fewer side effects than the established treatment -- is practically equivalent to the established treatment -\citep{Wellek2010}. The method can also be used to assess -whether an effect is practically equivalent to the value of an absent effect\todo{change to "practically equivalent to an absent effect, usually zero"? meaning without the "the value of "}, -usually zero. Using equivalence testing as a remedy for non-significant results\todo{"as a way to deal with / handle non-significant results". because it is not a remedy in the sense of an intervention against non-sign. results. } -has been suggested by several authors \citep{Hauck1986, Campbell2018}. The main -challenge is to specify the margin $\Delta > 0$ that defines an equivalence -range $[-\Delta, +\Delta]$ in which an effect is considered as absent for -practical purposes. The goal is then to reject -the % composite %% maybe too technical? +\citep{Wellek2010}. The method can also be used to assess whether an effect is +practically equivalent to an absent effect, usually zero. Using equivalence +testing as a way to deal with non-significant results has been suggested by +several authors \citep{Hauck1986, Campbell2018}. The main challenge is to +specify the margin $\Delta > 0$ that defines an equivalence range +$[-\Delta, +\Delta]$ in which an effect is considered as absent for practical +purposes. The goal is then to reject the % composite %% maybe too technical? null hypothesis that the true effect is outside the equivalence range. This is -in contrast to the usual null hypothesis of a superiority test which states that -the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration. +in contrast to the usual null hypothesis of a superiority tests which state that +the effect is zero or smaller than zero, see Figure~\ref{fig:hypotheses} for an +illustration. \begin{figure}[!htb] \begin{center} @@ -362,7 +372,7 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration. \draw (3,0.2) -- (3,-0.2) node[below]{$0$}; \draw (4,0.2) -- (4,-0.2) node[below]{$+\Delta$}; - \node[text width=5cm, align=left] at (0,1.25) {\textbf{Equivalence}}; + \node[text width=5cm, align=left] at (0,1) {\textbf{Equivalence}}; \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt}] (2.05,0.75) -- (3.95,0.75) node[midway,yshift=1.5em]{\textcolor{darkred2}{$H_1$}}; \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}] @@ -370,7 +380,7 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration. \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}] (4.05,0.75) -- (6,0.75) node[pos=0.4,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}}; - \node[text width=5cm, align=left] at (0,2.5) {\textbf{Superiority}}; + \node[text width=5cm, align=left] at (0,2.15) {\textbf{Superiority}\\(two-sided)}; \draw [decorate,decoration={brace,amplitude=5pt}] (3,2) -- (3,2) node[midway,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}}; \draw[darkblue2] (3,1.95) -- (3,2.2); @@ -379,17 +389,17 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration. \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}] (3.05,2) -- (6,2) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}}; - % \node[text width=5cm, align=left] at (0,5.5) {\textbf{Superiority \\ (one-sided)}}; - % \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}] - % (3.05,5) -- (6,5) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}}; - % \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}] - % (0,5) -- (3,5) node[pos=0.6,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}}; + \node[text width=5cm, align=left] at (0,3.45) {\textbf{Superiority}\\(one-sided)}; + \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}] + (3.05,3.25) -- (6,3.25) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}}; + \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}] + (0,3.25) -- (3,3.25) node[pos=0.6,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}}; \draw [dashed] (2,0) -- (2,0.75); \draw [dashed] (4,0) -- (4,0.75); \draw [dashed] (3,0) -- (3,0.75); \draw [dashed] (3,1.5) -- (3,1.9); - % \draw [dashed] (3,3.9) -- (3,5); + \draw [dashed] (3,2.8) -- (3,3.2); \end{tikzpicture} \end{center} \caption{Null hypothesis ($H_0$) and alternative hypothesis ($H_1$) for @@ -397,25 +407,24 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration. \label{fig:hypotheses} \end{figure} -\todo{shouldn't for superiority H1 have positive effect and H0 an effect < or equal to 0...?} - To ensure that the null hypothesis is falsely rejected at most $\alpha \times 100\%$ of the time, the standard approach is to declare equivalence if the $(1-2\alpha)\times 100\%$ confidence interval for the effect is contained within the equivalence range, for example, a 90\% confidence -interval for $\alpha = 5\%$ \citep{Westlake1972}. The procedure is equivalent to -two one-sided tests (TOST) for the null hypotheses of the effect being -greater/smaller than $+\Delta$ and $-\Delta$ being significant at level $\alpha$ \todo{this sentence confused me a bit, mainly the "being significant at ...". does it need a comma somewhere?} -\citep{Schuirmann1987}. A quantitative measure of evidence for the absence of an -effect is then given by the maximum of the two one-sided \textit{p}-values (the TOST -\textit{p}-value). A reasonable replication success criterion for null results may -therefore be to require that both the original and the replication TOST -\textit{p}-values be smaller than some level $\alpha$ (e.g., 0.05), or, equivalently, -that their $(1-2\alpha)\times 100\%$ confidence intervals are included in the -equivalence region. In contrast to the non-significance criterion, this -criterion controls the error of falsely claiming replication success at level -$\alpha^{2}$ when there is a true effect outside the equivalence margin, thus -complementing the usual two-trials rule. +interval for $\alpha = 5\%$ \citep{Westlake1972}. This procedure is equivalent +to declaring equivalence when two one-sided tests (TOST) for the null hypotheses +of the effect being greater/smaller than $+\Delta$ and $-\Delta$, are both +significant at level $\alpha$ \citep{Schuirmann1987}. A quantitative measure of +evidence for the absence of an effect is then given by the maximum of the two +one-sided \textit{p}-values (the TOST \textit{p}-value). A reasonable +replication success criterion for null results may therefore be to require that +both the original and the replication TOST \textit{p}-values be smaller than +some level $\alpha$ (conventionally 0.05), or, equivalently, that their +$(1-2\alpha)\times 100\%$ confidence intervals are included in the equivalence +region. In contrast to the non-significance criterion, this criterion controls +the error of falsely claiming replication success at level $\alpha^{2}$ when +there is a true effect outside the equivalence margin, thus complementing the +usual two-trials rule. \begin{figure} @@ -505,7 +514,7 @@ ggplot(data = rpcbNull) + strip.background = element_rect(fill = alpha("tan", 0.4)), axis.text = element_text(size = 8)) @ -\caption{Standardized mean difference (SMD) effect estimates with +\caption{Effect estimates on standardized mean difference (SMD) scale with \Sexpr{round(conflevel*100, 2)}\% confidence interval for the ``null results'' and their replication studies from the Reproducibility Project: Cancer Biology \citep{Errington2021}. The identifier above each plot indicates (original @@ -516,11 +525,11 @@ ggplot(data = rpcbNull) + indicated in the plot titles. The dashed gray line represents the value of no effect ($\text{SMD} = 0$), while the dotted red lines represent the equivalence range with a margin of $\Delta = \Sexpr{margin}$, classified as - ``liberal'' by \citet[Table 1.1]{Wellek2010}. The \textit{p}-values $p_{\text{TOST}}$ - are the maximum of the two one-sided \textit{p}-values for the effect being less than - or greater than $+\Delta$ or $-\Delta$, respectively. The Bayes factors - $\BF_{01}$ quantify the evidence for the null hypothesis - $H_{0} \colon \text{SMD} = 0$ against the alternative + ``liberal'' by \citet[Table 1.1]{Wellek2010}. The \textit{p}-values + $p_{\text{TOST}}$ are the maximum of the two one-sided \textit{p}-values for + the null hypotheses of the effect being greater/less than $+\Delta$ and + $-\Delta$, respectively. The Bayes factors $\BF_{01}$ quantify the evidence + for the null hypothesis $H_{0} \colon \text{SMD} = 0$ against the alternative $H_{1} \colon \text{SMD} \neq 0$ with normal unit-information prior assigned to the SMD under $H_{1}$.} \label{fig:nullfindings} @@ -557,29 +566,29 @@ results by the RPCB.\footnote{There are four original studies with null effects analysis \citep{Errington2021}, we aggregated their SMD estimates into a single SMD estimate with fixed-effect meta-analysis and recomputed the replication \textit{p}-value based on a normal approximation. For the original - studies and single replication studies we report the \textit{p}-values as provided by - the RPCB.} Most of them showed non-significant \textit{p}-values ($p > 0.05$) in the -original study. In one of the considered papers (number 48) the original authors regarded two effects as null results despite their statistical significance. We see that -there are \Sexpr{nullSuccesses} ``success'' according to the non-significance -criterion (with $p > 0.05$ in original and replication study) out of total -\Sexpr{ntotal} null effects, as reported in Table 1 from~\citet{Errington2021}. -% , and which were therefore treated as null results also by the RPCB. + studies and the single replication studies we report the \textit{p}-values as + provided by the RPCB.} Most of them showed non-significant \textit{p}-values +($p > 0.05$) in the original study. In one of the considered papers (number 48) +the original authors regarded two effects as null results despite their +statistical significance. We see that there are \Sexpr{nullSuccesses} +``successes'' according to the non-significance criterion (with $p > 0.05$ in +original and replication study) out of total \Sexpr{ntotal} null effects, as +reported in Table 1 from~\citet{Errington2021}. We will now apply equivalence testing to the RPCB data. The dotted red lines -represent an equivalence range for the margin $\Delta = -\Sexpr{margin}$, % , for which the shown TOST \textit{p}-values are computed. -which \citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even -with this generous margin, only \Sexpr{equivalenceSuccesses} of the -\Sexpr{ntotal} study pairs are able to establish replication success at the 5\% -level, in the sense that both the original and the replication 90\% confidence -interval fall within the equivalence range (or, equivalently, that their TOST -\textit{p}-values are smaller than $0.05$). For the remaining \Sexpr{ntotal - - equivalenceSuccesses} studies, the situation remains inconclusive and there is -no evidence for the absence or the presence of the effect. For instance, the -previously discussed example from \citet{Goetz2011} marginally fails the -criterion ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study -and $p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while -the example from \citet{Dawson2011} is a clearer failure +represent an equivalence range for the margin $\Delta = \Sexpr{margin}$, which +\citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even with this +generous margin, only \Sexpr{equivalenceSuccesses} of the \Sexpr{ntotal} study +pairs are able to establish replication success at the 5\% level, in the sense +that both the original and the replication 90\% confidence interval fall within +the equivalence range (or, equivalently, that their TOST \textit{p}-values are +smaller than $0.05$). For the remaining \Sexpr{ntotal - equivalenceSuccesses} +studies, the situation remains inconclusive and there is no evidence for the +absence or the presence of the effect. For instance, the previously discussed +example from \citet{Goetz2011} marginally fails the criterion +($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study and +$p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while the +example from \citet{Dawson2011} is a clearer failure ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto2)}$ in the original study and $p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication). @@ -588,19 +597,19 @@ $p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication). % We chose the margin $\Delta = \Sexpr{margin}$ primarily for illustrative % purposes and because effect sizes in preclinical research are typically much % larger than in clinical research. -The post-hoc determination of the equivalence margins is controversial. Ideally, -the margin should be determined on a case-by-case basis before the studies are +The post-hoc determination of equivalence margins is controversial. Ideally, the +margin should be determined on a case-by-case basis before the studies are conducted by researchers familiar with the subject matter. In the social and medical sciences, the conventions of \citet{Cohen1992} are typically used to classify SMD effect sizes ($\text{SMD} = 0.2$ small, $\text{SMD} = 0.5$ medium, $\text{SMD} = 0.8$ large). While effect sizes are typically larger in -preclinical research, it seems unrealistic to specify margins larger than 1 \todo{add "on the SMD scale"?} to -represent effect sizes that are absent for practical purposes. It could also be -argued that the chosen margin $\Delta = \Sexpr{margin}$ is too lax compared to -margins commonly used in clinical research; for instance, in oncology, a margin -of $\Delta = \log(1.3)$ is commonly used for log odds/hazard ratios, whereas in -bioequivalence studies a margin of \mbox{$\Delta = - \log(1.25) % = \Sexpr{round(log(1.25), 2)} +preclinical research, it seems unrealistic to specify margins larger than 1 on +SMD scale to represent effect sizes that are absent for practical purposes. It +could also be argued that the chosen margin $\Delta = \Sexpr{margin}$ is too lax +compared to margins commonly used in clinical research; for instance, in +oncology, a margin of $\Delta = \log(1.3)$ is commonly used for log odds/hazard +ratios, whereas in bioequivalence studies a margin of +\mbox{$\Delta = \log(1.25) % = \Sexpr{round(log(1.25), 2)} $} is the convention. These margins would translate into much more stringent margins of $\Delta = % \log(1.3)\sqrt{3}/\pi = \Sexpr{round(log(1.3)*sqrt(3)/pi, 2)}$ and $\Delta = % \log(1.25)\sqrt{3}/\pi = @@ -609,15 +618,12 @@ the $\text{SMD} = (\surd{3} / \pi) \log\text{OR}$ conversion \citep[p. 233]{Cooper2019}. Therefore, we report a sensitivity analysis in Figure~\ref{fig:sensitivity}. The top plot shows the number of successful replications as a function of the margin $\Delta$ and for different TOST -\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first proposed -by \citet{Hauck1986}. -% see also \citet{Campbell2021} for alternative approaches to post-hoc -% equivalence margin specification. -We see that for realistic margins between 0 and 1, the proportion of replication -successes remains below 50\%. To achieve a success rate of -11/15 = \Sexpr{round(11/15*100, 1)}\%, as is was achieved with the non-significance criterion, -unrealistic margins of $\Delta >$ 2 are required, highlighting the paucity of -evidence provided by these studies. +\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first +proposed by \citet{Hauck1986}. We see that for realistic margins between 0 and +1, the proportion of replication successes remains below 50\%. To achieve a +success rate of 11/15 = \Sexpr{round(11/15*100, 1)}\%, as is was achieved with +the non-significance criterion, unrealistic margins of $\Delta >$ 2 are +required, highlighting the paucity of evidence provided by these studies. @@ -714,10 +720,10 @@ grid.arrange(plotA, plotB, ncol = 1) \caption{Number of successful replications of original null results in the RPCB as a function of the margin $\Delta$ of the equivalence test ($p_{\text{TOST}} \leq \alpha$ in both studies) or the standard deviation of - the normal prior distribution for the SMD effect size under the alternative - $H_{1}$ of the Bayes factor test ($\BF_{01} \geq \gamma$ in both studies). The - dashed gray lines represent the margin and standard deviation used in the main - analysis shown in Figure~\ref{fig:nullfindings}.} + the zero-mean normal prior distribution for the SMD effect size under the + alternative $H_{1}$ of the Bayes factor test ($\BF_{01} \geq \gamma$ in both + studies). The dashed gray lines represent the margin and standard deviation + used in the main analysis shown in Figure~\ref{fig:nullfindings}.} \label{fig:sensitivity} \end{figure} @@ -751,8 +757,8 @@ respectively \citep{Jeffreys1961}. In contrast to the non-significance criterion, this criterion provides a genuine measure of evidence that can distinguish absence of evidence from evidence of absence. -When the observed data are dichotomized into positive (\mbox{$p < 0.05$}) or null -results (\mbox{$p > 0.05$}), the Bayes factor based on a null result is the +When the observed data are dichotomized into positive (\mbox{$p < 0.05$}) or +null results (\mbox{$p > 0.05$}), the Bayes factor based on a null result is the probability of observing \mbox{$p > 0.05$} when the effect is indeed absent (which is $95\%$) divided by the probability of observing $p > 0.05$ when the effect is indeed present (which is one minus the power of the study). For @@ -769,7 +775,7 @@ under $H_{1}$ will end up with different Bayes factors. Instead of specifying a single effect, one therefore typically specifies a ``prior distribution'' of plausible effects. Importantly, the prior distribution, like the equivalence margin, should be determined by researchers with subject knowledge and before -the data are observed\todo{are collected?}. +the data are collected. In practice, the observed data should not be dichotomized into positive or null results, as this leads to a loss of information. Therefore, to compute the Bayes @@ -782,11 +788,12 @@ an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal ``unit-information'' prior distribution\footnote{For SMD effect sizes, a normal unit-information prior is a normal distribution centered around the null value with a standard deviation corresponding to one observation. Assuming that the group means are - normally distributed \mbox{$\bar{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} - and \mbox{$\bar{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the + normally distributed + \mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} and + \mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the total sample size and $\sigma$ the known data standard deviation, the distribution of the SMD is - \mbox{$\text{SMD} = (\bar{X}_{1} - \bar{X}_{2})/\sigma \sim \Nor((\theta_{1} - \theta_{2})/\sigma, 4/n)$}. + \mbox{$\text{SMD} = (\overline{X}_{1} - \overline{X}_{2})/\sigma \sim \Nor\{(\theta_{1} - \theta_{2})/\sigma, 4/n\}$}. The standard deviation of the SMD based on one unit ($n = 1$) is hence 2, just as the unit standard deviation for log hazard/odds/rate ratio effect sizes \citep[Section 2.4]{Spiegelhalter2004}.} \citep{Kass1995b} for the effect size @@ -897,10 +904,11 @@ appropriately. Table~\ref{tab:recommendations} summarizes our recommendations. \item Compute the Bayes factors contrasting $H_{0} \colon \theta = \theta_{n}$ to $H_{1} \colon \theta \neq \theta_{n}$ for original and replication - data. Assuming a normal prior distribution - $\theta \given H_{1} \sim \Nor(m ,v)$, the Bayes factor is + data. Assuming a normal prior distribution, + % $\theta \given H_{1} \sim \Nor(m ,v)$, + the Bayes factor is $$\BF_{01i} - = \sqrt{1 + v/\sigma^{2}_{i}} \, \exp\left[-\frac{1}{2} \left\{\frac{(\hat{\theta}_{i} - + = \sqrt{1 + \frac{v}{\sigma^{2}_{i}}} \, \exp\left[-\frac{1}{2} \left\{\frac{(\hat{\theta}_{i} - \theta_{n})^{2}}{\sigma^{2}_{i}} - \frac{(\hat{\theta}_{i} - m)^{2}}{\sigma^{2}_{i} + v} \right\}\right], ~ i \in \{o, r\}.$$ \item Declare replication success at level $\gamma > 1$ if @@ -952,8 +960,10 @@ power to make conclusive inferences regarding the absence of the effect. \section*{Acknowledgements} We thank the contributors of the RPCB for their tremendous efforts and for making their data publicly available. We thank Maya Mathur for helpful advice -with the data preparation. This work was supported by the Swiss National Science -Foundation (grant \href{https://data.snf.ch/grants/grant/189295}{\#189295}). +with the data preparation. Our acknowledgment of these individuals does not +imply their endorsement of our article. This work was supported by the Swiss +National Science Foundation (grant +\href{https://data.snf.ch/grants/grant/189295}{\#189295}). \section*{Conflict of interest} We declare no conflict of interest. diff --git a/paper/rsabsence.pdf b/paper/rsabsence.pdf new file mode 100644 index 0000000000000000000000000000000000000000..3b34b1a47d7a4beb6befbd68e3ffdfbc8273ba51 Binary files /dev/null and b/paper/rsabsence.pdf differ