@@ -131,9 +131,10 @@ paper by Douglas Altman and Martin Bland has since become a mantra in the
 statistical and medical literature \citep{Altman1995}. Yet, the misconception
 that a statistically non-significant result indicates evidence for the absence
 of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null
-result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the null
-hypothesis of an absent effect -- may also occur if an effect is actually
-present. For example, if the sample size of a study is chosen to detect an
+result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the
+null hypothesis of an absent effect -- may also occur if an effect is actually
+present. For example, if the sample size of a
+study is chosen to detect an
 assumed effect with a power of 80\%, null results will incorrectly occur 20\% of
 the time when the assumed effect is actually present. Conversely, if the power
 of the study is lower, null results will occur more often. In general, the lower
@@ -148,12 +149,13 @@ or Bayes factors \citep{Kass1995}, should be used from the outset.
 % two systematic reviews that I found which show that animal studies are very
 % much underpowered on average \citep{Jennions2003,Carneiro2018}
-The contextualization\todo{replace contextualization with interpretation?} of null results becomes even more complicated in the
-setting of replication studies. In a replication study, researchers attempt to
-repeat an original study as closely as possible in order to assess whether
-similar\todo{replace similar with consistent?} results can be obtained with new data \citep{NSF2019}. In the last decade, various large-scale replication projects have been conducted in diverse fields, from the biomedical to the social sciences
- \citep[among
-others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}. \todo{changed sentennce to not assume that their were only projects in biomed and soc sciences, but there might be more, or more in the pipeline}
+The interpretation of null results becomes even more complicated in the setting
+of replication studies. In a replication study, researchers attempt to repeat an
+original study as closely as possible in order to assess whether consistent
+results can be obtained with new data \citep{NSF2019}. In the last decade,
+various large-scale replication projects have been conducted in diverse fields,
+from the biomedical to the social sciences \citep[among
 Most of these projects reported alarmingly low replicability rates across a
 broad spectrum of criteria for quantifying replicability. While most of these
 projects restricted their focus on original studies with statistically
@@ -164,41 +166,37 @@ significant results (``positive results''), the \emph{Reproducibility Project:
 also attempted to replicate some original studies with null results.
 The RPP excluded the original null results from its overall assessment of
-replication success (\textit{i.e.} the proportion of successful replications\todo{added by me, can be deleted again}), but the RPCB and the RPEP explicitly defined null results
-in both the original and the replication study as a criterion for ``replication
-success''. There are several logical problems with this ``non-significance''
-criterion. First, if the original study had low statistical power, a
-non-significant result is highly inconclusive and does not provide evidence for
-the absence of an effect. It is then unclear what exactly the goal of the
-replication should be -- to replicate the inconclusiveness of the original
-result? On the other hand, if the original study was adequately powered, a
-non-significant result may indeed provide some evidence for the absence of an
-effect when analyzed with appropriate methods, so that the goal of the
-replication is clearer. However, the criterion does not distinguish between
-these two cases. Second, with this criterion researchers can virtually always
-achieve replication success by conducting two studies with very small sample
-sizes, such that the \textit{p}-values are non-significant and the results are
-inconclusive. \todo{I find the "second, ..." argument a bit unnecessary for our cause. Also because if you do a replication, you probably do not design the first study (to be of low power). Instead I would directly write "Second, if the goal of inference is to quantify the
-evidence for the absence of an effect, the null hypothesis under which the \textit{p}-values are computed is misaligned with the goal."} This is because the null hypothesis under which the \textit{p}-values are
-computed is misaligned with the goal of inference, which is to quantify the
-evidence for the absence of an effect. We will discuss methods that are better
-aligned with this inferential goal. % in Section~\ref{sec:methods}.
-Third, the criterion does not control the error of falsely claiming the absence
-of an effect at some predetermined rate. This is in contrast to the standard
-replication success criterion of requiring significance from both studies
-\citep[also known as the two-trials rule, see chapter 12.2.8 in][]{Senn2008},
-which ensures that the error of falsely claiming the presence of an effect is
-controlled at a rate equal to the squared significance level (for example,
-5\ $\times$ 5\% = 0.25\% for a 5\% significance level). The non-significance
-criterion may be intended to complement the two-trials rule for null results,
-but it fails to do so in this respect, which may be important to regulators,
-funders, and researchers. We will now demonstrate these issues and potential
-solutions using the null results from the RPCB.
+replication success (i.e., the proportion of ``successful'' replications), but
+the RPCB and the RPEP explicitly defined null results in both the original and
+the replication study as a criterion for ``replication success''. There are
+several logical problems with this ``non-significance'' criterion. First, if the
+original study had low statistical power, a non-significant result is highly
+inconclusive and does not provide evidence for the absence of an effect. It is
+then unclear what exactly the goal of the replication should be -- to replicate
+the inconclusiveness of the original result? On the other hand, if the original
+study was adequately powered, a non-significant result may indeed provide some
+evidence for the absence of an effect when analyzed with appropriate methods, so
+that the goal of the replication is clearer. However, the criterion does not
+distinguish between these two cases. Second, with this criterion researchers can
+virtually always achieve replication success by conducting a replication study
+with a very small sample size, such that the \textit{p}-value is non-significant
+and the result are inconclusive. This is because the null hypothesis under which
+the \textit{p}-value is computed is misaligned with the goal of inference, which
+is to quantify the evidence for the absence of an effect. We will discuss
+methods that are better aligned with this inferential goal. Third, the criterion
+does not control the error of falsely claiming the absence of an effect at some
+predetermined rate. This is in contrast to the standard replication success
+criterion of requiring significance from both studies \citep[also known as the
+two-trials rule, see chapter 12.2.8 in][]{Senn2008}, which ensures that the
+error of falsely claiming the presence of an effect is controlled at a rate
+equal to the squared significance level (for example, 5\% $\times$ 5\% = 0.25\%
+for a 5\% significance level). The non-significance criterion may be intended to
+complement the two-trials rule for null results, but it fails to do so in this
+respect, which may be important to regulators, funders, and researchers. We will
+now demonstrate these issues and potential solutions using the null results from
+the RPCB.
-\section{Null results from the Reproducibility Project: Cancer Biology}
 << "data" >>=
 ## data
 rpcbRaw <- read.csv(file = "../data/rpcb-effect-level.csv")
@@ -242,11 +240,8 @@ rpcbNull <- rpcb %>%
 ## paper 41 (https://osf.io/qnpxv) - 1 Hazard ratio - sample size correspond to forest plot
 ## paper 47 (https://osf.io/jhp8z) - 2 r - sample size correspond to forest plot
 ## paper 48 (https://osf.io/zewrd) - 1 r - sample size do not correspond to forest plot for original study
-<< "2-example-studies", fig.height = 3.25 >>=
+## 2 examples
 ## some evidence for absence of effect https://doi.org/10.7554/eLife.45120 I
 ## can't find the replication effect like reported in the data set :( let's take
 ## it at face value we are not data detectives
@@ -267,8 +262,25 @@ plotDF1 <- rpcbNull %>%
 ## ## https://doi.org/10.7554/eLife.25306.012
 ## plotDF1$no[plotDF1$id == study2] <- plotDF1$no[plotDF1$id == study2]*2
 ## plotDF1$nr[plotDF1$id == study2] <- plotDF1$nr[plotDF1$id == study2]*2
-## create plot showing two example study pairs with null results
 conflevel <- 0.95
+\section{Null results from the Reproducibility Project: Cancer Biology}
+Figure~\ref{fig:2examples} shows effect estimates on standardized mean
+difference scale with \Sexpr{round(100*conflevel, 2)}\% confidence intervals
+from two RPCB study pairs. In both study pairs, the original and replications
+studies are ``null results'' and therefore meet the non-significance criterion
+for replication success (the two-sided \textit{p}-values are greater than 0.05
+in both the original and the replication study). However, intuition would
+suggest that the conclusions in the two pairs are very different.
+<< "2-example-studies", fig.height = 3 >>=
+## create plot showing two example study pairs with null results
 ggplot(data = plotDF1) +
     facet_wrap(~ label) +
     geom_hline(yintercept = 0, lty = 2, alpha = 0.3) +
@@ -306,22 +318,20 @@ ggplot(data = plotDF1) +
   for the null hypothesis that the effect is absent.}
-Figure~\ref{fig:2examples} shows standardized mean difference effect estimates
-with \Sexpr{round(100*conflevel, 2)}\% confidence intervals from two RPCB study
-pairs. In both study pairs, the original and replications studies are ``null results'' and therefore meet the non-significance criterion for
-replication success (the two-sided \textit{p}-values are greater than 0.05 in both the
-original and the replication study). However, intuition would suggest that the conclusions in the two pairs are very different.
 The original study from \citet{Dawson2011} and its replication both show large
 effect estimates in magnitude, but due to the small sample sizes, the
-uncertainty of these estimates is large, too. If the sample sizes of the
-studies were larger and the point estimates remained the same, intuitively both
-studies would provide evidence for a non-zero effect\todo{Does this sentence add much information? I'd delete it and start the next one "With such low sample sizes used, ...".}. However, with the samples
-sizes that were actually used, the results are inconclusive. In contrast, the
+uncertainty of these estimates is large, too.
+% If the sample sizes of the studies were larger and the point estimates
+% remained the same, intuitively both studies would provide evidence for a
+% non-zero effect\todo{Does this sentence add much information? I'd delete it
+% and start the next one "With such low sample sizes used, ...".}. However, with
+% the samples sizes that were actually used,
+With such low sample sizes used, the results seem inconclusive. In contrast, the
 effect estimates from \citet{Goetz2011} and its replication are much smaller in
 magnitude and their uncertainty is also smaller because the studies used larger
-sample sizes. Intuitively, these studies seem to provide some evidence for a
-zero (or negligibly small) effect. While these two examples show the qualitative
+sample sizes. Intuitively, the results seem to provide some evidence for a zero
+(or negligibly small) effect. While these two examples show the qualitative
 difference between absence of evidence and evidence of absence, we will now
 discuss how the two can be quantitatively distinguished.
@@ -341,17 +351,17 @@ data.
 Equivalence testing was developed in the context of clinical trials to assess
 whether a new treatment -- typically cheaper or with fewer side effects than the
 established treatment -- is practically equivalent to the established treatment
-\citep{Wellek2010}. The method can also be used to assess
-whether an effect is practically equivalent to the value of an absent effect\todo{change to "practically equivalent to an absent effect, usually zero"? meaning without the "the value of "},
-usually zero. Using equivalence testing as a remedy for non-significant results\todo{"as a way to deal with / handle non-significant results". because it is not a remedy in the sense of an intervention against non-sign. results. }
-has been suggested by several authors \citep{Hauck1986, Campbell2018}. The main
-challenge is to specify the margin $\Delta > 0$ that defines an equivalence
-range $[-\Delta, +\Delta]$ in which an effect is considered as absent for
-practical purposes. The goal is then to reject
-the % composite %% maybe too technical?
+\citep{Wellek2010}. The method can also be used to assess whether an effect is
+practically equivalent to an absent effect, usually zero. Using equivalence
+testing as a way to deal with non-significant results has been suggested by
+several authors \citep{Hauck1986, Campbell2018}. The main challenge is to
+specify the margin $\Delta > 0$ that defines an equivalence range
+$[-\Delta, +\Delta]$ in which an effect is considered as absent for practical
+purposes. The goal is then to reject the % composite %% maybe too technical?
 null hypothesis that the true effect is outside the equivalence range. This is
-in contrast to the usual null hypothesis of a superiority test which states that
-the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration.
+in contrast to the usual null hypothesis of a superiority tests which state that
+the effect is zero or smaller than zero, see Figure~\ref{fig:hypotheses} for an
@@ -362,7 +372,7 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration.
       \draw (3,0.2) -- (3,-0.2) node[below]{$0$};
       \draw (4,0.2) -- (4,-0.2) node[below]{$+\Delta$};
-      \node[text width=5cm, align=left] at (0,1.25) {\textbf{Equivalence}};
+      \node[text width=5cm, align=left] at (0,1) {\textbf{Equivalence}};
       \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt}]
       (2.05,0.75) -- (3.95,0.75) node[midway,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
       \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
@@ -370,7 +380,7 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration.
       \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
       (4.05,0.75) -- (6,0.75) node[pos=0.4,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
-      \node[text width=5cm, align=left] at (0,2.5) {\textbf{Superiority}};
+      \node[text width=5cm, align=left] at (0,2.15) {\textbf{Superiority}\\(two-sided)};
       \draw [decorate,decoration={brace,amplitude=5pt}]
       (3,2) -- (3,2) node[midway,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
       \draw[darkblue2] (3,1.95) -- (3,2.2);
@@ -379,17 +389,17 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration.
       \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
       (3.05,2) -- (6,2) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
-      % \node[text width=5cm, align=left] at (0,5.5) {\textbf{Superiority  \\ (one-sided)}};
-      % \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
-      % (3.05,5) -- (6,5) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
-      % \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
-      % (0,5) -- (3,5) node[pos=0.6,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
+      \node[text width=5cm, align=left] at (0,3.45) {\textbf{Superiority}\\(one-sided)};
+      \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
+      (3.05,3.25) -- (6,3.25) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
+      \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
+      (0,3.25) -- (3,3.25) node[pos=0.6,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
       \draw [dashed] (2,0) -- (2,0.75);
       \draw [dashed] (4,0) -- (4,0.75);
       \draw [dashed] (3,0) -- (3,0.75);
       \draw [dashed] (3,1.5) -- (3,1.9);
-      % \draw [dashed] (3,3.9) -- (3,5);
+      \draw [dashed] (3,2.8) -- (3,3.2);
   \caption{Null hypothesis ($H_0$) and alternative hypothesis ($H_1$) for
@@ -397,25 +407,24 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration.
-\todo{shouldn't for superiority H1 have positive effect and H0 an effect < or equal to 0...?}
 To ensure that the null hypothesis is falsely rejected at most
 $\alpha \times 100\%$ of the time, the standard approach is to declare
 equivalence if the $(1-2\alpha)\times 100\%$ confidence interval for the effect
 is contained within the equivalence range, for example, a 90\% confidence
-interval for $\alpha = 5\%$ \citep{Westlake1972}. The procedure is equivalent to
-two one-sided tests (TOST) for the null hypotheses of the effect being
-greater/smaller than $+\Delta$ and $-\Delta$ being significant at level $\alpha$ \todo{this sentence confused me a bit, mainly the "being significant at ...". does it need a comma somewhere?}
-\citep{Schuirmann1987}. A quantitative measure of evidence for the absence of an
-effect is then given by the maximum of the two one-sided \textit{p}-values (the TOST
-\textit{p}-value). A reasonable replication success criterion for null results may
-therefore be to require that both the original and the replication TOST
-\textit{p}-values be smaller than some level $\alpha$ (e.g., 0.05), or, equivalently,
-that their $(1-2\alpha)\times 100\%$ confidence intervals are included in the
-equivalence region. In contrast to the non-significance criterion, this
-criterion controls the error of falsely claiming replication success at level
-$\alpha^{2}$ when there is a true effect outside the equivalence margin, thus
-complementing the usual two-trials rule.
+interval for $\alpha = 5\%$ \citep{Westlake1972}. This procedure is equivalent
+to declaring equivalence when two one-sided tests (TOST) for the null hypotheses
+of the effect being greater/smaller than $+\Delta$ and $-\Delta$, are both
+significant at level $\alpha$ \citep{Schuirmann1987}. A quantitative measure of
+evidence for the absence of an effect is then given by the maximum of the two
+one-sided \textit{p}-values (the TOST \textit{p}-value). A reasonable
+replication success criterion for null results may therefore be to require that
+both the original and the replication TOST \textit{p}-values be smaller than
+some level $\alpha$ (conventionally 0.05), or, equivalently, that their
+$(1-2\alpha)\times 100\%$ confidence intervals are included in the equivalence
+region. In contrast to the non-significance criterion, this criterion controls
+the error of falsely claiming replication success at level $\alpha^{2}$ when
+there is a true effect outside the equivalence margin, thus complementing the
+usual two-trials rule.
@@ -505,7 +514,7 @@ ggplot(data = rpcbNull) +
           strip.background = element_rect(fill = alpha("tan", 0.4)),
           axis.text = element_text(size = 8))
-\caption{Standardized mean difference (SMD) effect estimates with
+\caption{Effect estimates on standardized mean difference (SMD) scale with
   \Sexpr{round(conflevel*100, 2)}\% confidence interval for the ``null results''
   and their replication studies from the Reproducibility Project: Cancer Biology
   \citep{Errington2021}. The identifier above each plot indicates (original
@@ -516,11 +525,11 @@ ggplot(data = rpcbNull) +
   indicated in the plot titles. The dashed gray line represents the value of no
   effect ($\text{SMD} = 0$), while the dotted red lines represent the
   equivalence range with a margin of $\Delta = \Sexpr{margin}$, classified as
-  ``liberal'' by \citet[Table 1.1]{Wellek2010}. The \textit{p}-values $p_{\text{TOST}}$
-  are the maximum of the two one-sided \textit{p}-values for the effect being less than
-  or greater than $+\Delta$ or $-\Delta$, respectively. The Bayes factors
-  $\BF_{01}$ quantify the evidence for the null hypothesis
-  $H_{0} \colon \text{SMD} = 0$ against the alternative
+  ``liberal'' by \citet[Table 1.1]{Wellek2010}. The \textit{p}-values
+  $p_{\text{TOST}}$ are the maximum of the two one-sided \textit{p}-values for
+  the null hypotheses of the effect being greater/less than $+\Delta$ and
+  $-\Delta$, respectively. The Bayes factors $\BF_{01}$ quantify the evidence
+  for the null hypothesis $H_{0} \colon \text{SMD} = 0$ against the alternative
   $H_{1} \colon \text{SMD} \neq 0$ with normal unit-information prior assigned
   to the SMD under $H_{1}$.}
@@ -557,29 +566,29 @@ results by the RPCB.\footnote{There are four original studies with null effects
   analysis \citep{Errington2021}, we aggregated their SMD estimates into a
   single SMD estimate with fixed-effect meta-analysis and recomputed the
   replication \textit{p}-value based on a normal approximation. For the original
-  studies and single replication studies we report the \textit{p}-values as provided by
-  the RPCB.} Most of them showed non-significant \textit{p}-values ($p > 0.05$) in the
-original study. In one of the considered papers (number 48) the original authors regarded two effects as null results despite their statistical significance. We see that
-there are \Sexpr{nullSuccesses} ``success'' according to the non-significance
-criterion (with $p > 0.05$ in original and replication study) out of total
-\Sexpr{ntotal} null effects, as reported in Table 1 from~\citet{Errington2021}.
-% , and which were therefore treated as null results also by the RPCB.
+  studies and the single replication studies we report the \textit{p}-values as
+  provided by the RPCB.} Most of them showed non-significant \textit{p}-values
+($p > 0.05$) in the original study. In one of the considered papers (number 48)
+the original authors regarded two effects as null results despite their
+statistical significance. We see that there are \Sexpr{nullSuccesses}
+``successes'' according to the non-significance criterion (with $p > 0.05$ in
+original and replication study) out of total \Sexpr{ntotal} null effects, as
+reported in Table 1 from~\citet{Errington2021}.
 We will now apply equivalence testing to the RPCB data. The dotted red lines
-represent an equivalence range for the margin $\Delta =
-\Sexpr{margin}$, % , for which the shown TOST \textit{p}-values are computed.
-which \citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even
-with this generous margin, only \Sexpr{equivalenceSuccesses} of the
-\Sexpr{ntotal} study pairs are able to establish replication success at the 5\%
-level, in the sense that both the original and the replication 90\% confidence
-interval fall within the equivalence range (or, equivalently, that their TOST
-\textit{p}-values are smaller than $0.05$). For the remaining \Sexpr{ntotal -
-  equivalenceSuccesses} studies, the situation remains inconclusive and there is
-no evidence for the absence or the presence of the effect. For instance, the
-previously discussed example from \citet{Goetz2011} marginally fails the
-criterion ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study
-and $p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while
-the example from \citet{Dawson2011} is a clearer failure
+represent an equivalence range for the margin $\Delta = \Sexpr{margin}$, which
+\citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even with this
+generous margin, only \Sexpr{equivalenceSuccesses} of the \Sexpr{ntotal} study
+pairs are able to establish replication success at the 5\% level, in the sense
+that both the original and the replication 90\% confidence interval fall within
+the equivalence range (or, equivalently, that their TOST \textit{p}-values are
+smaller than $0.05$). For the remaining \Sexpr{ntotal - equivalenceSuccesses}
+studies, the situation remains inconclusive and there is no evidence for the
+absence or the presence of the effect. For instance, the previously discussed
+example from \citet{Goetz2011} marginally fails the criterion
+($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study and
+$p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while the
+example from \citet{Dawson2011} is a clearer failure
 ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto2)}$ in the original study and
 $p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication).
@@ -588,19 +597,19 @@ $p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication).
 % We chose the margin $\Delta = \Sexpr{margin}$ primarily for illustrative
 % purposes and because effect sizes in preclinical research are typically much
 % larger than in clinical research.
-The post-hoc determination of the equivalence margins is controversial. Ideally,
-the margin should be determined on a case-by-case basis before the studies are
+The post-hoc determination of equivalence margins is controversial. Ideally, the
+margin should be determined on a case-by-case basis before the studies are
 conducted by researchers familiar with the subject matter. In the social and
 medical sciences, the conventions of \citet{Cohen1992} are typically used to
 classify SMD effect sizes ($\text{SMD} = 0.2$ small, $\text{SMD} = 0.5$ medium,
 $\text{SMD} = 0.8$ large). While effect sizes are typically larger in
-preclinical research, it seems unrealistic to specify margins larger than 1 \todo{add "on the SMD scale"?} to
-represent effect sizes that are absent for practical purposes. It could also be
-argued that the chosen margin $\Delta = \Sexpr{margin}$ is too lax compared to
-margins commonly used in clinical research; for instance, in oncology, a margin
-of $\Delta = \log(1.3)$ is commonly used for log odds/hazard ratios, whereas in
-bioequivalence studies a margin of \mbox{$\Delta =
-  \log(1.25) % = \Sexpr{round(log(1.25), 2)}
+preclinical research, it seems unrealistic to specify margins larger than 1 on
+SMD scale to represent effect sizes that are absent for practical purposes. It
+could also be argued that the chosen margin $\Delta = \Sexpr{margin}$ is too lax
+compared to margins commonly used in clinical research; for instance, in
+oncology, a margin of $\Delta = \log(1.3)$ is commonly used for log odds/hazard
+ratios, whereas in bioequivalence studies a margin of
+\mbox{$\Delta = \log(1.25) % = \Sexpr{round(log(1.25), 2)}
   $} is the convention. These margins would translate into much more stringent
 margins of $\Delta = % \log(1.3)\sqrt{3}/\pi =
 \Sexpr{round(log(1.3)*sqrt(3)/pi, 2)}$ and $\Delta = % \log(1.25)\sqrt{3}/\pi =
@@ -609,15 +618,12 @@ the $\text{SMD} = (\surd{3} / \pi) \log\text{OR}$ conversion \citep[p.
 233]{Cooper2019}. Therefore, we report a sensitivity analysis in
 Figure~\ref{fig:sensitivity}. The top plot shows the number of successful
 replications as a function of the margin $\Delta$ and for different TOST
-\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first proposed
-by \citet{Hauck1986}.
-% see also \citet{Campbell2021} for alternative approaches to post-hoc
-% equivalence margin specification.
-We see that for realistic margins between 0 and 1, the proportion of replication
-successes remains below 50\%. To achieve a success rate of
-11/15 = \Sexpr{round(11/15*100, 1)}\%, as is was achieved with the non-significance criterion,
-unrealistic margins of $\Delta >$ 2 are required, highlighting the paucity of
-evidence provided by these studies.
+\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first
+proposed by \citet{Hauck1986}. We see that for realistic margins between 0 and
+1, the proportion of replication successes remains below 50\%. To achieve a
+success rate of 11/15 = \Sexpr{round(11/15*100, 1)}\%, as is was achieved with
+the non-significance criterion, unrealistic margins of $\Delta >$ 2 are
+required, highlighting the paucity of evidence provided by these studies.
@@ -714,10 +720,10 @@ grid.arrange(plotA, plotB, ncol = 1)
 \caption{Number of successful replications of original null results in the RPCB
   as a function of the margin $\Delta$ of the equivalence test
   ($p_{\text{TOST}} \leq \alpha$ in both studies) or the standard deviation of
-  the normal prior distribution for the SMD effect size under the alternative
-  $H_{1}$ of the Bayes factor test ($\BF_{01} \geq \gamma$ in both studies). The
-  dashed gray lines represent the margin and standard deviation used in the main
-  analysis shown in Figure~\ref{fig:nullfindings}.}
+  the zero-mean normal prior distribution for the SMD effect size under the
+  alternative $H_{1}$ of the Bayes factor test ($\BF_{01} \geq \gamma$ in both
+  studies). The dashed gray lines represent the margin and standard deviation
+  used in the main analysis shown in Figure~\ref{fig:nullfindings}.}
@@ -751,8 +757,8 @@ respectively \citep{Jeffreys1961}. In contrast to the non-significance
 criterion, this criterion provides a genuine measure of evidence that can
 distinguish absence of evidence from evidence of absence.
-When the observed data are dichotomized into positive (\mbox{$p < 0.05$}) or null
-results (\mbox{$p > 0.05$}), the Bayes factor based on a null result is the
+When the observed data are dichotomized into positive (\mbox{$p < 0.05$}) or
+null results (\mbox{$p > 0.05$}), the Bayes factor based on a null result is the
 probability of observing \mbox{$p > 0.05$} when the effect is indeed absent
 (which is $95\%$) divided by the probability of observing $p > 0.05$ when the
 effect is indeed present (which is one minus the power of the study). For
@@ -769,7 +775,7 @@ under $H_{1}$ will end up with different Bayes factors. Instead of specifying a
 single effect, one therefore typically specifies a ``prior distribution'' of
 plausible effects. Importantly, the prior distribution, like the equivalence
 margin, should be determined by researchers with subject knowledge and before
-the data are observed\todo{are collected?}.
+the data are collected.
 In practice, the observed data should not be dichotomized into positive or null
 results, as this leads to a loss of information. Therefore, to compute the Bayes
@@ -782,11 +788,12 @@ an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal ``unit-information''
 prior distribution\footnote{For SMD effect sizes, a normal unit-information
   prior is a normal distribution centered around the null value with a standard
   deviation corresponding to one observation. Assuming that the group means are
-  normally distributed \mbox{$\bar{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$}
-  and \mbox{$\bar{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the
+  normally distributed
+  \mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} and
+  \mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the
   total sample size and $\sigma$ the known data standard deviation, the
   distribution of the SMD is
-  \mbox{$\text{SMD} = (\bar{X}_{1} - \bar{X}_{2})/\sigma \sim \Nor((\theta_{1} - \theta_{2})/\sigma, 4/n)$}.
+  \mbox{$\text{SMD} = (\overline{X}_{1} - \overline{X}_{2})/\sigma \sim \Nor\{(\theta_{1} - \theta_{2})/\sigma, 4/n\}$}.
   The standard deviation of the SMD based on one unit ($n = 1$) is hence 2, just
   as the unit standard deviation for log hazard/odds/rate ratio effect sizes
   \citep[Section 2.4]{Spiegelhalter2004}.} \citep{Kass1995b} for the effect size
@@ -897,10 +904,11 @@ appropriately. Table~\ref{tab:recommendations} summarizes our recommendations.
       \item Compute the Bayes factors contrasting
             $H_{0} \colon \theta = \theta_{n}$ to
             $H_{1} \colon \theta \neq \theta_{n}$ for original and replication
-            data. Assuming a normal prior distribution
-            $\theta \given H_{1} \sim \Nor(m ,v)$, the Bayes factor is
+            data. Assuming a normal prior distribution,
+            % $\theta \given H_{1} \sim \Nor(m ,v)$,
+            the Bayes factor is
-            = \sqrt{1 + v/\sigma^{2}_{i}} \, \exp\left[-\frac{1}{2} \left\{\frac{(\hat{\theta}_{i} -
+            = \sqrt{1 + \frac{v}{\sigma^{2}_{i}}} \, \exp\left[-\frac{1}{2} \left\{\frac{(\hat{\theta}_{i} -
                   \theta_{n})^{2}}{\sigma^{2}_{i}} - \frac{(\hat{\theta}_{i} - m)^{2}}{\sigma^{2}_{i} + v}
               \right\}\right], ~ i \in \{o, r\}.$$
       \item Declare replication success at level $\gamma > 1$ if
@@ -952,8 +960,10 @@ power to make conclusive inferences regarding the absence of the effect.
 We thank the contributors of the RPCB for their tremendous efforts and for
 making their data publicly available. We thank Maya Mathur for helpful advice
-with the data preparation. This work was supported by the Swiss National Science
-Foundation (grant \href{https://data.snf.ch/grants/grant/189295}{\#189295}).
+with the data preparation. Our acknowledgment of these individuals does not
+imply their endorsement of our article. This work was supported by the Swiss
+National Science Foundation (grant
 \section*{Conflict of interest}
 We declare no conflict of interest.
