diff --git a/Dockerfile b/Dockerfile
index 533036ef96a9b1459c3c8b2e04698b7a43eeec1d..2fd937f910a186846341667a9ff8213f0c3d08e4 100755
--- a/Dockerfile
+++ b/Dockerfile
@@ -34,7 +34,7 @@ CMD if [ "$pdfdocker" = "false" ] ; then \
     && mv figure/* /output/figure/ ; \
     else \
     echo "compiling PDF inside Docker" \
-    && Rscript -e "tinytex::install_tinytex()" --vanilla \
+    # && Rscript -e "tinytex::install_tinytex()" --vanilla \
     ## knit Rnw to tex and compile tex inside docker to PDF
     && Rscript -e "knitr::knit2pdf('"$FILE".Rnw')" --vanilla \
     && mv "$FILE".pdf  /output/ ; \
diff --git a/paper/rsabsence.Rnw b/paper/rsabsence.Rnw
index 56ac3f36e9221ea1ea82636c3527562505c5c9c5..b10a6cd626dee911348504212ba7d0fb9823cdf0 100755
--- a/paper/rsabsence.Rnw
+++ b/paper/rsabsence.Rnw
@@ -131,9 +131,10 @@ paper by Douglas Altman and Martin Bland has since become a mantra in the
 statistical and medical literature \citep{Altman1995}. Yet, the misconception
 that a statistically non-significant result indicates evidence for the absence
 of an effect is unfortunately still widespread \citep{Makin2019}. Such a ``null
-result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the null
-hypothesis of an absent effect -- may also occur if an effect is actually
-present. For example, if the sample size of a study is chosen to detect an
+result'' -- typically characterized by a \textit{p}-value of $p > 0.05$ for the
+null hypothesis of an absent effect -- may also occur if an effect is actually
+present. For example, if the sample size of a
+study is chosen to detect an
 assumed effect with a power of 80\%, null results will incorrectly occur 20\% of
 the time when the assumed effect is actually present. Conversely, if the power
 of the study is lower, null results will occur more often. In general, the lower
@@ -148,12 +149,13 @@ or Bayes factors \citep{Kass1995}, should be used from the outset.
 % two systematic reviews that I found which show that animal studies are very
 % much underpowered on average \citep{Jennions2003,Carneiro2018}
 
-The contextualization\todo{replace contextualization with interpretation?} of null results becomes even more complicated in the
-setting of replication studies. In a replication study, researchers attempt to
-repeat an original study as closely as possible in order to assess whether
-similar\todo{replace similar with consistent?} results can be obtained with new data \citep{NSF2019}. In the last decade, various large-scale replication projects have been conducted in diverse fields, from the biomedical to the social sciences
- \citep[among
-others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}. \todo{changed sentennce to not assume that their were only projects in biomed and soc sciences, but there might be more, or more in the pipeline}
+The interpretation of null results becomes even more complicated in the setting
+of replication studies. In a replication study, researchers attempt to repeat an
+original study as closely as possible in order to assess whether consistent
+results can be obtained with new data \citep{NSF2019}. In the last decade,
+various large-scale replication projects have been conducted in diverse fields,
+from the biomedical to the social sciences \citep[among
+others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}.
 Most of these projects reported alarmingly low replicability rates across a
 broad spectrum of criteria for quantifying replicability. While most of these
 projects restricted their focus on original studies with statistically
@@ -164,41 +166,37 @@ significant results (``positive results''), the \emph{Reproducibility Project:
 also attempted to replicate some original studies with null results.
 
 The RPP excluded the original null results from its overall assessment of
-replication success (\textit{i.e.} the proportion of successful replications\todo{added by me, can be deleted again}), but the RPCB and the RPEP explicitly defined null results
-in both the original and the replication study as a criterion for ``replication
-success''. There are several logical problems with this ``non-significance''
-criterion. First, if the original study had low statistical power, a
-non-significant result is highly inconclusive and does not provide evidence for
-the absence of an effect. It is then unclear what exactly the goal of the
-replication should be -- to replicate the inconclusiveness of the original
-result? On the other hand, if the original study was adequately powered, a
-non-significant result may indeed provide some evidence for the absence of an
-effect when analyzed with appropriate methods, so that the goal of the
-replication is clearer. However, the criterion does not distinguish between
-these two cases. Second, with this criterion researchers can virtually always
-achieve replication success by conducting two studies with very small sample
-sizes, such that the \textit{p}-values are non-significant and the results are
-inconclusive. \todo{I find the "second, ..." argument a bit unnecessary for our cause. Also because if you do a replication, you probably do not design the first study (to be of low power). Instead I would directly write "Second, if the goal of inference is to quantify the
-evidence for the absence of an effect, the null hypothesis under which the \textit{p}-values are computed is misaligned with the goal."} This is because the null hypothesis under which the \textit{p}-values are
-computed is misaligned with the goal of inference, which is to quantify the
-evidence for the absence of an effect. We will discuss methods that are better
-aligned with this inferential goal. % in Section~\ref{sec:methods}.
-Third, the criterion does not control the error of falsely claiming the absence
-of an effect at some predetermined rate. This is in contrast to the standard
-replication success criterion of requiring significance from both studies
-\citep[also known as the two-trials rule, see chapter 12.2.8 in][]{Senn2008},
-which ensures that the error of falsely claiming the presence of an effect is
-controlled at a rate equal to the squared significance level (for example,
-5\ $\times$ 5\% = 0.25\% for a 5\% significance level). The non-significance
-criterion may be intended to complement the two-trials rule for null results,
-but it fails to do so in this respect, which may be important to regulators,
-funders, and researchers. We will now demonstrate these issues and potential
-solutions using the null results from the RPCB.
+replication success (i.e., the proportion of ``successful'' replications), but
+the RPCB and the RPEP explicitly defined null results in both the original and
+the replication study as a criterion for ``replication success''. There are
+several logical problems with this ``non-significance'' criterion. First, if the
+original study had low statistical power, a non-significant result is highly
+inconclusive and does not provide evidence for the absence of an effect. It is
+then unclear what exactly the goal of the replication should be -- to replicate
+the inconclusiveness of the original result? On the other hand, if the original
+study was adequately powered, a non-significant result may indeed provide some
+evidence for the absence of an effect when analyzed with appropriate methods, so
+that the goal of the replication is clearer. However, the criterion does not
+distinguish between these two cases. Second, with this criterion researchers can
+virtually always achieve replication success by conducting a replication study
+with a very small sample size, such that the \textit{p}-value is non-significant
+and the result are inconclusive. This is because the null hypothesis under which
+the \textit{p}-value is computed is misaligned with the goal of inference, which
+is to quantify the evidence for the absence of an effect. We will discuss
+methods that are better aligned with this inferential goal. Third, the criterion
+does not control the error of falsely claiming the absence of an effect at some
+predetermined rate. This is in contrast to the standard replication success
+criterion of requiring significance from both studies \citep[also known as the
+two-trials rule, see chapter 12.2.8 in][]{Senn2008}, which ensures that the
+error of falsely claiming the presence of an effect is controlled at a rate
+equal to the squared significance level (for example, 5\% $\times$ 5\% = 0.25\%
+for a 5\% significance level). The non-significance criterion may be intended to
+complement the two-trials rule for null results, but it fails to do so in this
+respect, which may be important to regulators, funders, and researchers. We will
+now demonstrate these issues and potential solutions using the null results from
+the RPCB.
 
 
-\section{Null results from the Reproducibility Project: Cancer Biology}
-\label{sec:rpcb}
-
 << "data" >>=
 ## data
 rpcbRaw <- read.csv(file = "../data/rpcb-effect-level.csv")
@@ -242,11 +240,8 @@ rpcbNull <- rpcb %>%
 ## paper 41 (https://osf.io/qnpxv) - 1 Hazard ratio - sample size correspond to forest plot
 ## paper 47 (https://osf.io/jhp8z) - 2 r - sample size correspond to forest plot
 ## paper 48 (https://osf.io/zewrd) - 1 r - sample size do not correspond to forest plot for original study
-@
-
 
-\begin{figure}[!htb]
-<< "2-example-studies", fig.height = 3.25 >>=
+## 2 examples
 ## some evidence for absence of effect https://doi.org/10.7554/eLife.45120 I
 ## can't find the replication effect like reported in the data set :( let's take
 ## it at face value we are not data detectives
@@ -267,8 +262,25 @@ plotDF1 <- rpcbNull %>%
 ## ## https://doi.org/10.7554/eLife.25306.012
 ## plotDF1$no[plotDF1$id == study2] <- plotDF1$no[plotDF1$id == study2]*2
 ## plotDF1$nr[plotDF1$id == study2] <- plotDF1$nr[plotDF1$id == study2]*2
-## create plot showing two example study pairs with null results
+
 conflevel <- 0.95
+@
+
+\section{Null results from the Reproducibility Project: Cancer Biology}
+\label{sec:rpcb}
+
+Figure~\ref{fig:2examples} shows effect estimates on standardized mean
+difference scale with \Sexpr{round(100*conflevel, 2)}\% confidence intervals
+from two RPCB study pairs. In both study pairs, the original and replications
+studies are ``null results'' and therefore meet the non-significance criterion
+for replication success (the two-sided \textit{p}-values are greater than 0.05
+in both the original and the replication study). However, intuition would
+suggest that the conclusions in the two pairs are very different.
+
+
+\begin{figure}[!htb]
+<< "2-example-studies", fig.height = 3 >>=
+## create plot showing two example study pairs with null results
 ggplot(data = plotDF1) +
     facet_wrap(~ label) +
     geom_hline(yintercept = 0, lty = 2, alpha = 0.3) +
@@ -306,22 +318,20 @@ ggplot(data = plotDF1) +
   for the null hypothesis that the effect is absent.}
 \end{figure}
 
-Figure~\ref{fig:2examples} shows standardized mean difference effect estimates
-with \Sexpr{round(100*conflevel, 2)}\% confidence intervals from two RPCB study
-pairs. In both study pairs, the original and replications studies are ``null results'' and therefore meet the non-significance criterion for
-replication success (the two-sided \textit{p}-values are greater than 0.05 in both the
-original and the replication study). However, intuition would suggest that the conclusions in the two pairs are very different.
 
 The original study from \citet{Dawson2011} and its replication both show large
 effect estimates in magnitude, but due to the small sample sizes, the
-uncertainty of these estimates is large, too. If the sample sizes of the
-studies were larger and the point estimates remained the same, intuitively both
-studies would provide evidence for a non-zero effect\todo{Does this sentence add much information? I'd delete it and start the next one "With such low sample sizes used, ...".}. However, with the samples
-sizes that were actually used, the results are inconclusive. In contrast, the
+uncertainty of these estimates is large, too.
+% If the sample sizes of the studies were larger and the point estimates
+% remained the same, intuitively both studies would provide evidence for a
+% non-zero effect\todo{Does this sentence add much information? I'd delete it
+% and start the next one "With such low sample sizes used, ...".}. However, with
+% the samples sizes that were actually used,
+With such low sample sizes used, the results seem inconclusive. In contrast, the
 effect estimates from \citet{Goetz2011} and its replication are much smaller in
 magnitude and their uncertainty is also smaller because the studies used larger
-sample sizes. Intuitively, these studies seem to provide some evidence for a
-zero (or negligibly small) effect. While these two examples show the qualitative
+sample sizes. Intuitively, the results seem to provide some evidence for a zero
+(or negligibly small) effect. While these two examples show the qualitative
 difference between absence of evidence and evidence of absence, we will now
 discuss how the two can be quantitatively distinguished.
 
@@ -341,17 +351,17 @@ data.
 Equivalence testing was developed in the context of clinical trials to assess
 whether a new treatment -- typically cheaper or with fewer side effects than the
 established treatment -- is practically equivalent to the established treatment
-\citep{Wellek2010}. The method can also be used to assess
-whether an effect is practically equivalent to the value of an absent effect\todo{change to "practically equivalent to an absent effect, usually zero"? meaning without the "the value of "},
-usually zero. Using equivalence testing as a remedy for non-significant results\todo{"as a way to deal with / handle non-significant results". because it is not a remedy in the sense of an intervention against non-sign. results. }
-has been suggested by several authors \citep{Hauck1986, Campbell2018}. The main
-challenge is to specify the margin $\Delta > 0$ that defines an equivalence
-range $[-\Delta, +\Delta]$ in which an effect is considered as absent for
-practical purposes. The goal is then to reject
-the % composite %% maybe too technical?
+\citep{Wellek2010}. The method can also be used to assess whether an effect is
+practically equivalent to an absent effect, usually zero. Using equivalence
+testing as a way to deal with non-significant results has been suggested by
+several authors \citep{Hauck1986, Campbell2018}. The main challenge is to
+specify the margin $\Delta > 0$ that defines an equivalence range
+$[-\Delta, +\Delta]$ in which an effect is considered as absent for practical
+purposes. The goal is then to reject the % composite %% maybe too technical?
 null hypothesis that the true effect is outside the equivalence range. This is
-in contrast to the usual null hypothesis of a superiority test which states that
-the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration.
+in contrast to the usual null hypothesis of a superiority tests which state that
+the effect is zero or smaller than zero, see Figure~\ref{fig:hypotheses} for an
+illustration.
 
 \begin{figure}[!htb]
   \begin{center}
@@ -362,7 +372,7 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration.
       \draw (3,0.2) -- (3,-0.2) node[below]{$0$};
       \draw (4,0.2) -- (4,-0.2) node[below]{$+\Delta$};
 
-      \node[text width=5cm, align=left] at (0,1.25) {\textbf{Equivalence}};
+      \node[text width=5cm, align=left] at (0,1) {\textbf{Equivalence}};
       \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt}]
       (2.05,0.75) -- (3.95,0.75) node[midway,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
       \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
@@ -370,7 +380,7 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration.
       \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
       (4.05,0.75) -- (6,0.75) node[pos=0.4,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
 
-      \node[text width=5cm, align=left] at (0,2.5) {\textbf{Superiority}};
+      \node[text width=5cm, align=left] at (0,2.15) {\textbf{Superiority}\\(two-sided)};
       \draw [decorate,decoration={brace,amplitude=5pt}]
       (3,2) -- (3,2) node[midway,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
       \draw[darkblue2] (3,1.95) -- (3,2.2);
@@ -379,17 +389,17 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration.
       \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
       (3.05,2) -- (6,2) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
 
-      % \node[text width=5cm, align=left] at (0,5.5) {\textbf{Superiority  \\ (one-sided)}};
-      % \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
-      % (3.05,5) -- (6,5) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
-      % \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
-      % (0,5) -- (3,5) node[pos=0.6,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
+      \node[text width=5cm, align=left] at (0,3.45) {\textbf{Superiority}\\(one-sided)};
+      \draw [draw={darkred2},decorate,decoration={brace,amplitude=5pt,aspect=0.4}]
+      (3.05,3.25) -- (6,3.25) node[pos=0.4,yshift=1.5em]{\textcolor{darkred2}{$H_1$}};
+      \draw [draw={darkblue2},decorate,decoration={brace,amplitude=5pt,aspect=0.6}]
+      (0,3.25) -- (3,3.25) node[pos=0.6,yshift=1.5em]{\textcolor{darkblue2}{$H_0$}};
 
       \draw [dashed] (2,0) -- (2,0.75);
       \draw [dashed] (4,0) -- (4,0.75);
       \draw [dashed] (3,0) -- (3,0.75);
       \draw [dashed] (3,1.5) -- (3,1.9);
-      % \draw [dashed] (3,3.9) -- (3,5);
+      \draw [dashed] (3,2.8) -- (3,3.2);
     \end{tikzpicture}
   \end{center}
   \caption{Null hypothesis ($H_0$) and alternative hypothesis ($H_1$) for
@@ -397,25 +407,24 @@ the effect is zero, see Figure~\ref{fig:hypotheses} for an illustration.
   \label{fig:hypotheses}
 \end{figure}
 
-\todo{shouldn't for superiority H1 have positive effect and H0 an effect < or equal to 0...?}
-
 To ensure that the null hypothesis is falsely rejected at most
 $\alpha \times 100\%$ of the time, the standard approach is to declare
 equivalence if the $(1-2\alpha)\times 100\%$ confidence interval for the effect
 is contained within the equivalence range, for example, a 90\% confidence
-interval for $\alpha = 5\%$ \citep{Westlake1972}. The procedure is equivalent to
-two one-sided tests (TOST) for the null hypotheses of the effect being
-greater/smaller than $+\Delta$ and $-\Delta$ being significant at level $\alpha$ \todo{this sentence confused me a bit, mainly the "being significant at ...". does it need a comma somewhere?}
-\citep{Schuirmann1987}. A quantitative measure of evidence for the absence of an
-effect is then given by the maximum of the two one-sided \textit{p}-values (the TOST
-\textit{p}-value). A reasonable replication success criterion for null results may
-therefore be to require that both the original and the replication TOST
-\textit{p}-values be smaller than some level $\alpha$ (e.g., 0.05), or, equivalently,
-that their $(1-2\alpha)\times 100\%$ confidence intervals are included in the
-equivalence region. In contrast to the non-significance criterion, this
-criterion controls the error of falsely claiming replication success at level
-$\alpha^{2}$ when there is a true effect outside the equivalence margin, thus
-complementing the usual two-trials rule.
+interval for $\alpha = 5\%$ \citep{Westlake1972}. This procedure is equivalent
+to declaring equivalence when two one-sided tests (TOST) for the null hypotheses
+of the effect being greater/smaller than $+\Delta$ and $-\Delta$, are both
+significant at level $\alpha$ \citep{Schuirmann1987}. A quantitative measure of
+evidence for the absence of an effect is then given by the maximum of the two
+one-sided \textit{p}-values (the TOST \textit{p}-value). A reasonable
+replication success criterion for null results may therefore be to require that
+both the original and the replication TOST \textit{p}-values be smaller than
+some level $\alpha$ (conventionally 0.05), or, equivalently, that their
+$(1-2\alpha)\times 100\%$ confidence intervals are included in the equivalence
+region. In contrast to the non-significance criterion, this criterion controls
+the error of falsely claiming replication success at level $\alpha^{2}$ when
+there is a true effect outside the equivalence margin, thus complementing the
+usual two-trials rule.
 
 
 \begin{figure}
@@ -505,7 +514,7 @@ ggplot(data = rpcbNull) +
           strip.background = element_rect(fill = alpha("tan", 0.4)),
           axis.text = element_text(size = 8))
 @
-\caption{Standardized mean difference (SMD) effect estimates with
+\caption{Effect estimates on standardized mean difference (SMD) scale with
   \Sexpr{round(conflevel*100, 2)}\% confidence interval for the ``null results''
   and their replication studies from the Reproducibility Project: Cancer Biology
   \citep{Errington2021}. The identifier above each plot indicates (original
@@ -516,11 +525,11 @@ ggplot(data = rpcbNull) +
   indicated in the plot titles. The dashed gray line represents the value of no
   effect ($\text{SMD} = 0$), while the dotted red lines represent the
   equivalence range with a margin of $\Delta = \Sexpr{margin}$, classified as
-  ``liberal'' by \citet[Table 1.1]{Wellek2010}. The \textit{p}-values $p_{\text{TOST}}$
-  are the maximum of the two one-sided \textit{p}-values for the effect being less than
-  or greater than $+\Delta$ or $-\Delta$, respectively. The Bayes factors
-  $\BF_{01}$ quantify the evidence for the null hypothesis
-  $H_{0} \colon \text{SMD} = 0$ against the alternative
+  ``liberal'' by \citet[Table 1.1]{Wellek2010}. The \textit{p}-values
+  $p_{\text{TOST}}$ are the maximum of the two one-sided \textit{p}-values for
+  the null hypotheses of the effect being greater/less than $+\Delta$ and
+  $-\Delta$, respectively. The Bayes factors $\BF_{01}$ quantify the evidence
+  for the null hypothesis $H_{0} \colon \text{SMD} = 0$ against the alternative
   $H_{1} \colon \text{SMD} \neq 0$ with normal unit-information prior assigned
   to the SMD under $H_{1}$.}
 \label{fig:nullfindings}
@@ -557,29 +566,29 @@ results by the RPCB.\footnote{There are four original studies with null effects
   analysis \citep{Errington2021}, we aggregated their SMD estimates into a
   single SMD estimate with fixed-effect meta-analysis and recomputed the
   replication \textit{p}-value based on a normal approximation. For the original
-  studies and single replication studies we report the \textit{p}-values as provided by
-  the RPCB.} Most of them showed non-significant \textit{p}-values ($p > 0.05$) in the
-original study. In one of the considered papers (number 48) the original authors regarded two effects as null results despite their statistical significance. We see that
-there are \Sexpr{nullSuccesses} ``success'' according to the non-significance
-criterion (with $p > 0.05$ in original and replication study) out of total
-\Sexpr{ntotal} null effects, as reported in Table 1 from~\citet{Errington2021}.
-% , and which were therefore treated as null results also by the RPCB.
+  studies and the single replication studies we report the \textit{p}-values as
+  provided by the RPCB.} Most of them showed non-significant \textit{p}-values
+($p > 0.05$) in the original study. In one of the considered papers (number 48)
+the original authors regarded two effects as null results despite their
+statistical significance. We see that there are \Sexpr{nullSuccesses}
+``successes'' according to the non-significance criterion (with $p > 0.05$ in
+original and replication study) out of total \Sexpr{ntotal} null effects, as
+reported in Table 1 from~\citet{Errington2021}.
 
 We will now apply equivalence testing to the RPCB data. The dotted red lines
-represent an equivalence range for the margin $\Delta =
-\Sexpr{margin}$, % , for which the shown TOST \textit{p}-values are computed.
-which \citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even
-with this generous margin, only \Sexpr{equivalenceSuccesses} of the
-\Sexpr{ntotal} study pairs are able to establish replication success at the 5\%
-level, in the sense that both the original and the replication 90\% confidence
-interval fall within the equivalence range (or, equivalently, that their TOST
-\textit{p}-values are smaller than $0.05$). For the remaining \Sexpr{ntotal -
-  equivalenceSuccesses} studies, the situation remains inconclusive and there is
-no evidence for the absence or the presence of the effect. For instance, the
-previously discussed example from \citet{Goetz2011} marginally fails the
-criterion ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study
-and $p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while
-the example from \citet{Dawson2011} is a clearer failure
+represent an equivalence range for the margin $\Delta = \Sexpr{margin}$, which
+\citet[Table 1.1]{Wellek2010} classifies as ``liberal''. However, even with this
+generous margin, only \Sexpr{equivalenceSuccesses} of the \Sexpr{ntotal} study
+pairs are able to establish replication success at the 5\% level, in the sense
+that both the original and the replication 90\% confidence interval fall within
+the equivalence range (or, equivalently, that their TOST \textit{p}-values are
+smaller than $0.05$). For the remaining \Sexpr{ntotal - equivalenceSuccesses}
+studies, the situation remains inconclusive and there is no evidence for the
+absence or the presence of the effect. For instance, the previously discussed
+example from \citet{Goetz2011} marginally fails the criterion
+($p_{\text{TOST}} = \Sexpr{formatPval(ptosto1)}$ in the original study and
+$p_{\text{TOST}} = \Sexpr{formatPval(ptostr1)}$ in the replication), while the
+example from \citet{Dawson2011} is a clearer failure
 ($p_{\text{TOST}} = \Sexpr{formatPval(ptosto2)}$ in the original study and
 $p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication).
 
@@ -588,19 +597,19 @@ $p_{\text{TOST}} = \Sexpr{formatPval(ptostr2)}$ in the replication).
 % We chose the margin $\Delta = \Sexpr{margin}$ primarily for illustrative
 % purposes and because effect sizes in preclinical research are typically much
 % larger than in clinical research.
-The post-hoc determination of the equivalence margins is controversial. Ideally,
-the margin should be determined on a case-by-case basis before the studies are
+The post-hoc determination of equivalence margins is controversial. Ideally, the
+margin should be determined on a case-by-case basis before the studies are
 conducted by researchers familiar with the subject matter. In the social and
 medical sciences, the conventions of \citet{Cohen1992} are typically used to
 classify SMD effect sizes ($\text{SMD} = 0.2$ small, $\text{SMD} = 0.5$ medium,
 $\text{SMD} = 0.8$ large). While effect sizes are typically larger in
-preclinical research, it seems unrealistic to specify margins larger than 1 \todo{add "on the SMD scale"?} to
-represent effect sizes that are absent for practical purposes. It could also be
-argued that the chosen margin $\Delta = \Sexpr{margin}$ is too lax compared to
-margins commonly used in clinical research; for instance, in oncology, a margin
-of $\Delta = \log(1.3)$ is commonly used for log odds/hazard ratios, whereas in
-bioequivalence studies a margin of \mbox{$\Delta =
-  \log(1.25) % = \Sexpr{round(log(1.25), 2)}
+preclinical research, it seems unrealistic to specify margins larger than 1 on
+SMD scale to represent effect sizes that are absent for practical purposes. It
+could also be argued that the chosen margin $\Delta = \Sexpr{margin}$ is too lax
+compared to margins commonly used in clinical research; for instance, in
+oncology, a margin of $\Delta = \log(1.3)$ is commonly used for log odds/hazard
+ratios, whereas in bioequivalence studies a margin of
+\mbox{$\Delta = \log(1.25) % = \Sexpr{round(log(1.25), 2)}
   $} is the convention. These margins would translate into much more stringent
 margins of $\Delta = % \log(1.3)\sqrt{3}/\pi =
 \Sexpr{round(log(1.3)*sqrt(3)/pi, 2)}$ and $\Delta = % \log(1.25)\sqrt{3}/\pi =
@@ -609,15 +618,12 @@ the $\text{SMD} = (\surd{3} / \pi) \log\text{OR}$ conversion \citep[p.
 233]{Cooper2019}. Therefore, we report a sensitivity analysis in
 Figure~\ref{fig:sensitivity}. The top plot shows the number of successful
 replications as a function of the margin $\Delta$ and for different TOST
-\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first proposed
-by \citet{Hauck1986}.
-% see also \citet{Campbell2021} for alternative approaches to post-hoc
-% equivalence margin specification.
-We see that for realistic margins between 0 and 1, the proportion of replication
-successes remains below 50\%. To achieve a success rate of
-11/15 = \Sexpr{round(11/15*100, 1)}\%, as is was achieved with the non-significance criterion,
-unrealistic margins of $\Delta >$ 2 are required, highlighting the paucity of
-evidence provided by these studies.
+\textit{p}-value thresholds. Such an ``equivalence curve'' approach was first
+proposed by \citet{Hauck1986}. We see that for realistic margins between 0 and
+1, the proportion of replication successes remains below 50\%. To achieve a
+success rate of 11/15 = \Sexpr{round(11/15*100, 1)}\%, as is was achieved with
+the non-significance criterion, unrealistic margins of $\Delta >$ 2 are
+required, highlighting the paucity of evidence provided by these studies.
 
 
 
@@ -714,10 +720,10 @@ grid.arrange(plotA, plotB, ncol = 1)
 \caption{Number of successful replications of original null results in the RPCB
   as a function of the margin $\Delta$ of the equivalence test
   ($p_{\text{TOST}} \leq \alpha$ in both studies) or the standard deviation of
-  the normal prior distribution for the SMD effect size under the alternative
-  $H_{1}$ of the Bayes factor test ($\BF_{01} \geq \gamma$ in both studies). The
-  dashed gray lines represent the margin and standard deviation used in the main
-  analysis shown in Figure~\ref{fig:nullfindings}.}
+  the zero-mean normal prior distribution for the SMD effect size under the
+  alternative $H_{1}$ of the Bayes factor test ($\BF_{01} \geq \gamma$ in both
+  studies). The dashed gray lines represent the margin and standard deviation
+  used in the main analysis shown in Figure~\ref{fig:nullfindings}.}
 \label{fig:sensitivity}
 \end{figure}
 
@@ -751,8 +757,8 @@ respectively \citep{Jeffreys1961}. In contrast to the non-significance
 criterion, this criterion provides a genuine measure of evidence that can
 distinguish absence of evidence from evidence of absence.
 
-When the observed data are dichotomized into positive (\mbox{$p < 0.05$}) or null
-results (\mbox{$p > 0.05$}), the Bayes factor based on a null result is the
+When the observed data are dichotomized into positive (\mbox{$p < 0.05$}) or
+null results (\mbox{$p > 0.05$}), the Bayes factor based on a null result is the
 probability of observing \mbox{$p > 0.05$} when the effect is indeed absent
 (which is $95\%$) divided by the probability of observing $p > 0.05$ when the
 effect is indeed present (which is one minus the power of the study). For
@@ -769,7 +775,7 @@ under $H_{1}$ will end up with different Bayes factors. Instead of specifying a
 single effect, one therefore typically specifies a ``prior distribution'' of
 plausible effects. Importantly, the prior distribution, like the equivalence
 margin, should be determined by researchers with subject knowledge and before
-the data are observed\todo{are collected?}.
+the data are collected.
 
 In practice, the observed data should not be dichotomized into positive or null
 results, as this leads to a loss of information. Therefore, to compute the Bayes
@@ -782,11 +788,12 @@ an effect ($H_{1} \colon \text{SMD} \neq 0$) using a normal ``unit-information''
 prior distribution\footnote{For SMD effect sizes, a normal unit-information
   prior is a normal distribution centered around the null value with a standard
   deviation corresponding to one observation. Assuming that the group means are
-  normally distributed \mbox{$\bar{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$}
-  and \mbox{$\bar{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the
+  normally distributed
+  \mbox{$\overline{X}_{1} \sim \Nor(\theta_{1}, 2\sigma^{2}/n)$} and
+  \mbox{$\overline{X}_{2} \sim \Nor(\theta_{2}, 2\sigma^{2}/n)$} with $n$ the
   total sample size and $\sigma$ the known data standard deviation, the
   distribution of the SMD is
-  \mbox{$\text{SMD} = (\bar{X}_{1} - \bar{X}_{2})/\sigma \sim \Nor((\theta_{1} - \theta_{2})/\sigma, 4/n)$}.
+  \mbox{$\text{SMD} = (\overline{X}_{1} - \overline{X}_{2})/\sigma \sim \Nor\{(\theta_{1} - \theta_{2})/\sigma, 4/n\}$}.
   The standard deviation of the SMD based on one unit ($n = 1$) is hence 2, just
   as the unit standard deviation for log hazard/odds/rate ratio effect sizes
   \citep[Section 2.4]{Spiegelhalter2004}.} \citep{Kass1995b} for the effect size
@@ -897,10 +904,11 @@ appropriately. Table~\ref{tab:recommendations} summarizes our recommendations.
       \item Compute the Bayes factors contrasting
             $H_{0} \colon \theta = \theta_{n}$ to
             $H_{1} \colon \theta \neq \theta_{n}$ for original and replication
-            data. Assuming a normal prior distribution
-            $\theta \given H_{1} \sim \Nor(m ,v)$, the Bayes factor is
+            data. Assuming a normal prior distribution,
+            % $\theta \given H_{1} \sim \Nor(m ,v)$,
+            the Bayes factor is
             $$\BF_{01i}
-            = \sqrt{1 + v/\sigma^{2}_{i}} \, \exp\left[-\frac{1}{2} \left\{\frac{(\hat{\theta}_{i} -
+            = \sqrt{1 + \frac{v}{\sigma^{2}_{i}}} \, \exp\left[-\frac{1}{2} \left\{\frac{(\hat{\theta}_{i} -
                   \theta_{n})^{2}}{\sigma^{2}_{i}} - \frac{(\hat{\theta}_{i} - m)^{2}}{\sigma^{2}_{i} + v}
               \right\}\right], ~ i \in \{o, r\}.$$
       \item Declare replication success at level $\gamma > 1$ if
@@ -952,8 +960,10 @@ power to make conclusive inferences regarding the absence of the effect.
 \section*{Acknowledgements}
 We thank the contributors of the RPCB for their tremendous efforts and for
 making their data publicly available. We thank Maya Mathur for helpful advice
-with the data preparation. This work was supported by the Swiss National Science
-Foundation (grant \href{https://data.snf.ch/grants/grant/189295}{\#189295}).
+with the data preparation. Our acknowledgment of these individuals does not
+imply their endorsement of our article. This work was supported by the Swiss
+National Science Foundation (grant
+\href{https://data.snf.ch/grants/grant/189295}{\#189295}).
 
 \section*{Conflict of interest}
 We declare no conflict of interest.
diff --git a/paper/rsabsence.pdf b/paper/rsabsence.pdf
new file mode 100644
index 0000000000000000000000000000000000000000..3b34b1a47d7a4beb6befbd68e3ffdfbc8273ba51
Binary files /dev/null and b/paper/rsabsence.pdf differ