diff --git a/paper/rsabsence.Rnw b/paper/rsabsence.Rnw
index 9322348b5bd1dd0cc099dc6e66ee0b641e652fb6..35f4249e0a427adbcc1d1c37bf20dafefaad34c8 100755
--- a/paper/rsabsence.Rnw
+++ b/paper/rsabsence.Rnw
@@ -48,7 +48,7 @@ opts_chunk$set(fig.height = 4,
                eval = TRUE)
 
 ## should sessionInfo be printed at the end?
-Reproducibility <- TRUE
+Reproducibility <- FALSE
 
 ## packages
 library(ggplot2) # plotting
@@ -148,13 +148,12 @@ or Bayes factors \citep{Kass1995}, should be used from the outset.
 % two systematic reviews that I found which show that animal studies are very
 % much underpowered on average \citep{Jennions2003,Carneiro2018}
 
-The contextualization of null results becomes even more complicated in the
+The contextualization\todo{replace contextualization with interpretation?} of null results becomes even more complicated in the
 setting of replication studies. In a replication study, researchers attempt to
 repeat an original study as closely as possible in order to assess whether
-similar results can be obtained with new data \citep{NSF2019}. There have been
-various large-scale replication projects in the biomedical and social sciences
-in the last decade \citep[among
-others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}.
+similar\todo{replace similar with consistent?} results can be obtained with new data \citep{NSF2019}. In the last decade, various large-scale replication projects have been conducted in diverse fields, from the biomedical to the social sciences
+ \citep[among
+others]{Prinz2011,Begley2012,Klein2014,Opensc2015,Camerer2016,Camerer2018,Klein2018,Cova2018,Errington2021}. \todo{changed sentennce to not assume that their were only projects in biomed and soc sciences, but there might be more, or more in the pipeline}
 Most of these projects reported alarmingly low replicability rates across a
 broad spectrum of criteria for quantifying replicability. While most of these
 projects restricted their focus on original studies with statistically
@@ -165,7 +164,7 @@ significant results (``positive results''), the \emph{Reproducibility Project:
 also attempted to replicate some original studies with null results.
 
 The RPP excluded the original null results from its overall assessment of
-replication success, but the RPCB and the RPEP explicitly defined null results
+replication success (\textit{i.e.} the proportion of successful replications\todo{added by me, can be deleted again}), but the RPCB and the RPEP explicitly defined null results
 in both the original and the replication study as a criterion for ``replication
 success''. There are several logical problems with this ``non-significance''
 criterion. First, if the original study had low statistical power, a
@@ -179,7 +178,8 @@ replication is clearer. However, the criterion does not distinguish between
 these two cases. Second, with this criterion researchers can virtually always
 achieve replication success by conducting two studies with very small sample
 sizes, such that the \textit{p}-values are non-significant and the results are
-inconclusive. This is because the null hypothesis under which the \textit{p}-values are
+inconclusive. \todo{I find the "second, ..." argument a bit unnecessary for our cause. Also because if you do a replication, you probably do not design the first study (to be of low power). Instead I would directly write "Second, if the goal of inference is to quantify the
+evidence for the absence of an effect, the null hypothesis under which the \textit{p}-values are computed is misaligned with the goal."} This is because the null hypothesis under which the \textit{p}-values are
 computed is misaligned with the goal of inference, which is to quantify the
 evidence for the absence of an effect. We will discuss methods that are better
 aligned with this inferential goal. % in Section~\ref{sec:methods}.
@@ -308,17 +308,16 @@ ggplot(data = plotDF1) +
 
 Figure~\ref{fig:2examples} shows standardized mean difference effect estimates
 with \Sexpr{round(100*conflevel, 2)}\% confidence intervals from two RPCB study
-pairs. Both are ``null results'' and meet the non-significance criterion for
+pairs. In both study pairs, the original and replications studies are ``null results'' and therefore meet the non-significance criterion for
 replication success (the two-sided \textit{p}-values are greater than 0.05 in both the
-original and the replication study), but intuition would suggest that these two
-pairs are very much different.
+original and the replication study). However, intuition would suggest that the conclusions in the two pairs are very different.
 
 The original study from \citet{Dawson2011} and its replication both show large
 effect estimates in magnitude, but due to the small sample sizes, the
-uncertainty of these estimates is very large, too. If the sample sizes of the
+uncertainty of these estimates is large, too. If the sample sizes of the
 studies were larger and the point estimates remained the same, intuitively both
-studies would provide evidence for a non-zero effect. However, with the samples
-sizes that were actually used, the results seem inconclusive. In contrast, the
+studies would provide evidence for a non-zero effect\todo{Does this sentence add much information? I'd delete it and start the next one "With such low sample sizes used, ...".}. However, with the samples
+sizes that were actually used, the results are inconclusive. In contrast, the
 effect estimates from \citet{Goetz2011} and its replication are much smaller in
 magnitude and their uncertainty is also smaller because the studies used larger
 sample sizes. Intuitively, these studies seem to provide some evidence for a
@@ -331,7 +330,7 @@ discuss how the two can be quantitatively distinguished.
 \label{sec:methods}
 There are both frequentist and Bayesian methods that can be used for assessing
 evidence for the absence of an effect. \citet{Anderson2016} provide an excellent
-summary of both approaches in the context of replication studies in psychology.
+summary in the context of replication studies in psychology.
 We now briefly discuss two possible approaches -- frequentist equivalence
 testing and Bayesian hypothesis testing -- and their application to the RPCB
 data.
@@ -343,8 +342,8 @@ Equivalence testing was developed in the context of clinical trials to assess
 whether a new treatment -- typically cheaper or with fewer side effects than the
 established treatment -- is practically equivalent to the established treatment
 \citep{Wellek2010}. The method can also be used to assess
-whether an effect is practically equivalent to the value of an absent effect,
-usually zero. Using equivalence testing as a remedy for non-significant results
+whether an effect is practically equivalent to the value of an absent effect\todo{change to "practically equivalent to an absent effect, usually zero"? meaning withour the "the value of "},
+usually zero. Using equivalence testing as a remedy for non-significant results\dodo{"as a way to deal with / handle non-significant results". because it is not a remedy in the sense of an intervention against non-sign. results. }
 has been suggested by several authors \citep{Hauck1986, Campbell2018}. The main
 challenge is to specify the margin $\Delta > 0$ that defines an equivalence
 range $[-\Delta, +\Delta]$ in which an effect is considered as absent for