-
Fanny Wegner authoredFanny Wegner authored
Phylogeny based on SNPs
-
We will use three softwares to get to our trees:
-
snippy
for variant calling from reads and generating a SNP alignment. -
seqtk
for manipulating our alignment. -
IQtree
for generating ML phylogenies.
-
-
More info on software used:
Create a SNP alignment
-
The required input for
snippy
is:- a reference genome (can be multiple contigs)
- sequence reads (in our case, the Illumina trimmed, paired reads)
- specification of results folder (will be generated by
snippy
)
Perhaps: Here, we are mostly interested in analysing the samples that we know from cgMLST are associated with the outbreak:
SaH1P10622
SaH1P30922
SaH1P41122
SaH1P51022
SaH2P70722We will call SNPs against two different references:
-
Against the type strain or reference strain of S. aureus (https://bacdive.dsmz.de/strain/14487)
-
Against the hybrid assembly of the oldest strain of our outbreak in Hospital 1. Note: The sample labels are constructed as follows:
SaH#P#MMYY
Sa = Staphyloccocus aureus
H# = hospital
P# = patient
MMYY = {month year) of collection
The oldest sample is therefore SaH1P10622.
-
You can find the references here
/shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/ref
-
The singularity image for
snippy
is here/shares/amr.imm.uzh/bioinfo/singularity/snippy_4.6.0--hdfd78af_2.sif
-
How to run snippy:
First you call variants for each sample running the general command (anything in [] are just place holders for your parameters):
# CALL VARIANTS # for each sample snippy --reference [reference] --R1 [forward_reads] --R2 [reverse_reads] --outdir [outdir] --report --cpus $SLURM_CPUS_PER_TASK
After all samples have finished, you can use the command
snippy-core
to generate an alignment of all core SNPs. "Core site" here refers to all sites that are present in all samples. If the core site is variable, it is a "core SNP" (ignoring indels). As input tosnippy-core
, all output directories from your SNP calling in the previous step need to be named.# GENERATE A CORE SNP ALIGNMENT snippy-core --prefix core [outdir] [outdir2] [outdir3] --ref [reference]
This command will give you various output, but among them is
core.aln
andcore.full.aln
. Both are fasta multiple sequence alignment, the first contains only the core SNPs, the other contains the full alignment of your reference and the samples inlcuding any invariant sites.In the next step you clean up the full alignment, which will remove any potential ambiguous characters (can sometimes produce problems with tree-building tools).
# CLEAN UP THE ALIGNMENT snippy-clean_full_aln core.full.aln > clean.full.aln
TIP FOR JOB ARRANGEMENT:
For this task it could make sense to write two scripts, with one being nested within the other. For example, the SNP-calling step could be a job array script, that is called from within another script that then uses this output to generate the MSA and the tree.
For this to work, you just submit your job array script from within the other submission script as usual with
sbatch
. However, in order for the "outer" script to wait for all launched child jobs to finish, you need to specify the flag-W
.this snippy.sh script submits another job array script within:
#!/usr/bin/env bash #SBATCH --time=10:00:00 #SBATCH --mem-per-cpu=20G #SBATCH --cpus-per-task=8 #SBATCH --job-name=snippy #SBATCH --output=snippy_%j.out #SBATCH --error=snippy_%j.err # THIS IS MY MAIN SCRIPT command1 command2 # Submit batch job and wait for results of all child jobs sbatch -W myarrayjob.sh # continue with the rest of the script command3 command4
Inspect the SNP alignment
-
Call SNPs for each reference. Inspect the output
snippy
gives. How many core SNPs have been called for each reference? -
Optional: You can also inspect the alignment (clean.full.aln or core.aln) using AliView (http://www.ormbunkar.se/aliview/downloads/), a lightweight alignment sequence viewer that you can install locally.
Calculate ML phylogenies
-
We will work with the clean.full.aln. First, we want the alignment without the reference. You can do that in many ways. One option is to use
seqtk
, a handy toolbox for sequence analysis. The singularity image can be found here:/shares/amr.imm.uzh/bioinfo/singularity/seqtk_1.3--h7132678_4.sif
The command we want to use is called
subseq
which extracts all fasta sequences that are specified in an input text file (samples.ls
):seqtk subseq [in.fa] samples.ls > [out.fa]
In our case this
samples.ls
file would look like this:SaH1P10622 SaH1P30922 SaH1P41122 SaH1P51022 SaH2P70722
The new MSA without reference will be the input for the tree-building software
IQtree
. -
How to run IQtree:
The singularity image for
IQtree
can be found here:/shares/amr.imm.uzh/bioinfo/singularity/iqtree_2.2.0.3--hb97b32f_1.sif
The basic command structure is the following:
iqtree2 -s [input.fa] -B 1000 -T AUTO -m MFP --prefix [prefix]
This means, IQtree will do a bootstrap (
-B
) of 1000 replicates, use the best number of CPUs possible given what you have requested (-T AUTO
), do an automatic model selection step using ModelFinder Plus (-m MFP
) and attach the specified prefix to any of the output. -
All output from the model selection and tree calculation will be in the .log file. Which model has been used, respectively?
You can find more info on the implemented models here: http://www.iqtree.org/doc/Substitution-Models#dna-models
-
The actual tree in Newik format will be stored in the .treefile.
-
Inspect the phylogeny using Figtree - download it from https://github.com/rambaut/figtree/releases if you have not already done so. You should be able to open the .treefile directly via the SFTP feature of your terminal client.
-
When opening the .treefile, Figtree will ask you to give a label to the node values. These are your bootstrap values. You can choose display them under "Node Labels".
-
In figtree, root the tree using the oldest sequence.
-
Play around with Figtree and the different methods of displaying your data.
-
What do the different scales mean?
I need help
Talk to the tutors, we're happy to help. Otherwise you can also find the scripts here:
-
Copy the following scripts to wherever you want them:
cp /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/02_scripts/snippy* [target_directory] cp /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/02_scripts/iqtree* [target_directory]
Change the variables to your needs.
DOWNLOAD RESULTS
The complete output for the SNP calling, MSA and tree can be found here, download it into a [target_directory]
of your choice.
In every snippy folder you will find a folder called iqtree
containing the phylogenetic tree (the .tre file).
# Typestrain as reference
cp -r /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/03_output_mrsa/snippy_ref_typestrain [target_directory]
# Hybrid assembly as reference
cp -r /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/03_output_mrsa/snippy_ref_hybrid [target_directory]