Skip to content
Snippets Groups Projects

Phylogeny based on SNPs

Create a SNP alignment

  • The required input for snippy is:

    • a reference genome (can be multiple contigs)
    • sequence reads (in our case, the Illumina trimmed, paired reads)
    • specification of results folder (will be generated by snippy)

    Perhaps: Here, we are mostly interested in analysing the samples that we know from cgMLST are associated with the outbreak:

    SaH1P10622
    SaH1P30922
    SaH1P41122
    SaH1P51022
    SaH2P70722

    We will call SNPs against two different references:

    1. Against the type strain or reference strain of S. aureus (https://bacdive.dsmz.de/strain/14487)

    2. Against the hybrid assembly of the oldest strain of our outbreak in Hospital 1. Note: The sample labels are constructed as follows:

      SaH#P#MMYY

      Sa = Staphyloccocus aureus
      H# = hospital
      P# = patient
      MMYY = {month year) of collection

    The oldest sample is therefore SaH1P10622.

  • You can find the references here

    /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/ref
  • The singularity image for snippy is here

    /shares/amr.imm.uzh/bioinfo/singularity/snippy_4.6.0--hdfd78af_2.sif
  • How to run snippy:

    First you call variants for each sample running the general command (anything in [] are just place holders for your parameters):

    # CALL VARIANTS 
    # for each sample
    snippy --reference [reference] --R1 [forward_reads] --R2 [reverse_reads] --outdir [outdir] --report --cpus $SLURM_CPUS_PER_TASK

    After all samples have finished, you can use the command snippy-core to generate an alignment of all core SNPs. "Core site" here refers to all sites that are present in all samples. If the core site is variable, it is a "core SNP" (ignoring indels). As input to snippy-core, all output directories from your SNP calling in the previous step need to be named.

    # GENERATE A CORE SNP ALIGNMENT
    snippy-core --prefix core [outdir] [outdir2] [outdir3] --ref [reference]

    This command will give you various output, but among them is core.aln and core.full.aln. Both are fasta multiple sequence alignment, the first contains only the core SNPs, the other contains the full alignment of your reference and the samples inlcuding any invariant sites.

    In the next step you clean up the full alignment, which will remove any potential ambiguous characters (can sometimes produce problems with tree-building tools).

    # CLEAN UP THE ALIGNMENT
    snippy-clean_full_aln core.full.aln > clean.full.aln

    TIP FOR JOB ARRANGEMENT:

    For this task it could make sense to write two scripts, with one being nested within the other. For example, the SNP-calling step could be a job array script, that is called from within another script that then uses this output to generate the MSA and the tree.

    For this to work, you just submit your job array script from within the other submission script as usual with sbatch. However, in order for the "outer" script to wait for all launched child jobs to finish, you need to specify the flag -W.

    this snippy.sh script submits another job array script within:

    #!/usr/bin/env bash
    #SBATCH --time=10:00:00
    #SBATCH --mem-per-cpu=20G
    #SBATCH --cpus-per-task=8
    #SBATCH --job-name=snippy
    #SBATCH --output=snippy_%j.out
    #SBATCH --error=snippy_%j.err
    
    # THIS IS MY MAIN SCRIPT
    command1
    command2
    
    # Submit batch job and wait for results of all child jobs
    sbatch -W myarrayjob.sh
    
    # continue with the rest of the script 
    command3
    command4
    

Inspect the SNP alignment

  • Call SNPs for each reference. Inspect the output snippy gives. How many core SNPs have been called for each reference?

  • Optional: You can also inspect the alignment (clean.full.aln or core.aln) using AliView (http://www.ormbunkar.se/aliview/downloads/), a lightweight alignment sequence viewer that you can install locally.


Calculate ML phylogenies

  • We will work with the clean.full.aln. First, we want the alignment without the reference. You can do that in many ways. One option is to use seqtk, a handy toolbox for sequence analysis. The singularity image can be found here:

    /shares/amr.imm.uzh/bioinfo/singularity/seqtk_1.3--h7132678_4.sif 

    The command we want to use is called subseq which extracts all fasta sequences that are specified in an input text file (samples.ls):

    seqtk subseq [in.fa] samples.ls > [out.fa]

    In our case this samples.ls file would look like this:

    SaH1P10622
    SaH1P30922
    SaH1P41122
    SaH1P51022
    SaH2P70722

    The new MSA without reference will be the input for the tree-building software IQtree.

  • How to run IQtree:

    The singularity image for IQtree can be found here:

    /shares/amr.imm.uzh/bioinfo/singularity/iqtree_2.2.0.3--hb97b32f_1.sif

    The basic command structure is the following:

    iqtree2 -s [input.fa] -B 1000 -T AUTO -m MFP --prefix [prefix]

    This means, IQtree will do a bootstrap (-B) of 1000 replicates, use the best number of CPUs possible given what you have requested (-T AUTO), do an automatic model selection step using ModelFinder Plus (-m MFP) and attach the specified prefix to any of the output.

  • All output from the model selection and tree calculation will be in the .log file. Which model has been used, respectively?

    You can find more info on the implemented models here: http://www.iqtree.org/doc/Substitution-Models#dna-models

  • The actual tree in Newik format will be stored in the .treefile.

  • Inspect the phylogeny using Figtree - download it from https://github.com/rambaut/figtree/releases if you have not already done so. You should be able to open the .treefile directly via the SFTP feature of your terminal client.

  • When opening the .treefile, Figtree will ask you to give a label to the node values. These are your bootstrap values. You can choose display them under "Node Labels".

  • In figtree, root the tree using the oldest sequence.

  • Play around with Figtree and the different methods of displaying your data.

  • What do the different scales mean?

I need help

Talk to the tutors, we're happy to help. Otherwise you can also find the scripts here:

  • Copy the following scripts to wherever you want them:

    cp /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/02_scripts/snippy* [target_directory]
    
    cp /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/02_scripts/iqtree* [target_directory]

Change the variables to your needs.

DOWNLOAD RESULTS

The complete output for the SNP calling, MSA and tree can be found here, download it into a [target_directory] of your choice.

In every snippy folder you will find a folder called iqtree containing the phylogenetic tree (the .tre file).

# Typestrain as reference
cp -r /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/03_output_mrsa/snippy_ref_typestrain [target_directory]

# Hybrid assembly as reference
cp -r /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/03_output_mrsa/snippy_ref_hybrid [target_directory]