## Phylogeny based on SNPs

* We will use three softwares to get to our trees:
1) `snippy` for variant calling from reads and generating a SNP alignment.
2) `seqtk` for manipulating our alignment.
3) `IQtree` for generating ML phylogenies.

* More info on software used:

https://github.com/tseemann/snippy

https://github.com/lh3/seqtk

http://www.iqtree.org

### Create a SNP alignment

* The required input for `snippy` is:
- a reference genome (can be multiple contigs)
- sequence reads (in our case, the Illumina trimmed, paired reads)
- specification of results folder (will be generated by `snippy`)

Perhaps: Here, we are mostly interested in analysing the samples that we know from cgMLST are associated with the outbreak:

SaH1P10622 
SaH1P30922 
SaH1P41122 
SaH1P51022 
SaH2P70722

We will call SNPs against **two different references**:
1) Against the type strain or reference strain of *S. aureus* (https://bacdive.dsmz.de/strain/14487)
2) Against the hybrid assembly of the oldest strain of our outbreak in Hospital 1.
*Note:* The sample labels are constructed as follows:

SaH#P#MMYY

Sa = *Staphyloccocus aureus* 
H# = hospital 
P# = patient 
MMYY = {month year) of collection

The oldest sample is therefore SaH1P10622.

* You can find the references here
```
/shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/ref
```

* The singularity image for `snippy` is here
```
/shares/amr.imm.uzh/bioinfo/singularity/snippy_4.6.0--hdfd78af_2.sif
```

* **How to run snippy:**

First you call variants for each sample running the general command (anything in [] are just place holders for your parameters):
```
# CALL VARIANTS
# for each sample
snippy --reference [reference] --R1 [forward_reads] --R2 [reverse_reads] --outdir [outdir] --report --cpus $SLURM_CPUS_PER_TASK
```

*After all samples have finished*, you can use the command `snippy-core` to generate an alignment of all core SNPs. "Core site" here refers to all sites that are present in *all* samples. If the core site is variable, it is a "core SNP" (ignoring indels). As input to `snippy-core`, all output directories from your SNP calling in the previous step need to be named.

```
# GENERATE A CORE SNP ALIGNMENT
snippy-core --prefix core [outdir] [outdir2] [outdir3] --ref [reference]
```

This command will give you various output, but among them is `core.aln` and `core.full.aln`. Both are fasta multiple sequence alignment, the first contains only the core SNPs, the other contains the full alignment of your reference and the samples inlcuding any invariant sites.

In the next step you clean up the full alignment, which will remove any potential ambiguous characters (can sometimes produce problems with tree-building tools).

```
# CLEAN UP THE ALIGNMENT
snippy-clean_full_aln core.full.aln > clean.full.aln
```

**TIP FOR JOB ARRANGEMENT**:

For this task it could make sense to write two scripts, with one being nested within the other.
For example, the SNP-calling step could be a job array script, that is called from within another script that then uses this output to generate the MSA and the tree.

For this to work, you just submit your job array script from within the other submission script as usual with `sbatch`. However, in order for the "outer" script to wait for all launched child jobs to finish, you need to specify the flag `-W`.

this snippy.sh script submits another job array script within:
```
#!/usr/bin/env bash
#SBATCH --time=10:00:00
#SBATCH --mem-per-cpu=20G
#SBATCH --cpus-per-task=8
#SBATCH --job-name=snippy
#SBATCH --output=snippy_%j.out
#SBATCH --error=snippy_%j.err

# THIS IS MY MAIN SCRIPT
command1
command2

# Submit batch job and wait for results of all child jobs
sbatch -W myarrayjob.sh

# continue with the rest of the script
command3
command4

```

### Inspect the SNP alignment

* Call SNPs for each reference. Inspect the output `snippy` gives. How many core SNPs have been called for each reference?

* Optional: You can also inspect the alignment (clean.full.aln or core.aln) using AliView (http://www.ormbunkar.se/aliview/downloads/), a lightweight alignment sequence viewer that you can install locally.

### Calculate ML phylogenies

* We will work with the clean.full.aln. First, we want the alignment without the reference. You can do that in many ways. One option is to use `seqtk`, a handy toolbox for sequence analysis. The singularity image can be found here:

```
/shares/amr.imm.uzh/bioinfo/singularity/seqtk_1.3--h7132678_4.sif
```

The command we want to use is called `subseq` which extracts all fasta sequences that are specified in an input text file (`samples.ls`):

```
seqtk subseq [in.fa] samples.ls > [out.fa]
```

In our case this `samples.ls` file would look like this:
```
SaH1P10622
SaH1P30922
SaH1P41122
SaH1P51022
SaH2P70722
```

The new MSA without reference will be the input for the tree-building software `IQtree`.

* **How to run IQtree:**

The singularity image for `IQtree` can be found here:
```
/shares/amr.imm.uzh/bioinfo/singularity/iqtree_2.2.0.3--hb97b32f_1.sif
```

The basic command structure is the following:

```
iqtree2 -s [input.fa] -B 1000 -T AUTO -m MFP --prefix [prefix]
```
This means, IQtree will do a bootstrap (`-B`) of 1000 replicates, use the best number of CPUs possible given what you have requested (`-T AUTO`), do an automatic model selection step using ModelFinder Plus (`-m MFP`) and attach the specified prefix to any of the output.

* All output from the model selection and tree calculation will be in the .log file. Which model has been used, respectively?

You can find more info on the implemented models here: http://www.iqtree.org/doc/Substitution-Models#dna-models

* The actual tree in Newik format will be stored in the .treefile.

* Inspect the phylogeny using **Figtree** - download it from https://github.com/rambaut/figtree/releases if you have not already done so. You should be able to open the .treefile directly via the SFTP feature of your terminal client.

* When opening the .treefile, Figtree will ask you to give a label to the node values. These are your bootstrap values. You can choose display them under "Node Labels".

* In figtree, root the tree using the oldest sequence.

* Play around with Figtree and the different methods of displaying your data.

* What do the different scales mean?

### I need help

Talk to the tutors, we're happy to help. Otherwise you can also find the scripts here:

* Copy the following scripts to wherever you want them:

```
cp /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/02_scripts/snippy* [target_directory]

cp /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/02_scripts/iqtree* [target_directory]
```

Change the variables to your needs.

### DOWNLOAD RESULTS

The complete output for the SNP calling, MSA and tree can be found here, download it into a `[target_directory]` of your choice.

In every snippy folder you will find a folder called `iqtree` containing the phylogenetic tree (the .tre file).

```bash
# Typestrain as reference
cp -r /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/03_output_mrsa/snippy_ref_typestrain [target_directory]

# Hybrid assembly as reference
cp -r /shares/amr.imm.uzh/data/projects/Blockcourse_BIO296/FS2023/Participants_data/illumina/03_output_mrsa/snippy_ref_hybrid [target_directory]
```