Upload New File

9e5535ef · Fanny Wegner · c39be1f1 · 9e5535ef
Commit 9e5535ef authored 2 years ago by Fanny Wegner
--- a/exercises/UNIX_HPC_exercise_instructions.md
+++ b/exercises/UNIX_HPC_exercise_instructions.md
+## Before you start
+* If you have questions about any command, you can always type `man [command]` to get an explanation and all possible additional options. 
+* **Exercise solutions:** All solutions are embedded in this document and are hidden by default,  but
+  you can reveal them by clicking on the drop-down menu, like this one:
+  <details><summary><b>Exercise solution</b></summary>
+  This would reveal the answer...
+  </details>
+   We encourage you to *not* look at the solution too quickly, and try to solve the exercise without it. Remember you can always ask the course teachers for help. 
+* **Exercise material:** Download the [ompA.zip](ompA.zip) archive to your local computer. Leave it zipped. 
+<br>
+<br>
+## Exercise 1 - Navigating the filesystem on command line
+**Objective:** get familiar with Science Cluster, get familiar with navigating the directory tree and listing the content of directories.
+1. **Print your current working directory** with the `pwd` command. This will
+   show you where you currently are in the directory tree.
+2. **List the contents of your current working directory** with `ls`, `ls -l` and `ls -a`.
+   * What do the `-l` and `-a` options do?
+   * **Hint:** you can use `man ls` to display the help for the `ls` command.
+   To exit the help, simply type `q` on your keyboard.
+   * *Note:* one-letter options can be grouped together, so `ls -la` is the same as `ls -l -a`.
+   * *Note:* some options have both a "short" and a "long" form. E.g. `ls -a` is the short form for `ls --all`.
+   * Science Cluster has some pre-defined "aliases" or shorthands for often-used commands. See what happens when you type `ll` or `la`. 
+3. **Navigate to your data filesystem by typing `cd data`.** Once you are in `data/`, navigate to its parent directory with `cd ..`. 
+   * Where are you now?
+   * Navigate to your data filesystem using the complete path `/data/$USER`, not the link in your home. Again, navigate to its parent directory. Where are you now?
+   * What happens if you now type just `cd`?
+4. **Try to run the command `cd .`**. What happens? What does the `.` stand for?
+<br>
+<details><summary><b>Exercise solution</b></summary>
+<p>
+1. Printing the current working directory:
+    ```sh
+    pwd
+    ```
+2. Listing the content of your `home` directory with different `ls` options:
+   ```sh
+   ls       # Prints the names of files and directories
+   ls -l    # List content of the subdirectory in "long listing" format. This
+            # provides additional details for each file/directory, such as
+            # its permissions, its size and its last modified date.
+   ls -a    # Adding the "-a" option additionally displays hidden files and
+            # directories. These are files/directories whose name starts with
+            # a dot ".".
+            # Hidden files are often used to store program configurations.
+   ll       # This alias is the same as "ls -l"
+   la       # This alias is the same as "ls -lA"
+   ```
+   **Tip:** It is possible to define a shorthand for longer commands that you use often, a so called `alias`. On Science Cluster, there are already some pre-defined useful aliases, among them `ll` (standing for `ls -lFh`) and `la` (standing for `ls -lA`).
+3. Navigating to your `data` filesystem and understanding links:
+   ```sh
+   cd data/
+   pwd         # this shows you your location relative to the symlink you followed
+   cd ..
+   pwd         # you are back to your home
+   ```
+   When navigating to a directory through a link, your path through the tree of the filesystem will reflect that and `cd ..` will bring you back to your `home`. 
+   ```sh
+   cd /data/$USER
+   pwd      # this shows you the actual location /data/$USER
+   cd ..
+   pwd      # you are now in /data
+   ll       # this can take a moment   
+   ```
+   When navigating to your `data` space through its true full path, the parent directory is `/data`. Typing `ls` here will show you the `data` spaces of all users on Science Cluster. However, you don't have permission to any other than your own!
+   Typing `cd` from anywhere is a shorthand for `cd ~` or `cd /home/$USER`. 
+4. The `.` symbol is a shortcut for the current directory. So running `cd .`
+   has no effect since it simply changes to the same directory we are already
+   in.
+   The `.` shortcut is useful in some situations. E.g. if you want to copy
+   a file to the current directory you can do `cp /file/to/copy .`, or you
+   can run an executable located in the current directory with `./run_me.sh`.
+</p>
+</details>
+<br>
+## Exercise 2 - Creating and moving directories and files
+**Objective:** transfer files to Science Cluster. Learn to use the `mkdir`, `cp`, `mv` and `rm` commands.
+1. **Transfer the downloaded `ompA.zip`** from your local computer with your command line/SFTP client to your `home` directory (drag and drop) on Science Cluster.
+2. **Copy the `ompA.zip` file from your `home` to `data`.** Navigate to `data` and unzip the archive. 
+   * *Note:* The command for unzipping is `unzip [archive]`. 
+   * Look at the content of the extracted archive. What has happened?
+3. **Create a new directory called `intro_to_unix`**. 
+4. **Move the fasta file `ompA_short_ref.fasta` into the new directory.** 
+   * Look at the content of `intro_to_unix` without entering the directory. 
+5. **Move the rest of the fasta files**, but NOT the zip archive into `intro_to_unix`. 
+   * *Hint:* Using the wildcard character `*` can help you. For example, all files ending with `.jpg` can be expressed by typing `*.jpg`. 
+   * Now unzip the `ompA.zip` again as before. This time use the wildcard `*` to move *all* files into `intro_to_unix`.
+6. **Go to the directory `intro_to_unix` and rename the zip archive.** The new name does not matter. 
+7. **Delete the renamed zip archive.**
+8. **Create a new directory called `archive` inside `intro_to_unix`.**
+   * Copy `ompA.zip` from your `home` directory into the new directory. 
+   * Move the whole `archive `directory including its content to the parent directory of `intro_to_unix`.
+   * Now delete the directory `archive` including its content. 
+   * *Hint:* You have to use an additional flag for the `rm` command. Look at the manual of the command and try to figure out which one.  
+   * *Note:* Empty directories can be delete with `rmdir` directly. 
+9. **Create a symbolic link in your `home` directory to your `intro_to_unix` directory.**
+<details><summary><b>Exercise solution</b></summary>
+<p>
+1. If you have trouble with the transfer, please ask a student/tutor. 
+2. Copying files:
+   ```sh
+   cp ompA.zip data/
+   cd data
+   unzip ompA.zip       # This extracts the files from the archive while also preserving the archive. 
+   ```
+3. Creating a new directory:
+   ```sh
+   mkdir intro_to_unix
+   ```
+   **Tip:** the `-p` option creates a directory if it does not exist yet. This can be useful in a script. 
+4. Moving files:
+   ```sh
+   mv ompA_short_ref.fasta intro_to_unix/
+   ll intro_to_unix/ 
+   ```
+   Giving any location or even file as a parameter to "ls" will list the files that you specify. You don't always need to navigate to a directory to inspect its content.   
+5. Moving files using wildcards:
+   ```sh
+   mv *.fasta intro_to_unix/  # This moves all files with the ending ".fasta"
+   unzip ompA.zip
+   mv * intro_to_unix/        # It will overwrite existing files with the same name
+   ```
+6. Renaming files:
+   ```sh
+   cd intro_to_unix/
+   mv ompA.zip fasta_files.zip
+   ```
+7. Deleting files:
+   ```sh
+   rm fasta_files.zip
+   ```
+8. Moving and deleting directories:
+   ```sh
+   mkdir archive
+   cp ~/ompA.zip archive/
+   mv archive/ ../..     # You can use ".." multiple times to go upward in the file tree. 
+   cd ../..
+   rm -r archive/        # the -r flag will recursively delete the directory, meaning it will also delete its content.
+   ```
+9. Creating links:
+   ```sh
+   cd          # Remember, this is a shorthand to get "home"
+   ln -s data/intro_to_unix/ unix_exercise
+   cd unix_exercise/
+   pwd
+   ```
+</p>
+</details>
+<br>
+## Exercise 3 - Reading and manipulating files, input and output
+**Objective:** become familiar handling files, concatenating commands and output
+1. **Concatenate all fasta files into one file.**
+2. **Extract only the headers from your multi-fasta file.**
+   * Save the header in a new file. 
+   * *Hint:* What do these headers have in common that you could use to extract them?
+3. **Look at the content of the multi-fasta file using `head`, `tail`, `cat`, `more` and `less`** to explore the different behaviour. 
+   * Print the top and bottom 5 lines instead of the top 10 (default). 
+   * *Hint:* Check out the `man` pages for optional parameters. 
+4. **Count the number of sequences that contain the peptide SNVYGKNHDTGVSP.**
+5. **Extract the variant numbers of the variant ompA sequences.**
+6. **Create a new file called `myOmpA'** with touch. 
+   * Print the statement "This is my favourite ompA sequence:" and attach it to the file. 
+   * Attach one of the ompA sequences to the file.
+<details><summary><b>Exercise solution</b></summary> 
+<p>
+1. Concatenating files:
+   ```sh
+   cat *.fasta >> all_ompA.fasta
+   ```
+2. Using grep:
+   ```sh
+   grep ">" all_ompA.fasta       # Quotation marks are necessary because ">" is meant as a character, not a command!
+   grep ">" all_ompA.fasta > headers
+   ```
+3. Exploring files using different tools:
+   ```sh
+   head all_ompA.fasta
+   head -n 5 all_ompA.fasta       # show only the first 5 lines
+   tail all_ompA.fasta
+   tail -n 5 all_ompA.fasta
+   cat all_ompA.fasta
+   # "cat" print the complete file to standard output. 
+   more all_ompA.fasta 
+   # with "more" you can remain in your command line. You scroll through the file with space bar, when you reach the end of file, you get your prompt back.
+   less all_ompA.fasta
+   # with "less", you will see the document as if in a new window apart from your command line commands and you can go backward and forward. Type 'g' to get to the top of the file, 'SHIFT+g' to the end. Type '/' followed by a search term will highlight all instances.  
+   ```
+4. Counting lines:
+   ```sh
+     grep "SNVYGKNHDTGVSP" all_ompA.fasta | wc -l        # 5 sequences
+   ```
+   *Note:* This counts the number of lines in which the pattern was found, not the number of occurrences. 
+5. Extracting parts of a character string:
+   ```sh
+   grep "variant" all_ompA.fasta | cut -d '-' -f 3       # or use the headers file
+   ```
+6. Concatenating input and files
+   ```sh
+   touch myOmpA
+   echo "This is my favourite ompA sequence:" >> myOmpA
+   cat ompA_variant_008.fasta >> myOmpA
+   ```
+</p>
+</details>
+<br>
+## Exercise 4 - Variables and control structures
+**Objective:** handle variables, arrays and for loops; write a simple script. 
+1. **On the command line, create a variable `var1` containing the string "ompA".** Print the variable to standard output. 
+2. **On the command line, loop through all fasta files starting with "ompA" using `var1`** and print the name to standard output.
+   * Once you type `for` and press `ENTER`, the shell will recognise the syntax and you can continue writing the for-loop line by line. It will evaluate the whole command only after you entered `done`. 
+3. **In your editor, write a simple script called `copy_and_rename.sh`.** 
+   **Preparation:** It's easiest if you write your script locally and then transfer it via SFTP of your command line client (Termius/Notepad++) to the cluster into your `intro_to_unix` directory.
+   **For BBEdit users:** You can turn on syntax highlighting either by saving the file as a .sh file or by selecting "Unix Shell Script" in the drop-down menu at the very bottom (it says "Text file" by default). 
+   **For Notepad++ users:** You can turn on syntax highlighting by selecting "shell" in the "Language" menu. 
+   * In a for loop, go through all fasta files starting with "ompA". 
+   * Copy each fasta file. For the name of the new copy, replace "ompA" with "outerMembraneProteinA". 
+   * Run the script from within your `intro_to_unix` folder. 
+   * *Hint:* Think about our previous exercise where we extracted the variant number. You can use a similar approach here. 
+   * **Tip:** Always keep in mind what exactly is stored in your respective variables. A filename? An index? Try to also think about what you need as input for each command. For example, do you want to extract something from within a file or from a filename? 
+   * *Hint:* Don't forget the shebang! 
+4. **Rewrite your script to make it more flexible/reusable in the future.** 
+   * For example, defining variables that contain certain things (paths, names, strings etc.) that are used in the code but you might want to change in the future could be a good idea.  
+5. **On the command line, create an array that contains the numbers from 5 to 10.**
+   * Print out the complete array to standard output. 
+   * Print out the 3rd element. 
+   * Write a for loop that prints out each element. 
+5. **Modify the script `copy_and_rename.sh` further.** 
+   * Include an array that contains the filenames of all fasta files starting with "ompA". Use it in your loop. 
+   * Create a folder for each "variant" of ompA (.e.g., `001`, `short` etc). 
+   * Copy each fasta into the respective variant folder. Keep the renaming as before.  
+<details><summary><b>Exercise solution</b></summary> 
+<p>
+1. Creating a variable:
+   ```sh
+   var1="ompA"
+   echo $var1
+   ```
+2. Writing a for loop: 
+   ```sh
+   for x in $var1*.fasta
+   do
+      echo $x
+   done
+   ```
+3. Writing a simple script that copies and renames files:
+   ```sh
+   #!/bin/bash
+   for x in ompA*.fasta
+   do
+      # save file ending in variable
+      ending=$(ls $x | cut -d '_' -f 2,3)    # extract 2nd and 3rd field
+      # copy and rename file
+      cp $x outerMembraneProtein_$ending
+   done
+   ```
+4. To make it more reusable, you can use additional variables whose values can be easily changed and adapted to new use cases.   
+   ```sh
+   #!/bin/bash
+   # Define variables prior to your actual code and use these throughout instead of hard-coded names
+   name1="ompA"
+   name2="outerMembraneProteinA"
+   for x in $name1*.fasta
+   do
+      # save file ending in variable
+	   ending=$(ls $x | cut -d '_' -f 2,3)
+      # copy and rename file
+	   cp $x ${name2}_$ending
+   done
+   ```
+   You could even change the parameters of the `cut` command of the for loop to make it more general, or you could include variables for the paths of the current working directory and a target directory, etc etc - but this version will suffice here.  
+5. Creating an array:
+   ```sh
+   numbers=(5 6 7 8 9 10)
+   echo ${numbers[@]}      # don't forget the curly brackets
+   echo ${numbers[2]}      # indices start at 0
+   # iterating through the array elements
+   for i in ${numbers[@]}
+   do
+      echo $i
+   done
+   # OR iterating through the array indices
+   for i in ${!numbers[@]}
+   do
+      echo ${numbers[$i]}
+   done
+   # OR iterating through indices defines by yourself (which you use for the array in this case)
+   for i in {0..5}
+   do
+      echo ${numbers[$i]}
+   done
+   ```
+6. Modifying the script further to use arrays:
+   ```sh
+   #!/bin/bash
+   name1="ompA"
+   name2="outerMembraneProteinA"
+   ompAfiles=($(ls $name1*.fasta))     # remember: x=(x1 x2 x3) creates an array, y=$(command) assigns the output of a command to a variable. Here these two are combined.  
+   echo ${ompAfiles[@]}                # including echos of variables can be a good sanity check for your code
+   for x in ${ompAfiles[@]}
+   do
+      # save file ending in variable
+      ending=$(echo $x | cut -d '_' -f 2,3)
+      echo $ending    
+      # extract ompA variant 
+      variant=$(echo $x | cut -d '_' -f 3 | cut -d '.' -f 1)   # you can pipe as many commands as you like
+      echo $variant   
+      # create new directory 
+      mkdir -p $variant		# -p: throws no warning if folder already exists
+      # copy file into new directory and rename
+      cp $x $variant/${name2}_$ending
+   done
+   ```
+</p>
+</details>
+<br>
+## Exercise 5 - Science Cluster
+**Objective:** running software from singularity images and submitting jobs to Science Cluster
+**BEFORE WE START**
+Please run the following command:
+```sh
+echo "export SINGULARITY_BINDPATH=/scratch,/data,/home/$USER,/shares/amr.imm.uzh" >> $HOME/.bashrc
+source $HOME/.bashrc
+```
+<br>
+1. **Create a job submission script for Science Cluster called `msa.sh`.** This script shall take the ompA sequences and generate a multiple sequence alignment (MSA). In your script:  
+   * Request the following resources: 
+      * 4 CPUs
+      * 8 GB memory 
+      * 30 min runtime 
+      * Give the job a sensible name to identify it
+   * Load the `singularityce` module
+   * The software for generating the MSA is called `mafft`. It is installed via its singularity module which can be found here: `/shares/amr.imm.uzh/bioinfo/singularity/mafft_7.505--hec16e2b_0.sif`. 
+      `mafft` is a very convenient software which recognises if you want to align DNA or Proteins automatically. 
+   * The command for `mafft` is as follows: 
+      ```sh
+      mafft input.fasta > output.fasta
+      ``` 
+   * The input fasta file is the multi-fasta file you generated in the previous exercise. 
+   * `mafft` is a very convenient software which recognises if you want to align DNA or Proteins automatically.  
+   * Submit the job. 
+   * Watch the state of the job using `squeue`. 
+   * Explore the outplut files `[jobname].out` and `[jobname].err`
+<details><summary><b>Exercise solution</b></summary> 
+<p>
+1. Writing a submission script:
+   ```sh
+   #!/usr/bin/env bash
+   #SBATCH --time=00:30:00
+   #SBATCH --mem-per-cpu=8G
+   #SBATCH --cpus-per-task=2
+   #SBATCH --job-name=msa
+   #SBATCH --output=msa_%j.out
+   #SBATCH --error=msa_%j.err
+   # load the Singularity module
+   module load singularityce
+   /shares/amr.imm.uzh/bioinfo/singularity/mafft_7.505--hec16e2b_0.sif mafft all_ompA.fasta > all_ompA_msa.fasta
+   ```
+</p>
+</details>
+<br>