diff --git a/exercises/UNIX_HPC_exercise_instructions.md b/exercises/UNIX_HPC_exercise_instructions.md new file mode 100644 index 0000000000000000000000000000000000000000..a470a2d39b06e6dbdaea4b00972e8e4fc412e55a --- /dev/null +++ b/exercises/UNIX_HPC_exercise_instructions.md @@ -0,0 +1,494 @@ +## Before you start + +* If you have questions about any command, you can always type `man [command]` to get an explanation and all possible additional options. + +* **Exercise solutions:** All solutions are embedded in this document and are hidden by default, but + you can reveal them by clicking on the drop-down menu, like this one: + + <details><summary><b>Exercise solution</b></summary> + This would reveal the answer... + </details> + + We encourage you to *not* look at the solution too quickly, and try to solve the exercise without it. Remember you can always ask the course teachers for help. + + +* **Exercise material:** Download the [ompA.zip](ompA.zip) archive to your local computer. Leave it zipped. + +<br> +<br> + +## Exercise 1 - Navigating the filesystem on command line + +**Objective:** get familiar with Science Cluster, get familiar with navigating the directory tree and listing the content of directories. + +1. **Print your current working directory** with the `pwd` command. This will + show you where you currently are in the directory tree. + +2. **List the contents of your current working directory** with `ls`, `ls -l` and `ls -a`. + * What do the `-l` and `-a` options do? + * **Hint:** you can use `man ls` to display the help for the `ls` command. + To exit the help, simply type `q` on your keyboard. + * *Note:* one-letter options can be grouped together, so `ls -la` is the same as `ls -l -a`. + * *Note:* some options have both a "short" and a "long" form. E.g. `ls -a` is the short form for `ls --all`. + * Science Cluster has some pre-defined "aliases" or shorthands for often-used commands. See what happens when you type `ll` or `la`. + +3. **Navigate to your data filesystem by typing `cd data`.** Once you are in `data/`, navigate to its parent directory with `cd ..`. + * Where are you now? + * Navigate to your data filesystem using the complete path `/data/$USER`, not the link in your home. Again, navigate to its parent directory. Where are you now? + * What happens if you now type just `cd`? + +4. **Try to run the command `cd .`**. What happens? What does the `.` stand for? + + + +<br> +<details><summary><b>Exercise solution</b></summary> +<p> + +1. Printing the current working directory: + + ```sh + pwd + ``` + +2. Listing the content of your `home` directory with different `ls` options: + + ```sh + ls # Prints the names of files and directories + ls -l # List content of the subdirectory in "long listing" format. This + # provides additional details for each file/directory, such as + # its permissions, its size and its last modified date. + ls -a # Adding the "-a" option additionally displays hidden files and + # directories. These are files/directories whose name starts with + # a dot ".". + # Hidden files are often used to store program configurations. + ll # This alias is the same as "ls -l" + la # This alias is the same as "ls -lA" + ``` + + **Tip:** It is possible to define a shorthand for longer commands that you use often, a so called `alias`. On Science Cluster, there are already some pre-defined useful aliases, among them `ll` (standing for `ls -lFh`) and `la` (standing for `ls -lA`). + +3. Navigating to your `data` filesystem and understanding links: + + ```sh + cd data/ + pwd # this shows you your location relative to the symlink you followed + cd .. + pwd # you are back to your home + ``` + When navigating to a directory through a link, your path through the tree of the filesystem will reflect that and `cd ..` will bring you back to your `home`. + + ```sh + cd /data/$USER + pwd # this shows you the actual location /data/$USER + cd .. + pwd # you are now in /data + ll # this can take a moment + ``` + + When navigating to your `data` space through its true full path, the parent directory is `/data`. Typing `ls` here will show you the `data` spaces of all users on Science Cluster. However, you don't have permission to any other than your own! + + Typing `cd` from anywhere is a shorthand for `cd ~` or `cd /home/$USER`. + + +4. The `.` symbol is a shortcut for the current directory. So running `cd .` + has no effect since it simply changes to the same directory we are already + in. + + The `.` shortcut is useful in some situations. E.g. if you want to copy + a file to the current directory you can do `cp /file/to/copy .`, or you + can run an executable located in the current directory with `./run_me.sh`. + + + +</p> +</details> +<br> + +## Exercise 2 - Creating and moving directories and files + +**Objective:** transfer files to Science Cluster. Learn to use the `mkdir`, `cp`, `mv` and `rm` commands. + + +1. **Transfer the downloaded `ompA.zip`** from your local computer with your command line/SFTP client to your `home` directory (drag and drop) on Science Cluster. + +2. **Copy the `ompA.zip` file from your `home` to `data`.** Navigate to `data` and unzip the archive. + * *Note:* The command for unzipping is `unzip [archive]`. + * Look at the content of the extracted archive. What has happened? + +3. **Create a new directory called `intro_to_unix`**. + +4. **Move the fasta file `ompA_short_ref.fasta` into the new directory.** + * Look at the content of `intro_to_unix` without entering the directory. + +5. **Move the rest of the fasta files**, but NOT the zip archive into `intro_to_unix`. + * *Hint:* Using the wildcard character `*` can help you. For example, all files ending with `.jpg` can be expressed by typing `*.jpg`. + * Now unzip the `ompA.zip` again as before. This time use the wildcard `*` to move *all* files into `intro_to_unix`. + +6. **Go to the directory `intro_to_unix` and rename the zip archive.** The new name does not matter. + +7. **Delete the renamed zip archive.** + +8. **Create a new directory called `archive` inside `intro_to_unix`.** + * Copy `ompA.zip` from your `home` directory into the new directory. + * Move the whole `archive `directory including its content to the parent directory of `intro_to_unix`. + * Now delete the directory `archive` including its content. + * *Hint:* You have to use an additional flag for the `rm` command. Look at the manual of the command and try to figure out which one. + * *Note:* Empty directories can be delete with `rmdir` directly. + +9. **Create a symbolic link in your `home` directory to your `intro_to_unix` directory.** + + +<details><summary><b>Exercise solution</b></summary> +<p> + +1. If you have trouble with the transfer, please ask a student/tutor. + +2. Copying files: + ```sh + cp ompA.zip data/ + cd data + unzip ompA.zip # This extracts the files from the archive while also preserving the archive. + ``` + +3. Creating a new directory: + + ```sh + mkdir intro_to_unix + ``` + + **Tip:** the `-p` option creates a directory if it does not exist yet. This can be useful in a script. + +4. Moving files: + ```sh + mv ompA_short_ref.fasta intro_to_unix/ + ll intro_to_unix/ + ``` + + Giving any location or even file as a parameter to "ls" will list the files that you specify. You don't always need to navigate to a directory to inspect its content. + +5. Moving files using wildcards: + ```sh + mv *.fasta intro_to_unix/ # This moves all files with the ending ".fasta" + unzip ompA.zip + mv * intro_to_unix/ # It will overwrite existing files with the same name + ``` + +6. Renaming files: + ```sh + cd intro_to_unix/ + mv ompA.zip fasta_files.zip + ``` + +7. Deleting files: + ```sh + rm fasta_files.zip + ``` + +8. Moving and deleting directories: + ```sh + mkdir archive + cp ~/ompA.zip archive/ + mv archive/ ../.. # You can use ".." multiple times to go upward in the file tree. + cd ../.. + rm -r archive/ # the -r flag will recursively delete the directory, meaning it will also delete its content. + ``` + +9. Creating links: + ```sh + cd # Remember, this is a shorthand to get "home" + ln -s data/intro_to_unix/ unix_exercise + cd unix_exercise/ + pwd + ``` +</p> +</details> +<br> + + +## Exercise 3 - Reading and manipulating files, input and output + +**Objective:** become familiar handling files, concatenating commands and output + +1. **Concatenate all fasta files into one file.** + +2. **Extract only the headers from your multi-fasta file.** + * Save the header in a new file. + * *Hint:* What do these headers have in common that you could use to extract them? + +3. **Look at the content of the multi-fasta file using `head`, `tail`, `cat`, `more` and `less`** to explore the different behaviour. + * Print the top and bottom 5 lines instead of the top 10 (default). + * *Hint:* Check out the `man` pages for optional parameters. + +4. **Count the number of sequences that contain the peptide SNVYGKNHDTGVSP.** + +5. **Extract the variant numbers of the variant ompA sequences.** + +6. **Create a new file called `myOmpA'** with touch. + * Print the statement "This is my favourite ompA sequence:" and attach it to the file. + * Attach one of the ompA sequences to the file. + + +<details><summary><b>Exercise solution</b></summary> +<p> + +1. Concatenating files: + + ```sh + cat *.fasta >> all_ompA.fasta + ``` +2. Using grep: + + ```sh + grep ">" all_ompA.fasta # Quotation marks are necessary because ">" is meant as a character, not a command! + + grep ">" all_ompA.fasta > headers + ``` +3. Exploring files using different tools: + ```sh + head all_ompA.fasta + head -n 5 all_ompA.fasta # show only the first 5 lines + + tail all_ompA.fasta + tail -n 5 all_ompA.fasta + + cat all_ompA.fasta + # "cat" print the complete file to standard output. + + more all_ompA.fasta + # with "more" you can remain in your command line. You scroll through the file with space bar, when you reach the end of file, you get your prompt back. + + less all_ompA.fasta + # with "less", you will see the document as if in a new window apart from your command line commands and you can go backward and forward. Type 'g' to get to the top of the file, 'SHIFT+g' to the end. Type '/' followed by a search term will highlight all instances. + ``` +4. Counting lines: + ```sh + grep "SNVYGKNHDTGVSP" all_ompA.fasta | wc -l # 5 sequences + ``` + *Note:* This counts the number of lines in which the pattern was found, not the number of occurrences. + +5. Extracting parts of a character string: + ```sh + grep "variant" all_ompA.fasta | cut -d '-' -f 3 # or use the headers file + ``` +6. Concatenating input and files + ```sh + touch myOmpA + echo "This is my favourite ompA sequence:" >> myOmpA + cat ompA_variant_008.fasta >> myOmpA + ``` +</p> +</details> +<br> + + +## Exercise 4 - Variables and control structures + +**Objective:** handle variables, arrays and for loops; write a simple script. + + +1. **On the command line, create a variable `var1` containing the string "ompA".** Print the variable to standard output. + +2. **On the command line, loop through all fasta files starting with "ompA" using `var1`** and print the name to standard output. + * Once you type `for` and press `ENTER`, the shell will recognise the syntax and you can continue writing the for-loop line by line. It will evaluate the whole command only after you entered `done`. + +3. **In your editor, write a simple script called `copy_and_rename.sh`.** + + **Preparation:** It's easiest if you write your script locally and then transfer it via SFTP of your command line client (Termius/Notepad++) to the cluster into your `intro_to_unix` directory. + + **For BBEdit users:** You can turn on syntax highlighting either by saving the file as a .sh file or by selecting "Unix Shell Script" in the drop-down menu at the very bottom (it says "Text file" by default). + + **For Notepad++ users:** You can turn on syntax highlighting by selecting "shell" in the "Language" menu. + + * In a for loop, go through all fasta files starting with "ompA". + * Copy each fasta file. For the name of the new copy, replace "ompA" with "outerMembraneProteinA". + * Run the script from within your `intro_to_unix` folder. + * *Hint:* Think about our previous exercise where we extracted the variant number. You can use a similar approach here. + * **Tip:** Always keep in mind what exactly is stored in your respective variables. A filename? An index? Try to also think about what you need as input for each command. For example, do you want to extract something from within a file or from a filename? + * *Hint:* Don't forget the shebang! + +4. **Rewrite your script to make it more flexible/reusable in the future.** + * For example, defining variables that contain certain things (paths, names, strings etc.) that are used in the code but you might want to change in the future could be a good idea. + +5. **On the command line, create an array that contains the numbers from 5 to 10.** + * Print out the complete array to standard output. + * Print out the 3rd element. + * Write a for loop that prints out each element. + + +5. **Modify the script `copy_and_rename.sh` further.** + * Include an array that contains the filenames of all fasta files starting with "ompA". Use it in your loop. + * Create a folder for each "variant" of ompA (.e.g., `001`, `short` etc). + * Copy each fasta into the respective variant folder. Keep the renaming as before. + + + +<details><summary><b>Exercise solution</b></summary> +<p> + +1. Creating a variable: + ```sh + var1="ompA" + echo $var1 + ``` + +2. Writing a for loop: + + ```sh + for x in $var1*.fasta + do + echo $x + done + ``` + +3. Writing a simple script that copies and renames files: + ```sh + #!/bin/bash + for x in ompA*.fasta + do + # save file ending in variable + ending=$(ls $x | cut -d '_' -f 2,3) # extract 2nd and 3rd field + # copy and rename file + cp $x outerMembraneProtein_$ending + + done + ``` + +4. To make it more reusable, you can use additional variables whose values can be easily changed and adapted to new use cases. + + ```sh + #!/bin/bash + + # Define variables prior to your actual code and use these throughout instead of hard-coded names + name1="ompA" + name2="outerMembraneProteinA" + + for x in $name1*.fasta + do + # save file ending in variable + ending=$(ls $x | cut -d '_' -f 2,3) + # copy and rename file + cp $x ${name2}_$ending + done + ``` + + You could even change the parameters of the `cut` command of the for loop to make it more general, or you could include variables for the paths of the current working directory and a target directory, etc etc - but this version will suffice here. + +5. Creating an array: + ```sh + numbers=(5 6 7 8 9 10) + echo ${numbers[@]} # don't forget the curly brackets + echo ${numbers[2]} # indices start at 0 + + # iterating through the array elements + for i in ${numbers[@]} + do + echo $i + done + # OR iterating through the array indices + for i in ${!numbers[@]} + do + echo ${numbers[$i]} + done + # OR iterating through indices defines by yourself (which you use for the array in this case) + for i in {0..5} + do + echo ${numbers[$i]} + done + + ``` + +6. Modifying the script further to use arrays: + + ```sh + #!/bin/bash + + name1="ompA" + name2="outerMembraneProteinA" + + ompAfiles=($(ls $name1*.fasta)) # remember: x=(x1 x2 x3) creates an array, y=$(command) assigns the output of a command to a variable. Here these two are combined. + echo ${ompAfiles[@]} # including echos of variables can be a good sanity check for your code + + for x in ${ompAfiles[@]} + do + # save file ending in variable + ending=$(echo $x | cut -d '_' -f 2,3) + echo $ending + + # extract ompA variant + variant=$(echo $x | cut -d '_' -f 3 | cut -d '.' -f 1) # you can pipe as many commands as you like + echo $variant + + # create new directory + mkdir -p $variant # -p: throws no warning if folder already exists + + # copy file into new directory and rename + cp $x $variant/${name2}_$ending + + done + ``` + + +</p> +</details> +<br> + + +## Exercise 5 - Science Cluster + +**Objective:** running software from singularity images and submitting jobs to Science Cluster + +**BEFORE WE START** +Please run the following command: +```sh +echo "export SINGULARITY_BINDPATH=/scratch,/data,/home/$USER,/shares/amr.imm.uzh" >> $HOME/.bashrc +source $HOME/.bashrc +``` + +<br> + +1. **Create a job submission script for Science Cluster called `msa.sh`.** This script shall take the ompA sequences and generate a multiple sequence alignment (MSA). In your script: + * Request the following resources: + * 4 CPUs + * 8 GB memory + * 30 min runtime + * Give the job a sensible name to identify it + * Load the `singularityce` module + * The software for generating the MSA is called `mafft`. It is installed via its singularity module which can be found here: `/shares/amr.imm.uzh/bioinfo/singularity/mafft_7.505--hec16e2b_0.sif`. + + `mafft` is a very convenient software which recognises if you want to align DNA or Proteins automatically. + * The command for `mafft` is as follows: + ```sh + mafft input.fasta > output.fasta + ``` + * The input fasta file is the multi-fasta file you generated in the previous exercise. + * `mafft` is a very convenient software which recognises if you want to align DNA or Proteins automatically. + * Submit the job. + * Watch the state of the job using `squeue`. + * Explore the outplut files `[jobname].out` and `[jobname].err` + +<details><summary><b>Exercise solution</b></summary> +<p> + +1. Writing a submission script: + + ```sh + #!/usr/bin/env bash + #SBATCH --time=00:30:00 + #SBATCH --mem-per-cpu=8G + #SBATCH --cpus-per-task=2 + #SBATCH --job-name=msa + #SBATCH --output=msa_%j.out + #SBATCH --error=msa_%j.err + + # load the Singularity module + module load singularityce + + /shares/amr.imm.uzh/bioinfo/singularity/mafft_7.505--hec16e2b_0.sif mafft all_ompA.fasta > all_ompA_msa.fasta + ``` + + +</p> +</details> +<br> +