Skip to content
Snippets Groups Projects
Commit 9e5535ef authored by Fanny Wegner's avatar Fanny Wegner
Browse files

Upload New File

parent c39be1f1
No related branches found
No related tags found
No related merge requests found
## Before you start
* If you have questions about any command, you can always type `man [command]` to get an explanation and all possible additional options.
* **Exercise solutions:** All solutions are embedded in this document and are hidden by default, but
you can reveal them by clicking on the drop-down menu, like this one:
<details><summary><b>Exercise solution</b></summary>
This would reveal the answer...
</details>
We encourage you to *not* look at the solution too quickly, and try to solve the exercise without it. Remember you can always ask the course teachers for help.
* **Exercise material:** Download the [ompA.zip](ompA.zip) archive to your local computer. Leave it zipped.
<br>
<br>
## Exercise 1 - Navigating the filesystem on command line
**Objective:** get familiar with Science Cluster, get familiar with navigating the directory tree and listing the content of directories.
1. **Print your current working directory** with the `pwd` command. This will
show you where you currently are in the directory tree.
2. **List the contents of your current working directory** with `ls`, `ls -l` and `ls -a`.
* What do the `-l` and `-a` options do?
* **Hint:** you can use `man ls` to display the help for the `ls` command.
To exit the help, simply type `q` on your keyboard.
* *Note:* one-letter options can be grouped together, so `ls -la` is the same as `ls -l -a`.
* *Note:* some options have both a "short" and a "long" form. E.g. `ls -a` is the short form for `ls --all`.
* Science Cluster has some pre-defined "aliases" or shorthands for often-used commands. See what happens when you type `ll` or `la`.
3. **Navigate to your data filesystem by typing `cd data`.** Once you are in `data/`, navigate to its parent directory with `cd ..`.
* Where are you now?
* Navigate to your data filesystem using the complete path `/data/$USER`, not the link in your home. Again, navigate to its parent directory. Where are you now?
* What happens if you now type just `cd`?
4. **Try to run the command `cd .`**. What happens? What does the `.` stand for?
<br>
<details><summary><b>Exercise solution</b></summary>
<p>
1. Printing the current working directory:
```sh
pwd
```
2. Listing the content of your `home` directory with different `ls` options:
```sh
ls # Prints the names of files and directories
ls -l # List content of the subdirectory in "long listing" format. This
# provides additional details for each file/directory, such as
# its permissions, its size and its last modified date.
ls -a # Adding the "-a" option additionally displays hidden files and
# directories. These are files/directories whose name starts with
# a dot ".".
# Hidden files are often used to store program configurations.
ll # This alias is the same as "ls -l"
la # This alias is the same as "ls -lA"
```
**Tip:** It is possible to define a shorthand for longer commands that you use often, a so called `alias`. On Science Cluster, there are already some pre-defined useful aliases, among them `ll` (standing for `ls -lFh`) and `la` (standing for `ls -lA`).
3. Navigating to your `data` filesystem and understanding links:
```sh
cd data/
pwd # this shows you your location relative to the symlink you followed
cd ..
pwd # you are back to your home
```
When navigating to a directory through a link, your path through the tree of the filesystem will reflect that and `cd ..` will bring you back to your `home`.
```sh
cd /data/$USER
pwd # this shows you the actual location /data/$USER
cd ..
pwd # you are now in /data
ll # this can take a moment
```
When navigating to your `data` space through its true full path, the parent directory is `/data`. Typing `ls` here will show you the `data` spaces of all users on Science Cluster. However, you don't have permission to any other than your own!
Typing `cd` from anywhere is a shorthand for `cd ~` or `cd /home/$USER`.
4. The `.` symbol is a shortcut for the current directory. So running `cd .`
has no effect since it simply changes to the same directory we are already
in.
The `.` shortcut is useful in some situations. E.g. if you want to copy
a file to the current directory you can do `cp /file/to/copy .`, or you
can run an executable located in the current directory with `./run_me.sh`.
</p>
</details>
<br>
## Exercise 2 - Creating and moving directories and files
**Objective:** transfer files to Science Cluster. Learn to use the `mkdir`, `cp`, `mv` and `rm` commands.
1. **Transfer the downloaded `ompA.zip`** from your local computer with your command line/SFTP client to your `home` directory (drag and drop) on Science Cluster.
2. **Copy the `ompA.zip` file from your `home` to `data`.** Navigate to `data` and unzip the archive.
* *Note:* The command for unzipping is `unzip [archive]`.
* Look at the content of the extracted archive. What has happened?
3. **Create a new directory called `intro_to_unix`**.
4. **Move the fasta file `ompA_short_ref.fasta` into the new directory.**
* Look at the content of `intro_to_unix` without entering the directory.
5. **Move the rest of the fasta files**, but NOT the zip archive into `intro_to_unix`.
* *Hint:* Using the wildcard character `*` can help you. For example, all files ending with `.jpg` can be expressed by typing `*.jpg`.
* Now unzip the `ompA.zip` again as before. This time use the wildcard `*` to move *all* files into `intro_to_unix`.
6. **Go to the directory `intro_to_unix` and rename the zip archive.** The new name does not matter.
7. **Delete the renamed zip archive.**
8. **Create a new directory called `archive` inside `intro_to_unix`.**
* Copy `ompA.zip` from your `home` directory into the new directory.
* Move the whole `archive `directory including its content to the parent directory of `intro_to_unix`.
* Now delete the directory `archive` including its content.
* *Hint:* You have to use an additional flag for the `rm` command. Look at the manual of the command and try to figure out which one.
* *Note:* Empty directories can be delete with `rmdir` directly.
9. **Create a symbolic link in your `home` directory to your `intro_to_unix` directory.**
<details><summary><b>Exercise solution</b></summary>
<p>
1. If you have trouble with the transfer, please ask a student/tutor.
2. Copying files:
```sh
cp ompA.zip data/
cd data
unzip ompA.zip # This extracts the files from the archive while also preserving the archive.
```
3. Creating a new directory:
```sh
mkdir intro_to_unix
```
**Tip:** the `-p` option creates a directory if it does not exist yet. This can be useful in a script.
4. Moving files:
```sh
mv ompA_short_ref.fasta intro_to_unix/
ll intro_to_unix/
```
Giving any location or even file as a parameter to "ls" will list the files that you specify. You don't always need to navigate to a directory to inspect its content.
5. Moving files using wildcards:
```sh
mv *.fasta intro_to_unix/ # This moves all files with the ending ".fasta"
unzip ompA.zip
mv * intro_to_unix/ # It will overwrite existing files with the same name
```
6. Renaming files:
```sh
cd intro_to_unix/
mv ompA.zip fasta_files.zip
```
7. Deleting files:
```sh
rm fasta_files.zip
```
8. Moving and deleting directories:
```sh
mkdir archive
cp ~/ompA.zip archive/
mv archive/ ../.. # You can use ".." multiple times to go upward in the file tree.
cd ../..
rm -r archive/ # the -r flag will recursively delete the directory, meaning it will also delete its content.
```
9. Creating links:
```sh
cd # Remember, this is a shorthand to get "home"
ln -s data/intro_to_unix/ unix_exercise
cd unix_exercise/
pwd
```
</p>
</details>
<br>
## Exercise 3 - Reading and manipulating files, input and output
**Objective:** become familiar handling files, concatenating commands and output
1. **Concatenate all fasta files into one file.**
2. **Extract only the headers from your multi-fasta file.**
* Save the header in a new file.
* *Hint:* What do these headers have in common that you could use to extract them?
3. **Look at the content of the multi-fasta file using `head`, `tail`, `cat`, `more` and `less`** to explore the different behaviour.
* Print the top and bottom 5 lines instead of the top 10 (default).
* *Hint:* Check out the `man` pages for optional parameters.
4. **Count the number of sequences that contain the peptide SNVYGKNHDTGVSP.**
5. **Extract the variant numbers of the variant ompA sequences.**
6. **Create a new file called `myOmpA'** with touch.
* Print the statement "This is my favourite ompA sequence:" and attach it to the file.
* Attach one of the ompA sequences to the file.
<details><summary><b>Exercise solution</b></summary>
<p>
1. Concatenating files:
```sh
cat *.fasta >> all_ompA.fasta
```
2. Using grep:
```sh
grep ">" all_ompA.fasta # Quotation marks are necessary because ">" is meant as a character, not a command!
grep ">" all_ompA.fasta > headers
```
3. Exploring files using different tools:
```sh
head all_ompA.fasta
head -n 5 all_ompA.fasta # show only the first 5 lines
tail all_ompA.fasta
tail -n 5 all_ompA.fasta
cat all_ompA.fasta
# "cat" print the complete file to standard output.
more all_ompA.fasta
# with "more" you can remain in your command line. You scroll through the file with space bar, when you reach the end of file, you get your prompt back.
less all_ompA.fasta
# with "less", you will see the document as if in a new window apart from your command line commands and you can go backward and forward. Type 'g' to get to the top of the file, 'SHIFT+g' to the end. Type '/' followed by a search term will highlight all instances.
```
4. Counting lines:
```sh
grep "SNVYGKNHDTGVSP" all_ompA.fasta | wc -l # 5 sequences
```
*Note:* This counts the number of lines in which the pattern was found, not the number of occurrences.
5. Extracting parts of a character string:
```sh
grep "variant" all_ompA.fasta | cut -d '-' -f 3 # or use the headers file
```
6. Concatenating input and files
```sh
touch myOmpA
echo "This is my favourite ompA sequence:" >> myOmpA
cat ompA_variant_008.fasta >> myOmpA
```
</p>
</details>
<br>
## Exercise 4 - Variables and control structures
**Objective:** handle variables, arrays and for loops; write a simple script.
1. **On the command line, create a variable `var1` containing the string "ompA".** Print the variable to standard output.
2. **On the command line, loop through all fasta files starting with "ompA" using `var1`** and print the name to standard output.
* Once you type `for` and press `ENTER`, the shell will recognise the syntax and you can continue writing the for-loop line by line. It will evaluate the whole command only after you entered `done`.
3. **In your editor, write a simple script called `copy_and_rename.sh`.**
**Preparation:** It's easiest if you write your script locally and then transfer it via SFTP of your command line client (Termius/Notepad++) to the cluster into your `intro_to_unix` directory.
**For BBEdit users:** You can turn on syntax highlighting either by saving the file as a .sh file or by selecting "Unix Shell Script" in the drop-down menu at the very bottom (it says "Text file" by default).
**For Notepad++ users:** You can turn on syntax highlighting by selecting "shell" in the "Language" menu.
* In a for loop, go through all fasta files starting with "ompA".
* Copy each fasta file. For the name of the new copy, replace "ompA" with "outerMembraneProteinA".
* Run the script from within your `intro_to_unix` folder.
* *Hint:* Think about our previous exercise where we extracted the variant number. You can use a similar approach here.
* **Tip:** Always keep in mind what exactly is stored in your respective variables. A filename? An index? Try to also think about what you need as input for each command. For example, do you want to extract something from within a file or from a filename?
* *Hint:* Don't forget the shebang!
4. **Rewrite your script to make it more flexible/reusable in the future.**
* For example, defining variables that contain certain things (paths, names, strings etc.) that are used in the code but you might want to change in the future could be a good idea.
5. **On the command line, create an array that contains the numbers from 5 to 10.**
* Print out the complete array to standard output.
* Print out the 3rd element.
* Write a for loop that prints out each element.
5. **Modify the script `copy_and_rename.sh` further.**
* Include an array that contains the filenames of all fasta files starting with "ompA". Use it in your loop.
* Create a folder for each "variant" of ompA (.e.g., `001`, `short` etc).
* Copy each fasta into the respective variant folder. Keep the renaming as before.
<details><summary><b>Exercise solution</b></summary>
<p>
1. Creating a variable:
```sh
var1="ompA"
echo $var1
```
2. Writing a for loop:
```sh
for x in $var1*.fasta
do
echo $x
done
```
3. Writing a simple script that copies and renames files:
```sh
#!/bin/bash
for x in ompA*.fasta
do
# save file ending in variable
ending=$(ls $x | cut -d '_' -f 2,3) # extract 2nd and 3rd field
# copy and rename file
cp $x outerMembraneProtein_$ending
done
```
4. To make it more reusable, you can use additional variables whose values can be easily changed and adapted to new use cases.
```sh
#!/bin/bash
# Define variables prior to your actual code and use these throughout instead of hard-coded names
name1="ompA"
name2="outerMembraneProteinA"
for x in $name1*.fasta
do
# save file ending in variable
ending=$(ls $x | cut -d '_' -f 2,3)
# copy and rename file
cp $x ${name2}_$ending
done
```
You could even change the parameters of the `cut` command of the for loop to make it more general, or you could include variables for the paths of the current working directory and a target directory, etc etc - but this version will suffice here.
5. Creating an array:
```sh
numbers=(5 6 7 8 9 10)
echo ${numbers[@]} # don't forget the curly brackets
echo ${numbers[2]} # indices start at 0
# iterating through the array elements
for i in ${numbers[@]}
do
echo $i
done
# OR iterating through the array indices
for i in ${!numbers[@]}
do
echo ${numbers[$i]}
done
# OR iterating through indices defines by yourself (which you use for the array in this case)
for i in {0..5}
do
echo ${numbers[$i]}
done
```
6. Modifying the script further to use arrays:
```sh
#!/bin/bash
name1="ompA"
name2="outerMembraneProteinA"
ompAfiles=($(ls $name1*.fasta)) # remember: x=(x1 x2 x3) creates an array, y=$(command) assigns the output of a command to a variable. Here these two are combined.
echo ${ompAfiles[@]} # including echos of variables can be a good sanity check for your code
for x in ${ompAfiles[@]}
do
# save file ending in variable
ending=$(echo $x | cut -d '_' -f 2,3)
echo $ending
# extract ompA variant
variant=$(echo $x | cut -d '_' -f 3 | cut -d '.' -f 1) # you can pipe as many commands as you like
echo $variant
# create new directory
mkdir -p $variant # -p: throws no warning if folder already exists
# copy file into new directory and rename
cp $x $variant/${name2}_$ending
done
```
</p>
</details>
<br>
## Exercise 5 - Science Cluster
**Objective:** running software from singularity images and submitting jobs to Science Cluster
**BEFORE WE START**
Please run the following command:
```sh
echo "export SINGULARITY_BINDPATH=/scratch,/data,/home/$USER,/shares/amr.imm.uzh" >> $HOME/.bashrc
source $HOME/.bashrc
```
<br>
1. **Create a job submission script for Science Cluster called `msa.sh`.** This script shall take the ompA sequences and generate a multiple sequence alignment (MSA). In your script:
* Request the following resources:
* 4 CPUs
* 8 GB memory
* 30 min runtime
* Give the job a sensible name to identify it
* Load the `singularityce` module
* The software for generating the MSA is called `mafft`. It is installed via its singularity module which can be found here: `/shares/amr.imm.uzh/bioinfo/singularity/mafft_7.505--hec16e2b_0.sif`.
`mafft` is a very convenient software which recognises if you want to align DNA or Proteins automatically.
* The command for `mafft` is as follows:
```sh
mafft input.fasta > output.fasta
```
* The input fasta file is the multi-fasta file you generated in the previous exercise.
* `mafft` is a very convenient software which recognises if you want to align DNA or Proteins automatically.
* Submit the job.
* Watch the state of the job using `squeue`.
* Explore the outplut files `[jobname].out` and `[jobname].err`
<details><summary><b>Exercise solution</b></summary>
<p>
1. Writing a submission script:
```sh
#!/usr/bin/env bash
#SBATCH --time=00:30:00
#SBATCH --mem-per-cpu=8G
#SBATCH --cpus-per-task=2
#SBATCH --job-name=msa
#SBATCH --output=msa_%j.out
#SBATCH --error=msa_%j.err
# load the Singularity module
module load singularityce
/shares/amr.imm.uzh/bioinfo/singularity/mafft_7.505--hec16e2b_0.sif mafft all_ompA.fasta > all_ompA_msa.fasta
```
</p>
</details>
<br>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment