-
Fanny Wegner authoredFanny Wegner authored
Before you start
-
If you have questions about any command, you can always type
man [command]
to get an explanation and all possible additional options. -
Exercise solutions: All solutions are embedded in this document and are hidden by default, but you can reveal them by clicking on the drop-down menu, like this one:
Exercise solution
This would reveal the answer...We encourage you to not look at the solution too quickly, and try to solve the exercise without it. Remember you can always ask the course teachers for help.
-
Exercise material: Download the ompA.zip archive to your local computer. Leave it zipped.
Exercise 1 - Navigating the filesystem on command line
Objective: get familiar with Science Cluster, get familiar with navigating the directory tree and listing the content of directories.
-
Print your current working directory with the
pwd
command. This will show you where you currently are in the directory tree. -
List the contents of your current working directory with
ls
,ls -l
andls -a
.- What do the
-l
and-a
options do? -
Hint: you can use
man ls
to display the help for thels
command. To exit the help, simply typeq
on your keyboard. -
Note: one-letter options can be grouped together, so
ls -la
is the same asls -l -a
. -
Note: some options have both a "short" and a "long" form. E.g.
ls -a
is the short form forls --all
. - Science Cluster has some pre-defined "aliases" or shorthands for often-used commands. See what happens when you type
ll
orla
.
- What do the
-
Navigate to your data filesystem by typing
cd data
. Once you are indata/
, navigate to its parent directory withcd ..
.- Where are you now?
- Navigate to your data filesystem using the complete path
/data/$USER
, not the link in your home. Again, navigate to its parent directory. Where are you now? - What happens if you now type just
cd
?
-
Try to run the command
cd .
. What happens? What does the.
stand for?
Exercise solution
-
Printing the current working directory:
pwd
-
Listing the content of your
home
directory with differentls
options:ls # Prints the names of files and directories ls -l # List content of the subdirectory in "long listing" format. This # provides additional details for each file/directory, such as # its permissions, its size and its last modified date. ls -a # Adding the "-a" option additionally displays hidden files and # directories. These are files/directories whose name starts with # a dot ".". # Hidden files are often used to store program configurations. ll # This alias is the same as "ls -l" la # This alias is the same as "ls -lA"
Tip: It is possible to define a shorthand for longer commands that you use often, a so called
alias
. On Science Cluster, there are already some pre-defined useful aliases, among themll
(standing forls -lFh
) andla
(standing forls -lA
). -
Navigating to your
data
filesystem and understanding links:cd data/ pwd # this shows you your location relative to the symlink you followed cd .. pwd # you are back to your home
When navigating to a directory through a link, your path through the tree of the filesystem will reflect that and
cd ..
will bring you back to yourhome
.cd /data/$USER pwd # this shows you the actual location /data/$USER cd .. pwd # you are now in /data ll # this can take a moment
When navigating to your
data
space through its true full path, the parent directory is/data
. Typingls
here will show you thedata
spaces of all users on Science Cluster. However, you don't have permission to any other than your own!Typing
cd
from anywhere is a shorthand forcd ~
orcd /home/$USER
. -
The
.
symbol is a shortcut for the current directory. So runningcd .
has no effect since it simply changes to the same directory we are already in.The
.
shortcut is useful in some situations. E.g. if you want to copy a file to the current directory you can docp /file/to/copy .
, or you can run an executable located in the current directory with./run_me.sh
.
Exercise 2 - Creating and moving directories and files
Objective: transfer files to Science Cluster. Learn to use the mkdir
, cp
, mv
and rm
commands.
-
Transfer the downloaded
ompA.zip
from your local computer with your command line/SFTP client to yourhome
directory (drag and drop) on Science Cluster. -
Copy the
ompA.zip
file from yourhome
todata
. Navigate todata
and unzip the archive.-
Note: The command for unzipping is
unzip [archive]
. - Look at the content of the extracted archive. What has happened?
-
Note: The command for unzipping is
-
Create a new directory called
intro_to_unix
. -
Move the fasta file
ompA_ref_short.fasta
into the new directory.- Look at the content of
intro_to_unix
without entering the directory.
- Look at the content of
-
Move the rest of the fasta files, but NOT the zip archive into
intro_to_unix
.-
Hint: Using the wildcard character
*
can help you. For example, all files ending with.jpg
can be expressed by typing*.jpg
. - Now unzip the
ompA.zip
again as before. This time use the wildcard*
to move all files intointro_to_unix
.
-
Hint: Using the wildcard character
-
Go to the directory
intro_to_unix
and rename the zip archive. The new name does not matter. -
Delete the renamed zip archive.
-
Create a new directory called
archive
insideintro_to_unix
.- Copy
ompA.zip
from yourhome
directory into the new directory. - Move the whole
archive
directory including its content to the parent directory ofintro_to_unix
. - Now delete the directory
archive
including its content. -
Hint: You have to use an additional flag for the
rm
command. Look at the manual of the command and try to figure out which one. -
Note: Empty directories can be delete with
rmdir
directly.
- Copy
-
Create a symbolic link in your
home
directory to yourintro_to_unix
directory.
Exercise solution
-
If you have trouble with the transfer, please ask a student/tutor.
-
Copying files:
cp ompA.zip data/ cd data unzip ompA.zip # This extracts the files from the archive while also preserving the archive.
-
Creating a new directory:
mkdir intro_to_unix
Tip: the
-p
option creates a directory if it does not exist yet. This can be useful in a script. -
Moving files:
mv ompA_ref_short.fasta intro_to_unix/ ll intro_to_unix/
Giving any location or even file as a parameter to "ls" will list the files that you specify. You don't always need to navigate to a directory to inspect its content.
-
Moving files using wildcards:
mv *.fasta intro_to_unix/ # This moves all files with the ending ".fasta" unzip ompA.zip mv * intro_to_unix/ # It will overwrite existing files with the same name
-
Renaming files:
cd intro_to_unix/ mv ompA.zip fasta_files.zip
-
Deleting files:
rm fasta_files.zip
-
Moving and deleting directories:
mkdir archive cp ~/ompA.zip archive/ mv archive/ ../.. # You can use ".." multiple times to go upward in the file tree. cd ../.. rm -r archive/ # the -r flag will recursively delete the directory, meaning it will also delete its content.
-
Creating links:
cd # Remember, this is a shorthand to get "home" ln -s data/intro_to_unix/ unix_exercise cd unix_exercise/ pwd
Exercise 3 - Reading and manipulating files, input and output
Objective: become familiar handling files, concatenating commands and output
-
Concatenate all fasta files into one file.
-
Extract only the headers from your multi-fasta file.
- Save the header in a new file.
- Hint: What do these headers have in common that you could use to extract them?
-
Look at the content of the multi-fasta file using
head
,tail
,cat
,more
andless
to explore the different behaviour.- Print the top and bottom 5 lines instead of the top 10 (default).
-
Hint: Check out the
man
pages for optional parameters.
-
Count the number of sequences that contain the peptide SNVYGKNHDTGVSP.
-
Extract the variant numbers of the variant ompA sequences.
-
Create a new file called `myOmpA' with touch.
- Print the statement "This is my favourite ompA sequence:" and attach it to the file.
- Attach one of the ompA sequences to the file.
Exercise solution
-
Concatenating files:
cat *.fasta >> all_ompA.fasta
-
Using grep:
grep ">" all_ompA.fasta # Quotation marks are necessary because ">" is meant as a character, not a command! grep ">" all_ompA.fasta > headers
-
Exploring files using different tools:
head all_ompA.fasta head -n 5 all_ompA.fasta # show only the first 5 lines tail all_ompA.fasta tail -n 5 all_ompA.fasta cat all_ompA.fasta # "cat" print the complete file to standard output. more all_ompA.fasta # with "more" you can remain in your command line. You scroll through the file with space bar, when you reach the end of file, you get your prompt back. less all_ompA.fasta # with "less", you will see the document as if in a new window apart from your command line commands and you can go backward and forward. Type 'g' to get to the top of the file, 'SHIFT+g' to the end. Type '/' followed by a search term will highlight all instances.
-
Counting lines:
grep "SNVYGKNHDTGVSP" all_ompA.fasta | wc -l # 5 sequences
Note: This counts the number of lines in which the pattern was found, not the number of occurrences.
-
Extracting parts of a character string:
grep "variant" all_ompA.fasta | cut -d '_' -f 3 # or use the headers file
-
Concatenating input and files
touch myOmpA echo "This is my favourite ompA sequence:" >> myOmpA cat ompA_variant_008.fasta >> myOmpA
Exercise 4 - Variables and control structures
Objective: handle variables, arrays and for loops; write a simple script.
-
On the command line, create a variable
var1
containing the string "ompA". Print the variable to standard output. -
On the command line, loop through all fasta files starting with "ompA" using
var1
and print the name to standard output.- Once you type
for
and pressENTER
, the shell will recognise the syntax and you can continue writing the for-loop line by line. It will evaluate the whole command only after you entereddone
.
- Once you type
-
In your editor, write a simple script called
copy_and_rename.sh
.Preparation: It's easiest if you write your script locally and then transfer it via SFTP of your command line client (Termius/Notepad++) to the cluster into your
intro_to_unix
directory.For BBEdit users: You can turn on syntax highlighting either by saving the file as a .sh file or by selecting "Unix Shell Script" in the drop-down menu at the very bottom (it says "Text file" by default).
For Notepad++ users: You can turn on syntax highlighting by selecting "shell" in the "Language" menu.
- In a for loop, go through all fasta files starting with "ompA".
- Copy each fasta file. For the name of the new copy, replace "ompA" with "outerMembraneProteinA".
- Run the script from within your
intro_to_unix
folder. - Hint: Think about our previous exercise where we extracted the variant number. You can use a similar approach here.
- Tip: Always keep in mind what exactly is stored in your respective variables. A filename? An index? Try to also think about what you need as input for each command. For example, do you want to extract something from within a file or from a filename?
- Hint: Don't forget the shebang!
-
Rewrite your script to make it more flexible/reusable in the future.
- For example, defining variables that contain certain things (paths, names, strings etc.) that are used in the code but you might want to change in the future could be a good idea.
-
On the command line, create an array that contains the numbers from 5 to 10.
- Print out the complete array to standard output.
- Print out the 3rd element.
- Write a for loop that prints out each element.
-
Modify the script
copy_and_rename.sh
further.- Include an array that contains the filenames of all fasta files starting with "ompA". Use it in your loop.
- Create a folder for each "variant" of ompA (.e.g.,
001
,short
etc). - Copy each fasta into the respective variant folder. Keep the renaming as before.
Exercise solution
-
Creating a variable:
var1="ompA" echo $var1
-
Writing a for loop:
for x in $var1*.fasta do echo $x done
-
Writing a simple script that copies and renames files:
#!/bin/bash for x in ompA*.fasta do # save file ending in variable ending=$(ls $x | cut -d '_' -f 2,3) # extract 2nd and 3rd field # copy and rename file cp $x outerMembraneProtein_$ending done
-
To make it more reusable, you can use additional variables whose values can be easily changed and adapted to new use cases.
#!/bin/bash # Define variables prior to your actual code and use these throughout instead of hard-coded names name1="ompA" name2="outerMembraneProteinA" for x in $name1*.fasta do # save file ending in variable ending=$(ls $x | cut -d '_' -f 2,3) # copy and rename file cp $x ${name2}_$ending done
You could even change the parameters of the
cut
command of the for loop to make it more general, or you could include variables for the paths of the current working directory and a target directory, etc etc - but this version will suffice here. -
Creating an array:
numbers=(5 6 7 8 9 10) echo ${numbers[@]} # don't forget the curly brackets echo ${numbers[2]} # indices start at 0 # iterating through the array elements for i in ${numbers[@]} do echo $i done # OR iterating through the array indices for i in ${!numbers[@]} do echo ${numbers[$i]} done # OR iterating through indices defines by yourself (which you use for the array in this case) for i in {0..5} do echo ${numbers[$i]} done
-
Modifying the script further to use arrays:
#!/bin/bash name1="ompA" name2="outerMembraneProteinA" ompAfiles=($(ls $name1*.fasta)) # remember: x=(x1 x2 x3) creates an array, y=$(command) assigns the output of a command to a variable. Here these two are combined. echo ${ompAfiles[@]} # including echos of variables can be a good sanity check for your code for x in ${ompAfiles[@]} do # save file ending in variable ending=$(echo $x | cut -d '_' -f 2,3) echo $ending # extract ompA variant variant=$(echo $x | cut -d '_' -f 3 | cut -d '.' -f 1) # you can pipe as many commands as you like echo $variant # create new directory mkdir -p $variant # -p: throws no warning if folder already exists # copy file into new directory and rename cp $x $variant/${name2}_$ending done
Exercise 5 - Science Cluster
Objective: running software from singularity images and submitting jobs to Science Cluster
BEFORE WE START Please run the following command:
echo "export SINGULARITY_BINDPATH=/scratch,/data,/home/$USER,/shares/amr.imm.uzh" >> $HOME/.bashrc
source $HOME/.bashrc
-
Create a job submission script for Science Cluster called
msa.sh
. This script shall take the ompA sequences and generate a multiple sequence alignment (MSA). In your script:-
Request the following resources:
- 4 CPUs
- 8 GB memory
- 30 min runtime
- Give the job a sensible name to identify it
-
Load the
singularityce
module -
The software for generating the MSA is called
mafft
. It is installed via its singularity image which can be found here:/shares/amr.imm.uzh/bioinfo/singularity/mafft_7.505--hec16e2b_0.sif
.mafft
is a very convenient software which can recognise if you want to align DNA or proteins automatically. -
The command for
mafft
is as follows:mafft input.fasta > output.fasta
-
The input fasta file is the multi-fasta file you generated in the previous exercise.
-
Submit the job.
-
Watch the state of the job using
squeue
. -
Explore the output files of the job
[jobname].out
and[jobname].err
as well as the actual MSA.
-
Exercise solution
-
Writing a submission script:
#!/usr/bin/env bash #SBATCH --time=00:30:00 #SBATCH --mem-per-cpu=8G #SBATCH --cpus-per-task=2 #SBATCH --job-name=msa #SBATCH --output=msa_%j.out #SBATCH --error=msa_%j.err # load the Singularity module module load singularityce /shares/amr.imm.uzh/bioinfo/singularity/mafft_7.505--hec16e2b_0.sif mafft all_ompA.fasta > all_ompA_msa.fasta