Skip to content
Snippets Groups Projects

Before you start

  • If you have questions about any command, you can always type man [command] to get an explanation and all possible additional options.

  • Exercise solutions: All solutions are embedded in this document and are hidden by default, but you can reveal them by clicking on the drop-down menu, like this one:

    Exercise solution This would reveal the answer...

    We encourage you to not look at the solution too quickly, and try to solve the exercise without it. Remember you can always ask the course teachers for help.

  • Exercise material: Download the ompA.zip archive to your local computer. Leave it zipped.



Exercise 1 - Navigating the filesystem on command line

Objective: get familiar with Science Cluster, get familiar with navigating the directory tree and listing the content of directories.

  1. Print your current working directory with the pwd command. This will show you where you currently are in the directory tree.

  2. List the contents of your current working directory with ls, ls -l and ls -a.

    • What do the -l and -a options do?
    • Hint: you can use man ls to display the help for the ls command. To exit the help, simply type q on your keyboard.
    • Note: one-letter options can be grouped together, so ls -la is the same as ls -l -a.
    • Note: some options have both a "short" and a "long" form. E.g. ls -a is the short form for ls --all.
    • Science Cluster has some pre-defined "aliases" or shorthands for often-used commands. See what happens when you type ll or la.
  3. Navigate to your data filesystem by typing cd data. Once you are in data/, navigate to its parent directory with cd ...

    • Where are you now?
    • Navigate to your data filesystem using the complete path /data/$USER, not the link in your home. Again, navigate to its parent directory. Where are you now?
    • What happens if you now type just cd?
  4. Try to run the command cd .. What happens? What does the . stand for?


Exercise solution

  1. Printing the current working directory:

    pwd
  2. Listing the content of your home directory with different ls options:

    ls       # Prints the names of files and directories
    ls -l    # List content of the subdirectory in "long listing" format. This
             # provides additional details for each file/directory, such as
             # its permissions, its size and its last modified date.
    ls -a    # Adding the "-a" option additionally displays hidden files and
             # directories. These are files/directories whose name starts with
             # a dot ".".
             # Hidden files are often used to store program configurations.
    ll       # This alias is the same as "ls -l"
    la       # This alias is the same as "ls -lA"

    Tip: It is possible to define a shorthand for longer commands that you use often, a so called alias. On Science Cluster, there are already some pre-defined useful aliases, among them ll (standing for ls -lFh) and la (standing for ls -lA).

  3. Navigating to your data filesystem and understanding links:

    cd data/
    pwd         # this shows you your location relative to the symlink you followed
    cd ..
    pwd         # you are back to your home

    When navigating to a directory through a link, your path through the tree of the filesystem will reflect that and cd .. will bring you back to your home.

    cd /data/$USER
    pwd      # this shows you the actual location /data/$USER
    cd ..
    pwd      # you are now in /data
    ll       # this can take a moment   

    When navigating to your data space through its true full path, the parent directory is /data. Typing ls here will show you the data spaces of all users on Science Cluster. However, you don't have permission to any other than your own!

    Typing cd from anywhere is a shorthand for cd ~ or cd /home/$USER.

  4. The . symbol is a shortcut for the current directory. So running cd . has no effect since it simply changes to the same directory we are already in.

    The . shortcut is useful in some situations. E.g. if you want to copy a file to the current directory you can do cp /file/to/copy ., or you can run an executable located in the current directory with ./run_me.sh.


Exercise 2 - Creating and moving directories and files

Objective: transfer files to Science Cluster. Learn to use the mkdir, cp, mv and rm commands.

  1. Transfer the downloaded ompA.zip from your local computer with your command line/SFTP client to your home directory (drag and drop) on Science Cluster.

  2. Copy the ompA.zip file from your home to data. Navigate to data and unzip the archive.

    • Note: The command for unzipping is unzip [archive].
    • Look at the content of the extracted archive. What has happened?
  3. Create a new directory called intro_to_unix.

  4. Move the fasta file ompA_ref_short.fasta into the new directory.

    • Look at the content of intro_to_unix without entering the directory.
  5. Move the rest of the fasta files, but NOT the zip archive into intro_to_unix.

    • Hint: Using the wildcard character * can help you. For example, all files ending with .jpg can be expressed by typing *.jpg.
    • Now unzip the ompA.zip again as before. This time use the wildcard * to move all files into intro_to_unix.
  6. Go to the directory intro_to_unix and rename the zip archive. The new name does not matter.

  7. Delete the renamed zip archive.

  8. Create a new directory called archive inside intro_to_unix.

    • Copy ompA.zip from your home directory into the new directory.
    • Move the whole archive directory including its content to the parent directory of intro_to_unix.
    • Now delete the directory archive including its content.
    • Hint: You have to use an additional flag for the rm command. Look at the manual of the command and try to figure out which one.
    • Note: Empty directories can be delete with rmdir directly.
  9. Create a symbolic link in your home directory to your intro_to_unix directory.

Exercise solution

  1. If you have trouble with the transfer, please ask a student/tutor.

  2. Copying files:

    cp ompA.zip data/
    cd data
    unzip ompA.zip       # This extracts the files from the archive while also preserving the archive. 
  3. Creating a new directory:

    mkdir intro_to_unix

    Tip: the -p option creates a directory if it does not exist yet. This can be useful in a script.

  4. Moving files:

    mv ompA_ref_short.fasta intro_to_unix/
    ll intro_to_unix/ 

    Giving any location or even file as a parameter to "ls" will list the files that you specify. You don't always need to navigate to a directory to inspect its content.

  5. Moving files using wildcards:

    mv *.fasta intro_to_unix/  # This moves all files with the ending ".fasta"
    unzip ompA.zip
    mv * intro_to_unix/        # It will overwrite existing files with the same name
  6. Renaming files:

    cd intro_to_unix/
    mv ompA.zip fasta_files.zip
  7. Deleting files:

    rm fasta_files.zip
  8. Moving and deleting directories:

    mkdir archive
    cp ~/ompA.zip archive/
    mv archive/ ../..     # You can use ".." multiple times to go upward in the file tree. 
    cd ../..
    rm -r archive/        # the -r flag will recursively delete the directory, meaning it will also delete its content.
  9. Creating links:

    cd          # Remember, this is a shorthand to get "home"
    ln -s data/intro_to_unix/ unix_exercise
    cd unix_exercise/
    pwd

Exercise 3 - Reading and manipulating files, input and output

Objective: become familiar handling files, concatenating commands and output

  1. Concatenate all fasta files into one file.

  2. Extract only the headers from your multi-fasta file.

    • Save the header in a new file.
    • Hint: What do these headers have in common that you could use to extract them?
  3. Look at the content of the multi-fasta file using head, tail, cat, more and less to explore the different behaviour.

    • Print the top and bottom 5 lines instead of the top 10 (default).
    • Hint: Check out the man pages for optional parameters.
  4. Count the number of sequences that contain the peptide SNVYGKNHDTGVSP.

  5. Extract the variant numbers of the variant ompA sequences.

  6. Create a new file called `myOmpA' with touch.

    • Print the statement "This is my favourite ompA sequence:" and attach it to the file.
    • Attach one of the ompA sequences to the file.
Exercise solution

  1. Concatenating files:

    cat *.fasta >> all_ompA.fasta
  2. Using grep:

    grep ">" all_ompA.fasta       # Quotation marks are necessary because ">" is meant as a character, not a command!
    
    grep ">" all_ompA.fasta > headers
  3. Exploring files using different tools:

    head all_ompA.fasta
    head -n 5 all_ompA.fasta       # show only the first 5 lines
    
    tail all_ompA.fasta
    tail -n 5 all_ompA.fasta
    
    cat all_ompA.fasta
    # "cat" print the complete file to standard output. 
    
    more all_ompA.fasta 
    # with "more" you can remain in your command line. You scroll through the file with space bar, when you reach the end of file, you get your prompt back.
    
    less all_ompA.fasta
    # with "less", you will see the document as if in a new window apart from your command line commands and you can go backward and forward. Type 'g' to get to the top of the file, 'SHIFT+g' to the end. Type '/' followed by a search term will highlight all instances.  
  4. Counting lines:

      grep "SNVYGKNHDTGVSP" all_ompA.fasta | wc -l        # 5 sequences

    Note: This counts the number of lines in which the pattern was found, not the number of occurrences.

  5. Extracting parts of a character string:

    grep "variant" all_ompA.fasta | cut -d '_' -f 3       # or use the headers file
  6. Concatenating input and files

    touch myOmpA
    echo "This is my favourite ompA sequence:" >> myOmpA
    cat ompA_variant_008.fasta >> myOmpA

Exercise 4 - Variables and control structures

Objective: handle variables, arrays and for loops; write a simple script.

  1. On the command line, create a variable var1 containing the string "ompA". Print the variable to standard output.

  2. On the command line, loop through all fasta files starting with "ompA" using var1 and print the name to standard output.

    • Once you type for and press ENTER, the shell will recognise the syntax and you can continue writing the for-loop line by line. It will evaluate the whole command only after you entered done.
  3. In your editor, write a simple script called copy_and_rename.sh.

    Preparation: It's easiest if you write your script locally and then transfer it via SFTP of your command line client (Termius/Notepad++) to the cluster into your intro_to_unix directory.

    For BBEdit users: You can turn on syntax highlighting either by saving the file as a .sh file or by selecting "Unix Shell Script" in the drop-down menu at the very bottom (it says "Text file" by default).

    For Notepad++ users: You can turn on syntax highlighting by selecting "shell" in the "Language" menu.

    • In a for loop, go through all fasta files starting with "ompA".
    • Copy each fasta file. For the name of the new copy, replace "ompA" with "outerMembraneProteinA".
    • Run the script from within your intro_to_unix folder.
    • Hint: Think about our previous exercise where we extracted the variant number. You can use a similar approach here.
    • Tip: Always keep in mind what exactly is stored in your respective variables. A filename? An index? Try to also think about what you need as input for each command. For example, do you want to extract something from within a file or from a filename?
    • Hint: Don't forget the shebang!
  4. Rewrite your script to make it more flexible/reusable in the future.

    • For example, defining variables that contain certain things (paths, names, strings etc.) that are used in the code but you might want to change in the future could be a good idea.
  5. On the command line, create an array that contains the numbers from 5 to 10.

    • Print out the complete array to standard output.
    • Print out the 3rd element.
    • Write a for loop that prints out each element.
  6. Modify the script copy_and_rename.sh further.

    • Include an array that contains the filenames of all fasta files starting with "ompA". Use it in your loop.
    • Create a folder for each "variant" of ompA (.e.g., 001, short etc).
    • Copy each fasta into the respective variant folder. Keep the renaming as before.
Exercise solution

  1. Creating a variable:

    var1="ompA"
    echo $var1
  2. Writing a for loop:

    for x in $var1*.fasta
    do
       echo $x
    done
  3. Writing a simple script that copies and renames files:

    #!/bin/bash
    for x in ompA*.fasta
    do
       # save file ending in variable
       ending=$(ls $x | cut -d '_' -f 2,3)    # extract 2nd and 3rd field
       # copy and rename file
       cp $x outerMembraneProtein_$ending
     
    done
  4. To make it more reusable, you can use additional variables whose values can be easily changed and adapted to new use cases.

    #!/bin/bash
    
    # Define variables prior to your actual code and use these throughout instead of hard-coded names
    name1="ompA"
    name2="outerMembraneProteinA"
    
    for x in $name1*.fasta
    do
       # save file ending in variable
        ending=$(ls $x | cut -d '_' -f 2,3)
       # copy and rename file
        cp $x ${name2}_$ending
    done

    You could even change the parameters of the cut command of the for loop to make it more general, or you could include variables for the paths of the current working directory and a target directory, etc etc - but this version will suffice here.

  5. Creating an array:

    numbers=(5 6 7 8 9 10)
    echo ${numbers[@]}      # don't forget the curly brackets
    echo ${numbers[2]}      # indices start at 0
    
    # iterating through the array elements
    for i in ${numbers[@]}
    do
       echo $i
    done
    # OR iterating through the array indices
    for i in ${!numbers[@]}
    do
       echo ${numbers[$i]}
    done
    # OR iterating through indices defines by yourself (which you use for the array in this case)
    for i in {0..5}
    do
       echo ${numbers[$i]}
    done
    
  6. Modifying the script further to use arrays:

    #!/bin/bash
    
    name1="ompA"
    name2="outerMembraneProteinA"
    
    ompAfiles=($(ls $name1*.fasta))     # remember: x=(x1 x2 x3) creates an array, y=$(command) assigns the output of a command to a variable. Here these two are combined.  
    echo ${ompAfiles[@]}                # including echos of variables can be a good sanity check for your code
    
    for x in ${ompAfiles[@]}
    do
       # save file ending in variable
       ending=$(echo $x | cut -d '_' -f 2,3)
       echo $ending    
    
       # extract ompA variant 
       variant=$(echo $x | cut -d '_' -f 3 | cut -d '.' -f 1)   # you can pipe as many commands as you like
       echo $variant   
       
       # create new directory 
       mkdir -p $variant		# -p: throws no warning if folder already exists
       
       # copy file into new directory and rename
       cp $x $variant/${name2}_$ending
       
    done

Exercise 5 - Science Cluster

Objective: running software from singularity images and submitting jobs to Science Cluster

BEFORE WE START Please run the following command:

echo "export SINGULARITY_BINDPATH=/scratch,/data,/home/$USER,/shares/amr.imm.uzh" >> $HOME/.bashrc
source $HOME/.bashrc

  1. Create a job submission script for Science Cluster called msa.sh. This script shall take the ompA sequences and generate a multiple sequence alignment (MSA). In your script:
    • Request the following resources:

      • 4 CPUs
      • 8 GB memory
      • 30 min runtime
      • Give the job a sensible name to identify it
    • Load the singularityce module

    • The software for generating the MSA is called mafft. It is installed via its singularity image which can be found here: /shares/amr.imm.uzh/bioinfo/singularity/mafft_7.505--hec16e2b_0.sif.

      mafft is a very convenient software which can recognise if you want to align DNA or proteins automatically.

    • The command for mafft is as follows:

      mafft input.fasta > output.fasta
    • The input fasta file is the multi-fasta file you generated in the previous exercise.

    • Submit the job.

    • Watch the state of the job using squeue.

    • Explore the output files of the job [jobname].out and [jobname].err as well as the actual MSA.

Exercise solution

  1. Writing a submission script:

    #!/usr/bin/env bash
    #SBATCH --time=00:30:00
    #SBATCH --mem-per-cpu=8G
    #SBATCH --cpus-per-task=2
    #SBATCH --job-name=msa
    #SBATCH --output=msa_%j.out
    #SBATCH --error=msa_%j.err
    
    # load the Singularity module
    module load singularityce
    
    /shares/amr.imm.uzh/bioinfo/singularity/mafft_7.505--hec16e2b_0.sif mafft all_ompA.fasta > all_ompA_msa.fasta