README Edits

91d4e764 · Devin Routh · GitLab UZH · 37ddcb81 · 91d4e764
Verified Commit 91d4e764 authored 5 months ago by Devin Routh Committed by GitLab UZH 5 months ago
--- a/README.md
+++ b/README.md
 # Scientific Workflows Course

 ##  File Descriptions
-This repository contains code to run a fully nested cross validation workflow on the TensorFlow "fashion_mnist" dataset of apparel images for the Science IT Scientific Workflows course with the goals described [below](#Goals). Descriptions of the files are as follows:
+This repository contains code to run a fully nested cross validation workflow via PyTorch Lightning on the "mnist" dataset of images for the Science IT Scientific Workflows course with the goals described [below](#Goals). (Previous versions of the repo with TensorFlow used the "fashion_mnist" images.) Descriptions of the files are as follows:

 - The "README.md" is a standard file that should be added (as a best practice) to all code repositories.

 - The ".gitignore" file is another standard file that can/should be added (as a best practice) to many code repositories. It ensures that unwanted files that exist within a Git repository are not tracked; e.g., checkpoints for Jupyter notebooks.

- The "TensorFlow_Tutorial_Jupyter.ipynb" file shows the analysis code within a Jupyter notebook (allowing for assessment of printed data objects and a better understanding of the code).
+- The "Lightning_Tutorial_Jupyter.ipynb" file shows the analysis code within a Jupyter notebook (allowing for assessment of printed data objects and a better understanding of the code).

- The "TensorFlow_Benchmark_Results.ipynb" file shows a simple analysis of the timings that were recorded within the code. Note: this notebook uses a separate R kernel environment, which uses only 2 packages (`tidyverse` and `ggplot2`).
+- The "Benchmark_Results.ipynb" file shows a simple analysis of the timings that were recorded within the code. Note: this notebook uses a separate R kernel environment, which uses only 2 packages (`tidyverse` and `ggplot2`).

- The "TensorFlow_Tutorial_CommandLine.py" file is a ready-to-use Python 3 script for calling the code directly from a command line. The command line arguments are ordered as follows: (1) the number of cores to use; (2) the _k_ in k-fold cross validation; (3) the number of images to include in the analysis (currently limited to values between the specified _k_ value and 50,000)
+- The "Lightning_Tutorial_Jupyter.py" file is a ready-to-use Python script for calling the code directly from a command line. The command line arguments are ordered as follows: (1) the number of cores to use; (2) the _k_ in k-fold cross validation; (3) the number of images to include in the analysis (currently limited to values between the specified _k_ value and 50,000)

 - The "environment.yml" file details the packages that are used in the Conda environment that underlies the principal analytical code in this repository. It is detailed in the Conda section below.

@@ -29,6 +29,8 @@ This repository contains code to run a fully nested cross validation workflow on

 - The "container_info.md" file has in-depth directions and guidance on using Singularity.

+- The "TF_Reference" subdirectory is a reference of all previous TensorFlow code and outputs. Feel free to explore this code and these outputs to compare/contrast the 2 workflows.
+
 ## Goals
 In practice, the main analytical code in this repository would be written to compute a rigorous and unbiased accuracy estimate for a full model selection/training procedure. For Science IT purposes, this repository (and the analytical code contained within it) intends to accomplish the following goals:
 - create an example analytical script that can be slightly altered for benchmarking purposes; e.g., the following aspects of the script can be adjusted to compare the speeds using various parallelization strategies:
@@ -55,26 +57,28 @@ conda env create -f environment.yml

 First, the underlying Conda environment should be created using the standard Conda commands; i.e., `conda create ...`, `source activate ENVNAME`, `conda install ...`. An example of a basic Conda environment creation can be found [here](https://docs.s3it.uzh.ch/how-to_articles/how_to_use_conda/#create-your-environment).

-Once a Conda environment has been successfully created and refined, with all of the necessary packages, you can use the `conda list` or the `conda env export --name ENVNAME` command to list all package dependencies in the environment along with their specific versions. You can then use these details to build the `environment.yml` file, creating a new environment via `conda env create --file environment.yml` as demonstrated above. For reference, the original environment was built by manually installing (i.e., using `conda install ...` on) the following packages:
+Once a Conda environment has been successfully created and refined, with all of the necessary packages, you can use the `conda list` or the `conda env export --name ENVNAME` command to list all package dependencies in the environment along with their specific versions. You can then use these details to build the `environment.yml` file, creating a new environment via `conda env create --file environment.yml` as demonstrated above. For reference, the original environment was built by manually installing packages via the following commands:
 ```
-python=3.8.
-tensorflow-gpu=2.2.0
-scikit-learn
-pandas
-joblib
-jupyterlab
-ipykernel
-pynvml
-py-cpuinfo
-matplotlib
-ndcctools
+# !! Please note:
+# `mamba install` is used here instead of `conda install`; the 2 are interchangeable, but Mamba is often faster for installations
+module load mamba
+mamba create -n "lightning" python=3.12
+source activate lightning
+mamba install -c pytorch -c nvidia pytorch torchvision torchaudio pytorch-cuda
+mamba install conda-forge::pytorch-lightning
+mamba install scikit-learn
+mamba install pandas
+mamba install matplotlib
+mamba install pynvml
+mamba install py-cpuinfo
 ```

 To add this environment as a [custom kernel in Jupyter](https://docs.s3it.uzh.ch/cluster/apps/user_guide/#custom-kernels-in-jupyter) if you intend on using it on the ScienceApps, run the following code on a ScienceCluster terminal after creating the environment:
 ```
 module load anaconda3
-source activate tfbenchmark
-ipython kernel install --user --name tfbenchmark
+source activate torch
+mamba install ipykernel
+ipython kernel install --user --name torch
 ```

 ## Singularity (i.e., a Docker-style workflow that operates on shared cluster systems)
@@ -221,17 +225,17 @@ echo 'finished'
 ```


-## Prepare the tensorflow benchmark using singularity
+## Prepare the benchmark using singularity

 To use Singularity, create the Singularity Image file from the definition file using:
 ```
-sudo singularity build tfbenchmark.sif singularity.def
+sudo singularity build torch.sif singularity.def
 ```
-Then transfer the `tfbenchmark.sif` file to your system of interest and sandbox it:
+Then transfer the `torch.sif` file to your system of interest and sandbox it:
 ```
-singularity build --sandbox tfbenchmark tfbenchmark.sif
+singularity build --sandbox torch torch.sif
 ```
-After that, use either `singularity shell` or `singularity exec` to run the code (i.e., integrate these commands with a sample run, such as `python TensorFlow_Tutorial_CommandLine.py 3 5 500 tfb`)
+After that, use either `singularity shell` or `singularity exec` to run the code (i.e., integrate these commands with a sample run, such as `python Lightning_Tutorial_Jupyter.py 3 5 500 tfb`)

 There are ScienceCloud images that come equipped with Singularity pre-installed; i.e., the image titled "***Singularity 3.8 Ubuntu 20.04 (2021-07-06)" that's selectable from the "Source" menu of the ScienceCloud dashboard "Launch Instance" menu.

@@ -242,15 +246,15 @@ It is also possible to create a Conda environment within a Singularity container

 ---

-For reference: further installation instructions for TensorFlow, including platform-specific and GPU instructions, can be found [here](https://www.tensorflow.org/install/).
+For reference: further installation instructions for PyTorch Lightning, including platform-specific and GPU instructions, can be found [here](https://lightning.ai/docs/pytorch/stable/).


 ## Running the code from the command line

-Once the software environment has been created, it will be possible to run the analysis code directly from the command line. To do so, activate your chosen environment (e.g., `source activate tfbenchmark` on the ScienceCluster) then run:
+Once the software environment has been created, it will be possible to run the analysis code directly from the command line. To do so, activate your chosen environment (e.g., `source activate torch` on the ScienceCluster) then run:

 ```
-python TensorFlow_Tutorial_CommandLine.py 3 5 500 tfb
+python Lightning_Tutorial_Jupyter.py 3 5 500 tfb
 ```

 As currently implemented, the command line code requires 4 arguments; in order, they are:
@@ -277,7 +281,7 @@ The first time the code runs, it needs an internet connection to download traini

 ## Code Description

-As noted above, this code demonstrates a nested K-Fold cross validation example that assesses the generalized accuracy of 2 TensorFlow based models that operate on the ["fashion_mnist" dataset](https://www.tensorflow.org/datasets/catalog/fashion_mnist).
+As noted above, this code demonstrates a nested K-Fold cross validation example that assesses the generalized accuracy of 2 PyTorch based models that operate on the ["mnist" dataset](https://pytorch.org/vision/main/generated/torchvision.datasets.MNIST.html).

 ### What is k-fold cross validation?

@@ -306,11 +310,11 @@ Depending on the details of your own analytical code, one or both types of paral

 #### Multithread / Multi-GPU parallelization

-The first sort of parallelization is specific to the analytical code. In this repository, such parallelization is illustrated in the "TensorFlow_Tutorial_Jupyter.ipynb". Within this notebook, you can see that the `Parallel` function from the `joblib` library is used to spread the multiple test/validate/test permutations across multiple CPUs.
+The first sort of parallelization is specific to the analytical code. In this repository, such parallelization is illustrated in the "Lightning_Tutorial_Jupyter.ipynb". Within this notebook, you can see that the `Parallel` function from the `joblib` library is used to spread the multiple test/validate/test permutations across multiple CPUs.

-Moreover, because this code uses Tensorflow, the `strategy = tf.distribute.MirroredStrategy()` option can be used to parallelize the data across multiple GPUs.
+Moreover, because this code uses PyTorch, the `trainer()` options can be used to parallelize the data across multiple GPUs.

-Notice that these 2 methods of parallelization are specific to the software language (python), the tools being used (Tensorflow models), and the analysis being run (multiple iterations of models with different train/validate/test sets). In other words, implementing this sort of parallelism requires a knowledge of the code itself and may not be possible depending on the analysis and/or the tools being used.
+Notice that these 2 methods of parallelization are specific to the software language (python), the tools being used (PyTorch models), and the analysis being run (multiple iterations of models with different train/validate/test sets). In other words, implementing this sort of parallelism requires a knowledge of the code itself and may not be possible depending on the analysis and/or the tools being used.

 ##### Monitoring GPU Usage

@@ -339,8 +343,8 @@ Common sense might dictate that the greater number of cores or GPUs that you use

 ![Amdahl's Law](Images/1920px-AmdahlsLaw.svg.png)

-In technical terms, [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law) states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used". Otherwise stated, Amdahl's law means that you can only parallelize your workflow up to a certain maximal point of efficiency (unique to the application / analysis code itself), after which increased resources provided for parallelization (i.e., adding more CPUs or GPUs) will not result in any greater efficiency. In reality, after reaching the "optimal threshold", greater increases in provided resources may actually **signficantly decrease** the efficiency of the code's execution. This dropoff in efficiency can be seen in the SpeedUp charts within the "TensorFlow_Benchmark_Results.ipynb" notebook.
+In technical terms, [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law) states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used". Otherwise stated, Amdahl's law means that you can only parallelize your workflow up to a certain maximal point of efficiency (unique to the application / analysis code itself), after which increased resources provided for parallelization (i.e., adding more CPUs or GPUs) will not result in any greater efficiency. In reality, after reaching the "optimal threshold", greater increases in provided resources may actually **signficantly decrease** the efficiency of the code's execution. This dropoff in efficiency can be seen in the SpeedUp charts within the "Lightning_Tutorial_Jupyter.ipynb" notebook.

-In order to find the optimal level of hardware for efficiency gains, you will simply need to test multiple levels of provided hardware and chart/log the efficiency as resources increase. For this context, efficiency is measured as "speedup", which is the non-parallelized implementation time divided by the parallelized implementation time (see the use of the `mutate` function in cell 5 of the "TensorFlow_Benchmark_Results.ipynb" notebook). After the "sweet spot" of parallelization is reached, no greater efficiency can be accomplished (but efficiency can greatly decrease).
+In order to find the optimal level of hardware for efficiency gains, you will simply need to test multiple levels of provided hardware and chart/log the efficiency as resources increase. For this context, efficiency is measured as "speedup", which is the non-parallelized implementation time divided by the parallelized implementation time (see the use of the `mutate` function in cell 5 of the "Lightning_Tutorial_Jupyter.ipynb" notebook). After the "sweet spot" of parallelization is reached, no greater efficiency can be accomplished (but efficiency can greatly decrease).

 **TL;DR**: a greater number of cores or GPUs **does not** mean an automatic linear increase in efficiency. You will need to perform a benchmarking process to determine where the "sweet spot" for parallelization exists.