@@ -301,50 +301,5 @@ The process of benchmarking can be [defined](https://en.wikipedia.org/wiki/Bench
## Parellelization: What is it exactly and what options do I have?
Commonly, you can consider 2 basic strategies for parallelizing your work:
1. parallelization within the analytical code itself (either across multiple CPUs or GPUs); this is sometimes called ["multithreading"](https://docs.s3it.uzh.ch/cluster/parallelisation/#multithreaded-application)
2. parallelization at the task level
Depending on the details of your own analytical code, one or both types of parallelism may be possible. Only *you*, the domain expert and the author of your own code, will be able to make the most informed decision about parallelism in your workflow.
#### Multithread / Multi-GPU parallelization
The first sort of parallelization is specific to the analytical code. In this repository, such parallelization is illustrated in the "Lightning_Tutorial_Jupyter.ipynb". Within this notebook, you can see that the `Parallel` function from the `joblib` library is used to spread the multiple test/validate/test permutations across multiple CPUs.
Moreover, because this code uses PyTorch, the `trainer()` options can be used to parallelize the data across multiple GPUs.
Notice that these 2 methods of parallelization are specific to the software language (python), the tools being used (PyTorch models), and the analysis being run (multiple iterations of models with different train/validate/test sets). In other words, implementing this sort of parallelism requires a knowledge of the code itself and may not be possible depending on the analysis and/or the tools being used.
##### Monitoring GPU Usage
If you would like to monitor GPU usage, you can similarly execute the analytical code in a background terminal/process then run the following command in another terminal that's part of the same session (i.e., located on the same computer or node):
In this case, the `nvidia-smi` tool will record all details on the devices that are assigned via `$CUDA_VISIBLE_DEVICES`, sampling data every 2 seconds (via the `-l 2` flag), returning the `gpu_name` and memory details, then writing all details in a `csv` style format to a file titled `nvidia-smi.log`.
#### Task/Job Parallelization
The second sort of parallelization occurs at the "task" or "job" level. Because this code can be executed with multiple parameters (e.g., for `k` in k-fold cross validation), multiple executions of the code can be run completely independently of one another. Parameter sweeps are commonly able to take advantage of this type of parallelization.
This type of circumstance has been termed ["embarrasingly parallel"](https://en.wikipedia.org/wiki/Embarrassingly_parallel). It is "embarrassingly", "delightfully", or "pleasingly" parallel because none of the executions of the code require any interaction with the other executions of the code. This means that in a large enough computational environment *all* iterations of the code can be run potentially simultaneously. Such a workflow can be accomplished on cluster type systems using the array framework demonstrated in the "submit_array_job.sh" script template.
Furthermore, you can parallelize runs with multiple array jobs on different partitions of a large cluster type environment (i.e., the ScienceCluster) by queuing multiple jobs across multiple partitions simultaneously, as illustrated with the "submit_slurm_jobs.sh" script. In this context, all independent array jobs will run as soon as resources are available for them.
### How do I determine the optimal number of cores / GPUs to use?
If you can indeed perform parallelization at the code level (i.e., a multithreaded application or multi-GPU application), the next question then becomes, "what level of parallelization will be most optimal?"
Common sense might dictate that the greater number of cores or GPUs that you use, the faster your workflow will become. However, per Amdahl's law, this is not the case.

In technical terms, [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law) states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used". Otherwise stated, Amdahl's law means that you can only parallelize your workflow up to a certain maximal point of efficiency (unique to the application / analysis code itself), after which increased resources provided for parallelization (i.e., adding more CPUs or GPUs) will not result in any greater efficiency. In reality, after reaching the "optimal threshold", greater increases in provided resources may actually **signficantly decrease** the efficiency of the code's execution. This dropoff in efficiency can be seen in the SpeedUp charts within the "Lightning_Tutorial_Jupyter.ipynb" notebook.
In order to find the optimal level of hardware for efficiency gains, you will simply need to test multiple levels of provided hardware and chart/log the efficiency as resources increase. For this context, efficiency is measured as "speedup", which is the non-parallelized implementation time divided by the parallelized implementation time (see the use of the `mutate` function in cell 5 of the "Lightning_Tutorial_Jupyter.ipynb" notebook). After the "sweet spot" of parallelization is reached, no greater efficiency can be accomplished (but efficiency can greatly decrease).
**TL;DR**: a greater number of cores or GPUs **does not** mean an automatic linear increase in efficiency. You will need to perform a benchmarking process to determine where the "sweet spot" for parallelization exists.
By parallelization it is meant performing parallel computing, i.e., the simultaneous use of multiple compute resources to solve a computational problem.
Refer to the ["parallelization.md"](parallelization.md) for an overview of the topic of parallel computing.