reorganize slurm and estimate

1ebe125d · Andrei Plamada · 614a689d · 1ebe125d
Verified Commit 1ebe125d authored 5 months ago by Andrei Plamada
--- a/Resource_Monitoring_and_Benchmarking/README.md
+++ b/Resource_Monitoring_and_Benchmarking/README.md
@@ -129,6 +129,50 @@ More info at https://linuxize.com/post/how-to-use-linux-screen/ .

 Source: https://developer.nvidia.com/system-management-interface

+## Job Monitoring with Slurm
+
+The job scheduler Slurm can report statistics for jobs that can help you understand what resources to request. Here is a short guide to the steps for resource monitoring.
+
+_note_: Slurm makes measurements every 60 seconds, which may result in instantaneous spikes not being measured.
+
+_hint_: [Resource monitor](https://cctools.readthedocs.io/en/latest/resource_monitor/) from CCTools is very useful as it can track most subprocesses and has a relatively small overhead. It also tracks much more than Slurm does.
+
+### Running Jobs
+
+For currently running jobs, you can use [`squeue`](https://slurm.schedmd.com/squeue.html), and specifically for your own jobs: `squeue -u $USER` or `squeue --me`. The output of this command shows the **JOBID** in the first column (for running jobs only).
+
+With the **JOBID** for your currently running job, you can run [`sstat`](https://slurm.schedmd.com/sstat.html).
+
+```bash
+$ sstat --all --format JobID,NTasks,MaxRSS,MinCPU,AveCPU -j <JOBID>
+```
+
+_hint_: add [`watch`](https://manpages.ubuntu.com/manpages/latest/en/man1/watch.1.html) before `sstat` for automatic refreshing
+
+While a job runs you can also use `scontrol` to check what resources Slurm has allocated along with other important details of your job.
+
+```bash
+$ scontrol show job <JOBID>
+```
+
+### Completed Jobs
+
+To find job IDs for jobs that have stopped running, use [`sacct`](https://slurm.schedmd.com/sacct.html), and in this example, my user's jobs from the last 30 days:
+
+```bash
+$ sacct -u $USER -S `date --date="-30 days" +%Y-%m-%d` -X -o JobID,JobName,Start,End,Elapsed,State
+```
+
+From the first column, select a job ID for a job. A job that ran for longer may be easier to understand, so if you have one, choose a job with a longer elapsed time.
+
+```bash
+$ sacct --format=JobID,ReqMem,MaxRSS,MaxVMSize,Elapsed,NCPUS,SystemCPU,UserCPU -j <JOBID>
+```
+
+The value for `CPUTime` is `NCPUS * Elapsed`
+
+_hint_: `sacct -e` shows the available output columns.
+
 ## Finding the Bottlenecks

 Adding timestamps in our code to identify the bottleneck can be tedious and time consuming. We should should rather use dedicated tools able to measure the instructions or memory usage at the function or code line level. This activity is called **profiling** and the tools are know as **profilers**.
@@ -252,52 +296,15 @@ The high level languages (e.g., Python, R) are not designed to control the cache

 In practice we can reduce the execution time by paying attention memory usage and IO operations. We should copy the data close to the compute infrastructure (or even in memory).

-## Job monitoring with Slurm
-
-The job scheduler Slurm can report statistics for jobs that can help you understand what resources to request. Here is a short guide to the steps for resource monitoring.
-
-_note_: Slurm makes measurements every 60 seconds, which may result in instantaneous spikes not being measured.
-
-_hint_: [Resource monitor](https://cctools.readthedocs.io/en/latest/resource_monitor/) from CCTools is very useful as it can track most subprocesses and has a relatively small overhead. It also tracks much more than Slurm does.
-
-### Estimating your hardware requirements
+## Estimating the Hardware Requirements

 It works well for embarrassingly parallel problems.

-**Sample!** Take a representative small sample from a batch, whether it's data, file or parameters. If the batch is heterogeneous, take a few representative samples. If possible, run a pipeline or workflow start-to-finish with that sample. Run the same pipeline with more resources and ask yourself, "does it speed up?" Extrapolate the time needed for the entire data or parameter set from this small sample. Doing this will also help you find bugs or errors in the code much faster than when starting with the entire data or parameter set.
-
-### Running jobs
-
-For currently running jobs, you can use [`squeue`](https://slurm.schedmd.com/squeue.html), and specifically for your own jobs: `squeue -u $USER` or `squeue --me`. The output of this command shows the job ID in the first column (for running jobs only).
-
-With the job ID for your currently running job, you can run [`sstat`](https://slurm.schedmd.com/sstat.html).
-
-```bash
-$ sstat --all --format JobID,NTasks,MaxRSS,MinCPU,AveCPU -j <jobid>
-```
-
-_hint_: add [`watch`](https://manpages.ubuntu.com/manpages/latest/en/man1/watch.1.html) before `sstat` for automatic refreshing
-
-While a job runs you can also use `scontrol` to check what resources Slurm has allocated along with other important details of your job.
-
-```bash
-$ scontrol show job <jobid>
-```
-
-### Completed jobs
+**Sample!**

-To find job IDs for jobs that have stopped running, use [`sacct`](https://slurm.schedmd.com/sacct.html), and in this example, my user's jobs from the last 30 days:
-
-```bash
-$ sacct -u $USER -S `date --date="-30 days" +%Y-%m-%d` -X -o jobid,jobname,start,end,elapsed,state
-```
+1. Take a representative small sample from a batch, whether it's data, file or parameters. If the batch is heterogeneous, take a few representative samples.
+2. If possible, run a pipeline or workflow start-to-finish with that sample.
+3. Run the same pipeline with more resources and ask yourself, "does it speed up?"
+4. Extrapolate the time needed for the entire data or parameter set from this small sample.

-From the first column, select a job ID for a job. A job that ran for longer may be easier to understand, so if you have one, choose a job with a longer elapsed time.
-
-```bash
-$ sacct --format=jobid,ReqMem,MaxRSS,MaxVMSize,Elapsed,AllocCPU,SystemCPU,UserCPU -j <jobid>
-```
-
-The value for `CPUTime` is `NCPUS * Elapsed`
-
-_hint_: `sacct -e` shows the available output columns.
+Doing this will also help you find bugs or errors in the code much faster than when starting with the entire data or parameter set.