Skip to content
Snippets Groups Projects
Verified Commit 1ebe125d authored by Andrei Plamada's avatar Andrei Plamada
Browse files

reorganize slurm and estimate

parent 614a689d
No related branches found
No related tags found
No related merge requests found
......@@ -129,6 +129,50 @@ More info at https://linuxize.com/post/how-to-use-linux-screen/ .
Source: https://developer.nvidia.com/system-management-interface
## Job Monitoring with Slurm
The job scheduler Slurm can report statistics for jobs that can help you understand what resources to request. Here is a short guide to the steps for resource monitoring.
_note_: Slurm makes measurements every 60 seconds, which may result in instantaneous spikes not being measured.
_hint_: [Resource monitor](https://cctools.readthedocs.io/en/latest/resource_monitor/) from CCTools is very useful as it can track most subprocesses and has a relatively small overhead. It also tracks much more than Slurm does.
### Running Jobs
For currently running jobs, you can use [`squeue`](https://slurm.schedmd.com/squeue.html), and specifically for your own jobs: `squeue -u $USER` or `squeue --me`. The output of this command shows the **JOBID** in the first column (for running jobs only).
With the **JOBID** for your currently running job, you can run [`sstat`](https://slurm.schedmd.com/sstat.html).
```bash
$ sstat --all --format JobID,NTasks,MaxRSS,MinCPU,AveCPU -j <JOBID>
```
_hint_: add [`watch`](https://manpages.ubuntu.com/manpages/latest/en/man1/watch.1.html) before `sstat` for automatic refreshing
While a job runs you can also use `scontrol` to check what resources Slurm has allocated along with other important details of your job.
```bash
$ scontrol show job <JOBID>
```
### Completed Jobs
To find job IDs for jobs that have stopped running, use [`sacct`](https://slurm.schedmd.com/sacct.html), and in this example, my user's jobs from the last 30 days:
```bash
$ sacct -u $USER -S `date --date="-30 days" +%Y-%m-%d` -X -o JobID,JobName,Start,End,Elapsed,State
```
From the first column, select a job ID for a job. A job that ran for longer may be easier to understand, so if you have one, choose a job with a longer elapsed time.
```bash
$ sacct --format=JobID,ReqMem,MaxRSS,MaxVMSize,Elapsed,NCPUS,SystemCPU,UserCPU -j <JOBID>
```
The value for `CPUTime` is `NCPUS * Elapsed`
_hint_: `sacct -e` shows the available output columns.
## Finding the Bottlenecks
Adding timestamps in our code to identify the bottleneck can be tedious and time consuming. We should should rather use dedicated tools able to measure the instructions or memory usage at the function or code line level. This activity is called **profiling** and the tools are know as **profilers**.
......@@ -252,52 +296,15 @@ The high level languages (e.g., Python, R) are not designed to control the cache
In practice we can reduce the execution time by paying attention memory usage and IO operations. We should copy the data close to the compute infrastructure (or even in memory).
## Job monitoring with Slurm
The job scheduler Slurm can report statistics for jobs that can help you understand what resources to request. Here is a short guide to the steps for resource monitoring.
_note_: Slurm makes measurements every 60 seconds, which may result in instantaneous spikes not being measured.
_hint_: [Resource monitor](https://cctools.readthedocs.io/en/latest/resource_monitor/) from CCTools is very useful as it can track most subprocesses and has a relatively small overhead. It also tracks much more than Slurm does.
### Estimating your hardware requirements
## Estimating the Hardware Requirements
It works well for embarrassingly parallel problems.
**Sample!** Take a representative small sample from a batch, whether it's data, file or parameters. If the batch is heterogeneous, take a few representative samples. If possible, run a pipeline or workflow start-to-finish with that sample. Run the same pipeline with more resources and ask yourself, "does it speed up?" Extrapolate the time needed for the entire data or parameter set from this small sample. Doing this will also help you find bugs or errors in the code much faster than when starting with the entire data or parameter set.
### Running jobs
For currently running jobs, you can use [`squeue`](https://slurm.schedmd.com/squeue.html), and specifically for your own jobs: `squeue -u $USER` or `squeue --me`. The output of this command shows the job ID in the first column (for running jobs only).
With the job ID for your currently running job, you can run [`sstat`](https://slurm.schedmd.com/sstat.html).
```bash
$ sstat --all --format JobID,NTasks,MaxRSS,MinCPU,AveCPU -j <jobid>
```
_hint_: add [`watch`](https://manpages.ubuntu.com/manpages/latest/en/man1/watch.1.html) before `sstat` for automatic refreshing
While a job runs you can also use `scontrol` to check what resources Slurm has allocated along with other important details of your job.
```bash
$ scontrol show job <jobid>
```
### Completed jobs
**Sample!**
To find job IDs for jobs that have stopped running, use [`sacct`](https://slurm.schedmd.com/sacct.html), and in this example, my user's jobs from the last 30 days:
```bash
$ sacct -u $USER -S `date --date="-30 days" +%Y-%m-%d` -X -o jobid,jobname,start,end,elapsed,state
```
1. Take a representative small sample from a batch, whether it's data, file or parameters. If the batch is heterogeneous, take a few representative samples.
2. If possible, run a pipeline or workflow start-to-finish with that sample.
3. Run the same pipeline with more resources and ask yourself, "does it speed up?"
4. Extrapolate the time needed for the entire data or parameter set from this small sample.
From the first column, select a job ID for a job. A job that ran for longer may be easier to understand, so if you have one, choose a job with a longer elapsed time.
```bash
$ sacct --format=jobid,ReqMem,MaxRSS,MaxVMSize,Elapsed,AllocCPU,SystemCPU,UserCPU -j <jobid>
```
The value for `CPUTime` is `NCPUS * Elapsed`
_hint_: `sacct -e` shows the available output columns.
Doing this will also help you find bugs or errors in the code much faster than when starting with the entire data or parameter set.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment