Skip to content
Snippets Groups Projects
Verified Commit 5199a289 authored by Andrei Plamada's avatar Andrei Plamada
Browse files

improve monitoring

parent e96eb979
No related branches found
No related tags found
No related merge requests found
......@@ -67,7 +67,7 @@ When benchmarking:
- make sure the system is not used for other tasks (we can check this by monitoring the resources)
- it is hard to compare different infrastructures (we should avoid doing it)
- it is hard to compared when using different inputs (we should avoid doing it)
- it is hard to compare when using different inputs (we should avoid doing it)
- IO operations make the benchmarking less predictable (more later)
## Resource Monitoring
......@@ -131,7 +131,7 @@ Source: https://developer.nvidia.com/system-management-interface
## Job Monitoring with Slurm
The job scheduler Slurm can report statistics for jobs that can help you understand what resources to request. Here is a short guide to the steps for resource monitoring.
The job scheduler Slurm can report statistics for jobs that can help you understand what resources to request.
_note_: Slurm makes measurements every 60 seconds, which may result in instantaneous spikes not being measured.
......@@ -139,7 +139,9 @@ _hint_: [Resource monitor](https://cctools.readthedocs.io/en/latest/resource_mon
### Running Jobs
For currently running jobs, you can use [`squeue`](https://slurm.schedmd.com/squeue.html), and specifically for your own jobs: `squeue -u $USER` or `squeue --me`. The output of this command shows the **JOBID** in the first column (for running jobs only).
For pending, currently running, and completing jobs you can use [`squeue`](https://slurm.schedmd.com/squeue.html), and specifically for your own jobs: `squeue -u $USER` or `squeue --me`. The output of this command shows the **JOBID** in the first column.
You can check only your running jobs with `squeue -u $USER --states=RUNNING`.
With the **JOBID** for your currently running job, you can run [`sstat`](https://slurm.schedmd.com/sstat.html).
......@@ -157,26 +159,26 @@ $ scontrol show job <JOBID>
### Access the Corresponding Compute Node
You cannot connect to the compute nod via ssh. You can access the node similar to [running an interactive session](https://docs.s3it.uzh.ch/how-to_articles/how_to_run_an_interactive_session/):
You cannot connect to the compute node via ssh. You can access the node similar to [running an interactive session](https://docs.s3it.uzh.ch/how-to_articles/how_to_run_an_interactive_session/):
```bash
$ srun --pty --interactive --jobid <JOBID> bash -l
```
If the job is using multiple nodes, you can request a given one using `--nodelist=<NODE>`.
If you run a multi-node job, you can request a given node using `--nodelist=<NODE>`.
### Completed Jobs
To find job IDs for jobs that have stopped running, use [`sacct`](https://slurm.schedmd.com/sacct.html), and in this example, my user's jobs from the last 30 days:
To find the JobID for jobs that have finished running, use [`sacct`](https://slurm.schedmd.com/sacct.html), and in this example, my user's jobs from the last 30 days:
```bash
$ sacct -u $USER -S `date --date="-30 days" +%Y-%m-%d` -X -o JobID,JobName,Start,End,Elapsed,State
```
From the first column, select a job ID for a job. A job that ran for longer may be easier to understand, so if you have one, choose a job with a longer elapsed time.
From the first column, select a JobID for a job. A job that ran for longer may be easier to understand, so if you have one, choose a job with a longer elapsed time.
```bash
$ sacct --format=JobID,ReqMem,MaxRSS,MaxVMSize,Elapsed,NCPUS,SystemCPU,UserCPU -j <JOBID>
$ sacct --format=JobID,ReqMem,MaxRSS,MaxVMSize,Elapsed,NCPUS,SystemCPU,UserCPU -j <JobID>
```
The value for `CPUTime` is `NCPUS * Elapsed`
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment