improve monitoring

5199a289 · Andrei Plamada · e96eb979 · 5199a289
Verified Commit 5199a289 authored 5 months ago by Andrei Plamada
--- a/Resource_Monitoring_and_Benchmarking/README.md
+++ b/Resource_Monitoring_and_Benchmarking/README.md
@@ -67,7 +67,7 @@ When benchmarking:

 - make sure the system is not used for other tasks (we can check this by monitoring the resources)
 - it is hard to compare different infrastructures (we should avoid doing it)
- it is hard to compared when using different inputs (we should avoid doing it)
+- it is hard to compare when using different inputs (we should avoid doing it)
 - IO operations make the benchmarking less predictable (more later)

 ## Resource Monitoring
@@ -131,7 +131,7 @@ Source: https://developer.nvidia.com/system-management-interface

 ## Job Monitoring with Slurm

-The job scheduler Slurm can report statistics for jobs that can help you understand what resources to request. Here is a short guide to the steps for resource monitoring.
+The job scheduler Slurm can report statistics for jobs that can help you understand what resources to request.

 _note_: Slurm makes measurements every 60 seconds, which may result in instantaneous spikes not being measured.

@@ -139,7 +139,9 @@ _hint_: [Resource monitor](https://cctools.readthedocs.io/en/latest/resource_mon

 ### Running Jobs

-For currently running jobs, you can use [`squeue`](https://slurm.schedmd.com/squeue.html), and specifically for your own jobs: `squeue -u $USER` or `squeue --me`. The output of this command shows the **JOBID** in the first column (for running jobs only).
+For pending, currently running, and completing jobs you can use [`squeue`](https://slurm.schedmd.com/squeue.html), and specifically for your own jobs: `squeue -u $USER` or `squeue --me`. The output of this command shows the **JOBID** in the first column.
+
+You can check only your running jobs with `squeue -u $USER --states=RUNNING`.

 With the **JOBID** for your currently running job, you can run [`sstat`](https://slurm.schedmd.com/sstat.html).

@@ -157,26 +159,26 @@ $ scontrol show job <JOBID>

 ### Access the Corresponding Compute Node

-You cannot connect to the compute nod via ssh. You can access the node similar to [running an interactive session](https://docs.s3it.uzh.ch/how-to_articles/how_to_run_an_interactive_session/):
+You cannot connect to the compute node via ssh. You can access the node similar to [running an interactive session](https://docs.s3it.uzh.ch/how-to_articles/how_to_run_an_interactive_session/):

 ```bash
 $ srun --pty --interactive --jobid <JOBID> bash -l
 ```

-If the job is using multiple nodes, you can request a given one using `--nodelist=<NODE>`.
+If you run a multi-node job, you can request a given node using `--nodelist=<NODE>`.

 ### Completed Jobs

-To find job IDs for jobs that have stopped running, use [`sacct`](https://slurm.schedmd.com/sacct.html), and in this example, my user's jobs from the last 30 days:
+To find the JobID for jobs that have finished running, use [`sacct`](https://slurm.schedmd.com/sacct.html), and in this example, my user's jobs from the last 30 days:

 ```bash
 $ sacct -u $USER -S `date --date="-30 days" +%Y-%m-%d` -X -o JobID,JobName,Start,End,Elapsed,State
 ```

-From the first column, select a job ID for a job. A job that ran for longer may be easier to understand, so if you have one, choose a job with a longer elapsed time.
+From the first column, select a JobID for a job. A job that ran for longer may be easier to understand, so if you have one, choose a job with a longer elapsed time.

 ```bash
-$ sacct --format=JobID,ReqMem,MaxRSS,MaxVMSize,Elapsed,NCPUS,SystemCPU,UserCPU -j <JOBID>
+$ sacct --format=JobID,ReqMem,MaxRSS,MaxVMSize,Elapsed,NCPUS,SystemCPU,UserCPU -j <JobID>
 ```

 The value for `CPUTime` is `NCPUS * Elapsed`