add GPU to monitoring

6d1cfe16 · Andrei Plamada · 55eddca7 · 6d1cfe16
Verified Commit 6d1cfe16 authored 5 months ago by Andrei Plamada
--- a/Resource_Monitoring_and_Benchmarking/README.md
+++ b/Resource_Monitoring_and_Benchmarking/README.md
@@ -27,7 +27,7 @@ Moreover, with Science IT computing infrastructure, the cost contribution is bas
 The aim of this guide is to introduce techniques to that will help to:

 - measure the total execution time of your application
- understand your code performance by examining the system utilization (CPU and Memory)
+- understand your code performance by examining the system utilization (CPU, Memory, GPU)
 - find the bottlenecks by measure the runtime and memory usage of an isolated piece of code (Python/R)
 - avoid common pitfalls in assessing the performance on various systems
 - use Slurm to understand the resource usage of your jobs (running or completed)
@@ -70,7 +70,7 @@ When benchmarking:
 - it is hard to compared when using different inputs (we should avoid doing it)
 - IO operations make the benchmarking less predictable (more later)

-## Resource Monitoring (CPU and Memory)
+## Resource Monitoring

 First we want to understand the utilization of the system and what processes are active.
 Each operating system has established tools for this:
@@ -119,6 +119,16 @@ $ exit # the session terminates when you exit (run it in the session)

 More info at https://linuxize.com/post/how-to-use-linux-screen/ .

+### `nvtop`
+
+[`nvtop`](https://github.com/Syllo/nvtop) is a `htop` like for GPUs and accelerators.
+
+### `nvidia-smi`
+
+> The NVIDIA System Management Interface (`nvidia-smi`) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
+
+Source: https://developer.nvidia.com/system-management-interface
+
 ## Finding the Bottlenecks

 Adding timestamps in our code to identify the bottleneck can be tedious and time consuming. We should should rather use dedicated tools able to measure the instructions or memory usage at the function or code line level. This activity is called **profiling** and the tools are know as **profilers**.