GPU Monitoring

tmam can collect real-time GPU metrics from NVIDIA and AMD GPUs and display them in the dashboard's GPU Analytics view.


Enabling GPU Monitoring

Pass collect_gpu_stats=True to init():

from tmam import init

init(
    url="http://localhost:5050/api/sdk",
    public_key="pk-tmam-xxxxxxxx",
    secrect_key="sk-tmam-xxxxxxxx",
    application_name="my-gpu-app",
    collect_gpu_stats=True,
)

tmam will auto-detect whether an NVIDIA or AMD GPU is present.


GPU Vendor Requirements

NVIDIA GPUs

Install the pynvml library (NVIDIA Management Library Python bindings):

pip install pynvml

Requires NVIDIA drivers to be installed on the host. Works with any CUDA-capable GPU.

AMD GPUs

Install the amdsmi library:

pip install amdsmi

Requires AMD ROCm drivers to be installed on the host.


Metrics Collected

For each GPU in the system, tmam collects the following metrics, tagged with the GPU index, UUID, and name:

Metric NameOTel NameDescription
Utilizationgpu.utilizationCore utilization %
Encoder Utilizationgpu.enc.utilizationVideo encoder utilization %
Decoder Utilizationgpu.dec.utilizationVideo decoder utilization %
Temperaturegpu.temperatureTemperature in °C
Fan Speedgpu.fan_speedFan speed (NVIDIA only)
Memory Availablegpu.memory.availableAvailable VRAM in MB
Memory Totalgpu.memory.totalTotal VRAM in MB
Memory Usedgpu.memory.usedUsed VRAM in MB
Memory Freegpu.memory.freeFree VRAM in MB
Power Drawgpu.power.drawCurrent power draw in W
Power Limitgpu.power.limitPower limit in W

All metrics are tagged with:

  • gpu.index — GPU index (0, 1, 2...)
  • gpu.uuid — GPU UUID
  • gpu.name — GPU model name
  • service.name — your application_name
  • deployment.environment — your environment

Viewing GPU Metrics

In the dashboard, navigate to Analytics → GPU to see:

  • GPU utilization over time
  • Memory usage (used vs. total)
  • Temperature and power draw
  • Per-GPU breakdowns for multi-GPU systems

No GPU Detected

If collect_gpu_stats=True but no supported GPU is found, tmam logs:

Tmam GPU Instrumentation Error: No supported GPUs found.
If this is a non-GPU host, set `collect_gpu_stats=False` to disable GPU stats.

This does not affect other tracing or metrics collection — it is non-fatal.


Example: LLM + GPU Monitoring

from tmam import init
from transformers import pipeline

init(
    url="http://localhost:5050/api/sdk",
    public_key="pk-tmam-xxxxxxxx",
    secrect_key="sk-tmam-xxxxxxxx",
    application_name="local-llm",
    environment="dev",
    collect_gpu_stats=True,  # monitor GPU while running inference
)

# Transformers calls are auto-instrumented
generator = pipeline("text-generation", model="gpt2", device=0)
output = generator("The future of AI is", max_new_tokens=50)
print(output[0]["generated_text"])

While inference runs, tmam records both the LLM span (tokens, latency) and GPU metrics (VRAM usage, utilization) — correlatable by timestamp in the dashboard.