Monitoring GPU Performance
NVIDIA GPU
tmam leverages OpenTelemetry to provide visibility into your NVIDIA GPU performance. Track key metrics such as utilization, temperature, memory usage, and power consumption—enabling you to monitor hardware health and optimize AI workloads effectivel
Using Python SDK
Install tmam
Install tmam
Initialize tmam in your application
Add the following two lines to your application code:
import tmam
tmam.init(
otlp_endpoint="YOUR_OTEL_ENDPOINT", otlp_headers="YOUR_OTEL_HEADERS"
collect_gpu_stats=True
)
Replace: YOUR_OTEL_ENDPOINT with the URL of your OpenTelemetry backend, such as http://127.0.0.1:4318 if you are using OpenLIT and a local OTel Collector.
GPU Metrics
This section highlights the GPU-related OpenTelemetry metrics collected by tmam. These metrics provide clear insights into GPU performance and resource utilization, making it easier to monitor system behavior in applications that rely on LLMs or other compute-intensive workloads. They complement trace data and are ideal for building dashboards to track GPU usage and performance in real time.
Metric Name | Description | Unit | Type | Attributes |
---|---|---|---|---|
gpu.utilization | GPU Utilization in percentage | percent | Gauge | telemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid |
gpu.enc.utilization | GPU encoder Utilization in percentage | percent | Gauge | telemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid |
gpu.dec.utilization | GPU decoder Utilization in percentage | percent | Gauge | telemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid |
gpu.temperature | GPU Temperature in Celsius | Celsius | Gauge | telemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid |
gpu.fan_speed | GPU Fan Speed (0-100) as an integer | Integer | Gauge | telemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid |
gpu.memory.available | Available GPU Memory in MB | MB | Gauge | telemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid |
gpu.memory.total | Total GPU Memory in MB | MB | Gauge | telemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid |
gpu.memory.used | Used GPU Memory in MB | MB | Gauge | telemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid |
gpu.memory.free | Free GPU Memory in MB | MB | Gauge | telemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid |
gpu.power.draw | GPU Power Draw in Watts | Watt | Gauge | telemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid |
gpu.power.limit | GPU Power Limit in Watts | Watt | Gauge | telemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid |