Monitoring GPU Performance

NVIDIA GPU

tmam leverages OpenTelemetry to provide visibility into your NVIDIA GPU performance. Track key metrics such as utilization, temperature, memory usage, and power consumption—enabling you to monitor hardware health and optimize AI workloads effectivel

Using Python SDK

1
Install tmam
Install tmam
2
Initialize tmam in your application

Add the following two lines to your application code:

import tmam

tmam.init(
otlp_endpoint="YOUR_OTEL_ENDPOINT", otlp_headers="YOUR_OTEL_HEADERS"
collect_gpu_stats=True 
)

Replace: YOUR_OTEL_ENDPOINT with the URL of your OpenTelemetry backend, such as http://127.0.0.1:4318 if you are using OpenLIT and a local OTel Collector.

GPU Metrics

This section highlights the GPU-related OpenTelemetry metrics collected by tmam. These metrics provide clear insights into GPU performance and resource utilization, making it easier to monitor system behavior in applications that rely on LLMs or other compute-intensive workloads. They complement trace data and are ideal for building dashboards to track GPU usage and performance in real time.

Metric NameDescriptionUnitTypeAttributes
gpu.utilizationGPU Utilization in percentagepercentGaugetelemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid
gpu.enc.utilizationGPU encoder Utilization in percentagepercentGaugetelemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid
gpu.dec.utilizationGPU decoder Utilization in percentagepercentGaugetelemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid
gpu.temperatureGPU Temperature in CelsiusCelsiusGaugetelemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid
gpu.fan_speedGPU Fan Speed (0-100) as an integerIntegerGaugetelemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid
gpu.memory.availableAvailable GPU Memory in MBMBGaugetelemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid
gpu.memory.totalTotal GPU Memory in MBMBGaugetelemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid
gpu.memory.usedUsed GPU Memory in MBMBGaugetelemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid
gpu.memory.freeFree GPU Memory in MBMBGaugetelemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid
gpu.power.drawGPU Power Draw in WattsWattGaugetelemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid
gpu.power.limitGPU Power Limit in WattsWattGaugetelemetry.sdk.name, gen_ai.application_name, gen_ai.environment, gpu_index, gpu_name, gpu_uuid