DCGM Projects .

Technology

DCGM

NVIDIA Data Center GPU Manager (DCGM) is a robust suite of tools for managing, monitoring, and diagnosing NVIDIA datacenter GPUs in cluster environments.

DCGM provides essential GPU administration capabilities: active health monitoring, comprehensive diagnostics, and system alerts. It simplifies cluster management by offering continuous GPU telemetry, including metrics like SM clock frequency and memory temperature (DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_MEMORY_TEMP). Infrastructure teams use it standalone or integrate it easily into cluster management tools and resource schedulers. For containerized environments, the DCGM-Exporter component enables rich GPU telemetry for platforms like Kubernetes and Prometheus, ensuring optimal resource reliability and uptime across the datacenter.

https://developer.nvidia.com/dcgm
0 projects · 0 cities

Recent Talks & Demos

Showing 1-0 of 0

Members-Only

Sign in to see who built these projects

No public projects found for this technology yet.