Technology
DCGM
NVIDIA Data Center GPU Manager (DCGM) is a robust suite of tools for managing, monitoring, and diagnosing NVIDIA datacenter GPUs in cluster environments.
DCGM provides essential GPU administration capabilities: active health monitoring, comprehensive diagnostics, and system alerts. It simplifies cluster management by offering continuous GPU telemetry, including metrics like SM clock frequency and memory temperature (DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_MEMORY_TEMP). Infrastructure teams use it standalone or integrate it easily into cluster management tools and resource schedulers. For containerized environments, the DCGM-Exporter component enables rich GPU telemetry for platforms like Kubernetes and Prometheus, ensuring optimal resource reliability and uptime across the datacenter.
Recent Talks & Demos
Showing 1-0 of 0