dcgm-exporter 监控 Linux 系统 GPU 使用情况-陈铭的个人小站

1. 引言

在现代数据中心，GPU 不仅仅是图形渲染的加速器，它们在机器学习和科学计算中扮演着越来越重要的角色。为了最大化 GPU 的使用效率和性能，监控 GPU 的运行状态变得至关重要。dcgm-exporter 是 NVIDIA 提供的一个工具，它能够将 GPU 的运行数据以 Prometheus 可理解的格式暴露出来，方便进行监控和分析。

2. 前提条件

系统中已安装 NVIDIA GPU 并配置好相应的驱动。
系统中安装好了docker

3. 安装与配置

3.1 获取 dcgm-exporter 镜像

从 NVIDIA 容器注册库中拉取 dcgm-exporter 镜像：

docker pull nvcr.io/nvidia/k8s/dcgm-exporter:<version>

请将 <version> 替换为当前可用的版本号。

3.2 运行 dcgm-exporter

使用 Docker 运行 dcgm-exporter 容器，并将其端口映射到宿主机：

docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:<version>

这将启动 dcgm-exporter 并在后台运行，同时将容器的 9400 端口映射到宿主机的同一端口。

4. 验证 dcgm-exporter

通过以下命令查看 dcgm-exporter 暴露的指标：

curl http://localhost:9400/metrics

如果一切正常，您将看到一系列以 DCGM_ 开头的指标数据。

5. 集成 Prometheus

5.1 配置 Prometheus

在 Prometheus 的配置文件中（通常为 prometheus.yml），添加一个新的 scrape_config 来抓取 dcgm-exporter 暴露的指标：

scrape_configs:
  - job_name: 'dcgm'
    static_configs:
      - targets: ['localhost:9400']

5.2 重启 Prometheus

重启 Prometheus 服务以应用新的配置：

sudo systemctl restart prometheus

6. Grafana 可视化（可选）

如果您已安装 Grafana，可以导入 NVIDIA DCGM Exporter Dashboard 来可视化 GPU 指标。

6.1 添加数据源

在 Grafana 中添加 Prometheus 作为数据源。

6.2 导入 Dashboard

在 Grafana 中导入预先配置好的 NVIDIA DCGM Exporter Dashboard。

NVIDIA DCGM-Exporter dashboard :https://grafana.com/grafana/dashboards/12239

7. 监控指标示例

以下是一些通过 dcgm-exporter 可以监控的 GPU 指标示例：

DCGM_FI_DEV_GPU_UTIL：GPU 利用率（%）
DCGM_FI_DEV_FB_USED：显存已使用量（MB）
DCGM_FI_DEV_GPU_TEMP：GPU 温度（摄氏度）
DCGM_FI_DEV_POWER_USAGE：GPU 功率使用情况（W）

通过部署 dcgm-exporter，并将其实现为 Prometheus 的数据源，我们能够方便地监控 GPU 的使用情况。结合 Grafana 的数据可视化功能，我们能够更加直观地理解 GPU 的运行状态，从而做出更加合理的资源调度和系统优化决策。

GitHub：https://github.com/NVIDIA/dcgm-exporter

目录CONTENT

dcgm-exporter 监控 Linux 系统 GPU 使用情况