docs/advance/grafana_prometheus.md
Author: https://github.com/meituan-search
Last updated: 12/05/2025.
Monitor the rollout computation process using Prometheus and Grafana when using verl to enhance system observability and facilitate further performance optimization.
We provide an additional training monitoring capability, leveraging Prometheus and Grafana to display rollout information during training and enhance system observability to facilitate further performance optimization.
The system automatically configures Prometheus to scrape metrics from rollout servers, eliminating manual configuration steps.
The figures below show the performance of Qwen235B on the AIME2024 dataset with a response length of 20k, where the emergence of a long-tail problem is clearly observable.
The following figure presents the fully asynchronous training of the Qwen235B model. Here, resource idleness is distinctly noticeable, indicating that rollout resources can be reduced.
Through the above two examples, we also illustrate the necessity of system observability.
The overall workflow consists of the following steps:
First, set the necessary environment variables and start the Ray service.
Reference: configure-manage-dashboard
# Master node environment variables
export GF_SERVER_HTTP_PORT=3000 # Grafana service default port (customizable)
export PROMETHEUS_PORT=9090 # Prometheus service default port (customizable)
export RAY_HEAD_PORT=6379 # Ray master node port (customizable)
export RAY_DASHBOARD_PORT=8265 # Ray dashboard default port (customizable)
export GRAFANA_PATHS_DATA=/tmp/grafana # Grafana data storage directory (customizable)
export RAY_GRAFANA_HOST="http://${master_ip}:${GF_SERVER_HTTP_PORT}" # Ray-associated Grafana address
export RAY_PROMETHEUS_HOST="http://${master_ip}:${PROMETHEUS_PORT}" # Ray-associated Prometheus address
# Start Ray on master node
ray start --head --port=${RAY_HEAD_PORT} --dashboard-port=${RAY_DASHBOARD_PORT}
# Start Ray on worker nodes
ray start --address={master_addr}:${RAY_HEAD_PORT}
Verification: Visit http://master_ip:8265 to confirm Ray has started successfully.
Grafana is used to display metrics collected by Prometheus (such as cache hit rate, throughput, etc.):
# Master node
nohup grafana-server \
--config /tmp/ray/session_latest/metrics/grafana/grafana.ini \
--homepath /usr/share/grafana \
web > grafana.log 2>&1 &
Verification: Visit http://master_ip:3000 to confirm Grafana has started successfully (default credentials: admin/admin).
If you need to change the port, modify the GF_SERVER_HTTP_PORT environment variable, and grafana-server will automatically recognize it.
Prometheus is responsible for scraping metrics from vLLM services and storing them as time-series data:
# Master node
nohup prometheus \
--config.file /tmp/ray/session_latest/metrics/prometheus/prometheus.yml \
--web.enable-lifecycle \
--web.listen-address=:${PROMETHEUS_PORT} \
> prometheus.log 2>&1 &
Verification: Visit http://master_ip:9090 to confirm Prometheus service has started successfully.
Start verl training with the following parameters configured:
Required Configuration:
actor_rollout_ref.rollout.mode="async"actor_rollout_ref.rollout.disable_log_stats=Falseactor_rollout_ref.rollout.prometheus.enable=TrueIf use default port, this parameter can be omitted.
actor_rollout_ref.rollout.prometheus.port=9090If use default path, this parameter can be omitted.
actor_rollout_ref.rollout.prometheus.file="/tmp/ray/session_latest/metrics/prometheus/prometheus.yml"served_model_name uses model_path.split("/")[-1] for data statistics by default.
Users can also customize other aliases:
actor_rollout_ref.rollout.prometheus.served_model_name="Qwen3-235B"Shell Script Example:
WORKING_DIR=${WORKING_DIR:-"${PWD}"}
RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
rollout_mode="async"
rollout_name="vllm" # Options: sglang or vllm
if [ "$rollout_mode" = "async" ]; then
export VLLM_USE_V1=1
return_raw_chat="True"
fi
# Synchronous training
ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
--working-dir "${WORKING_DIR}" \
-- python3 -m verl.trainer.main_ppo \
data.return_raw_chat=${return_raw_chat} \
actor_rollout_ref.rollout.name=${rollout_name} \
actor_rollout_ref.rollout.mode=${rollout_mode} \
actor_rollout_ref.rollout.disable_log_stats=False \
actor_rollout_ref.rollout.prometheus.enable=True
...
# Asynchronous training
ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
--working-dir "${WORKING_DIR}" \
-- python3 verl.experimental.fully_async_policy.fully_async_main \
data.return_raw_chat=${return_raw_chat} \
actor_rollout_ref.rollout.name=${rollout_name} \
actor_rollout_ref.rollout.mode=${rollout_mode} \
actor_rollout_ref.rollout.disable_log_stats=False \
actor_rollout_ref.rollout.prometheus.enable=True
...
After task execution, verify that Prometheus is correctly collecting metrics.
Verification: Visit the Prometheus interface at http://master_ip:9090 and search for vllm: or sglang: to
confirm metrics are being reported correctly.
Troubleshooting:
If no metrics appear:
AgentLoopManager to find the server porthttp://master_ip:server_port/metrics to verify server metrics are availableactor_rollout_ref.rollout.disable_log_stats=False is setAfter task execution, log in to Grafana to view and customize monitoring dashboards.
Login: Visit http://master_ip:3000 (default credentials: admin/admin)
Import Dashboard:
Dashboards → New → Import → Upload dashboard JSON fileAvailable Dashboards: