packages/chip/scripts/alphachip/nebius_h200_runbook.md
Use this when local 16 GB VRAM is too small or too slow. AlphaChip's published Ariane-scale recipe used 8x V100 for training plus many CPU collect workers; a single H200 should be enough for first E1 experiments, but walltime will still depend on how many CPU collect jobs feed Reverb.
ROOT_DIR.git clone <this-repo> e1-chip
cd e1-chip/packages/chip
git clone https://github.com/google-research/circuit_training.git external/circuit_training
git -C external/circuit_training checkout r0.0.4
scripts/alphachip/build_container.sh
For a GPU image:
ALPHACHIP_GPU_IMAGE=1 scripts/alphachip/build_container.sh
The lighter current path is to build the normal AlphaChip image and add the CUDA 12.2 user-space wheels expected by TensorFlow 2.15:
scripts/alphachip/build_container.sh
ALPHACHIP_IMAGE=circuit_training:e1-r0.0.4-cuda-pip \
scripts/alphachip/build_cuda_runtime_image.sh
If Docker's NVIDIA runtime is unavailable but the GPU devices are visible, the local wrappers can use direct device mounts:
ALPHACHIP_GPU_MODE=manual USE_GPU=True scripts/alphachip/run_toy_training.sh
USE_GPU=True NUM_COLLECT_JOBS=8 scripts/alphachip/run_toy_training.sh
Prepare the E1 soft-macro benchmark from an OpenLane DEF:
ALPHACHIP_BENCH_DIR=/shared/alphachip/e1_softmacro_full \
scripts/alphachip/prepare_e1_softmacro_benchmark.sh \
--def pd/openlane/runs/<run>/46-openroad-detailedrouting/e1_chip_top.def \
--cols 16 \
--rows 16
Measure the OpenROAD-derived baseline:
ALPHACHIP_COMPARE_DIR=/shared/alphachip/e1_softmacro_full/compare \
scripts/alphachip/compare_proxy_costs.sh /shared/alphachip/e1_softmacro_full
Run a first integrated GPU experiment:
ALPHACHIP_BENCH_DIR=/shared/alphachip/e1_softmacro_full \
ALPHACHIP_RUN_DIR=/shared/alphachip/e1_softmacro_full_train \
USE_GPU=True \
NUM_COLLECT_JOBS=8 \
SEQUENCE_LENGTH=257 \
OBS_MAX_NUM_NODES=512 \
OBS_MAX_NUM_EDGES=8192 \
OBS_MAX_GRID_SIZE=16 \
TRAIN_ITERATIONS=5 \
EPISODES_PER_ITERATION=16 \
PER_REPLICA_BATCH_SIZE=16 \
scripts/alphachip/run_e1_softmacro_training.sh
The integrated runner starts Reverb, CPU collectors, the learner, and the upstream evaluator. The evaluator writes:
<ALPHACHIP_RUN_DIR>/run_00/eval_output/rl_opt_placement.plc
Compare the exported placement:
ALPHACHIP_PLC=/shared/alphachip/e1_softmacro_full_train/run_00/eval_output/rl_opt_placement.plc \
ALPHACHIP_COMPARE_DIR=/shared/alphachip/e1_softmacro_full/compare \
scripts/alphachip/compare_proxy_costs.sh /shared/alphachip/e1_softmacro_full
The current full E1 benchmark can be packaged for upload:
scripts/alphachip/package_nebius_payload.sh /tmp/e1-alphachip/e1_softmacro_full
This writes:
build/alphachip/nebius/e1_alphachip_payload.tar.gz
After extracting that archive on an H200 host, the remote one-command runner is:
NUM_COLLECT_JOBS=8 \
TRAIN_ITERATIONS=5 \
EPISODES_PER_ITERATION=16 \
PER_REPLICA_BATCH_SIZE=16 \
scripts/alphachip/run_h200_payload.sh
run_h200_payload.sh builds circuit_training:e1-r0.0.4 first, then derives
circuit_training:e1-r0.0.4-cuda-pip, runs the OpenROAD proxy baseline, trains,
and re-runs the comparison against the evaluator-exported AlphaChip PLC.
The Nebius CLI is configured in this workspace, but federation auth may need to be refreshed before VM commands work:
nebius profile list
nebius compute instance list --parent-id project-e00kfz6cpr00q21z892vec
For larger runs, either keep the integrated runner or split jobs manually:
ppo_reverb_server on the Reverb host, many ppo_collect jobs on CPU hosts,
train_ppo --use_gpu on the H200 host, and learning.eval pointed at the same
variable-container server. Match the upstream docs/ARIANE.md job split and
tune sequence_length to the number of movable E1 soft macros.