Back to Alluxio

Pytorch Data Loading Benchmark

integration/tools/benchmark/pytorch/README.md

3132.9 KB
Original Source

Pytorch Data Loading Benchmark

This module includes the testing scripts for benchmarking Pytorch data loading performance of various file system implementations including Alluxio POSIX API.

Run single node benchmarking

  • Launch the Alluxio cluster with master and worker
  • Launch Alluxio Fuse to mount Alluxio namespace to host path /mnt/alluxio-fuse/ in this benchmarking node
  • download demo image: luqqiu/alluxioloadagent:latest which is built based on the Dockerfile included in this module
  • Start docker with the following command
docker run -it --rm --name loadtest -e NVIDIA_VISIBLE_DEVICES= -v `pwd`:/v/ -v /mnt:/mnt:rshared -w /v luqqiu/alluxioloadagent:latest bash
  • prepare file name list into inputdata.csv, one filepath per file WITHOUT the common alluxio path prefix. The common path prefix will be passed to load.py.
./run-test.sh 2 load.py --workers 2 --file_name_list inputdata.csv --number_of_files 10000 \
  -p /mnt/alluxio-fuse/data/

Run multi-node benchmarking

  • Launch the Alluxio cluster with master and worker
  • Launch Alluxio Fuse to mount Alluxio namespace to host path /mnt/alluxio-fuse/ in each benchmarking node
  • Launch the docker container in each training node
  • Prepare file name list in each training docker container with name inputdata.csv
  • Run the load script For example, benchmarking data loading performance in two nodes. Run the following command in node one:
export MASTER_ADDR=${NODE_ONE_HOSTNAME} \
&& export MASTER_PORT=${NODE_ONE_PORT} \
&& export WORLD_SIZE=2 \
&& export RANK=0 \
&& run-test.sh 2 load.py --workers 2 --file_name_list inputdata.csv --number_of_files 10000 \
-p /mnt/alluxio-fuse/data/"

Change the RANK=0 to RANK=1 and run in the other node.

Run multi-node benchmarking with Arena

Arena can be used for running the benchmark in multi-node.

arena --loglevel info submit pytorch --name=test-job --gpus=0 --workers=2 --cpu 4 --memory 32G \
--image=luqqiu/alluxioloadagent:latest --selector alluxio-master=false --data-dir=/mnt/ \
--sync-mode=git --sync-source=https://github.com/Alluxio/alluxio.git \
"export MASTER_ADDR=test-job-master-0 && export MASTER_PORT=12425 \
&& /root/code/alluxio/integration/tools/benchmark/pytorch/run-test.sh 2 /root/code/alluxio/integration/tools/benchmark/pytorch/load.py \
--workers 2 --file_name_list inputdata.csv --number_of_files 10000 \
-p /mnt/alluxio-fuse/data"

Please refer to Distribtued Pytorch Training Guide for more information about how to launch a pytorch script in multi-node.

Get the benchmarking pods name

kubectl get pods

The benchmark result of each node is shown in the logs of each kubernetes pod

kubectl logs test-job-master-0
kubectl logs test-job-worker-0

Special thanks

Special thanks to Kevin Cai and Zifan Ni for contributing this benchmark scripts.