PyTorch data loading with Mountpoint

This directory shows a simple example of how to use Mountpoint for Amazon S3 as a PyTorch data loader via the torchdata library. It trains a ResNet-50 model for a few epochs using PyTorch Lightning, with synthetic ImageNet-style training data, stored in S3 as shards in WebDataset format. The idea is that because Mountpoint exposes your S3 bucket as a local file system, you can just use standard file-oriented loaders rather than writing S3-specific code, and get great performance.

Getting started

To get started, launch an EC2 instance with a GPU (we used a g5.2xlarge), choosing the AWS Deep Learning AMI GPU PyTorch 2.0.1 (Amazon Linux 2) as your AMI.

When using the AWS Deep Learning AMI, you'll need to activate the preconfigured virtual environment for PyTorch:

source activate pytorch

Perform a simple clone of this repository if you don't already have it available on the instance.

git clone https://github.com/awslabs/mountpoint-s3.git

Move to the PyTorch example directory (examples/pytorch) within the Mountpoint repository:

cd mountpoint-s3/examples/pytorch

Install the dependencies for the example code:

python -m pip install -r requirements.txt

Install Mountpoint if you don't already have it:

wget https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.rpm
sudo yum install ./mount-s3.rpm

To generate and upload the training shards to an S3 bucket (in this example, amzn-s3-demo-bucket), run:

python resnet.py make s3://amzn-s3-demo-bucket/shard-data/ --num-images 50000

This will upload about 5GB worth of shards to your bucket. With this prepared, you're ready to start running training iterations.

Running training iterations

Now to run the training loop:

python resnet.py train s3://amzn-s3-demo-bucket/shard-data/ --source-kind mountpoint --batch-size 256 --max-epochs 3

The --source-kind argument controls how the data is loaded from S3:

mountpoint spawns a Mountpoint instance and accesses it as a local file system with the FileOpener datapipe from torchdata
s3io uses the S3FileLoader datapipe from torchdata (formerly the Amazon S3 plugin for PyTorch)
fsspec uses the FSSpecFileOpener datapipe from torchdata (which uses the fsspec and s3fs libraries)

Results

Here are some indicative results (keep in mind this is a synthetic test). We ran training on a g5.2xlarge EC2 instance (8 vCPUs, 1x NVIDIA A10G Tensor Core GPU) in the us-west-2 region running AWS Deep Learning AMI GPU PyTorch 2.0.1 (Amazon Linux 2) 20230609. The instance had PyTorch 2.0.1, PyTorch Lightning 2.0.6, NVIDIA driver 525.85.12, and CUDA 11.8. Training ran on 50,000 images at a batch size of 256, and trained ResNet-50 in 16-bit mixed precision for 5 epochs. We saw the following results:

Data loader	Images/sec
Mountpoint	526.9
S3 IO	470.1
FSSpec	75.5