Back to Tensorflow

Benchmarks

site/en/r1/guide/performance/benchmarks.md

latest16.4 KB
Original Source

Benchmarks

Overview

A selection of image classification models were tested across multiple platforms to create a point of reference for the TensorFlow community. The Methodology section details how the tests were executed and has links to the scripts used.

Results for image classification models

InceptionV3 (arXiv:1512.00567), ResNet-50 (arXiv:1512.03385), ResNet-152 (arXiv:1512.03385), VGG16 (arXiv:1409.1556), and AlexNet were tested using the ImageNet data set. Tests were run on Google Compute Engine, Amazon Elastic Compute Cloud (Amazon EC2), and an NVIDIA® DGX-1™. Most of the tests were run with both synthetic and real data. Testing with synthetic data was done by using a tf.Variable set to the same shape as the data expected by each model for ImageNet. We believe it is important to include real data measurements when benchmarking a platform. This load tests both the underlying hardware and the framework at preparing data for actual training. We start with synthetic data to remove disk I/O as a variable and to set a baseline. Real data is then used to verify that the TensorFlow input pipeline and the underlying disk I/O are saturating the compute units.

Training with NVIDIA® DGX-1™ (NVIDIA® Tesla® P100)

<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div>

Details and additional results are in the Details for NVIDIA® DGX-1™ (NVIDIA® Tesla® P100) section.

Training with NVIDIA® Tesla® K80

<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div>

Details and additional results are in the Details for Google Compute Engine (NVIDIA® Tesla® K80) and Details for Amazon EC2 (NVIDIA® Tesla® K80) sections.

Distributed training with NVIDIA® Tesla® K80

<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div>

Details and additional results are in the Details for Amazon EC2 Distributed (NVIDIA® Tesla® K80) section.

Compare synthetic with real data training

NVIDIA® Tesla® P100

<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div>

NVIDIA® Tesla® K80

<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div>

Details for NVIDIA® DGX-1™ (NVIDIA® Tesla® P100)

Environment

  • Instance type: NVIDIA® DGX-1™
  • GPU: 8x NVIDIA® Tesla® P100
  • OS: Ubuntu 16.04 LTS with tests run via Docker
  • CUDA / cuDNN: 8.0 / 5.1
  • TensorFlow GitHub hash: b1e174e
  • Benchmark GitHub hash: 9165a70
  • Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • Disk: Local SSD
  • DataSet: ImageNet
  • Test Date: May 2017

Batch size and optimizer used for each model are listed in the table below. In addition to the batch sizes listed in the table, InceptionV3, ResNet-50, ResNet-152, and VGG16 were tested with a batch size of 32. Those results are in the other results section.

OptionsInceptionV3ResNet-50ResNet-152AlexNetVGG16
Batch size per GPU64646451264
Optimizersgdsgdsgdsgdsgd

Configuration used for each model.

Modelvariable_updatelocal_parameter_device
InceptionV3parameter_servercpu
ResNet50parameter_servercpu
ResNet152parameter_servercpu
AlexNetreplicated (with NCCL)n/a
VGG16replicated (with NCCL)n/a

Results

<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div> <div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div>

Training synthetic data

GPUsInceptionV3ResNet-50ResNet-152AlexNetVGG16
114221991.82987154
22844221815658295
456985235610509584
811311734716178221081

Training real data

GPUsInceptionV3ResNet-50ResNet-152AlexNetVGG16
114221891.42890154
22784251794448284
45518533597105534
810791630708N/A898

Training AlexNet with real data on 8 GPUs was excluded from the graph and table above due to it maxing out the input pipeline.

Other Results

The results below are all with a batch size of 32.

Training synthetic data

GPUsInceptionV3ResNet-50ResNet-152VGG16
112819582.7144
2259368160281
4520768317549
89951485632820

Training real data

GPUsInceptionV3ResNet-50ResNet-152VGG16
113019382.4144
2257369159253
4507760317457
89661410609690

Details for Google Compute Engine (NVIDIA® Tesla® K80)

Environment

  • Instance type: n1-standard-32-k80x8
  • GPU: 8x NVIDIA® Tesla® K80
  • OS: Ubuntu 16.04 LTS
  • CUDA / cuDNN: 8.0 / 5.1
  • TensorFlow GitHub hash: b1e174e
  • Benchmark GitHub hash: 9165a70
  • Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • Disk: 1.7 TB Shared SSD persistent disk (800 MB/s)
  • DataSet: ImageNet
  • Test Date: May 2017

Batch size and optimizer used for each model are listed in the table below. In addition to the batch sizes listed in the table, InceptionV3 and ResNet-50 were tested with a batch size of 32. Those results are in the other results section.

OptionsInceptionV3ResNet-50ResNet-152AlexNetVGG16
Batch size per GPU64643251232
Optimizersgdsgdsgdsgdsgd

The configuration used for each model was variable_update equal to parameter_server and local_parameter_device equal to cpu.

Results

<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div>

Training synthetic data

GPUsInceptionV3ResNet-50ResNet-152AlexNetVGG16
130.551.920.065635.4
257.899.038.2120964.8
411619575.82328120
82273871484640234

Training real data

GPUsInceptionV3ResNet-50ResNet-152AlexNetVGG16
130.651.220.063934.2
258.498.838.3113662.9
411519475.42067118
82253811484056230

Other Results

Training synthetic data

GPUsInceptionV3 (batch size 32)ResNet-50 (batch size 32)
129.349.5
255.095.4
4109183
8216362

Training real data

GPUsInceptionV3 (batch size 32)ResNet-50 (batch size 32)
129.549.3
255.495.3
4110186
8216359

Details for Amazon EC2 (NVIDIA® Tesla® K80)

Environment

  • Instance type: p2.8xlarge
  • GPU: 8x NVIDIA® Tesla® K80
  • OS: Ubuntu 16.04 LTS
  • CUDA / cuDNN: 8.0 / 5.1
  • TensorFlow GitHub hash: b1e174e
  • Benchmark GitHub hash: 9165a70
  • Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • Disk: 1TB Amazon EFS (burst 100 MiB/sec for 12 hours, continuous 50 MiB/sec)
  • DataSet: ImageNet
  • Test Date: May 2017

Batch size and optimizer used for each model are listed in the table below. In addition to the batch sizes listed in the table, InceptionV3 and ResNet-50 were tested with a batch size of 32. Those results are in the other results section.

OptionsInceptionV3ResNet-50ResNet-152AlexNetVGG16
Batch size per GPU64643251232
Optimizersgdsgdsgdsgdsgd

Configuration used for each model.

Modelvariable_updatelocal_parameter_device
InceptionV3parameter_servercpu
ResNet-50replicated (without NCCL)gpu
ResNet-152replicated (without NCCL)gpu
AlexNetparameter_servergpu
VGG16parameter_servergpu

Results

<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div>

Training synthetic data

GPUsInceptionV3ResNet-50ResNet-152AlexNetVGG16
130.851.519.768436.3
258.798.037.6124469.4
411719574.92479141
82303841494853260

Training real data

GPUsInceptionV3ResNet-50ResNet-152AlexNetVGG16
130.551.319.767436.3
259.094.938.2122767.5
411818875.22201136
8228373149N/A242

Training AlexNet with real data on 8 GPUs was excluded from the graph and table above due to our EFS setup not providing enough throughput.

Other Results

Training synthetic data

GPUsInceptionV3 (batch size 32)ResNet-50 (batch size 32)
129.949.0
257.594.1
4114184
8216355

Training real data

GPUsInceptionV3 (batch size 32)ResNet-50 (batch size 32)
130.049.1
257.595.1
4113185
8212353

Details for Amazon EC2 Distributed (NVIDIA® Tesla® K80)

Environment

  • Instance type: p2.8xlarge
  • GPU: 8x NVIDIA® Tesla® K80
  • OS: Ubuntu 16.04 LTS
  • CUDA / cuDNN: 8.0 / 5.1
  • TensorFlow GitHub hash: b1e174e
  • Benchmark GitHub hash: 9165a70
  • Build Command: bazel build -c opt --copt=-march="haswell" --config=cuda //tensorflow/tools/pip_package:build_pip_package
  • Disk: 1.0 TB EFS (burst 100 MB/sec for 12 hours, continuous 50 MB/sec)
  • DataSet: ImageNet
  • Test Date: May 2017

The batch size and optimizer used for the tests are listed in the table. In addition to the batch sizes listed in the table, InceptionV3 and ResNet-50 were tested with a batch size of 32. Those results are in the other results section.

OptionsInceptionV3ResNet-50ResNet-152
Batch size per GPU646432
Optimizersgdsgdsgd

Configuration used for each model.

Modelvariable_updatelocal_parameter_devicecross_replica_sync
InceptionV3distributed_replicatedn/aTrue
ResNet-50distributed_replicatedn/aTrue
ResNet-152distributed_replicatedn/aTrue

To simplify server setup, EC2 instances (p2.8xlarge) running worker servers also ran parameter servers. Equal numbers of parameter servers and worker servers were used with the following exceptions:

  • InceptionV3: 8 instances / 6 parameter servers
  • ResNet-50: (batch size 32) 8 instances / 4 parameter servers
  • ResNet-152: 8 instances / 4 parameter servers

Results

<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div> <div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div>

Training synthetic data

GPUsInceptionV3ResNet-50ResNet-152
129.752.419.4
8229378146
16459751291
329021388565
6417832744981

Other Results

<div style="width:95%; margin:auto; margin-bottom:10px; margin-top:20px;"> </div>

Training synthetic data

GPUsInceptionV3 (batch size 32)ResNet-50 (batch size 32)
129.248.4
8219333
16427667
328201180
6416082315

Methodology

This script was run on the various platforms to generate the above results.

In order to create results that are as repeatable as possible, each test was run 5 times and then the times were averaged together. GPUs are run in their default state on the given platform. For NVIDIA® Tesla® K80 this means leaving on GPU Boost. For each test, 10 warmup steps are done and then the next 100 steps are averaged.