Back to Fastertransformer

FasterTransformer T5

docs/t5_guide.md

latest69.5 KB
Original Source

FasterTransformer T5

The FasterTransformer T5 implements the huggingface t5 model (https://huggingface.co/t5-base).

Table Of Contents

Introduction

This document describes what FasterTransformer provides for the T5 model, explaining the workflow and optimization. We also provide a guide to help users to run the T5 model on FasterTransformer. Finally, we provide benchmark to demonstrate the speed of FasterTransformer on T5.

Supported features

  • Checkpoint converter
    • Huggingface
    • Megatron
    • NeMo Megatron
  • Data type
    • FP32
    • FP16
    • BF16
  • Feature
    • Multi-GPU multi-node inference
    • Dynamic random seed
    • Stop tokens
    • Beam search and sampling are both supported
    • Loading FP32 or FP16 weights
  • Frameworks
    • PyTorch
    • Triton backend

Model architecture

Workflow

The source codes are put in src/fastertransformer/models/t5.

  • Constructor of T5 Encoder
ClassificationNameData TypeDescription
[0]max_batch_sizesize_tDeprecated, move to input
[1]max_seq_lensize_tDeprecated, move to input
[2]head_numsize_tHead number for model configuration
[3]size_per_headsize_tSize per head for model configuration
[4]inter_sizesize_tThe inter size of feed forward network. It is often set to 4 * head_num * size_per_head.
[5]d_modelsize_tThe dimension of embedding of transformer input.
[6]num_layersize_tNumber of transformer layers for model configuration
[7]num_bucket_or_max_seq_lensize_tNumber of bucket in relative position embedding, or max sequence length for absolute position embedding
[8]max_distancesize_tMax distance for relative position embedding
[9]smintThe compute capacity of GPU
[10]q_scalingfloatIt is used to scale the query before the batch multiplication of query and key
[11]streamcudaStream_tCUDA stream
[12]cublas_wrappercublasMMWrapper*Pointer of cuBLAS wrapper, which is declared in src/fastertransformer/utils/cublasMMWrapper.h
[13]allocatorIAllocator*Pointer of memory allocator, which is declared in src/fastertransformer/utils/allocator.h
[14]is_free_buffer_after_forwardboolIf setting to be true, FasterTransformer will allocate buffer before forward, and free buffer after forward. When the allocator is based on memory pool, setting to true may help reducing the memory usage during inference.
[15]attention_typeAttentionTypeDetermine fusing the attention or not, remove padding or not, which is declared in src/fastertransformer/layers/attention_layers/BaseAttentionLayer.h
[16]sparseboolIs using sparsity. Experimental feature
[17]activation_typeActivationTypeDetermine the activation in FFN, which is declared in src/fastertransformer/layers/attention_layers/FfnLayer.h
[18]layernorm_typeLayerNormTypeDetermine using pre-layernorm or post-layernorm, which is declared in src/fastertransformer/kernels/layernorm_kernels.h. Note that only pre_layernorm is supported at the moment.
[19]tensor_paraNcclParamTensor Parallel information, which is declared in src/fastertransformer/utils/nccl_utils.h
[20]pipeline_paraNcclParamPipeline Parallel information, which is declared in src/fastertransformer/utils/nccl_utils.h
[21]prompt_learning_start_idintThe start id of virtual token in p/prompt-tuning
[22]prompt_learning_typePromptLearningTypeThe type of prompt learning when we load the prompt embedding in constructor. FT supports no_prompt, soft_prompt, prefix_prompt, p_prompt_tuning now
[23]custom_all_reduce_commAbstractCustomCommCustom all reduction communication for custom all reduction in model parallelism. It is only supported in 8-way tensor parallelism
[24]enable_custom_all_reduceintFlag of enabling custom all reduction or not
  • Input of T5 Encoder
NameTensor/Parameter ShapeLocationData TypeDescription
input_ids[batch_size, seq_len]GPUintThe input ids
sequence_length[batch_size]GPUintThe lengths of input ids
inputs_embeds[batch_size, seq_len, d_model]GPUfp32/fp16/bf16Optional. The embedding after embedding lookup. If this input is not null, using this embedding as input of transformer
prompt_learning_task_name_ids[batch_size]CPUintOptional. Task name ids for prompt learning.
request_prompt_lengths[batch_size],GPUintOptional. Length of prefix soft prompt embedding. This describes how many tokens of soft prompt embedding in each sentence.
request_prompt_embedding[batch_size, max_prompt_length, hidden_units]GPUfloatOptional. Prefix soft prompt embedding. FT will concat them with results of embedding lookup kernel
ia3_tasks[batch_size]GPUintOptional. Which IA3 weights to use for each sequence.
  • Output of T5 Encoder
NameTensor/Parameter ShapeLocationData TypeDescription
output_hidden_state[batch_size, sequence_length, d_model_]GPUfp32/fp16/bf16The output of transformer layer
  • Constructor of T5 Decoding
ClassificationNameData TypeDescription
[0]max_batch_sizesize_tDeprecated, move to input
[1]max_seq_lensize_tDeprecated, move to input
[2]mem_max_seq_lensize_tDeprecated, move to input
[3]beam_widthsize_tDeprecated, move to input
[4]head_numsize_tHead number for model configuration
[5]size_per_headsize_tSize per head for model configuration
[6]inter_sizesize_tThe inter size of feed forward network. It is often set to 4 * head_num * size_per_head
[7]d_modelsize_tThe dimension of embedding of transformer input.
[8]num_layersize_tNumber of transformer layers for model configuration
[9]vocab_sizesize_tVocabulary size for model configuration
[10]num_bucketsize_tNumber of bucket in relative position embedding, or max sequence length for absolute position embedding
[11]max_distancesize_tMax distance for relative position embedding
[12]q_scalingfloatIt is used to scale the query before the batch multiplication of query and key
[13]start_idintStart id for vocabulary
[14]end_idintEnd id for vocabulary
[15]beam_search_diversity_ratefloatDeprecated, move to input
[16]top_ksize_tDeprecated, move to input
[17]top_pfloatDeprecated, move to input
[18]temperaturefloatDeprecated, move to input
[19]len_penaltyfloatDeprecated, move to input
[20]repetition_penaltyfloatDeprecated, move to input
[21]streamcudaStream_tCUDA stream
[22]cublas_wrappercublasMMWrapper*Pointer of cuBLAS wrapper, which is declared in src/fastertransformer/utils/cublasMMWrapper.h
[23]allocatorIAllocator*Pointer of memory allocator, which is declared in src/fastertransformer/utils/allocator.h
[24]is_free_buffer_after_forwardboolIf setting to be true, FasterTransformer will allocate buffer before forward, and free buffer after forward. When the allocator is based on memory pool, setting to true may help reducing the memory usage during inference.
[25]cuda_device_propcudaDeviceProp*Pointer of CUDA device properties, which is used to get the properties of hardware like size of shared memory
[26]tensor_paraNcclParamTensor Parallel information, which is declared in src/fastertransformer/utils/nccl_utils.h
[27]pipeline_paraNcclParamPipeline Parallel information, which is declared in src/fastertransformer/utils/nccl_utils.h
[28]activation_typeActivationTypeDetermine the activation in FFN, which is declared in src/fastertransformer/layers/attention_layers/FfnLayer.h
[29]tie_word_embeddingsboolA flag controlling the scale of transformer output
[30]custom_all_reduce_commAbstractCustomCommCustom all reduction communication for custom all reduction in model parallelism. It is only supported in 8-way tensor parallelism
[31]enable_custom_all_reduceintFlag of enabling custom all reduction or not
  • Input of T5 Decoding
NameTensor/Parameter ShapeLocationData TypeDescription
encoder_output[batch_size, mem_max_seq_len, memory_d_model]GPUfp32/fp16/bf16The output of T5 Encoder
encoder_sequence_length[batch_size]GPUintThe sequence length of encoder input/output
stop_words_list[batch_size, 2, stop_words_length]GPUintOptional. When FT generates words in this list, it will stop the generation. An extension of stop id
bad_words_list[batch_size, 2, bad_words_length]GPUintOptional. The words in the list will never be sampled.
start_id[batch_size]CPUintOptional. If FT receives this input, FT will replace default start id by it
end_id[batch_size]CPUintOptional. If FT receives this input, FT will replace default end id by it
runtime_top_k[1] or [batch_size]CPUuintOptional. top_k value for top k sampling
runtime_top_p[1] or [batch_size]CPUfloatOptional. top_p value for top p sampling
beam_search_diversity_rate[1] or [batch_size]CPUfloatOptional. A hyper hyper-parameter for simple diverse decoding
temperature[1] or [batch_size]CPUfloatOptional. Temperature applied to logits for both beam search and sampling
len_penalty[1] or [batch_size]CPUfloatOptional. Length penalty applied to logits for only beam search
repetition_penalty[1] or [batch_size]CPUfloatOptional. Repetition penalty applied to logits for both beam search and sampling. Exclusive with presence_penalty
presence_penalty[1] or [batch_size]CPUfloatOptional. Presence penalty - additive type of repetition penalty - applied to logits for both beam search and sampling. Exclusive with repetition_penalty.
min_length[1] or [batch_size]CPUintOptional. Minimum number of tokens to generate
random_seed[1] or [batch_size]CPUunsigned long long intOptional. Random seed to initialize the random table in sampling.
ia3_tasks[batch_size]GPUintOptional. Which IA3 weights to use for each sequence.
  • Output of T5 Decoding
NameTensor/Parameter ShapeLocationData TypeDescription
output_ids[batch_size, beam_width, max_output_seq_len]GPUintThe output ids. It contains the input_ids and generated ids
sequence_length[batch_size, beam_width]GPUintThe lengths of output ids
output_log_probs[batch_size, beam_width, request_output_seq_len]GPUfloatOptional. It records the log probability of logits at each step for sampling.
cum_log_probs[batch_size, beam_width]GPUfloatOptional. Cumulative log probability of generated sentences
cross_attentions[num_layer / pipeline_para_size, batch_size, beam_width, head_num / tensor_para_size, max_seq_len, mem_max_seq_len]GPUfloatOptional. The attention scores of cross attention

Optimization

  1. Kernel optimization: First, since the sequence length of query in SelfAttention and CrossAttention is always 1, we use customed fused multi-head attention kernel to optimize. Second, we fuse many small operations into one kernel. For example, AddBiasResidualLayerNorm combines the adding bias, adding residual of previous block and the computation of layer normalization into 1 kernel. Third, we optimize top k operation and sampling to accelerate the beam search and sampling. Finally, to prevent from recomputing the previous keys and values, we allocate a buffer to store them at each step. Although it takes some additional memory usage, we can save the cost of recomputing, allocating buffer at each step, and the cost of concatenation.

Setup

The following section lists the requirements to use FasterTransformer.

Requirements

  • CMake >= 3.13 for PyTorch
  • CUDA 11.0 or newer version
  • NCCL 2.10 or newer version
  • Python: Only verify on Python 3.
  • PyTorch: Verify on 1.10.0, >= 1.5.0 should work.

Recommend use nvcr image like nvcr.io/nvidia/pytorch:22.09-py3.

Ensure you have the following components:

For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:

For those unable to use the NGC container, to set up the required environment or create your own container, see the versioned NVIDIA Container Support Matrix.

Build FasterTransformer

Prepare

You can choose the pytorch version and python version you want. Here, for pytorch user, we suggest image nvcr.io/nvidia/pytorch:22.09-py3, which contains the PyTorch 1.13.0 and python 3.8. For tensorflow user, we suggest image nvcr.io/nvidia/tensorflow:22.09-tf2-py3, which contains tensorflow 2.9.1 and python 3.8.

```bash
nvidia-docker run -ti --shm-size 5g --rm nvcr.io/nvidia/pytorch:22.09-py3 bash
git clone https://github.com/NVIDIA/FasterTransformer.git
mkdir -p FasterTransformer/build
cd FasterTransformer/build
git submodule init && git submodule update
```

Build the project

  • Note: the xx of -DSM=xx in following scripts means the compute capability of your GPU. The following table shows the compute capability of common GPUs.
GPUcompute capacity
P4060
P461
V10070
T475
A10080
A3080
A1086

By default, -DSM is set by 70, 75, 80 and 86. When users set more kinds of -DSM, it requires longer time to compile. So, we suggest setting the -DSM for the device you use only. Here, we use xx as an example due to convenience.

  1. build with PyTorch

    bash
    cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..
    make -j12
    

    This will build the TorchScript custom class. Please make sure that the PyTorch >= 1.5.0.

  2. build with TensorRT Can use nvcr.io/nvidia/pytorch:22.09-py3 docker image, too.

    bash
    cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_TRT=ON -DBUILD_MULTI_GPU=ON ..
    make -j12
    
  3. build with Tensorflow 2

    bash
    cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF2=ON -DTF_PATH=/usr/local/lib/python3.8/dist-packages/tensorflow/ -DBUILD_MULTI_GPU=ON ..
    make -j12
    

How to use

Translation process

  1. Run FasterTransformer T5 on PyTorch

    Please install utils first before running the demos by

    bash
    pip install -r ../examples/pytorch/t5/requirement.txt
    

    1.1 Generate the gemm_config.in file:

    ./bin/t5_gemm can generate the best GEMM configuration.

    Assume the settings of decoding are as follows.

    • batch_size = 8
    • beam_width = 4
    • max_mem_seq_len = 32
    • encoder_d_model = 512
    • encoder_head_num = 8
    • encoder_size_per_head = 64
    • encoder_inter_size = 2048
    • decoder_d_model = 512
    • decoder_head_num = 8
    • decoder_size_per_head = 64
    • decoder_inter_size = 2048
    • decoder_vocab_size = 32128
    • data_type = 0 (FP32) or 1 (FP16) or 2 (BF16)
    • tensor_para_size = 2

    Then the following scripts can generate the best GEMM configuration under such settings and record the configuration into the gemm_config.in file.

    bash
    ./bin/t5_gemm 8 4 32 512 8 64 2048 512 8 64 2048 32128 0 2 1
    

    If the application may have multiple different shapes (like different batch size), users can run multiple time and set is_append to be true. For example

    bash
    ./bin/t5_gemm 8 4 32 512 8 64 2048 512 8 64 2048 32128 0 2 1 # bs 8, not append, will create a new gemm_config.ini
    ./bin/t5_gemm 16 4 32 512 8 64 2048 512 8 64 2048 32128 0 2 1 # bs 16, append results in existed gemm_config.ini
    

    1.2 Run the PyTorch T5 example:

    bash
    python ../examples/pytorch/t5/translate_example.py \
            --batch_size 32 \
            --beam_width 4 \
            --max_seq_len 128 \
            --data_type fp32 \
            --test_time 0123 \
            --sampling_topk 4 \
            --model t5-small
    

    Data Type can be fp32, fp16 and bf16

    The outputs should be like to the following:

    bash
    [INFO] hf-beamsearch translates 94 batches taking 157.58 sec to translate 62591 tokens, BLEU score: 26.21, 397 tokens/sec.
    [INFO] ft-beamsearch translates 94 batches taking 14.45 sec to translate 61763 tokens, BLEU score: 26.45, 4274 tokens/sec.
    [INFO] hf-sampling translates 94 batches taking 99.17 sec to translate 62022 tokens, BLEU score: 25.35, 625 tokens/sec.
    [INFO] ft-sampling translates 94 batches taking 7.93 sec to translate 62096 tokens, BLEU score: 17.61, 7827 tokens/sec.
    

    1.3 Run T5 with model parallel

    bash
    mpirun -n 4 --allow-run-as-root \
      python ../examples/pytorch/t5/translate_example.py \
            --batch_size 32 \
            --beam_width 4 \
            --max_seq_len 128 \
            --data_type fp32 \
            --test_time 0123 \
            --sampling_topk 4 \
            --model t5-small \
            --tensor_para_size 2 \
            --pipeline_para_size 2
    
  2. Run FasterTransformer T5 on TensorRT

    Please install transformers first before running the demos by

    bash
    pip install -r ../examples/pytorch/t5/requirement.txt
    
    bash
    # get T5Model weight for test (need Internet or pre-downloaded model)
    # Note that the model is saved in ./ft_t5_small/1-gpu, but not ./ft_t5_small
    python ../examples/tensorrt/t5/extractT5ModelToBIN.py \
            -in_file t5-small \
            -saved_dir ./ft_t5_small
    
    python ../examples/tensorrt/t5/testT5Plugin.py \
            --batch_size 32 \
            --beam_width 4 \
            --max_seq_len 128 \
            --data_type fp16 \
            --ckpt_path ./ft_t5_small/1-gpu
    
  • Input/Output Tensor/Parameter of T5Encoder Plugin
ClassificationTensor/Parameter ShapeData TypeDescription
input tensor
[0][batch_size,max_seq_len]int32input token after tokenization
[1][batch_size]int32real sequence length of each input
input parameter
[0][]int32max_batch_size
[1][]int32max_seq_len
[2][]int32beam_width (keep the same as decoding)
[3][]int32sm
[4][]int32useFP16
[5][]stringcheckpoint path of converted FT model
output tensor
[0][batch_size,max_seq_len,d_model]foat32/float16encoder output
  • Input/Output Tensor/Parameter of T5Decoding Plugin
ClassificationTensor/Parameter ShapeData TypeDescription
input tensor
[0][batch_size,max_seq_len,d_model]foat32/float16encoder output
[1][batch_size]int32real sequence length of each input
[2][1] or [batch_size]int32top_k
[3][1] or [batch_size]float32top_p
[4][1] or [batch_size]float32beam_search_diversity_rate
[5][1] or [batch_size]float32temperature
[6][1] or [batch_size]float32len_penalty
[7][1] or [batch_size]float32repetition_penalty
input parameter
[0][]int32max_batch_size
[1][]int32max_seq_len
[2][]int32mem_max_seq_len
[3][]int32beam_width
[4][]int32useFp16
[5][]stringcheckpoint path of converted FT model
output tensor
[0][batch_size,beam_width,max_seq_len]float32/float16decoding output
[1][batch_size,beam_width]float32/float16real sequence length of each output

The model configuration are stored in config.ini of checkpoint path. For example, after running,

python ../examples/tensorrt/t5/extractT5ModelToBIN.py \
            -in_file t5-small \
            -saved_dir ./ft_t5_small`

users can see the model configuration in ./ft_t5_small/1-gpu/config.ini

  1. Run FasterTransformer T5 on Tensorflow 2

    Please install utils first before running the demos by

    bash
    pip install -r ../examples/tensorflow/t5/requirement.txt
    

    3.1 Generate the gemm_config.in file:

    ./bin/t5_gemm can generate the best GEMM configuration.

    Assume the settings of decoding are as follows.

    • batch_size = 8
    • beam_width = 4
    • max_mem_seq_len = 32
    • encoder_d_model = 512
    • encoder_head_num = 8
    • encoder_size_per_head = 64
    • encoder_inter_size = 2048
    • decoder_d_model = 512
    • decoder_head_num = 8
    • decoder_size_per_head = 64
    • decoder_inter_size = 2048
    • decoder_vocab_size = 32128
    • data_type = 0 (FP32) or 1 (FP16) or 2 (BF16)
    • tensor_para_size = 2

    Then the following scripts can generate the best GEMM configuration under such settings and record the configuration into the gemm_config.in file.

    bash
    ./bin/t5_gemm 8 4 32 512 8 64 2048 512 8 64 2048 32128 0 2 1
    

    3.2 Run the Tensorflow T5 example:

    bash
    python ../examples/tensorflow/t5/translate_example.py \
        --batch_size 32 \
        --beam_width 4 \
        --max_seq_len 128 \
        --data_type fp32 \
        --test_time 13 \
        --sampling_topk 4 \
        --model t5-small
    

    Data Type can be fp32, fp16 and bf16

    The outputs should be like to the following:

    bash
    2022-11-09 01:34:30,687 __main__ [INFO] ft-beamsearch translates 94 batches taking 17.75 sec to translate 99719 tokens, BLEU score: 25.38, 5617 tokens/sec. (62035 words, 3494 words/sec)
    2022-11-09 01:34:30,687 __main__ [INFO] ft-sampling translates 94 batches taking 14.80 sec to translate 99745 tokens, BLEU score: 25.36, 6740 tokens/sec. (62029 words, 4191 words/sec)
    

    3.3 Run the Tensorflow T5 v1_1 example:

    bash
    python ../examples/tensorflow/t5/translate_example.py \
        --batch_size 32 \
        --beam_width 4 \
        --max_seq_len 128 \
        --data_type fp32 \
        --test_time 0123 \
        --sampling_topk 4 \
        --max_ite 10 \
        --model google/t5-v1_1-small
    

    Data Type can be fp32, fp16 and bf16

    The outputs should be like to the following:

    bash
    2022-11-18 08:53:25,056 __main__ [INFO] hf-beamsearch translates 10 batches taking 218.60 sec to translate 12261 tokens, BLEU score: 0.41, 56 tokens/sec. (8198 words, 38 words/sec)
    2022-11-18 08:53:25,056 __main__ [INFO] ft-beamsearch translates 10 batches taking 3.28 sec to translate 10488 tokens, BLEU score: 0.47, 3201 tokens/sec. (7016 words, 2141 words/sec)
    2022-11-18 08:53:25,056 __main__ [INFO] hf-sampling translates 10 batches taking 124.60 sec to translate 11607 tokens, BLEU score: 0.29, 93 tokens/sec. (7842 words, 63 words/sec)
    2022-11-18 08:53:25,056 __main__ [INFO] ft-sampling translates 10 batches taking 2.91 sec to translate 11755 tokens, BLEU score: 0.29, 4034 tokens/sec. (8113 words, 2784 words/sec)
    

Running UL2 on FasterTransformer Pytorch op

UL2 (Unifying Language Learning Paradigms) is published by Google. The following is its introduction:

UL2 is a unified framework for pretraining models that are universally effective across datasets and setups. UL2 uses Mixture-of-Denoisers (MoD), apre-training objective that combines diverse pre-training paradigms together. UL2 introduces a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes.

We show how to sever UL2 by FasterTransformer PyTorch op on huggingface's model in this section.

  1. Download model (It requires some time because the model size is about 40GBs)

    bash
    sudo apt-get install git-lfs
    git lfs install
    git lfs clone https://huggingface.co/google/ul2
    
  2. Convert the checkpoint to FT

    Because loading UL2 model on pytorch and do prprocessing takes long time, and summarization.py only supports loading FT's model from binary files, we convert the pytorch checkpoint to FasterTransformer by converter huggingface_t5_ckpt_convert.py.

    bash
    python3 ../examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
            -saved_dir ul2/c-models \
            -in_file ul2/ \
            -inference_tensor_para_size 2 \
            -weight_data_type fp32
    
  3. Run UL2 on summarization task

    bash
    mpirun -n 2 python3 ../examples/pytorch/t5/summarization.py  \
                          --ft_model_location ul2/c-models/ \
                          --hf_model_location ul2/ \
                          --test_ft \
                          --data_type bf16 \
                          --tensor_para_size 2
    

    The results would be like

    bash
    rouge1 : 23.673944166014593
    rouge2 : 5.946485383012474
    rougeL : 14.749827731626247
    rougeLsum : 20.217932008044144
    

Running t5-v1.1

  1. Download model (It requires some time because the model size is about 40GBs)

    bash
    sudo apt-get install git-lfs
    git lfs install
    git lfs clone https://huggingface.co/google/t5-v1_1-base
    
  2. Convert the checkpoint to FT

    bash
    python3 ../examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
            -saved_dir t5-v1_1-base/c-models \
            -in_file t5-v1_1-base/ \
            -inference_tensor_para_size 1 \
            -weight_data_type fp32
    
  3. Run t5-v1.1 on summarization task

    bash
    python3 ../examples/pytorch/t5/summarization.py  \
            --ft_model_location t5-v1_1-base/c-models/ \
            --hf_model_location t5-v1_1-base/ \
            --test_ft \
            --test_hf
    

    The results would be like

    bash
    Hugging Face (total latency: 21.826529 sec)
    rouge1 : 10.786476875527406
    rouge2 : 1.8231246974441166
    rougeL : 8.652689713627165
    rougeLsum : 10.326607305635523
    Faster Transformers (total latency: 7.036808000000001 sec)
    rouge1 : 10.91735083630513
    rouge2 : 1.8454654301092783
    rougeL : 8.76872604148143
    rougeLsum : 10.453229536094794
    
  • Note that these models are not fine-tuned, so running with FP16 or setting topk > 1 may lead to unstable results.

Running mt5

  1. Download model (It requires some time because the model size is about 40GBs)

    bash
    sudo apt-get install git-lfs
    git lfs install
    git lfs clone https://huggingface.co/google/mt5-base
    
  2. Convert the checkpoint to FT

    bash
    python3 ../examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
            -saved_dir mt5-base/c-models \
            -in_file mt5-base/ \
            -inference_tensor_para_size 1 \
            -weight_data_type fp32
    
  3. Run mt5 on summarization task

    bash
    python3 ../examples/pytorch/t5/summarization.py  \
            --ft_model_location mt5-base/c-models/ \
            --hf_model_location mt5-base/ \
            --test_ft \
            --test_hf
    

    The results would be like

    bash
    Hugging Face (total latency: 3.143815 sec)
    rouge1 : 4.636193727758547
    rouge2 : 0.20661157024793395
    rougeL : 3.7990194456844026
    rougeLsum : 4.274724726798723
    Faster Transformers (total latency: 1.3952859999999998 sec)
    rouge1 : 4.726148174547172
    rouge2 : 0.20818875780707846
    rougeL : 3.8698557495145516
    rougeLsum : 4.3507453221528
    
    • Note that these models are not fine-tuned, so running with FP16 or setting topk > 1 may lead to unstable results.

Performance

Hardware settings:

  • CPU: Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
  • V100-16GB (with mclk 877MHz, pclk 1380MHz) with Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (dgx-1 server)
  • A100-40GB
  • A100-80GB (with mclk 1593, pclk 1410) with AMD EPYC 7742 64-Core Processor

To run the following benchmark, we need to install the unix computing tool "bc" by

bash
apt-get install bc

End-to-end translation performance on PyTorch

We demonstrate the throughput of huggingface and FT for end-to-end translation on V100 and A100. We also skip the BLEU score because the score of PyTorch, FT Decoder and FT Decoding are close.

Although the bleu scores of all methods are close, the results may be little different, and the number of generated tokens may be also different. So, we use throughput but not latency to show the performance in this benchmark.

T5-3B on A100-80GB

  • T5-3B on FP16 with beamsearch

| Batch Size | beamsearch | Precision | FT Decoding Throughput (token/sec) | | :--------: | :--------: | :-------: | :--------------------------------------: | | 1 | 4 | FP16 | 192 | | 1 | 32 | FP16 | 140 | | 8 | 4 | FP16 | 787 | | 8 | 32 | FP16 | 271 | | 32 | 4 | FP16 | 1540 | | 32 | 32 | FP16 | OOM | | 128 | 4 | FP16 | 1907 | | 128 | 32 | FP16 | OOM |

When batch size is 32, beam width is 32, the k/v caches require about 90GBs and lead to OOM.

  • T5-3B on FP16 with sampling

| Batch Size | sampling | Precision | FT Decoding Throughput (token/sec) | | :--------: | :------: | :-------: | :--------------------------------------: | | 1 | 4 | FP16 | 218 | | 1 | 0.5 | FP16 | 217 | | 8 | 4 | FP16 | 932 | | 8 | 0.5 | FP16 | 908 | | 32 | 4 | FP16 | 2416 | | 32 | 0.5 | FP16 | 2344 | | 128 | 4 | FP16 | 5004 | | 128 | 0.5 | FP16 | 4891 |

T5-base on A100-40GB

  • T5-base on FP32 with beamsearch

| Batch Size | beamsearch | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :--------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP32 | 46 | 422 | 9.17 | | 1 | 32 | FP32 | 34 | 339 | 9.97 | | 8 | 4 | FP32 | 194 | 1779 | 9.17 | | 8 | 32 | FP32 | 98 | 516 | 5.26 | | 32 | 4 | FP32 | 486 | 2939 | 6.04 | | 32 | 32 | FP32 | OOM | OOM | - | | 128 | 4 | FP32 | 810 | 3445 | 4.25 | | 128 | 32 | FP32 | OOM | OOM | - |

  • T5-base on FP16 with beamsearch

| Batch Size | beamsearch | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :--------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP16 | 44 | 671 | 15.25 | | 1 | 32 | FP16 | 25 | 517 | 20.68 | | 8 | 4 | FP16 | 139 | 2807 | 20.19 | | 8 | 32 | FP16 | 77 | 1573 | 20.42 | | 32 | 4 | FP16 | 368 | 7102 | 19.29 | | 32 | 32 | FP16 | 123 | 1830 | 14.87 | | 128 | 4 | FP16 | 656 | 11312 | 17.24 | | 128 | 32 | FP16 | OOM | 1845 | - |

  • T5-base on FP32 with sampling

| Batch Size | sampling | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP32 | 66 | 334 | 5.06 | | 1 | 0.5 | FP32 | 65 | 323 | 4.97 | | 8 | 4 | FP32 | 217 | 1887 | 8.70 | | 8 | 0.5 | FP32 | 200 | 1765 | 8.83 | | 32 | 4 | FP32 | 718 | 5211 | 7.26 | | 32 | 0.5 | FP32 | 656 | 4731 | 7.21 | | 128 | 4 | FP32 | 2115 | 8782 | 4.15 | | 128 | 0.5 | FP32 | 1805 | 8212 | 4.55 |

  • T5-base on FP16 with sampling

| Batch Size | sampling | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP16 | 46 | 746 | 16.21 | | 1 | 0.5 | FP16 | 43 | 706 | 16.41 | | 8 | 4 | FP16 | 212 | 3293 | 15.53 | | 8 | 0.5 | FP16 | 191 | 3049 | 15.96 | | 32 | 4 | FP16 | 501 | 8783 | 17.53 | | 32 | 0.5 | FP16 | 432 | 7961 | 18.42 | | 128 | 4 | FP16 | 1426 | 18137 | 12.71 | | 128 | 0.5 | FP16 | 1414 | 16680 | 11.79 |

T5-base on V100-16GB

  • T5-base on FP32 with beamsearch

| Batch Size | beamsearch | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :--------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP32 | 28 | 257 | 9.17 | | 1 | 32 | FP32 | 20 | 175 | 8.75 | | 8 | 4 | FP32 | 105 | 953 | 9.07 | | 8 | 32 | FP32 | 50 | 196 | 3.92 | | 32 | 4 | FP32 | 247 | 1400 | 5.66 | | 32 | 32 | FP32 | 0 | OOM | x | | 128 | 4 | FP32 | 0 | 1448 | x | | 128 | 32 | FP32 | OOM | OOM | x |

  • T5-base on FP16 with beam search

| Batch Size | beamsearch | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :--------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP16 | 21 | 359 | 17.09 | | 1 | 32 | FP16 | 14 | 250 | 17.85 | | 8 | 4 | FP16 | 76 | 1418 | 18.65 | | 8 | 32 | FP16 | 40 | 526 | 13.15 | | 32 | 4 | FP16 | 221 | 2962 | 13.40 | | 32 | 32 | FP16 | OOM | 684 | x | | 128 | 4 | FP16 | 345 | 4079 | 11.82 | | 128 | 32 | FP16 | OOM | OOM | x |

  • T5-base on FP32 with sampling

| Batch Size | sampling | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP32 | 26 | 226 | 8.69 | | 1 | 0.5 | FP32 | 27 | 219 | 8.11 | | 8 | 4 | FP32 | 115 | 1153 | 10.02 | | 8 | 0.5 | FP32 | 130 | 1075 | 8.26 | | 32 | 4 | FP32 | 327 | 3021 | 9.23 | | 32 | 0.5 | FP32 | 297 | 2773 | 9.33 | | 128 | 4 | FP32 | 1162 | 4184 | 3.60 | | 128 | 0.5 | FP32 | 797 | 3975 | 4.98 |

  • T5-base on FP16 with sampling

| Batch Size | sampling | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP16 | 19 | 364 | 19.15 | | 1 | 0.5 | FP16 | 20 | 353 | 17.65 | | 8 | 4 | FP16 | 83 | 1733 | 20.87 | | 8 | 0.5 | FP16 | 98 | 1599 | 16.31 | | 32 | 4 | FP16 | 337 | 4517 | 13.40 | | 32 | 0.5 | FP16 | 301 | 4207 | 13.97 | | 128 | 4 | FP16 | 956 | 8519 | 8.91 | | 128 | 0.5 | FP16 | 723 | 7997 | 11.06 |

T5-small on V100-16GB

  • T5-small on FP32 with beamsearch

| Batch Size | beamsearch | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :--------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP32 | 51 | 626 | 12.27 | | 1 | 32 | FP32 | 30 | 413 | 13.76 | | 8 | 4 | FP32 | 192 | 2462 | 12.82 | | 8 | 32 | FP32 | 72 | 563 | 7.81 | | 32 | 4 | FP32 | 383 | 4316 | 11.26 | | 32 | 32 | FP32 | 104 | 668 | 6.42 | | 128 | 4 | FP32 | 554 | 4747 | 8.56 | | 128 | 32 | FP32 | OOM | OOM | x |

  • T5-small on FP16 with beamsearch

| Batch Size | beamsearch | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :--------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP16 | 35 | 776 | 22.17 | | 1 | 32 | FP16 | 28 | 553 | 19.75 | | 8 | 4 | FP16 | 163 | 3467 | 21.26 | | 8 | 32 | FP16 | 71 | 1140 | 16.05 | | 32 | 4 | FP16 | 365 | 7154 | 19.60 | | 32 | 32 | FP16 | 108 | 1359 | 12.58 | | 128 | 4 | FP16 | 524 | 11285 | 21.53 | | 128 | 32 | FP16 | 0 | 942※ | 0.00 |

※: Out of memory on single GPU. Run by 2 ways tensor parallel.

  • T5-small on FP32 with sampling

| Batch Size | sampling | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP32 | 60 | 577 | 9.61 | | 1 | 0.5 | FP32 | 57 | 524 | 9.19 | | 8 | 4 | FP32 | 243 | 2821 | 11.60 | | 8 | 0.5 | FP32 | 221 | 2345 | 10.61 | | 32 | 4 | FP32 | 765 | 7865 | 10.28 | | 32 | 0.5 | FP32 | 634 | 6365 | 10.03 | | 128 | 4 | FP32 | 2238 | 12134 | 5.42 | | 128 | 0.5 | FP32 | 1611 | 10439 | 6.47 |

  • T5-small on FP16 with sampling

| Batch Size | sampling | Precision | Huggingface Throughput (token/sec) | FT Decoding Throughput (token/sec) | FT Decoding Speedup | | :--------: | :------: | :-------: | :--------------------------------------: | :--------------------------------------: | :-----------------------: | | 1 | 4 | FP16 | 46 | 934 | 20.30 | | 1 | 0.5 | FP16 | 42 | 862 | 20.52 | | 8 | 4 | FP16 | 194 | 3510 | 18.09 | | 8 | 0.5 | FP16 | 182 | 3235 | 17.77 | | 32 | 4 | FP16 | 592 | 10692 | 18.06 | | 32 | 0.5 | FP16 | 553 | 9008 | 16.28 | | 128 | 4 | FP16 | 1921 | 19446 | 10.12 | | 128 | 0.5 | FP16 | 1307 | 16810 | 12.86 |