Back to Megatron Lm

Megatron Core To TRTLLM Export Documentation

examples/export/trtllm_export/README.md

23.066.7 KB
Original Source

Megatron Core To TRTLLM Export Documentation

This guide will walk you through how you can use the megatron core export for exporting models to trtllm format

Contents

1. Quick Start

This will walk you through the flow of converting an mcore gpt model to trtllm format using single device mode. The file can be found at gpt_single_device_cpu_export.py

NOTE: For faster performance, if your entire model will fit into gpu memory, pre transfer the model state dict to gpu and then call the get_trtllm_pretrained_config_and_model_weights function.

1.1 Understanding The Code

STEP 1 - We initialize model parallel and other default arguments We initalize tp and pp to 1 so that we can get the full model state dict on cpu

python
    initialize_distributed(tensor_model_parallel_size=1, pipeline_model_parallel_size=1)

STEP 2 - We load the model using the model_provider_function NOTE: We create a simple gpt model

python
    transformer_config = TransformerConfig(
        num_layers=2, 
        hidden_size=64, # Needs to be atleast 32 times num_attn_heads
        num_attention_heads=2, 
        use_cpu_initialization=True, 
        pipeline_dtype=torch.float32,
    )

    gpt_model = GPTModel(
        config=transformer_config, 
        transformer_layer_spec=get_gpt_layer_local_spec(), 
        vocab_size=100, 
        max_sequence_length=_SEQUENCE_LENGTH,
    )

    # Optionally you can also load a model using this code 
    # sharded_state_dict=gpt_model.sharded_state_dict(prefix='')
    # checkpoint = dist_checkpointing.load(sharded_state_dict=sharded_state_dict, checkpoint_dir=checkpoint_path)
    # gpt_model.load_state_dict(checkpoint)

STEP 3 - Instantiate the TRTLLM Helper We instantiate the TRTLLM Helper For the GPT model we instantiate trtllm_helper as shown below.

python
    if hasattr(gpt_model, "rotary_pos_emb"):
        seq_len_interpolation_factor =  gpt_model.rotary_pos_emb.seq_len_interpolation_factor

    trtllm_helper = TRTLLMHelper(
                        transformer_config=gpt_model.config, 
                        model_type=ModelType.gpt,
                        position_embedding_type = gpt_model.position_embedding_type, 
                        max_position_embeddings = gpt_model.max_position_embeddings, 
                        rotary_percentage = gpt_model.rotary_percent,
                        rotary_base = gpt_model.rotary_base,
                        moe_tp_mode = 2,
                        multi_query_mode = False,
                        activation = "gelu", 
                        seq_len_interpolation_factor = seq_len_interpolation_factor,
                        share_embeddings_and_output_weights=gpt_model.share_embeddings_and_output_weights
                    )   

STEP 4 - Get the TRTLLM Weights and configs To convert model weights to trtllm weights and configs, we use the single_device_converter. We pass as inputs the model state dict, and export config. In this example we use inference tp size as 2 for the export.

python
    model_state_dict={}
    for key , val in gpt_model.state_dict().items():
        # val is non for _extra_state layers . We filter it out
        if val is not None:
            model_state_dict[key] = val

    export_config = ExportConfig(inference_tp_size = 2)
    weight_list, config_list = trtllm_helper.get_trtllm_pretrained_config_and_model_weights(
        model_state_dict= model_state_dict,
        dtype = DataType.bfloat16,
        export_config=export_config
    )

STEP 5 - Build the TRTLLM Engine Following code is used to build the TRTLLM Engine.

python
    for trtllm_model_weights, trtllm_model_config in zip(weight_list, config_list):
        trtllm_helper.build_and_save_engine(
            max_input_len=256,
            max_output_len=256,
            max_batch_size=8,
            engine_dir='/opt/megatron-lm/engine',
            trtllm_model_weights=trtllm_model_weights,
            trtllm_model_config=trtllm_model_config,
            lora_ckpt_list=None,
            use_lora_plugin=None,
            max_lora_rank=64,
            lora_target_modules=None,
            max_prompt_embedding_table_size=0,
            paged_kv_cache=True,
            remove_input_padding=True,
            paged_context_fmha=False,
            use_refit=False,
            max_num_tokens=None,
            max_seq_len=512,
            opt_num_tokens=None,
            max_beam_width=1,
            tokens_per_block=128,
            multiple_profiles=False,
            gpt_attention_plugin="auto",
            gemm_plugin="auto",
        )
1.2 Running The Code

An example run script is shown below.

# In a workstation 
MLM_PATH=/path/to/megatron-lm
CONTAINER_IMAGE=gitlab-master.nvidia.com:5005/dl/joc/nemo-ci/trtllm_0.12/train:pipe.17669124-x86

docker run -it --gpus=all --ipc=host -v $MLM_PATH/:/opt/megatron-lm $CONTAINER_IMAGE bash

# Inside the container run the following. 

cd /opt/megatron-lm/

CUDA_VISIBLE_DEVICES=0 torchrun --nproc-per-node 1  examples/export/trtllm_export/single_device_export/gpt_single_device_cpu_export.py

2. GPU Export

You can use the gpt_distributed_gpu_export.py to run a more optimized on device distributed. version of trtllm export. Internally this uses the distributed_converter to convert model weights on device. In the single device version you collect all the model weights on CPU/GPU, convert it to trtllm format, and then store the engine back on disk. In the GPU version you load each individual state dict on the gpus, convert it on the device itself and store the engine on disk.

To run the gpu version

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc-per-node 2  examples/export/trtllm_export/distributed_export/gpt_distributed_gpu_export.py

3. Future work

The following are planned for the future releases .

  • Pipeline parallellism for export (Work in progress)
  • GPU Export for more models (Work in progress for some models)
  • Refit functionality
  • VLLM Support