Back to Triton Inference Server

Copyright (c) 2021-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

deploy/alibaba-cloud/README.md

2.68.08.3 KB
Original Source
<!-- # Copyright (c) 2021-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions # are met: # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # * Redistributions in binary form must reproduce the above copyright # notice, this list of conditions and the following disclaimer in the # documentation and/or other materials provided with the distribution. # * Neither the name of NVIDIA CORPORATION nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -->

Deploy Triton Inference Server on PAI-EAS

Description

This repository contains information about how to deploy NVIDIA Triton Inference Server in EAS(Elastic Algorithm Service) of Alibaba-Cloud.

  • EAS provides a simple way for deep learning developers to deploy their models in Alibaba Cloud.
  • Using Triton Processor is the recommended way on EAS to deploy Triton Inference Server. Users can simply deploy a Triton Server by preparing models and creating a EAS service by setting processor type to triton.
  • Models should be uploaded to Alibaba Cloud's OSS(Object Storage Service). User's model repository in OSS will be mounted onto local path visible to Triton Server.
  • This documentation uses Triton's own example models for demo. The ONNX inception v3 model can be obtained by the fetch_models.sh script.

Prerequisites

  • You should register an Alibaba Cloud Account, and being able to use EAS by eascmd, which is a command line tool to create stop or scale services on EAS.
  • Before creating an EAS service, you should buy dedicated resource groups(CPU or GPU) on EAS following this document.
  • Make sure you can use OSS(Object Storage Service), the models should be uploaded into your own OSS bucket.

Demo Instruction

Prepare a model repo directory in OSS

Download the ONNX inception v3 model via fetch_model.sh. Then using ossutil , which is a command line tool to use OSS, to upload the model to a certain OSS dir as you want.

./ossutil cp inception_v3_onnx/ oss://triton-model-repo/models

Create Triton Service with json config by eascmd

The following is the json we use when creating a Triton Server on EAS.

{
  "name": "<your triton service name>",
  "processor": "triton",
  "processor_params": [
    "--model-repository=oss://triton-model-repo/models",
    "--allow-grpc=true",
    "--allow-http=true"
  ],
  "metadata": {
    "instance": 1,
    "cpu": 4,
    "gpu": 1,
    "memory": 10000,
    "resource": "<your resource id>",
    "rpc.keepalive": 3000
  }
}

Only processor and processor_params should be different from a normal EAS service.

paramsdetails
processorName should be triton to use Triton on EAS
processor_paramsList of strings, every element is a param for tritonserver
./eascmd create triton.config
[RequestId]: AECDB6A4-CB69-4688-AA35-BA1E020C39E6
+-------------------+------------------------------------------------------------------------------------------------+
| Internet Endpoint | http://1271520832287160.cn-shanghai.pai-eas.aliyuncs.com/api/predict/test_triton_processor     |
| Intranet Endpoint | http://1271520832287160.vpc.cn-shanghai.pai-eas.aliyuncs.com/api/predict/test_triton_processor |
|             Token | MmY3M2ExZGYwYjZiMTQ5YTRmZWE3MDAzNWM1ZTBiOWQ3MGYxZGNkZQ==                                       |
+-------------------+------------------------------------------------------------------------------------------------+
[OK] Service is now deploying
[OK] Successfully synchronized resources
[OK] Waiting [Total: 1, Pending: 1, Running: 0]
[OK] Waiting [Total: 1, Pending: 1, Running: 0]
[OK] Running [Total: 1, Pending: 0, Running: 1]
[OK] Service is running

Query Triton service by python client

Install triton's python client

pip install tritonclient[all]

A demo to query inception model

import numpy as np
import time
from PIL import Image

import tritonclient.http as httpclient
from tritonclient.utils import InferenceServerException

URL = "<servcice url>"
HEADERS = {"Authorization": "<service token>"}
input_img = httpclient.InferInput("input", [1, 299, 299, 3], "FP32")
# Using one of the cat images from imagenet or a random cat images you like
img = Image.open('./cat.png').resize((299, 299))
img = np.asarray(img).astype('float32') / 255.0
input_img.set_data_from_numpy(img.reshape([1, 299, 299, 3]), binary_data=True)

output = httpclient.InferRequestedOutput(
    "InceptionV3/Predictions/Softmax", binary_data=True
)
triton_client = httpclient.InferenceServerClient(url=URL, verbose=False)

start = time.time()
for i in range(10):
    results = triton_client.infer(
        "inception_v3_onnx", inputs=[input_img], outputs=[output], headers=HEADERS
    )
    res_body = results.get_response()
    elapsed_ms = (time.time() - start) * 1000
    if i == 0:
        print("model name: ", res_body["model_name"])
        print("model version: ", res_body["model_version"])
        print("output name: ", res_body["outputs"][0]["name"])
        print("output shape: ", res_body["outputs"][0]["shape"])
    print("[{}] Avg rt(ms): {:.2f}".format(i, elapsed_ms))
    start = time.time()

You will get the following result by running the python script:

[0] Avg rt(ms): 86.05
[1] Avg rt(ms): 52.35
[2] Avg rt(ms): 50.56
[3] Avg rt(ms): 43.45
[4] Avg rt(ms): 41.19
[5] Avg rt(ms): 40.55
[6] Avg rt(ms): 37.24
[7] Avg rt(ms): 37.16
[8] Avg rt(ms): 36.68
[9] Avg rt(ms): 34.24
[10] Avg rt(ms): 34.27

Additional Resources

See the following resources to learn more about how to use Alibaba Cloud's OSS orEAS.

Known Issues

  • Binary Tensor Data Extension is not fully supported yet, users want to use service with binary extension supported, it is only available in cn-shanghai region of PAI-EAS.
  • Currently only HTTP/1 is supported, hence gRPC cannot be used when query Triton servers on EAS. HTP/2 will be officially supported in a short time.
  • Users should not mount a whole OSS bucket when launching Triton processor, but an arbitrarily deep sub-directory in bucket. Otherwise the mounted path will no be as expected.
  • Not all of Triton Server parameters are be supported on EAS, the following params are supported on EAS:
model-repository
log-verbose
log-info
log-warning
log-error
exit-on-error
strict-model-config
strict-readiness
allow-http
http-thread-count
pinned-memory-pool-byte-size
cuda-memory-pool-byte-size
min-supported-compute-capability
buffer-manager-thread-count
backend-config