deploy/alibaba-cloud/README.md
This repository contains information about how to deploy NVIDIA Triton Inference Server in EAS(Elastic Algorithm Service) of Alibaba-Cloud.
triton.fetch_models.sh script.Download the ONNX inception v3 model via fetch_model.sh. Then using ossutil , which is a command line tool to use OSS, to upload the model to a certain OSS dir as you want.
./ossutil cp inception_v3_onnx/ oss://triton-model-repo/models
The following is the json we use when creating a Triton Server on EAS.
{
"name": "<your triton service name>",
"processor": "triton",
"processor_params": [
"--model-repository=oss://triton-model-repo/models",
"--allow-grpc=true",
"--allow-http=true"
],
"metadata": {
"instance": 1,
"cpu": 4,
"gpu": 1,
"memory": 10000,
"resource": "<your resource id>",
"rpc.keepalive": 3000
}
}
Only processor and processor_params should be different from a normal EAS service.
| params | details |
|---|---|
| processor | Name should be triton to use Triton on EAS |
| processor_params | List of strings, every element is a param for tritonserver |
./eascmd create triton.config
[RequestId]: AECDB6A4-CB69-4688-AA35-BA1E020C39E6
+-------------------+------------------------------------------------------------------------------------------------+
| Internet Endpoint | http://1271520832287160.cn-shanghai.pai-eas.aliyuncs.com/api/predict/test_triton_processor |
| Intranet Endpoint | http://1271520832287160.vpc.cn-shanghai.pai-eas.aliyuncs.com/api/predict/test_triton_processor |
| Token | MmY3M2ExZGYwYjZiMTQ5YTRmZWE3MDAzNWM1ZTBiOWQ3MGYxZGNkZQ== |
+-------------------+------------------------------------------------------------------------------------------------+
[OK] Service is now deploying
[OK] Successfully synchronized resources
[OK] Waiting [Total: 1, Pending: 1, Running: 0]
[OK] Waiting [Total: 1, Pending: 1, Running: 0]
[OK] Running [Total: 1, Pending: 0, Running: 1]
[OK] Service is running
pip install tritonclient[all]
import numpy as np
import time
from PIL import Image
import tritonclient.http as httpclient
from tritonclient.utils import InferenceServerException
URL = "<servcice url>"
HEADERS = {"Authorization": "<service token>"}
input_img = httpclient.InferInput("input", [1, 299, 299, 3], "FP32")
# Using one of the cat images from imagenet or a random cat images you like
img = Image.open('./cat.png').resize((299, 299))
img = np.asarray(img).astype('float32') / 255.0
input_img.set_data_from_numpy(img.reshape([1, 299, 299, 3]), binary_data=True)
output = httpclient.InferRequestedOutput(
"InceptionV3/Predictions/Softmax", binary_data=True
)
triton_client = httpclient.InferenceServerClient(url=URL, verbose=False)
start = time.time()
for i in range(10):
results = triton_client.infer(
"inception_v3_onnx", inputs=[input_img], outputs=[output], headers=HEADERS
)
res_body = results.get_response()
elapsed_ms = (time.time() - start) * 1000
if i == 0:
print("model name: ", res_body["model_name"])
print("model version: ", res_body["model_version"])
print("output name: ", res_body["outputs"][0]["name"])
print("output shape: ", res_body["outputs"][0]["shape"])
print("[{}] Avg rt(ms): {:.2f}".format(i, elapsed_ms))
start = time.time()
You will get the following result by running the python script:
[0] Avg rt(ms): 86.05
[1] Avg rt(ms): 52.35
[2] Avg rt(ms): 50.56
[3] Avg rt(ms): 43.45
[4] Avg rt(ms): 41.19
[5] Avg rt(ms): 40.55
[6] Avg rt(ms): 37.24
[7] Avg rt(ms): 37.16
[8] Avg rt(ms): 36.68
[9] Avg rt(ms): 34.24
[10] Avg rt(ms): 34.27
See the following resources to learn more about how to use Alibaba Cloud's OSS orEAS.
model-repository
log-verbose
log-info
log-warning
log-error
exit-on-error
strict-model-config
strict-readiness
allow-http
http-thread-count
pinned-memory-pool-byte-size
cuda-memory-pool-byte-size
min-supported-compute-capability
buffer-manager-thread-count
backend-config