docs_new/docs/basic_usage/aws_sagemaker.mdx
Deploy SGLang on Amazon SageMaker AI endpoints using the AWS Deep Learning Container (DLC) for SGLang. The SageMaker image variant accepts model configuration via environment variables and serves on port 8080.
This guide uses the pre-built DLC image. To build and deploy your own container instead, see Method 7: Run on AWS SageMaker in the installation guide.
AWS publishes pre-built, security-patched SGLang DLCs. The SageMaker GPU image is available from the
Amazon ECR registry (account 763104351884) in each supported region. For example, in us-west-2:
763104351884.dkr.ecr.us-west-2.amazonaws.com/sglang:server-sagemaker-cuda-v1.0
For the full list of image tags, see the Available DLC Images reference, and for region-specific account IDs and supported regions, see Region Availability.
The SageMaker image resolves the model in this order:
SM_SGLANG_MODEL_PATH environment variable — explicit Hugging Face ID or path./opt/ml/model — when SageMaker mounts model artifacts via ModelDataUrl or ModelDataSource,
the entrypoint uses this path by default.For gated models, also pass HF_TOKEN.
Any SM_SGLANG_* environment variable is converted to a --<name> SGLang server argument
(for example, SM_SGLANG_CONTEXT_LENGTH=4096 becomes --context-length 4096).
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
model = Model(
image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/sglang:server-sagemaker-cuda-v1.0",
role="arn:aws:iam::<account_id>:role/<role_name>",
predictor_cls=Predictor,
env={"SM_SGLANG_MODEL_PATH": "openai/gpt-oss-20b"},
)
predictor = model.deploy(
instance_type="ml.g5.2xlarge",
initial_instance_count=1,
inference_ami_version="al2023-ami-sagemaker-inference-gpu-4-1",
serializer=JSONSerializer(),
)
response = predictor.predict({
"model": "openai/gpt-oss-20b",
"messages": [{"role": "user", "content": "What is deep learning?"}],
"max_tokens": 256,
})
print(response)
# Cleanup
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)
import json
import boto3
sm = boto3.client("sagemaker")
smrt = boto3.client("sagemaker-runtime")
sm.create_model(
ModelName="sglang-model",
PrimaryContainer={
"Image": "763104351884.dkr.ecr.us-west-2.amazonaws.com/sglang:server-sagemaker-cuda-v1.0",
"Environment": {"SM_SGLANG_MODEL_PATH": "openai/gpt-oss-20b"},
},
ExecutionRoleArn="arn:aws:iam::<account_id>:role/<role_name>",
)
sm.create_endpoint_config(
EndpointConfigName="sglang-config",
ProductionVariants=[{
"VariantName": "default",
"ModelName": "sglang-model",
"InstanceType": "ml.g5.2xlarge",
"InitialInstanceCount": 1,
"InferenceAmiVersion": "al2023-ami-sagemaker-inference-gpu-4-1",
}],
)
sm.create_endpoint(EndpointName="sglang-endpoint", EndpointConfigName="sglang-config")
sm.get_waiter("endpoint_in_service").wait(EndpointName="sglang-endpoint")
resp = smrt.invoke_endpoint(
EndpointName="sglang-endpoint",
ContentType="application/json",
Body=json.dumps({
"model": "openai/gpt-oss-20b",
"messages": [{"role": "user", "content": "What is deep learning?"}],
"max_tokens": 256,
}),
)
print(json.loads(resp["Body"].read()))
# Cleanup
sm.delete_endpoint(EndpointName="sglang-endpoint")
sm.delete_endpoint_config(EndpointConfigName="sglang-config")
sm.delete_model(ModelName="sglang-model")
When ModelDataUrl (or ModelDataSource) points to a tarball or S3 prefix, SageMaker mounts the contents
at /opt/ml/model. The entrypoint defaults --model-path to that location, so SM_SGLANG_MODEL_PATH
can be omitted:
model.tar.gz
├── config.json # standard model files (Hugging Face layout)
├── tokenizer.json
└── *.safetensors
inference_ami_version — the default SageMaker host AMI has incompatible NVIDIA
drivers for CUDA 13 images. See the
ProductionVariant API reference
for valid values./v1/chat/completions schema.