Back to Qwen3

dstack

docs/source/deployment/dstack.rst

latest5.2 KB
Original Source

dstack

dstack <https://github.com/dstackai/dstack>__ is an open-source alternative to Kubernetes and Slurm, designed to simplify GPU allocation and AI workload orchestration for ML teams across top clouds, on-prem clusters, and accelerators.

Prerequisites

Before you start, install dstack by following the installation instructions <https://dstack.ai/docs/installation/>__. Once dstack server is up, you can initialize your workspace as shown below:

.. code:: bash

mkdir dstack-qwen-deploy && cd dstack-qwen-deploy dstack init

Deploy Qwen3-30B-A3B

Deploy Qwen3-30B-A3B on instances available with cloud providers configured in your ~/.dstack/server/config.yml file.

You can use SgLang, TGI or vLLM to serve the model. Here we use SgLang as an example.

Create a service <https://dstack.ai/docs/concepts/services/>__ configuration file named serve-30b.dstack.yml with the following content:

.. code:: yaml

type: service
name: qwen3-30b-a3b

image: lmsysorg/sglang:latest
env:
  - MODEL_ID=Qwen/Qwen3-30B-A3B

commands:
  - python3 -m sglang.launch_server
        --model-path $MODEL_ID
        --port 8000
        --trust-remote-code

port: 8000
model: Qwen/Qwen3-30B-A3B

resources:
  gpu: 80GB:1
  

.. note:: For other inference backends such as vLLM or TGI, visit the dstack Inference Examples <https://dstack.ai/examples/#inference>__ documentation.

Go ahead and apply the service configuration:

.. code:: bash

dstack apply -f serve-30b.dstack.yml

Access the Service

After the service is successfully deployed, you can access the service's endpoint in the following ways:

.. tab-set::

.. tab-item:: CURL
   
  Access through service endpoint at ``<dstack server URL>/proxy/services/<project name>/<run name>/``

  .. code:: bash

        curl http://localhost:3000/proxy/services/main/qwen3-30b-a3b/v1/chat/completions \
            -H 'Content-Type: application/json' \
            -H 'Authorization: Bearer <dstack token>' \
            -d '{
                "model": "Qwen/Qwen3-30B-A3B",
                "messages": [
                    {
                        "role": "user",
                        "content": "Compose a poem that explains the concept of recursion in programming."
                    }
                ]
        }'

  .. note::
      When starting the dstack server, an admin token is automatically generated:

      .. code:: bash 

          The admin token is "bbae0f28-d3dd-4820-bf61-8f4bb40815da"
          The server is running at http://127.0.0.1:3000/

.. tab-item:: Chat UI

   Access through dstack's Chat UI at ``<dstack server URL>/projects/<project name>/models/<run name>/``

   .. image:: https://dstack.ai/static-assets/static-assets/images//dstack-qwen-ui.png

.. dropdown:: Gateway :icon: info :animate: fade-in

Running services for development purposes doesn't require setting up a gateway.

However, you'll need a gateway in the following cases:

* To use auto-scaling or rate limits
* To enable HTTPS for the endpoint and map it to your domain
* If your service requires WebSockets
* If your service cannot work with a path prefix

For detailed information about gateway configuration and usage, refer to the `dstack documentation on gateways <https://dstack.ai/docs/concepts/gateways/>`__.

Replicas and Auto Scaling

You can auto scale the service by specifying additional configurations in the serve-30b.dstack.yml.

  • Set replicas: min..max to define the minimum and maximum number of replicas
  • Configure scaling rules to determine when to scale up or down

Below is a complete configuration example with auto-scaling enabled:

.. code:: yaml

type: service
name: qwen3-30b-a3b

image: lmsysorg/sglang:latest
env:
  - MODEL_ID=Qwen/Qwen3-30B-A3B

commands:
  - python3 -m sglang.launch_server
        --model-path $MODEL_ID
        --port 8000
        --trust-remote-code

port: 8000
model: Qwen/Qwen3-30B-A3B

resources:
  gpu: 80GB:1

# Minimum and maximum number of replicas
replicas: 1..4
scaling:
  # Requests per seconds
  metric: rps
  # Target metric value
  target: 10

.. note:: The scaling property requires a gateway to be set up.

See also

  • Fleets: Create cloud and on-prem clusters using Fleets <https://dstack.ai/docs/concepts/fleets/>__.
  • Dev Environments: Experiment and test before deploying to production using Dev Environments <https://dstack.ai/docs/concepts/dev-environments/>__.
  • Tasks: Schedule single node or distributed training using Tasks <https://dstack.ai/docs/concepts/tasks/>__.
  • Services: Deploy models as secure, auto-scaling OpenAI-compatible endpoints using Services <https://dstack.ai/docs/concepts/services/>__.
  • Metrics: Monitor performance with automatically tracked metrics via CLI or UI using Metrics <https://dstack.ai/docs/guides/metrics/>__.