Back to Serve

Build a Streaming API for a Large Language Model

docs/tutorials/llm-serve.md

3.34.01.1 KB
Original Source

Build a Streaming API for a Large Language Model

{include}
:start-after: <!-- start llm-streaming-intro -->
:end-before: <!-- end llm-streaming-intro -->

Service Schemas

{include}
:start-after: <!-- start llm-streaming-schemas -->
:end-before: <!-- end llm-streaming-schemas -->
{admonition}
:class: note

Thanks to DocArray's flexibility, you can implement very flexible services. For instance, you can use 
Tensor types to efficiently stream token logits back to the client and implement complex token sampling strategies on 
the client side.

Service initialization

{include}
:start-after: <!-- start llm-streaming-init -->
:end-before: <!-- end llm-streaming-init -->

Implement the streaming endpoint

{include}
:start-after: <!-- start llm-streaming-endpoint -->
:end-before: <!-- end llm-streaming-endpoint -->

Serve and send requests

{include}
:start-after: <!-- start llm-streaming-serve -->
:end-before: <!-- end llm-streaming-serve -->