ExecuTorch

ExecuTorch is a lightweight runtime for model inference on edge devices. It exports a PyTorch model into a portable, ahead-of-time format. A small C++ runtime plans memory and dispatches operations to hardware-specific backends. Execution and memory behavior is known before the model runs on device, so inference overhead is low.

Export a Transformers model with the optimum-executorch library.

bash

optimum-cli export executorch \
    --model "HuggingFaceTB/SmolLM2-135M-Instruct" \
    --task "text-generation" \
    --recipe "xnnpack" \
    --output_dir="./smollm2_exported"

</hfoption> <hfoption id="Python">

from transformers import AutoTokenizer
from optimum.executorch import ExecuTorchModelForCausalLM

model = ExecuTorchModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM2-135M-Instruct",
    recipe="xnnpack",
)
model.save_pretrained("./smollm2_exported")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")

</hfoption> </hfoptions>

Transformers integration

The export process uses several Transformers components.

[~PreTrainedModel.from_pretrained] loads the model weights in safetensors format.
Optimum applies graph optimizations and runs torch.export to create a model.pte file targeting your hardware backend.
[AutoTokenizer] or [AutoProcessor] loads the tokenizer or processor files and runs during inference.
At runtime, a C++ runner class executes the .pte file on the ExecuTorch runtime.

c++

#include <executorch/extension/llm/runner/text_llm_runner.h>

using namespace executorch::extension::llm;

int main() {
  // Load tokenizer and create runner
  auto tokenizer = load_tokenizer("path/to/tokenizer.json", nullptr, std::nullopt, 0, 0);
  auto runner = create_text_llm_runner("path/to/model.pte", std::move(tokenizer));

  // Load the model
  runner->load();

  // Configure generation
  GenerationConfig config;
  config.max_new_tokens = 100;
  config.temperature = 0.8f;

  // Generate text with streaming output
  runner->generate("The capital of France is", config,
    [](const std::string& token) { std::cout << token << std::flush; },
    nullptr);

  return 0;
}