Back to Ultralytics

Qualcomm QNN Export for Ultralytics YOLO Models

docs/en/integrations/qnn.md

8.4.6420.9 KB
Original Source

Qualcomm QNN Export for Ultralytics YOLO Models

Deploying computer vision models on Qualcomm Snapdragon devices requires a model format tuned for the Qualcomm AI Engine Direct (QNN) runtime. Exporting Ultralytics YOLO models to the QNN format lets you run accelerated, on-device inference across Snapdragon CPU, Adreno GPU, and Hexagon NPU hardware found in billions of mobile phones, laptops, automotive systems, and IoT devices. This guide walks through how to export YOLO to Qualcomm QNN and deploy it for fast, low-power inference on Snapdragon hardware.

What is Qualcomm QNN?

<p align="center"> </p>

Qualcomm AI Engine Direct — commonly referred to as QNN and distributed as part of the Qualcomm AI Runtime (QAIRT) SDK — is Qualcomm's low-level inference stack for Snapdragon processors. It provides a unified API with backend-specific libraries that target the Snapdragon CPU, the Adreno GPU, and the Hexagon Tensor Processor (HTP), the dedicated neural network processing unit (NPU) inside modern Snapdragon SoCs. QNN gives developers full-stack access to these Snapdragon AI accelerators and is the modern successor to the older Snapdragon Neural Processing Engine (SNPE) SDK. It powers on-device AI across the Snapdragon 8 Gen 2, 8 Gen 3, and 8 Elite mobile platforms, Snapdragon X laptops, and automotive and XR products.

Why Export to Qualcomm QNN?

Snapdragon is the most widely deployed mobile compute platform in the world. Exporting Ultralytics YOLO to the Qualcomm QNN format unlocks the dedicated AI hardware on these devices:

  • Hexagon NPU acceleration: Running YOLO on the Hexagon Tensor Processor delivers dramatically higher throughput and lower power than CPU inference — ideal for real-time inference and always-on computer vision on Snapdragon.
  • On-device and offline: QNN inference runs entirely on the Snapdragon device, so there are no cloud round-trips, latency stays low, and data never leaves the device.
  • Quantized efficiency: QNN export quantizes YOLO to INT8 weights with 16-bit activations, the Hexagon NPU's preferred accuracy/performance balance, shrinking model size and maximizing frames per second on battery-powered hardware.
  • One format, many devices: A single Qualcomm QNN export targets Snapdragon CPU, Adreno GPU, and Hexagon NPU across the Snapdragon 8 Gen 2, 8 Gen 3, and 8 Elite families and beyond.
  • Production-ready Qualcomm AI stack: QNN (Qualcomm AI Engine Direct / QAIRT) is Qualcomm's current, actively maintained on-device AI runtime and the recommended replacement for SNPE.

QNN Export Format

Ultralytics compiles YOLO models to QNN locally using the ONNX Runtime QNN Execution Provider (the pip-installable onnxruntime-qnn package, which bundles the QAIRT libraries). The exporter converts your model to ONNX, quantizes it with calibration data to 16-bit activations and INT8 weights (the recommended balance for the Hexagon NPU), then initializes an ONNX Runtime session with context-binary caching enabled — this compiles the quantized graph into a QNN context binary embedded in <model>_qnn.onnx. No Qualcomm account, cloud upload, or separate SDK download is required.

Unlike the cloud-based Qualcomm AI Hub, which compiles and profiles models on Qualcomm-hosted Snapdragon devices and requires a Qualcomm account, the Ultralytics QNN export runs entirely on your own machine with a single export(format="qnn") call. You get the same QNN/QAIRT runtime target — Snapdragon CPU, Adreno GPU, and Hexagon NPU — without sign-up, upload limits, or queue times, and it drops straight into the standard YOLO export workflow.

The exported *_qnn.onnx file is self-contained: it embeds the QNN context binary and ONNX metadata such as class names, image size, and task.

Key Features of QNN Models

  • Quantization: The model is quantized to 16-bit activations and INT8 weights with the ONNX Runtime QNN QDQ flow and a calibration dataset, the Hexagon NPU's recommended accuracy/performance balance. Learn more about model quantization.
  • Fully Local Compilation: The context binary is generated entirely on your host machine — no Qualcomm account, API token, or cloud upload.
  • Full Snapdragon Acceleration: Run inference on the Hexagon NPU (HTP), Adreno GPU, or CPU through a single unified runtime.
  • Broad Device Reach: Target the wide range of Snapdragon platforms shipping in phones, PCs (Windows on Snapdragon), automotive, XR, and embedded products.
  • Precompiled Context Binary: Shipping a context binary minimizes on-device graph compilation, reducing model load latency on the target.
  • Self-Contained Output: The exported ONNX file includes the precompiled QNN context binary and metadata for straightforward deployment.

Supported Tasks

QNN export supports the standard task set available in each model family, including YOLO26 semantic segmentation.

TaskSupported
Object Detection
Instance Segmentation
Semantic Segmentation
Pose Estimation
OBB Detection
Classification

Export to QNN: Converting Your YOLO Model

Export an Ultralytics YOLO model to QNN format for deployment on Snapdragon hardware. The context binary is finalized for a target Hexagon Tensor Processor (HTP) architecture, which you select with the name argument — the same argument used to target a chip in RKNN export.

Supported HTP Architectures

Pass the target architecture via name (e.g. name="73"). Valid values:

nameHexagon HTPSnapdragon platform
68v68Snapdragon 888
69v69Snapdragon 8 Gen 1 / 8+ Gen 1
73v73Snapdragon 8 Gen 2, X Elite (default)
75v75Snapdragon 8 Gen 3
79v79Snapdragon 8 Elite
81v81Snapdragon 8 Elite Gen 5

!!! note "Platform support"

QNN export uses the `onnxruntime-qnn` package. Prebuilt wheels are published for **Windows (x64 and ARM64)** and **Linux ARM64 (aarch64)**; on **Linux x86-64** build ONNX Runtime from source with `--use_qnn` (no prebuilt wheel is published, and macOS is not a supported QNN host). QNN context-binary generation runs on an x64 host — Windows x64 or Linux x86-64 — and does not require a Snapdragon device for the export step.

Installation

To install the required packages, run:

!!! tip "Installation"

=== "CLI"

    ```bash
    # Install the required package for YOLO
    pip install ultralytics
    ```

The onnxruntime-qnn package (which provides the ONNX Runtime QNN Execution Provider and bundles the QAIRT libraries) is installed automatically on first export. For detailed instructions and best practices related to the installation process, check our Ultralytics Installation guide. While installing the required packages for YOLO, if you encounter any difficulties, consult our Common Issues guide for solutions and tips.

Usage

The QNN format supports the Export, Predict, and Validate modes. Inference and validation run on Qualcomm Snapdragon hardware through ONNX Runtime's QNN Execution Provider (the same onnxruntime-qnn package used for export). Export your model, then load the exported model on a Snapdragon device to run inference or validate its accuracy.

!!! example "Export"

=== "Python"

    ```python
    from ultralytics import YOLO

    # Load a YOLO26 model
    model = YOLO("yolo26n.pt")

    # Export to Qualcomm QNN format (INT8, enforced automatically), targeting an HTP architecture via 'name'
    # 'name' can be one of 68, 69, 73, 75, 79, 81 (Snapdragon 888, 8 Gen 1, 8 Gen 2, 8 Gen 3, 8 Elite, 8 Elite Gen 5)
    model.export(format="qnn", name="73")  # creates 'yolo26n_qnn.onnx'
    ```

=== "CLI"

    ```bash
    # Export a YOLO26n PyTorch model to Qualcomm QNN format for the target HTP architecture
    # 'name' can be one of 68, 69, 73, 75, 79, 81 (Snapdragon 888, 8 Gen 1, 8 Gen 2, 8 Gen 3, 8 Elite, 8 Elite Gen 5)
    yolo export model=yolo26n.pt format=qnn name=73 # creates 'yolo26n_qnn.onnx'
    ```

!!! example "Predict"

=== "Python"

    ```python
    from ultralytics import YOLO

    # Load the exported QNN model (on a Snapdragon device with onnxruntime-qnn)
    model = YOLO("yolo26n_qnn.onnx")

    # Run inference
    results = model("https://ultralytics.com/images/bus.jpg")
    ```

=== "CLI"

    ```bash
    # Run inference with the exported QNN model
    yolo predict model=yolo26n_qnn.onnx source='https://ultralytics.com/images/bus.jpg'
    ```

!!! example "Validate"

=== "Python"

    ```python
    from ultralytics import YOLO

    # Load the exported QNN model (on a Snapdragon device with onnxruntime-qnn)
    model = YOLO("yolo26n_qnn.onnx")

    # Validate accuracy on the COCO8 dataset
    metrics = model.val(data="coco8.yaml")
    ```

=== "CLI"

    ```bash
    # Validate the exported QNN model
    yolo val model=yolo26n_qnn.onnx data=coco8.yaml
    ```

Export Arguments

ArgumentTypeDefaultDescription
formatstr'qnn'Target format for the exported model, defining compatibility with the Qualcomm QNN runtime.
imgszint or tuple640Desired image size for the model input. Can be an integer for square images or a tuple (height, width).
batchint1Specifies the export model batch size, which is baked into the generated QNN context binary.
namestr'73'Target Hexagon HTP architecture version: 68, 69, 73, 75, 79, or 81 (Snapdragon 888, 8 Gen 1, 8 Gen 2, 8 Gen 3, 8 Elite, 8 Elite Gen 5). The context binary is finalized for this architecture.
int8boolTrueEnables INT8 quantization. Required for QNN HTP export — automatically set to True if not specified.
datastr'coco8.yaml'Dataset configuration file used for INT8 calibration. Specifies the calibration image source.
fractionfloat1.0Fraction of the calibration dataset to use for INT8 quantization.
devicestrNoneSpecifies the device for the ONNX export step: GPU (device=0) or CPU (device=cpu).

!!! note "Precision"

QNN export quantizes the model to **16-bit activations and INT8 weights** — the recommended accuracy/performance balance for the Hexagon NPU — using the [ONNX Runtime QDQ quantization](https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html) flow with calibration images from `data`. `int8=True` is enforced automatically.

For more details about the export process, visit the Ultralytics documentation page on exporting.

Output Structure

After a successful export, a self-contained ONNX file is created:

yolo26n_qnn.onnx   # ONNX wrapping the precompiled QNN context binary and metadata

The yolo26n_qnn.onnx file embeds the QNN context binary and is loaded by ONNX Runtime with the QNN Execution Provider on the Snapdragon device. It also carries model metadata such as class names, image size, and task in ONNX metadata_props.

Deploying Exported YOLO QNN Models

QNN models run on Qualcomm Snapdragon hardware, making on-device model deployment straightforward. On a Snapdragon device with onnxruntime-qnn installed, run the exported model directly with the Ultralytics API (yolo predict/yolo val, see Usage above) — Ultralytics loads the context binary through the ONNX Runtime QNN Execution Provider and selects the HTP (NPU), GPU, or CPU backend.

For custom pipelines, you can also load the context-binary ONNX directly with ONNX Runtime. onnxruntime-qnn is a plugin Execution Provider, so register it at runtime:

python
import onnxruntime as ort
import onnxruntime_qnn as qnn_ep

# On the Snapdragon device, register the QNN plugin EP and select its device(s)
ort.register_execution_provider_library("QNNExecutionProvider", qnn_ep.get_library_path())
devices = [d for d in ort.get_ep_devices() if d.ep_name == "QNNExecutionProvider"]

options = ort.SessionOptions()
options.add_provider_for_devices(devices, {"backend_path": qnn_ep.get_qnn_htp_path()})
session = ort.InferenceSession("yolo26n_qnn.onnx", sess_options=options)
outputs = session.run(None, {"images": input_tensor})  # input_tensor: float32 NCHW

Because the QNN context binary is precompiled, the session loads quickly without recompiling the graph on-device.

  1. Train your model using Ultralytics Train Mode
  2. Export to QNN format using model.export(format="qnn") on a supported platform (Windows x64 or ARM64, or Linux ARM64)
  3. Deploy the exported *_qnn.onnx file to your Snapdragon device
  4. Run inference with ONNX Runtime and the QNN Execution Provider, selecting the HTP, GPU, or CPU backend

Real-World Applications

YOLO models running on Qualcomm Snapdragon hardware are well suited for a wide range of edge AI applications:

  • Smartphones: Real-time object detection and scene understanding in camera and photo apps with NPU acceleration.
  • Windows on Snapdragon: On-device computer vision in Copilot+ PCs without offloading to the cloud.
  • Automotive: Driver monitoring, occupant detection, and ADAS features on Snapdragon Digital Chassis platforms.
  • XR and Wearables: Low-power, low-latency perception for AR/VR headsets and smart glasses.
  • IoT and Robotics: Efficient vision inference on Snapdragon-powered cameras, drones, and embedded systems.

Summary

In this guide, you've learned how to export Ultralytics YOLO models to the Qualcomm QNN format locally with the ONNX Runtime QNN Execution Provider. The export pipeline converts your model to ONNX, then compiles it into a QNN context binary on your host machine — no Qualcomm account or cloud required — producing a *_qnn.onnx file optimized for Snapdragon CPU, Adreno GPU, and Hexagon NPU hardware via the QNN/QAIRT runtime.

The combination of Ultralytics YOLO and Qualcomm's on-device AI stack provides an effective solution for running advanced computer vision workloads across the broad Snapdragon ecosystem.

For other on-device and mobile deployment targets, see the related ONNX, CoreML, NCNN, TFLite, ExecuTorch, RKNN, Sony IMX500, and TensorRT export guides. To compare formats before shipping, use Benchmark mode. For the full list of formats and options, visit the Export mode documentation and the integrations guide page.

FAQ

How do I export my Ultralytics YOLO model to QNN format?

You can export your model using the export() method in Python or via the CLI with format="qnn". The export first creates an ONNX model, then compiles it locally into a QNN context binary using the ONNX Runtime QNN Execution Provider. The onnxruntime-qnn package is installed automatically on first export.

!!! example

=== "Python"

    ```python
    from ultralytics import YOLO

    model = YOLO("yolo26n.pt")
    model.export(format="qnn")
    ```

=== "CLI"

    ```bash
    yolo export model=yolo26n.pt format=qnn
    ```

Do I need a Qualcomm account or cloud access?

No. QNN export runs entirely on your local machine using the onnxruntime-qnn package, which bundles the QAIRT libraries. No Qualcomm account, API token, or network access is required.

How does Ultralytics QNN export compare to Qualcomm AI Hub?

Qualcomm AI Hub is Qualcomm's cloud service for compiling, profiling, and benchmarking models on hosted Snapdragon devices, and it requires a Qualcomm account. Ultralytics QNN export targets the same QNN/QAIRT runtime (Snapdragon CPU, Adreno GPU, and Hexagon NPU) but compiles the context binary locally with the ONNX Runtime QNN Execution Provider — no account, no upload, and no queue. It is the fastest way to go from a .pt model to a Snapdragon-ready build directly inside the standard YOLO export workflow.

Which platforms can I export on?

onnxruntime-qnn provides prebuilt wheels for Windows (x64 and ARM64) and Linux ARM64 (aarch64); on Linux x86-64 build ONNX Runtime from source with --use_qnn (no prebuilt wheel is published, and macOS is not a supported QNN host). Context-binary generation runs on an x64 host — Windows x64 or Linux x86-64 — and does not require a physical Snapdragon device.

How do I run YOLO on a Qualcomm Snapdragon NPU?

Export with model.export(format="qnn"), copy the resulting yolo26n_qnn.onnx file to your Snapdragon device, and run yolo predict model=yolo26n_qnn.onnx source=image.jpg (or yolo val). Ultralytics loads the context binary through the ONNX Runtime QNN Execution Provider and runs it on the Hexagon NPU — see Deploying Exported YOLO QNN Models.

What is the difference between QNN and SNPE?

QNN (Qualcomm AI Engine Direct, part of the QAIRT SDK) is Qualcomm's current inference stack and the recommended replacement for the older Snapdragon Neural Processing Engine (SNPE) SDK. New deployments should target QNN.

Can I run a QNN model with yolo predict and yolo val?

Yes, on a Qualcomm Snapdragon device with onnxruntime-qnn installed — YOLO("yolo26n_qnn.onnx") loads the context binary through the QNN Execution Provider and runs predict/val like any other format. On an x86 host without QNN hardware the model cannot execute, since the context binary targets the Snapdragon NPU.

What is the output of a QNN export?

The export creates a self-contained context-binary ONNX file (e.g., yolo26n_qnn.onnx) with class names, image size, task, and other model metadata embedded in ONNX metadata_props.