Back to Ultralytics

YOLO26 Deployment Options Compared

docs/en/guides/model-deployment-options.md

8.4.8017.6 KB
Original Source

Comparative Analysis of YOLO26 Deployment Options

YOLO26 supports more than 20 deployment options, each tuned for a different runtime, hardware target, or platform — from PyTorch and ONNX to TensorRT, OpenVINO, CoreML, and dedicated edge-NPU formats. Picking the right one balances inference speed, hardware constraints, and ease of integration. This guide compares every option so you can choose the best fit for your application, then move on to model deployment best practices to deploy it reliably.

<p align="center"> <iframe loading="lazy" width="720" height="405" src="https://www.youtube.com/embed/QkCsj2SvZc4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen> </iframe>

<strong>Watch:</strong> How to Choose the Best Ultralytics YOLO26 Deployment Format for Your Project | TensorRT | OpenVINO 🚀

</p>

Deployment is the stage in the computer vision project workflow where a trained model starts doing real work, so the format you export to has a direct impact on speed, cost, and portability.

How to Select the Right Deployment Option for Your YOLO26 Model

When it's time to deploy your YOLO26 model, selecting a suitable export format is very important. As outlined in the Ultralytics YOLO26 export documentation, the model.export() function converts your trained model into a variety of formats tailored to diverse environments and performance requirements.

The ideal format depends on your model's intended operational context and hardware.

!!! tip "Skip the manual export"

For managed deployment without manual export, [Ultralytics Platform](https://platform.ultralytics.com) provides ready-to-use [inference endpoints](../platform/deploy/endpoints.md) with auto-scaling across 43 global regions.

YOLO26's Deployment Options

Here is a short description of each format and when to reach for it. For the full export walkthrough, see the export documentation; for the side-by-side criteria, jump to the comparison table.

  • PyTorch (.pt): The native training and inference format, offering maximum flexibility and CUDA GPU acceleration — ideal for research and prototyping with no export step required.
  • TorchScript (torchscript): Serializes the model for a Python-free C++ runtime, suited to production systems where Python is unavailable.
  • ONNX (onnx): A framework-agnostic interchange format with broad cross-platform and hardware support through ONNX Runtime.
  • OpenVINO (openvino): Intel's toolkit for optimized inference on Intel CPUs, integrated GPUs, and NPUs, common in IoT and edge computing.
  • TensorRT (engine): NVIDIA's high-performance runtime delivering top-tier GPU inference with FP16 and INT8 optimization.
  • CoreML (coreml): Apple's on-device format for iOS, macOS, watchOS, and tvOS, using the Apple Neural Engine.
  • TF SavedModel (saved_model): TensorFlow's standard format for scalable server-side serving with TensorFlow Serving.
  • TF GraphDef (pb): A frozen static-graph TensorFlow format for environments that need a fixed computation graph.
  • TF Lite (tflite): A lightweight TensorFlow runtime for on-device inference on mobile and embedded hardware.
  • TF Edge TPU (edgetpu): Compiles TF Lite models for Google Coral Edge TPU accelerators.
  • TF.js (tfjs): Runs models directly in the browser with no backend, accelerated through WebGL.
  • PaddlePaddle (paddle): Baidu's deep learning framework, popular in China, with broad hardware support.
  • MNN (mnn): A lightweight, high-performance inference engine optimized for mobile and embedded ARM and x86-64 systems.
  • NCNN (ncnn): A high-performance, lightweight inference framework tuned for mobile ARM devices.
  • Sony IMX500 (imx): Exports for Sony's IMX500 intelligent vision sensor with on-chip processing, such as the Raspberry Pi AI Camera.
  • Rockchip RKNN (rknn): Targets Rockchip NPUs on embedded boards with FP16 and INT8 quantization.
  • ExecuTorch (executorch): PyTorch's native on-device runtime for mobile (iOS and Android) and embedded systems via XNNPACK.
  • Axelera AI (axelera): Compiles for Axelera's Metis AIPU (up to 856 TOPS) over PCIe or M.2 for high-throughput edge inference.
  • DEEPX (deepx): Targets DEEPX NPU hardware with INT8 quantization for embedded edge inference.
  • Qualcomm QNN (qnn): On-device inference on Snapdragon Hexagon NPU, Adreno GPU, and CPU through the Qualcomm AI stack.

For an additional edge target, the Hailo integration compiles YOLO detection models to Hailo HEF. It is not a direct model.export() target: detection models are exported to ONNX first, then compiled to HEF with the external Hailo Dataflow Compiler for Hailo-8, Hailo-8L, and Hailo-15 accelerators.

Deployment Options Compared

The following table summarizes the deployment options for YOLO26 models across the criteria that usually drive the choice. For an in-depth look at each format, see the export formats documentation.

Deployment OptionPerformance BenchmarksCompatibility and IntegrationCommunity Support and EcosystemCase StudiesMaintenance and UpdatesSecurity ConsiderationsHardware Acceleration
PyTorchGood flexibility; may trade off raw performanceExcellent with Python librariesExtensive resources and communityResearch and prototypesRegular, active developmentDependent on deployment environmentCUDA support for GPU acceleration
TorchScriptBetter for production than PyTorchSmooth transition from PyTorch to C++Specialized but narrower than PyTorchIndustry where Python is a bottleneckConsistent updates with PyTorchImproved security without full PythonInherits CUDA support from PyTorch
ONNXVariable depending on runtimeHigh across different frameworksBroad ecosystem, supported by many orgsFlexibility across ML frameworksRegular updates for new operationsEnsure secure conversion and deployment practicesVarious hardware optimizations
OpenVINOOptimized for Intel hardwareBest within Intel ecosystemSolid in computer vision domainIoT and edge with Intel hardwareRegular updates for Intel hardwareRobust features for sensitive applicationsTailored for Intel hardware
TensorRTTop-tier on NVIDIA GPUsBest for NVIDIA hardwareStrong network through NVIDIAReal-time video and image inferenceFrequent updates for new GPUsEmphasis on securityDesigned for NVIDIA GPUs
CoreMLOptimized for on-device Apple hardwareExclusive to Apple ecosystemStrong Apple and developer supportOn-device ML on Apple productsRegular Apple updatesFocus on privacy and securityApple neural engine and GPU
TF SavedModelScalable in server environmentsWide compatibility in TensorFlow ecosystemLarge support due to TensorFlow popularityServing models at scaleRegular updates by Google and communityRobust features for enterpriseVarious hardware accelerations
TF GraphDefStable for static computation graphsIntegrates well with TensorFlow infrastructureResources for optimizing static graphsScenarios requiring static graphsUpdates alongside TensorFlow coreEstablished TensorFlow security practicesTensorFlow acceleration options
TF LiteSpeed and efficiency on mobile/embeddedWide range of device supportRobust community, Google backedMobile applications with minimal footprintLatest features for mobileSecure environment on end-user devicesGPU and DSP among others
TF Edge TPUOptimized for Google's Edge TPU hardwareExclusive to Edge TPU devicesGrowing with Google and third-party resourcesIoT devices requiring real-time processingImprovements for new Edge TPU hardwareGoogle's robust IoT securityCustom-designed for Google Coral
TF.jsReasonable in-browser performanceHigh with web technologiesWeb and Node.js developers supportInteractive web applicationsTensorFlow team and community contributionsWeb platform security modelEnhanced with WebGL and other APIs
PaddlePaddleCompetitive, easy to use and scalableBaidu ecosystem, wide application supportRapidly growing, especially in ChinaChinese market and language processingFocus on Chinese AI applicationsEmphasizes data privacy and securityIncluding Baidu's Kunlun chips
MNNHigh-performance for mobile devicesMobile and embedded ARM systems and X86-64 CPUMobile/embedded ML communityMobile systems efficiencyHigh performance maintenance on mobile devicesOn-device security advantagesARM CPUs and GPUs optimizations
NCNNOptimized for mobile ARM-based devicesMobile and embedded ARM systemsNiche but active mobile/embedded ML communityAndroid and ARM systems efficiencyHigh performance maintenance on ARMOn-device security advantagesARM CPUs and GPUs optimizations
Sony IMX500On-sensor inference at very low powerSony IMX500 sensor, Raspberry Pi AI CameraSony AITRIOS ecosystemOn-camera edge AISony SDK and MCT toolchain updatesData stays on the sensorSony IMX500 on-chip accelerator
Rockchip RKNNOptimized for Rockchip NPUsRockchip SoC boards (e.g. RK3588)Rockchip developer communityEmbedded SBC and edge devicesRockchip RKNN-Toolkit updatesOn-device local inferenceRockchip NPU
ExecuTorchEfficient on-device PyTorch runtimeiOS, Android, embedded via XNNPACKBacked by the PyTorch projectMobile and embedded appsMaintained alongside PyTorchOn-device inference keeps data localXNNPACK and mobile CPU/GPU backends
Axelera AIVery high throughput (up to 856 TOPS)Metis AIPU over PCIe or M.2Axelera Voyager SDKHigh-throughput edge inferenceAxelera SDK updatesOn-premises edge inferenceAxelera Metis AIPU
DEEPXINT8-optimized NPU inferenceDEEPX NPU hardwareDEEPX developer tools (dx_com, dx_engine)Embedded edge inferenceDEEPX SDK and runtime updatesOn-device local inferenceDEEPX NPU
Qualcomm QNNFast on-device Snapdragon inferenceSnapdragon Hexagon NPU, Adreno GPU, CPUQualcomm AI Hub ecosystemMobile and edge Snapdragon devicesQualcomm AI stack (QAIRT) updatesOn-device inference keeps data localSnapdragon Hexagon NPU

This comparison gives you a high-level overview. For deployment, weigh the specific requirements and constraints of your project against each option, and consult the linked integration guide for the format you choose.

Conclusion

YOLO26's wide range of export formats lets you tailor a model to almost any environment, from a cloud GPU server to an on-sensor edge camera. Once you have picked a format, follow model deployment best practices for optimization, troubleshooting, and security, and lean on the Ultralytics community when you hit a snag.

FAQ

What are the deployment options available for YOLO26 on different hardware platforms?

Ultralytics YOLO26 supports various deployment formats, each designed for specific environments and hardware platforms. Key formats include:

  • PyTorch for research and prototyping, with excellent Python integration.
  • TorchScript for production environments where Python is unavailable.
  • ONNX for cross-platform compatibility and hardware acceleration.
  • OpenVINO for optimized performance on Intel hardware.
  • TensorRT for high-speed inference on NVIDIA GPUs.

Each format has unique advantages. For a detailed walkthrough, see our export process documentation.

How do I improve the inference speed of my YOLO26 model on an Intel CPU?

To enhance inference speed on Intel CPUs, you can deploy your YOLO26 model using Intel's OpenVINO toolkit. OpenVINO offers significant performance boosts by optimizing models to leverage Intel hardware efficiently.

  1. Convert your YOLO26 model to the OpenVINO format using the model.export() function.
  2. Follow the detailed setup guide in the Intel OpenVINO Export documentation.

For more insights, check out our blog post.

Can I deploy YOLO26 models on mobile devices?

Yes, YOLO26 models can be deployed on mobile devices using TensorFlow Lite (TF Lite) for both Android and iOS platforms. TF Lite is designed for mobile and embedded devices, providing efficient on-device inference.

!!! example

=== "Python"

    ```python
    # Export command for TFLite format
    model.export(format="tflite")
    ```

=== "CLI"

    ```bash
    # CLI command for TFLite export
    yolo export model=yolo26n.pt format=tflite
    ```

For more details on deploying models to mobile, refer to our TF Lite integration guide.

What factors should I consider when choosing a deployment format for my YOLO26 model?

When choosing a deployment format for YOLO26, consider the following factors:

  • Performance: Some formats like TensorRT provide exceptional speeds on NVIDIA GPUs, while OpenVINO is optimized for Intel hardware.
  • Compatibility: ONNX offers broad compatibility across different platforms.
  • Ease of Integration: Formats like CoreML or TF Lite are tailored for specific ecosystems like iOS and Android, respectively.
  • Community Support: Formats like PyTorch and TensorFlow have extensive community resources and support.

For a comparative analysis, refer to our export formats documentation.

How can I deploy YOLO26 models in a web application?

To deploy YOLO26 models in a web application, you can use TensorFlow.js (TF.js), which allows for running machine learning models directly in the browser. This approach eliminates the need for backend infrastructure and provides real-time performance.

  1. Export the YOLO26 model to the TF.js format.
  2. Integrate the exported model into your web application.

For step-by-step instructions, refer to our guide on TensorFlow.js integration.