bindings/python/notebook/windows(arm64).ipynb
This notebook demonstrates how to use the NexaAI SDK for various AI inference tasks on NPU devices, including:
If you prefer, we also offer a video tutorial for the installation. Check it out here.
NexaAI requires Python 3.11 – 3.13 (ARM64 build) on Windows ARM. Please download and install the official ARM64 Python from the python-3.11.1-arm64.exe. Make sure you read the instructions below carefully before proceeding.
❗ IMPORTANT: Make sure you select "Add python.exe to PATH" on the first screen of the installation wizard.
🛑 Make sure you restart the terminal or your IDE after installation.
⚠️ Do not use Conda or x86 builds — they are incompatible with native ARM64 binaries. If you are in a conda environment, run
conda deactivatefirst.
Verify the installation:
In case your environment path gets overriden by some environment manager, we recommend you to run the following commands to restore PATH variable from system settings.
$systemPath = [Environment]::GetEnvironmentVariable('Path', 'Machine')
$userPath = [Environment]::GetEnvironmentVariable('Path', 'User')
$env:Path = "$userPath;$systemPath"
Then verify your python executable has the correct architecture and version (3.11 - 3.13)
python -c "import sys, platform; print(f'Python version: {sys.version}')"
Your output should look like:
Python version: 3.11.0 (main, Oct 24 2022, 18:15:22) [MSC v.1933 64 bit (ARM64)]
Expected output must contain version 3.11.0 and architecture ARM64.
If it does show AMD64 or incorrect version, try the following:
conda deactivate to deactivate the current conda environment.python executable points to the x86 version) You may need to make the ARM64 Python come before the x86 Python in your PATH.
Win key, and type env, and hit Enter to select Edit the system environment variables setting.Environment Variables... button.Path and click Edit....OK for several times to close all the dialogs and save the changes.cd to the current project root directory cd path/to/nexa-sdk.
python -m venv nexaai-env
nexaai-env\Scripts\activate
pip install nexaai -v
nexaai-env, or the custom virtual environment you have created. The kernel should automatically reload in most IDEs.Run the following code to ensure you have the right kernel running.
import sys
import platform
# ANSI color codes
RED = "\033[91m"
GREEN = "\033[92m"
YELLOW = "\033[93m"
BOLD = "\033[1m"
RESET = "\033[0m"
min_ver = (3, 11)
max_ver = (3, 13)
current_ver = sys.version_info
arch = platform.machine()
if not (min_ver <= (current_ver.major, current_ver.minor) < max_ver) or arch.lower() != "arm64":
print("\n" + "=" * 80)
print(f"{BOLD}{RED}WARNING: Your Python version or architecture is not compatible.{RESET}")
print(f"Detected version: {current_ver.major}.{current_ver.minor}, architecture: {arch}")
print(f"{YELLOW}Required: Python 3.11 - 3.13 & architecture 'arm64'.{RESET}")
print("=" * 80)
print(f"{RED}DO NOT continue to the following code!{RESET}\n")
print("To install arm64 Python:")
print(" - Download Python 3.11-3.13 for arm64 from https://www.python.org/downloads/")
print(" - Install and verify by running: python3 --version and python3 -c 'import platform; print(platform.machine())'")
print(" - Launch Jupyter and make sure to select the arm64 Python kernel in 'Kernel > Change kernel'.")
sys.exit(1)
else:
print(f"{GREEN}[VERIFICATION PASSED] Python version and architecture are correct. You may continue to the following sections.{RESET}")
Before running any examples, you need to set up your NexaAI authentication token.
Replace "YOUR_NEXA_TOKEN_HERE" with your actual NexaAI token from https://sdk.nexa.ai/:
import os
# Replace "YOUR_NEXA_TOKEN_HERE" with your actual token from https://sdk.nexa.ai/
os.environ["NEXA_TOKEN"] = "YOUR_NEXA_TOKEN_HERE"
# Suppress HF warnings
os.environ["HF_HUB_VERBOSITY"] = "error"
assert os.environ.get("NEXA_TOKEN", "").startswith(
"key/"), "ERROR: NEXA_TOKEN must start with 'key/'. Please check your token."
Using NPU-accelerated large language models for text generation and conversation. Llama3.2-3B-NPU-Turbo is specifically optimized for NPU.
import io
import os
from nexaai import LLM, GenerationConfig, ModelConfig, LlmChatMessage
def llm_npu_example():
"""LLM NPU inference example"""
print("=== LLM NPU Inference Example ===")
# Model configuration
# Use huggingface Repo ID
model_name = "NexaAI/Llama3.2-3B-NPU-Turbo"
# Alternatively, use local path
# model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\Llama3.2-3B-NPU-Turbo\weights-1-3.nexa")
plugin_id = "npu"
max_tokens = 100
system_message = "You are a helpful assistant."
print(f"Loading model: {model_name}")
print(f"Using plugin: {plugin_id}")
# Create model instance
config = ModelConfig()
llm = LLM.from_(model=model_name, plugin_id=plugin_id, config=config)
# Create conversation history
conversation = [LlmChatMessage(role="system", content=system_message)]
# Example conversations
test_prompts = [
"What is artificial intelligence?",
"Explain the benefits of on-device AI processing.",
"How does NPU acceleration work?"
]
for i, prompt in enumerate(test_prompts, 1):
print(f"\n--- Conversation {i} ---")
print(f"User: {prompt}")
# Add user message
conversation.append(LlmChatMessage(role="user", content=prompt))
# Apply chat template
formatted_prompt = llm.apply_chat_template(conversation)
# Generate response
print("Assistant: ", end="", flush=True)
response_buffer = io.StringIO()
gen = llm.generate_stream(formatted_prompt, GenerationConfig(max_tokens=max_tokens))
result = None
try:
while True:
token = next(gen)
print(token, end="", flush=True)
response_buffer.write(token)
except StopIteration as e:
result = e.value
# Get profiling data
if result and hasattr(result, "profile_data") and result.profile_data:
print(f"\n{result.profile_data}")
# Add assistant response to conversation history
conversation.append(LlmChatMessage(role="assistant", content=response_buffer.getvalue()))
print("\n" + "=" * 50)
llm_npu_example()
Using NPU-accelerated vision language models for multimodal understanding and generation. OmniNeural-4B supports joint processing of images and text.
import os
import io
from nexaai import (
GenerationConfig,
ModelConfig,
VlmChatMessage,
VlmContent,
)
from nexaai.vlm import VLM
def vlm_npu_example():
"""VLM NPU inference example"""
print("=== VLM NPU Inference Example ===")
# Model configuration
# Use huggingface repo ID
model_name = "NexaAI/OmniNeural-4B"
# Alternatively, use local path
# model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\OmniNeural-4B\weights-1-8.nexa")
plugin_id = "npu"
max_tokens = 100
system_message = "You are a helpful assistant that can understand images and text."
image_path = '/your/image/path' # Replace with actual image path if available
print(f"Loading model: {model_name}")
print(f"Using plugin: {plugin_id}")
# Check for image existence
if not (image_path and os.path.exists(image_path)):
print(
f"\033[93mWARNING: The specified image_path ('{image_path}') does not exist or was not provided. Multimodal prompts will not include image input.\033[0m")
# Create model instance
config = ModelConfig()
vlm = VLM.from_(model=model_name, config=config, plugin_id=plugin_id)
# Create conversation history
conversation = [
VlmChatMessage(
role="system",
contents=[VlmContent(type="text", text=system_message)]
)
]
# Example multimodal conversations
test_cases = [
{
"text": "What do you see in this image?",
"image_path": image_path
}
]
for i, case in enumerate(test_cases, 1):
print(f"\n--- Multimodal Conversation {i} ---")
print(f"User: {case['text']}")
# Build message content
contents = []
if case['text']:
contents.append(VlmContent(type="text", text=case['text']))
# Add image content if available
if case['image_path'] and os.path.exists(case['image_path']):
contents.append(VlmContent(type="image", text=case['image_path']))
print(f"Including image: {case['image_path']}")
# Add user message
conversation.append(VlmChatMessage(role="user", contents=contents))
# Apply chat template
formatted_prompt = vlm.apply_chat_template(conversation)
# Generate response
print("Assistant: ", end="", flush=True)
response_buffer = io.StringIO()
# Prepare image and audio paths
image_paths = [case['image_path']] if case['image_path'] and os.path.exists(case['image_path']) else None
audio_paths = None
gen = vlm.generate_stream(
formatted_prompt,
config=GenerationConfig(
max_tokens=max_tokens,
image_paths=image_paths,
audio_paths=audio_paths
)
)
result = None
try:
while True:
token = next(gen)
print(token, end="", flush=True)
response_buffer.write(token)
except StopIteration as e:
result = e.value
# Get profiling data
if result and hasattr(result, "profile_data") and result.profile_data:
print(f"\n{result.profile_data}")
# Add assistant response to conversation history
conversation.append(
VlmChatMessage(
role="assistant",
contents=[
VlmContent(type="text", text=response_buffer.getvalue())
]
)
)
print("\n" + "=" * 50)
vlm_npu_example()
Using NPU-accelerated embedding models for text vectorization and similarity computation. embeddinggemma-300m-npu is a lightweight embedding model specifically optimized for NPU.
import numpy as np
import os
from nexaai.embedding import Embedder
def embedder_npu_example():
"""Embedder NPU inference example"""
print("=== Embedder NPU Inference Example ===")
# Model configuration
# Use huggingface repo ID
model_name = "NexaAI/embeddinggemma-300m-npu"
# Alternatively, use local path
# model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\embeddinggemma-300m-npu\weights-1-2.nexa")
plugin_id = "npu"
batch_size = 2
print(f"Loading model: {model_name}")
print(f"Using plugin: {plugin_id}")
print(f"Batch size: {batch_size}")
# Create embedder instance
embedder = Embedder.from_(model=model_name, plugin_id=plugin_id)
print('Embedder loaded successfully!')
# Get embedding dimension
dim = embedder.embedding_dim()
print(f"Dimension: {dim}")
# Example texts
texts = [
"On-device AI is a type of AI that is processed on the device itself, rather than in the cloud.",
"Nexa AI allows you to run state-of-the-art AI models locally on CPU, GPU, or NPU — from instant use cases to production deployments.",
"A ragdoll is a breed of cat that is known for its long, flowing hair and gentle personality.",
"The capital of France is Paris.",
"NPU acceleration provides significant performance improvements for AI workloads."
]
query = "what is on device AI"
print(f"\n=== Generating Embeddings ===")
print(f"Processing {len(texts)} texts...")
# Generate embeddings
result = embedder.embed(
texts=texts,
batch_size=batch_size,
)
embeddings = result.embeddings
print(f"Successfully generated {len(embeddings)} embeddings")
# Display embedding information
print(f"\n=== Embedding Details ===")
for i, (text, embedding) in enumerate(zip(texts, embeddings)):
print(f"\nText {i + 1}:")
print(f" Content: {text}")
print(f" Embedding shape: {len(embedding)} dimensions")
print(f" First 10 elements: {embedding[:10]}")
print("-" * 70)
# Query processing
print(f"\n=== Query Processing ===")
print(f"Query: '{query}'")
query_result = embedder.embed(
texts=[query],
batch_size=1,
)
query_embedding = query_result.embeddings[0]
print(f"Query embedding shape: {len(query_embedding)} dimensions")
# Similarity analysis
print(f"\n=== Similarity Analysis (Inner Product) ===")
similarities = []
for i, (text, embedding) in enumerate(zip(texts, embeddings)):
inner_product = sum(a * b for a, b in zip(query_embedding, embedding))
similarities.append((i, text, inner_product))
print(f"\nText {i + 1}:")
print(f" Content: {text}")
print(f" Inner product with query: {inner_product:.6f}")
print("-" * 70)
# Sort and display most similar texts
similarities.sort(key=lambda x: x[2], reverse=True)
print(f"\n=== Similarity Ranking Results ===")
for rank, (idx, text, score) in enumerate(similarities, 1):
print(f"Rank {rank}: [{score:.6f}] {text}")
return embeddings, query_embedding, similarities
embeddings, query_emb, similarities = embedder_npu_example()
Using NPU-accelerated speech recognition models for speech-to-text transcription. parakeet-npu provides high-quality speech recognition with NPU acceleration.
import os
import time
from nexaai.asr import ASR
def asr_npu_example():
"""ASR NPU inference example"""
print("=== ASR NPU Inference Example ===")
# Model configuration
# Use huggingface Repo ID
model_name = "NexaAI/parakeet-npu"
# Alternatively, use local path
# model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\parakeet-npu\weights-1-5.nexa")
plugin_id = "npu"
# Example audio file (replace with your actual audio file)
audio_file = r"path/to/audio" # Replace with actual audio file path
print(f"Loading model: {model_name}")
print(f"Using plugin: {plugin_id}")
audio_path = os.path.expanduser(audio_file)
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
# Create ASR instance
asr = ASR.from_(
model=os.path.expanduser(model_name),
plugin_id=plugin_id,
device_id=None,
)
print('ASR model loaded successfully!')
print(f"\n=== Starting Transcription ===")
start_time = time.time()
# Perform transcription
result = asr.transcribe(
audio_path=audio_path,
language="en",
timestamps="segment",
beam_size=5,
)
end_time = time.time()
transcription_time = end_time - start_time
# Display results
print(f"\n=== Transcription Results ===")
print(f"Transcription: {result.transcript}")
print(f"Processing time: {transcription_time:.2f} seconds")
return result
result = asr_npu_example()
Using NPU-accelerated reranking models for document reranking. jina-v2-rerank-npu can perform precise similarity-based document ranking based on queries.
import os
from nexaai.rerank import Reranker
def reranker_npu_example():
"""Reranker NPU inference example"""
print("=== Reranker NPU Inference Example ===")
# Model configuration
# Use huggingface repo ID
model_name = "NexaAI/jina-v2-rerank-npu"
# Alternatively, use local path
# model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\jina-v2-rerank-npu\weights-1-4.nexa")
plugin_id = "npu"
batch_size = 4
print(f"Loading model: {model_name}")
print(f"Using plugin: {plugin_id}")
print(f"Batch size: {batch_size}")
# Create reranker instance
reranker = Reranker.from_(
model=os.path.expanduser(model_name),
plugin_id=plugin_id,
)
# Example queries and documents
queries = [
"Where is on-device AI?",
"What is NPU acceleration?",
"How does machine learning work?",
"Tell me about computer vision"
]
documents = [
"On-device AI is a type of AI that is processed on the device itself, rather than in the cloud.",
"NPU acceleration provides significant performance improvements for AI workloads on specialized hardware.",
"Edge computing brings computation and data storage closer to the sources of data.",
"A ragdoll is a breed of cat that is known for its long, flowing hair and gentle personality.",
"The capital of France is Paris, a beautiful city known for its art and culture.",
"Machine learning is a subset of artificial intelligence that enables computers to learn without being explicitly programmed.",
"Computer vision is a field of artificial intelligence that trains computers to interpret and understand visual information.",
"Deep learning uses neural networks with multiple layers to model and understand complex patterns in data."
]
print(f"\n=== Document Reranking Test ===")
print(f"Number of documents: {len(documents)}")
# Rerank for each query
for i, query in enumerate(queries, 1):
print(f"\n--- Query {i} ---")
print(f"Query: '{query}'")
print("-" * 50)
# Perform reranking
result = reranker.rerank(
query=query,
documents=documents,
batch_size=batch_size,
)
scores = result.scores
# Create (document, score) pairs and sort
doc_scores = list(zip(documents, scores))
doc_scores.sort(key=lambda x: x[1], reverse=True)
# Display ranking results
print("Reranking results:")
for rank, (doc, score) in enumerate(doc_scores, 1):
print(f" {rank:2d}. [{score:.4f}] {doc}")
# Display most relevant documents
print(f"\nMost relevant documents (top 3):")
for rank, (doc, score) in enumerate(doc_scores[:3], 1):
print(f" {rank}. {doc}")
print("=" * 80)
return reranker
reranker = reranker_npu_example()
Run NPU-accelerated computer vision tasks (e.g., OCR/text recognition) on images.
import os
from nexaai.cv import CV
def cv_ocr_example():
"""CV OCR example"""
print("=== CV OCR Example ===")
# Use huggingface repo ID
model_name = "NexaAI/paddleocr-npu"
# Alternatively, use local path
# model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\paddleocr-npu\weights-1-1.nexa")
image_path = r"path/to/image"
image_path = os.path.expanduser(image_path)
if not os.path.exists(image_path):
raise FileNotFoundError(f"Image file not found: {image_path}")
cv = CV.from_(
model=os.path.expanduser(model_name),
capabilities=0, # 0=OCR
plugin_id='npu',
)
results = cv.infer(image_path)
print(f"Number of results: {len(results.results)}")
for result in results.results:
print(f"[{result.confidence:.2f}] {result.text}")
cv_ocr_example()
Run NPU-accelerated speech diarization tasks on audio files.
import os
from nexaai.diarize import Diarize
def diarize_example():
"""Diarize NPU inference example"""
print("=== Diarize NPU Inference Example ===")
# Use huggingface repo ID
model_name = "NexaAI/Pyannote-NPU"
# Alternatively, use local path
# model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\Pyannote-NPU\weights-1-1.nexa")
plugin_id = "npu"
audio_path = r"path/to/audio" # Replace with actual audio file path
print(f"Loading model: {model_name}")
print(f"Using plugin: {plugin_id}")
audio_path = os.path.expanduser(audio_path)
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
# Create Diarize instance
diarize = Diarize.from_(
model=os.path.expanduser(model_name),
plugin_id=plugin_id,
device_id=None,
)
print('Diarize model loaded successfully!')
print(f"\n=== Starting Diarization ===")
# Perform diarization
result = diarize.infer(
audio_path=audio_path,
min_speakers=0, # Auto-detect
max_speakers=0, # No limit
)
# Display results
print(f"\n=== Diarization Results ===")
print(f"Number of speakers: {result.num_speakers}")
print(f"Duration: {result.duration:.2f}s")
print(f"Number of segments: {len(result.segments)}")
print("\nSegments:")
for segment in result.segments:
print(
f"[{segment.start_time:.2f}s - {segment.end_time:.2f}s] {segment.speaker_label}"
)
return result
result = diarize_example()