docs/source/en/api/pipelines/longcat_audio_dit.md
LongCat-AudioDiT is a text-to-audio diffusion model from Meituan LongCat. The diffusers integration exposes a standard [DiffusionPipeline] interface for text-conditioned audio generation.
This pipeline was adapted from the LongCat-AudioDiT reference implementation: https://github.com/meituan-longcat/LongCat-AudioDiT
This pipeline supports loading from a local directory or Hugging Face Hub repository in diffusers format (containing text_encoder/, transformer/, vae/, tokenizer/, and scheduler/ subfolders).
import soundfile as sf
import torch
from diffusers import LongCatAudioDiTPipeline
pipeline = LongCatAudioDiTPipeline.from_pretrained(
"ruixiangma/LongCat-AudioDiT-1B-Diffusers",
torch_dtype=torch.float16,
)
pipeline = pipeline.to("cuda")
prompt = "A calm ocean wave ambience with soft wind in the background."
audio = pipeline(
prompt,
audio_duration_s=5.0,
num_inference_steps=16,
guidance_scale=4.0,
generator=torch.Generator("cuda").manual_seed(42),
).audios[0, 0]
sf.write("longcat.wav", audio, pipeline.sample_rate)
audio_duration_s is the most direct way to control output duration.generator=torch.Generator("cuda").manual_seed(42) to make generation reproducible.(batch, channels, samples) - use .audios[0, 0] to get a single audio sample.audio.unsqueeze(0).repeat(1, 2, 1).[[autodoc]] LongCatAudioDiTPipeline - all - call - from_pretrained