docs/source/en/model_doc/clvp.md
This model was released on 2023-05-12 and added to Hugging Face Transformers on 2023-11-10.
The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in Better speech synthesis through scaling by James Betker.
The abstract from the paper is the following:
In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise - an expressive, multi-voice text-to-speech system.
This model was contributed by Susnato Dhar. The original code can be found here.
ClvpModelForConditionalGeneration.generate()] method is strongly recommended for tortoise usage.ClvpTokenizer] tokenizes the text input, and the [ClvpFeatureExtractor] extracts the log mel-spectrogram from the desired audio.ClvpConditioningEncoder] takes those text tokens and audio representations and converts them into embeddings conditioned on the text and audio.ClvpForCausalLM] uses those embeddings to generate multiple speech candidates.ClvpEncoder]) which converts them into a vector representation, and the text encoder ([ClvpEncoder]) converts the text tokens into the same latent space.ClvpModelForConditionalGeneration.generate()] compresses all of the logic described above into a single method.Example :
import datasets
from transformers import ClvpModelForConditionalGeneration, ClvpProcessor
# Define the Text and Load the Audio (We are taking an audio example from HuggingFace Hub using `datasets` library).
text = "This is an example text."
ds = datasets.load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", datasets.Audio(sampling_rate=22050))
sample = ds[0]["audio"]
# Define processor and model.
processor = ClvpProcessor.from_pretrained("susnato/clvp_dev")
model = ClvpModelForConditionalGeneration.from_pretrained("susnato/clvp_dev", device_map="auto")
# Generate processor output and model output.
processor_output = processor(raw_speech=sample["array"], sampling_rate=sample["sampling_rate"], text=text, return_tensors="pt").to(model.device)
generated_output = model.generate(**processor_output)
[[autodoc]] ClvpConfig
[[autodoc]] ClvpEncoderConfig
[[autodoc]] ClvpDecoderConfig
[[autodoc]] ClvpTokenizer - save_vocabulary
[[autodoc]] ClvpFeatureExtractor - call
[[autodoc]] ClvpProcessor - call - decode - batch_decode
[[autodoc]] ClvpModelForConditionalGeneration - forward - generate - get_text_features - get_speech_features
[[autodoc]] ClvpForCausalLM
[[autodoc]] ClvpModel
[[autodoc]] ClvpEncoder
[[autodoc]] ClvpDecoder