docs/source/en/model_doc/pop2piano.md
This model was released on 2022-11-02 and added to Hugging Face Transformers on 2023-08-21.
The Pop2Piano model was proposed in Pop2Piano : Pop Audio-based Piano Cover Generation by Jongho Choi and Kyogu Lee.
Piano covers of pop music are widely enjoyed, but generating them from music is not a trivial task. It requires great expertise with playing piano as well as knowing different characteristics and melodies of a song. With Pop2Piano you can directly generate a cover from a song's audio waveform. It is the first model to directly generate a piano cover from pop audio without melody and chord extraction modules.
Pop2Piano is an encoder-decoder Transformer model based on T5. The input audio is transformed to its waveform and passed to the encoder, which transforms it to a latent representation. The decoder uses these latent representations to generate token ids in an autoregressive way. Each token id corresponds to one of four different token types: time, velocity, note and 'special'. The token ids are then decoded to their equivalent MIDI file.
The abstract from the paper is the following:
Piano covers of pop music are enjoyed by many people. However, the task of automatically generating piano covers of pop music is still understudied. This is partly due to the lack of synchronized {Pop, Piano Cover} data pairs, which made it challenging to apply the latest data-intensive deep learning-based methods. To leverage the power of the data-driven approach, we make a large amount of paired and synchronized {Pop, Piano Cover} data using an automated pipeline. In this paper, we present Pop2Piano, a Transformer network that generates piano covers given waveforms of pop music. To the best of our knowledge, this is the first model to generate a piano cover directly from pop audio without using melody and chord extraction modules. We show that Pop2Piano, trained with our dataset, is capable of producing plausible piano covers.
This model was contributed by Susnato Dhar. The original code can be found here.
pip install pretty-midi==0.2.9 essentia==2.1b6.dev1034 librosa scipy
Please note that you may need to restart your runtime after installation.
Pop2PianoForConditionalGeneration.generate() can lead to variety of different results.from datasets import load_dataset
from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor
model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano", device_map="auto")
processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
ds = load_dataset("sweetcocoa/pop2piano_ci", split="test")
inputs = processor(
audio=ds["audio"][0]["array"], sampling_rate=ds["audio"][0]["sampling_rate"], return_tensors="pt"
)
model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
tokenizer_output = processor.batch_decode(
token_ids=model_output, feature_extractor_output=inputs
)["pretty_midi_objects"][0]
tokenizer_output.write("./Outputs/midi_output.mid")
import librosa
from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor
audio, sr = librosa.load("<your_audio_file_here>", sr=44100) # feel free to change the sr to a suitable value.
model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano", device_map="auto")
processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
inputs = processor(audio=audio, sampling_rate=sr, return_tensors="pt").to(model.device)
model_output = model.generate(input_features=inputs["input_features"], composer="composer1")
tokenizer_output = processor.batch_decode(
token_ids=model_output, feature_extractor_output=inputs
)["pretty_midi_objects"][0]
tokenizer_output.write("./Outputs/midi_output.mid")
import librosa
from transformers import Pop2PianoForConditionalGeneration, Pop2PianoProcessor
# feel free to change the sr to a suitable value.
audio1, sr1 = librosa.load("<your_first_audio_file_here>", sr=44100)
audio2, sr2 = librosa.load("<your_second_audio_file_here>", sr=44100)
model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano", device_map="auto")
processor = Pop2PianoProcessor.from_pretrained("sweetcocoa/pop2piano")
inputs = processor(audio=[audio1, audio2], sampling_rate=[sr1, sr2], return_attention_mask=True, return_tensors="pt").to(model.device)
# Since we now generating in batch(2 audios) we must pass the attention_mask
model_output = model.generate(
input_features=inputs["input_features"],
attention_mask=inputs["attention_mask"],
composer="composer1",
)
tokenizer_output = processor.batch_decode(
token_ids=model_output, feature_extractor_output=inputs
)["pretty_midi_objects"]
# Since we now have 2 generated MIDI files
tokenizer_output[0].write("./Outputs/midi_output1.mid")
tokenizer_output[1].write("./Outputs/midi_output2.mid")
Pop2PianoFeatureExtractor and Pop2PianoTokenizer):import librosa
from transformers import Pop2PianoFeatureExtractor, Pop2PianoForConditionalGeneration, Pop2PianoTokenizer
# feel free to change the sr to a suitable value.
audio1, sr1 = librosa.load("<your_first_audio_file_here>", sr=44100)
audio2, sr2 = librosa.load("<your_second_audio_file_here>", sr=44100)
model = Pop2PianoForConditionalGeneration.from_pretrained("sweetcocoa/pop2piano", device_map="auto")
feature_extractor = Pop2PianoFeatureExtractor.from_pretrained("sweetcocoa/pop2piano")
tokenizer = Pop2PianoTokenizer.from_pretrained("sweetcocoa/pop2piano")
inputs = feature_extractor(
audio=[audio1, audio2],
sampling_rate=[sr1, sr2],
return_attention_mask=True,
return_tensors="pt",
)
# Since we now generating in batch(2 audios) we must pass the attention_mask
model_output = model.generate(
input_features=inputs["input_features"],
attention_mask=inputs["attention_mask"],
composer="composer1",
)
tokenizer_output = tokenizer.batch_decode(
token_ids=model_output, feature_extractor_output=inputs
)["pretty_midi_objects"]
# Since we now have 2 generated MIDI files
tokenizer_output[0].write("./Outputs/midi_output1.mid")
tokenizer_output[1].write("./Outputs/midi_output2.mid")
[[autodoc]] Pop2PianoConfig
[[autodoc]] Pop2PianoFeatureExtractor - call
[[autodoc]] Pop2PianoForConditionalGeneration - forward - generate
[[autodoc]] Pop2PianoTokenizer - call
[[autodoc]] Pop2PianoProcessor - call