Back to Unilm

VQ-KD Training

beit2/TOKENIZER.md

latest2.1 KB
Original Source

VQ-KD Training

The proposed VQ-KD aims at reconstructing semantic knowledge from the teacher rather than original pixels. Then we can construct a highly compact semantic codebook for masked image modeling.

Example: Training VQ-KD Tokenizer on ImageNet-1k

The VQ-KD model can be trained on ImageNet-1k using a DGX box (8 V100-32GB):

bash
python -m torch.distributed.launch --nproc_per_node=8 run_vqkd_training.py \
    --data_set image_folder \
    --data_path /path/to/imagenet-1k/train \
    --eval_data_path /path/to/imagenet-1k/eval \
    --output_dir /path/to/save/your_model \
    --log_dir /path/to/save/your_model \
    --process_type default \
    --train_interpolation bicubic \
    --min_crop_scale 0.08 \
    --model vqkd_encoder_base_decoder_3x768x12_clip \
    --teacher_input_size 224 \
    --codebook_n_emd 8192  \
    --codebook_emd_dim 32 \
    --quantize_kmeans_init \
    --rec_loss_type cosine \
    --batch_size 64 \
    --opt adamw \
    --opt_betas 0.9 0.99 \
    --weight_decay 1e-4  \
    --warmup_epochs 10 \
    --epochs 100 \
    --save_ckpt_freq 20 
  • --model: one can modify the encoder, decoder and teacher model in modeling_vqkd.py according to personal demands.

Example: Encode images

One can compress the input image into quantized codes like this:

bash
python test_get_code.py

Model Zoo

We provide some trained vq-kd tokenizers here.

model nameencoder layersdecoder layersteacher modelcodebook usageweight
vqkd_encoder_base_decoder_1x768x12_clip121CLIP ViT-B/16100%link
vqkd_encoder_base_decoder_3x768x12_clip123CLIP ViT-B/1697%link
vqkd_encoder_base_decoder_1x768x12_dino121DINO ViT-B/16100%link