beit2/TOKENIZER.md
The proposed VQ-KD aims at reconstructing semantic knowledge from the teacher rather than original pixels. Then we can construct a highly compact semantic codebook for masked image modeling.
The VQ-KD model can be trained on ImageNet-1k using a DGX box (8 V100-32GB):
python -m torch.distributed.launch --nproc_per_node=8 run_vqkd_training.py \
--data_set image_folder \
--data_path /path/to/imagenet-1k/train \
--eval_data_path /path/to/imagenet-1k/eval \
--output_dir /path/to/save/your_model \
--log_dir /path/to/save/your_model \
--process_type default \
--train_interpolation bicubic \
--min_crop_scale 0.08 \
--model vqkd_encoder_base_decoder_3x768x12_clip \
--teacher_input_size 224 \
--codebook_n_emd 8192 \
--codebook_emd_dim 32 \
--quantize_kmeans_init \
--rec_loss_type cosine \
--batch_size 64 \
--opt adamw \
--opt_betas 0.9 0.99 \
--weight_decay 1e-4 \
--warmup_epochs 10 \
--epochs 100 \
--save_ckpt_freq 20
--model: one can modify the encoder, decoder and teacher model in modeling_vqkd.py according to personal demands.One can compress the input image into quantized codes like this:
python test_get_code.py
We provide some trained vq-kd tokenizers here.
| model name | encoder layers | decoder layers | teacher model | codebook usage | weight |
|---|---|---|---|---|---|
| vqkd_encoder_base_decoder_1x768x12_clip | 12 | 1 | CLIP ViT-B/16 | 100% | link |
| vqkd_encoder_base_decoder_3x768x12_clip | 12 | 3 | CLIP ViT-B/16 | 97% | link |
| vqkd_encoder_base_decoder_1x768x12_dino | 12 | 1 | DINO ViT-B/16 | 100% | link |