Log in Get started

Masked Autoencoders Are Scalable Vision Learners (MAE)

official/projects/mae/README.md

2.20.01.8 KB

Original Source

Masked Autoencoders Are Scalable Vision Learners (MAE)

TF2 implementation of MAE.

Imagenet pretrain

Model	reolution	pathch size	batch size	epochs	target pixel norm	val MSE
(a) ViT-L14	224x224	14	4096	800	no	0.2456
(b) ViT-L14	224x224	14	4096	800	yes	0.3630
(c) ViT-L16	224x224	16	4096	800	yes	0.3866

ImageNet linear probing

Model	resolution	pathch size	base learning rate	batch size	init checkpoint	epochs	top1 Acc	dashboard
ViT-L14	224x224	14	0.1	16384	(b)	90	72.8	-
ViT-L16	224x224	16	0.1	16384	(c)	90	73.0	-
ViT-L16	224x224	16	0.1	16384	norm	90	73.9	Table 1 (d)

ImageNet finetune

Model	resolution	pathch size	base learning rate	batch size	init checkpoint	epochs	top1 Acc	dashboard
ViT-L14	224x224	14	0.001	1024	(a)	50	84.4	-
ViT-L14	224x224	14	0.001	1024	(b)	50	85.3	-
ViT-L14	224x224	14	0.00075	1024	(b)	50	85.4	-
ViT-L14	224x224	14	0.0001	4096	scratch	200	82.4	-
ViT-L16	224x224	16	0.001	1024	(c)	50	84.9	-
ViT-L16	224x224	16	0.001	1024	no-norm	50	84.9	Table 1(d)
ViT-L16	224x224	16	0.001	1024	norm	50	85.4	paper section 4.
ViT-L16	224x224	16	0.0001	4096	scratch	200	82.5	paper section 4.

Known discrepancy with the paper:

~-0.9 linear probing top1 acc (w/ norm) compared to paper results with patch size 16.
~-0.5 finetune top1 acc (w/ norm) compared to paper results with patch size 16.