Back to Models

Masked Autoencoders Are Scalable Vision Learners (MAE)

official/projects/mae/README.md

2.20.01.8 KB
Original Source

Masked Autoencoders Are Scalable Vision Learners (MAE)

TF2 implementation of MAE.

Imagenet pretrain

Modelreolutionpathch sizebatch sizeepochstarget pixel normval MSE
(a) ViT-L14224x224144096800no0.2456
(b) ViT-L14224x224144096800yes0.3630
(c) ViT-L16224x224164096800yes0.3866

ImageNet linear probing

Modelresolutionpathch sizebase learning ratebatch sizeinit checkpointepochstop1 Accdashboard
ViT-L14224x224140.116384(b)9072.8-
ViT-L16224x224160.116384(c)9073.0-
ViT-L16224x224160.116384norm9073.9Table 1 (d)

ImageNet finetune

Modelresolutionpathch sizebase learning ratebatch sizeinit checkpointepochstop1 Accdashboard
ViT-L14224x224140.0011024(a)5084.4-
ViT-L14224x224140.0011024(b)5085.3-
ViT-L14224x224140.000751024(b)5085.4-
ViT-L14224x224140.00014096scratch20082.4-
ViT-L16224x224160.0011024(c)5084.9-
ViT-L16224x224160.0011024no-norm5084.9Table 1(d)
ViT-L16224x224160.0011024norm5085.4paper section 4.
ViT-L16224x224160.00014096scratch20082.5paper section 4.

Known discrepancy with the paper:

  • ~-0.9 linear probing top1 acc (w/ norm) compared to paper results with patch size 16.

  • ~-0.5 finetune top1 acc (w/ norm) compared to paper results with patch size 16.