MODELHUB - Swin Transformer

Access code for baidu is swin.

ImageNet-1K and ImageNet-22K Pretrained Swin-V1 Models

name	pretrain	resolution	acc@1	acc@5	#params	FLOPs	FPS	22K model	1K model
Swin-T	ImageNet-1K	224x224	81.2	95.5	28M	4.5G	755	-	github/baidu/config/log
Swin-S	ImageNet-1K	224x224	83.2	96.2	50M	8.7G	437	-	github/baidu/config/log
Swin-B	ImageNet-1K	224x224	83.5	96.5	88M	15.4G	278	-	github/baidu/config/log
Swin-B	ImageNet-1K	384x384	84.5	97.0	88M	47.1G	85	-	github/baidu/config
Swin-T	ImageNet-22K	224x224	80.9	96.0	28M	4.5G	755	github/baidu/config	github/baidu/config
Swin-S	ImageNet-22K	224x224	83.2	97.0	50M	8.7G	437	github/baidu/config	github/baidu/config
Swin-B	ImageNet-22K	224x224	85.2	97.5	88M	15.4G	278	github/baidu/config	github/baidu/config
Swin-B	ImageNet-22K	384x384	86.4	98.0	88M	47.1G	85	github/baidu	github/baidu/config
Swin-L	ImageNet-22K	224x224	86.3	97.9	197M	34.5G	141	github/baidu/config	github/baidu/config
Swin-L	ImageNet-22K	384x384	87.3	98.2	197M	103.9G	42	github/baidu	github/baidu/config

ImageNet-1K and ImageNet-22K Pretrained Swin-V2 Models

name	pretrain	resolution	window	acc@1	acc@5	#params	FLOPs	FPS	22K model	1K model
SwinV2-T	ImageNet-1K	256x256	8x8	81.8	95.9	28M	5.9G	572	-	github/baidu/config
SwinV2-S	ImageNet-1K	256x256	8x8	83.7	96.6	50M	11.5G	327	-	github/baidu/config
SwinV2-B	ImageNet-1K	256x256	8x8	84.2	96.9	88M	20.3G	217	-	github/baidu/config
SwinV2-T	ImageNet-1K	256x256	16x16	82.8	96.2	28M	6.6G	437	-	github/baidu/config
SwinV2-S	ImageNet-1K	256x256	16x16	84.1	96.8	50M	12.6G	257	-	github/baidu/config
SwinV2-B	ImageNet-1K	256x256	16x16	84.6	97.0	88M	21.8G	174	-	github/baidu/config
SwinV2-B<sup>*</sup>	ImageNet-22K	256x256	16x16	86.2	97.9	88M	21.8G	174	github/baidu/config	github/baidu/config
SwinV2-B<sup>*</sup>	ImageNet-22K	384x384	24x24	87.1	98.2	88M	54.7G	57	github/baidu/config	github/baidu/config
SwinV2-L<sup>*</sup>	ImageNet-22K	256x256	16x16	86.9	98.0	197M	47.5G	95	github/baidu/config	github/baidu/config
SwinV2-L<sup>*</sup>	ImageNet-22K	384x384	24x24	87.6	98.3	197M	115.4G	33	github/baidu/config	github/baidu/config

Note:

SwinV2-B* (SwinV2-L*) with input resolution of 256x256 and 384x384 both fine-tuned from the same pre-training model using a smaller input resolution of 192x192.
SwinV2-B* (384x384) achieves 78.08 acc@1 on ImageNet-1K-V2 while SwinV2-L* (384x384) achieves 78.31.

ImageNet-1K Pretrained Swin MLP Models

name	pretrain	resolution	acc@1	acc@5	#params	FLOPs	FPS	1K model
Mixer-B/16	ImageNet-1K	224x224	76.4	-	59M	12.7G	-	official repo
ResMLP-S24	ImageNet-1K	224x224	79.4	-	30M	6.0G	715	timm
ResMLP-B24	ImageNet-1K	224x224	81.0	-	116M	23.0G	231	timm
Swin-T/C24	ImageNet-1K	256x256	81.6	95.7	28M	5.9G	563	github/baidu/config
SwinMLP-T/C24	ImageNet-1K	256x256	79.4	94.6	20M	4.0G	807	github/baidu/config
SwinMLP-T/C12	ImageNet-1K	256x256	79.6	94.7	21M	4.0G	792	github/baidu/config
SwinMLP-T/C6	ImageNet-1K	256x256	79.7	94.9	23M	4.0G	766	github/baidu/config
SwinMLP-B	ImageNet-1K	224x224	81.3	95.3	61M	10.4G	409	github/baidu/config

Note: C24 means each head has 24 channels.

ImageNet-22K Pretrained Swin-MoE Models

name	#experts	k	router	resolution	window	IN-22K acc@1	IN-1K/ft acc@1	IN-1K/5-shot acc@1	22K model
Swin-MoE-S	1 (dense)	-	-	192x192	8x8	35.5	83.5	70.3	github/baidu/config
Swin-MoE-S	8	1	Linear	192x192	8x8	36.8	84.5	75.2	github/baidu/config
Swin-MoE-S	16	1	Linear	192x192	8x8	37.6	84.9	76.5	github/baidu/config
Swin-MoE-S	32	1	Linear	192x192	8x8	37.4	84.7	75.9	github/baidu/config
Swin-MoE-S	32	1	Cosine	192x192	8x8	37.2	84.3	75.2	github/baidu/config
Swin-MoE-S	64	1	Linear	192x192	8x8	37.8	84.7	75.7	-
Swin-MoE-S	128	1	Linear	192x192	8x8	37.4	84.5	75.4	-
Swin-MoE-B	1 (dense)	-	-	192x192	8x8	37.3	85.1	75.9	config
Swin-MoE-B	8	1	Linear	192x192	8x8	38.1	85.3	77.2	config
Swin-MoE-B	16	1	Linear	192x192	8x8	38.7	85.5	78.2	config
Swin-MoE-B	32	1	Linear	192x192	8x8	38.6	85.5	77.9	config
Swin-MoE-B	32	1	Cosine	192x192	8x8	38.5	85.3	77.3	config
Swin-MoE-B	32	2	Linear	192x192	8x8	38.6	85.5	78.7	-

SimMIM Pretrained Swin-V2 Models

Please note that all SimMIM pretrained Swin-V2 models will be stored in the Huggingface repository starting July 2024. For more details, refer to the huggingface repository.

Model size only includes the backbone weights and excludes weights in the decoders/classification heads.
Batch size for all models is set to 2048.
Validation loss is calculated on the ImageNet-1K validation set.
Fine-tuned acc@1 refers to the top-1 accuracy on the ImageNet-1K validation set after fine-tuning.

name	model size	pre-train dataset	pre-train iterations	validation loss	fine-tuned acc@1	pre-trained model	fine-tuned model
SwinV2-Small	49M	ImageNet-1K 10%	125k	0.4820	82.69	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 10%	250k	0.4961	83.11	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 10%	500k	0.5115	83.17	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 20%	125k	0.4751	83.05	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 20%	250k	0.4722	83.56	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 20%	500k	0.4734	83.75	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 50%	125k	0.4732	83.04	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 50%	250k	0.4681	83.67	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K 50%	500k	0.4646	83.96	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K	125k	0.4728	82.92	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K	250k	0.4674	83.66	huggingface	huggingface
SwinV2-Small	49M	ImageNet-1K	500k	0.4641	84.08	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 10%	125k	0.4822	83.33	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 10%	250k	0.4997	83.60	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 10%	500k	0.5112	83.41	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 20%	125k	0.4703	83.86	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 20%	250k	0.4679	84.37	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 20%	500k	0.4711	84.61	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 50%	125k	0.4683	84.04	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 50%	250k	0.4633	84.57	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K 50%	500k	0.4598	84.95	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K	125k	0.4680	84.13	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K	250k	0.4626	84.65	huggingface	huggingface
SwinV2-Base	87M	ImageNet-1K	500k	0.4588	85.04	huggingface	huggingface
SwinV2-Base	87M	ImageNet-22K	125k	0.4695	84.11	huggingface	huggingface
SwinV2-Base	87M	ImageNet-22K	250k	0.4649	84.57	huggingface	huggingface
SwinV2-Base	87M	ImageNet-22K	500k	0.4614	85.11	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 10%	125k	0.4995	83.69	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 10%	250k	0.5140	83.66	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 10%	500k	0.5150	83.50	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 20%	125k	0.4675	84.38	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 20%	250k	0.4746	84.71	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 20%	500k	0.4960	84.59	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 50%	125k	0.4622	84.78	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 50%	250k	0.4566	85.38	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K 50%	500k	0.4530	85.80	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K	125k	0.4611	84.98	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K	250k	0.4552	85.45	huggingface	huggingface
SwinV2-Large	195M	ImageNet-1K	500k	0.4507	85.91	huggingface	huggingface
SwinV2-Large	195M	ImageNet-22K	125k	0.4649	84.61	huggingface	huggingface
SwinV2-Large	195M	ImageNet-22K	250k	0.4586	85.39	huggingface	huggingface
SwinV2-Large	195M	ImageNet-22K	500k	0.4536	85.81	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 20%	125k	0.4789	84.35	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 20%	250k	0.5038	84.16	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 20%	500k	0.5071	83.44	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 50%	125k	0.4549	85.09	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 50%	250k	0.4511	85.64	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K 50%	500k	0.4559	85.69	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K	125k	0.4531	85.23	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K	250k	0.4464	85.90	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-1K	500k	0.4416	86.34	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-22K	125k	0.4564	85.14	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-22K	250k	0.4499	85.86	huggingface	huggingface
SwinV2-Huge	655M	ImageNet-22K	500k	0.4444	86.27	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K 50%	125k	0.4534	85.44	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K 50%	250k	0.4515	85.76	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K 50%	500k	0.4719	85.51	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K	125k	0.4513	85.57	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K	250k	0.4442	86.12	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-1K	500k	0.4395	86.46	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-22K	125k	0.4544	85.39	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-22K	250k	0.4475	85.96	huggingface	huggingface
SwinV2-giant	1.06B	ImageNet-22K	500k	0.4416	86.53	huggingface	huggingface

SimMIM Pretrained Swin-V1 Models

ImageNet-1K Pre-trained and Fine-tuned Models

name	pre-train epochs	pre-train resolution	fine-tune resolution	acc@1	pre-trained model	fine-tuned model
Swin-Base	100	192x192	192x192	82.8	google/config	google/config
Swin-Base	100	192x192	224x224	83.5	google/config	google/config
Swin-Base	800	192x192	224x224	84.0	google/config	google/config
Swin-Large	800	192x192	224x224	85.4	google/config	google/config
SwinV2-Huge	800	192x192	224x224	85.7	/	/
SwinV2-Huge	800	192x192	512x512	87.1	/	/