June 26, 2025
- MobileNetV5 backbone (w/ encoder only variant) for Gemma 3n image encoder
- Version 1.0.16 released
June 23, 2025
- Add F.grid_sample based 2D and factorized pos embed resize to NaFlexViT. Faster when lots of different sizes (based on example by https://github.com/stas-sl).
- Further speed up patch embed resample by replacing vmap with matmul (based on snippet by https://github.com/stas-sl).
- Add 3 initial native aspect NaFlexViT checkpoints created while testing, ImageNet-1k and 3 different pos embed configs w/ same hparams.
Model | Top-1 Acc | Top-5 Acc | Params (M) | Eval Seq Len |
---|---|---|---|---|
naflexvit_base_patch16_par_gap.e300_s576_in1k | 83.67 | 96.45 | 86.63 | 576 |
naflexvit_base_patch16_parfac_gap.e300_s576_in1k | 83.63 | 96.41 | 86.46 | 576 |
naflexvit_base_patch16_gap.e300_s576_in1k | 83.50 | 96.46 | 86.63 | 576 |
- Support gradient checkpointing for
forward_intermediates
and fix some checkpointing bugs. Thanks https://github.com/brianhou0208 - Add 'corrected weight decay' (https://arxiv.org/abs/2506.02285) as option to AdamW (legacy), Adopt, Kron, Adafactor (BV), Lamb, LaProp, Lion, NadamW, RmsPropTF, SGDW optimizers
- Switch PE (perception encoder) ViT models to use native timm weights instead of remapping on the fly
- Fix cuda stream bug in prefetch loader
June 5, 2025
- Initial NaFlexVit model code. NaFlexVit is a Vision Transformer with:
- Encapsulated embedding and position encoding in a single module
- Support for nn.Linear patch embedding on pre-patchified (dictionary) inputs
- Support for NaFlex variable aspect, variable resolution (SigLip-2: https://arxiv.org/abs/2502.14786)
- Support for FlexiViT variable patch size (https://arxiv.org/abs/2212.08013)
- Support for NaViT fractional/factorized position embedding (https://arxiv.org/abs/2307.06304)
- Existing vit models in
vision_transformer.py
can be loaded into the NaFlexVit model by adding theuse_naflex=True
flag tocreate_model
- Some native weights coming soon
- A full NaFlex data pipeline is available that allows training / fine-tuning / evaluating with variable aspect / size images
- To enable in
train.py
andvalidate.py
add the--naflex-loader
arg, must be used with a NaFlexVit
- To enable in
- To evaluate an existing (classic) ViT loaded in NaFlexVit model w/ NaFlex data pipe:
python validate.py /imagenet --amp -j 8 --model vit_base_patch16_224 --model-kwargs use_naflex=True --naflex-loader --naflex-max-seq-len 256
- The training has some extra args features worth noting
- The
--naflex-train-seq-lens'
argument specifies which sequence lengths to randomly pick from per batch during training - The
--naflex-max-seq-len
argument sets the target sequence length for validation - Adding
--model-kwargs enable_patch_interpolator=True --naflex-patch-sizes 12 16 24
will enable random patch size selection per-batch w/ interpolation - The
--naflex-loss-scale
arg changes loss scaling mode per batch relative to the batch size,timm
NaFlex loading changes the batch size for each seq len
- The
May 28, 2025
- Add a number of small/fast models thanks to https://github.com/brianhou0208
- SwiftFormer - (ICCV2023) SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
- FasterNet - (CVPR2023) Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks
- SHViT - (CVPR2024) SHViT: Single-Head Vision Transformer with Memory Efficient
- StarNet - (CVPR2024) Rewrite the Stars
- GhostNet-V3 GhostNetV3: Exploring the Training Strategies for Compact Models
- Update EVA ViT (closest match) to support Perception Encoder models (https://arxiv.org/abs/2504.13181) from Meta, loading Hub weights but I still need to push dedicated
timm
weights- Add some flexibility to ROPE impl
- Big increase in number of models supporting
forward_intermediates()
and some additional fixes thanks to https://github.com/brianhou0208- DaViT, EdgeNeXt, EfficientFormerV2, EfficientViT(MIT), EfficientViT(MSRA), FocalNet, GCViT, HGNet /V2, InceptionNeXt, Inception-V4, MambaOut, MetaFormer, NesT, Next-ViT, PiT, PVT V2, RepGhostNet, RepViT, ResNetV2, ReXNet, TinyViT, TResNet, VoV
- TNT model updated w/ new weights
forward_intermediates()
thanks to https://github.com/brianhou0208 - Add
local-dir:
pretrained schema, can uselocal-dir:/path/to/model/folder
for model name to source model / pretrained cfg & weights Hugging Face Hub models (config.json + weights file) from a local folder. - Fixes, improvements for onnx export
What's Changed
- Fix arg merging of sknet, old seresnet. Fix #2470 by @rwightman in #2471
- Fix onnx export by @rwightman in #2475
- Add local-dir: schema support for model loading (config + weights) from folder by @rwightman in #2476
- Fix: Allow img_size to be int or tuple in PatchEmbed by @sddongxh in #2477
- Add LightlyTrain Integration for Pretraining Support by @yutong-xiang-97 in #2474
- Check forward_intermediates features against forward_features output by @rwightman in #2483
- More models support forward_intermediates by @brianhou0208 in #2482
- Update README.md by @atharva-pathak in #2484
- remove
download
argument from torch_kwargs for torchvisionImageNet
class by @ryan-caesar-ramos in #2486 - Update TNT-(S/B) model weights and add feature extraction support by @brianhou0208 in #2480
- Add EVA ViT based PE (Perceptual Encoder) impl by @rwightman in #2487
- Add SwiftFormer, SHViT, StarNet, FasterNet and GhostNetV3 by @brianhou0208 in #2499
- A cleaned up beit3 remap onto vision_transformer.py vit by @rwightman in #2503
- Initial NaFlex ViT model and training support by @rwightman in #2466
- Forgot to compact attention pool branches after verifying by @rwightman in #2507
- Throw exception on non-directory path for pretrained weights by @emmanuel-ferdman in #2510
- Add corrected_weight decay to several optimizers by @rwightman in #2511
- Doing some Claude enabled docstring, type annotation and other cleanup by @rwightman in #2504
- Fix #2513, be explicit about stream devices by @rwightman in #2515
- Update legacy AdamW impl so it has a multi-tensor impl like NAdamW (n… by @rwightman in #2517
- Fix
head_dim
reference inAttentionRope
class ofattention.py
by @amorehead in #2519 - Refactor patch and pos embed resampling based on feedback from https://github.com/stas-sl by @rwightman in #2518
- Add initial weights for my first 3 naflexvit_base models by @rwightman in #2523
- Support gradient checkpointing in
forward_intermediates()
by @brianhou0208 in #2501 - Update README: add references for additional supported models by @brianhou0208 in #2526
- MobileNetV5 by @rwightman in #2527
New Contributors
- @sddongxh made their first contribution in #2477
- @yutong-xiang-97 made their first contribution in #2474
- @atharva-pathak made their first contribution in #2484
- @ryan-caesar-ramos made their first contribution in #2486
- @emmanuel-ferdman made their first contribution in #2510
- @amorehead made their first contribution in #2519
Full Changelog: v1.0.15...v1.0.16