Skip to content

Release v1.0.16

Latest
Compare
Choose a tag to compare
@rwightman rwightman released this 26 Jun 18:44
· 11 commits to main since this release
7101adb

June 26, 2025

  • MobileNetV5 backbone (w/ encoder only variant) for Gemma 3n image encoder
  • Version 1.0.16 released

June 23, 2025

  • Add F.grid_sample based 2D and factorized pos embed resize to NaFlexViT. Faster when lots of different sizes (based on example by https://github.com/stas-sl).
  • Further speed up patch embed resample by replacing vmap with matmul (based on snippet by https://github.com/stas-sl).
  • Add 3 initial native aspect NaFlexViT checkpoints created while testing, ImageNet-1k and 3 different pos embed configs w/ same hparams.
Model Top-1 Acc Top-5 Acc Params (M) Eval Seq Len
naflexvit_base_patch16_par_gap.e300_s576_in1k 83.67 96.45 86.63 576
naflexvit_base_patch16_parfac_gap.e300_s576_in1k 83.63 96.41 86.46 576
naflexvit_base_patch16_gap.e300_s576_in1k 83.50 96.46 86.63 576
  • Support gradient checkpointing for forward_intermediates and fix some checkpointing bugs. Thanks https://github.com/brianhou0208
  • Add 'corrected weight decay' (https://arxiv.org/abs/2506.02285) as option to AdamW (legacy), Adopt, Kron, Adafactor (BV), Lamb, LaProp, Lion, NadamW, RmsPropTF, SGDW optimizers
  • Switch PE (perception encoder) ViT models to use native timm weights instead of remapping on the fly
  • Fix cuda stream bug in prefetch loader

June 5, 2025

  • Initial NaFlexVit model code. NaFlexVit is a Vision Transformer with:
    1. Encapsulated embedding and position encoding in a single module
    2. Support for nn.Linear patch embedding on pre-patchified (dictionary) inputs
    3. Support for NaFlex variable aspect, variable resolution (SigLip-2: https://arxiv.org/abs/2502.14786)
    4. Support for FlexiViT variable patch size (https://arxiv.org/abs/2212.08013)
    5. Support for NaViT fractional/factorized position embedding (https://arxiv.org/abs/2307.06304)
  • Existing vit models in vision_transformer.py can be loaded into the NaFlexVit model by adding the use_naflex=True flag to create_model
    • Some native weights coming soon
  • A full NaFlex data pipeline is available that allows training / fine-tuning / evaluating with variable aspect / size images
    • To enable in train.py and validate.py add the --naflex-loader arg, must be used with a NaFlexVit
  • To evaluate an existing (classic) ViT loaded in NaFlexVit model w/ NaFlex data pipe:
    • python validate.py /imagenet --amp -j 8 --model vit_base_patch16_224 --model-kwargs use_naflex=True --naflex-loader --naflex-max-seq-len 256
  • The training has some extra args features worth noting
    • The --naflex-train-seq-lens' argument specifies which sequence lengths to randomly pick from per batch during training
    • The --naflex-max-seq-len argument sets the target sequence length for validation
    • Adding --model-kwargs enable_patch_interpolator=True --naflex-patch-sizes 12 16 24 will enable random patch size selection per-batch w/ interpolation
    • The --naflex-loss-scale arg changes loss scaling mode per batch relative to the batch size, timm NaFlex loading changes the batch size for each seq len

May 28, 2025

What's Changed

New Contributors

Full Changelog: v1.0.15...v1.0.16