Eva02 weight_init #2565

TonyCongqianWang · 2025-08-10T23:23:08Z

TonyCongqianWang
Aug 10, 2025

I read in the paper that in their experiments the usual trunc_normal weight initialisation didn't work well and they used xavier_normal instead. I see that in the eva.py file, the _init_weigths for linear layers is the same regardless of whether swiglu is used or not.

The swiglu module also has a weight initialisation function, but that one does not use xavier either

I am wondering what the reasoning for this is.

Related topic:
Why is sometimes _init_weights and sometimes init_weights used? Only the first one is called by the Init function of the vision transformers. Is the later called later in the training loop to overwrite weight initialisation done by the former?

Answered by rwightman

Aug 11, 2025

@TonyCongqianWang I could add support for the other init approach, but the main reason it's not there is that it's for training from scratch, and I pretty sure the observation made in the EVA02 paper was for training from scratch with their combo of MIM + CLIP style pretraining on very large datasets, which may or may not have impact for other types of pretraining. All in all given that most people are fine-tuning the EVA models from EVA weights, lower priority to add (and more time consumingly, test) the alternative init mode.

View full answer

rwightman · 2025-08-11T17:29:02Z

rwightman
Aug 11, 2025
Maintainer

@TonyCongqianWang I could add support for the other init approach, but the main reason it's not there is that it's for training from scratch, and I pretty sure the observation made in the EVA02 paper was for training from scratch with their combo of MIM + CLIP style pretraining on very large datasets, which may or may not have impact for other types of pretraining. All in all given that most people are fine-tuning the EVA models from EVA weights, lower priority to add (and more time consumingly, test) the alternative init mode.

2 replies

TonyCongqianWang Aug 11, 2025
Author

Thanks for the answer, those are very good points, I guess the entire recipe is quite different.

Where and in what order are the weight init functions called? I can also try to make a pull request

Did you try the Xavier init for training the vit baselines and find no significant difference?

rwightman Aug 11, 2025
Maintainer

init_weights() fn on submodules is only called if the base model has an init fn that calls them, like in vision_transformer.py

pytorch-image-models/timm/models/vision_transformer.py

Lines 997 to 1064 in 954613a

    
           def init_weights_vit_timm(module: nn.Module, name: str = '') -> None: 
        
               """ViT weight initialization, original timm impl (for reproducibility). 
        
               Args: 
        
                   module: Module to initialize. 
        
                   name: Module name for context. 
        
               """ 
        
               if isinstance(module, nn.Linear): 
        
                   trunc_normal_(module.weight, std=.02) 
        
                   if module.bias is not None: 
        
                       nn.init.zeros_(module.bias) 
        
               elif hasattr(module, 'init_weights'): 
        
                   module.init_weights() 
        
           def init_weights_vit_jax(module: nn.Module, name: str = '', head_bias: float = 0.0) -> None: 
        
               """ViT weight initialization, matching JAX (Flax) impl. 
        
               Args: 
        
                   module: Module to initialize. 
        
                   name: Module name for context. 
        
                   head_bias: Bias value for head layer. 
        
               """ 
        
               if isinstance(module, nn.Linear): 
        
                   if name.startswith('head'): 
        
                       nn.init.zeros_(module.weight) 
        
                       nn.init.constant_(module.bias, head_bias) 
        
                   else: 
        
                       nn.init.xavier_uniform_(module.weight) 
        
                       if module.bias is not None: 
        
                           nn.init.normal_(module.bias, std=1e-6) if 'mlp' in name else nn.init.zeros_(module.bias) 
        
               elif isinstance(module, nn.Conv2d): 
        
                   lecun_normal_(module.weight) 
        
                   if module.bias is not None: 
        
                       nn.init.zeros_(module.bias) 
        
               elif hasattr(module, 'init_weights'): 
        
                   module.init_weights() 
        
           def init_weights_vit_moco(module: nn.Module, name: str = '') -> None: 
        
               """ViT weight initialization, matching moco-v3 impl minus fixed PatchEmbed. 
        
               Args: 
        
                   module: Module to initialize. 
        
                   name: Module name for context. 
        
               """ 
        
               if isinstance(module, nn.Linear): 
        
                   if 'qkv' in name: 
        
                       # treat the weights of Q, K, V separately 
        
                       val = math.sqrt(6. / float(module.weight.shape[0] // 3 + module.weight.shape[1])) 
        
                       nn.init.uniform_(module.weight, -val, val) 
        
                   else: 
        
                       nn.init.xavier_uniform_(module.weight) 
        
                   if module.bias is not None: 
        
                       nn.init.zeros_(module.bias) 
        
               elif hasattr(module, 'init_weights'): 
        
                   module.init_weights() 
        
           def get_init_weights_vit(mode: str = 'jax', head_bias: float = 0.0) -> Callable: 
        
               if 'jax' in mode: 
        
                   return partial(init_weights_vit_jax, head_bias=head_bias) 
        
               elif 'moco' in mode: 
        
                   return init_weights_vit_moco 
        
               else: 
        
                   return init_weights_vit_timm

The EVA model isn't calling them.

rwightman · 2025-08-11T17:30:25Z

rwightman
Aug 11, 2025
Maintainer

No init fn are called in the training loop, different models have slightly different approaches to weight init due to origins of the code and evolving ideas.

It has been a backburner TODO of mine to better unify this into a more consistent API that can be used with meta device init to do a proper two-phase init-process

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Eva02 weight_init #2565

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Eva02 weight_init #2565

Uh oh!

Uh oh!

TonyCongqianWang Aug 10, 2025

Replies: 2 comments · 2 replies

Uh oh!

rwightman Aug 11, 2025 Maintainer

Uh oh!

TonyCongqianWang Aug 11, 2025 Author

Uh oh!

Uh oh!

rwightman Aug 11, 2025 Maintainer

Uh oh!

rwightman Aug 11, 2025 Maintainer

TonyCongqianWang
Aug 10, 2025

Replies: 2 comments 2 replies

rwightman
Aug 11, 2025
Maintainer

TonyCongqianWang Aug 11, 2025
Author

rwightman Aug 11, 2025
Maintainer

rwightman
Aug 11, 2025
Maintainer