I think the correct way the code the training is that optimizer.zero_grad() # Forward pass outputs = model(images) loss = criterion(outputs, labels) # Backward and optimize loss.backward() optimizer.step() not that # Forward pass outputs = model(images) loss = criterion(outputs, labels) # Backward and optimize optimizer.zero_grad() loss.backward() optimizer.step()