Merge branch 'main' into delete-one-more

AlannaBurke · web-flow · commit 69dc5bb03159 · 2025-07-09T13:40:26.000-04:00
diff --git a/.github/scripts/check_redirects.sh b/.github/scripts/check_redirects.sh
@@ -0,0 +1,46 @@
+#!/bin/bash
+
+if [ "$CURRENT_BRANCH" == "$BASE_BRANCH" ]; then
+  echo "Running on $BASE_BRANCH branch. Skipping check."
+  exit 0
+fi
+
+
+# Get list of deleted or renamed files in this branch compared to base
+DELETED_FILES=$(git diff --name-status $BASE_BRANCH $CURRENT_BRANCH --diff-filter=DR | awk '{print $2}' | grep -E '\.(rst|py|md)$' | grep -v 'redirects.py')
+# Check if any deleted or renamed files were found
+if [ -z "$DELETED_FILES" ]; then
+  echo "No deleted or renamed files found. Skipping check."
+  exit 0
+fi
+
+echo "Deleted or renamed files:"
+echo "$DELETED_FILES"
+
+# Check if redirects.py has been updated
+REDIRECTS_UPDATED=$(git diff --name-status $BASE_BRANCH $CURRENT_BRANCH --diff-filter=AM | grep 'redirects.py' && echo "yes" || echo "no")
+
+if [ "$REDIRECTS_UPDATED" == "no" ]; then
+  echo "ERROR: Files were deleted or renamed but redirects.py was not updated. Please update .github/scripts/redirects.py to redirect these files."
+  exit 1
+fi
+
+# Check if each deleted file has a redirect entry
+MISSING_REDIRECTS=0
+for FILE in $DELETED_FILES; do
+  # Convert file path to URL path format (remove extension and adjust path)
+  REDIRECT_PATH=$(echo $FILE | sed -E 's/(.+)_source\/(.+)\.(py|rst|md)$/\1\/\2.html/')
+
+  # Check if this path exists in redirects.py as a key. We don't check for values.
+  if ! grep -q "\"$REDIRECT_PATH\":" redirects.py; then
+    echo "ERROR: Missing redirect for deleted file: $FILE (should have entry for \"$REDIRECT_PATH\")"
+    MISSING_REDIRECTS=1
+  fi
+done
+
+if [ $MISSING_REDIRECTS -eq 1 ]; then
+  echo "ERROR: Please add redirects for all deleted/renamed files to redirects.py"
+  exit 1
+fi
+
+echo "All deleted/renamed files have proper redirects. Check passed!"
diff --git a/.github/workflows/check-redirects.yml b/.github/workflows/check-redirects.yml
@@ -0,0 +1,25 @@
+name: Check Redirects for Deleted or Renamed Files
+
+on:
+  pull_request:
+    paths:
+     - '*/**/*.rst'
+     - '*/**/*.py'
+     - '*/**/*.md'
+
+jobs:
+  check-redirects:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Run redirect check script
+        run: |
+            chmod +x ./.github/scripts/check_redirects.sh
+            ./.github/scripts/check_redirects.sh
+        env:
+          BASE_BRANCH: ${{ github.base_ref }}
+          CURRENT_BRANCH: ${{ github.head_ref }}
diff --git a/intermediate_source/memory_format_tutorial.py b/intermediate_source/memory_format_tutorial.py
@@ -1,13 +1,28 @@
 # -*- coding: utf-8 -*-
 """
-(beta) Channels Last Memory Format in PyTorch
+Channels Last Memory Format in PyTorch
 *******************************************************
 **Author**: `Vitaly Fedyunin <https://github.com/VitalyFedyunin>`_
 
-What is Channels Last
----------------------
+.. grid:: 2
 
-Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel).
+    .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
+       :class-card: card-prerequisites
+
+       * What is the channels last memory format in PyTorch?
+       * How can it be used to improve performance on certain operators?
+
+    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
+       :class-card: card-prerequisites
+
+       * PyTorch v1.5.0
+       * A CUDA-capable GPU
+
+#########################################################################
+# Overview - What is channels last?
+# ---------------------------------
+
+The channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel).
 
 For example, classic (contiguous) storage of NCHW tensor (in our case it is two 4x4 images with 3 color channels) look like this:
 
@@ -19,7 +34,7 @@
 .. figure:: /_static/img/channels_last_memory_format.png
    :alt: channels_last_memory_format
 
-Pytorch supports memory formats (and provides back compatibility with existing models including eager, JIT, and TorchScript) by utilizing  existing strides structure.
+Pytorch supports memory formats by utilizing the existing strides structure.
 For example, 10x3x16x16 batch in Channels last format will have strides equal to (768, 1, 48, 3).
 """
 
@@ -387,3 +402,12 @@ def attribute(m):
 #
 # If you have feedback and/or suggestions for improvement, please let us
 # know by creating `an issue <https://github.com/pytorch/pytorch/issues>`_.
+
+######################################################################
+# Conclusion
+# ----------
+#
+# This tutorial introduced the "channels last" memory format and demonstrated
+# how to use it for performance gains. For a practical example of accelerating
+# vision models using channels last, see the post
+# `here <https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/>`_.
diff --git a/recipes_source/recipes/tuning_guide.py b/recipes_source/recipes/tuning_guide.py
@@ -8,10 +8,38 @@
 techniques often can be implemented by changing only a few lines of code and can
 be applied to a wide range of deep learning models across all domains.
 
+.. grid:: 2
+
+    .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
+       :class-card: card-prerequisites
+
+       * General optimization techniques for PyTorch models
+       * CPU-specific performance optimizations
+       * GPU acceleration strategies
+       * Distributed training optimizations
+
+    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
+       :class-card: card-prerequisites
+
+       * PyTorch 2.0 or later
+       * Python 3.8 or later
+       * CUDA-capable GPU (recommended for GPU optimizations)
+       * Linux, macOS, or Windows operating system
+
+Overview
+--------
+
+Performance optimization is crucial for efficient deep learning model training and inference.
+This tutorial covers a comprehensive set of techniques to accelerate PyTorch workloads across
+different hardware configurations and use cases.
+
 General optimizations
 ---------------------
 """
 
+import torch
+import torchvision
+
 ###############################################################################
 # Enable asynchronous data loading and augmentation
 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -90,7 +118,7 @@
 # setting it to zero, for more details refer to the
 # `documentation <https://pytorch.org/docs/master/optim.html#torch.optim.Optimizer.zero_grad>`_.
 #
-# Alternatively, starting from PyTorch 1.7, call ``model`` or
+# Alternatively, call ``model`` or
 # ``optimizer.zero_grad(set_to_none=True)``.
 
 ###############################################################################
@@ -129,7 +157,7 @@ def gelu(x):
 ###############################################################################
 # Enable channels_last memory format for computer vision models
 # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-# PyTorch 1.5 introduced support for ``channels_last`` memory format for
+# PyTorch supports ``channels_last`` memory format for
 # convolutional networks. This format is meant to be used in conjunction with
 # `AMP <https://pytorch.org/docs/stable/amp.html>`_ to further accelerate
 # convolutional neural networks with
@@ -250,65 +278,6 @@ def gelu(x):
 #
 #    export LD_PRELOAD=<jemalloc.so/tcmalloc.so>:$LD_PRELOAD
 
-###############################################################################
-# Use oneDNN Graph with TorchScript for inference
-# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-# oneDNN Graph can significantly boost inference performance. It fuses some compute-intensive operations such as convolution, matmul with their neighbor operations.
-# In PyTorch 2.0, it is supported as a beta feature for ``Float32`` & ``BFloat16`` data-types.
-# oneDNN Graph receives the model’s graph and identifies candidates for operator-fusion with respect to the shape of the example input.
-# A model should be JIT-traced using an example input.
-# Speed-up would then be observed after a couple of warm-up iterations for inputs with the same shape as the example input.
-# The example code-snippets below are for resnet50, but they can very well be extended to use oneDNN Graph with custom models as well.
-
-# Only this extra line of code is required to use oneDNN Graph
-torch.jit.enable_onednn_fusion(True)
-
-###############################################################################
-# Using the oneDNN Graph API requires just one extra line of code for inference with Float32.
-# If you are using oneDNN Graph, please avoid calling ``torch.jit.optimize_for_inference``.
-
-# sample input should be of the same shape as expected inputs
-sample_input = [torch.rand(32, 3, 224, 224)]
-# Using resnet50 from torchvision in this example for illustrative purposes,
-# but the line below can indeed be modified to use custom models as well.
-model = getattr(torchvision.models, "resnet50")().eval()
-# Tracing the model with example input
-traced_model = torch.jit.trace(model, sample_input)
-# Invoking torch.jit.freeze
-traced_model = torch.jit.freeze(traced_model)
-
-###############################################################################
-# Once a model is JIT-traced with a sample input, it can then be used for inference after a couple of warm-up runs.
-
-with torch.no_grad():
-    # a couple of warm-up runs
-    traced_model(*sample_input)
-    traced_model(*sample_input)
-    # speedup would be observed after warm-up runs
-    traced_model(*sample_input)
-
-###############################################################################
-# While the JIT fuser for oneDNN Graph also supports inference with ``BFloat16`` datatype,
-# performance benefit with oneDNN Graph is only exhibited by machines with AVX512_BF16
-# instruction set architecture (ISA).
-# The following code snippets serves as an example of using ``BFloat16`` datatype for inference with oneDNN Graph:
-
-# AMP for JIT mode is enabled by default, and is divergent with its eager mode counterpart
-torch._C._jit_set_autocast_mode(False)
-
-with torch.no_grad(), torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16):
-    # Conv-BatchNorm folding for CNN-based Vision Models should be done with ``torch.fx.experimental.optimization.fuse`` when AMP is used
-    import torch.fx.experimental.optimization as optimization
-    # Please note that optimization.fuse need not be called when AMP is not used
-    model = optimization.fuse(model)
-    model = torch.jit.trace(model, (example_input))
-    model = torch.jit.freeze(model)
-    # a couple of warm-up runs
-    model(example_input)
-    model(example_input)
-    # speedup would be observed in subsequent runs.
-    model(example_input)
-
 
 ###############################################################################
 # Train a model on CPU with PyTorch ``DistributedDataParallel``(DDP) functionality
@@ -426,9 +395,8 @@ def gelu(x):
 # * enable AMP
 #
 #   * Introduction to Mixed Precision Training and AMP:
-#     `video <https://www.youtube.com/watch?v=jF4-_ZK_tyc&feature=youtu.be>`_,
 #     `slides <https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/dusan_stosic-training-neural-networks-with-tensor-cores.pdf>`_
-#   * native PyTorch AMP is available starting from PyTorch 1.6:
+#   * native PyTorch AMP is available:
 #     `documentation <https://pytorch.org/docs/stable/amp.html>`_,
 #     `examples <https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples>`_,
 #     `tutorial <https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html>`_
@@ -536,3 +504,31 @@ def gelu(x):
 # approximately constant number of tokens (and variable number of sequences in a
 # batch), other models solve imbalance by bucketing samples with similar
 # sequence length or even by sorting dataset by sequence length.
+
+###############################################################################
+# Conclusion
+# ----------
+# 
+# This tutorial covered a comprehensive set of performance optimization techniques
+# for PyTorch models. The key takeaways include:
+#
+# * **General optimizations**: Enable async data loading, disable gradients for
+#   inference, fuse operations with ``torch.compile``, and use efficient memory formats
+# * **CPU optimizations**: Leverage NUMA controls, optimize OpenMP settings, and
+#   use efficient memory allocators
+# * **GPU optimizations**: Enable Tensor cores, use CUDA graphs, enable cuDNN
+#   autotuner, and implement mixed precision training
+# * **Distributed optimizations**: Use DistributedDataParallel, optimize gradient
+#   synchronization, and balance workloads across devices
+#
+# Many of these optimizations can be applied with minimal code changes and provide
+# significant performance improvements across a wide range of deep learning models.
+#
+# Further Reading
+# ---------------
+#
+# * `PyTorch Performance Tuning Documentation <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html>`_
+# * `CUDA Best Practices <https://pytorch.org/docs/stable/notes/cuda.html>`_
+# * `Distributed Training Documentation <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_
+# * `Mixed Precision Training <https://pytorch.org/docs/stable/amp.html>`_
+# * `torch.compile Tutorial <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_