Skip to content

Commit 69dc5bb

Browse files
authored
Merge branch 'main' into delete-one-more
2 parents c033ee2 + da1f9d0 commit 69dc5bb

File tree

4 files changed

+159
-68
lines changed

4 files changed

+159
-68
lines changed

.github/scripts/check_redirects.sh

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
#!/bin/bash
2+
3+
if [ "$CURRENT_BRANCH" == "$BASE_BRANCH" ]; then
4+
echo "Running on $BASE_BRANCH branch. Skipping check."
5+
exit 0
6+
fi
7+
8+
9+
# Get list of deleted or renamed files in this branch compared to base
10+
DELETED_FILES=$(git diff --name-status $BASE_BRANCH $CURRENT_BRANCH --diff-filter=DR | awk '{print $2}' | grep -E '\.(rst|py|md)$' | grep -v 'redirects.py')
11+
# Check if any deleted or renamed files were found
12+
if [ -z "$DELETED_FILES" ]; then
13+
echo "No deleted or renamed files found. Skipping check."
14+
exit 0
15+
fi
16+
17+
echo "Deleted or renamed files:"
18+
echo "$DELETED_FILES"
19+
20+
# Check if redirects.py has been updated
21+
REDIRECTS_UPDATED=$(git diff --name-status $BASE_BRANCH $CURRENT_BRANCH --diff-filter=AM | grep 'redirects.py' && echo "yes" || echo "no")
22+
23+
if [ "$REDIRECTS_UPDATED" == "no" ]; then
24+
echo "ERROR: Files were deleted or renamed but redirects.py was not updated. Please update .github/scripts/redirects.py to redirect these files."
25+
exit 1
26+
fi
27+
28+
# Check if each deleted file has a redirect entry
29+
MISSING_REDIRECTS=0
30+
for FILE in $DELETED_FILES; do
31+
# Convert file path to URL path format (remove extension and adjust path)
32+
REDIRECT_PATH=$(echo $FILE | sed -E 's/(.+)_source\/(.+)\.(py|rst|md)$/\1\/\2.html/')
33+
34+
# Check if this path exists in redirects.py as a key. We don't check for values.
35+
if ! grep -q "\"$REDIRECT_PATH\":" redirects.py; then
36+
echo "ERROR: Missing redirect for deleted file: $FILE (should have entry for \"$REDIRECT_PATH\")"
37+
MISSING_REDIRECTS=1
38+
fi
39+
done
40+
41+
if [ $MISSING_REDIRECTS -eq 1 ]; then
42+
echo "ERROR: Please add redirects for all deleted/renamed files to redirects.py"
43+
exit 1
44+
fi
45+
46+
echo "All deleted/renamed files have proper redirects. Check passed!"

.github/workflows/check-redirects.yml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
name: Check Redirects for Deleted or Renamed Files
2+
3+
on:
4+
pull_request:
5+
paths:
6+
- '*/**/*.rst'
7+
- '*/**/*.py'
8+
- '*/**/*.md'
9+
10+
jobs:
11+
check-redirects:
12+
runs-on: ubuntu-latest
13+
steps:
14+
- name: Checkout code
15+
uses: actions/checkout@v4
16+
with:
17+
fetch-depth: 0
18+
19+
- name: Run redirect check script
20+
run: |
21+
chmod +x ./.github/scripts/check_redirects.sh
22+
./.github/scripts/check_redirects.sh
23+
env:
24+
BASE_BRANCH: ${{ github.base_ref }}
25+
CURRENT_BRANCH: ${{ github.head_ref }}

intermediate_source/memory_format_tutorial.py

Lines changed: 29 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,28 @@
11
# -*- coding: utf-8 -*-
22
"""
3-
(beta) Channels Last Memory Format in PyTorch
3+
Channels Last Memory Format in PyTorch
44
*******************************************************
55
**Author**: `Vitaly Fedyunin <https://github.com/VitalyFedyunin>`_
66
7-
What is Channels Last
8-
---------------------
7+
.. grid:: 2
98
10-
Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel).
9+
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
10+
:class-card: card-prerequisites
11+
12+
* What is the channels last memory format in PyTorch?
13+
* How can it be used to improve performance on certain operators?
14+
15+
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
16+
:class-card: card-prerequisites
17+
18+
* PyTorch v1.5.0
19+
* A CUDA-capable GPU
20+
21+
#########################################################################
22+
# Overview - What is channels last?
23+
# ---------------------------------
24+
25+
The channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel).
1126
1227
For example, classic (contiguous) storage of NCHW tensor (in our case it is two 4x4 images with 3 color channels) look like this:
1328
@@ -19,7 +34,7 @@
1934
.. figure:: /_static/img/channels_last_memory_format.png
2035
:alt: channels_last_memory_format
2136
22-
Pytorch supports memory formats (and provides back compatibility with existing models including eager, JIT, and TorchScript) by utilizing existing strides structure.
37+
Pytorch supports memory formats by utilizing the existing strides structure.
2338
For example, 10x3x16x16 batch in Channels last format will have strides equal to (768, 1, 48, 3).
2439
"""
2540

@@ -387,3 +402,12 @@ def attribute(m):
387402
#
388403
# If you have feedback and/or suggestions for improvement, please let us
389404
# know by creating `an issue <https://github.com/pytorch/pytorch/issues>`_.
405+
406+
######################################################################
407+
# Conclusion
408+
# ----------
409+
#
410+
# This tutorial introduced the "channels last" memory format and demonstrated
411+
# how to use it for performance gains. For a practical example of accelerating
412+
# vision models using channels last, see the post
413+
# `here <https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/>`_.

recipes_source/recipes/tuning_guide.py

Lines changed: 59 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,38 @@
88
techniques often can be implemented by changing only a few lines of code and can
99
be applied to a wide range of deep learning models across all domains.
1010
11+
.. grid:: 2
12+
13+
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
14+
:class-card: card-prerequisites
15+
16+
* General optimization techniques for PyTorch models
17+
* CPU-specific performance optimizations
18+
* GPU acceleration strategies
19+
* Distributed training optimizations
20+
21+
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
22+
:class-card: card-prerequisites
23+
24+
* PyTorch 2.0 or later
25+
* Python 3.8 or later
26+
* CUDA-capable GPU (recommended for GPU optimizations)
27+
* Linux, macOS, or Windows operating system
28+
29+
Overview
30+
--------
31+
32+
Performance optimization is crucial for efficient deep learning model training and inference.
33+
This tutorial covers a comprehensive set of techniques to accelerate PyTorch workloads across
34+
different hardware configurations and use cases.
35+
1136
General optimizations
1237
---------------------
1338
"""
1439

40+
import torch
41+
import torchvision
42+
1543
###############################################################################
1644
# Enable asynchronous data loading and augmentation
1745
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -90,7 +118,7 @@
90118
# setting it to zero, for more details refer to the
91119
# `documentation <https://pytorch.org/docs/master/optim.html#torch.optim.Optimizer.zero_grad>`_.
92120
#
93-
# Alternatively, starting from PyTorch 1.7, call ``model`` or
121+
# Alternatively, call ``model`` or
94122
# ``optimizer.zero_grad(set_to_none=True)``.
95123

96124
###############################################################################
@@ -129,7 +157,7 @@ def gelu(x):
129157
###############################################################################
130158
# Enable channels_last memory format for computer vision models
131159
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
132-
# PyTorch 1.5 introduced support for ``channels_last`` memory format for
160+
# PyTorch supports ``channels_last`` memory format for
133161
# convolutional networks. This format is meant to be used in conjunction with
134162
# `AMP <https://pytorch.org/docs/stable/amp.html>`_ to further accelerate
135163
# convolutional neural networks with
@@ -250,65 +278,6 @@ def gelu(x):
250278
#
251279
# export LD_PRELOAD=<jemalloc.so/tcmalloc.so>:$LD_PRELOAD
252280

253-
###############################################################################
254-
# Use oneDNN Graph with TorchScript for inference
255-
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
256-
# oneDNN Graph can significantly boost inference performance. It fuses some compute-intensive operations such as convolution, matmul with their neighbor operations.
257-
# In PyTorch 2.0, it is supported as a beta feature for ``Float32`` & ``BFloat16`` data-types.
258-
# oneDNN Graph receives the model’s graph and identifies candidates for operator-fusion with respect to the shape of the example input.
259-
# A model should be JIT-traced using an example input.
260-
# Speed-up would then be observed after a couple of warm-up iterations for inputs with the same shape as the example input.
261-
# The example code-snippets below are for resnet50, but they can very well be extended to use oneDNN Graph with custom models as well.
262-
263-
# Only this extra line of code is required to use oneDNN Graph
264-
torch.jit.enable_onednn_fusion(True)
265-
266-
###############################################################################
267-
# Using the oneDNN Graph API requires just one extra line of code for inference with Float32.
268-
# If you are using oneDNN Graph, please avoid calling ``torch.jit.optimize_for_inference``.
269-
270-
# sample input should be of the same shape as expected inputs
271-
sample_input = [torch.rand(32, 3, 224, 224)]
272-
# Using resnet50 from torchvision in this example for illustrative purposes,
273-
# but the line below can indeed be modified to use custom models as well.
274-
model = getattr(torchvision.models, "resnet50")().eval()
275-
# Tracing the model with example input
276-
traced_model = torch.jit.trace(model, sample_input)
277-
# Invoking torch.jit.freeze
278-
traced_model = torch.jit.freeze(traced_model)
279-
280-
###############################################################################
281-
# Once a model is JIT-traced with a sample input, it can then be used for inference after a couple of warm-up runs.
282-
283-
with torch.no_grad():
284-
# a couple of warm-up runs
285-
traced_model(*sample_input)
286-
traced_model(*sample_input)
287-
# speedup would be observed after warm-up runs
288-
traced_model(*sample_input)
289-
290-
###############################################################################
291-
# While the JIT fuser for oneDNN Graph also supports inference with ``BFloat16`` datatype,
292-
# performance benefit with oneDNN Graph is only exhibited by machines with AVX512_BF16
293-
# instruction set architecture (ISA).
294-
# The following code snippets serves as an example of using ``BFloat16`` datatype for inference with oneDNN Graph:
295-
296-
# AMP for JIT mode is enabled by default, and is divergent with its eager mode counterpart
297-
torch._C._jit_set_autocast_mode(False)
298-
299-
with torch.no_grad(), torch.cpu.amp.autocast(cache_enabled=False, dtype=torch.bfloat16):
300-
# Conv-BatchNorm folding for CNN-based Vision Models should be done with ``torch.fx.experimental.optimization.fuse`` when AMP is used
301-
import torch.fx.experimental.optimization as optimization
302-
# Please note that optimization.fuse need not be called when AMP is not used
303-
model = optimization.fuse(model)
304-
model = torch.jit.trace(model, (example_input))
305-
model = torch.jit.freeze(model)
306-
# a couple of warm-up runs
307-
model(example_input)
308-
model(example_input)
309-
# speedup would be observed in subsequent runs.
310-
model(example_input)
311-
312281

313282
###############################################################################
314283
# Train a model on CPU with PyTorch ``DistributedDataParallel``(DDP) functionality
@@ -426,9 +395,8 @@ def gelu(x):
426395
# * enable AMP
427396
#
428397
# * Introduction to Mixed Precision Training and AMP:
429-
# `video <https://www.youtube.com/watch?v=jF4-_ZK_tyc&feature=youtu.be>`_,
430398
# `slides <https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/dusan_stosic-training-neural-networks-with-tensor-cores.pdf>`_
431-
# * native PyTorch AMP is available starting from PyTorch 1.6:
399+
# * native PyTorch AMP is available:
432400
# `documentation <https://pytorch.org/docs/stable/amp.html>`_,
433401
# `examples <https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples>`_,
434402
# `tutorial <https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html>`_
@@ -536,3 +504,31 @@ def gelu(x):
536504
# approximately constant number of tokens (and variable number of sequences in a
537505
# batch), other models solve imbalance by bucketing samples with similar
538506
# sequence length or even by sorting dataset by sequence length.
507+
508+
###############################################################################
509+
# Conclusion
510+
# ----------
511+
#
512+
# This tutorial covered a comprehensive set of performance optimization techniques
513+
# for PyTorch models. The key takeaways include:
514+
#
515+
# * **General optimizations**: Enable async data loading, disable gradients for
516+
# inference, fuse operations with ``torch.compile``, and use efficient memory formats
517+
# * **CPU optimizations**: Leverage NUMA controls, optimize OpenMP settings, and
518+
# use efficient memory allocators
519+
# * **GPU optimizations**: Enable Tensor cores, use CUDA graphs, enable cuDNN
520+
# autotuner, and implement mixed precision training
521+
# * **Distributed optimizations**: Use DistributedDataParallel, optimize gradient
522+
# synchronization, and balance workloads across devices
523+
#
524+
# Many of these optimizations can be applied with minimal code changes and provide
525+
# significant performance improvements across a wide range of deep learning models.
526+
#
527+
# Further Reading
528+
# ---------------
529+
#
530+
# * `PyTorch Performance Tuning Documentation <https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html>`_
531+
# * `CUDA Best Practices <https://pytorch.org/docs/stable/notes/cuda.html>`_
532+
# * `Distributed Training Documentation <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_
533+
# * `Mixed Precision Training <https://pytorch.org/docs/stable/amp.html>`_
534+
# * `torch.compile Tutorial <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_

0 commit comments

Comments
 (0)