Skip to content

Commit efaf961

Browse files
brianjomalfetguyang3532mcarilliucalyptus
authored
Updating Nightly Build Branch (#1159)
* Fix typo (#1118) In PyTorch tutorial, `torch` should be installed rather than `torchaudio` * Recover the attributes of torch in memory_format_tutorial (#1112) Co-authored-by: Brian Johnson <[email protected]> * fix bugs for data_loading_tutorial and dcgan_faces_tutorial (#1092) * Update autocast in dispatcher tutorial (#1128) * draft * fixes * dont overrun the line * Corrected model.resnet50() spelling (#1139) Spelling mistake led to errors for beginners. * Fix typo & Minor changes (#1138) Thanks for the fixes @codingbowoo! * Run win_test_worker manually (#1142) Merging to clean up a build issue. * Disable `pytorch_windows_builder_worker` config (#1143) See #1141 * Update index.rst (#1140) Fixed incorrect link. * Update index.rst Fix to broken link. * LSTM's -> LSTMs in equence_models_tutorial.py docs (#1136) Co-authored-by: Brian Johnson <[email protected]> * Added Ray Tune Hyperparameter Tuning Tutorial (#1066) * Added Ray Tune Hyperparameter Tuning Tutorial * Use nightly ray release * Fix checkpoint API Co-authored-by: Brian Johnson <[email protected]> * Fix typo in "Introduction to Pytorch" tutorial (in NLP tutorial) (#1145) * Fix typo in "Introduction to Pytorch" tutorial (in Pytorch for NLP tutorials) * Dummy commit, to restart CI * Revert dummy commit, to restart CI * Revert whitespace changes * Install torch not torch vision (#1153) Small update to recipe that instructs users to install `torch` not `torchaudio` * Python recipe for automatic mixed precision (#1137) * fdsa * Tutorial runs * clarify one scaler per convergence run * adjust sizes, dont run illustrative sections * satisfying ocd * MORE * fdsa * details * rephrase * fix formatting * move script to recipes * hopefully moved to recipes * fdsa * add amp_tutorial to toctree * amp_tutorial -> amp_recipe * looks like backtick highlights dont render in card_description * correct path for amp_recipe.html * arch notes and saving/restoring * formatting * fdsa * Clarify autograd-autocast interaction for custom ops * touchups Co-authored-by: Brian Johnson <[email protected]> * Fix model to be properly exported to ONNX (#1144) Co-authored-by: Brian Johnson <[email protected]> * Dist rpc merge (#1158) * Create distributed_rpc_profiling.rst * Update recipes_index.rst * Add files via upload * Update recipes_index.rst * Fix typo "asynchronizely" -> "asynchronously" (#1154) * Update dist_overview with additional information. (#1155) Summary: 1) Added DDP + RPC tutorial. 2) Added a pointer to PT Distributed CONTRIBUTING.md. Test Plan: Verified by loading the page locally. Reviewers: sentinel Subscribers: Tasks: Tags: Co-authored-by: pritam <[email protected]> * Add Performance Tuning guide recipe (#1161) * Performance Tuning Guide - initial commit * Minor tweaks * Switched profiling guide thumbnail to pytorch logo * Converted Tuning Guide to 80 chars/line * Split tuning guide into general, GPU-specific and distributed optimizations. * WAR to fix generation of header for 1st section * Minor fixes * Implemented changes suggested during initial review * Changed 'addition assignment' to 'addition' * Removed sentences about 1 CPU core for DataParallel training * Reordering of layers is recommended only for DDP(find_unused_parameters=True) * Fixed formatting * s/constructors/model constructors and s/match/roughly match * Fixed typos * A fix for one line comment when removing runnable code. (#1165) Co-authored-by: v-jizhang <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: guyang3532 <[email protected]> Co-authored-by: mcarilli <[email protected]> Co-authored-by: Sayantan Das <[email protected]> Co-authored-by: 장보우 Bowoo Jang <[email protected]> Co-authored-by: Alan deLevie <[email protected]> Co-authored-by: krfricke <[email protected]> Co-authored-by: Vijay Viswanathan <[email protected]> Co-authored-by: J. Randall Hunt <[email protected]> Co-authored-by: Thiago Crepaldi <[email protected]> Co-authored-by: Peter Whidden <[email protected]> Co-authored-by: Pritam Damania <[email protected]> Co-authored-by: pritam <[email protected]> Co-authored-by: Szymon Migacz <[email protected]> Co-authored-by: Jinlin Zhang <[email protected]> Co-authored-by: v-jizhang <[email protected]>
1 parent ecddcd1 commit efaf961

24 files changed

+1647
-46
lines changed

.circleci/config.yml

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -562,10 +562,11 @@ workflows:
562562
branches:
563563
only:
564564
- master
565-
- pytorch_windows_build_worker:
566-
name: win_test_worker
567-
filters:
568-
branches:
569-
only:
570-
- master
565+
# - pytorch_windows_build_worker:
566+
# name: win_test_worker
567+
# type: approval
568+
# filters:
569+
# branches:
570+
# only:
571+
# - master
571572

.jenkins/remove_runnable_code.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,17 @@
1616
if line.startswith('#'):
1717
ret_lines.append(line)
1818
state = STATE_NORMAL
19+
elif ((line.startswith('"""') or line.startswith('r"""')) and
20+
line.endswith('"""')):
21+
ret_lines.append(line)
22+
state = STATE_NORMAL
1923
elif line.startswith('"""') or line.startswith('r"""'):
2024
ret_lines.append(line)
2125
state = STATE_IN_MULTILINE_COMMENT_BLOCK_DOUBLE_QUOTE
26+
elif ((line.startswith("'''") or line.startswith("r'''")) and
27+
line.endswith("'''")):
28+
ret_lines.append(line)
29+
state = STATE_NORMAL
2230
elif line.startswith("'''") or line.startswith("r'''"):
2331
ret_lines.append(line)
2432
state = STATE_IN_MULTILINE_COMMENT_BLOCK_SINGLE_QUOTE

_static/img/ray-tune.png

58.4 KB
Loading
14.5 KB
Loading
41.9 KB
Loading

advanced_source/dispatcher.rst

Lines changed: 85 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,8 @@ speaking, the structure of your registrations will look like this:
105105
that provides implementations for all basic operators on the XLA dispatch
106106
key.
107107

108+
.. _autograd-support:
109+
108110
Adding autograd support
109111
-----------------------
110112

@@ -229,38 +231,97 @@ Autocast
229231
^^^^^^^^
230232

231233
The Autocast dispatch key implements support for
232-
`automatic mixed precision <https://developer.nvidia.com/automatic-mixed-precision>`_
233-
(AMP). An autocast kernel typically modifies the operation of an operator by casting the
234-
input arguments to some precision before carrying out the operation. For some
235-
operations, it is numerically safe to cast to lower precision, which is how AMP
236-
can achieve speed ups and reduced memory usage without sacrificing much
237-
accuracy. A nontrivial autocast kernel looks something like this:
234+
`automatic mixed precision (AMP) <https://pytorch.org/docs/stable/amp.html>`_.
235+
An autocast wrapper kernel typically casts incoming ``float16`` or ``float32`` CUDA tensors
236+
to some preferred precision before running the op.
237+
For example, matmuls and convolutions on floating-point CUDA tensors usually run faster
238+
and use less memory in ``float16`` without impairing convergence.
239+
Autocast wrappers only have an effect in
240+
`autocast-enabled contexts <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.autocast>`_.
241+
242+
Here's an autocast wrapper for a hypothetical custom matmul, along with its registration:
238243

239244
.. code-block:: cpp
240245
246+
// Autocast-specific helper functions
247+
#include <ATen/autocast_mode.h>
248+
241249
Tensor mymatmul_autocast(const Tensor& self, const Tensor& other) {
242250
c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
243-
return mymatmul(autocast::_cast(at::kHalf, self), autocast::_cast(at::kHalf, other));
251+
return mymatmul(at::autocast::cached_cast(at::kHalf, self),
252+
at::autocast::cached_cast(at::kHalf, other));
244253
}
245254
255+
TORCH_LIBRARY_IMPL(myops, Autocast, m) {
256+
m.impl("mymatmul", mymatmul_autocast);
257+
}
258+
259+
``cached_cast(kHalf, tensor)`` casts ``tensor`` to ``float16`` if ``tensor`` is CUDA and ``float32``,
260+
otherwise, it leaves ``tensor`` unchanged (c.f. the
261+
`eligibility policy <https://pytorch.org/docs/stable/amp.html#op-eligibility>`_ for natively autocasted ops).
262+
This ensures if the network calls ``mymatmul`` on any mixture of ``float16`` and ``float32`` CUDA tensors,
263+
``mymatmul`` runs in ``float16``. Meanwhile, calls to ``mymatmul`` with non-CUDA, integer-type, or ``float64``
264+
inputs are unaffected. Using ``cached_cast`` to follow the native eligibility policy in your own autocast wrapper
265+
is recommended, but not required. For example, if you wanted to force ``float16`` execution for all input types,
266+
you could ``return mymatmul(self.half(), other.half());`` instead of using ``cached_cast``.
267+
246268
Notice that, like our autograd kernels, we exclude the ``Autocast`` key from
247-
dispatch before redispatching. By default, if no autocast kernel is provided,
248-
we simply fallthrough directly to the regular operator implementation (no
249-
autocasting occurs.) (We didn't use ``myadd`` for this example, since pointwise
250-
addition doesn't do autocasting and should just fall through).
251-
252-
When should an autocast kernel be registered? Unfortunately, there aren't
253-
cut-and-dry rules for when you should cast to a lower precision. You can
254-
get a sense for what operators have autocasting behavior by looking at
255-
the `AMP documentation
256-
<https://pytorch.org/docs/master/amp.html#op-specific-behavior>`_. Some other
257-
general rules:
258-
259-
* Operations that do reductions should be carried out in float32,
260-
* Any operation with multiple float tensor inputs has to standardize them
261-
to a common precision, and
262-
* Any operation that does a convolution or gemm under the hood should
263-
probably be float16
269+
dispatch before redispatching.
270+
271+
By default, if no autocast wrapper is provided,
272+
we fallthrough directly to the regular operator implementation (no
273+
autocasting occurs). (We didn't use ``myadd`` for this example, since pointwise
274+
addition doesn't need autocasting and should just fall through.)
275+
276+
When should an autocast wrapper be registered? Unfortunately, there aren't
277+
cut-and-dried rules for an op's preferred precision. You can
278+
get a sense for some native ops' preferred precisions by looking at the
279+
`cast lists <https://pytorch.org/docs/master/amp.html#op-specific-behavior>`_.
280+
General guidance:
281+
282+
* Ops that do reductions should probably execute in ``float32``,
283+
* Any op that does a convolution or gemm under the hood should
284+
probably execute in ``float16``, and
285+
* Other ops with multiple floating-point tensor inputs should standardize
286+
them to a common precision (unless the implementation supports inputs with different precisions).
287+
288+
If your custom op falls into the third category, the ``promote_type`` template
289+
helps figure out the widest floating-point type present among input tensors, which is
290+
the safest choice for the execution type:
291+
292+
.. code-block:: cpp
293+
294+
#include <ATen/autocast_mode.h>
295+
296+
Tensor my_multiple_input_op_autocast(const Tensor& t0, const Tensor& t1) {
297+
c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
298+
// The required at::kHalf argument is an optimistic initial guess.
299+
auto exec_type = at::autocast::promote_type(at::kHalf, t0, t1);
300+
return my_multiple_input_op(at::autocast::cached_cast(exec_type, t0),
301+
at::autocast::cached_cast(exec_type, t1));
302+
}
303+
304+
If your custom op is :ref:`autograd-enabled<autograd-support>`, you only need to write and register
305+
an autocast wrapper for the same name onto which the autograd wrapper is registered.
306+
For example, if you wanted an autocast wrapper for the ``myadd`` function shown
307+
in the autograd section, all you'd need is
308+
309+
.. code-block:: cpp
310+
311+
Tensor myadd_autocast(const Tensor& self, const Tensor& other) {
312+
c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
313+
return myadd(at::autocast::cached_cast(<desired dtype>, self),
314+
at::autocast::cached_cast(<desired dtype>, other));
315+
}
316+
317+
TORCH_LIBRARY_IMPL(myops, Autocast, m) {
318+
m.impl("myadd", myadd_autocast);
319+
}
320+
321+
There are no separate gymnastics to make the backward method autocast compatible.
322+
However, the backward method defined in your custom autograd function will run in the same
323+
dtype as autocast sets for the forward method, so you should choose a ``<desired dtype>``
324+
suitable for both your forward and backward methods.
264325

265326
Batched
266327
^^^^^^^

beginner_source/data_loading_tutorial.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -374,7 +374,7 @@ def __call__(self, sample):
374374
#
375375

376376
dataloader = DataLoader(transformed_dataset, batch_size=4,
377-
shuffle=True, num_workers=4)
377+
shuffle=True, num_workers=0)
378378

379379

380380
# Helper function to show a batch

beginner_source/dcgan_faces_tutorial.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -591,7 +591,7 @@ def forward(self, input):
591591
# Format batch
592592
real_cpu = data[0].to(device)
593593
b_size = real_cpu.size(0)
594-
label = torch.full((b_size,), real_label, device=device)
594+
label = torch.full((b_size,), real_label, dtype=torch.float, device=device)
595595
# Forward pass real batch through D
596596
output = netD(real_cpu).view(-1)
597597
# Calculate loss on all-real batch

beginner_source/dist_overview.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,3 +195,13 @@ RPC Tutorials are listed below:
195195
`@rpc.functions.async_execution <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.functions.async_execution>`__
196196
decorator, which can help speed up inference and training. It uses similar
197197
RL and PS examples employed in the above tutorials 1 and 2.
198+
5. The `Combining Distributed DataParallel with Distributed RPC Framework <../advanced/rpc_ddp_tutorial.html>`__
199+
tutorial demonstrates how to combine DDP with RPC to train a model using
200+
distributed data parallelism combined with distributed model parallelism.
201+
202+
203+
PyTorch Distributed Developers
204+
------------------------------
205+
206+
If you'd like to contribute to PyTorch Distributed, please refer to our
207+
`Developer Guide <https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md>`_.

0 commit comments

Comments
 (0)