Skip to content

Commit 2f4e5c3

Browse files
authored
Update ddp_minGPT to remove FSDP1 references (#3442)
1 parent 8476a99 commit 2f4e5c3

File tree

1 file changed

+11
-12
lines changed

1 file changed

+11
-12
lines changed

intermediate_source/ddp_series_minGPT.rst

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,11 @@ Authors: `Suraj Subramanian <https://github.com/subramen>`__
2626
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
2727
:class-card: card-prerequisites
2828

29-
- Familiarity with `multi-GPU training <../beginner/ddp_series_multigpu.html>`__ and `torchrun <../beginner/ddp_series_fault_tolerance.html>`__
30-
- [Optional] Familiarity with `multinode training <ddp_series_multinode.html>`__
31-
- 2 or more TCP-reachable GPU machines (this tutorial uses AWS p3.2xlarge instances)
3229
- PyTorch `installed <https://pytorch.org/get-started/locally/>`__ with CUDA on all machines
30+
- Familiarity with `multi-GPU training <../beginner/ddp_series_multigpu.html>`__ and `torchrun <../beginner/ddp_series_fault_tolerance.html>`__
31+
- [Optional] Familiarity with `multinode training <ddp_series_multinode.html>`__
32+
- 2 or more TCP-reachable GPU machines for multi-node training (this tutorial uses AWS p3.2xlarge instances)
33+
3334

3435
Follow along with the video below or on `youtube <https://www.youtube.com/watch/XFsFDGKZHh4>`__.
3536

@@ -63,25 +64,23 @@ from any node that has access to the cloud bucket.
6364

6465
Using Mixed Precision
6566
~~~~~~~~~~~~~~~~~~~~~~~~
66-
To speed things up, you might be able to use `Mixed Precision <https://pytorch.org/docs/stable/amp.html>`__ to train your models.
67-
In Mixed Precision, some parts of the training process are carried out in reduced precision, while other steps
68-
that are more sensitive to precision drops are maintained in FP32 precision.
67+
To speed things up, you might be able to use `Mixed Precision <https://pytorch.org/docs/stable/amp.html>`__ to train your models.
68+
In Mixed Precision, some parts of the training process are carried out in reduced precision, while other steps
69+
that are more sensitive to precision drops are maintained in FP32 precision.
6970

7071

7172
When is DDP not enough?
7273
~~~~~~~~~~~~~~~~~~~~~~~~
7374
A typical training run's memory footprint consists of model weights, activations, gradients, the input batch, and the optimizer state.
74-
Since DDP replicates the model on each GPU, it only works when GPUs have sufficient capacity to accomodate the full footprint.
75+
Since DDP replicates the model on each GPU, it only works when GPUs have sufficient capacity to accomodate the full footprint.
7576
When models grow larger, more aggressive techniques might be useful:
7677

77-
- `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
78-
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
79-
78+
- `Activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
79+
- `Fully-Sharded Data Parallel <https://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
8080

8181
Further Reading
8282
---------------
8383
- `Multi-Node training with DDP <ddp_series_multinode.html>`__ (previous tutorial in this series)
8484
- `Mixed Precision training <https://pytorch.org/docs/stable/amp.html>`__
85-
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__
85+
- `Fully-Sharded Data Parallel tutorial <https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__
8686
- `Training a 1T parameter model with FSDP <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__
87-
- `FSDP Video Tutorial Series <https://www.youtube.com/playlist?list=PL_lsbAsL_o2BT6aerEKgIoufVD_fodnuT>`__

0 commit comments

Comments
 (0)