Skip to content

Fix duplicate labels and other docs build warnings #9446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@
"sphinx.ext.napoleon",
"sphinx.ext.viewcode",
"sphinxcontrib.katex",
"sphinx.ext.autosectionlabel",
"sphinx_copybutton",
# "sphinx_panels",
# "myst_parser", # Will be activated by myst_nb
Expand All @@ -38,6 +37,8 @@
extensions = pytorch_extensions + [
"myst_nb"
]
# Automatically generate section anchors for selected heading level
myst_heading_anchors = 3

# Users must manually execute their notebook cells
# with the correct hardware accelerator.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/contribute/cpp_debugger.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ We suggest the following steps:

At this point, your PyTorch is built with debugging symbols and ready to debug
with GDB. However, we recommend debugging with VSCode. For more information, see
{ref}`Debug with VSCode`.
[](#debug-with-vscode)

### Verify your file is built

Expand Down
5 changes: 1 addition & 4 deletions docs/source/contribute/plugins.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ you can test with the placeholder `LIBRARY` device type. For example:
[device(type='xla', index=0), device(type='xla', index=1), device(type='xla', index=2), device(type='xla', index=3)]

To register your device type automatically for users as well as to
handle extra setup for e.g. multiprocessing, you may implement the
handle extra setup, for example, multiprocessing, you may implement the
`DevicePlugin` Python API. PyTorch/XLA plugin packages contain two key
components:

Expand All @@ -65,9 +65,6 @@ class CpuPlugin(plugins.DevicePlugin):
that identifies your `DevicePlugin`. For exmaple, to register the
`EXAMPLE` device type in a `pyproject.toml`:

```{=html}
<!-- -->
```
[project.entry-points."torch_xla.plugins"]
example = "torch_xla_cpu_plugin:CpuPlugin"

Expand Down
1 change: 1 addition & 0 deletions docs/source/features/pallas.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ output = torch.ops.xla.paged_attention(
)
```

(pallas-integration-example)=
#### Integration Example

The vLLM TPU integration utilizes [PagedAttention
Expand Down
19 changes: 12 additions & 7 deletions docs/source/learn/_pjrt.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
---
orphan: true
---

# PJRT Runtime

PyTorch/XLA has migrated from the TensorFlow-based XRT runtime to the
Expand Down Expand Up @@ -39,7 +43,7 @@ the `runtime` tag.
per device. On TPU v2 and v3 in PJRT, workloads are multiprocess and
multithreaded (4 processes with 2 threads each), so your workload
should be thread-safe. See [Multithreading on TPU
v2/v3](#multithreading-on-tpu-v2v3) and the [Multiprocessing section
v2/v3](multithreading-on-tpu-v2v3) and the [Multiprocessing section
of the API
guide](https://github.com/pytorch/xla/blob/master/API_GUIDE.md#running-on-multiple-xla-devices-with-multi-processing)
for more information. Key differences to keep in mind:
Expand Down Expand Up @@ -267,7 +271,7 @@ for more information about TPU architecture.
from .
- Under XRT, the server process is the only process that interacts
with the TPU devices, and client processes don't have direct access
to the TPU devices. When profiling a single-host TPU (e.g. v3-8 or
to the TPU devices. When profiling a single-host TPU (e.g. v3-8 or
v4-8), you would normally see 8 device traces (one for each TPU
core). With PJRT, each process has one chip, and a profile from that
process will show only 2 TPU cores.
Expand All @@ -282,11 +286,12 @@ for more information about TPU architecture.
each TPU host
(`[gcloud compute tpus tpu-vm scp](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/tpus/tpu-vm/scp)`)
and run the code on each host in parallel
(e.g. `[gcloud compute tpus tpu-vm ssh --workers=all --command="PJRT_DEVICE=TPU python run.py"](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/tpus/tpu-vm/ssh)`)
(e.g. `[gcloud compute tpus tpu-vm ssh --workers=all --command="PJRT_DEVICE=TPU python run.py"](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/tpus/tpu-vm/ssh)`)
- `xm.rendezvous` has been reimplemented using XLA-native collective
communication to enhance stability on large TPU pods. See below for
more details.

(multithreading-on-tpu-v2v3)=
### Multithreading on TPU v2/v3

On TPU v2 and v3, **distributed workloads always run multithreaded**,
Expand Down Expand Up @@ -332,7 +337,7 @@ implementation:
- Because XLA does not permit collective operations to run on a subset
of workers, all workers must participate in the `rendezvous`.

If you require the old behavior of `xm.rendezvous` (i.e. communicating
If you require the old behavior of `xm.rendezvous` (i.e. communicating
data without altering the XLA graph and/or synchronizing a subset of
workers), consider using `` `torch.distributed.barrier ``
\<<https://pytorch.org/docs/stable/distributed.html#torch.distributed.barrier>\>[\_\_
Expand All @@ -358,7 +363,7 @@ from the PyTorch documentation. Keep in mind these constraints:
*New in PyTorch/XLA r2.0*

When using PJRT with `torch.distributed` and
`[torch.nn.parallel.DistributedDataParallel](https://github.com/pytorch/xla/blob/master/docs/ddp.md)`
`[torch.nn.parallel.DistributedDataParallel](https://github.com/pytorch/xla/blob/master/docs/source/perf/ddp.md)`
we strongly recommend using the new `xla://` `init_method`, which
automatically finds the replica IDs, world size, and master IP by
querying the runtime. For example:
Expand Down Expand Up @@ -398,9 +403,9 @@ Note: For TPU v2/v3, you still need to import
`torch.distributed` is still experimental.

For more information about using `DistributedDataParallel` on
PyTorch/XLA, see [ddp.md](./ddp.md) on TPU V4. For an example that uses
PyTorch/XLA, see [ddp.md](../perf/ddp.md) on TPU V4. For an example that uses
DDP and PJRT together, run the following [example
script](../test/test_train_mp_imagenet.py) on a TPU:
script](../../../test/test_train_mp_imagenet.py) on a TPU:

``` bash
PJRT_DEVICE=TPU python xla/test/test_train_mp_mnist.py --ddp --pjrt_distributed --fake_data --num_epochs 1
Expand Down
2 changes: 1 addition & 1 deletion docs/source/learn/pytorch-on-xla-devices.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ XLA. The model definition, dataloader, optimizer and training loop can
work on any device. The only XLA-specific code is a couple lines that
acquire the XLA device and materializing the tensors. Calling `torch_xla.sync()`
at the end of each training iteration causes XLA to execute its current
graph and update the model's parameters. See {ref}`XLA Tensor Deep Dive`
graph and update the model's parameters. See [](#xla-tensor-deep-dive)
for more on how XLA creates graphs and runs
operations.

Expand Down
4 changes: 2 additions & 2 deletions docs/source/learn/troubleshoot.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ disable execution analysis by `PT_XLA_DEBUG_LEVEL=1`). To use
PyTorch/XLA efficiently, we expect the same models code to be run for
every step and compilation only happen once for every graph. If you keep
seeing `Compilation Cause`, you should try to dump the IR/HLO following
{ref}`Common Debugging Environment Variables Combinations` and
[](#common-debugging-environment-variables-combinations) and
compare the graphs for each step and understand the source of the
differences.

Expand Down Expand Up @@ -313,7 +313,7 @@ If your model shows bad performance, keep in mind the following caveats:
*Solution*:

- For most ops we can lower them to XLA to fix it. Checkout
{ref}`Get A Metrics Report` to find out the
[](#get-a-metrics-report) to find out the
missing ops and open a feature request on
[GitHub](https://github.com/pytorch/xla/issues).

Expand Down
4 changes: 2 additions & 2 deletions docs/source/perf/spmd_advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
This guide covers advanced topics with SPMD. Please read the
[SPMD user guide](https://github.com/pytorch/xla/blob/master/docs/spmd_basic.md) as a prerequisite.

### Sharding-Aware Host-to-Device Data Loading
## Sharding-Aware Host-to-Device Data Loading

SPMD takes a single-device program, shards it, and executes it in parallel.

Expand Down Expand Up @@ -38,7 +38,7 @@ train_loader = pl.MpDeviceLoader(
)
```

### Virtual device optimization
## Virtual device optimization

PyTorch/XLA normally transfers tensor data asynchronously from host to device once the tensor is defined. This is to overlap the data transfer with the graph tracing time. However, because SPMD allows the user to modify the tensor sharding _after _the tensor has been defined, we need an optimization to prevent unnecessary transfer of tensor data back and forth between host and device. We introduce Virtual Device Optimization, a technique to place the tensor data on a virtual device SPMD:0 first, before uploading to the physical devices when all the sharding decisions are finalized. Every tensor data in SPMD mode is placed on a virtual device, SPMD:0. The virtual device is exposed to the user as an XLA device XLA:0 with the actual shards on physical devices, like TPU:0, TPU:1, etc.

Expand Down
Loading