-
Notifications
You must be signed in to change notification settings - Fork 559
Open
Labels
bugSomething isn't workingSomething isn't workingdistributedSPMD and other distributed things.SPMD and other distributed things.
Description
🐛 Bug
Hello!
I'm trying to run a basic distributed all_reduce program on two CPU nodes using torch_xla. I'm simulating a distributed environment using two Docker containers connected by a network. I set up all the necessary environment variables during Docker startup.
Interestingly, the program running on one of the nodes waits for the program on the other to start execute. But it seems that they don't execute all_reduce correctly (the all_reduce tensor remains the same equal to tensor([1.]))
Almost the same example works correctly on two GPU nodes. Maybe I'm missing some details.
To Reproduce
Here is the code.
import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.runtime as xr
import torch_xla.distributed.xla_backend
import torch_xla.distributed.xla_multiprocessing as xmp
import torch.distributed as dist
def main(rank):
dist.init_process_group('xla', init_method='xla://')
device = xm.xla_device()
rank = xr.global_ordinal()
world_size = xr.world_size()
print(f'Device type: {xr.device_type()}')
print(f'rank_arg: {rank} xr.global_ordinal(): {rank} xr.world_size(): {world_size} dist.world_size(): {dist.get_world_size()}')
print(f'xr.global_runtime_device_count(): {xr.global_runtime_device_count()}')
tensor = torch.tensor([rank + 1]).float().to(device)
print(f'[Rank {rank}] Before all_reduce: {tensor}')
reduced_tensor = xm.all_reduce(xm.REDUCE_SUM, tensor)
xm.mark_step()
print(f'[Rank {rank}] After all_reduce: {reduced_tensor}')
if __name__ == '__main__':
# This function is a wrapper of multithreading spawn to allow user run the script with torchrun
# command line also. Each process will only be able to access the device assigned to the current process.
# torch_xla.launch(main, args=())
# Без xmp.spawn не включается pjrt
xmp.spawn(main)
Here is my docker compose file:
services:
cpu_xla_0:
image: cpu_xla
container_name: cpu_xla_0
shm_size: 16GB
volumes:
- ./:/shared
working_dir: /shared
networks:
- cpu_xla_network
tty: true # -it
stdin_open: true
environment:
- WORLD_SIZE=2
- RANK=0
- MASTER_ADDR=cpu_xla_0
- MASTER_PORT=1234
- PJRT_DEVICE=CPU
- CPU_NUM_DEVICES=1
cpu_xla_1:
image: cpu_xla
container_name: cpu_xla_1
shm_size: 16GB
volumes:
- ./:/shared
working_dir: /shared
networks:
- cpu_xla_network
tty: true
stdin_open: true
environment:
- WORLD_SIZE=2
- RANK=1
- MASTER_ADDR=cpu_xla_0
- MASTER_PORT=1234
- PJRT_DEVICE=CPU
- CPU_NUM_DEVICES=1
networks:
cpu_xla_network:
name: cpu_xla_network
driver: bridge
Expected behavior
I expect that after all_reduce the resulting tensor should be equal to tensor([2.])
Environment
- Reproducible on XLA backend = CPU
- torch_xla version == 2.6
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdistributedSPMD and other distributed things.SPMD and other distributed things.