all_reduce doesn't work on multiple CPUs

## 🐛 Bug

Hello!

I'm trying to run a basic distributed all_reduce program on two CPU nodes using torch_xla. I'm simulating a distributed environment using two Docker containers connected by a network. I set up all the necessary environment variables during Docker startup.

Interestingly, the program running on one of the nodes waits for the program on the other to start execute. But it seems that they don't execute all_reduce correctly (the all_reduce tensor remains the same equal to tensor([1.]))

Almost the same example works correctly on two GPU nodes. Maybe I'm missing some details.


## To Reproduce

Here is the code.
```python

import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.runtime as xr
import torch_xla.distributed.xla_backend
import torch_xla.distributed.xla_multiprocessing as xmp
import torch.distributed as dist

def main(rank):


    dist.init_process_group('xla', init_method='xla://')
    device = xm.xla_device()

    rank = xr.global_ordinal() 
    world_size = xr.world_size()

    print(f'Device type: {xr.device_type()}')

    print(f'rank_arg: {rank} xr.global_ordinal(): {rank} xr.world_size(): {world_size} dist.world_size(): {dist.get_world_size()}')
    print(f'xr.global_runtime_device_count(): {xr.global_runtime_device_count()}')

    tensor = torch.tensor([rank + 1]).float().to(device)
    print(f'[Rank {rank}] Before all_reduce: {tensor}')

    reduced_tensor = xm.all_reduce(xm.REDUCE_SUM, tensor)

    xm.mark_step()

    print(f'[Rank {rank}] After all_reduce: {reduced_tensor}')


    

if __name__ == '__main__':
    # This function is a wrapper of multithreading spawn to allow user run the script with torchrun
    # command line also. Each process will only be able to access the device assigned to the current process.
    # torch_xla.launch(main, args=())

    # Без xmp.spawn не включается pjrt
    xmp.spawn(main)
``` 
Here is my docker compose file:

```dockerfile
services:
  cpu_xla_0:
    image: cpu_xla
    container_name: cpu_xla_0
    shm_size: 16GB
    volumes:
      - ./:/shared
    working_dir: /shared
    networks:
      - cpu_xla_network
    tty: true # -it
    stdin_open: true
    environment:
      - WORLD_SIZE=2
      - RANK=0
      - MASTER_ADDR=cpu_xla_0
      - MASTER_PORT=1234
      - PJRT_DEVICE=CPU
      - CPU_NUM_DEVICES=1

  cpu_xla_1:
    image: cpu_xla
    container_name: cpu_xla_1
    shm_size: 16GB
    volumes:
      - ./:/shared
    working_dir: /shared
    networks:
      - cpu_xla_network
    tty: true
    stdin_open: true
    environment:
      - WORLD_SIZE=2
      - RANK=1
      - MASTER_ADDR=cpu_xla_0 
      - MASTER_PORT=1234
      - PJRT_DEVICE=CPU
      - CPU_NUM_DEVICES=1


networks:
  cpu_xla_network:
    name: cpu_xla_network
    driver: bridge
``` 

## Expected behavior

I expect that after all_reduce the resulting tensor should be equal to tensor([2.])

## Environment

 - Reproducible on XLA backend = CPU
 - torch_xla version == 2.6


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

all_reduce doesn't work on multiple CPUs #9496

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

all_reduce doesn't work on multiple CPUs #9496

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions