Skip to content

all_reduce doesn't work on multiple CPUs #9496

@darudimimi

Description

@darudimimi

🐛 Bug

Hello!

I'm trying to run a basic distributed all_reduce program on two CPU nodes using torch_xla. I'm simulating a distributed environment using two Docker containers connected by a network. I set up all the necessary environment variables during Docker startup.

Interestingly, the program running on one of the nodes waits for the program on the other to start execute. But it seems that they don't execute all_reduce correctly (the all_reduce tensor remains the same equal to tensor([1.]))

Almost the same example works correctly on two GPU nodes. Maybe I'm missing some details.

To Reproduce

Here is the code.

import os
import torch
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.runtime as xr
import torch_xla.distributed.xla_backend
import torch_xla.distributed.xla_multiprocessing as xmp
import torch.distributed as dist

def main(rank):


    dist.init_process_group('xla', init_method='xla://')
    device = xm.xla_device()

    rank = xr.global_ordinal() 
    world_size = xr.world_size()

    print(f'Device type: {xr.device_type()}')

    print(f'rank_arg: {rank} xr.global_ordinal(): {rank} xr.world_size(): {world_size} dist.world_size(): {dist.get_world_size()}')
    print(f'xr.global_runtime_device_count(): {xr.global_runtime_device_count()}')

    tensor = torch.tensor([rank + 1]).float().to(device)
    print(f'[Rank {rank}] Before all_reduce: {tensor}')

    reduced_tensor = xm.all_reduce(xm.REDUCE_SUM, tensor)

    xm.mark_step()

    print(f'[Rank {rank}] After all_reduce: {reduced_tensor}')


    

if __name__ == '__main__':
    # This function is a wrapper of multithreading spawn to allow user run the script with torchrun
    # command line also. Each process will only be able to access the device assigned to the current process.
    # torch_xla.launch(main, args=())

    # Без xmp.spawn не включается pjrt
    xmp.spawn(main)

Here is my docker compose file:

services:
  cpu_xla_0:
    image: cpu_xla
    container_name: cpu_xla_0
    shm_size: 16GB
    volumes:
      - ./:/shared
    working_dir: /shared
    networks:
      - cpu_xla_network
    tty: true # -it
    stdin_open: true
    environment:
      - WORLD_SIZE=2
      - RANK=0
      - MASTER_ADDR=cpu_xla_0
      - MASTER_PORT=1234
      - PJRT_DEVICE=CPU
      - CPU_NUM_DEVICES=1

  cpu_xla_1:
    image: cpu_xla
    container_name: cpu_xla_1
    shm_size: 16GB
    volumes:
      - ./:/shared
    working_dir: /shared
    networks:
      - cpu_xla_network
    tty: true
    stdin_open: true
    environment:
      - WORLD_SIZE=2
      - RANK=1
      - MASTER_ADDR=cpu_xla_0 
      - MASTER_PORT=1234
      - PJRT_DEVICE=CPU
      - CPU_NUM_DEVICES=1


networks:
  cpu_xla_network:
    name: cpu_xla_network
    driver: bridge

Expected behavior

I expect that after all_reduce the resulting tensor should be equal to tensor([2.])

Environment

  • Reproducible on XLA backend = CPU
  • torch_xla version == 2.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdistributedSPMD and other distributed things.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions