How to adapt fs_model for torch.distributed #271

Open
opened 2022-05-27 17:03:16 +02:00 by BeaverInGreenland · 0 comments
BeaverInGreenland commented 2022-05-27 17:03:16 +02:00 (Migrated from github.com)

Inspired by this tutorial, I am trying to adapt the different scripts train.py, data/data_loader_Swapping.py, and models/fs_model.py for torch.distributed.launch

I'm stuck at an error when I try to apply torch.nn.parallel.DistributedDataParallel to the model:

temp_model = fsModel()

device = torch.device("cuda:{}".format(local_rank))
temp_model = temp_model.to(device)
model = torch.nn.parallel.DistributedDataParallel(temp_model, device_ids=[local_rank], output_device=local_rank)

model.initialize(opt)

This is the error, I get:

AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.

Has someone successfully adapted the repo for the Multi GPU training ?

Inspired by [this tutorial](https://leimao.github.io/blog/PyTorch-Distributed-Training/#Launching-Distributed-Training), I am trying to adapt the different scripts `train.py`, `data/data_loader_Swapping.py`, and `models/fs_model.py` for torch.distributed.launch I'm stuck at an error when I try to apply torch.nn.parallel.DistributedDataParallel to the model: ``` temp_model = fsModel() device = torch.device("cuda:{}".format(local_rank)) temp_model = temp_model.to(device) model = torch.nn.parallel.DistributedDataParallel(temp_model, device_ids=[local_rank], output_device=local_rank) model.initialize(opt) ``` This is the error, I get: ``` AssertionError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient. ``` Has someone successfully adapted the repo for the Multi GPU training ?
Sign in to join this conversation.