You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run the debug tutorial on a multimode SLURM cluster (8 nodes, 4 GPUs per node). Setting dp_shard = 4, tp_size = 1, dp_replicate = 8 and get this error:
.......lingua/data.py", line 506, in distribute_data_to_rank
[rank30]: return rank_to_jsonl_iterator_params[rank]
[rank30]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
[rank30]: IndexError: list index out of range
At first, I thought the reason was that my rank_to_jsonl_iterator_params length was less than my number of running processes, but I later found that by printing rank, rank can go up to 59 which is even higher than my process count dp_degree = 32.
After some investigation, It may be caused by this code block in train.py
# build dataloader# need dp world size and rankdp_mesh=world_mesh["dp_replicate"]
dp_degree=dp_mesh.size() # 8dp_rank=dp_mesh.get_local_rank() # [0-7]ifargs.distributed.dp_shard>1:
dp_rank=dp_rank*dp_degree+world_mesh["dp_shard"].get_local_rank() # [0-7] * 8 + [0-3] = [0-59]dp_degree*=world_mesh["dp_shard"].size() # 8 * 4 = 32logger.info(f"Running on dp rank : {dp_rank}") # [0-59]logger.info(f"Running on dp size : {dp_degree}") # 32
I am trying to run the debug tutorial on a multimode SLURM cluster (8 nodes, 4 GPUs per node). Setting dp_shard = 4, tp_size = 1, dp_replicate = 8 and get this error:
lingua/lingua/data.py
Line 508 in f24c8e9
At first, I thought the reason was that my rank_to_jsonl_iterator_params length was less than my number of running processes, but I later found that by printing
rank
, rank can go up to 59 which is even higher than my process countdp_degree = 32
.After some investigation, It may be caused by this code block in train.py
lingua/apps/main/train.py
Line 243 in f24c8e9
So here is my question, is it normal that dp_rank can be higher than dp_size or am I misunderstanding something? Thank you
The text was updated successfully, but these errors were encountered: