Skip to content

Commit

Permalink
Reducing deepspeed timeout to 10mins (#132)
Browse files Browse the repository at this point in the history
Signed-off-by: Mustafa Eyceoz <[email protected]>
  • Loading branch information
Maxusmusti authored Jul 9, 2024
1 parent 22639da commit 2b744af
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/instructlab/training/main_ds.py
Original file line number Diff line number Diff line change
Expand Up @@ -490,7 +490,7 @@ def main(args):
#### distributed init #####
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
args.local_rank = int(os.environ["LOCAL_RANK"])
deepspeed.init_distributed(timeout=timedelta(minutes=360))
deepspeed.init_distributed(timeout=timedelta(minutes=10))
args.global_rank = torch.distributed.get_rank()
tensor = torch.ByteTensor([False]).cuda()
torch.distributed.all_reduce(tensor)
Expand Down

0 comments on commit 2b744af

Please sign in to comment.