Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

medaka predict: concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. #536

Open
mariusmessemaker opened this issue Oct 25, 2024 · 1 comment
Labels

Comments

@mariusmessemaker
Copy link

mariusmessemaker commented Oct 25, 2024

Describe the bug
When executing hundres of medaka smolecule runs in parallel, I attempted to allocate cpus based on the expected complexity of each smolecule.fa file, as follows:

  • High complexity: Files with more than 600 alignments and over 150 regions receive 3 threads.
  • Medium complexity: Files with 50-600 alignments and 12-150 regions receive 2 threads.
  • Low complexity: All other files receive 1 thread.

Alternatively, I tried setting a uniform allocation of 2 threads for all smolecule.fa files, regardless of complexity.

As I allocate 1GB memory for each cpu of a run (i.e., 100GB memory for a run of a total of 100 cpus), each thread gets allocated ~1GB of memory. However, as this is not a hard requirement for a medaka smolecule task to start, some smolecule.fa tasks will have less memory available during the run.

I use the following medaka smolecule command (e.g., for 3 threads):

medaka smolecule --method spoa --depth 2 --length 50 --threads 3 --batch_size 100 --chunk_len 1000 --chunk_ovlp 500 --model r1041_e82_400bps_hac_v4.2.0 medaka_tmp_out_dir smolecule.fa

In most cases, consensus sequence generation completes successfully for smolecule.fa files. However, in certain cases:

  • When threads are allocated based on file complexity, approximately 20% of the smolecule.fa files produce the error message below during the sampling or prediction process, or just stall at the same position in the log without giving the error message.
  • When a uniform allocation of 2 threads is used, approximately 1% of the smolecule.fa files produce the same error message, or just stall at the same position in the log without giving the error message.

Error message:
(during sampling process)

[05:15:03 - Sampler] Initializing sampler for consensus of region 1906:0-1546.
[05:15:03 - Feature] Processed 1800:0.0-1546.0 (median depth 3.0)
[05:15:03 - Sampler] Took 0.02s to make features.
[05:15:03 - Sampler] Initializing sampler for consensus of region 1967:0-1542.
[05:15:03 - Feature] Processed 1967:0.0-1541.0 (median depth 3.0)
[05:15:03 - Sampler] Took 0.01s to make features.
[05:15:03 - Sampler] Initializing sampler for consensus of region 1984:0-1542.
[05:15:03 - Feature] Processed 1906:0.0-1545.0 (median depth 3.0)
[05:15:03 - Sampler] Took 0.03s to make features.
[05:15:03 - Sampler] Initializing sampler for consensus of region 1988:0-1544.
[05:15:03 - Feature] Processed 1984:0.0-1541.0 (median depth 3.0)
[05:15:03 - Sampler] Took 0.03s to make features.
[05:15:03 - Feature] Processed 1988:0.0-1543.0 (median depth 3.0)
[05:15:03 - Sampler] Took 0.06s to make features.
Traceback (most recent call last):
  File "/home/m.messemaker/miniconda3/envs/py310_nanopore_tcr_consensus_v3/bin/medaka", line 11, in <module>
    sys.exit(main())
  File "/home/m.messemaker/miniconda3/envs/py310_nanopore_tcr_consensus_v3/lib/python3.10/site-packages/medaka/medaka.py", line 836, in main
    args.func(args)
  File "/home/m.messemaker/miniconda3/envs/py310_nanopore_tcr_consensus_v3/lib/python3.10/site-packages/medaka/smolecule.py", line 498, in main
    _ = fut.result()
  File "/home/m.messemaker/miniconda3/envs/py310_nanopore_tcr_consensus_v3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/m.messemaker/miniconda3/envs/py310_nanopore_tcr_consensus_v3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

(during prediction process)

[07:01:51 - PWorker] Batches in cache: 8.
[07:01:51 - PWorker] 22.8% Done (0.2/0.9 Mbases) in 377.1s
[07:05:04 - PWorker] Batches in cache: 8.
[07:05:04 - PWorker] 28.6% Done (0.2/0.9 Mbases) in 569.6s
[07:05:38 - PWorker] Batches in cache: 8.
[07:05:38 - PWorker] 34.3% Done (0.3/0.9 Mbases) in 604.3s
[07:05:51 - PWorker] Batches in cache: 8.
[07:05:51 - PWorker] 40.1% Done (0.3/0.9 Mbases) in 616.9s
[07:06:41 - PWorker] Batches in cache: 8.
[07:06:41 - PWorker] 45.8% Done (0.4/0.9 Mbases) in 666.8s
[07:06:47 - PWorker] Batches in cache: 8.
[07:08:53 - PWorker] Batches in cache: 8.
[07:08:53 - PWorker] 57.3% Done (0.5/0.9 Mbases) in 799.3s
Traceback (most recent call last):
  File "/home/m.messemaker/miniconda3/envs/py310_nanopore_tcr_consensus_v3/bin/medaka", line 11, in <module>
    sys.exit(main())
  File "/home/m.messemaker/miniconda3/envs/py310_nanopore_tcr_consensus_v3/lib/python3.10/site-packages/medaka/medaka.py", line 836, in main
    args.func(args)
  File "/home/m.messemaker/miniconda3/envs/py310_nanopore_tcr_consensus_v3/lib/python3.10/site-packages/medaka/smolecule.py", line 498, in main
    _ = fut.result()
  File "/home/m.messemaker/miniconda3/envs/py310_nanopore_tcr_consensus_v3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/m.messemaker/miniconda3/envs/py310_nanopore_tcr_consensus_v3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Or medaka smolecule just stalls without giving the error message at the same position in the .log.

Lastly, the error or stalling does not occur when I run with 25 threads per smolecule.fa, but this approach prevents me from completing all iterations over my smolecule.fa files, as too many threads remain idle for each file.

Is this error due to a memory allocation issue?

  • Should I allocate more memory per thread?
  • Would decreasing the batch size in medaka predict improve stability?

Thank you so much for your help again!
Marius
``
Logging Example of the # of alignments and regions:

head region_cluster9_consensus_smolecule.log
[05:02:46 - Smolecule] Given one input file, subreads are assumed to be grouped by read.
[05:02:46 - Smolecule] Running spoa pre-medaka consensus for all reads.
[05:04:40 - POAManager] Created 709 consensus with 709 alignments.
[05:04:40 - Smolecule] Writing medaka input bam for 709 reads.
[05:04:41 - Smolecule] Running medaka consensus.
[05:04:43 - Predict] Processing region(s): 191:0-1541 133:0-1542 227:0-1542 468:0-1542 642:0-1542 70:0-1542 322:0-1542 529:0-1542 551:0-1542 631:0-1542 703:0-1542 125:0-1542 355:0-1542 358:0-1542 386:0-1542 783:0-1542 1173:0-1542 76:0-1543 128:0-1541 140:0-1542 201:0-1542 287:0-1542 290:0-1542 309:0-1542 319:0-1542 346:0-1542 373:0-1542 435:0-1542 460:0-1542 489:0-1542 549:0-1543 581:0-1542 598:0-1542 604:0-1542 706:0-1546 749:0-1542 779:0-1542 790:0-1544 818:0-1546 934:0-1542 1012:0-1542 1046:0-1542 1073:0-1542 1268:0-1542 1956:0-1540 1981:0-1541 31:0-1542 51:0-1542 82:0-1542 152:0-1543 156:0-1543 182:0-1542 216:0-1542 233:0-1543 243:0-1542 257:0-1540 324:0-1542 328:0-1541 357:0-1542 371:0-1542 384:0-1531 385:0-1542 394:0-1542 415:0-1542 417:0-1541 452:0-1542 453:0-1548 457:0-1543 470:0-1542 479:0-1542 482:0-1542 490:0-1546 515:0-1543 525:0-1542 573:0-1542 610:0-1542 644:0-1544 663:0-1542 699:0-1542 710:0-1542 718:0-1542 733:0-1542 734:0-1542 755:0-1542 785:0-1542 807:0-1542 808:0-1542 977:0-1541 989:0-1543 1015:0-1542 1020:0-1547 1024:0-1542 1044:0-1542 1050:0-1542 1066:0-1542 1113:0-1542 1156:0-1542 1162:0-1542 1170:0-1543 1193:0-1541 1338:0-1542 1431:0-1542 3:0-1543 15:0-1542 23:0-1542 37:0-1542 44:0-1542 69:0-1542 91:0-1544 95:0-1542 101:0-1543 105:0-1543 111:0-1542 119:0-1542 120:0-1546 124:0-1542 130:0-1540 162:0-1547 169:0-1540 171:0-1542 175:0-1542 185:0-1541 196:0-1542 199:0-1542 228:0-1541 231:0-1542 246:0-1541 258:0-1542 263:0-1542 269:0-1542 274:0-1543 276:0-1542 301:0-1542 310:0-1542 327:0-1542 356:0-1542 360:0-1543 364:0-1542 366:0-1542 367:0-1542 370:0-1540 376:0-1542 377:0-1542 380:0-1542 402:0-1542 406:0-1542 407:0-1542 421:0-1542 428:0-1539 438:0-1541 450:0-1554 456:0-1542 472:0-1544 475:0-1544 495:0-1541 499:0-1543 533:0-1542 535:0-1542 536:0-1542 538:0-1542 559:0-1542 565:0-1541 571:0-1542 580:0-1542 592:0-1542 595:0-1544 601:0-1542 620:0-1541 633:0-1541 634:0-1541 640:0-1542 660:0-1542 673:0-1542 686:0-1542 700:0-1550 701:0-1539 719:0-1542 724:0-1542 777:0-1542 781:0-1542 788:0-1541 829:0-1544 831:0-1540 835:0-1542 837:0-1542 840:0-1542 854:0-1542 859:0-1542 862:0-1543 888:0-1548 897:0-1542 908:0-1542 917:0-1542 929:0-1544 931:0-1542 972:0-1542 986:0-1543 997:0-1542 1000:0-1542 1004:0-1542 1009:0-1545 1014:0-1542 1028:0-1545 1056:0-1543 1061:0-1544 1062:0-1543 1070:0-1542 1134:0-1542 1143:0-1543 1147:0-1542 1176:0-1542 1201:0-1542 1234:0-1542 1286:0-1542 1306:0-1551 1325:0-1542 1395:0-1542 1453:0-1542 1502:0-1542 1514:0-1542 1558:0-1542 28:0-1544 46:0-1543 52:0-1542 56:0-1553 75:0-1547 78:0-1542 81:0-1545 97:0-1542 99:0-1544 106:0-1546 108:0-1545 117:0-1542 121:0-1542 122:0-1548 127:0-1540 131:0-1544 132:0-1543 138:0-1543 143:0-1542 148:0-1543 163:0-1542 166:0-1542 184:0-1542 189:0-1542 205:0-1543 208:0-1544 217:0-1544 224:0-1543 229:0-1542 232:0-1541 250:0-1540 253:0-1543 260:0-1542 281:0-1542 285:0-1542 300:0-1547 308:0-1542 318:0-1542 323:0-1544 326:0-1543 332:0-1555 338:0-1545 339:0-1544 344:0-1542 347:0-1542 361:0-1542 362:0-1541 382:0-1541 387:0-1542 391:0-1542 393:0-1542 404:0-1543 405:0-1542 411:0-1542 430:0-1543 436:0-1543 439:0-1536 442:0-1542 445:0-1542 461:0-1543 463:0-1542 476:0-1542 478:0-1540 487:0-1543 492:0-1542 503:0-1542 508:0-1542 518:0-1541 519:0-1542 522:0-1543 527:0-1545 531:0-1544 543:0-1545 550:0-1545 558:0-1543 566:0-1542 583:0-1542 597:0-1542 600:0-1553 602:0-1544 603:0-1542 607:0-1543 612:0-1544 614:0-1543 628:0-1542 629:0-1542 630:0-1542 632:0-1542 638:0-1543 639:0-1547 645:0-1543 655:0-1542 680:0-1543 682:0-1543 691:0-1543 716:0-1543 730:0-1542 731:0-1543 737:0-1542 739:0-1543 746:0-1542 751:0-1542 756:0-1543 759:0-1542 762:0-1542 768:0-1544 769:0-1542 786:0-1545 794:0-1542 795:0-1541 824:0-1543 861:0-1543 877:0-1540 896:0-1543 898:0-1542 899:0-1544 905:0-1542 913:0-1542 918:0-1543 921:0-1542 923:0-1542 925:0-1544 932:0-1543 936:0-1546 958:0-1542 984:0-1543 1008:0-1542 1017:0-1549 1021:0-1542 1029:0-1542 1051:0-1542 1064:0-1540 1072:0-1549 1099:0-1541 1110:0-1542 1111:0-1541 1122:0-1542 1123:0-1543 1126:0-1540 1131:0-1542 1135:0-1542 1138:0-1537 1140:0-1543 1146:0-1542 1151:0-1543 1174:0-1543 1182:0-1544 1185:0-1542 1190:0-1542 1191:0-1543 1195:0-1543 1209:0-1544 1218:0-1542 1221:0-1545 1225:0-1542 1226:0-1542 1229:0-1543 1233:0-1540 1239:0-1542 1245:0-1545 1246:0-1542 1255:0-1542 1257:0-1542 1259:0-1542 1267:0-1542 1274:0-1544 1280:0-1543 1304:0-1542 1305:0-1542 1313:0-1541 1314:0-1543 1318:0-1544 1327:0-1544 1336:0-1543 1345:0-1544 1359:0-1542 1374:0-1544 1376:0-1542 1391:0-1542 1393:0-1544 1409:0-1543 1422:0-1542 1425:0-1546 1435:0-1542 1446:0-1543 1465:0-1546 1471:0-1542 1478:0-1544 1499:0-1542 1511:0-1542 1524:0-1541 1528:0-1542 1644:0-1542 1663:0-1543 1664:0-1543 1697:0-1542 1784:0-1542 1908:0-1542 1996:0-1542 2006:0-1541 2208:0-1538 2219:0-1542 1:0-1542 11:0-1547 29:0-1546 63:0-1540 64:0-1543 65:0-1542 72:0-1542 73:0-1545 77:0-1542 86:0-1546 88:0-1543 104:0-1545 110:0-1542 113:0-1545 123:0-1542 144:0-1544 150:0-1542 158:0-1543 160:0-1546 165:0-1542 167:0-1543 168:0-1542 172:0-1535 179:0-1545 181:0-1542 197:0-1541 200:0-1544 202:0-1547 206:0-1546 211:0-1543 215:0-1542 222:0-1543 235:0-1544 236:0-1546 237:0-1542 242:0-1543 244:0-1542 247:0-1544 252:0-1542 261:0-1543 270:0-1542 275:0-1546 279:0-1543 291:0-1542 292:0-1546 294:0-1543 304:0-1544 305:0-1544 329:0-1541 331:0-1543 335:0-1541 337:0-1544 341:0-1544 345:0-1549 349:0-1548 353:0-1542 365:0-1542 368:0-1542 372:0-1542 378:0-1541 388:0-1543 390:0-1548 395:0-1545 399:0-1545 400:0-1542 403:0-1542 414:0-1542 422:0-1542 425:0-1543 427:0-1545 433:0-1547 440:0-1542 441:0-1543 447:0-1542 449:0-1540 458:0-1542 466:0-1542 474:0-1542 480:0-1541 483:0-1542 493:0-1547 496:0-1542 509:0-1542 514:0-1543 517:0-1540 532:0-1543 537:0-1547 541:0-1543 547:0-1544 548:0-1548 552:0-1547 557:0-1545 560:0-1543 567:0-1539 574:0-1543 577:0-1543 582:0-1540 587:0-1544 590:0-1546 591:0-1541 596:0-1543 608:0-1542 615:0-1543 618:0-1543 626:0-1542 641:0-1544 646:0-1543 659:0-1543 666:0-1538 667:0-1544 669:0-1542 670:0-1547 671:0-1547 674:0-1544 678:0-1543 683:0-1543 692:0-1542 693:0-1544 695:0-1549 715:0-1544 722:0-1541 727:0-1542 728:0-1546 732:0-1544 742:0-1539 743:0-1540 745:0-1541 750:0-1545 757:0-1542 758:0-1545 770:0-1541 775:0-1545 776:0-1541 778:0-1544 782:0-1541 789:0-1546 793:0-1540 796:0-1541 797:0-1543 799:0-1545 802:0-1546 811:0-1542 815:0-1540 819:0-1544 820:0-1543 826:0-1541 834:0-1542 839:0-1546 842:0-1544 844:0-1542 851:0-1542 852:0-1543 853:0-1545 857:0-1544 863:0-1543 866:0-1545 869:0-1544 870:0-1544 871:0-1543 873:0-1544 878:0-1545 882:0-1548 891:0-1544 903:0-1549 909:0-1543 928:0-1543 938:0-1541 942:0-1541 954:0-1540 957:0-1543 959:0-1542 960:0-1542 962:0-1550 964:0-1542 966:0-1542 973:0-1541 975:0-1582 981:0-1542 983:0-1542 987:0-1545 988:0-1543 993:0-1542 1013:0-1548 1016:0-1544 1023:0-1542 1027:0-1547 1032:0-1542 1033:0-1542 1036:0-1540 1037:0-1541 1039:0-1542 1040:0-1544 1041:0-1542 1063:0-1552 1065:0-1540 1071:0-1544 1090:0-1544 1094:0-1542 1096:0-1542 1104:0-1546 1108:0-1543 1132:0-1543 1153:0-1542 1165:0-1542 1171:0-1542 1172:0-1551 1175:0-1545 1178:0-1545 1179:0-1541 1188:0-1541 1206:0-1544 1212:0-1542 1220:0-1543 1223:0-1543 1230:0-1542 1240:0-1542 1241:0-1544 1243:0-1545 1250:0-1547 1251:0-1541 1263:0-1541 1264:0-1545 1272:0-1543 1277:0-1543 1289:0-1545 1293:0-1540 1301:0-1543 1331:0-1544 1342:0-1541 1343:0-1544 1360:0-1539 1365:0-1542 1366:0-1543 1371:0-1543 1380:0-1542 1382:0-1541 1386:0-1543 1388:0-1543 1392:0-1545 1433:0-1544 1439:0-1546 1448:0-1540 1449:0-1542 1457:0-1543 1462:0-1546 1489:0-1542 1509:0-1542 1523:0-1544 1542:0-1542 1561:0-1546 1567:0-1543 1584:0-1543 1589:0-1546 1592:0-1542 1609:0-1542 1612:0-1543 1620:0-1544 1631:0-1547 1636:0-1545 1641:0-1543 1651:0-1540 1661:0-1540 1692:0-1545 1702:0-1542 1703:0-1542 1710:0-1540 1718:0-1543 1721:0-1545 1730:0-1539 1734:0-1537 1736:0-1545 1758:0-1543 1779:0-1542 1785:0-1544 1800:0-1547 1906:0-1546 1967:0-1542 1984:0-1542 1988:0-1544 2007:0-1540 2021:0-1541 2095:0-1549 2192:0-1541 2214:0-1545 2296:0-1541 2298:0-1540

Environment:

  • Installation method: conda
  • OS: 22.04.5 LTS
  • medaka version: 2.0.1
  • No GPU

Additional information

  • median length of my input subreads (n = 4,323,293 subreads): 1525 bp
  • min length of my input subreads (n = 4,323,293 subreads): 1500 bp
  • max length of my input subreads (n = 4,323,293 subreads): 1636 bp
@mariusmessemaker
Copy link
Author

mariusmessemaker commented Oct 29, 2024

Update: I tried making memory a hard requirement for a medaka smolecule task to start. Specifically, I tried multiple runs where I increased memory requirement for a task to start from 1 GB to 4.25 GB in steps of 0.25 GB. Besides the memory requirement, I uniformly require at least 2 cpus for a medaka smolecule task to start. Interestingly, increasing memory to 4.25 GB does not get rid of the 1% of smolecule.fa files that stalls at this position in the log (the error from the comment above is not returned anymore):

[21:19:52 - PWorker] 96.4% Done (4.8/4.9 Mbases) in 1559.7s
[21:19:53 - PWorker] Processed 97 batches
[21:19:53 - PWorker] All done, 0 remainder regions.
[21:19:53 - Predict] Finished processing all regions.
[21:19:56 - Smolecule] Running medaka stitch.
[21:19:56 - DataIndx] Loaded 1/1 (100.00%) sample files.
[21:19:58 - DataIndx] Loaded 1/1 (100.00%) sample files.
[21:19:58 - DataIndx] Loaded 1/1 (100.00%) sample files.
[21:19:58 - DataIndx] Loaded 1/1 (100.00%) sample files.

In addition, medaka smolecule does not stall like this on the same smolecule.fa files always in different runs. So, it does not seem a memory nor smolecule.fa file specific issue.

Again, thank you for your help with resolving this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant