Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Missing variants frozenset #393

Open
Jorisvansteenbrugge opened this issue Nov 29, 2024 · 3 comments
Open

ValueError: Missing variants frozenset #393

Jorisvansteenbrugge opened this issue Nov 29, 2024 · 3 comments
Labels
bug Something isn't working
Milestone

Comments

@Jorisvansteenbrugge
Copy link

Jorisvansteenbrugge commented Nov 29, 2024

Description of the bug

When running the pipeline with --run_ancestry an error occurs during process 'PGSCATALOG_PGSCCALC:PGSCCALC:APPLY_SCORE:SCORE_AGGREGATE (reference) , I will have the command error listed below, but the error message is in

File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/calc/cli/aggregate_cli.py", line 50, in verify_variants
      raise ValueError(f"Missing variants {diff}")
  ValueError: Missing variants frozenset({'5:53383709:C:CAA', '4:125831837:A:AATAT'})

I have tested the pipeline before with another dataset, which did not result in the error. Similarly, running the pipeline on the current dataset without ancestry analysis does not result in the error.

Do you have any idea what might be causing this issue?

Command used and terminal output

$  nextflow run pgsc_calc/main.nf -c slurm.config -profile slurm --input /path/to/output_samplesheet.csv --pgs_id PGS000004 --target_build GRCh38 --outdir testOutdir --keep_multiallelic true --keep_ambiguous true --run_ancestry /path/to/pgsc_HGDP+1kGP_v1.tar.zst


Command error:

  INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
  pgscatalog.calc.cli.aggregate_cli: 2024-11-29 12:26:33 INFO     Checking variant overlap
  pgscatalog.calc.cli.aggregate_cli: 2024-11-29 12:26:33 INFO     Read 295 from reference_ALL_additive_0.sscore.vars
  pgscatalog.calc.cli.aggregate_cli: 2024-11-29 12:26:33 INFO     Read 297 from reference_ALL_additive_0.scorefile.gz
  Traceback (most recent call last):
    File "/app/pgscatalog.utils/.venv/bin/pgscatalog-aggregate", line 8, in <module>
      sys.exit(run_aggregate())
               ^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/calc/cli/aggregate_cli.py", line 76, in run_aggregate
      [verify_variants(x) for x in score_paths]
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/calc/cli/aggregate_cli.py", line 76, in <listcomp>
      [verify_variants(x) for x in score_paths]
       ^^^^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/calc/cli/aggregate_cli.py", line 50, in verify_variants
      raise ValueError(f"Missing variants {diff}")
  ValueError: Missing variants frozenset({'5:53383709:C:CAA', '4:125831837:A:AATAT'})

Work dir:
  /hpc/diaggen/users/joris/PRS_data/pgs_calc/work/56/e9595f52fdfeef44e9b06bf84af104

Container:
  /hpc/diaggen/software/singularity_cache/ghcr.io-pgscatalog-pygscatalog-pgscatalog-utils-1.4.4-singularity.img


### Relevant files

[nextflow.log](https://github.com/user-attachments/files/17959011/nextflow.log)


### System information

Nextflow version:  24.10.1
HPC
slurm
singularity
Rocky 8.10
@Jorisvansteenbrugge Jorisvansteenbrugge added the bug Something isn't working label Nov 29, 2024
@nebfield
Copy link
Member

nebfield commented Dec 2, 2024

Thanks for the bug report!

This error happens when the variants that are output by the pgscatalog-match process don't perfectly match the variants used by plink to calculate the scores. We always want to make sure these two variant sets are perfectly consistent.

It's interesting this happens with the reference panel. I think it has something to do with the matching parameters --keep_multiallelic and --keep_ambiguous (which are both usually false). Does the error still happen if you remove these parameters?

@Jorisvansteenbrugge
Copy link
Author

Thank you for taking a look!

The pipeline does run successfully without --keep_multiallelic and --keep_ambiguous set to true. I enabled these two settings to get a slightly higher match rate, as I was testing the pipeline with a small sampleset :)

@nebfield
Copy link
Member

nebfield commented Dec 5, 2024

Great 🚀 that's helpful, thank you. I'll leave this issue open to investigate properly and fix in our next release - but that probably won't be until early next year sometime 😅

@nebfield nebfield added this to the v2.1.0 milestone Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants