Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gffutils database build failed with UNIQUE constraint failed: features.id #12

Open
yeeus opened this issue Jul 2, 2024 · 0 comments
Open
Assignees

Comments

@yeeus
Copy link

yeeus commented Jul 2, 2024

Useful and exciting tool! But when I ran lifton with the command:

lifton MF2_mat.v1.0.fa ~/rawdata/GRCh38/ref/GRCh38.p14.new_name.fa -sc 0.95 -copies -g ~/rawdata/GRCh38/ref/GCF_000001405.40_GRCh38.p14_genomic.gff -polish -o CN1v1.0_mat.lifton.gff -c -cds -ad RefSeq -f type.list -exclude_partial -t 10 -D

I got this error:

**********************
** Running miniprot **
**********************
gffutils database build failed with UNIQUE constraint failed: features.id

while there are so many warnings and a ValueError:

$tail -50 CN1v1.0/Mat/06.lifton/GRCh38/lifton.sh.sbatch.e
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '4c10942085bce244cfce502d028bd6f1'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '7ab0878db5cb4b1d34b527f6f36432d5'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '7e9c75fdd3787d0324c845de6e12c07e'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id 'dff13391ae5c98698f19335853b321e5'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '4c10942085bce244cfce502d028bd6f1'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '7ab0878db5cb4b1d34b527f6f36432d5'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '7e9c75fdd3787d0324c845de6e12c07e'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id 'dff13391ae5c98698f19335853b321e5'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '4c10942085bce244cfce502d028bd6f1'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '7ab0878db5cb4b1d34b527f6f36432d5'; ignoring all but the first
2024-07-01 15:57:42,481 - WARNING - Duplicate lines in file for id '7e9c75fdd3787d0324c845de6e12c07e'; ignoring all but the first
2024-07-01 15:57:42,481 - WARNING - Duplicate lines in file for id 'dff13391ae5c98698f19335853b321e5'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:45,000 - INFO - Populating features table and first-order relations: 3903620 features
2024-07-01 15:57:45,001 - INFO - Updating relations
2024-07-01 15:58:19,940 - INFO - Creating relations(parent) index
2024-07-01 15:58:23,229 - INFO - Creating relations(child) index
2024-07-01 15:58:27,449 - INFO - Creating features(featuretype) index
2024-07-01 15:58:30,206 - INFO - Creating features (seqid, start, end) index
2024-07-01 15:58:33,525 - INFO - Creating features (seqid, start, end, strand) index
2024-07-01 15:58:37,309 - INFO - Running ANALYZE features
>> Creating miniprot annotation database : ./lifton_output/miniprot/miniprot.gff3
2024-07-01 15:58:39,206 - INFO - Populating features
2024-07-01 16:00:27,613 - INFO - Populating features table and first-order relations: 1912405 features
2024-07-01 16:00:27,613 - INFO - Updating relations
2024-07-01 16:00:37,349 - INFO - Creating relations(parent) index
2024-07-01 16:00:37,940 - INFO - Creating relations(child) index
2024-07-01 16:00:38,686 - INFO - Creating features(featuretype) index
2024-07-01 16:00:39,703 - INFO - Creating features (seqid, start, end) index
2024-07-01 16:00:41,177 - INFO - Creating features (seqid, start, end, strand) index
2024-07-01 16:00:42,823 - INFO - Running ANALYZE features
Traceback (most recent call last):
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/bin/lifton", line 8, in <module>
    sys.exit(main())
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/lifton/lifton.py", line 352, in main
    run_all_lifton_steps(args)
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/lifton/lifton.py", line 290, in run_all_lifton_steps
    tree_dict = intervals.initialize_interval_tree(l_feature_db, features)
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/lifton/intervals.py", line 12, in initialize_interval_tree
    tree_dict[chromosome].add(gene_interval)
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/intervaltree/intervaltree.py", line 324, in add
    raise ValueError(
ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(45020029, 45020029, 'CDS_51812')

When I look at the gff file I provided, which was downloaded from NCBI (GRCh38 refseq), I found there are a few identical ids which may cause the error in miniprot (while liftoff created unique ids):

$rg -v '^#' ~/rawdata/GRCh38/ref/GCF_000001405.40_GRCh38.p14_genomic.gff | cut -f 9 | awk -F '[=|;]' '{print $2}' | sort | uniq -c | sort -nr | head
362 cds-NP_001254479.2
358 cds-XP_016860308.1
335 cds-XP_016860310.1
335 cds-XP_016860309.1
316 cds-XP_047301616.1
312 cds-XP_047301617.1
312 cds-NP_001243779.1
311 cds-NP_596869.4
309 cds-XP_024308863.1
299 cds-XP_047301619.1

so I think you'd better edit the performance of miniprot..
Best wishes!

@Kuanhao-Chao Kuanhao-Chao self-assigned this Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants