Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic PGEN writer #8

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Basic PGEN writer #8

wants to merge 14 commits into from

Conversation

biona001
Copy link
Member

@biona001 biona001 commented Sep 8, 2022

This pr adds the write_pgen functionality which saves a dosage matrix into a PGEN file. Any suggestion is highly appreciated.

  • In write_pgen, all hard-call genotypes are missing, while dosage values are saved in data track 4 of variant records
  • Currently write_pgen is about ~6x slower than reading variant by variant, but I'm not sure how to optimize further. This line takes up roughly 70% of the time but I'm not sure if we can replace it with anything else
    storage[1] = hton(x)
  • psam and pvar files are also outputted. I'm not sure if it's better to output a pvar.zst instead, but that would require adding CodecZlib.jl as another dependency

If this seems fine, I still need to add some documentation

@codecov-commenter
Copy link

codecov-commenter commented Sep 8, 2022

Codecov Report

Merging #8 (ba32323) into main (931ca5d) will increase coverage by 4.15%.
The diff coverage is 88.42%.

@@            Coverage Diff             @@
##             main       #8      +/-   ##
==========================================
+ Coverage   68.51%   72.67%   +4.15%     
==========================================
  Files           9       10       +1     
  Lines         432      527      +95     
==========================================
+ Hits          296      383      +87     
- Misses        136      144       +8     
Impacted Files Coverage Δ
src/PGENFiles.jl 28.57% <ø> (ø)
src/write.jl 88.42% <88.42%> (ø)
src/iterator.jl 56.66% <0.00%> (+3.33%) ⬆️
src/structs.jl 91.30% <0.00%> (+8.69%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@kose-y
Copy link
Member

kose-y commented Sep 8, 2022

Thanks, this looks awesome. Are all the variant record types 0x40?

@biona001
Copy link
Member Author

biona001 commented Sep 8, 2022

Yeah all variant record types are 0x40. Also apparently my tests are failing on windows, let me debug this...

@kose-y
Copy link
Member

kose-y commented Sep 15, 2022

Have you checked if the output file works with Plink 2?

@biona001
Copy link
Member Author

Sorry, haven't had much time to work on this pr this past week.

No I haven't yet.... To merge this pr, I still need to:

  1. Check output works with PLINK2 and BASIL (a Lasso solver that I want to use)
  2. Write some docs
  3. Maybe improve efficiency. It just seems odd to me that writing is 6x slower than reading.

@kose-y
Copy link
Member

kose-y commented Sep 15, 2022

The issue is that we are not closing the file we opened.

@biona001
Copy link
Member Author

Sorry, do you mean the open(...) do io blocks? e.g.

open(pgen_filename * ".pgen", "w") do io

I think that syntax closes the io automatically when the block is done.

@kose-y
Copy link
Member

kose-y commented Sep 15, 2022 via email

@kose-y
Copy link
Member

kose-y commented Dec 14, 2022

@biona001 Please let me know if you think it is ready to be merged.

@biona001
Copy link
Member Author

I have not checked if the result can be run/read by other PGEN readers. I haven't had time to do that yet

@kose-y
Copy link
Member

kose-y commented Dec 14, 2022

Thanks for the update. By the way, adding close(p.io) (with p a Pgen struct) at the end of the tests might fix the test failures on Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants