-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing a nan value should leave a gap #132
Comments
That seems reasonable. Under the hood I have to iterate over the vector anyway, so testing for and skipping NaNs shouldn't be a big deal (just need to look up the C API for numpy and find how to do that). |
On a related note, if zeros are provided in |
They're written, exactly for the reason you suspected. |
Some more details. So, PyBigWig can save the whole vector of a chromosome,
but then it saves the NaNs as
which is relatively fast, but this produces a larger bigwig file. It is also possible to look for segments
This is much slower, but the resulting file is smaller too. An example of times/file size:
|
I'm surprised there's a difference in size and that the performance difference is so large. The actual data should be the same in either case as are the zoom levels created when the file is closed. I don't fully understand what's going on there. Do you have any data handy that I could use for testing? I have some higher priority things on my plate at the moment but this is vexing me. |
Following up from @rcasero's observation, I'm providing a reproducible example showing that even when the indices of the intervals with non-NaN values are passed to
import pyBigWig
import numpy as np
# Initialise array as fake genomic signal for one chromosome
data = np.ones(1000)
# Replace half of the array with NaNs
data[500:1000] = np.nan
# Create dictionary from array
data_dict = dict()
data_dict['chr1'] = data
def to_bigwig_create_gaps(x, filename):
"""Save dictionary to bigWig by creating gaps
"""
with pyBigWig.open(filename, 'wb') as bw:
# add header (list of (chromosome, length) tuples)
bw.addHeader([(k, len(v)) for k, v in x.items()])
for chrom in x.keys():
valid_idx = ~np.isnan(x[chrom])
for chrom in x.keys():
if np.all(valid_idx):
# no NaN values, so it can be written in one go
bw.addEntries(chrom, 0, values=x[chrom], span=1, step=1)
else:
# we skip writing NaN values by giving the location of every non-nan value, and considering it a
# segment of span=1
bw.addEntries(chrom, np.where(valid_idx)[0], values=x[chrom][valid_idx], span=1)
to_bigwig_create_gaps(data_dict, "by_gaps.bigwig") As pointed out by @rcasero above, the array can also be saved segment-wise resulting in files without
import itertools
def to_bigwig_segment_wise(x, filename):
"""Save dictionary to bigWig segment-wise
"""
with pyBigWig.open(filename, 'wb') as bw:
# add header (list of (chromosome, length) tuples)
bw.addHeader([(k, len(v)) for k, v in x.items()])
for chrom in x.keys():
# first segment is at the beginning of the chromosome
pos = 0
# group values into segments using as criterion whether they are NaN or not
for k, g in itertools.groupby(x[chrom], lambda a: ~np.isnan(a)):
# values in current segment
g = np.array(list(g))
# if this is a segment of valid numbers, write it to file
if k:
bw.addEntries(chrom, pos, values=g, span=1, step=1)
# move pointer to the beginning of next segment
pos += len(g)
to_bigwig_segment_wise(data_dict, "segment_wise.bigwig") The resulting file size of |
Another colleagued figured out what was going on with the different file sizes. Basically, the different versions of I think the package would benefit from that being documented, specifying what kind of wiggle block each |
Hi,
Not sure whether to consider this a bug, but given that
bw.values("chr1", start, end)
producesnan
values for missing values, it seems thatbw.addEntry()
withnan
values should produce gaps in the output file, rather than writing them asnan
as it currently does.There's a way to create the gaps in compact notation for a
numpy.array
vectorx
that containsnan
valuesbut perhaps this is something that should be enforced for consistency?
The text was updated successfully, but these errors were encountered: