-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to represent large arrays #15
Comments
I totally agree, and think we should favor representations in this form. There is some push back as it is then necessary to do some work indexing into the arrays, but I think it is worth it for the points mentioned above. |
I should ping @sunqm on this issue. I forgot earlier. Sorry... |
Agreed, we might want to roll this into our units tech as well as that follows a similar format. |
Another way to store large arrays is to hold the binary data directly:
Most languages can understand this fairly easily and we do not have to go through the standard serialization/deserialization process where some data can be lost. Kills readability, but for larger arrays its likely the way to go. |
As a binary format, does big endian / little endian need to be considered? |
@sunqm One must know the byte ordering, which is indicated by the I'm not sure if supporting binary representation is that important. One is free to use the spec with a JSON alternatives that nativly supports (efficient) storage of binary data and take care byte order. One could see it as a problem to be solved outside the spec. |
I agree to @tovrstra. It seems hard to test and debug for the binary format. I think a readable text is definitely needed and the binary format can be an optional field if performance or file size matters in certain scenario. |
Since we are talking about a data transfer spec, I do think this is within scope. The following is in relation to the sparce-array issue, but I suggest it could be a binary solution as well... What I am implementing for my first try at Jmol just for atom names and symbols in topology.atoms is a simple run-length-encoding of arrays with a first-element descriptor:
My decoder simply first checks the first array value, and if it reads My suggestion is to allow different array "headers" like this -- so that it is still just an array being read:
The context should handle the data type, but if that is also desired, then perhaps a simple qualifier such as:
for example. Very simple and flexible. Endianness is important -- and should be specified. I prefer little-endian, which is very common. "shape" seems like an odd choice for a dimensionality descriptor. Why not just "dim": [4,4] I realize that some may think it better to have some sort of other meta tag that would indicate array type. Here are the arguments against that:
In fact, another option would be for long arrays would be to use a true data URI:
Alternatively, then you really don't need the array:
but then that is a totally different solution than a run-length encoded array, where you need both a data-type header and numeric array data. But maybe we would want to have:
I might experiment with that. Bob |
As a note you can use single quotes to do a verbatim import numpy
print(np.arange(5)) |
@BobHanson The name "shape" comes from the numpy API (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.shape.html) but it seems to be a better choice than "dim", even without considering Numpy conventions. In the Numpy API, "dimension" or "dim" is used to refer to the number of elements in shape, which is consistent with common terminology: a 1D array corresponds to a vector, a 2D array corresponds to a matrix, but there can also be higher-dimensional arrays. Likewise, "shape" is a also a name consistent with common terminology, e.g. it determines whether a 2D array is square or rectangular. I like your ideas regarding the encoding, but what do you do in case a list just happens to start with |
OK, "shape" it is.
_RLE_
I think something like this would work.
- These are controlled arrays in a defined context; the parser will always
know to expect a float or int and not a string
- Encoding could be only for numerical arrays, thus no conflict.
- I suggest using the "data:" syntax, not the "_RLE_" that I have there
now. That would provide forward-flexibility.
- The important points are:
-- the author should have the choice to compress or not compress -- not
the requirement to do that.
-- ideally, the compression method should be able to be applied anywhere
in the schema, not only at
designated places.
-- at the same time, a parser should not have to read two key/value pairs
just to read an array.
Thus my suggestion that the first element of a sequential numerical array
optionally be a string that
describes the compression method.
|
Putting compression inside the spec itself could cause quite a few issues of interoperability as these libraries can be relatively heavy to import into languages. Why not just compress the whole JSON file while sending it between points?
I dont quite understand why this is important, reading multiple pairs compared to checking the first index of the array doesn't seem terribly different. |
On Tue, Dec 19, 2017 at 9:06 AM, Daniel Smith ***@***.***> wrote:
Putting compression inside the spec itself could cause quite a few issues
of interoperability as these libraries can be relatively heavy to import
into languages.
Why would that be? The interface libraries will have to be written no
matter how it is done. What makes this "heavy"? I'm missing your point
there.
Why not just compress the whole JSON file while sending it between points?
at the same time, a parser should not have to read two key/value pairs
just to read an array.
JSON files are gzip-compressed automatically in transmission always.
Compression is not really the issue. A 64-bit float such as "1.2345678E-5,"
is 12 bytes; its binary form is 8 bytes, and its base64 form is probably
about 11 bytes as well. No huge gain there.
The issues are:
(1) parsing of numbers rather than reading them from binary, which is
slower.
(2) data integrity -- binary-to-ASCII-to-binary is lossy;
binary-to-base64-to-binary is lossless.
I am just stating my preference for in-line array header compression
indicators rather than some sort of specialized supplemental key, as is
done in many binary formats. It may seem a bit odd as part of a JSON
format, but it is quite customary in data transfer formats that involve
binary data.
I don't quite understand why this is important, reading multiple pairs
compared to checking the first index of the array doesn't seem terribly
different.
It's actually considerably different. If for every array one has to read
two keys, one of which may not exist, just to find out if it is compressed,
two hashtable lookups have to be done, and there have to be two separate
keys in the hashtable (and it has to be a hashtable that is surrounding the
array, not the array by itself or another array wrapping it); if just
checking a first element for type, that's trivial, compact, easy to pass
along without loss, and totally doable.
|
I share some of @dgasmith's concerns, what I think we are looking for is a spec that can be implemented using an off the shelf JSON parser. Optimizing for hand-written JSON parsers, and extending for highly efficient binary transport seems to be pushing things in a very different direction. For C++ we currently use jsoncpp, and I am looking at a more modern json header library. In Fortran Bert used this, and Python we usually just import the json module. This was part of the reason there was push back on using JSON pointers, many parsers do not support them at this point. The more we put in the spec the more difficult it will be to develop support in each language, and I think that will make it less likely to be adopted by quantum codes etc. Personally I would rather develop a HDF5 or binary sister format, than attempt to create a hybrid mixing JSON with binary encoders, compression, etc. The HDF5 dependency raises the bar quite a bit too though, and effectively requires linking to the HDF5 C library to the best of my knowledge at this point. If you look at the work done in XDMF it is interesting to note how they use very standard XML, and point to external binary resources which might be binary raw blobs, HDF5, or similar. We have seen some great uptake, but they took great care not to mix XML and binary. I entirely agree that work must be done whatever the spec, but if we step too far outside of JSON to the point of custom parsers, etc I think this becomes quite a different project. There is the proposal of vendor extensions that would enable experimentation with whatever we like, such as these encoded arrays. Just my opinions, there is a BSON spec I am a huge fan of too that would enable direct encoding of arrays, we use it in databases already, and making use of that for binary could be a great option for end-to-end binary support where many details are already hashed out (I haven't tried to use it for transport yet). |
For Jmol -- and visualization in general, I think, it doesn't really matter
if the format is lossless or not; binary or not; compressed or not. Because
the visualization is an end product. I thought, though, people had
expressed an interest in lossless transfer between calculational modules in
terms of intermediate workflow. That's the reason to add a binary array
option for JSON.
I might have had that wrong. But if people do what to be able to transfer
actual calculational results (density matrices, for example, perhaps even
just coordinates) losslessly between modules, then you have to allow some
sort of binary representation in the schema.
Nothing mentioned so far suggests you couldn't use an off-the-shelf JSON
parser. What is being discussed is the next step -- making meaning of that
JSON-parsed data structure. Run-length decoders are about 10 lines long.
They are standard and trivial. Base64 is also very standard and
off-the-shelf.
I'm not seeing the point Marcus is making. JSON pointers (layered JSON) is
a very different animal. That really does take specialized parsers.
BSON is a possibility, sure.
|
My vote would also be to get a text-based array format finalized first. As mentioned, different binary representations can be tried out as "extensions", since it's unclear what the right thing to do is. It might be the case that it would be best to support several different (binary) representations if arrays, and signal the choice at the top of the file. |
I'm also in favor of keeping the spec as simple as possible and introduce fully orthogonal extensions later to obtain compact representations. |
Large arrays (e.g. density matrices when using a lot of basis functions) may cause some efficiency issues. At the moment all arrays are represented as a list of (lists of ...) numbers. This has some limitations:
Writing a double precision number as text in JSON gives about +150% overhead. This may be solved to some extent with BSON or HDF5, provided arrays can be recognized easily before they are written to the file.
Reading a long list of data, then detecting that it is an array and then converting it to an array is slow and very annoying in some programming languages. It would be very convenient to know in advance that the list is an array with a given shape and data type.
Both aspects of this issue can be fixed by representing arrays as follows and to handle such data appropriately when reading and writing a BSON or HDF5 file.
It would even be possible to extend the JSON format to encode the array with base64 encoding (only +33% overhead). I know that such practical details are not really a part of the JSON schema. However, it is relevant to define the schema such that arrays can be read and written efficiently.
The text was updated successfully, but these errors were encountered: