Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit ArrayData to a single array? #3213

Closed
ConradJohnston opened this issue Jul 26, 2019 · 3 comments
Closed

Limit ArrayData to a single array? #3213

ConradJohnston opened this issue Jul 26, 2019 · 3 comments

Comments

@ConradJohnston
Copy link
Contributor

Perhaps this in an unpopular opinion, but what is the motivation for allowing multiple arrays to be stored in a single ArrayData object?

In the use case where there is only one array, one of the tedious consequences is needing to keep track of the name of that array.
For example:

from aiida import orm 
array = orm.ArrayData()
array.set_array('one_lonely_array', np.ones(3))
array.get_shape('one_lonely_array')

I know one can get the names of the stored arrays, but that adds more machinery to what should be a simple operation. In my mind, an ArrayData should represent only one array and so this code should look more like this:

from aiida import orm 
array = orm.ArrayData(array=np.ones(3)
array.get_shape()

This would be much more consistent with how Numpy works and would be more intuitive for users.

If there is a need to store many arrays, then this should be a different data object, in the same way that TrajectoryData is a separate class from StructureData.

Comments, queries, and abuse on a postcard, please?

@giovannipizzi
Copy link
Member

Hi Conrad, thanks a lot for the suggestion!

Here is the reasoning: there are cases (e.g. a trajectory data) where multiple arrays make sense only together (both in terms of the length of one of their dimensions, but also physically).
You don't want to have multiple nodes for a single trajectory (one for the cells, one for the positions, one for the chemical symbols, ...).
Note, also, that while the numpy array is one array, their .npy storage format is a collection of arrays (and indeed what we use internally).

I see two options:

  • you just define a couple of utility functions to write and read (possibly with validations) arrays in a ArrayData that contain a single array, something along the line (metacode, untested):

     SINGLEARRAY_NAME = 'singlearray'
     
     def create_singlearray(array):
         node = ArrayData()
         node.set_array(SINGLEARRAY_NAME, array)
         return node
     
     def get_singlearray(node):
         # TODO: validate here that there the list of array names is equal to the list [SINGLEARRAY_NAME]
         return node.get_array(SINGLEARRAY_NAME)

    This is good for a function you are writing yourself, I believe.

  • If you want to make this functionality available to everyone, we should define a SingleArrayData class, in the same spirit of the SinglefileData. There, you can just either inherit from ArrayData and add the methods you want, or maybe in this case it's simpler to redefine methods and logic to do what you want.

I hope this helps. I'll keep this issue open for a while for discussion, then in a few weeks we'll decide if we'll make a wontfix or we convert to a real issue.

@ConradJohnston
Copy link
Contributor Author

ConradJohnston commented Aug 2, 2019

Hi @giovannipizzi ,

Thanks for your input. Yes, I do indeed see the value of keeping related arrays together for the likes of trajectories. When I was thinking about this previously, I was focused on volumetric data like electron density, etc. Originally, I was thinking that ArrayData should be a collection of some SingleArrayData objects, but I appreciate now that this could be a bit ridiculous for the DB and for the graph.

If you think it's worth having a SingleArrayData convenience class, I'll have a crack at some point. I suppose the nice thing about adding, rather than changing, is that you don't risk breaking anyone's plugin.
In any case, this definitely falls in the "nice to have" category and is by no means a necessity.

@sphuber
Copy link
Contributor

sphuber commented Jun 1, 2024

In #6097 and a related PR I implemented functionality that essentially covers your use case. If an ArrayData only stores a single array, get_array() no longer requires a name, it will automatically fetch the single array.

@sphuber sphuber closed this as completed Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants