Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use marshal for save/load operators to improve performance #124

Merged
merged 20 commits into from
Jul 23, 2017
Merged

Use marshal for save/load operators to improve performance #124

merged 20 commits into from
Jul 23, 2017

Conversation

Spaceenter
Copy link
Contributor

No description provided.

@Spaceenter
Copy link
Contributor Author

@babbush @jarrodmcc @idk3

Copy link
Contributor

@damiansteiger damiansteiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to not use marshal over HDF5 because

  1. It is Python specific
  2. The format is not guaranteed to be identical for different python versions
  3. It is not secure as I could put malicious data into it

https://docs.python.org/2/library/marshal.html

@idk3
Copy link
Collaborator

idk3 commented Jul 22, 2017

@damiansteiger the first two points are true but the second isn't, see e.g. https://stackoverflow.com/questions/26931919/marshal-unserialization-not-secure - marshal doesn't allow you to insert arbitrary code (as e.g. pickle / eval would) and there are no known exploits for it.

The main reasons we want to change from h5py to marshal are speed (factor ~5 faster loading, and ~10+ saving), but more significantly memory usage. In the tests I've done with saving/loading larger operators (with >10 M terms), the conversion to HDF5 blows up in memory, getting as high as ~30 GB for a 2 GB FermionOperator. marshal for the same file only uses ~8 GB by comparison (though I'm still interested in lighter solutions if you know of any!). This means that loading the kind of FermionOperators that we'd be interested in saving in the first place is almost impossible with HDF5.

@idk3 idk3 self-requested a review July 22, 2017 23:06
@@ -223,3 +217,19 @@ def get_file_path(file_name, data_directory):
file_path = data_directory + '/' + file_name

return file_path


def numpy_scalar(number):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we were to cast to complex all float types to complex instead? This makes me wonder why we don't just make the coefficients complex when they're defined - is there a reason we need things like numpy.complex128? @damiansteiger @babbush @jarrodmcc

@babbush
Copy link
Contributor

babbush commented Jul 22, 2017

@damiansteiger: while I appreciate your concerns, right now performance is significantly more important for us. While I see that it might be problematic for people to try to share data between different versions of python or even another language, that is much less of a concern than the fact that we cannot run calculations we need to run without this pull request, or something similar which dramatically brings down memory usage. Right now the security concerns appear to be hypothetical, as Ian points out. So before we make the new release Sunday night, we'd like to have this PR accepted.

babbush
babbush previously approved these changes Jul 22, 2017
for k, v in operator.terms.items()],
ensure_ascii=False).encode('utf8'))
tm = operator.terms
f = open(file_path, 'wb')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use context manager here (with open(file_path, 'wb') as f:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ensure_ascii=False).encode('utf8'))
tm = operator.terms
f = open(file_path, 'wb')
marshal.dump((operator_type, {k: numpy_scalar(tm[k]) for k in tm}), f)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this a little more, I'm not sure numpy_scalar will do anything fundamentally different from casting to complex directly - do you have any thoughts on this @Spaceenter ? Using complex instead is a little cleaner (no extra function needed) and also a bit faster.

By doing that as well as using itertools' izip and imap, this line can be made ~25% faster (dict(izip(iterkeys(tm), imap(complex, itervalues(tm))))), though it's by no means a bottleneck :p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the coefficients of both FermionOperator and QubitOperator should always be of native Python complex type when the operators are originally constructed, so that we don't need to apply the transformations/castings at all here. Is it possible to make that change in FermionOperator and QubitOperator?
If not, I agree that casting to complex directly here is cleaner.
@idk3 @babbush @jarrodmcc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we can trust that it will always be initialized this way. Even if it is, inevitably people will write things like:
my_fermion_operator *= my_numpy_array[i]
which may cause the coefficient to become a numpy type. So I think we are going to need to cast directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

babbush
babbush previously approved these changes Jul 23, 2017
@idk3
Copy link
Collaborator

idk3 commented Jul 23, 2017

@Spaceenter The current failure is due to my mistake - izip isn't available in Python3 at all it seems. I think map and zip from future.builtins or future_builtins should do it.

edit: from future_builtins import map, zip does it.

@@ -153,8 +153,8 @@ def save_operator(operator, file_name=None, data_directory=None):

tm = operator.terms
with open(file_path, 'wb') as f:
marshal.dump((operator_type, dict(izip(tm.iterkeys(),
imap(complex, tm.itervalues())))), f)
marshal.dump((operator_type, dict(zip(tm.keys(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the dictionary parts you should still be able to use iterkeys and itervalues from future.utils! i.e. iterkeys(tm).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure actually - I actually thought that Python3's d.keys() was 2's d.iterkeys(). In fact I didn't know about viewkeys at all. It looks like both should work, but I know that future.utils.iterkeys passes tests for both 2 and 3 - it would be interesting to see if there's anything different with viewkeys, though I don't think we need any of its special properties like dynamic view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like tests for both 2 and 3 passed - should we go with this? Or do you want a change?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine as is! If it's significantly slower I'll make another PR after this one gets through, but I doubt there'll be any issues.

@damiansteiger
Copy link
Contributor

@Spaceenter
Thanks for providing this solution to @babbush's and @idk3's performance problem. I am sure they will use your provided solution as it solves their problem especially as they don't have time to work out something different. 🥇

@idk3
Thanks for pointing out the performance issues with the previous approach. This is extremely helpful to put this pull request into perspective (please always add comments why a certain pull request is made) and provides one argument why you would like to change to using marshal.

Let me explain in more detail why I have raised objections about using marshal:

  1. Python specific data format is / will be a real problem in the ProjectQ Framework, here is why:
    We have decided that the ProjectQ-Framework uses Python as it's main language and everything has to work in Python. However, as Python is slower than, e.g., C++, we implement code which is a performance bottleneck also in C++ (e.g. the simulator in ProjectQ). This gives the users a choice between the pure Python version or the faster C++ version. E.g. C++ Implementation of FermionOperator and QubitOperator #13 wants to have a C++ implementation of FermionOperators and QubitOperators. If we choose a language independent format, then the faster C++ code can read and write independently without going through Python which would kill the performance.

  2. According to the own documentation marshal is not supposed to be python version or python implementation independent. Since the first release of ProjectQ and FermiLib we have spent a lot of time to make sure that we python version independent. Just because Travis CI passes for both Python2 and Python3 does not guaranteed that other Python implementations won't fail reading this format.

  3. If the maintainers of the marshal package think they need to warn about malicious code execution, I think it is a fair point to assume it is possible.

I am not yet convinced the performance argument outweighs any of these three points (maybe a little bit point 3 but for sure not 1 and 2). Did you try the following:

  1. Make a toy example problem which demonstrates your performance bottleneck (maybe somebody else browsing through the repo might have a solution...)
  2. Did you try to play around with the different options of h5py to increase performance? Ask an HDF5 expert...
  3. It may be worth figuring out if the performance bottleneck is JSON and it might be faster to create a custom datatype without going the route through JSON...

@babbush and @idk3
I can understand that you are under time pressure and we are all trying our best to make sure FermiLib works for everyone. This means that we don't accept pull request to solve a performance problem for a particular Google project, while sacrificing fundamental design principles of the ProjectQ-Framework which will affect all other users. This means (hopefully very rarely) that there will be larger discussions to give everyone time to discuss their arguments and potentially find a better solution. Nevertheless, one great advantage of our framework being open source is that time pressure is almost never a valid argument for shortcutting proper code review and finding a good solution because there is always the following solution:

For the time being, you could just copy this code (which uses marshal and solves your immediate performance problem) and solve your problem. Is there a specific reason why this util function has to be inside FermiLib?
Just submit the code internally at Google and you can start using it or put this one function temporarily on your own GitHub account until we have found a solutions which works for everyone.

@babbush
Copy link
Contributor

babbush commented Jul 23, 2017

I understand and respect your objections. However, for us to keep developing FermiLib we need to be able to sometimes strike a compromise between functionality today and purity of engineering practices. Currently, FermiLib is implemented entirely in python and marshall is passing for all tested versions of python. While theoretically possible, there are no known security exploits in marshall. Simply put, that is good enough for us.

The solution is in some ways suboptimal so it makes sense to open an issue and eventually work towards a better solution. We can document what is going on with examples and wait for this hypothetical person who is browsing our repository to suggest something better. But currently the memory spikes associated with the current solution are completely unacceptable. We need a new release of FermiLib to happen Sunday night (as we discussed) and to include the marshall solution. Otherwise, we cannot check the new version into Google3 and that will cause us unacceptable delays. Considering that almost all FermiLib pull requests currently come from people involved in this project, I think this is a completely fair reason. Our projects are largely reasonable for moving FermiLib forward. So we need to be practical. This solution does work for everyone who actually uses the code today. Please accept this pull request in time for the new release.

If you want to discuss further, please call me.

@damiansteiger
Copy link
Contributor

damiansteiger commented Jul 23, 2017

This is a open source project, therefore technical discussions should happen openly and not behind closed doors (e.g. phone calls) so that everyone using FermiLib can see why certain design decisions have been made.

Here is a compromise:
I remove my code review so that @babbush can merge and solve his time critical project. This only happens because FermiLib is still in an alpha release and users should be aware that the code can change anytime. But my experience has taught me that it is much harder to get someone to write a nice solution solving this problem (which does not have objections 1-3) if there exists already a "(not) good enough" solution working for some people. Hence, @babbush are you willing to fix this problem before we make the first release?

@Spaceenter
Copy link
Contributor Author

Thanks for the discussions.

I actually have another idea to make this work without making it Python specific. Inside Google, almost all large engineering project use protobuf as a no-brain choice to represent data model (complex or not), because it is designed to be working across different languages, and have much better serializations than things like json.

So instead of using native Python dictionary to represent FermionOperator and QubitOperator (and a few other objects like InteractionRDM, MoleculeOperator, etc.), we define a protobuf message for each of them. And then write the serialized protobuf binary to file. https://developers.google.com/protocol-buffers/docs/pythontutorial#parsing-and-serialization

This is obviously more complex than marshal, and we'd need some time to develop and test.
@babbush @damiansteiger

@babbush
Copy link
Contributor

babbush commented Jul 23, 2017

@damiansteiger I agree we should try to keep discussions open. I also appreciate your point about it being more difficult to get people to fix issues when they are "mostly working" already and not badly in need of repair (as is the case now).

I will not personally promise to fix this myself. The first reason is because I am really not the right person to do this - I am not aware of any acceptable solutions and there are other people on this project who would be suitable (such as yourself or @Spaceenter). The second reason is because this is not a critical issue for me whereas there are many other features that I'd like to implement. But I will certainly encourage others to try to come up with a better solution before we move out of alpha. @Spaceenter, you just suggested moving to protocol buffers, which sounds like a reasonable fix to me. Is this something you could commit to trying to do in the next few weeks?

Hopefully this is enough for you to remove the block. I assume that if @Spaceenter agrees to help fix this then that will be good enough for you?

@Spaceenter
Copy link
Contributor Author

Yes, I'll start being a bit more free later next week, I'll then try to send a PR with protobuf and work with @idk3 to test the performance. In my opinion other people who will use projectq/fermilib in the future will also benefit from protobuf as they might use Java, Go, JS etc. to access the operators, and having protobuf should be much easier. Anyways, we could discuss when I actually have the PR ready.
@babbush @damiansteiger @idk3

@babbush
Copy link
Contributor

babbush commented Jul 23, 2017

Great, thank you @Spaceenter! I agree that moving to protocol buffers makes a lot of sense for the long run. I suppose this should be enough for @damiansteiger to merge this PR.

@babbush babbush merged commit 2482ab2 into ProjectQ-Framework:develop Jul 23, 2017
@Spaceenter Spaceenter deleted the mar branch September 5, 2017 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants