Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PROPOSAL: Refactor EID class into a pseudo-UUID #25666

Merged
merged 2 commits into from
Dec 4, 2024

Conversation

mike-spa
Copy link
Contributor

@mike-spa mike-spa commented Nov 27, 2024

Resolves: see previous discussion here.

The EID class has the purpose of giving a unique ID to every EngravingObject. It was initially conceived as a helper for debugging, but I can't say I have ever used it for that purpose. Instead, it has become clear to me that assigning unique IDs to items has extremely useful applications in establishing relationships between items in our files. Saving the start- and end measure of a SystemLock was an obvious first one, but the real big one to me will be part-score linking, as I've pointed out here.

The previous implementation of EID had a few shortcomings for this purpose:

  • The IDs were incremental and were assigned when constructing every EngravingObject. But because many EngravingObjects are created (or re-created) at the layout stage (and in some cases even at drawing stage...), even doing nothing could cause the ID count to increase.
  • A 32-bit ID is fine as a debugging tool, but is way too small to be used with the assumption of uniqueness. The previous point also compounds the problem because it's not unconceivable for it to overflow in a big score.

So, here's a few directions I've proposed:

  • I've removed EID as a member variable of EngravingObject.
  • No new EID gets created in the EngravingObject constructor.
  • The item-EID relationship is stored in a register that exists only in the MasterScore.
  • When reading a file, if we encounter items with a saved EID code, we register them and use the register to re-establish all the relationships we need.
  • When writing items to file, we check the register again. If the item is already registered, then we reuse the same EID, otherwise we create and register a new one. This has a couple of advantages:
  • items that already had an EID are re-saved with the same EID, which eliminates unnecessary diffs in the mscx file (very relevant if we imagine collaboration features in future);
  • the unique EIDs are generated only when saving, which means they won't have any significant performance impact in the runtime.
  • All of the above is done only for the items which need it (which means only measures, at the moment).

About the implementation details:

  • There is a choice to be made between using progressive or random IDs. I decided to go with random IDs because it is quite clearly the most future-proof choice (again, very relevant for online collaboration). Unit testing may become a bit more annoying, but not future-proofing ourselves to avoid testing annoyance doesn't seems like a good tradeoff. Furthermore, thanks to the fact that IDs are actually reused when saving, random IDs may actually not be much of a problem in most tests.
  • A 64-bit random number is probably enough for our needs. But again, I'd rather future-proof and be certain, so I decided to create 128-bit IDs.
  • Didn't want to use strings to represent the IDs to avoid heap allocations, string comparisons etc.
  • Also didn't want to import external dependencies to create "true" UUIDs cause that's completely overkill imo.
  • The best tradeoff between simplicity and safety that I found was to generate a pseudo-128bit-UUID as pair of pseudo-randomly-generated-64bit integers. This can be easily done with standard library tools and gives us a space of 10^38, which mean we're fine 😅.
  • Given that the EID register needs to be queried both by item and by eid, I actually implemented it as two separate and "opposite" unordered maps. Which feels wrong, so I'm still trying to think of a better solution. Perhaps just accepting that one of the two query directions is searched sequentially is the better alternative.

xml.tag("eid", eid.toStdString());
}

bool TWrite::itemNeedsWriteEid(const EngravingObject* item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not write for all elements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to do some performance testing on a big file. I'll come back to you about this :)

Comment on lines 56 to 58
namespace std {
template<>
struct hash<mu::engraving::EID>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this would better be written as

Suggested change
namespace std {
template<>
struct hash<mu::engraving::EID>
template<>
struct std::hash<mu::engraving::EID>

because otherwise some IDEs start thinking that the std namespace is ours...

@shoogle
Copy link
Contributor

shoogle commented Nov 29, 2024

The method looks good, but do we really need 128 bit IDs?

  • YouTube video IDs only contain 11 base64 characters (i.e. 64 bits) and that's already overkill.

  • With 64 bits, you can generate 5 billion IDs before the probabilty of a collision reaches 50%.

  • 5 billion is less than 1 billionth of the 2^64 available IDs.

IDs generated in series will never conflict, so it's only IDs generated in parallel that you need to worry about. That's an awful lot of pull requests that people would have to open on your score repository before you would see any conflicts in terms of IDs. In reality, you would run into ordinary merge conflicts long before you encounter any ID collisions.

If a score does happen to contain two elements with the same ID, you could show an error message and rely on the user to fix it manually. Only developers store scores in Git repositories, so it's only developers who could encounter this problem.

When we finally do multi-movement scores properly, depending on the implementation, IDs might only need to be unique for a single movement rather than an entire score, which would reduce the likelihood of collisions even more.

Didn't want to use strings to represent the IDs to avoid heap allocations, string comparisons etc.

What about in the MSCX? A 128-bit number will be 39 decimal digits long, but only 22 characters in base64.

Base Random 128-bit Number Digits
10 (decimal) 174786058559363532376011904777350482855 39
16 (hex) 837e9206009fb14ea565e4fddb7b0ba7 32
32 QN7JEBQAT6YU5JLF4T65W6YLU4====== 26 plus padding
64 g36SBgCfsU6lZeT923sLpw== 22 plus padding

Since we know the result is supposed to be 128 bits, we can discard the trailing = padding present in base32 and base64.

96 bits requires 29 decimal digits, or 16 characters in base64
Base Random 96-bit Number Digits
10 (decimal) 34580252689845652563297092806 29
16 (hex) 6fbc1d569248774adc5cc4c6 24
32 N66B2VUSJB3UVXC4YTDA==== 20 plus padding
64 b7wdVpJId0rcXMTG 16
64 bits requires 20 decimal digits, or 11 characters in base64
Base Random 64-bit Number Digits
10 (decimal) 11280125859929617183 20
16 (hex) 9c8b0eb476cd731f 16
32 TSFQ5NDWZVZR6=== 13 plus padding
64 nIsOtHbNcx8= 11 plus padding

@mike-spa
Copy link
Contributor Author

Thank you all for the comments! I'll look into it and come back to you. Meanwhile, I've been wondering whether to import old EIDs into the new format, but it turns out we can't do that, because of this issue. In fact, the vtests were crashing because at least one file had a repeated EID. So I think I'll just discard the old EIDs.
image

@mike-spa
Copy link
Contributor Author

Ok, here are my findings.

  • @igorkorsukov I've checked, and the performance cost of registering all the items (as opposed to just the Measures) is completely negligible, so I'm taking your suggestion here, thanks!
  • @shoogle git conflicts are obviuosly not the main problem, I just used it to illustrate the point. Maybe MuseScore will never become an online editor with real time collaboration in Google-Docs-style. But if we want to at least keep that door open, sequential IDs are just not an option.
  • Do we really need 128bit IDs. My answer would be that we don't have to, but given that the cost is negligible (see later) I don't see why not. Assumed that we opt for random IDs, the large file I've used for this test needs in the order of 10^6 IDs. From the resource you've shared, a 64bit random number would have a conflict probability in the order of 10^-6 (one in a million). That is small, but is not zero (and the file can be bigger still). With 128bit, it is zero.
  • If a score does happen to contain two elements with the same ID, you could show an error message and rely on the user to fix it manually. I think it's a bit more tricky than that. If we now start using IDs to establish relationships (which I think we should), then having an ID conflict isn't just about changing a number, it's also a file corruption, cause it means we've completely lost the element that we were referring to. So I don't think we should even plan for ID conflicts, we should just make them impossible.
  • What about in the MSCX? I was referring to the runtime representation of EIDs in preferring to use a tuple of uint64_t rather than a string. In the mscx I represent the EID as two dash-separated HEX strings, like
    <eid>29d2e029bab9ae53-412833ef6deb5a77</eid>.
    This makes it easy to parse it back into two integers when reading (and also gives me an easy way to descern and ignore the old EID codes).

Now, in terms of cost, long story short is that it's completely negligible, even with 128bit IDs. The large file I've used (Beethoven 9 with parts generated) required about 10^6 individual IDs.

  • Increased memory usage from registering all the 10^6 IDs is around 1% (about 50MB on a total memory usage of 5GB).
  • The time to generate the random IDs (EID::createNew) is barely measurable by the profiler.
  • The time to register all the IDs (which means performing insertion in two opposite maps, EIDRegister::registerItemEID) takes a total 725ms, which is practically unmeasurable over the total time it takes to parse the file, which is over 4 minutes.
  • Out of curiosity, I tried replacing unordered with ordered maps, and the insertion time became 2000 ms (still not a lot, but almost 3 times slower). Not that it matters, but I thought it was interesting @cbjeukendrup.

@shoogle
Copy link
Contributor

shoogle commented Nov 29, 2024

@mike-spa The main cost I'm worried about is seeing these long IDs in files and in diffs, but separating it into pieces goes a long way towards addressing that. In practice, I would only have to glance at the first half of two IDs to see if they are the same.

  • Hyphen: <eid>29d2e029bab9ae53-412833ef6deb5a77</eid>
  • Underscore: <eid>29d2e029bab9ae53_412833ef6deb5a77</eid>

Perhaps you could separate the pieces with underscores rather than hyphens? This would enable the entire ID to be selected with double click, for an easy Ctrl+F search in a large file.

@mike-spa
Copy link
Contributor Author

mike-spa commented Dec 3, 2024

I've made some final changes! I've implemented a simple base-64 encoding, so the resulting string in the file is a few characters shorter, and I've used the underscore for separator, so now the EIDs in our files are written like this
<eid>JmhnU8B5YOF_UfuUZNkL07F</eid>.
Please let me know if you think this is good to go, in which case I'd like to merge it soon :)

@shoogle
Copy link
Contributor

shoogle commented Dec 3, 2024

@mike-spa, that's a massive improvement, thanks!

Copy link
Contributor

@shoogle shoogle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve, though @igorkorsukov or @cbjeukendrup might want to give this a final look before it gets merged.

I notice you don't write IDs in test mode, probably because non-native formats don't have IDs (e.g. MIDI, MusicXML), so you'd get different random IDs each time, which would cause the tests to fail.

But it's a shame not to test whether IDs, spanner links, and parts links are preserved.

One idea might be to seed the PRNG to a specific value in test mode, so you get the same IDs generated every time. (Maybe something for a future PR?)

Copy link
Contributor

@igorkorsukov igorkorsukov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good for me :)

@mike-spa
Copy link
Contributor Author

mike-spa commented Dec 4, 2024

@shoogle Yes, we discussed about testing with @cbjeukendrup for exactly that reason. I should point out that even "legacy" EIDs weren't written in test mode (they were written only for the Score object, don't know why). At the moment EIDs only have one use case, that is SystemLocks. But as we start adopting them more, they should definitely be included in tests, and I think that seeding the RNG with a constant value may be indeed a great solution. In the meantime, I'll merge this :)
Thanks all for your help!

@mike-spa mike-spa merged commit 6b9b3b0 into musescore:master Dec 4, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants