PROPOSAL: Refactor EID class into a pseudo-UUID #25666

mike-spa · 2024-11-27T15:27:30Z

Resolves: see previous discussion here.

The EID class has the purpose of giving a unique ID to every EngravingObject. It was initially conceived as a helper for debugging, but I can't say I have ever used it for that purpose. Instead, it has become clear to me that assigning unique IDs to items has extremely useful applications in establishing relationships between items in our files. Saving the start- and end measure of a SystemLock was an obvious first one, but the real big one to me will be part-score linking, as I've pointed out here.

The previous implementation of EID had a few shortcomings for this purpose:

The IDs were incremental and were assigned when constructing every EngravingObject. But because many EngravingObjects are created (or re-created) at the layout stage (and in some cases even at drawing stage...), even doing nothing could cause the ID count to increase.
A 32-bit ID is fine as a debugging tool, but is way too small to be used with the assumption of uniqueness. The previous point also compounds the problem because it's not unconceivable for it to overflow in a big score.

So, here's a few directions I've proposed:

I've removed EID as a member variable of EngravingObject.
No new EID gets created in the EngravingObject constructor.
The item-EID relationship is stored in a register that exists only in the MasterScore.
When reading a file, if we encounter items with a saved EID code, we register them and use the register to re-establish all the relationships we need.
When writing items to file, we check the register again. If the item is already registered, then we reuse the same EID, otherwise we create and register a new one. This has a couple of advantages:
items that already had an EID are re-saved with the same EID, which eliminates unnecessary diffs in the mscx file (very relevant if we imagine collaboration features in future);
the unique EIDs are generated only when saving, which means they won't have any significant performance impact in the runtime.
All of the above is done only for the items which need it (which means only measures, at the moment).

About the implementation details:

There is a choice to be made between using progressive or random IDs. I decided to go with random IDs because it is quite clearly the most future-proof choice (again, very relevant for online collaboration). Unit testing may become a bit more annoying, but not future-proofing ourselves to avoid testing annoyance doesn't seems like a good tradeoff. Furthermore, thanks to the fact that IDs are actually reused when saving, random IDs may actually not be much of a problem in most tests.
A 64-bit random number is probably enough for our needs. But again, I'd rather future-proof and be certain, so I decided to create 128-bit IDs.
Didn't want to use strings to represent the IDs to avoid heap allocations, string comparisons etc.
Also didn't want to import external dependencies to create "true" UUIDs cause that's completely overkill imo.
The best tradeoff between simplicity and safety that I found was to generate a pseudo-128bit-UUID as pair of pseudo-randomly-generated-64bit integers. This can be easily done with standard library tools and gives us a space of 10^38, which mean we're fine 😅.
Given that the EID register needs to be queried both by item and by eid, I actually implemented it as two separate and "opposite" unordered maps. Which feels wrong, so I'm still trying to think of a better solution. Perhaps just accepting that one of the two query directions is searched sequentially is the better alternative.

igorkorsukov · 2024-11-28T14:11:50Z

src/engraving/rw/write/twrite.cpp

+    xml.tag("eid", eid.toStdString());
+}
+
+bool TWrite::itemNeedsWriteEid(const EngravingObject* item)


why not write for all elements?

I'd like to do some performance testing on a big file. I'll come back to you about this :)

cbjeukendrup · 2024-11-29T00:41:02Z

src/engraving/infrastructure/eid.h

+namespace std {
+template<>
+struct hash<mu::engraving::EID>


I believe this would better be written as

Suggested change

namespace std {

template<>

struct hash<mu::engraving::EID>

template<>

struct std::hash<mu::engraving::EID>

because otherwise some IDEs start thinking that the std namespace is ours...

shoogle · 2024-11-29T05:30:27Z

The method looks good, but do we really need 128 bit IDs?

YouTube video IDs only contain 11 base64 characters (i.e. 64 bits) and that's already overkill.
With 64 bits, you can generate 5 billion IDs before the probabilty of a collision reaches 50%.
5 billion is less than 1 billionth of the 2^64 available IDs.

IDs generated in series will never conflict, so it's only IDs generated in parallel that you need to worry about. That's an awful lot of pull requests that people would have to open on your score repository before you would see any conflicts in terms of IDs. In reality, you would run into ordinary merge conflicts long before you encounter any ID collisions.

If a score does happen to contain two elements with the same ID, you could show an error message and rely on the user to fix it manually. Only developers store scores in Git repositories, so it's only developers who could encounter this problem.

When we finally do multi-movement scores properly, depending on the implementation, IDs might only need to be unique for a single movement rather than an entire score, which would reduce the likelihood of collisions even more.

Didn't want to use strings to represent the IDs to avoid heap allocations, string comparisons etc.

What about in the MSCX? A 128-bit number will be 39 decimal digits long, but only 22 characters in base64.

Base	Random 128-bit Number	Digits
10 (decimal)	`174786058559363532376011904777350482855`	39
16 (hex)	`837e9206009fb14ea565e4fddb7b0ba7`	32
32	`QN7JEBQAT6YU5JLF4T65W6YLU4======`	26 plus padding
64	`g36SBgCfsU6lZeT923sLpw==`	22 plus padding

Since we know the result is supposed to be 128 bits, we can discard the trailing = padding present in base32 and base64.

96 bits requires 29 decimal digits, or 16 characters in base64

Base	Random 96-bit Number	Digits
10 (decimal)	`34580252689845652563297092806`	29
16 (hex)	`6fbc1d569248774adc5cc4c6`	24
32	`N66B2VUSJB3UVXC4YTDA====`	20 plus padding
64	`b7wdVpJId0rcXMTG`	16

64 bits requires 20 decimal digits, or 11 characters in base64

Base	Random 64-bit Number	Digits
10 (decimal)	`11280125859929617183`	20
16 (hex)	`9c8b0eb476cd731f`	16
32	`TSFQ5NDWZVZR6===`	13 plus padding
64	`nIsOtHbNcx8=`	11 plus padding

mike-spa · 2024-11-29T09:41:00Z

Thank you all for the comments! I'll look into it and come back to you. Meanwhile, I've been wondering whether to import old EIDs into the new format, but it turns out we can't do that, because of this issue. In fact, the vtests were crashing because at least one file had a repeated EID. So I think I'll just discard the old EIDs.

mike-spa · 2024-11-29T15:16:28Z

Ok, here are my findings.

@igorkorsukov I've checked, and the performance cost of registering all the items (as opposed to just the Measures) is completely negligible, so I'm taking your suggestion here, thanks!
@shoogle git conflicts are obviuosly not the main problem, I just used it to illustrate the point. Maybe MuseScore will never become an online editor with real time collaboration in Google-Docs-style. But if we want to at least keep that door open, sequential IDs are just not an option.
Do we really need 128bit IDs. My answer would be that we don't have to, but given that the cost is negligible (see later) I don't see why not. Assumed that we opt for random IDs, the large file I've used for this test needs in the order of 10^6 IDs. From the resource you've shared, a 64bit random number would have a conflict probability in the order of 10^-6 (one in a million). That is small, but is not zero (and the file can be bigger still). With 128bit, it is zero.
If a score does happen to contain two elements with the same ID, you could show an error message and rely on the user to fix it manually. I think it's a bit more tricky than that. If we now start using IDs to establish relationships (which I think we should), then having an ID conflict isn't just about changing a number, it's also a file corruption, cause it means we've completely lost the element that we were referring to. So I don't think we should even plan for ID conflicts, we should just make them impossible.
What about in the MSCX? I was referring to the runtime representation of EIDs in preferring to use a tuple of uint64_t rather than a string. In the mscx I represent the EID as two dash-separated HEX strings, like
<eid>29d2e029bab9ae53-412833ef6deb5a77</eid>.
This makes it easy to parse it back into two integers when reading (and also gives me an easy way to descern and ignore the old EID codes).

Now, in terms of cost, long story short is that it's completely negligible, even with 128bit IDs. The large file I've used (Beethoven 9 with parts generated) required about 10^6 individual IDs.

Increased memory usage from registering all the 10^6 IDs is around 1% (about 50MB on a total memory usage of 5GB).
The time to generate the random IDs (EID::createNew) is barely measurable by the profiler.
The time to register all the IDs (which means performing insertion in two opposite maps, EIDRegister::registerItemEID) takes a total 725ms, which is practically unmeasurable over the total time it takes to parse the file, which is over 4 minutes.
Out of curiosity, I tried replacing unordered with ordered maps, and the insertion time became 2000 ms (still not a lot, but almost 3 times slower). Not that it matters, but I thought it was interesting @cbjeukendrup.

shoogle · 2024-11-29T19:52:57Z

@mike-spa The main cost I'm worried about is seeing these long IDs in files and in diffs, but separating it into pieces goes a long way towards addressing that. In practice, I would only have to glance at the first half of two IDs to see if they are the same.

Hyphen: <eid>29d2e029bab9ae53-412833ef6deb5a77</eid>
Underscore: <eid>29d2e029bab9ae53_412833ef6deb5a77</eid>

Perhaps you could separate the pieces with underscores rather than hyphens? This would enable the entire ID to be selected with double click, for an easy Ctrl+F search in a large file.

mike-spa · 2024-12-03T09:45:09Z

I've made some final changes! I've implemented a simple base-64 encoding, so the resulting string in the file is a few characters shorter, and I've used the underscore for separator, so now the EIDs in our files are written like this
<eid>JmhnU8B5YOF_UfuUZNkL07F</eid>.
Please let me know if you think this is good to go, in which case I'd like to merge it soon :)

shoogle · 2024-12-03T13:58:37Z

@mike-spa, that's a massive improvement, thanks!

shoogle

I approve, though @igorkorsukov or @cbjeukendrup might want to give this a final look before it gets merged.

I notice you don't write IDs in test mode, probably because non-native formats don't have IDs (e.g. MIDI, MusicXML), so you'd get different random IDs each time, which would cause the tests to fail.

But it's a shame not to test whether IDs, spanner links, and parts links are preserved.

One idea might be to seed the PRNG to a specific value in test mode, so you get the same IDs generated every time. (Maybe something for a future PR?)

igorkorsukov

all good for me :)

mike-spa · 2024-12-04T08:22:56Z

@shoogle Yes, we discussed about testing with @cbjeukendrup for exactly that reason. I should point out that even "legacy" EIDs weren't written in test mode (they were written only for the Score object, don't know why). At the moment EIDs only have one use case, that is SystemLocks. But as we start adopting them more, they should definitely be included in tests, and I think that seeding the RNG with a constant value may be indeed a great solution. In the meantime, I'll merge this :)
Thanks all for your help!

src/engraving/infrastructure/eid.cpp

mike-spa requested a review from cbjeukendrup November 27, 2024 15:27

RomanPudashkin requested a review from igorkorsukov November 28, 2024 07:45

mike-spa force-pushed the reworkEID branch from 71ed2fc to 073f9bd Compare November 28, 2024 10:13

igorkorsukov reviewed Nov 28, 2024

View reviewed changes

cbjeukendrup reviewed Nov 29, 2024

View reviewed changes

mike-spa force-pushed the reworkEID branch from 073f9bd to 387dc34 Compare November 29, 2024 14:30

mike-spa force-pushed the reworkEID branch from 387dc34 to 006e7f5 Compare November 29, 2024 15:49

mike-spa added 2 commits December 2, 2024 10:48

Refactor EID into a pseudo-UUID

a678489

Fix/Update tests

e78aacd

mike-spa force-pushed the reworkEID branch from 006e7f5 to e78aacd Compare December 2, 2024 09:49

shoogle approved these changes Dec 3, 2024

View reviewed changes

igorkorsukov approved these changes Dec 4, 2024

View reviewed changes

mike-spa merged commit 6b9b3b0 into musescore:master Dec 4, 2024
11 checks passed

Jojo-Schmitz reviewed Dec 4, 2024

View reviewed changes

src/engraving/infrastructure/eid.cpp Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PROPOSAL: Refactor EID class into a pseudo-UUID #25666

PROPOSAL: Refactor EID class into a pseudo-UUID #25666

mike-spa commented Nov 27, 2024 •

edited

Loading

igorkorsukov Nov 28, 2024

mike-spa Nov 29, 2024

cbjeukendrup Nov 29, 2024

shoogle commented Nov 29, 2024

mike-spa commented Nov 29, 2024

mike-spa commented Nov 29, 2024

shoogle commented Nov 29, 2024

mike-spa commented Dec 3, 2024

shoogle commented Dec 3, 2024

shoogle left a comment

igorkorsukov left a comment

mike-spa commented Dec 4, 2024

PROPOSAL: Refactor EID class into a pseudo-UUID #25666

PROPOSAL: Refactor EID class into a pseudo-UUID #25666

Conversation

mike-spa commented Nov 27, 2024 • edited Loading

igorkorsukov Nov 28, 2024

Choose a reason for hiding this comment

mike-spa Nov 29, 2024

Choose a reason for hiding this comment

cbjeukendrup Nov 29, 2024

Choose a reason for hiding this comment

shoogle commented Nov 29, 2024

mike-spa commented Nov 29, 2024

mike-spa commented Nov 29, 2024

shoogle commented Nov 29, 2024

mike-spa commented Dec 3, 2024

shoogle commented Dec 3, 2024

shoogle left a comment

Choose a reason for hiding this comment

igorkorsukov left a comment

Choose a reason for hiding this comment

mike-spa commented Dec 4, 2024

mike-spa commented Nov 27, 2024 •

edited

Loading