-
-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PICARD-2466: Lookup function by using Catalog Number and other identifying info #2538
base: master
Are you sure you want to change the base?
Conversation
This is the first commit for a possible solution for PICARD-2466 also applies to PICARD-2793
Thanks to @kokon191 for the catno comparison code Cluster weights are sensible defaults which work for some of my tested files, but they will probably need some tweaking. The Metadata weights are sensible defaults guessed from the other weights. I haven't added them to File yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I am not sure that this actually resolves anything that is part of PICARD-2793.
@@ -72,6 +72,9 @@ | |||
'releasecountry': 2, | |||
'releasetype': 10, | |||
'totalalbumtracks': 5, | |||
'catalognumber': 60, | |||
'asin': 65, | |||
'barcode': 68, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unclear how these weights were chosen or why they are so high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose these in the same way as the ones in metadata.py
but after trying to match ~80 albums (all files tagged with only catalognumber
), with 'catalognumber': 60,
it was able to correctly identify almost all of them (except for ~4), while with a lower value it would identify quite a few less.
But all these were guesses and as I am not familiar with the codebase yet they may very well be wrong (even tho they worked for my specific use case).
@@ -163,6 +163,9 @@ class Metadata(MutableMapping): | |||
('totaltracks', 5), | |||
('discnumber', 5), | |||
('totaldiscs', 4), | |||
('catalognumber', 30), | |||
('barcode', 33), | |||
('asin', 32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unclear how these weights were chosen or why they are so high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose these values so they are a bit higher than the highest one (title
), since they more uniquely identify releases/recordings/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not had any experience with determining optimum search weightings in Picard, but I have designed complex weighted search algorithms in other search engines - and I know that getting the right weights is an art, and if you get them wrong then all sorts of funny results come up first.
I don't know whether @zas or @phw can help with determining whether these weights are good or bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also believe these weights are too high. This can easily overrule better matches based on title and artist. It's also one reason why I think we should go a different route.
metadata_counters[tag] = Counter() | ||
metadata_counters[tag][value] += 1 | ||
for tag in metadata_counters: | ||
self.metadata[tag] = metadata_counters[tag].most_common(1)[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this additional code (which is only for these tags)?
And why has genre been added to these tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this additional code (which is only for these tags)?
This is in order to add those tags to the cluster so they can be used in the lookup.
And why has genre been added to these tags?
Because I thought it would make sense for a cluster to also have the most common genre in its files, there may also be other tags which makes sense to include (like releasecountry
or comment
for release disambiguation maybe?) and all these can be quickly added/removed to the lookup (which could also lead to PICARD-2793).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional code - I haven't read the existing code in detail, but I surely code to include these tags in the cluster should be co-located with the code that puts the existing tags into the cluster.
Genres should not IMO be added because they are highly subjective and IMO the MB implementation is far too free-form i.e. lacking in hierarchy and unstructured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I had searched for that a bit after trying to just add catno=self.metadata['catalognumber'],
to the find_releases
call in Cluster.lookup_metadata()
(which didn't work) but I couldn't find anything.
Now I tried (on master
) printing self.metadata
right before the find_releases
call and the log just reads
store: {'album': ['Do the Dive [Limited Edition]'], 'albumartist': ['Call of Artemis'], 'totaltracks': ['4'], '~totalalbumtracks': ['4']}
deleted: set()
images: ["TagCoverArtImage from '/I/Raccolte/Musica/D4DJ Collection/Call of Artemis - Do the Dive [Limited Edition] (2022) [FLAC+CUE+LOG] {BRMM-10544}/01. Do the Dive.flac' of type front"]
length: 1066065
(all 4 files have catalognumber
set)
So it doesn't seem to get populated with the catalog number.
Unless I didn't understand something and those tag shouldn't be in self.metadata
Now that I think about it tho, in the UI I can see the cluster files' metadata, I'm gonna search if I can find where it gets them from.
Update: seems like the metadata box just gets the tags by iterating through the files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally this tag merging for clusters makes sense, and we should do something like this for some relevant tags. It makes the clusters far more useful if we add such lookups. Whether this includes genres or not is a different question, and currently I don't have any strong opinion on this either way.
I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once (i.e. usually when creating the cluster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once
I was also concerned about that, but from what I saw, self.update()
gets called only once after all files have been added (after for file in files
loop) in cluster.add_files()
, so it should run only one time for each cluster (as you said it needs to be tested tho).
(i.e. usually when creating the cluster).
Presumably this code would therefore be better placed inside wherever the cluster is created.
Actually my first approach was to do this, by passing the metadata objects of the various files to FileCluster
[1] and have a FileCluster.metadata
property returning a metadata object with the most common tags, but that would require quite a bit of changes, also in the tests (as it would change the arguments of FileCluster.add()
and Cluster
's constructor).
I have pushed these changes to koraynilay:identifiers-lookup-filecluster, if you want to see them (they should work, but the tests didn't all pass and there are still some problems (see 742b9d2's commit message for an example)).
[1] like this
# in cluster.py, Cluster.cluster()
@staticmethod
def cluster(files):
# ...
for file in files:
# ...
token = tokenize(album)
if token:
# Basically pass `metadata` here
- cluster_list[token].add(album, artist or various_artists, file)
+ cluster_list[token].add(metadata, file)
Hmm, yes, you're right, it doesn't relate directly, I included it because this change would also be needed for PICARD-2793's main use case (see from kokonut191's comments on JIRA). I can remove it from the PR description tho, since it actually is a separate issue. Also thanks for reviewing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR. I'm however not too sure about including this in the matching algorithm, especially not with such high weights. A mismatch in cat. no. or barcode would quickly lead to no match at all. IMHO if we use those identifiers in the matching they should have a low weight, so they will be used to push the result in favor of the releases with matching identifier, but don't spoil the search result completely.
Because questions about looking up by some identifier come up from time to time, but with different identifiers such as cat. no., barcode, disc ID, ISRC etc., I had been thinking about a different approach: We could add a context menu to clusters and files for looking up specifically by certain identifiers. This could be context aware and only offer lookups if the relevant tag is present.
The lookup would do an explicit search for the identifier in question on MB. It could either match the results automatically as it is done for "Lookup" and "Scan" or maybe show a search result window.
I think this would be the more flexible approach and would better fulfill the requirement of looking up by barcode, disc ID, cat. no., ISRC etc.
|
||
if 'barcode' in self and 'barcode' in weights: | ||
score = 1.0 if self['barcode'] == release['barcode'] else 0.0 | ||
parts.append((score, weights['barcode'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proper comparison of barcodes likely should be a bit more complex then exact string match. Because of UPC vs EAN-13 it is e.g. not uncommon that UPCs are written with a leading zero. Compare also e.g. https://github.com/kellnerd/harmony/blob/1744a1a87ee8a8f8a0a59cbf8208e8f01ee9bae0/utils/gtin.ts#L66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I did see that when I was looking at the docs for the search API, but it didn't come to mind when writing this code.
@@ -163,6 +163,9 @@ class Metadata(MutableMapping): | |||
('totaltracks', 5), | |||
('discnumber', 5), | |||
('totaldiscs', 4), | |||
('catalognumber', 30), | |||
('barcode', 33), | |||
('asin', 32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also believe these weights are too high. This can easily overrule better matches based on title and artist. It's also one reason why I think we should go a different route.
metadata_counters[tag] = Counter() | ||
metadata_counters[tag][value] += 1 | ||
for tag in metadata_counters: | ||
self.metadata[tag] = metadata_counters[tag].most_common(1)[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally this tag merging for clusters makes sense, and we should do something like this for some relevant tags. It makes the clusters far more useful if we add such lookups. Whether this includes genres or not is a different question, and currently I don't have any strong opinion on this either way.
I'm a bit concerned about performance here if this runs for every individual file added. But this would need to be measured. Optimization would be to ensure this runs only after all files have been added when clustering multiple files at once (i.e. usually when creating the cluster).
Presumably this code would therefore be better placed inside wherever the cluster is created. |
Thanks for reviewing it. Yeah that makes sense, unfortunately it didn't occur to me that there could be files tagged with a wrong identifier, as while trying to tag my files I was just thinking "if the catno matches, that means it's correct, whatever the other tags may be".
I agree this would be the better solution, so even if the matching doesn't automatically find the right release, it can still be found explicitly.
I'd prefer for it to match the results automatically, so that it can be done for a lot of clusters/files (as in my case, I could just select all ~80 clusters and press the "Lookup by catalogue number")
IMO it would be better to always offer those lookups, even if the tag isn't present (or show it if it's present in at least one file), because consider if the user has e.g.
selecting everything and pressing "lookup by catno" should match 1., 3. and 4., then pressing the normal lookup should match 2. and 5.. |
I added PICARD-2961 for the feature of a "Lookup by..." menu. This could work well for many similar feature requests for lookups, I linked the relevant tickets there. Let us know if this is something you would like to work on. If not I might actually look into this sometime in the future because I personally would like to have the lookup from disc IDs in tags handled by this. But currently this will take quite some time until I can get to this.
That's indeed always a tricky part with different data quality. A lot of Picard's matching is build on the fact that existing data might be bad. This is only a main reason why a lot of identifiers are not directly used. Unfortunately we don't really have a good solution for this. All of the matching is usually done with local testing. I have a my own collection and a folder of test stuff, but it is by no means comprehensive. What we would really need is all kinds of test collection data of varying quality, together with the intended result. Then those weights could be tuned to give best results. This is likely a project we should tackle in some way, but this is really a separate topic on its own.
Yes, I tend to agree.
Also makes sense. |
Unfortunately I have basically never worked with GUIs and I don't have a lot of time rn to understand how everything works, but if I start working on it I will comment here to let you guys know.
Maybe we can add a prompt when using "Lookup"/"Lookup by..." (or global option in the settings) like "trust these files are tagged correctly" and then use exact[1] tag (also usable for title/artist/album) matching on them. [1] would be better if it could also work with romanized tags (e.g. Russian or Japanese) and match the correct (prioritize romanized, but match also the non-romanized version), but I'm not sure how that would work (something like the artist translation field?) |
I'm +1 for generic lookup patterns since that's what I looked for in my ticket (PICARD-2793). Maybe even generic weights.
I would prefer it being a On the question of performance, I don't think it's any slower in the grand scheme of things. Or at least, that's what happens in my use case when I tried with a few thousand files. The slowest part has always been the network part/query with musicbrainz for me, and it easily outweighs anything else. |
Summary
Problem
When using the Lookup (Ctrl+L) function on clusters, Picard won't try to match using available release identifiers (like catalogue number or barcode), this PR aims to fix this.
The problem still applies to unclustered files (I haven't checked to them yet).
I searched online but couldn't find anything similar to this (except for the mentioned JIRA tickets) so I apologize if there already is something similar.
Solution
Make a dictionary containing a
Counter
for each tag, iterating through it and usingmost_common()
gives us the most common value in all files in the cluster for each tag specified intags_to_update
(Method shamelessly copied fromFileCluster
'sartist
andtitle
). This way we can populate the cluster's metadata with additional useful info.This PR also adds those identifiers to
Metadata.compare()
andMetadata.compare_to_release_parts()
(using their respective weights defined inCLUSTER_COMPARISON_WEIGHTS
)Action
Additional actions required: