Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perfect 5th detection in mir_eval.key is asymmetric #337

Open
iansimon opened this issue Mar 2, 2021 · 7 comments
Open

perfect 5th detection in mir_eval.key is asymmetric #337

iansimon opened this issue Mar 2, 2021 · 7 comments

Comments

@iansimon
Copy link

iansimon commented Mar 2, 2021

https://github.com/craffel/mir_eval/blob/576aad4e0b5931e7c697c078a1153c99b885c64f/mir_eval/key.py#L145

Is this intended to be asymmetric? E.g. if estimated = G major and reference = C major the score is 0.5, but if estimated = C major and reference = G major the score is zero.

@craffel
Copy link
Collaborator

craffel commented Mar 2, 2021

I do not know music theory :( @bmcfee ?

@bmcfee
Copy link
Collaborator

bmcfee commented Mar 2, 2021

From reading the comments, I think this is behaving to spec / consistent with the mirex implementation.

I also think it's weird, but 🤷

@bmcfee
Copy link
Collaborator

bmcfee commented Mar 2, 2021

To follow up on this a bit, yes, it's intended to be asymmetric, and the docstring is unambiguous about the behavior:

https://github.com/craffel/mir_eval/blob/576aad4e0b5931e7c697c078a1153c99b885c64f/mir_eval/key.py#L99-L115

I suppose this was originally intended to break symmetry between 5th and 4th estimation errors by only considering movement in one direction from the root. I did try to dig into this more in the mirex wiki, but there's not much provenance around how this metric came about. Maybe @jpauwels has some insight, as I think he was the most recent task caption for key detection?

@jpauwels
Copy link

jpauwels commented Mar 8, 2021

Hi all, as far as I know, it was never intended to make a distinction between perfect fifth and perfect fourth errors (as long as the mode is the same). It certainly isn't the case in the current implementation used for MIREX. That doesn't mean it's impossible that there is some ambiguously worded description somewhere on the wiki. Are you referring to the following table: https://www.music-ir.org/mirex/wiki/2005:Audio_and_Symbolic_Key (copied over every year since)? In that case, it should be read as perfect fifth either way, not just perfect fifth above.

The canonical reference I use for this measure is Emilia Gomez's PhD thesis (section 4.3.1.2), but I suspect the exact weighting was determined collaboratively with other participants of MIREX 2005. Between us, I've never been a fan of this particular measure, too many magic numbers in there. I always suggest listing the error categories separately and did so in my own papers, but I still include this combination measure in MIREX for historical comparisons.

@stefan-baumann
Copy link

stefan-baumann commented Jul 23, 2021

Seconding what @jpauwels said, he actually stated that in the 2017 MIREX results, too:

The NEMA system was retired this year, since a bug has been found in the calculation of the results. Keys with tonics related by a fifth and the same mode (a.k.a. adjacent keys) are supposed to get a score of 0.5, but only ascending fifths (going from ground-truth to estimation) were counted, not descending ones. It has been brought to my attention that the description of the measure on the wiki has been ambiguous for years, and probably the NEMA implementer got confused by this. However, the intention has always been to count ascending and descending fifth (or fourth) relationships between the tonics (in my humble opinion).

This is taken directly from the 2017 MIREX Key Detection results (https://www.music-ir.org/mirex/wiki/2017:Audio_Key_Detection_Results)

So apparently the "official" behavior changed from only counting ascending ones to also counting descending ones in 2017. It would be really useful if both approaches were implemented, maybe even adding a warning to the "old" approach that notifies users that they are using a metric that might not correspond with the one they're expecting.

I personally just spent multiple days trying to find a bug on my end of the code while reproducing prior MIREX results, only to find out that the one thing where I didn't expect the cause of the issue (who would expect the MIR algorithm evaluation library to output different results than the MIR algorithm evaluation campaign) is actually the cause. Adding a warning to the "old" approach could be really useful for cases like mine without breaking prior code.

I'll happily submit a PR adding a second method or additional argument and a warning incase the "old" method is used if this is the direction that the maintainers want to go with this.

@bmcfee
Copy link
Collaborator

bmcfee commented Jul 23, 2021

So apparently the "official" behavior changed from only counting ascending ones to also counting descending ones in 2017. It would be really useful if both approaches were implemented, maybe even adding a warning to the "old" approach that notifies users that they are using a metric that might not correspond with the one they're expecting.

I would support this change. AFAIK the mir_eval implementation was based on mirex in ~2014, so if mirex changed later, it's not surprising that we got out of sync.

'll happily submit a PR adding a second method or additional argument and a warning incase the "old" method is used if this is the direction that the maintainers want to go with this.

🚀 please do!

@stefan-baumann
Copy link

stefan-baumann commented Jul 23, 2021

@bmcfee Here you go - I implemented it in the way I would've intuitively done it and created a draft PR (if the feedback on the API choice is positive, I'll implement the tests and mark it as ready) for it at #339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants