Replies: 7 comments
-
Another day, we were speaking about the difference between global and local key algorithms. The algorithms are different, but we use the same object keyData for both. We could develop an easy way to create a multi-dataset. But vocabularies sometimes are very dataset specified. For example,
Should we standardize the data vocabularies? Maybe for the beginning, we can add an id on the docstring with meaning about task and vocabulary. And then, we know what we have.
I think it should be task specified or even have both possibilities. We could edit too mir_eval for accepting JAMS format. But well, I think that JAM is not widely used in the community (MIREX yes). 15 years ago datasets were in RTF format because everybody thinks that It was the future... Let's be flexible with data types... Morever, we can upgrade too muda or mir_eval formats hahaha. There are more projects about we should think, as: Well, I don't know too much of nothing. But if I can help with some implementation say me. |
Beta Was this translation helpful? Give feedback.
-
hi! this is a great idea!
more robust description: I think to_jams is a great example regarding standardisation. Maybe something can be done using this abstraction? more liberal description: what would we need this for? if it's training DL models on various datasets, we can just do it with the current framework - see below slightly related to this topic: At some point we want to create in MTG a dataset explorer where you could search datasets for a given tag: e.g. beat_detection would give you all datasets with annotated beats.
I think we should create examples of loaders but not explicit loader classes inside the datasets. tensorflow changes so fast that we would need to keep up and adapt every 6 months. import itertools def gen(): dataset1 = tf.data.Dataset.from_generator( ##add here other datasets datasets = [dataset1] we can use sample from datasetsdataset=tf.data.experimental.sample_from_datasets(datasets) or we can use interleave for multiple generatorssome related idea: from my experience, it would be more useful to provide a tfrecords export example (chunks of tfrecords) for datasets which have a large number of files (which is a training bottleneck with a simple generator)
I think we should provide track names or tensors/arrays. Then, the sampling and data splicing may be done within the framework of choice. To help with this we can provide examples. I have this code for splicing data loaded from tfrecords:
|
Beta Was this translation helpful? Give feedback.
-
To summarize some of the offline discussion - it sounds like the plan for now is:
|
Beta Was this translation helpful? Give feedback.
-
@PRamoneda do you want to take the lead on this one after #336 is finished? |
Beta Was this translation helpful? Give feedback.
-
#176 is very relevant for this! |
Beta Was this translation helpful? Give feedback.
-
Other relevant ones: #223 and #227. I am thinking now whether we should write search functions first based on these ideas, and then write the generators. It will simplify the generators' code massively and also create nice functionalities for the users. Thoughts @rabitt @nkundiushuti @genisplaja @PRamoneda? |
Beta Was this translation helpful? Give feedback.
-
Following up on this, the PR #445 is a good starting point for these exploratory tools. We should decide if we want to move forward with this functionality first, and how we want it to be. I like the idea of having some search functions that allow to explore multiple datasets based on annotations or metadata, and the task-specific multi-dataset loaders could be written on top of those. I also like the idea of start a hand-made prototype with let's say melody as goal task and a few search functions to see how it goes. I'm happy to start drafting that with @PRamoneda for instance. Thoughts? |
Beta Was this translation helpful? Give feedback.
-
(Thanks @gabolsgabs for this idea!)
We plan to add task-specific multi-dataset loaders to help merge multiple datasets into one. A few things to discuss:
For this reason, for now we were thinking of creating the task specific loaders semi-manually (rather than searching all datasets for a common attribute name/object type), and generalizing later when possible, while keeping the same API. Thoughts?
Do we want to explicitly create loaders for both frameworks? Or can we create something general (e.g. a generator) that can be input to a pytorch/tensorflow data loaders in a one-liner? For ex. something that yields
(audio, annotation)
tuplesDo we yield full tracks, or do we provide splicing functionality, e.g. to make equal-length audio-annotation chunks?
For example, if you want to use the
muda
package for data augmentation, it's best if the loaders output jams object annotation data. Formir_eval
, it needs to be in a mir_eval/mirex compatible format. For other use cases, you may want a time series, a piano-roll matrix, etc. Should we standardize this choice for all tasks, or make it task-specific?Beta Was this translation helpful? Give feedback.
All reactions