Task-specific multi-dataset loaders #451

rabitt · 2020-10-23T19:02:08Z

rabitt
Oct 23, 2020
Maintainer

(Thanks @gabolsgabs for this idea!)

We plan to add task-specific multi-dataset loaders to help merge multiple datasets into one. A few things to discuss:

Being careful about differing task definitions for the same annotation type (e.g. 'melody' annotations are created differently for different datasets, instrument ID datasets may have different taxonomies, etc etc)

For this reason, for now we were thinking of creating the task specific loaders semi-manually (rather than searching all datasets for a common attribute name/object type), and generalizing later when possible, while keeping the same API. Thoughts?

supporting the creation of pytorch/tensorflow data loaders

Do we want to explicitly create loaders for both frameworks? Or can we create something general (e.g. a generator) that can be input to a pytorch/tensorflow data loaders in a one-liner? For ex. something that yields (audio, annotation) tuples

Time granularity

Do we yield full tracks, or do we provide splicing functionality, e.g. to make equal-length audio-annotation chunks?

Do we want configurable output formats for these loaders?

For example, if you want to use the muda package for data augmentation, it's best if the loaders output jams object annotation data. For mir_eval, it needs to be in a mir_eval/mirex compatible format. For other use cases, you may want a time series, a piano-roll matrix, etc. Should we standardize this choice for all tasks, or make it task-specific?

PRamoneda · 2020-10-24T10:58:04Z

PRamoneda
Oct 24, 2020
Maintainer

For this reason, for now, we were thinking of creating the task-specific loaders semi-manually (rather than searching all datasets for a common attribute name/object type), and generalizing later when possible, while keeping the same API. Thoughts?

Another day, we were speaking about the difference between global and local key algorithms. The algorithms are different, but we use the same object keyData for both.

We could develop an easy way to create a multi-dataset. But vocabularies sometimes are very dataset specified. For example,

sections (for structure segmentation) in some datasets are very high level, and others are very low level.
Giantsteps and Beatport dataset have diatonic modes. But most global key datasets reduce vocabularies to major/minor.
etc. etc

Should we standardize the data vocabularies?

Maybe for the beginning, we can add an id on the docstring with meaning about task and vocabulary. And then, we know what we have.

For example, if you want to use the muda package for data augmentation, it's best if the loaders output jams object annotation data. For mir_eval, it needs to be in a mir_eval/mirex compatible format. For other use cases, you may want a time series, a piano-roll matrix, etc. Should we standardize this choice for all tasks, or make it task-specific?

I think it should be task specified or even have both possibilities. We could edit too mir_eval for accepting JAMS format. But well, I think that JAM is not widely used in the community (MIREX yes). 15 years ago datasets were in RTF format because everybody thinks that It was the future... Let's be flexible with data types...

Morever, we can upgrade too muda or mir_eval formats hahaha.

There are more projects about we should think, as:
https://github.com/bmcfee/muda
https://github.com/mila-iqia/fuel
https://github.com/keunwoochoi/kapre
https://github.com/pescadores/pescador/

Well, I don't know too much of nothing. But if I can help with some implementation say me.

0 replies

nkundiushuti · 2020-10-26T13:37:17Z

nkundiushuti
Oct 26, 2020
Maintainer

hi! this is a great idea!
my comments:

(Thanks @gabolsgabs for this idea!)

We plan to add task-specific multi-dataset loaders to help merge multiple datasets into one. A few things to discuss:

Being careful about differing task definitions for the same annotation type (e.g. 'melody' annotations are created differently for different datasets, instrument ID datasets may have different taxonomies, etc etc)

For this reason, for now we were thinking of creating the task specific loaders semi-manually (rather than searching all datasets for a common attribute name/object type), and generalizing later when possible, while keeping the same API. Thoughts?

more robust description: I think to_jams is a great example regarding standardisation. Maybe something can be done using this abstraction?

more liberal description: what would we need this for? if it's training DL models on various datasets, we can just do it with the current framework - see below

slightly related to this topic: At some point we want to create in MTG a dataset explorer where you could search datasets for a given tag: e.g. beat_detection would give you all datasets with annotated beats.

supporting the creation of pytorch/tensorflow data loaders

Do we want to explicitly create loaders for both frameworks? Or can we create something general (e.g. a generator) that can be input to a pytorch/tensorflow data loaders in a one-liner? For ex. something that yields (audio, annotation) tuples

I think we should create examples of loaders but not explicit loader classes inside the datasets. tensorflow changes so fast that we would need to keep up and adapt every 6 months.
the generator is a good way to do this, we yield the audio file name and target label as tuples, or the audio loaded as array and target labels. then everyone is free to load the data like this:

import itertools

def gen():
for t in my_loader.tracks():
yield (track.audio_path, track.label)

dataset1 = tf.data.Dataset.from_generator(
gen,
(tf.string, tf.int64),
(tf.TensorShape([None]), tf.TensorShape([None])))

##add here other datasets

datasets = [dataset1]

we can use sample from datasets

dataset=tf.data.experimental.sample_from_datasets(datasets)

or we can use interleave for multiple generators

some related idea: from my experience, it would be more useful to provide a tfrecords export example (chunks of tfrecords) for datasets which have a large number of files (which is a training bottleneck with a simple generator)

Time granularity

Do we yield full tracks, or do we provide splicing functionality, e.g. to make equal-length audio-annotation chunks?

I think we should provide track names or tensors/arrays. Then, the sampling and data splicing may be done within the framework of choice. To help with this we can provide examples. I have this code for splicing data loaded from tfrecords:

def _split_seq(self, example):
    for key, seq_features in example.items():

        if len(seq_features.shape)>2:
            #### features which are sequences (having a time dimension)
            if self._seq_len==1:
                example[key] = tf.transpose(seq_features, perm=[1, 0, 2])
            else:
                time_len = seq_features.shape[self._time_axis]
                nsplits = time_len // self._seq_len
                full_len = nsplits * self._seq_len
                if full_len != time_len:
                    seq_features = tf.split(seq_features,[full_len,time_len-full_len],axis=self._time_axis,num=2)[0]
                splits = tf.split(seq_features,nsplits,axis=self._time_axis)
                example[key] = tf.stack(splits,axis=0)
        else:
            #### features which are not sequences: e.g. labels
            if nsplits is not None:
                example[key] = tf.stack([seq_features for n in range(nsplits)],axis=0)

    return example

Do we want configurable output formats for these loaders?

For example, if you want to use the muda package for data augmentation, it's best if the loaders output jams object annotation data. For mir_eval, it needs to be in a mir_eval/mirex compatible format. For other use cases, you may want a time series, a piano-roll matrix, etc. Should we standardize this choice for all tasks, or make it task-specific?

0 replies

rabitt · 2020-10-26T17:09:40Z

rabitt
Oct 26, 2020
Maintainer Author

To summarize some of the offline discussion - it sounds like the plan for now is:

we create a proposal PR with one or two task-specific loaders (e.g. melody, beats) with manually selected datasets
in that PR, we can experiment with different levels of abstraction for what a generator might yield, what gets yielded, if it's easy (or not) to also support pytorch/tensorflow, etc
later, once we decided on what we like best for the previous point, we can explore generalizing the manual selection for each task, when possible

0 replies

magdalenafuentes · 2020-11-20T10:20:50Z

magdalenafuentes
Nov 20, 2020
Maintainer

@PRamoneda do you want to take the lead on this one after #336 is finished?

0 replies

magdalenafuentes · 2020-11-20T10:37:47Z

magdalenafuentes
Nov 20, 2020
Maintainer

#176 is very relevant for this!

0 replies

magdalenafuentes · 2021-01-07T13:18:59Z

magdalenafuentes
Jan 7, 2021
Maintainer

Other relevant ones: #223 and #227.

I am thinking now whether we should write search functions first based on these ideas, and then write the generators. It will simplify the generators' code massively and also create nice functionalities for the users. Thoughts @rabitt @nkundiushuti @genisplaja @PRamoneda?

0 replies

magdalenafuentes · 2021-02-08T15:47:47Z

magdalenafuentes
Feb 8, 2021
Maintainer

Following up on this, the PR #445 is a good starting point for these exploratory tools. We should decide if we want to move forward with this functionality first, and how we want it to be. I like the idea of having some search functions that allow to explore multiple datasets based on annotations or metadata, and the task-specific multi-dataset loaders could be written on top of those. I also like the idea of start a hand-made prototype with let's say melody as goal task and a few search functions to see how it goes. I'm happy to start drafting that with @PRamoneda for instance.

Thoughts?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task-specific multi-dataset loaders #451

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Task-specific multi-dataset loaders #451

rabitt Oct 23, 2020 Maintainer

Replies: 7 comments

PRamoneda Oct 24, 2020 Maintainer

nkundiushuti Oct 26, 2020 Maintainer

we can use sample from datasets

or we can use interleave for multiple generators

rabitt Oct 26, 2020 Maintainer Author

magdalenafuentes Nov 20, 2020 Maintainer

magdalenafuentes Nov 20, 2020 Maintainer

magdalenafuentes Jan 7, 2021 Maintainer

magdalenafuentes Feb 8, 2021 Maintainer

rabitt
Oct 23, 2020
Maintainer

PRamoneda
Oct 24, 2020
Maintainer

nkundiushuti
Oct 26, 2020
Maintainer

rabitt
Oct 26, 2020
Maintainer Author

magdalenafuentes
Nov 20, 2020
Maintainer

magdalenafuentes
Nov 20, 2020
Maintainer

magdalenafuentes
Jan 7, 2021
Maintainer

magdalenafuentes
Feb 8, 2021
Maintainer