Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set minimal metadata also for empty bed datasets #17586

Merged
merged 2 commits into from
May 20, 2024

Conversation

wm75
Copy link
Contributor

@wm75 wm75 commented Mar 4, 2024

Currently, set_metadata does nothing for empty bed datasets, however, for tools like gops_intersect that use metadata_source="input1" for their output, this means that data_lines gets inherited from the input, which subsequently shows up in the blurb of the dataset and is very confusing for users, who see that Galaxy thinks there is data there, but then they can't display it.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.

  • This is a refactoring of components with existing test coverage.

  • Instructions for manual testing are as follows:
    using built-in tools

    1. Create an interval with the "Create single interval" tool (e.g. with default settings)
    2. Use " Select lines that match an expression" with settings that don't match the interval (e.g. change "Matching" to "NOT Matching"

    The result will be an empty bed that claims to have 1 region without the patch and 0 regions (correct answer) with it.

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@github-actions github-actions bot added this to the 24.1 milestone Mar 4, 2024
@bernt-matthias
Copy link
Contributor

Seems to make total sense to me.

Would you then say that metadata_source only makes sense if the datatype does not set metadata (or may just set part of it)?

Since for text based datatypes this will always copy the number of lines one should make sure in a tool test that the number of lines is set correctly, isn't it?

Was also wondering about the 2nd sentence of the docs.

This copies the metadata information from the tool’s input dataset. This is particularly useful for interval data types where the order of the columns is not set.

So, maybe we can use this opportunity to improve the docs as well. Maybe add the info that built in metadata setting is still running (which was surprising for me.. but makes total sense).

Not sure if this should be a Bugfix and target an earlier branch.

@wm75
Copy link
Contributor Author

wm75 commented Mar 5, 2024

Would you then say that metadata_source only makes sense if the datatype does not set metadata (or may just set part of it)?

That's a bit difficult to do in practice. I guess one of the main reasons to use metadata_source is when a tool accepts different input formats and produces output in the same format as the input. With gops_intersect, for example this is exactly what's happening: the tool accepts "interval,gff" (meaning bed indirectly via interval), but intends to generate output in the exact format of the input. So you need metadata_source in the tool, but you also want to update the metadata with the set_metadata method of the datatype to reflect the output's actual content.

Since for text based datatypes this will always copy the number of lines one should make sure in a tool test that the number of lines is set correctly, isn't it?

That's a very reasonable thing to do it seems, yes 👍

@wm75
Copy link
Contributor Author

wm75 commented Mar 5, 2024

I guess one of the main reasons to use metadata_source is when a tool accepts different input formats and produces output in the same format as the input.

Ah wait, that's what format_source is for, of course. Hmm ...

@wm75
Copy link
Contributor Author

wm75 commented Mar 5, 2024

So I think metadata_source is used here for the case that input is of type interval, but doesn't have a header line that allows column auto-assignment, in which case the column assignment is taken from the input.
This happens when neither:

if first_line_is_header or line[0] == "#":
self.init_meta(dataset)
line = line.strip("#")
elems = line.split("\t")
for meta_name, header_list in alias_spec.items():
for header_val in header_list:
if header_val in elems:
# found highest priority header to meta_name
setattr(dataset.metadata, meta_name, elems.index(header_val) + 1)
break # next meta_name
break # Our metadata is set, so break out of the outer loop

nor:
else:
# Header lines in Interval files are optional. For example, BED is Interval but has no header.
# We'll make a best guess at the location of the metadata columns.
elems = line.split("\t")
if len(elems) > 2:
if overwrite or not dataset.metadata.element_is_set("chromCol"):
dataset.metadata.chromCol = 1
try:
int(elems[1])
if overwrite or not dataset.metadata.element_is_set("startCol"):
dataset.metadata.startCol = 2
except Exception:
pass # Metadata default will be used
try:
int(elems[2])
if overwrite or not dataset.metadata.element_is_set("endCol"):
dataset.metadata.endCol = 3

succeed.

So, in a sense, the documentation:

This is particularly useful for interval data types where the order of the columns is not set.

is correct. It's just complicated to understand what's going on.

@bernt-matthias
Copy link
Contributor

Maybe a good way to put it is that the metadata_source is the provider of the default values that may be overwritten by set_metadata?

@wm75
Copy link
Contributor Author

wm75 commented Mar 5, 2024

@bernt-matthias like this?

@mvdbeek mvdbeek self-requested a review May 14, 2024 14:28
@jdavcs jdavcs self-requested a review May 20, 2024 15:21
@mvdbeek mvdbeek merged commit 3f82a74 into galaxyproject:dev May 20, 2024
43 of 52 checks passed
@galaxyproject galaxyproject deleted a comment from github-actions bot May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants