Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output list type field as array of string in Yaml #107

Closed
haowang-bioinfo opened this issue May 15, 2018 · 9 comments
Closed

Output list type field as array of string in Yaml #107

haowang-bioinfo opened this issue May 15, 2018 · 9 comments
Assignees
Labels
discussion Not yet settled whether change in code is required.

Comments

@haowang-bioinfo
Copy link
Member

haowang-bioinfo commented May 15, 2018

The JSON/Yaml formats appear to be very useful in GEM curation, exchange and other things, among which one application is for visualization purpose. Now Yaml has been used as input for Metabolic Atlas website development.

However, the current writeYaml function treats the 'list' type fields (e.g. eccodes,subSystems) differently. It outputs the value as string when there is only one element, but output into array of string when there are multiple elements. This inconsistence causes problems when loading the file to public Yaml parser (such as in python), because the information cannot be correctly extracted if some elements have on value and others have multiple. Now this issue has been addressed in this branch: 81be856.

@haowang-bioinfo haowang-bioinfo self-assigned this May 15, 2018
@haowang-bioinfo haowang-bioinfo added the discussion Not yet settled whether change in code is required. label May 15, 2018
@BenjaSanchez
Copy link
Contributor

@Hao-Chalmers

  1. When writing the function I followed the cobrapy standard, and they actually also have this distinction when it's only 1 field or several: see example file in issue support for export to cobrapy-compatible yaml format #77, for instance reactions r_2246 (one ec-code) and r_2248 (2 ec-codes). So if we do what you suggest we would loose compatibility with them, however eventually we want to allow cobrapy users to contribute to our models.
  2. In our repos we should have .yml files as succinct as possible, so they can show the most amount of changes with the least amount of lines. This change would go against that principle.

So instead I would suggest a separate function that can convert the .yml to a parser-compatible version.

@edkerk
Copy link
Member

edkerk commented May 15, 2018

@Hao-Chalmers

As @BenjaSanchez said, the default writeYaml output should not be changed.

As alternative, include a parameter in the writeYaml function, e.g. extended or parseable, and set this to true if you want your precise format. The default for this parameter should be false, so the function continues to function as intended. If this parseable format requires a lot of changes, it might indeed make more sense to write a specific function for this purpose, either as part of RAVEN or as part of the relevant repository.

edit: renamed suggested parameter compact to extended as it's more accurate. parseable would be prefered though`

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented May 15, 2018

@BenjaSanchez, @edkerk

Here are my comments:

  1. The propsed change in this branch keeps the yml file in succinct style without introducing (many) addtional lines, and doesn't go aginst the compact principle.

  2. This function attemps to follow the JSON/Yaml scheme proposed by cobrapy. But any scheme should be open for adjustments toward better fit of practical needs, which is continuously changing. Actually cobrapy doesn't clearly specify this condition.

@BenjaSanchez
Copy link
Contributor

@Hao-Chalmers it would definetively include extra lines, for instance what is now:

- ec-code: 1.3.3.6

would then be:

- ec-code:
  - 1.3.3.6

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented May 15, 2018

True, it does. There won't be many though :-)

@pecholleyc
Copy link

@BenjaSanchez I understand you want to follow the cobrapy standard but the origin of this issue was because an inconsistency with the "subsystem" field in a reaction. It is defined as type string in JSON/Yaml cobrapy scheme whereas models like YEAST may have reactions with list of subsystems (which can makes sense).

The problem with annotation fields like 'ec-code' is different. I am not sure you are going to lose compatibility with cobrapy since annotation is defined as 'object'. IMO if multiple values for an annotation field is expected it should always be defined as list, it will simplify automatic parsing.

Moreover, if extra lines is a problem, lists can also be represented in an abbreviated form:

- ec-code:
  - 1.3.3.6

can be:

- ec-code: [1.3.3.6]

I also suggest to systematically double-quote strings for some of the fields, the current YAML export of HMR model cannot be properly parsed because of some metabolite's names includes YAML syntax e.g. '-['.

@edkerk
Copy link
Member

edkerk commented May 16, 2018

@pecholleyc
The primary aim of this function is compatibility with cobrapy (and of course if changes are made there, then we should update writeYaml). So if subSystems are currently not written identical to cobrapy output (and please verify this with actual output, because I have seen for instance for cobra toolbox that specified schemes are not always up-to-date), then please fix that specifically.

Also, as mentioned above, there is no problem to have an alternative output from writeYaml, but let's keep the default output cobrapy compatible. As suggested, you can specify a parameter parseable (with a default of 'false') or even cobrapy (with a default of 'true') to specify the exact formatting of the Yaml file. This alternative format could for instance include double-quoted strings. I just don't see why this would have to break the existing (default) output format?

@BenjaSanchez
Copy link
Contributor

@Hao-Chalmers actually they would be many new lines: the .yml file of yeast-GEM saved with your proposed change would be 82,202 lines, compared to 74,394 lines as it is currently stored.
And completely agree with @edkerk, we can have secondary ways of outputting the model but the way to store it in the repo should be the most compact option.

@haowang-bioinfo
Copy link
Member Author

haowang-bioinfo commented May 16, 2018

@edkerk A similar issue has been proposed in cobrapy repo for adjustment in implementation. Let's see how it proceeds.

@BenjaSanchez There won't be much differences with whether 82,202 or 74,394 lines, because both could not be managed by human eyes. And there would be no difference for digital storage.

The optional parsable output could be a solution, if it turned out to be optimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Not yet settled whether change in code is required.
Projects
None yet
Development

No branches or pull requests

4 participants