Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PINT's parsing of tim files needs to be revisited #1341

Open
aarchiba opened this issue Jul 13, 2022 · 2 comments
Open

PINT's parsing of tim files needs to be revisited #1341

aarchiba opened this issue Jul 13, 2022 · 2 comments

Comments

@aarchiba
Copy link
Contributor

A number of issues and PRs have arisen suggesting that PINT's parsing of tim files should be revisited to ensure that we are parsing these files in an appropriate way. There is considerable discussion but it is spread over multiple issues and PRs, so I want to gather it here in hopes that we can establish what PINT's parsing should do, to be implemented later.

A summary of the problem

There are a number of formats for .tim file entries:

  • Parkes, Princeton, and ITOA. These are FORTRAN-style text formats with fixed columns where various things appear; these are mostly fairly easy to distinguish from one another although some nuance is necessary. White space between columns may or may not exist. These do not support custom flags.
  • Tempo2. This is a C/Python-style free-form text format with no fixed columns where fields are delimited by whitespace, and where the first field can and does contain more or less arbitrary non-whitespace text (often a filename). These support custom flags. By specification and custom, the presence of such TOAs is signalled by a command (see below) "FORMAT 1".
  • Commands. These include "FORMAT 1" to signal the presence of tempo2-format TOAs, "JUMP" to introduce a(n old-style) JUMP, "TIME" to signal that some TOAs need adjustment, and "INCLUDE" to signal that additional TOAs should be read from another file.
  • Comments. These are conventionally lines starting with "#" or "C" (though that latter can easily conflict with tempo2-format fields). In a common custom, they may be created by prepending "C " to a TOA that has been excised; scripts to programmatically generate or recover these TOAs are common.
  • Informal comments. Traditionally TEMPO or TEMPO2 silently ignore any line they don't recognize, treating these as comments.

The problem, or at least a problem, is that "in the wild" some files exist that intermix these different kinds of entry in a variety of ways. The flexibility of the tempo2 format means it can easily be confused with Parkes, Princeton, and ITOA TOAs. The flexibility of comment format also complicates parsing, and the existence of informal comments makes error reporting much more difficult.

How do programs address this?

How should we change PINT's parsing to improve this situation?

Suggestions that have been floated:

  • Enforce that all lines after FORMAT 1 in a given file are treated as tempo2-format. (Do we allow tempo2-format entries not flagged by FORMAT 1?)
  • Allow users to choose between strict and best-guess parsing modes - strict modes where they can specify that their files are supposed to contain only tempo2-format TOAs signalled by FORMAT 1 and no informal comments, and any deviation gives rise to an exception, and best-guess modes where PINT tries its best to make anything it sees into a TOA.
  • Make the smallest possible change to PINT's parsing necessary to fix existing bugs.
  • Declare ambiguous files the user's fault.

Issues/PRs where this is discussed: #1320 #1319 #1271 #730 #731

@dlakaplan
Copy link
Contributor

Definitely in favor of having an optional strict mode

@kerrm
Copy link
Contributor

kerrm commented Jul 21, 2022

I wanted to add just a bit more about the way tempo2 reads .tim files:

  • Yes, indeed, if "FORMAT" appears anywhere, it assumes the tempo2-style format. If "HEAD" appears anywhere (including after FORMAT), it assumes ".tpo" format, whatever that is. I'm guessing this use case doesn't happen often.
  • Only some mixed formats are supported.
  • FORMAT will cause non-tempo2 TOAs to parse incorrectly.
  • Lack of FORMAT will cause tempo2 TOAs to parse incorrectly.
  • Flags like JUMP/TIME etc. are accumulated as they are read, i.e. in parse order, just like tempo, and apply to all subsequent TOAs regardless of inferred format. You probably don't want this, and so if somehow you were successfully parsing a file with both tempo and tempo2-style TOAs, it would only work as expected if tempo2 preceded tempo.

In summary, sometimes tempo2 will successfully parse .tim files with mixed formats, but it's not a universal property, and many cases will definitely yield erroneous results.

Thus, the safe thing is to require that all TOAs in a file share the same format. This can also be managed with a metafile using "include".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants