PINT's parsing of tim files needs to be revisited #1341

aarchiba · 2022-07-13T15:08:58Z

A number of issues and PRs have arisen suggesting that PINT's parsing of tim files should be revisited to ensure that we are parsing these files in an appropriate way. There is considerable discussion but it is spread over multiple issues and PRs, so I want to gather it here in hopes that we can establish what PINT's parsing should do, to be implemented later.

A summary of the problem

There are a number of formats for .tim file entries:

Parkes, Princeton, and ITOA. These are FORTRAN-style text formats with fixed columns where various things appear; these are mostly fairly easy to distinguish from one another although some nuance is necessary. White space between columns may or may not exist. These do not support custom flags.
Tempo2. This is a C/Python-style free-form text format with no fixed columns where fields are delimited by whitespace, and where the first field can and does contain more or less arbitrary non-whitespace text (often a filename). These support custom flags. By specification and custom, the presence of such TOAs is signalled by a command (see below) "FORMAT 1".
Commands. These include "FORMAT 1" to signal the presence of tempo2-format TOAs, "JUMP" to introduce a(n old-style) JUMP, "TIME" to signal that some TOAs need adjustment, and "INCLUDE" to signal that additional TOAs should be read from another file.
Comments. These are conventionally lines starting with "#" or "C" (though that latter can easily conflict with tempo2-format fields). In a common custom, they may be created by prepending "C " to a TOA that has been excised; scripts to programmatically generate or recover these TOAs are common.
Informal comments. Traditionally TEMPO or TEMPO2 silently ignore any line they don't recognize, treating these as comments.

The problem, or at least a problem, is that "in the wild" some files exist that intermix these different kinds of entry in a variety of ways. The flexibility of the tempo2 format means it can easily be confused with Parkes, Princeton, and ITOA TOAs. The flexibility of comment format also complicates parsing, and the existence of informal comments makes error reporting much more difficult.

How do programs address this?

Scott cites this as the best reference on TOA formats: http://tempo.sourceforge.net/ref_man_sections/toa.txt although the TEMPO source code notes that some tim files in the wild deviate from this format.
TEMPO uses column-based machinery to sort Parkes, Princeton, and ITOA tim file entries apart; once it encounters FORMAT 1 all successive TOAs in that file are assumed to be tempo2-format. https://sourceforge.net/p/tempo/tempo/ci/master/tree/src/arrtim.f#l217
TEMPO2 uses column-based machinery to sort Parkes, Princeton, and ITOA tim file entries apart, but if a file contains a line starting with FORMAT anywhere, that file is assumed to contain only tempo2-format TOAs: https://bitbucket.org/psrsoft/tempo2/src/9f4f29abe564a3f907f8b97cef79385011391a23/readTimfile.C#lines-343
PINT currently basically ignores "FORMAT 1" and treats every line as potentially any format, leading to failures in parsing valid tempo2-format TOAs like PINT misidentifies TOA format #1319 and TOA parsing does not accept indented TEMPO2-format files #1271

How should we change PINT's parsing to improve this situation?

Suggestions that have been floated:

Enforce that all lines after FORMAT 1 in a given file are treated as tempo2-format. (Do we allow tempo2-format entries not flagged by FORMAT 1?)
Allow users to choose between strict and best-guess parsing modes - strict modes where they can specify that their files are supposed to contain only tempo2-format TOAs signalled by FORMAT 1 and no informal comments, and any deviation gives rise to an exception, and best-guess modes where PINT tries its best to make anything it sees into a TOA.
Make the smallest possible change to PINT's parsing necessary to fix existing bugs.
Declare ambiguous files the user's fault.

Issues/PRs where this is discussed: #1320 #1319 #1271 #730 #731

The text was updated successfully, but these errors were encountered:

dlakaplan · 2022-07-14T15:34:25Z

Definitely in favor of having an optional strict mode

kerrm · 2022-07-21T15:33:25Z

I wanted to add just a bit more about the way tempo2 reads .tim files:

Yes, indeed, if "FORMAT" appears anywhere, it assumes the tempo2-style format. If "HEAD" appears anywhere (including after FORMAT), it assumes ".tpo" format, whatever that is. I'm guessing this use case doesn't happen often.
Only some mixed formats are supported.
FORMAT will cause non-tempo2 TOAs to parse incorrectly.
Lack of FORMAT will cause tempo2 TOAs to parse incorrectly.
Flags like JUMP/TIME etc. are accumulated as they are read, i.e. in parse order, just like tempo, and apply to all subsequent TOAs regardless of inferred format. You probably don't want this, and so if somehow you were successfully parsing a file with both tempo and tempo2-style TOAs, it would only work as expected if tempo2 preceded tempo.

In summary, sometimes tempo2 will successfully parse .tim files with mixed formats, but it's not a universal property, and many cases will definitely yield erroneous results.

Thus, the safe thing is to require that all TOAs in a file share the same format. This can also be managed with a metafile using "include".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PINT's parsing of tim files needs to be revisited #1341

PINT's parsing of tim files needs to be revisited #1341

aarchiba commented Jul 13, 2022

dlakaplan commented Jul 14, 2022

kerrm commented Jul 21, 2022 •

edited

Loading

PINT's parsing of tim files needs to be revisited #1341

PINT's parsing of tim files needs to be revisited #1341

Comments

aarchiba commented Jul 13, 2022

A summary of the problem

How do programs address this?

How should we change PINT's parsing to improve this situation?

dlakaplan commented Jul 14, 2022

kerrm commented Jul 21, 2022 • edited Loading

kerrm commented Jul 21, 2022 •

edited

Loading