-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Feat/expectations #307
base: master
Are you sure you want to change the base?
Implement Feat/expectations #307
Conversation
Can read, say, `test/problems/hello/exectations.yaml` and verifies the expectations set therein (rather than looking at directory-defined or @EXPECTED_GRADES@.)
Also introduce shortform like AC for ACCEPTED
Here’s the output for Note for instance the feedback for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick check, commenting the first things that popped into mind 🙂
pass # TODO (Thore): made this shut up! | ||
#error( | ||
# f'Submission {self.short_path} must have @EXPECTED_RESULTS@. Defaulting to ACCEPTED.' | ||
#) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is probably still necessary. DOMjudge does not support an expectations.yaml
, and reads the @EXPECTED_RESULTS@
tag to know which final verdicts are allowed. So if the submission is not in one of the four default directories, and it does not have such a tag, DOMjudge will default to ACCEPTED
, which is not intended in most cases 😅
I think that the @EXPECTED_RESULTS@
tag converted to expectations.yaml
syntax would look something like:
rejected/broken.py: # contains `@EXPECTED_RESULTS@: TIME_LIMIT_EXCEEDED, WRONG_ANSWER`
permitted: [AC, TLE, WA] # only these verdicts may appear
required: [TLE, WA] # at least one of these verdicts must appear
It would be nice if the existence of an @EXPECTED_RESULTS@
could be checked for consistency with the expectations.yaml
. Perhaps we could even infer the tag from the YAML on export, but that may result in too much seemingly magic behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All correct, and all (in my mind) part of the tool more than of the expecations framework.
Note that the framework does allow file-based expectations; in principle a submission itself can supply “its own” expectations, which then compose with expectations in expectations.yaml
in well-defined ways.
I’ve silenced @EXPECTED_RESULTS@
in my branch simply because there are too many decisions that I don’t feel strongly about involved with how to handle it. (A tiny method that translates @EXPECTED_RESULTS@
into specific expectations.BaseExpectations
would be the quickest way to be complatible. Very easy to do.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, if
permitted: [AC, TLE, WA]
required: [TLE, WA]
is sufficiently popular, the cleanest way is to elevate it to a common abbreviation next to accepted
, does not terminate
, etc. In principle, there are 16 different possible permitted
subsets (two of which make now sense), each of which can have a number of required
subsets. I was just taking the standard set and add a few that seemed useful to me (such as not accepted
).
If somebody has a large repository of @EXPECTED_RESULTS@-tagged files, we could run statistics on them! Please so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the following command on my folder that contains all Git repos that I've worked with in the past years (FPC from 2018, BAPC+preliminaries from 2020, NWERC from 2022):
$ grep -r @EXPECTED_RESULTS@ 2> /dev/null | grep 20 | cut -d ':' -f 3 | sed 's/ //g' | sort | uniq -c
- The
grep 20
is to only select repositories that have a year number (2018–2023), because I want to exclude things like BAPCtools and domjudge that use this tag many times in their tests. - The
sed
command removes the optional space after the comma that separates the verdicts. - The verdicts may still have different orders (e.g. both
ACCEPTED,WRONG_ANSWER
andWRONG_ANSWER,ACCEPTED
occur), but I've manually aggregated those in the results below.
3 ACCEPTED,RUN_TIME_ERROR
2 ACCEPTED,RUN_TIME_ERROR,TIME_LIMIT_EXCEEDED
2 ACCEPTED,RUN_TIME_ERROR,WRONG_ANSWER
47 ACCEPTED,TIME_LIMIT_EXCEEDED
16 ACCEPTED,WRONG_ANSWER
10 RUN_TIME_ERROR,TIME_LIMIT_EXCEEDED
3 RUN_TIME_ERROR,TIME_LIMIT_EXCEEDED,WRONG_ANSWER
22 RUN_TIME_ERROR,WRONG_ANSWER
49 TIME_LIMIT_EXCEEDED,WRONG_ANSWER
After running this command, I realize that my earlier example only works if the tag only has non-AC verdicts. For example, a combined AC,TLE
verdict may better be written without required
:
mixed/sometimes-too-slow.py: # contains `@EXPECTED_RESULTS@: ACCEPTED, TIME_LIMIT_EXCEEDED`
permitted: [AC, TLE] # only these verdicts may appear
Adding the required
field is probably not even strictly necessary in any of these cases, but for the cases that do not include AC
, it's probably helpful if the tool warns about a submission that we expect to give some non-AC
verdict, but is in fact AC
.
And as a final remark, I'm finding it difficult to switch my mental model from "the verdict that the submission receives" to "the set of verdicts that the test cases receive". The fact that DOMjudge (and @EXPECTED_RESULTS@
) use the former and expectations.yaml
uses the latter, probably doesn't help either. 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this. Also for this formulation:
I'm finding it difficult to switch my mental model from "the verdict that the submission receives" to "the set of verdicts that the test cases receive".
That is exactly why I’m doing this; and I’m trying to be terminologically stringent about it. To repeat this in my own words,
- The separation of these two concepts is crucial for even defining the semantics of, e.g.,
time_limit_exceeded
, in accordance with the specification. - Another separation is between the behaviour of the construction and validation tool (which is used by authors) and the electronic judge (which is used by contest organisers, teachers, and solvers.) The former (say, BAPCtools or problemtools) needs a way of saying “no testcase for the brute-force solution can ever be
WA
, butTLE
andRTE
are acceptable, and the final verdict shan’t beAC
”, and construction tools should support this to ensure problem quality. On the other hand, the judging environment (Kattis, DomJudge, etc.) needs well-defined semantics for what the final verdict is (or maybe subtask scores of IOI), and when it is OK to abort prematurely to reduce server load. For instance a judge can report the lexicographically first (by testcase path) non-AC
verdict.
By the way, I’m happy to add final
to the expectations specification, and tried many ideas. Here’s how that could look:
final: TLE
permitted: [AC, TLE, WA]
required: [TLE]
This is then either a requirement about the lexicographic ordering of testcases, ensuring that the TLE
verdict is encountered before the WA
verdict (what the spec calls first_error
), or that the judge has an ordering of rejection-verdicts that makes TLE
worse than WA
, continues grading after encountering WA
, and then reports the worst error.
But I don’t really see a need for final
, nor is the semantics clear to me. This is the kind of feedback I’m soliciting here. (In my current BAPCtools integration, this issues would need to be resolved to understand whether the final verdict should be red or green.) See, for instance, submissions bad-tle-wa.py
in the large screenshot above, which fails (because it gets a WA
), yet gets a green TLE
(which, after all, is a permitted testcase verdict). I have no opinion on what the semantics or visualisation should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the elaboration! I don't think that a final
expectation is needed, indeed 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve updated (b4d4df9) the pydoc documentation of expectations.yaml
to spell out the workflow that avoids expectation.yaml
altogether and specifies expectations submission-by-submission by writing
registry['wrong_answer/greedy.py'] = Expectations({'sample': 'accepted'})
This would be my suggestion for translating submissions that contain @EXPECTED_RESULTS@
. But I simply do not understand the semantics of the latter, because I can’t wrap my head around
Use this to annotate the possible outcomes that a submission can have, that will be accepted by the Judging verifier.
in the DOMjudge appendix.
As far as I can tell, the final verdict (“outcome”) in DOMjudge depends on results_prio
, which is not part of the problem package, and in the default setting seems to be what the problem package specification calls first_error
. Thus, the outcome depends on the lexicographic ordering of the testcases. I don’t know if this granularity is intended or used, nor don‘t I know how to translate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For DOMjudge, the final verdict not only depends on result_prio
, but also on whether lazy judging is enabled: https://www.domjudge.org/docs/manual/8.2/judging.html#lazy-judging-and-results-priority
When [lazy judging is] enabled, judging of a submission will stop when a highest priority result has been found for any testcase. You can find these priorities under the
results_prio
setting. In the default configuration, when enabling this, judging will stop with said verdict when a testcase results in e.g.run-error
,timelimit
orwrong-answer
. When a testcase is correct (lower priority), judging will continue to the next test case. In other words, to arrive at a verdict of correct, all testcases will have been evaluated, while any of the ‘error’ verdicts will immediately return this answer for the submission and the other testcases will never be tested [...].[...]
When not using lazy judging, all testcases will always be ran for each submission. The
results_prio
list will then determine which of the individual testcase results will be the overall submission result: the highest priority one. In case of a tie, the first occurring testcase result with highest priority is returned.
I think we can assume the default DOMjudge settings for BAPCtools. At least, we have never changed this setting in DOMjudge in the past years 🙂 Note that the lazy judging, especially when running multiple runners in parallel, may not necessarily return the result of the first non-accepted test case. In the cases where we use @EXPECTED_REULTS@
, I think we usually don't care about the ordering, as long as the submission never receives any other verdict than a verdict from this list (i.e. we don't use first_error
currently, but I also don't think we knew that it existed). Thus, at least for now, I don't think we have to translate first_error
to expectations.
I would translate @EXPECTED_RESULTS@: <verdict_list>
as follows (and it would be nice if BAPCtools does this automatically, when an expectations.yaml
does not exist):
if AC in <verdict_list>
:permitted: <verdict_list>
if AC not in <verdict_list>
:permitted: [AC, *<verdict_list>] # some test cases may pass, so also include `AC` required: verdict_list # at least one of the specified verdicts must appear for at least one of the test cases
- Imports are usually ordered with system libraries first, followed by a | ||
newline, followed by local includes. Both groups are sorted alphabetically, | ||
and `import` comes before `from ... import`. | ||
See [doc/expectations](doc/expectations.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably didn't mean to delete the entire readme.md
? 😛
# TODO Q from Thore: ok to use self here instead of problem? | ||
def get_expectations_registry(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably best to stay consistent with existing code, so I would use problem
here instead of self
, unless we rename the first parameter of all class methods from problem
to self
. I don't mind either way, as long as it's consistent, so this decision is better for Ragnar to make 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really a fan of self
somehow. I think it's because it's not always clear in which class you're currently reading code with the big functions everywhere, so naming things after what they are seems clearer generally. (Especially given that self
doesn't really mean anything special inside the member functions anyway.)
So yes, I'd prefer problem
.
We just came across #313. Currently the lazy-judging situatoin is kind of messy/ad-hoc:
This means that With expectations this all becomes much nicer: we can simply stop on the first non- Then the only edge case that I would like to keep is to keep running on |
@@ -0,0 +1,279 @@ | |||
# Expectations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice writeup :)
#common | ||
#range | ||
permitted?: #verdicts // only these verdicts may appear | ||
required?: #verdicts // at least one of these verdicts must appear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The at least one
here is I think the most non-intuitive part of the spec, but also what is wanted/needed/useful.
Requiring at least one WA
and at least one TLE
(in a single selection of testcases) is weird/rare anyway.
Maybe require_one
would be a better key? But I like the simplicity of required
so probably this is just something to get used to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that duplicate yaml dictionary keys are disallowed, there may not be a nice way at all to specify multiple required verdicts for a single testgroup. Probably that needs require_all:
or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way around could be to additional allow here required: [[WA], [TLE]]
: a list of lists, and at least inner-list behaves like the current required
key where at least one of the verdicts must be present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. For a while the keys were called all
, any
, only
, some
and so on, but it wasn’t clear even to myself. I agree that it feels as if require_all
should be there, but (again) I haven’t seen a sufficient use case.
If it’s important to a problem author, the author can always be more precise:
thore.cpp:
secret/\d+-makes_thore_wrong: wrong answer
secret/\d+-makes_thore_slow: time limit exceeded
However, I’m more than willing to reintroduce only
, all
, some
. I found that very neat. Other naming suggestions are welcome. (subset
, superset
, intersects
). require_any
and require_all
are very clear, but make my eyes bleed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, being more precise to the individual testcase level sounds good enough to me. This is probably a rare case anyway and this is a sufficiently good workaround IMO. So let's just keep require:
.
(Maybe add this workaround as a comment/example somewhere.)
|
||
# Specification for subgroups works "all the way down to the tescase" | ||
# though it's seldomly needed: | ||
funky.cpp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry to disappoint, but duplicate dictionary keys are not allowed by yaml, so this would have to be merged into the funky.cpp
above.
* accepted: Every test case is `AC`. | ||
* wrong answer: Every test case receives `AC` or `WA`; _some_ test case receives `WA`. | ||
* time limit exceeded: Every test case receives `AC` or `TLE`; _some_ test case receives `TLE`. | ||
* runtime exception: Every test case receives `AC` or `RTE`; _some_ test case receives `RTE`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will trip over time <space> limit
vs run <no space> time
all the time. oh well. Let's make nice IDE integration to autocomplete all options here :)
For reference, currently time_limit_exceeded/
and run_time_error/
directories are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, deliberate (and a pet peeve.)
- runtime refers to a part of the softward life cycle. It appears in runtime error, or phrases like “at runtime”. It can be spelt runtime or run time.
- running time is the execution time of a program, or the complexity of an algorithm
I spent an hour yesterday communicating with a student about why their Kattis submission failed, and it turned out that the student thought “run time error” meant “the running time is wrong”, so we discussed the wrong thing for a long time. So our terminological sloppiness has really bad consequences.
So I do what I can to separate runtime and running time, including a strong preference for the unspaced orthography. Also, runtime
remains the more common spelling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I only now realize you also renamed error
to exception
. Works for me.
(Maybe we should actually rename the directory as well then.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(These details are not really important to me, by the way and shouldn’t get in the way of anything.)
I'm super happy with where this is going :) One question that remains to be answered (probably in the spec) is what will be the meaning of the existing directories. Probably we want each directory to directly map to the equivalent shorthand you introduce, but in the current spec |
Just to make sure:
I explicitly have no opinion about what |
I think my question was mostly:
We could even extend that with
|
What you suggest is what I would do (and I deliberate chose terminology to make this suggestion tempting), but the decision belongs to the problem package format, not to the expectations framework. |
I'll review the code as well. I suppose the next step is to get the last details worked out in the spec and then we can merge it here. (I'd also be happy to merge before that.) |
This is my first attempt at implementing the expectations framework into BAPC.
Draft documentation is in doc/expectations.md, and the file
bin/expectations.py
has extensive documentation (including doctest.)This is a very early draft, and mainly an invitation for feedback.
Why care?
Correct semantics for required and permitted verdicts
Out-of-the box, an immediate improvement to the current situation is that it implements various verification conventions correctly, such as
time_limit_exceeded
andrun_time_exception
.Here’s an example of a submission (in
test/problems/spanishinquisition
, where all the examples for the expectations framework reside) that fails for two different reasons. It lives inwrong_answer
, but getsTLE
. The tool now correctly reports this, by distinguishing between permitted verdicts (here, “all verdict must getAC
andWA
”), and required verdicts (here, “some verdict must getWA
”). Both violations are reported.Of course, the normal behavious is still supported. A submission in
accepted
that getsWA
getsNote that the tool is more helpful in line 2 and tells us where to find the violation (here, because the submission resides in
accepted
), and which verdicts would have been permitted (here, ACCEPTED).You can specify your own conventions for what should reside in
submissions/other
orsubmissions/mixed
; the framework or the tool doesn’t care.More fine-grained test data specification:
Requirements can, by default, be specified for all of
data/
. But it is also possible to be more fine-grained. A contest that wants to ensure that the submissions inwrong_answer
still get accepted on all the samples, can just writeThis rule now applies to all submissions in
wrong_answer
.The specification of “which testdata to match” is just a regex, so you can go crazy:
This can be useful for specifying specific testcases that are aimed at “killing” specific submissions (or vice versa).
More generally, this allows specifying the behaviour of submissions on subdirectories of data, useful for expressing things like “the greedy submissions are expected to pass on the data in this directory, but not that”, or problems with subtasks.
More fine-grained submission specification
Matching of submissions is by prefix; the standard prefixes
accepted
,wrong_answer
, etc. just happen to match the path of the submissions that reside in the corresponding directories. As withsuperstitious.py
above, you can specify a specific submission by giving its full name; or anything inbetween. If Thore wants to ensure that his submissions (say,th-greedy.py
) inwrong_answer
at least pass sample, he can add