-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Feat/expectations #307
base: master
Are you sure you want to change the base?
Changes from all commits
3e5b21d
d7165c6
5f7a2bd
a357011
ea889a8
b8bc3dd
c10108e
c7bbef2
0d84659
509461d
d4eb2de
9989ebc
2e5861c
1dc0e49
86ebef2
1cccbd8
3b85390
372e4c1
4edfd33
6c3d945
8b3ddae
b68afec
c783540
aa6ddd7
f6f2bcc
db01d9d
b4d4df9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -328,7 +328,7 @@ def __init__(self, problem, path, skip_double_build_warning=False): | |
|
||
# The first element will match the directory the file is in, if possible. | ||
self.expected_verdicts = self._get_expected_verdicts() | ||
|
||
self.expectations = self.problem.get_expectations_registry().for_path(self.short_path) | ||
# NOTE: Judging of interactive problems on systems without `os.wait4` is | ||
# suboptimal because we cannot determine which of the submission and | ||
# interactor exits first. Thus, we don't distinguish the different non-AC | ||
|
@@ -400,9 +400,10 @@ def _get_expected_verdicts(self): | |
verdicts = [subdir] | ||
else: | ||
if len(verdicts) == 0: | ||
error( | ||
f'Submission {self.short_path} must have @EXPECTED_RESULTS@. Defaulting to ACCEPTED.' | ||
) | ||
pass # TODO (Thore): made this shut up! | ||
#error( | ||
# f'Submission {self.short_path} must have @EXPECTED_RESULTS@. Defaulting to ACCEPTED.' | ||
#) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This error is probably still necessary. DOMjudge does not support an I think that the rejected/broken.py: # contains `@EXPECTED_RESULTS@: TIME_LIMIT_EXCEEDED, WRONG_ANSWER`
permitted: [AC, TLE, WA] # only these verdicts may appear
required: [TLE, WA] # at least one of these verdicts must appear It would be nice if the existence of an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All correct, and all (in my mind) part of the tool more than of the expecations framework. Note that the framework does allow file-based expectations; in principle a submission itself can supply “its own” expectations, which then compose with expectations in I’ve silenced There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By the way, if permitted: [AC, TLE, WA]
required: [TLE, WA] is sufficiently popular, the cleanest way is to elevate it to a common abbreviation next to If somebody has a large repository of @EXPECTED_RESULTS@-tagged files, we could run statistics on them! Please so. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Running the following command on my folder that contains all Git repos that I've worked with in the past years (FPC from 2018, BAPC+preliminaries from 2020, NWERC from 2022): $ grep -r @EXPECTED_RESULTS@ 2> /dev/null | grep 20 | cut -d ':' -f 3 | sed 's/ //g' | sort | uniq -c
After running this command, I realize that my earlier example only works if the tag only has non-AC verdicts. For example, a combined mixed/sometimes-too-slow.py: # contains `@EXPECTED_RESULTS@: ACCEPTED, TIME_LIMIT_EXCEEDED`
permitted: [AC, TLE] # only these verdicts may appear Adding the And as a final remark, I'm finding it difficult to switch my mental model from "the verdict that the submission receives" to "the set of verdicts that the test cases receive". The fact that DOMjudge (and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for this. Also for this formulation:
That is exactly why I’m doing this; and I’m trying to be terminologically stringent about it. To repeat this in my own words,
By the way, I’m happy to add final: TLE
permitted: [AC, TLE, WA]
required: [TLE] This is then either a requirement about the lexicographic ordering of testcases, ensuring that the But I don’t really see a need for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the elaboration! I don't think that a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I’ve updated (b4d4df9) the pydoc documentation of registry['wrong_answer/greedy.py'] = Expectations({'sample': 'accepted'}) This would be my suggestion for translating submissions that contain
in the DOMjudge appendix. As far as I can tell, the final verdict (“outcome”) in DOMjudge depends on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For DOMjudge, the final verdict not only depends on
I think we can assume the default DOMjudge settings for BAPCtools. At least, we have never changed this setting in DOMjudge in the past years 🙂 Note that the lazy judging, especially when running multiple runners in parallel, may not necessarily return the result of the first non-accepted test case. In the cases where we use I would translate
|
||
|
||
if len(verdicts) == 0: | ||
verdicts = ['ACCEPTED'] | ||
|
@@ -452,9 +453,10 @@ def run_all_testcases( | |
|
||
verdict = (-100, 'ACCEPTED', 'ACCEPTED', 0) # priority, verdict, print_verdict, duration | ||
verdict_run = None | ||
verdict_for_testcase = dict() | ||
|
||
def process_run(run, p): | ||
nonlocal max_duration, verdict, verdict_run | ||
nonlocal max_duration, verdict, verdict_run, verdict_for_testcase | ||
|
||
localbar = bar.start(run) | ||
result = run.run() | ||
|
@@ -476,7 +478,10 @@ def process_run(run, p): | |
if table_dict is not None: | ||
table_dict[run.name] = result.verdict == 'ACCEPTED' | ||
|
||
got_expected = result.verdict in ['ACCEPTED'] + self.expected_verdicts | ||
verdict_short = short_verdict(result.verdict) | ||
verdict_for_testcase[run.name] = verdict_short | ||
#got_expected = result.verdict in ['ACCEPTED'] + self.expected_verdicts | ||
got_expected = self.expectations.is_permitted(verdict_short, run.name) | ||
|
||
# Print stderr whenever something is printed | ||
if result.out and result.err: | ||
|
@@ -514,8 +519,17 @@ def process_run(run, p): | |
data += '\n' | ||
data += f'{f.name}:' + localbar._format_data(t) + '\n' | ||
|
||
if not got_expected: | ||
localbar.error(f'{result.duration:6.3f}s {result.print_verdict()}', data) | ||
short = short_verdict(result.verdict) | ||
for prefix, pattern, verdicts in self.expectations.violated_permissions(short, run.name): | ||
prefix = (f'{Fore.CYAN}{prefix:>{len(localbar.prefix)}}{Style.RESET_ALL}:' + | ||
f'{pattern:<{localbar.item_width}}') | ||
localbar.warn(f"permits {verbose_verdicts(verdicts)}", prefix=prefix) | ||
|
||
localbar.done(got_expected, f'{result.duration:6.3f}s {result.print_verdict()}', data) | ||
|
||
|
||
# Lazy judging: stop on the first error when not in verbose mode. | ||
if ( | ||
not config.args.verbose and not config.args.table | ||
|
@@ -534,16 +548,22 @@ def process_run(run, p): | |
self.print_verdict = verdict[2] | ||
self.duration = max_duration | ||
|
||
# Check presence of required verdicts among testgroups | ||
for prefix, pattern, verdicts in self.expectations.unsatisfied_requirements(verdict_for_testcase): | ||
prefix = (f'{Fore.CYAN}{prefix:>{len(bar.prefix)}}{Style.RESET_ALL}: ' + | ||
f'{pattern:<{bar.item_width}}') | ||
bar.warn(f"no test case got {verbose_verdicts(verdicts)}", prefix=prefix) | ||
|
||
# Use a bold summary line if things were printed before. | ||
if bar.logged: | ||
color = ( | ||
Style.BRIGHT + Fore.GREEN | ||
if self.verdict in self.expected_verdicts | ||
if self.expectations.is_permitted(short_verdict(self.verdict), Path()) | ||
else Style.BRIGHT + Fore.RED | ||
) | ||
boldcolor = Style.BRIGHT | ||
else: | ||
color = Fore.GREEN if self.verdict in self.expected_verdicts else Fore.RED | ||
color = Fore.GREEN if self.expectations.is_permitted(short_verdict(self.verdict), Path()) else Fore.RED | ||
boldcolor = '' | ||
|
||
printed_newline = bar.finalize( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably best to stay consistent with existing code, so I would use
problem
here instead ofself
, unless we rename the first parameter of all class methods fromproblem
toself
. I don't mind either way, as long as it's consistent, so this decision is better for Ragnar to make 🙂There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really a fan of
self
somehow. I think it's because it's not always clear in which class you're currently reading code with the big functions everywhere, so naming things after what they are seems clearer generally. (Especially given thatself
doesn't really mean anything special inside the member functions anyway.)So yes, I'd prefer
problem
.