-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report comparative info for detector scores #814
Conversation
…t from multiple reports
…section; include bag details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing in progress, can you update the description here to explain what this PR does and offer example run of perf_stats.py
? Don't need the actual report files just breadcrumbs to follow when we need to update these resources.
Yup, done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to #813 the resource files here need to replace ContinueSlursReclaimedSlurs80
with ContinueSlursReclaimedSlursMini
.
Outstanding catch |
Enable interpretation of scores in a run by calibrating them against a bag of models.
garak/analyze/perf_stats.py
takes a glob of reportjsonl
s and calculates mean, standard deviation, and shapiro-wilk p-values (latter is to assess how well the spread of scores fit a normal distribution) for each probe/detector foundgarak/resources/calibration
contains the files from which stats are derived in a comparison. These contents are generated fromperf_stats.py
garak/analyze/report_digest.py
and templates updated to calculate a z-score for probe/detector combinations where this is possible, given a default calibration json, to print it in the html output, and to also report what this score means & where it came from