-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data file override support with precedence #916
data file override support with precedence #916
Conversation
Signed-off-by: Jeffrey Martin <[email protected]>
Signed-off-by: Jeffrey Martin <[email protected]>
* returns first found instance of filename * custom `Path` raises exception when: * path escape attempt is detected * no file matching request exists
Signed-off-by: Jeffrey Martin <[email protected]>
* move to tools path * update to overwrite default `data` file Signed-off-by: Jeffrey Martin <[email protected]>
Looks like there is some work to do with support for this on python 3.10 apparently! |
Flagging a tension here with payload enumeration. Currently payload enum globs but uses its own directory traversal code, rather than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions, looks good so far. Will be some extra work involved in aligning the payloads PR to the proposals here. The general separation is for non-code that can also be overridden should move into data; the rest remains in resources.
_config.transient.data_dir / "data", | ||
_config.transient.package_dir / "data", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
realising now that these aren't particularly transient pieces of information, so might go elsewhere in config. one advantage of this PR is that if this is worth updating one day, the update can be centralised in data
and made in far fewer places.
Signed-off-by: Jeffrey Martin <[email protected]>
from garak.detectors.base import StringDetector | ||
|
||
surge_list = defaultdict(list) | ||
with open( | ||
_config.transient.package_dir / "resources" / "profanity_en.csv", | ||
data_path / "profanity_en.csv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth looking at as part of a larger refactor -- do we want all these datasets in our git? Or should we have them in some other place e.g. HuggingFace hub, and the garak.data
module can also manage downloading these files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have discussed in the past that we may want tooling that will download all datasets to create an offline
deployment capability.
I suspect there could be some expansion on garak.data
for handling access to known datasets. Treating HF as the specific location that tooling can register data as available from might be a good direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes HF drops connections and things go wrong, so I prefer keeping smaller things closer. What that cutoff is, I don'k know - and it does also mean that garak will grow bigger over time.
Agree some expansion could work, perhaps using HF by default with a backup URI also (maybe a garak-data
repo)
/ "tap" | ||
/ "data" | ||
/ "tap_jailbreaks.txt" | ||
garak._config.transient.cache_dir / "data" / "tap" / "tap_jailbreaks.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since probes.tap
references the prompts_location
as data_path / "tap" / "tap_jailbreaks.txt"
, do we want this to reflect that as well? The idea is to keep your TAP results such that you can use them (or share them with us!) in subsequent probes. Though perhaps that's a larger issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree these should be captured better for re-use. I will defer that change to a future iteration though. I think there is a larger issue to address as we have a few techniques that generate these intermediate files that could be valuable from a research context.
Depending on land order I am thinking this can be a use case to improve |
Signed-off-by: Jeffrey Martin <[email protected]>
@@ -92,7 +92,8 @@ def _gen_prompts(self, term): | |||
def __init__(self, config_root=_config): | |||
super().__init__(config_root) | |||
|
|||
self.data_dir = _config.transient.cache_dir / "resources" / "wn" | |||
self.data_dir = _config.transient.cache_dir / "data" / "wn" | |||
self.data_dir.parent.mkdir(mode=0o740, parents=True, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting item here, the parent
must exist before download however if the wn
folder already exists the raise when attempting access below will be a different exception type and fails to recover.
This was masked
previously by the fact that _config.transient.cache_dir / "resources"
is created in early initialization for the plugin_cache.json
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting. Where do we want to check for / make XDG dirs? And then I guess should we be making subdirs using mkdir -p
-equiv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we want to check for / make XDG dirs?
https://github.com/leondz/garak/blob/0e1d96e3892e14564aaece56c1ee9f92cfdd7dc0/garak/_config.py#L68-L71
parents=True
is the equivalent to -p
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, missed that. This looks good. Expect to move wn
code to its own module/class once we know where those will live - did we settle on resources
? Other options could be apis
, external
, apps
, common
, & not lib
. Or is this a conversation to conclude once this PR lands?
Signed-off-by: Jeffrey Martin <[email protected]>
Signed-off-by: Jeffrey Martin <[email protected]>
Signed-off-by: Jeffrey Martin <[email protected]>
Signed-off-by: Jeffrey Martin <[email protected]>
Signed-off-by: Jeffrey Martin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: Jeffrey Martin <[email protected]>
NICE |
Fix #891
data
files to allow user's to override project provided datasets using$XDG_DATA_DIR/garak/data
to mirrorgarak/data
from the package.cached files
and do not change location or behavior in the revisiongcg
Consumers of the
garak.data.path
object can traverse the data directory as a standardPath
object and will be returned existing files based on the order precedence from the supported locations on disk.A possible drawback to the current implementation exists regarding directory interaction, if a folder exists at more than one path location the higher precedence path will be returned, when performing
glob
or iteration actions with such an instance the only files in the selected path on disk will be evaluated. This is currently considered an acceptable limitation.Verification
List the steps needed to make sure this thing works
python -m pytest tests/
garak -m nim -p donotanswer.MisinformationHarms -g 1 -n meta/llama3-70b-instruct --parallel_attempts 5
donotanswer.MisinformationHarms
garak -m nim -p donotanswer.MisinformationHarms -g 1 -n meta/llama3-70b-instruct --parallel_attempts 5
155
attempts10
attempts (count depends-n
count passed tohead
)Note: #870 provides new
data
files that will need to migrate to this pattern depending on PR landing order.