Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use XDG paths for configuration data and caching #799

Merged

Conversation

jmartin-tech
Copy link
Collaborator

Fix #479

Support using XDG ver 0.8 for project data.

Specifically support:

ENV VAR DEFAULT
$XDG_DATA_HOME $HOME/.local/share
$XDG_CONFIG_HOME $HOME/.config
$XDG_CACHE_HOME $HOME/.cache

Project name garak is appended to each location.

This is represents the followina breaking changes to project expecations:

  • report_prefix passed either at the command line or as config file option
    • set filename values only
    • no longer overrides report_dir
  • report_dir passed as a config file option
    • when provided as a relative path will be prepend with <xdg_data_home>/garak
    • provided as an absolute path will be used as the output directory
  • default user/site configuration file garak.site.yaml has moved
    • previously <basedir>/garak.site.yaml
    • updated location <xdg_config_home>/garak/garak.site.yaml

Additional changes (not considered breaking changes):

  • nltk data is placed in <xdg_cache_home>/garak if not already found in the environment
  • visual_jailbreak downloaded artifacts are placed in <xdg_cache_home>/garak/resources
  • generated data for beast/gcg/tap are placed in <xdg_cache_home>/garak/resources

Comment on lines 52 to 60
advbench_path = (
garak._config.transient.cache_dir
/ "resources"
/ "advbench"
/ "harmful_behaviors.csv"
)
if advbench_base_path.is_file() and not advbench_path.is_file():
shutil.copy2(advbench_base_path, advbench_path)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future iteration helper methods from _config for loading existing resource files may be appropriate, allowing the user to place files in the cache_dir or data_dir to override shipped versions.

garak/_plugins.py Outdated Show resolved Hide resolved
@@ -64,7 +65,7 @@
gcg_parser.add_argument(
"--outfile",
type=str,
default=gcg_resource_data / "gcg_prompts.txt",
default=resource_data / "gcg_prompts.txt",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is output, should this go to the cache location by default?

Suggested change
default=resource_data / "gcg_prompts.txt",
default=gcg_resource_data / "gcg_prompts.txt",

Thoughts?

Copy link
Collaborator Author

@jmartin-tech jmartin-tech Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more on this, it may be reasonable to create utility functions for read and write of resource files, and enforce a coding standard of access where a read file handle to a resource is returned based on two levels of access, first searching for the file in XDG_DATA_HOME/garak/PATH_TO_RESOURCE_FILE and falling back to return BASEDIR/PATH_TO_RESOURCE_FILE when not found. As a standard coding practice for the project all resources shipped with the project must be accessed through this utility.

Extending from this the standard for write to resource files would always place the file in XDG_DATA_HOME/garak to ensure the tool does not write to files in the installation path, but rather in the a user owned location. An additional benefit of this would be that the user could then override any resource file by placing their own version in XDG_DATA_HOME/garak/PATH_TO_RESOURCE_FILE. An immediate benefit of this would be seen in the suffix.GCGCached probe by allowing the cached set of prompts to be overwritten using this pattern. This could be further extended to centralize common as loadable files not embedded as strings in the codebase ensuring less churn unrelated to logic in the python files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern sounds great - sorry I missed this comment during the last review. The pattern fits multiple places. I guess writing should include a message in log/cli regarding the selected path. Also appreciate the benefits re: moving config out of source.

Def out of scope for this PR though, right

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes out of scope here, I have ensured output files are consistently in cache_dir for now.

@@ -37,7 +37,8 @@

logger = getLogger(__name__)

gcg_resource_data = garak._config.transient.basedir / "resources" / "gcg" / "data"
resource_data = garak._config.transient.basedir / "resources"
gcg_resource_data = garak._config.transient.cache_dir / "resources" / "gcg" / "data"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider updating suffix.GCGCached to read from the cache location and copy from the project install if the file is not already in this cache_dir path.

Copy link
Collaborator Author

@jmartin-tech jmartin-tech Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring this idea for future iteration.

@jmartin-tech jmartin-tech force-pushed the feature/user-storage-consolidation branch 2 times, most recently from 73eb8f0 to cc75afc Compare July 23, 2024 20:04
Support using [XDG ver 0.8](https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html) for project data.

Specifically support:
| ENV VAR          | DEFAULT            |
|------------------|--------------------|
| $XDG_DATA_HOME   | $HOME/.local/share |
| $XDG_CONFIG_HOME | $HOME/.config      |
| $XDG_CACHE_HOME  | $HOME/.cache       |

Project name `garak` is appended to each location.

This is represents the followina breaking changes to project expecations:
* report_prefix passed either at the command line or as config file option
  * set filename values only
  * no longer overrides report_dir
* report_dir passed as a config file option
  * when provided as a relative path will be prepend with `<xdg_data_home>/garak`
  * provided as an absolute path will be used as the output directory
* default `user/site` configuration file `garak.site.yaml` has moved
  * previously `<basedir>/garak.site.yaml`
  * updated location `<xdg_config_home>/garak/garak.site.yaml`

Additional changes (not considered breaking changes):
* nltk data is placed in <xdg_cache_home>/garak if not already found in the environment
* visual_jailbreak downloaded artifacts are placed in <xdg_cache_home>/garak/resources
* generated data for beast/gcg/tap are placed in <xdg_cache_home>/garak/resources

Signed-off-by: Jeffrey Martin <[email protected]>
@jmartin-tech jmartin-tech force-pushed the feature/user-storage-consolidation branch from 7e2430a to bdc2a41 Compare July 24, 2024 15:18
garak/_config.py Outdated Show resolved Hide resolved
garak/_config.py Outdated Show resolved Hide resolved
garak/_config.py Outdated Show resolved Hide resolved
garak/_plugins.py Outdated Show resolved Hide resolved
@@ -8,8 +8,10 @@


def start_logging():
from garak import _config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from garak import _config
""" initialises logging. assumes garak _config has already been loaded. """
from garak import _config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider enforcing this vs documenting it?

Suggested change
from garak import _config
from garak import _config
if not _config.loaded:
raise RuntimeError("Configuration must be loaded to start logging!")

Copy link
Collaborator Author

@jmartin-tech jmartin-tech Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer at this I don't think we need the warning at all, _config.transient.*_dir are all currently defined at load by import of the module and IMO starting logging functionality should not be restricted on runtime config being finalized.

>>> from garak import _config
>>> _config.transient.data_dir
PosixPath('/Users/jemartin/.local/share/garak')
>>>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed 0cad75c, enforcing loaded requirement is not really valid as noted in my comment above, the value consumed in this method is available on import of _config.

garak/command.py Outdated
@@ -41,19 +44,25 @@ def start_run():
"⚠️ The current/default config is optimised for speed rather than thoroughness. Try e.g. --config full for a stronger test, or specify some probes."
)
_config.transient.run_id = str(uuid.uuid4()) # uuid1 is safe but leaks host info
# why is report prefix a condition of placing file in the report_dir?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if i'm answering the right question, report_prefix was flexible between being a simple string prepended to report filenames, to something containing a directory path. now that directory is explicitly handled separately within this pr, report_prefix no longer needs to support directories (and probably shouldn't, in fact).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Awesome, this PR takes the approach of making report_prefix a naming item only. I do like the idea of adding validation, that will ensure the value is not used for path traversal.

garak/resources/autodan/genetic.py Show resolved Hide resolved
@@ -342,7 +347,7 @@ def run_beast(
suffix_len: int = 40,
data_size: int = 20,
target: Optional[str] = "",
outfile: str = beast_resource_data / "suffixes.txt",
outfile: Path = beast_resource_data / "suffixes.txt",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is (expensive) results from a BEAST run, should it go in xdg data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sold on that as this data is being built on the system for the run and could be extracted from the .report.jsonl prompts of a run that used them. There is currently no check for if the file exists prior to building it, and the data is only valid for the specific model it was used with. I do think there could be some future enhancement that makes this more data class than cache just not based the current implementation and scope of this PR.

A clear output value that documents to the user where this was placed may be appropriate in a future expansion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sold on that as this data is being built on the system for the run and could be extracted from the .report.jsonl prompts of a run that used them.

we have no code for this, though, right?

There is currently no check for if the file exists prior to building it, and the data is only valid for the specific model it was used with.

this suggests to me that it would make sense to place the output alongside run output data, using the same prefix

I do think there could be some future enhancement that makes this more data class than cache just not based the current implementation and scope of this PR.

yeah, this can def be split off into a separate issue for the related plugins. i do think we can get into trouble putting expensive results in a dir for transient content.

@@ -34,7 +35,7 @@
SAVE_RESULTS = True

resources_tap_data_file = (
garak._config.transient.basedir
garak._config.transient.cache_dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TAP has a pre-populated file in resources/tap/data that is distributed with garak. Maybe is makes sense to separate pre-loaded TAP resources from resources created while running TAP. The latter might belong in XDG data.

Copy link
Collaborator Author

@jmartin-tech jmartin-tech Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reasoning here. Although this comment does expose that there is likely an enhancement to be done here as the cached version is only consumed in a live run not in TAPCached by default.

This again relates to this idea.

Comment on lines +26 to +27
_config.transient.data_dir
/ _config.reporting.report_dir
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to double check - does _config.reporting.report_dir always have to be a subdir of _config.transient.data_dir?

Copy link
Collaborator Author

@jmartin-tech jmartin-tech Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if report_dir is a fully resolved path it will be used as is, see the test_report_dir_full_path() test in test_config.py.

tests/test_config.py Show resolved Hide resolved
@leondz leondz added the architecture Architectural upgrades label Jul 30, 2024
@jmartin-tech jmartin-tech force-pushed the feature/user-storage-consolidation branch from 0cad75c to 49bc1eb Compare July 30, 2024 16:21
@jmartin-tech jmartin-tech merged commit 4e30a3f into NVIDIA:main Jul 30, 2024
16 checks passed
@jmartin-tech jmartin-tech deleted the feature/user-storage-consolidation branch July 30, 2024 16:29
@github-actions github-actions bot locked and limited conversation to collaborators Jul 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
architecture Architectural upgrades
Projects
None yet
Development

Successfully merging this pull request may close these issues.

put garak data all in one place
2 participants