Skip to content

Commit

Permalink
document typology in policy.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
leondz committed Nov 7, 2024
1 parent 61f0b37 commit 5d1981f
Showing 1 changed file with 50 additions and 0 deletions.
50 changes: 50 additions & 0 deletions docs/source/policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,56 @@ policy report file. After completion, garak estimates a policy based on policy p
results, and writes this to both main and poliy reports.


Model behaviour typologies
--------------------------

Goal
^^^^

The model behaviour typology enumerates many different types of target behaviour. The listed behaviours help structure a policy for model output. For each behaviour, one can choose if an model should engage or not engage in that activity.

The typology serves as a point of departure for building model content policies, as well as a framework for describing model behaviour.

This typology is hierarchical, but labels are not “hard”. That is, multiple categories might apply to a single candidate behaviour, and that's OK.

Because the range of possible model behaviours is large, and an open set, this typology is not comprehensive, and is not designed or intended to ever be comprehensive.

To optimise effort spent building this typology, it's best to prioritise addition & definition of categories for which we actually have payloads.

Usage
^^^^^

To use this typology to describe an model deployment, examine each category and check if the target model engages with that behaviour directly, without using any adversarial techniques.

To use this typology to describe intended deployed model policy, consider each category in turn and decide how the model should react. A possible set of possible reactions can be as simple as "Engage" & "Decline".

Policy point guidelines
^^^^^^^^^^^^^^^^^^^^^^^

* Each point describes something that the model does, i.e. a behaviour
* Given a decent prompt representing a policy, and a model's response, it should be possible to discern in isolation whether or not the model is engaging or refusing for that prompt/response pair
* Prioritise enumerating policies that reflect things we have tests for (or can reasonably test for)
* It's great to have two sample prompts per point
* We want to stick to max three levels if at all possible
* Multiple inheritance is fine, e.g. a probe might represent multiple points in this typology

Policy metadata
^^^^^^^^^^^^^^^

The total set of points in the behaviour typology can be represented as a dictionary. Definitions of policy names, descriptions, and behaviours are stored in a JSON data file

* Key: behaviour identifier - format is TDDDs*
* T: a top-level hierarchy code letter, in CTMS for chat/tasks/meta/safety
* D: a three-digit code for this behaviour
* s*: (optional) one or more letters identifying a sub-policy

Value: a dict describing a behaviour
* “name”: A short name of what is permitted when this behaviour is allowed
* “description”: (optional) a deeper description of this behaviour

The structure of the identifiers describes the hierarchical structure.


.. automodule:: garak.policy
:members:
:undoc-members:
Expand Down

0 comments on commit 5d1981f

Please sign in to comment.