Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detector: Use LLMaaJ to see if models are following payload instructions #992

Open
leondz opened this issue Nov 11, 2024 · 0 comments
Open
Labels
architecture Architectural upgrades detectors work on code that inherits from or manages Detector
Milestone

Comments

@leondz
Copy link
Collaborator

leondz commented Nov 11, 2024

Requires separation of:

  • Strings used in building prompts, as components
  • Malicious instruction
  • Description of bad behaviour prompt is trying to elicit

The thing we're testing for is set by the probe when building the prompt, is carried in the attempt, and is processed by the detector

@leondz leondz added the architecture Architectural upgrades label Nov 11, 2024
@leondz leondz added this to the 24.12 milestone Nov 11, 2024
@leondz leondz added the detectors work on code that inherits from or manages Detector label Nov 14, 2024
@leondz leondz changed the title feature: Use LLMaaJ to see if models are following payload instructions detector: Use LLMaaJ to see if models are following payload instructions Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture Architectural upgrades detectors work on code that inherits from or manages Detector
Projects
None yet
Development

No branches or pull requests

1 participant