detector: Use LLMaaJ to see if models are following payload instructions #992

leondz · 2024-11-11T16:07:08Z

Requires separation of:

Strings used in building prompts, as components
Malicious instruction
Description of bad behaviour prompt is trying to elicit

The thing we're testing for is set by the probe when building the prompt, is carried in the attempt, and is processed by the detector

leondz added the architecture Architectural upgrades label Nov 11, 2024

leondz added this to the 24.12 milestone Nov 11, 2024

leondz added the detectors work on code that inherits from or manages Detector label Nov 14, 2024

leondz changed the title ~~feature: Use LLMaaJ to see if models are following payload instructions~~ detector: Use LLMaaJ to see if models are following payload instructions Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detector: Use LLMaaJ to see if models are following payload instructions #992

detector: Use LLMaaJ to see if models are following payload instructions #992

leondz commented Nov 11, 2024

detector: Use LLMaaJ to see if models are following payload instructions #992

detector: Use LLMaaJ to see if models are following payload instructions #992

Comments

leondz commented Nov 11, 2024