-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Supporting the new tongue tied Gandalf levels #356
base: main
Are you sure you want to change the base?
Conversation
…equest` to create normalizer requests
…rer and target, works for all 5 levels. Passes levels 1 to 3 within a 1-30 queries (only level 1 prompt provided).
"source": [ | ||
"## Orchestrator\n", | ||
"\n", | ||
"We will build our own simple orchestrator that asks the Red Teaming model to produce a prompt and refine it at every turn, passing it the conversation so far with feedback from the scorer at each turn and uses chain-of-thought reasoning to refine the prompt. This is a streamlined sequential version of the PAIR orchestrator in [pair_orchestrator.py](../../../pyrit/orchestrator/pair_orchestrator.py), though the idea is older and can be traced back to other attempts at solving Gandalf challenges like [LLMFuzzAgent](https://github.com/corca-ai/LLMFuzzAgent) and [Gandalf vs. Gandalf](https://github.com/microsoft/gandalf_vs_gandalf)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this vs use actual PAIR orchestrator (or TAP, or other)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A silly reason is that the PAIR orchestrator had a typo and was misbehaving (this PR also fixes it), but also that I wanted to start with a streamlined orchestrator that I understood. It wasn't entirely clear to me how to parse and reformat responses to give more direct feedback to the attacker (writing an ad-hoc scorer thar provides that feedback and modifying the PAIR orchestrator to use it?).
Ultimately, I think we do want to use (and maybe generalize) the PAIR orchestrator to do this, and switch to a float_scale
scorer that counts the number of successful LLM responses (in level 5, the same prompt needs to trick 3 different LLMs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only thing I'm hesitant on before merging. Because we run these notebooks to test, I would rather not have this "new orchestrator" as a dependency as we release. Can you update to use PAIR (even if imperfectly)? You could also use TAP which is more robust and works identically if configured right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a try. A risk is that the glue code might end up being longer and harder to maintain than an ad-hoc orchestrator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glue code is still useful to us, because we want these orchestrators to be as generic as possible; so if there are pain points we want to make it easier if we can :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting on the orchestrator updates; ping the team when it's ready for a re-review
Description
This PR adds support for new Gandalf levels (namely Tongue Tied) released on 29th August 2024. This requires a custom scorer for this purpose, since it is not a password finder.
Co-authored with @s-zanella
Tests and Documentation
Tests added for target and scorer in line with previous Gandalf modules.
A notebook with an ad-hoc orchestrator for the Tongue Tied levels which works for all 5 levels. It passes levels 1 to 3 within a 1-30 queries (only level 1 prompt provided).