Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Supporting the new tongue tied Gandalf levels #356

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

donebydan
Copy link

@donebydan donebydan commented Sep 3, 2024

Description

This PR adds support for new Gandalf levels (namely Tongue Tied) released on 29th August 2024. This requires a custom scorer for this purpose, since it is not a password finder.

Co-authored with @s-zanella

Tests and Documentation

Tests added for target and scorer in line with previous Gandalf modules.
A notebook with an ad-hoc orchestrator for the Tongue Tied levels which works for all 5 levels. It passes levels 1 to 3 within a 1-30 queries (only level 1 prompt provided).

@donebydan donebydan changed the title FEAT: Supporting (the first two) tongue tied Gandalf levels FEAT: Supporting the new tongue tied Gandalf levels Sep 5, 2024
"source": [
"## Orchestrator\n",
"\n",
"We will build our own simple orchestrator that asks the Red Teaming model to produce a prompt and refine it at every turn, passing it the conversation so far with feedback from the scorer at each turn and uses chain-of-thought reasoning to refine the prompt. This is a streamlined sequential version of the PAIR orchestrator in [pair_orchestrator.py](../../../pyrit/orchestrator/pair_orchestrator.py), though the idea is older and can be traced back to other attempts at solving Gandalf challenges like [LLMFuzzAgent](https://github.com/corca-ai/LLMFuzzAgent) and [Gandalf vs. Gandalf](https://github.com/microsoft/gandalf_vs_gandalf)."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this vs use actual PAIR orchestrator (or TAP, or other)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A silly reason is that the PAIR orchestrator had a typo and was misbehaving (this PR also fixes it), but also that I wanted to start with a streamlined orchestrator that I understood. It wasn't entirely clear to me how to parse and reformat responses to give more direct feedback to the attacker (writing an ad-hoc scorer thar provides that feedback and modifying the PAIR orchestrator to use it?).

Ultimately, I think we do want to use (and maybe generalize) the PAIR orchestrator to do this, and switch to a float_scale scorer that counts the number of successful LLM responses (in level 5, the same prompt needs to trick 3 different LLMs).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only thing I'm hesitant on before merging. Because we run these notebooks to test, I would rather not have this "new orchestrator" as a dependency as we release. Can you update to use PAIR (even if imperfectly)? You could also use TAP which is more robust and works identically if configured right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a try. A risk is that the glue code might end up being longer and harder to maintain than an ad-hoc orchestrator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glue code is still useful to us, because we want these orchestrators to be as generic as possible; so if there are pain points we want to make it easier if we can :)

Copy link
Contributor

@rlundeen2 rlundeen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting on the orchestrator updates; ping the team when it's ready for a re-review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants