Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are the experimental results different? #84

Open
zl-comment opened this issue Sep 18, 2024 · 3 comments
Open

Why are the experimental results different? #84

zl-comment opened this issue Sep 18, 2024 · 3 comments

Comments

@zl-comment
Copy link

zl-comment commented Sep 18, 2024

bertattack

original prompt: Evaluate the sentiment of the given text and classify it as 'positive' or 'negative':
original score: 0.4934426229508197
attacked prompt: Evaluate the sеntiment of the given text and classification it as 'positive' or 'negative':
attacked score: 0.4852459016393443
PDR: 0.01661129568106314

original prompt: Given the context of this text, indicate if the emotion conveyed is 'positive' or 'negative':
original score: 0.5
attacked prompt: Given the context of this text, indicatе if the emoton conveyed is 'positive' or 'negative':
attacked score: 0.4737704918032787
PDR: 0.05245901639344264

original prompt: Analyze the tone of this statement and respond with either 'positive' or 'negative':
original score: 0.5180327868852459
attacked prompt: Anaⅼyze the tone of this statement and respond with nor 'positive' or 'negative':
attacked score: 0.460655737704918
PDR: 0.1107594936708861

original prompt: In the role of a sentiment analysis tool, respond with 'positive' or 'negative' to classify this statement:
original score: 0.5229508196721312
attacked prompt: In the role of a sentiment analyse instrumentation, respond with 'positive' or 'negative' to classify this statemen𝚝:
attacked score: 0.4639344262295082
PDR: 0.11285266457680258

original prompt: Functioning as a sentiment identification tool, assess if the following expression is 'positive' or 'negative'. Classify:
original score: 0.5180327868852459
attacked prompt: Functionіng as a sentiment identіfication tool, assess if the following expression is 'positive' or 'negative'. Classify:
attacked score: 0.4639344262295082
PDR: 0.10443037974683543

original prompt: Serving as a sentiment evaluation model, determine if the given statement is 'positive' or 'negative'. Classify:
original score: 0.5278688524590164
attacked prompt: Serve as a sentimeոt evaluation modelling, determine if the given statement is 'positive' or 'negative'. Classify:
attacked score: 0.4721311475409836
PDR: 0.10559006211180119

The original accuracy rate using the prompt words in the paper is only about 50%, but the original accuracy rate in the source code is about 90%.
However, the original accuracy rate of prompt words using the sample codehttps://github.com/microsoft/promptbench/blob/main/examples/prompt_attack.ipynb is more than 90%

bertattack

'original prompt': "Evaluate the sentiment of the given text and classify it as 'positive' or 'negative':\nQuestion: {content}\nAnswer:"
'original score': 0.9459016393442623
'attacked prompt': "Evaluate the sincerity of the received text and rate it as 'positive' or 'negative':\ncuriosity: {content}\nAnswer:"
'attacked score': 0.9147540983606557
'PDR': 0.03292894280762571(0.9147540983606557~0.9557377049180328)

'original prompt': "Given the context of this text, indicate if the emotion conveyed is 'positive' or 'negative':\nQuestion: {content}\nAnswer:"
'original score': 0.9393442622950819
'attacked prompt': "i the context of this text, indicate if the emotion conveyed is 'positive' or 'negative':\nQuestion: {content}\nfield:"
'attacked score': 0.9229508196721311
'PDR': 0.017452006980802813(0.9229508196721311~)

'original prompt': "Given the context of this text, indicate if the emotion conveyed is 'positive' or 'negative':\nQuestion: {content}\nAnswer:"
'original score': 0.9442622950819672
'attacked prompt': "Given the clarity of this section, specify if the emotion depicted is 'positive' or 'negative':\ncuriosity: {content}\nAnswer:"
'attacked score': 0.8885245901639345
'PDR': 0.05902777777777775

'original prompt': "Analyze the tone of this statement and respond with either 'positive' or 'negative':\nQuestion: {content}\nAnswer:"
'original score': 0.9475409836065574
'attacked prompt': "Analyze the syntax of this phrase and address with meaning 'positive' or 'negative':\nriddle: {content}\nAnswer:"
'attacked score': 0.6081967213114754
'PDR': 0.35813148788927335

'original prompt': "In the role of a sentiment analysis tool, respond with 'positive' or 'negative' to classify this statement:\nQuestion: {content}\nAnswer:"
'original score': 0.9540983606557377
'attacked prompt': "in the role of a sentiment analysis game, respond with 'positive' or 'negative' to classify this function:\ncuriosity: {content}\nquestions:"
'attacked score': 0.8918032786885246
'PDR': 0.06529209621993128

'original prompt': "Functioning as a sentiment identification tool, assess if the following expression is 'positive' or 'negative'. Classify:\nQuestion: {content}\nAnswer:"
'original score': 0.9475409836065574
'attacked prompt': "Functioning as a emotion correlation tool, study if the following expression is 'positive' or 'negative'. find:\ncuriosity: {content}\nAnswer:"
'attacked score': 0.8918032786885246
'PDR': 0.05882352941176468

'original prompt': "Serving as a sentiment evaluation model, determine if the given statement is 'positive' or 'negative'. Classify:\nQuestion: {content}\nAnswer:"
'original score': 0.9491803278688524
'attacked prompt': "which as a sentiment evaluation models, determine if the given words is 'positive' or 'negative'. model:\ncorrectly: {content}\nasking:"
'attacked score': 0.8885245901639345
'PDR': 0.06390328151986174

@Immortalise
Copy link
Collaborator

Hi, could you please indicate which model you are using for the attack? The difference may arise from the use of a different model.

@zl-comment
Copy link
Author

model='google/flan-t5-large'

@Immortalise
Copy link
Collaborator

Could you please check and compare the results here? In this website, the results for T5 in SST-2 dataset is around 95%.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants