add a critique model to the mix to improve the answers #23

Aaronminer1 · 2024-06-23T19:08:42Z

Enhanced Workflow with Scoring Mechanism for Model-Only Process
Scoring Criteria
The criteria for evaluation will remain the same:

Clarity: How clear and understandable the response is.
Relevance: How relevant the response is to the prompt.
Accuracy: How factually correct the response is.
Completeness: How thoroughly the response addresses the prompt.
Coherence: How logically consistent the response is.
Each criterion is scored from 0 to 10, and the overall score is the average of these scores. The minimum passing score will be set at 7.

Workflow Steps
Input Prompt:

The initial input is fed into the first layer.
Layer 1:

Three agents
𝐴
1
,
1
A
1,1

,
𝐴
1
,
2
A
1,2

, and
𝐴
1
,
3
A
1,3

process the input independently.
Intermediate outputs are generated and concatenated.
Critique 1 with Scoring:

A critique agent evaluates the concatenated output using the criteria (Clarity, Relevance, Accuracy, Completeness, Coherence).
Each criterion is scored from 0 to 10.
The overall score is the average of the criteria scores.
If the overall score is >= 7, the output is passed to Layer 2.
If the overall score is < 7, the output is sent back to Layer 1 for revision by the agents.
Layer 2:

The adjusted output from Critique 1 is processed by agents
𝐴
2
,
1
A
2,1

,
𝐴
2
,
2
A
2,2

, and
𝐴
2
,
3
A
2,3

.
Intermediate outputs are generated and concatenated.
Critique 2 with Scoring:

A critique agent evaluates the outputs from Layer 2 using the same criteria.
Outputs are scored and averaged.
If the overall score is >= 7, the output is passed to Layer 3.
If the overall score is < 7, the output is sent back to Layer 2 for revision by the agents.
Layer 3:

The adjusted output from Critique 2 is processed by agents
𝐴
3
,
1
A
3,1

,
𝐴
3
,
2
A
3,2

, and
𝐴
3
,
3
A
3,3

.
Intermediate outputs are generated and concatenated.
Critique 3 with Scoring:

A final critique agent evaluates the outputs from Layer 3.
Outputs are scored and averaged.
If the overall score is >= 7, the output is passed to Layer 4.
If the overall score is < 7, the output is sent back to Layer 3 for revision by the agents.
Layer 4:

The final adjusted output is processed by agent
𝐴
4
,
1
A
4,1

.
The Final Output is produced.
Final Output:
The output from Layer 4 is the final output, having passed all critique evaluations and scoring criteria.
Diagram Summary:
Input Prompt -> Layer 1 -> Critique 1 with Scoring -> (Pass if score >= 7 or Revise if score < 7) -> Layer 2 -> Critique 2 with Scoring -> (Pass or Revise) -> Layer 3 -> Critique 3 with Scoring -> (Pass or Revise) -> Layer 4 -> Final Output
Example Diagram Description:
Input Prompt: Initial input is fed into Layer 1.
Layer 1: Agents
𝐴
1
,
1
A
1,1

,
𝐴
1
,
2
A
1,2

, and
𝐴
1
,
3
A
1,3

process the input independently, generating intermediate outputs which are concatenated.
Critique 1 with Scoring: A critique agent evaluates the concatenated output, scoring it on clarity, relevance, accuracy, completeness, and coherence. If the score is >= 7, the output passes to Layer 2; otherwise, it is sent back to Layer 1.
Layer 2: Agents
𝐴
2
,
1
A
2,1

,
𝐴
2
,
2
A
2,2

, and
𝐴
2
,
3
A
2,3

process the adjusted output, generating new intermediate outputs which are concatenated.
Critique 2 with Scoring: The critique agent evaluates the new outputs, scoring them as before. Outputs scoring >= 7 pass to Layer 3; others are sent back to Layer 2.
Layer 3: Agents
𝐴
3
,
1
A
3,1

,
𝐴
3
,
2
A
3,2

, and
𝐴
3
,
3
A
3,3

process the further adjusted output, generating final intermediate outputs which are concatenated.
Critique 3 with Scoring: The final critique agent evaluates and scores the outputs. Outputs scoring >= 7 pass to Layer 4; others are sent back to Layer 3.
Layer 4: The final agent
𝐴
4
,
1
A
4,1

processes the output to produce the final answer.
This workflow ensures each layer's output meets a quality threshold before advancing, thereby enhancing the final output's overall quality.

purquiz · 2024-08-07T15:08:55Z

Interesting idea, Here are my 2 cents on this.

I am in favor of the idea of rating and critique. I think it is a good way of enhancing quality.
I have, however, certain issues with your approach:

Issue 1: I agree with the 5 criteria for scoring - however I believe that assigning the same value to each one is not helping increasing quality. There are criteria that are more important than others. For instance, I feel that Coherence and Accuracy are more important than, for instance, Completeness or Relevance, and I think Clarity is the least important. This critique model is intended to score the outputs that will be fed to a subsequent layer, so no need for these responses to be clear to a human reader (because it's not a human who will need to understand them), it's not so important to make these responses complete (because one of the aggregator model's job is to build a complete answer summarizing partial ones), and the same with relevance - we can trust the aggregator model to pick and choose the most relevant bits from all the partial answers (also, don't forget that the aggregator will also come up with its own response to the prompt, adding the knowledge of the previous models' answers to it),

Issue 2: While the idea of moving responses with a passing score along the chain to subsequent layers is good, I am not so keen on the idea of throwing them back to the previous layer for revisions. One of the ideas of MoA, I believe, is to leverage smaller models with less compute cost to achieve the same performance as larger, more expensive ones. If we start adding loops of score->revision->rescoring->re-revision etc, we are basically making many requests to the same model for a response that is not guaranteed to pass the next layer's scoring (because it's possible to get a model generating a lower-quality response based on a previous one). The revision step will potentially add cost while not adding value, in my opinion.

Issue 3: I think adding critique to all layers is overkill. MoA already trusts the collective intelligence of many models, with each one contributing its knowledge to the overall aggregated and summarized response. Adding critique to all layers makes the critique model effectively more important than any other, which means that a bug, a poor choice in model, or a faulty training process will destroy the effectiveness of the MoA implementation. Since the critique model decides what is good to continue and what is not, a biased critique model can ensure biased final responses. As a real world example, if the question is what is the best animal and my brain is the critique model, I can ensure that the final answer will be "cat" regardless of what other layers say.

I believe that the best way to address issues is to attempt to provide a solution, so here are my opinions on how can these issues be fixed:

Issue 1: I would keep the same 5 criteria but have a multiplier or a different range of scores. For instance, coherence and accuracy an go from 1 to 10, completeness and relevance can go from 1 to 7, and clarity can go from 1 to 5. Then you still average all results and get that as a final score. This will ensure that a good completeness score is significant, but not as important as others.

Issue 2: I would just discard the non-passing score. If an answer is not good enough, then it's not passed to the next layer - period. The subsequent layers will take care of generating a better response and will only be influenced by the passing ones from the previous layers.

Issue 3: I see two possible ways of dealing with the "enforced bias" problem. The first one is to assign different passing scores to different layers. For instance, layer 1 has a passing score of 4, layer 2 a passing score of 5, and layer 3 a passing score of 7. This way, the critique model has less influence in the overall outcome. The second way of addressing this, and the one I would prefer, is to only apply before the last proposers layer - this way this last layer will have quality inputs while still have the freedom to add their own to the responses, and the aggregator will still have a variety of answers to aggregate, summarize and complete.

To finish with this longer-than-life comment, I would propose that the critique model needs to be a specialized evaluator, that is, not a generalist language model but one that excels at evaluating. No need to even be a language model - It can be a simple classifier network, maybe?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a critique model to the mix to improve the answers #23

add a critique model to the mix to improve the answers #23

Aaronminer1 commented Jun 23, 2024

purquiz commented Aug 7, 2024

add a critique model to the mix to improve the answers #23

add a critique model to the mix to improve the answers #23

Comments

Aaronminer1 commented Jun 23, 2024

purquiz commented Aug 7, 2024