Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content Moderation #56

Open
Keyrxng opened this issue Nov 2, 2024 · 2 comments
Open

Content Moderation #56

Keyrxng opened this issue Nov 2, 2024 · 2 comments

Comments

@Keyrxng
Copy link
Member

Keyrxng commented Nov 2, 2024

I've suggested a version of this before but I think it should be given another thought and re-evaluated.

Looks at this:
image

I luckily caught the message after only a minute or so but in the event that this was not captured immediately (not ana in particular but any contributor within any partner org) and the contributor did fall for some phishing attempt or whatever it may be, it's a baaaad look for us when the solution is pretty simple.


Build a content moderation plugin that can be configured for different sorts of content moderation, focusing on the highest priority few for V1.

The obvious option is to use a v cheap LLM and moderate every comment that comes through. Comments could have various pre-processing to reduce LLM usage. There are specific LLMs built for this but ol' trusty GPT would do just as well I'd imagine.

I'm use sure we could use a non-LLM approach and NLP it but unsure of the implementation details

@0x4007
Copy link
Member

0x4007 commented Nov 2, 2024

It's a bit vague how exactly this is supposed to moderate that out. What's the prompt look like?

@Keyrxng
Copy link
Member Author

Keyrxng commented Nov 2, 2024

It's a bit vague how exactly this is supposed to moderate that out. What's the prompt look like?

  1. There are specific models built for it but GPT would kill it still.
  2. pre-processing would need iterated on but things like hasCommented || isDbUser that kind of thing
  3. The prompt we'd need to iterate on too but something to the effect of

You are tasked with detecting fraudulent and/or malicious interactions via GitHub comments.
Primarily you are to assert that no comment ever contains:

  • a phishing attempt
  • ...

${orgName} supports the following:

  • ticketing system only ever through the exact match url: ${config.ticketingUrl}
  • ${...config.onlySupportsStrings}

${orgName} will never:

  • ${...config.strings || config.defaults} ask for a user' password or to reach out via anything other than official urls

${...config.officialUrls}


We can make the prompt practically built off of the strings the pass in via the config or they build the whole prompt themselves if they want? We can iterate on a solid outcome prompt with config variables at first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants