Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search #50

Closed
0x4007 opened this issue Sep 28, 2024 · 41 comments · Fixed by ubiquity-os-marketplace/command-ask#2
Closed

Search #50

0x4007 opened this issue Sep 28, 2024 · 41 comments · Fixed by ubiquity-os-marketplace/command-ask#2

Comments

@0x4007
Copy link
Member

0x4007 commented Sep 28, 2024

Description:

We have two goals that are closely aligned:

  1. Embedding-Based Search for Prior Conversations:
    We want a plugin that enables us to naturally ask questions related to previous conversations on GitHub using embeddings. This should ideally work across multiple GitHub organizations and repositories. For example, if someone asks, “What was the original reason for moving the LP tokens?”, the system should be able to search through all conversations and provide relevant information.

    This feature should include org-wide default search (with options to extend the search to multiple organizations as arguments).

    Example Context:

    It isn't clear to me if we redid the staking yet and if we should migrate. If so, perhaps we should make a new issue instead. We should investigate whether the missing LP tokens issue from the MasterChefV2.1 contract is critical to the decision of migrating or not.

    Next Step:

    • Implement embedding-based search across all issues and conversations.
  2. Temporary Slash Command for Context-Aware Search:
    As a stepping stone to the above, we propose a dedicated slash command /search to help contributors quickly search through existing threads and add value. The logic of this command would mimic that of the natural language embedding search, and eventually, it will merge into the @UbiquityOS question-based syntax.

    Temporary Fix:

    • Use the same thresholds from the issue deduplication system while debugging and testing the logic in production.
    • Focus on conversation parsing and extracting key issue specifications to add value.

    Future Vision:

    • Eventually, we’d like to enhance this functionality with a browser extension to automatically surface relevant information while contributors are working on issues.

References

It isn't clear to me if we redid the staking yet and if we should migrate? If so perhaps we should make a new issue instead.

What we have right now:

  1. MasterChefV2.1 contract (i.e. old staking) which seems to be working fine (almost)
  2. StakingFacet (i.e. new staking) which:
    a) is not used right now
    b) not fully covered with tests so it's hard to say that it works as the original MasterChef
    c) ideally requires an audit

The only issue with the MasterChefV2.1 contract (old staking) is current github issue with missing LP tokens. If this issue is the result of owner actions or some other expected behavior then it makes sense to save development time and simply use the old staking contract.

That is why we'd better to find out the root cause of missing LP tokens and then decide whether we should proceed with the old staking or migrate to the diamond's StakingFacet contract.

Originally posted by @rndquu in ubiquity/ubiquity-dollar#939 (comment)

There was a task carried out about this I remember, let me find it.
ubiquity-os/ubiquity-os-logger#35

Originally posted by @gentlementlegen in ubiquity-os-marketplace/text-conversation-rewards#132 (comment)

@0x4007
Copy link
Member Author

0x4007 commented Sep 28, 2024

@sshivaditya2019 rfc

@sshivaditya2019
Copy link

This can be easily accomplished with embeddings and a capable LLM that has a substantial context. The main challenge will be to maintain a large vector database where comments and conversations are stored and readily available. Instead of simply dumping entire message chains, these would need to be selectively curated.

@0x4007
Copy link
Member Author

0x4007 commented Sep 29, 2024

What's a good time estimate?

@sshivaditya2019
Copy link

What's a good time estimate?

If the comments have to be cleaned and cherry picked, for a good Context Retrieval, it should be around 1 week.

@sshivaditya2019
Copy link

/start

Copy link

ubiquity-os bot commented Sep 29, 2024

Deadline Sun, Oct 6, 2:24 PM UTC
Beneficiary 0xDAba6e01D15Db560b88C8F426b016801f79e1F69

Tip

  • Use /wallet 0x0000...0000 if you want to update your registered payment wallet address.
  • Be sure to open a draft pull request as soon as possible to communicate updates on your progress.
  • Be sure to provide timely updates to us when requested, or you will be automatically unassigned from the task.

@sshivaditya2019
Copy link

@0x4007 Could you share the comments corpus or text? How will this work?

@0x4007
Copy link
Member Author

0x4007 commented Sep 30, 2024

One consideration I just realized we have to resolve a design problem first.

@Keyrxng is building a plugin where we can ask ChatGPT (with full linked issue conversation and pull context) any questions with the same syntax.

Perhaps it makes the most sense to also look up similar conversations (embeddings) and appending their text to the LLM context window.

If there's some high match then append. If no high percentage match then don't append. I think this should make the user experience seamless when asking questions. In any case though, technically this should not be a new plugin. Look for the command-ask plugin.

I can't transfer this issue to that repository easily from my phone will need to from my computer in a bit

@sshivaditya2019
Copy link

sshivaditya2019 commented Sep 30, 2024

One consideration I just realized we have to resolve a design problem first.

@Keyrxng is building a plugin where we can ask ChatGPT (with full linked issue conversation and pull context) any questions with the same syntax.

Perhaps it makes the most sense to also look up similar conversations (embeddings) and appending their text to the LLM context window.

If there's some high match then append. If no high percentage match then don't append. I think this should make the user experience seamless when asking questions. In any case though, technically this should not be a new plugin. Look for the command-ask plugin.

I can't transfer this issue to that repository easily from my phone will need to from my computer in a bit

If that's the case then I think this is blocked by /gpt PR.

But, this could be a separate plugin as it deals with large troves of textual data, and probably could be extended for chats over other platforms, not related to code like Telegram, Discord.

@0x4007
Copy link
Member Author

0x4007 commented Sep 30, 2024

If that's the case then I think this is blocked by /gpt PR.

Well perhaps we can have you take it over mid this week if its not done. They are supposed to be focusing on the Telegram bridge plugin as a top priority and that also isn't done.

But, this could be a separate plugin as it deals with large troves of textual data, and probably could be extended for chats over other platforms, not related to code like Telegram, Discord.

This is the philosophy we are taking with these plugins, especially the ones that focus on working with text.

For example, my vision is to have our conversation-rewards algorithm be generally applicable for telegram conversations etc. As for these AI related features, yes ideally as well.

Then UbiquityOS can be context aware of every work input to the organization, which makes it more generally intelligent of everything happening.

@sshivaditya2019
Copy link

sshivaditya2019 commented Sep 30, 2024

If that's the case then I think this is blocked by /gpt PR.

Well perhaps we can have you take it over mid this week if its not done. They are supposed to be focusing on the Telegram bridge plugin as a top priority and that also isn't done.

I think they are are almost done with it. So, I can probably focus on the textual content/corpus required for this task in the mean time.

But, this could be a separate plugin as it deals with large troves of textual data, and probably could be extended for chats over other platforms, not related to code like Telegram, Discord.

This is the philosophy we are taking with these plugins, especially the ones that focus on working with text.

For example, my vision is to have our conversation-rewards algorithm be generally applicable for telegram conversations etc. As for these AI related features, yes ideally as well.

Then UbiquityOS can be context aware of every work input to the organization, which makes it more generally intelligent of everything happening.

Could you please share the links to the old issues/ conversation threads, I can write a script for seeding them into the database.

@0x4007
Copy link
Member Author

0x4007 commented Sep 30, 2024

Most of everything is within the @ubiquity organization (created in 2020) but we did break off our recent efforts to the new orgs @ubiquity-os and @ubiquity-os-marketplace

Or if its easier, please use the aggregated issues JSON in our directory.

It contains, at least, all the URLs to all the issues that we are monitoring for tasks/proposals. It does not include all their conversation contexts though.

I suppose the script can extract those URLs, and query the GitHub REST API for the conversations within each.

@Keyrxng
Copy link
Member

Keyrxng commented Sep 30, 2024

Well perhaps we can have you take it over mid this week if its not done. They are supposed to be focusing on the Telegram bridge plugin as a top priority and that also isn't done.

Both are held back by review only.

As I understand this task:

  • we are backfilling the DB with all previous conversation across the ~700 tracked issues
  • we are using the backdated embeddings in order to find relevant text when @ubiquityos is queried
  • command-ask is going to embed the query, find similar/relevant content and feed it all to the LLM including the complete context command-ask is already aware of.

So this task really involves two rather simple parts:

  1. script to backfill the DB. (Adhering to new DB schema?)
  2. turn a command-ask query into an embedding, extract relevant text via the DB function findSimilarContent and ask LLM.

@sshivaditya2019
Copy link

  • we are backfilling the DB with all previous conversation across the ~700 tracked issues

You’re correct that we are backfilling the database with conversation threads from various organizations. However, we will be selectively choosing content that is relevant to a central theme, which will be extracted using n-gram frequency analysis and topic modeling techniques. This approach is necessary to avoid including random discussions that don’t pertain to the actual concepts or topics, thereby helping to prevent model hallucination.

  • command-ask is going to embed the query, find similar/relevant content and feed it all to the LLM including the complete context command-ask is already aware of.

Pretty much that, except we need to apply stemming and lemmatization to ensure the LLM captures the context. We should also apply re-ranking techniques, as I am almost certain there will be overlapping contexts, so a reranking method like BM25 would be required.

So this task really involves two rather simple parts:

  1. script to backfill the DB. (Adhering to new DB schema?)

It would involve fetching the issues -> processing them -> dumping them into a SQL file -> and then executing the migrations.

  1. turn a command-ask query into an embedding, extract relevant text via the DB function findSimilarContent and ask LLM.

And incorporate some form of re-ranker and context distillation techniques to improve overall response quality while reducing model costs.

@sshivaditya2019
Copy link

I think this would function very effectively as a separate plugin with a higher request or execution limit. It is likely to be used more frequently than the /gpt command, as more people will seek information about the platform or project across various channels such as Telegram, Discord, or even a web-based chatbot on the homepage.

@0x4007
Copy link
Member Author

0x4007 commented Sep 30, 2024

I think this would function very effectively as a separate plugin with a higher request or execution limit. It is likely to be used more frequently than the /gpt command, as more people will seek information about the platform or project across various channels such as Telegram, Discord, or even a web-based chatbot on the homepage.

Much better UX to consolidate #50 (comment)

@sshivaditya2019
Copy link

One consideration I just realized we have to resolve a design problem first.

@Keyrxng is building a plugin where we can ask ChatGPT (with full linked issue conversation and pull context) any questions with the same syntax.

Perhaps it makes the most sense to also look up similar conversations (embeddings) and appending their text to the LLM context window.

If there's some high match then append. If no high percentage match then don't append. I think this should make the user experience seamless when asking questions. In any case though, technically this should not be a new plugin. Look for the command-ask plugin.

I can't transfer this issue to that repository easily from my phone will need to from my computer in a bit

Could you please explain further? I’m not able to understand what you mean by "better UX."

@0x4007
Copy link
Member Author

0x4007 commented Sep 30, 2024

@UbiquityOS what was the original reason for moving the LP tokens?

Refer to that part of the spec. Their command-ask plugin has this syntax.

@UbiquityOS is this considered a best practice?1

Imagine asking a senior colleague any question on a pull request or an issue. They will have context on all the other historical issues/pulls, as well as general knowledge from other projects. My vision consolidates both into a single natural interface of tagging the colleague and asking your question.

Footnotes

  1. a hypothetical review comment to settle a debate. By passing in the entire conversation and the diff hunk, ChatGPT should be able to provide a great answer.

@Keyrxng
Copy link
Member

Keyrxng commented Sep 30, 2024

You’re correct that we are backfilling the database with conversation threads from various organizations. However, we will be selectively choosing content that is relevant to a central theme, which will be extracted using n-gram frequency analysis and topic modeling techniques. This approach is necessary to avoid including random discussions that don’t pertain to the actual concepts or topics, thereby helping to prevent model hallucination.

This sounds like a whole new classification/approach to the current single-comment-body embeddings that are happening. Does this imply that the curated text (and the embeddings for that whole body of aggregated chat) is going to be an amalgamation of only the relevant text from across the issue minus any noise? So the embedding will paint a more well rounded picture of more info or will text be chunked as it is now on a per-comment basis?

However, we will be selectively choosing content that is relevant to a central theme,

I'd have thought that the task is the theme for any given body of text, then it's parent theme would be the repository, the parentmost theme being the org in which it belongs.

Pretty much that, except we need to apply stemming and lemmatization to ensure the LLM captures the context. We should also apply re-ranking techniques, as I am almost certain there will be overlapping contexts, so a reranking method like BM25 would be required.

Why do we need to lobotomize and generalize the text when models are more than capable of comprehension without having to squash things into layman's terms. I fear that this would not be ideal given the highly nuanced and specialized topics that get discussed across tasks and PRs. I don't think we should do that.

Stemming, lemmatization, and large language models (LLMs) are key concepts in natural language processing (NLP). While stemming and lemmatization are classical text preprocessing techniques, LLMs such as GPT (including the model you are interacting with) represent an advanced approach to understanding and generating human language. Here's a breakdown of these concepts and how they relate to LLMs

And incorporate some form of re-ranker and context distillation techniques to improve overall response quality while reducing model costs.

What is it you are re-ranking I do not understand that sorry. You mean re-ranking the relevance of the set of embeddings you obtained on your first search or obtaining a new set from the DB?

Looked into BM25, bag-of-words and I understand the basics of the concepts and how they'd be beneficial here actually. Very different from the current embeddings approach though, is the idea ultimately to perform the search across the single-body-embeddings as-is or only on the amalgamated "blocks"/"themes"?

Could you please explain further? I’m not able to understand what you mean by "better UX."

Currently command-ask gets it's context from comments and any linked issue(s). The intention now is that the user query is (likely) embedded and we use it to find relevant content with the embeddings we have stored and then append the respective text onto the already collected conversation context command-ask currently collects. Meaning there is only one "entry" or way to use the UbiquityOS chatbot, as there is only going to be "one". Multiple ways to invoke it for different specific query handling and results would get confusing.

ubiquity-os-marketplace/text-vector-embeddings#16 This PR consolidates everything into a single table which we can use to search across all embeddings at once or we can use both type and metadata to refine and restrict the scope of embeddings we search. Hybrid search with query embedding relevance, classification of query (setup, dao_info, general, etc..), and metadata specifics which we can pull from params like @ubiquity-os/command-ask or the Context payload itself in most cases.

Will this new style of embedding you are using have it's own sort of classification for use? Is that what you meant by themes?

Last question I promise;

If the comments have to be cleaned and cherry picked, for a good Context Retrieval,

Are you manually processing all 700 tracked tasks (and however many associated PRs)? 👀👀 Or are you automating somehow?

We have a lot of old conversations on our GitHub spanning to 2020.

Srsly last one. Only old conversations are going to receive this treatment or are we going to apply this same treatment to our current/future tasks also?

@sshivaditya2019
Copy link

sshivaditya2019 commented Oct 1, 2024

You’re correct that we are backfilling the database with conversation threads from various organizations. However, we will be selectively choosing content that is relevant to a central theme, which will be extracted using n-gram frequency analysis and topic modeling techniques. This approach is necessary to avoid including random discussions that don’t pertain to the actual concepts or topics, thereby helping to prevent model hallucination.

This sounds like a whole new classification/approach to the current single-comment-body embeddings that are happening. Does this imply that the curated text (and the embeddings for that whole body of aggregated chat) is going to be an amalgamation of only the relevant text from across the issue minus any noise? So the embedding will paint a more well rounded picture of more info or will text be chunked as it is now on a per-comment basis?

The single-comment-body embeddings are the same; the only change is in how they are selected. It’s more appropriate to view this as a temporary step in the process. Rather than dumping everything into the database, we are carefully selecting conversations that are genuinely relevant, which is crucial for preventing model hallucinations.

However, we will be selectively choosing content that is relevant to a central theme,

I'd have thought that the task is the theme for any given body of text, then it's parent theme would be the repository, the parentmost theme being the org in which it belongs.

A single issue or PR thread may have multiple themes that may not directly relate to the task at hand. For instance, a PR thread might include discussions about best practices that aren't relevant to the specific task or organization. Therefore, we need to adopt a nuanced approach to textual context rather than relying solely on code, as there is often significant implicit context and meaning involved.

Pretty much that, except we need to apply stemming and lemmatization to ensure the LLM captures the context. We should also apply re-ranking techniques, as I am almost certain there will be overlapping contexts, so a reranking method like BM25 would be required.

Why do we need to lobotomize and generalize the text when models are more than capable of comprehension without having to squash things into layman's terms. I fear that this would not be ideal given the highly nuanced and specialized topics that get discussed across tasks and PRs. I don't think we should do that.

In stemming and lemmatization, we reduce words to their root forms, which decreases the overall number of tokens without losing context. For instance, "running" becomes "run" through stemming, while lemmatization would convert both "running" and "ran" to "run." This process is particularly helpful for improving relevance scoring. I assume there may be conversations in other languages, but if that’s not the case, then this approach might not be very useful for English conversations.

Stemming, lemmatization, and large language models (LLMs) are key concepts in natural language processing (NLP). While stemming and lemmatization are classical text preprocessing techniques, LLMs such as GPT (including the model you are interacting with) represent an advanced approach to understanding and generating human language. Here's a breakdown of these concepts and how they relate to LLMs

And incorporate some form of re-ranker and context distillation techniques to improve overall response quality while reducing model costs.

What is it you are re-ranking I do not understand that sorry. You mean re-ranking the relevance of the set of embeddings you obtained on your first search or obtaining a new set from the DB?

Looked into BM25, bag-of-words and I understand the basics of the concepts and how they'd be beneficial here actually. Very different from the current embeddings approach though, is the idea ultimately to perform the search across the single-body-embeddings as-is or only on the amalgamated "blocks"/"themes"?

When comments are ranked based on cosine similarity, we can enhance retrieval performance by using rerankers. These rerankers simply reorder the results to help the LLM focus on the most important concepts. I’ve attached a reference[1] that highlights the significance of rerankers in RAG applications.

image

Could you please explain further? I’m not able to understand what you mean by "better UX."

Currently command-ask gets it's context from comments and any linked issue(s). The intention now is that the user query is (likely) embedded and we use it to find relevant content with the embeddings we have stored and then append the respective text onto the already collected conversation context command-ask currently collects. Meaning there is only one "entry" or way to use the UbiquityOS chatbot, as there is only going to be "one". Multiple ways to invoke it for different specific query handling and results would get confusing.

Will this new style of embedding you are using have it's own sort of classification for use? Is that what you meant by themes?

Everything is exactly the same; the themes serve merely as a way to refine the data, acting as a new feature extraction stage. The themes will not be utilized or included in the database.

Last question I promise;

If the comments have to be cleaned and cherry picked, for a good Context Retrieval,

Are you manually processing all 700 tracked tasks (and however many associated PRs)? 👀👀 Or are you automating somehow?

There won’t be any manual processing; it will be automated, but I will need to review at least some parts of it. This is a two-step process: first, we group the comment-issue-body into topics. While some manual work is necessary to ensure the tags are correct, this could potentially be automated using an LLM.

We have a lot of old conversations on our GitHub spanning to 2020.

Srsly last one. Only old conversations are going to receive this treatment or are we going to apply this same treatment to our current/future tasks also?

Only the old conversations will be used, although they can also be applied to newer tasks if needed. Essentially, this is a stable corpus that can serve as a source of truth, which is why "this treatment" is being implemented here.

[1] https://developer.nvidia.com/blog/enhancing-rag-pipelines-with-re-ranking/
[2] https://jina.ai/news/having-it-both-ways-combining-bm25-with-ai-reranking/
[3] https://lilianweng.github.io/posts/2024-07-07-hallucination/

@Keyrxng
Copy link
Member

Keyrxng commented Oct 1, 2024

The single-comment-body embeddings are the same; the only change is in how they are selected. Rather than dumping everything into the database, we are carefully selecting conversations that are genuinely relevant.

So we're filtering out irrelevant comments and embedding only those related to the task spec. And would these be embedded as individual comments or combined as an overview based on the "theme", like a focused md doc for each task kind of?

I assume there may be conversations in other languages, but if that’s not the case, then this approach might not be very useful for English conversations.

We previously agreed that GitHub is primarily English-based. We’ve added translation support in the plugin-template, but that’s the only time we’ve addressed multilingual support afaik.

For instance, "running" becomes "run" through stemming, while lemmatization would convert both "running" and "ran" to "run."

This seems more like a hammer when we really need a scalpel, but I’m curious to see how it works, especially given our large dataset.

These rerankers simply reorder the results to help the LLM focus on the most important concepts.

I’m concerned about managing multiple AI providers. Using a provider that aggregates different AI models might simplify things, as we’d avoid juggling multiple API keys.

We used to rely on GPT to summarize and rank everything. The latest models are more than capable of handling entire raw text task convos (multiple actually), PRs, and diffs, generating structured, consistent Markdown summaries. We could make a structured template for a MD doc for each task, improving RAG, and help us build a clean dataset for fine-tuning/training our own model, which should be our end goal ultimately.

The themes serve merely as a way to refine the data, acting as a new feature extraction stage. The themes will not be utilized or included in the database.
First, we group the comment-issue-body into topics.

I’m unclear here but will wait for the implementation since you have a clear direction.


Only the old conversations will be used, although they can also be applied to newer tasks if needed.

If this creates a clear, focused overview of tasks, we should apply it to all tasks, not just old ones, as it seems more efficient than embedding every comment like we do now.


We should do something like this https://chatgpt.com/share/66fbe1e7-9240-8000-aa8d-f2b68f9ca142 in my opinion and treat each task like a document in the knowledgebase of UbiquityOS, the same as big companies do with their in-house AI models like SalesForce, TelCom, etc.

onboarding bot QA: example of a document based approach - ubq-testing/ask-plugin#2

@sshivaditya2019
Copy link

sshivaditya2019 commented Oct 1, 2024

So we're filtering out irrelevant comments and embedding only those related to the task spec. And would these be embedded as individual comments or combined as an overview based on the "theme", like a focused md doc for each task kind of?

I just reviewed the updated database schema and noticed we could add a nullable topic column. This would allow us to create a topic for queries, match it, and use the results to generate context for the LLM. However, with this approach, we’re primarily filtering out irrelevant comments and embeddings. Once we retrieve data from the database, we could summarize it, but I don’t think that’s necessary.

This seems more like a hammer when we really need a scalpel, but I’m curious to see how it works, especially given our large dataset.

This would be very useful for n-gram models or topic models like LDA. Several papers have noted improvements in LLM retrieval performance as well [1]. However, this is optional and would depend on the specific context and the information regarding its performance.

I’m concerned about managing multiple AI providers. Using a provider that aggregates different AI models might simplify things, as we’d avoid juggling multiple API keys.

We could utilize something like openrouter, which would enable us to access a broader range of models while remaining compatible with the OpenAI package structure.

We used to rely on GPT to summarize and rank everything. The latest models are more than capable of handling entire raw text task convos (multiple actually), PRs, and diffs, generating structured, consistent Markdown summaries. We could make a structured template for a MD doc for each task, improving RAG, and help us build a clean dataset for fine-tuning/training our own model, which should be our end goal ultimately.

We can use GPT, which processes context strictly in the order it's provided, without applying any intelligent reasoning. As a result, the responses will rely heavily on the sequence of information retrieved through cosine similarity-based vector search.

[1] Optimizing LLM Queries in Relational Workloads
[2] Premise Order Matters in Reasoning with Large Language Models

@Keyrxng
Copy link
Member

Keyrxng commented Oct 2, 2024

I just reviewed the updated database schema and noticed we could add a nullable topic column. This would allow us to create a topic for queries, match it, and use the results to generate context for the LLM.

afaik queries are not saved to the DB only our embeddings which would mean we'd need to determine a topic for every embedding we create to be able to match it right? This is in addition to classifying it like dao_info etc...

This would be very useful for n-gram models or topic models like LDA. Several papers have noted improvements in LLM retrieval performance as well [1]. However, this is optional and would depend on the specific context and the information regarding its performance.

I feel like we have no current baseline for our own comparison as we do not even have a simple RAG chatbot yet. Also afaik n-gram and topic models need quality embeddings/source docs to begin with, they work with documents/bodies of text but our embedding system right now is so granular that our embedding content is likely the literal strings and we are going to have 10s/100s of thousands of these embeddings in a very short time. What I thought you were doing is improving the embeddings being stored so they are of higher quality to work with.

We could utilize something like openrouter, which would enable us to access a broader range of models while remaining compatible with the OpenAI package structure.

I'd be in favour of this as we'll be drowning in AI API keys in no time if we intend to continue to introduce lots of other models.

We can use GPT, which processes context strictly in the order it's provided, without applying any intelligent reasoning. As a result, the responses will rely heavily on the sequence of information retrieved through cosine similarity-based vector search.

I believe that these modern models bring their own intelligent reasoning and that's why they are so much better than only a couple of years ago.

My suggestion was that we use GPT to create a stable corpus of literal MD documents that cover the entire contents of a task succinctly but effectively ranging back to 2020 and continue to do so when a task is completed. This way we are creating a very structured and uniform dataset (good for RAG, good for fine-tuning, good for training) which is not some kind of blackbox (as these models all are) that we can actually review and read and manually edit if necessary (highly doubtful).

So we structure a template, feed the entire task and all of it's contents and have GPT do this for one repo or 30-40 tasks. They can be very easily QA'd, prompt refined, and chatbot tested and then we do it for every tracked task (via batch api as that'll cost large using the best models). The DB type is Task Doc or Task Summary if we need to restrict scope to only it or it appears in non-refined search with all embeddings.

9/10 RAG chatbots all operate on documentized datasets, so rather than try to re-create the wheel for our V1 chatbot, let's do what is normally done and documentize our org via tasks to create a chronological timeline of evolution for our org as the foundation of our chatbot's knowledge base. Won't these n-gram models and topic models perform most optimally with a better collection of base embeddings to begin with? Additionally with these complete task doc summaries, we could begin to fine-tune a model right away and even offer it as a service to partners.

https://community.openai.com/t/scaling-rag-chatbot-system-to-millions-of-documents/615386/2

https://chatgpt.com/share/66fd61e4-b41c-8000-bde7-ecbdbff51b2e - 1 question 1 answer convo

Recommendation
If the primary goal of your chatbot is to provide quick, accessible overviews or to assist in navigating project milestones and outcomes, then summarizing each task in a structured, document-like fashion would be more beneficial. This would make the bot user-friendly and efficient in handling common queries about project statuses or history.

However, if the chatbot needs to handle more technical or detailed queries that require digging into specific comments or code snippets, embedding each comment and task specification individually might be the better approach, albeit with more sophisticated natural language processing and data management systems in place to handle the complexity.

For a balanced approach, consider a hybrid model: summarize for high-level queries but retain the ability to refer to individual, unaltered records for detailed inquiries. This would require an indexing system that can efficiently switch between summary views and detailed views based on the nature of the query.

We intend on embedding our entire codebases so when we need to handle technical queries about code we will pull from there and chances are we will never ask about a specific comment but more likely the "why" or "how" which would pertain to a task/codebase/repo etc, contextually a document covering the evolution of the task would contain the "why" and "how".

@0x4007
Copy link
Member Author

0x4007 commented Oct 5, 2024

Whatever is easier. The idea is that you can get started on making the logic and we can worry about consolidating the user interface later

Copy link

ubiquity-os-beta bot commented Oct 18, 2024

Note

This output has been truncated due to the comment length limit.

 [ 800 WXDAI ] 

@sshivaditya2019
Contributions Overview
ViewContributionCountReward
IssueTask1800
IssueComment120
ReviewComment250

 [ 105.066 WXDAI ] 

@0x4007
Contributions Overview
ViewContributionCountReward
IssueSpecification155.41
IssueComment921.216
ReviewComment3828.44

 [ 23.1885 WXDAI ] 

@Keyrxng
Contributions Overview
ViewContributionCountReward
IssueComment623.1885

@0x4007
Copy link
Member Author

0x4007 commented Oct 18, 2024

@sshivaditya2019 can you paste the plugin configs under the completed issues so we can install and test?

@0x4007 0x4007 reopened this Oct 23, 2024
@0x4007 0x4007 closed this as completed Oct 23, 2024
Copy link

ubiquity-os-beta bot commented Oct 23, 2024

! Failed to run comment evaluation. Relevance / Comment length mismatch!

@0x4007 0x4007 reopened this Oct 23, 2024
@0x4007 0x4007 closed this as completed Oct 23, 2024
Copy link

ubiquity-os-beta bot commented Oct 23, 2024

! Failed to run comment evaluation. Relevance / Comment length mismatch!

@sshivaditya2019
Copy link

@gentlementlegen could you regenerate the permit for this one ?

@gentlementlegen
Copy link
Member

@sshivaditya2019 I see a generated result above, something wrong with it?

@sshivaditya2019
Copy link

sshivaditya2019 commented Nov 28, 2024

@gentlementlegen Yeah, this was generated when the permits were being reverted during execution, and it doesn't work.

@gentlementlegen
Copy link
Member

@sshivaditya2019 we are experiencing issues with the kernel, will regenerate this as soon as it is fixed.

Copy link

+ Evaluating results. Please wait...

Copy link

! Relevance / Comment length mismatch!

Copy link

+ Evaluating results. Please wait...

Copy link

 [ 801.7215 WXDAI ] 

@sshivaditya2019
Contributions Overview
ViewContributionCountReward
IssueTask1800
IssueComment21.7215
Conversation Incentives
CommentFormattingRelevancePriorityReward
This can be easily accomplished with embeddings and a capable LL…
2.78
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 50
  wordValue: 0.1
  result: 2.78
0.921.261
If the comments have to be cleaned and cherry picked, for a good…
1.33
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 21
  wordValue: 0.1
  result: 1.33
0.720.4605

 [ 183.658 WXDAI ] 

@0x4007
Contributions Overview
ViewContributionCountReward
IssueSpecification156.68
IssueComment1059.47
ReviewComment3967.508
Conversation Incentives
CommentFormattingRelevancePriorityReward
**Description:**We have two goals that are closely aligned:1…
28.34
content:
  content:
    p:
      score: 0
      elementCount: 16
    ol:
      score: 0
      elementCount: 1
    li:
      score: 0.5
      elementCount: 6
    em:
      score: 0
      elementCount: 3
    ul:
      score: 0
      elementCount: 3
    h3:
      score: 1
      elementCount: 1
    a:
      score: 5
      elementCount: 3
  result: 19
regex:
  wordCount: 208
  wordValue: 0.1
  result: 9.34
1256.68
@sshivaditya2019 rfc
0.18
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 2
  wordValue: 0.1
  result: 0.18
020
What's a good time estimate?
0.46
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 6
  wordValue: 0.1
  result: 0.46
0.220.184
One consideration I just realized we have to resolve a design pr…
6.1
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 126
  wordValue: 0.1
  result: 6.1
0.9210.98
Well perhaps we can have you take it over mid this week if its n…
5.18
content:
  content:
    p:
      score: 0
      elementCount: 4
  result: 0
regex:
  wordCount: 104
  wordValue: 0.1
  result: 5.18
0.828.288
Most of everything is within the @ubiquity organization (created…
9.5
content:
  content:
    p:
      score: 0
      elementCount: 4
    a:
      score: 5
      elementCount: 1
  result: 5
regex:
  wordCount: 88
  wordValue: 0.1
  result: 4.5
0.7216.3
Much better UX to consolidate https://github.com/ubiquity-os/plu…
5.39
content:
  content:
    p:
      score: 0
      elementCount: 1
    a:
      score: 5
      elementCount: 1
  result: 5
regex:
  wordCount: 5
  wordValue: 0.1
  result: 0.39
0.5210.39
Refer to that part of the spec. Their `command-ask` plug…
4.58
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 90
  wordValue: 0.1
  result: 4.58
0.726.412
In recent times, we've been pretty good about this so just take …
2.97
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 54
  wordValue: 0.1
  result: 2.97
0.623.564
Whatever is easier. The idea is that you can get started on maki…
1.54
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 25
  wordValue: 0.1
  result: 1.54
0.822.464
@sshivaditya2019 can you paste the plugin configs under the comp…
1.11
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 17
  wordValue: 0.1
  result: 1.11
0.420.888
Looks like a pretty solid implementation
0.46
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 6
  wordValue: 0.1
  result: 0.46
0.220.184
You marked my comments as "resolved" but didn't implement the re…
0.88
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 13
  wordValue: 0.1
  result: 0.88
0.420.704
```suggestion"author": "Ubiquity DAO",``…
2.33
content:
  content:
    ul:
      score: 0
      elementCount: 1
    li:
      score: 0.5
      elementCount: 3
  result: 1.5
regex:
  wordCount: 12
  wordValue: 0.1
  result: 0.83
0.623.996
Pretty unusual syntax to mix async await and then
0.65
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 9
  wordValue: 0.1
  result: 0.65
0.720.91
Does it make sense to have two separate arrays for each data typ…
0.88
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 13
  wordValue: 0.1
  result: 0.88
0.520.88
Should be renamed to ```suggestionenv: { UBIQU…
0.32
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 4
  wordValue: 0.1
  result: 0.32
0.720.448
Can you explain to me what the similarity threshold is for?
0.77
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 11
  wordValue: 0.1
  result: 0.77
0.520.77
Maybe rename to llm.d.ts
0.46
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 6
  wordValue: 0.1
  result: 0.46
0.320.276
Rename to github-types.d.ts
0.46
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 6
  wordValue: 0.1
  result: 0.46
0.320.276
Perhaps this should default to "UbiquityOS"
0.46
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 6
  wordValue: 0.1
  result: 0.46
0.620.552
```suggestionlet readme;```
0
content:
  content: {}
  result: 0
regex:
  wordCount: 0
  wordValue: 0.1
  result: 0
0.220
```suggestion```
0
content:
  content: {}
  result: 0
regex:
  wordCount: 0
  wordValue: 0.1
  result: 0
0.120
Confused about the parameters and this method but i assume it wo…
1.28
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 20
  wordValue: 0.1
  result: 1.28
0.521.28
Probably makes sense to rename this to ask-llm
0.65
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 9
  wordValue: 0.1
  result: 0.65
0.420.52
My understanding is that the `-instruct` model we chose …
1.8
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 30
  wordValue: 0.1
  result: 1.8
0.521.8
What is this? Is this to summarize contents? Are you sure this i…
1.06
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 16
  wordValue: 0.1
  result: 1.06
0.420.848
Seems unexpected that you removed this because we generally use …
0.94
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 14
  wordValue: 0.1
  result: 0.94
0.320.564
Be sure to consider the instructions from the [readme](https://g…
6.65
content:
  content:
    p:
      score: 0
      elementCount: 2
    a:
      score: 5
      elementCount: 1
  result: 5
regex:
  wordCount: 27
  wordValue: 0.1
  result: 1.65
0.4211.32
Hard coding these things is the wrong approach then. This needs …
1.17
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 18
  wordValue: 0.1
  result: 1.17
0.320.702
You sure you want to remove formatting clues such as bullet poin…
3.29
content:
  content:
    p:
      score: 0
      elementCount: 3
  result: 0
regex:
  wordCount: 61
  wordValue: 0.1
  result: 3.29
0.321.974
This could benefit from clarifying that this is calculating the …
3.06
content:
  content:
    p:
      score: 0
      elementCount: 2
  result: 0
regex:
  wordCount: 56
  wordValue: 0.1
  result: 3.06
0.523.06
Commas seem wrong
0.25
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 3
  wordValue: 0.1
  result: 0.25
0.220.1
Returning an empty string always seems like a bad idea. This see…
1.28
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 20
  wordValue: 0.1
  result: 1.28
0.320.768
Optional seems wrong unless its an optimization to save tokens
0.71
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 10
  wordValue: 0.1
  result: 0.71
0.220.284
Is ten optimal
0.25
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 3
  wordValue: 0.1
  result: 0.25
0.320.15
Might be good to find and replace all GPT instances in the code …
1.06
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 16
  wordValue: 0.1
  result: 1.06
0.420.848
In case we haven't already: we should make another task for dyna…
1
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 15
  wordValue: 0.1
  result: 1
0.521
This seems wrong but i dont know the full context of how its use…
0.94
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 14
  wordValue: 0.1
  result: 0.94
0.320.564
Mixed feelings on this. They fall out of date so fast. Its usefu…
1.95
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 33
  wordValue: 0.1
  result: 1.95
0.321.17
Does this package still work? I thought we deleted it and rebran…
1.28
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 20
  wordValue: 0.1
  result: 1.28
0.421.024
Gold star? Non established. Will need to work on this asap. DM …
1.17
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 18
  wordValue: 0.1
  result: 1.17
0.120.234
Seems like we get what we pay for :)
0.59
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 8
  wordValue: 0.1
  result: 0.59
0.120.118
Your QA results are quite interesting. We should prompt and focu…
2.64
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 47
  wordValue: 0.1
  result: 2.64
0.221.056
Thats mostly fine. Any price these models charge us are orders o…
1.38
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 22
  wordValue: 0.1
  result: 1.38
0.220.552
Actually we should use mini because it has a much larger usable …
3.11
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 57
  wordValue: 0.1
  result: 3.11
0.221.244
This aligns with my expectations. Claude is really good at deali…
19.06
content:
  content:
    p:
      score: 0
      elementCount: 1
    br:
      score: 0
      elementCount: 1
    ul:
      score: 0
      elementCount: 1
    li:
      score: 0.5
      elementCount: 2
    a:
      score: 5
      elementCount: 2
  result: 11
regex:
  wordCount: 175
  wordValue: 0.1
  result: 8.06
0.3226.836
Hows this coming along?
0.32
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 4
  wordValue: 0.1
  result: 0.32
0.120.064
Okay merge and lets test.
0.39
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 5
  wordValue: 0.1
  result: 0.39
0.220.156
My last batch of comments is intended to be handled async becaus…
1.38
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 22
  wordValue: 0.1
  result: 1.38
0.120.276

 [ 235.439 WXDAI ] 

@Keyrxng
Contributions Overview
ViewContributionCountReward
IssueComment6235.439
Conversation Incentives
CommentFormattingRelevancePriorityReward
Both are held back by review only.As I understand this task:…
7.55
content:
  content:
    p:
      score: 0
      elementCount: 8
    ul:
      score: 0
      elementCount: 1
    li:
      score: 0.5
      elementCount: 5
    hr:
      score: 0
      elementCount: 1
    ol:
      score: 0
      elementCount: 1
  result: 2.5
regex:
  wordCount: 101
  wordValue: 0.1
  result: 5.05
1215.1
This sounds like a whole new classification/approach to the curr…
24.18
content:
  content:
    p:
      score: 0
      elementCount: 11
    a:
      score: 5
      elementCount: 1
  result: 5
regex:
  wordCount: 485
  wordValue: 0.1
  result: 19.18
0.8240.688
So we're filtering out irrelevant comments and embedding only th…
22.39
content:
  content:
    p:
      score: 0
      elementCount: 9
    hr:
      score: 0
      elementCount: 2
    a:
      score: 5
      elementCount: 2
  result: 10
regex:
  wordCount: 290
  wordValue: 0.1
  result: 12.39
0.9242.302
afaik queries are not saved to the DB only our embeddings which …
31.15
content:
  content:
    p:
      score: 0
      elementCount: 10
    a:
      score: 5
      elementCount: 2
  result: 10
regex:
  wordCount: 544
  wordValue: 0.1
  result: 21.15
0.9258.07
Our aim should be improving the effectiveness of our embeddings …
6.87
content:
  content:
    p:
      score: 0
      elementCount: 4
  result: 0
regex:
  wordCount: 145
  wordValue: 0.1
  result: 6.87
0.85211.679
There may be an issue with this tracked task json approach depen…
36.4
content:
  content:
    p:
      score: 0
      elementCount: 14
    img:
      score: 5
      elementCount: 4
    hr:
      score: 0
      elementCount: 1
    ul:
      score: 0
      elementCount: 1
    li:
      score: 0.5
      elementCount: 2
    a:
      score: 5
      elementCount: 1
  result: 26
regex:
  wordCount: 236
  wordValue: 0.1
  result: 10.4
0.75267.6

 [ 13.368 WXDAI ] 

@gentlementlegen
Contributions Overview
ViewContributionCountReward
IssueComment21.35
ReviewComment212.018
Conversation Incentives
CommentFormattingRelevancePriorityReward
@sshivaditya2019 I see a generated result above, something wrong…
0.77
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 11
  wordValue: 0.1
  result: 0.77
0.320.462
@sshivaditya2019 we are experiencing issues with the kernel, wil…
1.11
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 17
  wordValue: 0.1
  result: 1.11
0.420.888
@sshivaditya2019 For the Supabase, does it need a brand new inst…
6.9
content:
  content:
    p:
      score: 0
      elementCount: 1
    a:
      score: 5
      elementCount: 1
  result: 5
regex:
  wordCount: 32
  wordValue: 0.1
  result: 1.9
0.5211.9
@sshivaditya2019 Sounds good, please poke me in telegram (`@…
0.59
content:
  content:
    p:
      score: 0
      elementCount: 1
  result: 0
regex:
  wordCount: 8
  wordValue: 0.1
  result: 0.59
0.120.118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants