diff --git a/docs/source/_static/images/gsoc_llm_robin_week5.jpg b/docs/source/_static/images/gsoc_llm_robin_week5.jpg new file mode 100644 index 000000000..71be43086 Binary files /dev/null and b/docs/source/_static/images/gsoc_llm_robin_week5.jpg differ diff --git a/docs/source/posts/2024/2024-07-01-week-4-robin.rst b/docs/source/posts/2024/2024-07-01-week-4-robin.rst new file mode 100644 index 000000000..0172bff7b --- /dev/null +++ b/docs/source/posts/2024/2024-07-01-week-4-robin.rst @@ -0,0 +1,74 @@ +Week 4: Pipeline Improvements and Taking The Bot Public! +======================================================== + +.. post:: July 1 2024 + :author: Robin Roy + :tags: google + :category: gsoc + +Hi, I'm `Robin `_ and this is my blog about week 4. + +My goals for week 4 were to move my Google colab notes to a proper Python script, improve the existing code, and make a working pipeline to upsert data easily. Also, the bot is public now :) Anyone reading this blog could join this `Discord Server `_ and ask questions right away! + +Things I did in Week 4 +---------------------- + +1) **Chunking tutorials and documentation** + +Earlier, only files fitting the context window of the embedding model were upserted. This was because otherwise, we'd have to split the file in half and lose the overall context. This will lead to information loss and retrieval will be messy. Now, I decided I'd upsert everything by splitting information properly. By "properly", what I mean is it won't be a random split, and there'll be logical reasoning behind every chunk. + +This area is still actively studied, and the whole concept is to find ideal chunks which are self-sufficient and contain the most information. This `notebook `_ details 6 different approaches, I read through them and some of their associated literature and decided we'll use `Recursive Character Text Splitting` and `Document Specific Splitting` for now. There is no major reason for this, I just felt it'll work well for now (a reasoning-backed approach will come in a few weeks). There is a lot of experimentation we could do here, a better chunking will result in better ``references`` generation and so on. + +So this is our current process + - if normal function/class definition: no splitting, chunk as it is. + - if rst files, use the ``rst parser`` and split them with a chunk size of ~8000 tokens (max llama could take). RST files in FURY contain documentation & blog posts. + - if tutorial, try chunking as it is, if not possible split at 8000 tokens. + +Function/class definitions are generally under 8000 so I've not done explicit checks for now, the model will trim the remaining if longer (I found some long classes later). + +2) **Move colab files to a proper Python script** + +I did all the upsertion and experiments on colab. It is messy and can't be used in production. We need a one-click approach to upsertion. Something like point to `fury` directory and it should do everything. So I took the messy colab code and made a python script from it. + +One of my key goals is to separate core application logic from LLMs/Database providers. We should be able to swap them as needed without much fuss. I'll talk more about this in week 5. + +3) **Taking the bot public!** + +The whole point of making the bot is to help improve the productivity of FURY developers. So I decided to take it public on `this discord server `_. You could use it today! (actually, you could've used it from the 20th of last month, this blog got delayed😢) + +I'll observe what people are asking and then iterate towards making the bot better in that area. I think it'll be better than making the bot good on what I believe is the best. + +4) **Minor bugfixes and stuff** + +Did some minor bug fixes on things like the Discord bot generation cutoff and error handling improvements. It was Discord message limit (<=2000) that caused the generation to cut off, I split the message into parts to fix that. Error handling was improved generally everywhere. I'll need to bring logging later. + + +Minor Sidequest +~~~~~~~~~~~~~~~ + +This is in no way related to FURY, but it was fun so I thought I'd add it here :) + +So after midterms, I decided to go back home, to maximally use my time I searched for things to do and found a local FOSS event (`link `_). It was done by FOSS United Kochi and it's one of the major FOSS events in my state (Kerala, India). Met some Pythonistas! Explained what FURY is to them. I also ended up finding some lore (`link `_) about how GNU/Linux spread in Kerala, India. Also found some old FOSS event pictures (`this `_ one is talking about Python, 2003 World of Python). This was my first FOSS event outside campus so it was fun :) + + +What is coming up next week? +---------------------------- + +- Benchmarking +- Architecture Update + +Did you get stuck anywhere? +--------------------------- + +No, I did not get stuck. This week was more of learning and experimentation so I think it's normal what I encountered. + +LINKS: + +- `Discord Server `_ +- `A Text Splitting Guide `_ +- `GNU Case of Kerala `_ +- `2003 World of Python `_ +- `FOSS United Kochi `_ +- `Robin :) `_ + +Thank you for reading! diff --git a/docs/source/posts/2024/2024-07-01-week-5-robin.rst b/docs/source/posts/2024/2024-07-01-week-5-robin.rst new file mode 100644 index 000000000..b785be5f0 --- /dev/null +++ b/docs/source/posts/2024/2024-07-01-week-5-robin.rst @@ -0,0 +1,129 @@ +Week 5: LLM Benchmarking & Architecture Modifications +===================================================== + +.. post:: July 1 2024 + :author: Robin Roy + :tags: google + :category: gsoc + +Hi, I'm `Robin `_ and this is my blog about week 5. + +This week, we'll take all the things we did in the previous weeks, and quantify them. Benchmarking an LLM is the process of grading the LLM answer. To grade properly, we need good rubrics, so that's what I worked on this week. Also, I made some architectural changes, to make the overall development simple. + +Things I did in Week 5 +---------------------- + +1) **Architectural Update** + +Earlier, this was our architecture: + + + .. raw:: html + + + +This had an obvious issue, all the core logic was inside the Discord Bot. So if I want to say, use the LLM inference for making a GitHub bot, or for benchmarking etc, it wasn't possible. So I decided to cut the LLM logic from Discord Bot and made a new ``LLM Router``. It'll handle all the LLM logic from now on, and we do not directly call any other endpoint other than this one. +It makes life simple, every input going into the endpoint goes like this: + +.. code-block:: json + + { + "query": "Render a cube in fury", + "llm": "llama3-70b-8192", + "knn": "3", + "stream": False + } + +Every response coming out will be like this: + +.. code-block:: json + + { + "response": "Yes, this is how it would be done python import fury....", + "references": "1, 2, 3" + } + +What happens on the inside is completely abstracted away. You just call this and it'll + - call the embedding model + - pass embeddings to the database + - return them to LLM (which you can choose) + - returns LLM answer with references to you + +Currently, we support ``ollama``, ``google`` and ``groq`` providers. That itself is 20+ LLM support, and you could swap between them using ``/api/groq or api/google or /api/ollama ...``. Adding another provider is simply adding another endpoint. + +So if you do + +`curl -X POST https://robinroy03-fury-engine.hf.space/api/groq/generate -H "Content-Type: application/json" -d '{"query": "How do I create a sphere in FURY?", "llm": "llama3-70b-8192", "knn": "3", "stream": false}'` + + +You'll get a response from ``llama3-70b-8192`` using ``groq``. If you do ``https://robinroy03-fury-engine.hf.space/api/google/generate`` you can call any google gemini modes like ``gemini-1.5-pro`` or ``gemini-1.5-flash``. Same for ``ollama``. + +This still could be improved, it does not currently account for vision models. I did not add that because we do not use vision models other than for benchmarking now, and that too is done locally. Benchmarking could also be streamlined, I avoided that because benchmarking is still in development so I'll have to rewrite every day. Presently you can use this core ``router`` for a working LLM generation (you'll get the same thing you'll get from the Discord Bot. So if you have a website, all you have to do is call the API). + +This is our present architecture: + +.. image:: /_static/images/gsoc_llm_robin_week5.jpg + :alt: Present LLM architecture. + +It is the same thing as above, except we have two new components - ``LLM Engine`` and a ``Groq & Gemini`` endpoint. When we'll end up having a conversational model setup (right now, it is one question and one answer), this model will be upgraded to accommodate that. My plan is to extend LLM Engine and add that. Other features such as vision also could be added to this as needed. + +2) **Gemini Models added** + +As mentioned above, I added ``Gemini`` models this week. They have a decent free tier. Also, I'm studying the feasibility of fine-tuning using the ``Gemini`` models. + +3) **LLM Benchmarking** + +LLM Benchmarking is the process of evaluating the LLM output and giving a score. With this, making the model better will be simply a function of increasing the score. This area is still under development and the things I've tried here are the current standard procedures. To understand more about benchmarking, you can read `this `_, `this `_ and `this `_. This `course `_ is also amazing. + +I'll anyways give a TL;DR: +LLM benchmarking is essentially like writing an English Literature exam and getting the grades. Your evaluator may give you a 4 or a 5, and the reasoning can be varied. For the same answer, you may even get very varied results from 2 different evaluators! Two common rubrics they use are ``groundedness (whether the answer follows from the material)`` and ``completion (whether the answer is complete, whether it fully answers the question with respect to the material)``. These are the same rubrics we'll use for LLM evaluation. For code, it's different. The code should compile and do exactly what it should. + +Now FURY Bot does 2 things - writing code & writing answers for common questions (on GitHub issues etc). Presently, I've only collected data for coding questions, as they are much easier to evaluate and give a clear sense of direction (also I found more coding data). + +Evaluating FURY code can be done by: + 1) Running the code. + 2) Checking the output. + +Now we do this using ``pytest`` in the FURY repo for tests. But this approach is tedious, as collecting questions and writing test cases take a lot of time, also the orientation of the 3D objects also matters (an LLM generation is not deterministic). So we are using a vision model ``moondream2`` to check the LLM generated output and verify if it is what we actually wanted. +On a high level, this is what we do (for now): + +- Take a QnA pair from the collected dataset (I've collected ~23 questions). +- Ask the LLM to generate a FURY code for that (using the references). +- Run this generated code. +- Check the output using ``moondream2`` and verify whether it is what we wanted. + +There is also ``fast_eval`` which checks whether the code compiles and skips ``moondream2`` entirely. This is obviously faster and is also decently good (is actually a pretty good heuristic). If it runs, assume it works :) + +This is our current stats: (from now on, we can finally talk using numbers) + +Coding benchmark: +~~~~~~~~~~~~~~~~~ +On ``fast_eval`` we have a success rate of ``47.83%`` for ``groq``. + +On ``normal_eval`` we have a success rate of ``13.04%`` for ``groq``. + +Note that ``moondream2`` also sometimes mistakes the output for something else. It is close to ``~45%`` when I checked manually. For now, I'm only going to focus on ``fast_eval`` as fixing ``moondream2`` is a distraction for the moment. (This actually gets very meta, there are projects where they have benchmarks for the evaluator and so on. `Read this `_.) + + +What is coming up next week? +---------------------------- + +- Better benchmark scores :) +- Line number highlighting @ references. +- Some ``references`` improvements. + +Did you get stuck anywhere? +--------------------------- + +No, I did not get stuck anywhere. + +LINKS: + +- `RAG Evaluation `_ +- `LLM Judge `_ +- `Advanced RAG `_ +- `Advanced Retrieval for AI `_ +- `Moondream2 `_ +- `Finding GPT-4 mistakes with GPT-4 `_ + +Thank you for reading!