The rationale of this project is to leverage existing Large Multi-Modal Models (LMMs) to engage meaningfully with astronomical images. The overarching goal is to build a fine-tuned Language and Vision Model such as LlaVA on a curated dataset from the Galaxy Zoo project.
You can see examples of chat here:
https://www.zooniverse.org/projects/zookeeper/galaxy-zoo/talk/1270
The steps of the project are as follows:
- Explore the Galaxy Zoo Talk dataset
- Read and understand the high-level details of the LlaVA and Llava-Med papers.
- Summarise the text using a LLM using either open-source or proprietary models.
- Curate the image - summary pairs for the instruction-tuning.
- Fine-tune the model.
- Evaluate the model.
The architecture of the LlaVA model, where the pre-trained CLIP visual encoder ViT-L/14 is connected to the LLAMA decoder.
You can watch the hack presentation by Jo during the telecon.
There is also a good video describing MLMs here: https://www.youtube.com/watch?v=mkI7EPD1vp8
Here is a list of references to get started on the subject
LLM-specific resources:
- HuggingFace NLP course: https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt (good for references; understanding the main parts of an NLP pipeline, tokenizer, embeddingings, downstream tasks)
- HuggingFace Transformers (https://huggingface.co/docs/transformers/index)
- Langchain tutorials, e.g. how to summarise https://python.langchain.com/docs/modules/chains/popular/summarize.html
- OpenAI cookbook: https://github.com/openai/openai-cookbook. This example shows you can you can summarise a paper, for example: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_call_functions_for_knowledge_retrieval.ipynb