Skip to content

Latest commit

 

History

History
45 lines (28 loc) · 2.08 KB

README.md

File metadata and controls

45 lines (28 loc) · 2.08 KB

Multimodal language models for GalaxyZoo image interpretation

Rationale

The rationale of this project is to leverage existing Large Multi-Modal Models (LMMs) to engage meaningfully with astronomical images. The overarching goal is to build a fine-tuned Language and Vision Model such as LlaVA on a curated dataset from the Galaxy Zoo project.

You can see examples of chat here:

https://www.zooniverse.org/projects/zookeeper/galaxy-zoo/talk/1270

image

The steps of the project are as follows:

  1. Explore the Galaxy Zoo Talk dataset
  2. Read and understand the high-level details of the LlaVA and Llava-Med papers.
  3. Summarise the text using a LLM using either open-source or proprietary models.
  4. Curate the image - summary pairs for the instruction-tuning.
  5. Fine-tune the model.
  6. Evaluate the model.

The architecture of the LlaVA model, where the pre-trained CLIP visual encoder ViT-L/14 is connected to the LLAMA decoder. image

You can watch the hack presentation by Jo during the telecon.

There is also a good video describing MLMs here: https://www.youtube.com/watch?v=mkI7EPD1vp8

Dataset

References

Here is a list of references to get started on the subject

LLM-specific resources: