Add mobileclip notebook (#1804)

openvinotoolkit · Mar 12, 2024 · 3b0b55a · 3b0b55a
1 parent 2de66b5
commit 3b0b55a
Show file tree

Hide file tree

Showing 5 changed files with 977 additions and 1 deletion.
diff --git a/.ci/ignore_pip_conflicts.txt b/.ci/ignore_pip_conflicts.txt
@@ -16,4 +16,5 @@ notebooks/272-paint-by-example/272-paint-by-example.ipynb # gradio==3.44.1
 notebooks/273-stable-zephyr-3b-chatbot/273-stable-zephyr-3b-chatbot.ipynb # install requirements.txt after clone repo
 notebooks/279-mobilevlm-language-assistant/279-mobilevlm-language-assistant.ipynb # transformers<4.35
 notebooks/280-depth-anything/280-depth-anything.ipynb # install requirements.txt after clone repo
-notebooks/285-surya-line-level-text-detection/285-surya-line-level-text-detection.ipynb # requires python >=3.9
+notebooks/285-surya-line-level-text-detection/285-surya-line-level-text-detection.ipynb # requires python >=3.9
+notebooks/289-mobileclip-video-search/289-mobileclip-video-search.ipynb # install requirements.txt inside
diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt
@@ -402,6 +402,7 @@ MLLM
 MLLMs
 MMVLM
 MLP
+MobileCLIP
 MobileLLaMA
 mobilenet
 MobileNet

diff --git a/notebooks/289-mobileclip-video-search/289-mobileclip-video-search.ipynb b/notebooks/289-mobileclip-video-search/289-mobileclip-video-search.ipynb
diff --git a/notebooks/289-mobileclip-video-search/README.md b/notebooks/289-mobileclip-video-search/README.md
@@ -0,0 +1,31 @@
+# Visual Content Search using MobileCLIP and OpenVINO™
+[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openvinotoolkit/openvino_notebooks/blob/main/notebooks/289-mobileclip-video-search/289-mobileclip-video-search.ipynb)
+
+![example.png](https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/4e241f82-548e-41c2-b1f4-45b319d3e519)
+
+Semantic visual content search is a machine learning task that uses either a text query or an input image to search a database of images (photo gallery, video) to find images that are semantically similar to the search query. 
+Historically, building a robust search engine for images was difficult. One could search by features such as file name and image metadata, and use any context around an image (i.e. alt text or surrounding text if an image appears in a passage of text) to provide the richer searching feature. This was before the advent of neural networks that can identify semantically related images to a given user query.
+
+[Contrastive Language-Image Pre-Training (CLIP)](https://arxiv.org/abs/2103.00020) models provide the means through which you can implement a semantic search engine with a few dozen lines of code. The CLIP model has been trained on millions of pairs of text and images, encoding semantics from images and text combined. Using CLIP, you can provide a text query and CLIP will return the images most related to the query.
+
+In this tutorial, we consider how to use [MobileCLIP](https://arxiv.org/pdf/2311.17049.pdf) for implementing a visual content search engine for finding relevant frames in video
+
+## Notebook Contents
+
+This tutorial demonstrates step-by-step instructions on how to run PyTorch MobileCLIP  with OpenVINO. It also provides an interactive user interface for search frames in video that are the most relevant to text or image requests.
+The tutorial consists of the following steps:
+
+
+- Select model
+- Prepare PyTorch model
+- Run PyTorch model inference
+- Convert PyTorch model to OpenVINO IR
+- Run model inference with OpenVINO
+- Launch interactive demo for 
+
+
+## Installation Instructions
+
+This is a self-contained example that relies solely on its own code.</br>
+We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
+For details, please refer to [Installation Guide](../../README.md).
diff --git a/selector/src/shared/notebook-tags.js b/selector/src/shared/notebook-tags.js
@@ -18,7 +18,10 @@ export const TASKS = /** @type {const} */ ({
     TEXT_TO_AUDIO: 'Text-to-Audio',
     AUDIO_TO_TEXT: 'Audio-to-Text',
     VISUAL_QUESTION_ANSWERING: 'Visual Question Answering',
+    IMAGE_CAPTIONING: "Image Captioning",
     FEATURE_EXTRACTION: 'Feature Extraction',
+    TEXT_TO_IMAGE_RETRIEVAL: "Text-to-Image Retrieval",
+    IMAGE_TO_TEXT_RETRIEVAL: "Image-to-Text Retrieval"
   },
   CV: {
     IMAGE_CLASSIFICATION: 'Image Classification',