KorCLIP is a project that implements a Korean version of the CLIP (Contrastive LanguageβImage Pretraining) model.
import io
import requests
from PIL import Image
import torch
from torchvision import transforms as T
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
MODEL_PATH = "thisisiron/korclip-vit-base-patch32"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModel.from_pretrained(MODEL_PATH).to(device)
image = Image.open(io.BytesIO(requests.get("http://images.cocodataset.org/val2014/COCO_val2014_000000537955.jpg").content))
preprocess = T.Compose([
T.Resize((224, 224)),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
image = preprocess(image).unsqueeze(0).to(device)
text = tokenizer(["κ°μμ§", "κ³ μμ΄", "κ±°λΆμ΄"], return_tensors="pt").to(device)
with torch.no_grad():
image_features = model.get_image_features(image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
- Image download
cd data ./download.sh
- Korean annotation download [link]
- You need the
MSCOCO_train_val_Korean.json
file.
- You need the
- Preprocessing (json -> csv)
python csv_converter.py
- Diretory structure
βββ data β βββ **download.sh** β βββ **download.sh** β βββ **download.sh** β βββ **csv_converter.py** β βββ train2014.zip β βββ val2014.zip β βββ train2014/ β βββ val2014/ β βββ MSCOCO_train_val_Korean.json β βββ COCO_train.csv β βββ COCO_val.csv
- Single-GPU
./run.sh 1
- Multi-GPU
./run.sh NUM_GPU
- I am currently evaluating using only one template. I plan to add additional datasets and templates for future evaluations.
- The following metric is the results of training on the "COCO2014" Korean dataset only.
python eval.py
Dataset | Acc@1 | Acc@5 |
---|---|---|
CIFAR10 | 61.99 | 93.82 |
- You can refer to
infer.ipynb
.