Image Retrieval based on natural language is typically done in the following manner -

Embedding Based Retreieval

  1. Embed images with CLIP
  2. Embed the query using the same model and do KNN search to retrieve semantically similar images.

Visual LLM Based Retrieval

  1. Describe an image using a Visual LLM such as LLava and GPT-V.
  2. Index the description and retrieve images by searching the descriptions.

SQL Based Retrieval

Semantic search on descriptions or CLIP based algorithms retrieve semantically similar images so they can be less accurate. Structured Extraction by object detection enables querying by object name classes using SQL.

  1. Run object detection models like YoloV9 or Grounding Dino on the images to extract objects.
  2. Write the object names, bounding boxes and other image metadata to a structured table.
  3. Query the structured table using SQL to retrieve images.

Examples

Visual Search Engine for E-commerce