Lumina — Video & Image Search

January 1, 2026

OpenAI CLIPPythonReactVector DBFastAPI

Search engine for videos and images using OpenAI CLIP for vector embeddings. Indexed 2k+ YouTube videos (by frame) and 108k COCO images. Also includes Wikipedia search.

What is it?

A multimodal search engine for images and video frames. You describe what you're looking for in plain text, and it returns semantically matching images from 118K COCO photos and frames from 2K+ YouTube videos. No keyword tags, no metadata matching — pure vector similarity via CLIP embeddings.

Two systems, one frontend

Lumina is actually two separate pipelines merged into one UI:

Image search: jinaai/jina-clip-v1 (768-dim) embeddings stored in Pinecone (Serverless, AWS us-east-1, cosine similarity, index name: multi-beast). Built in ImageSearchEngine/.

Video search: openai/clip-vit-base-patch32 (512-dim) embeddings stored in Qdrant Cloud. Built in video_search/. The Qdrant Docker Compose file names the container 'pinecone-local' — evidence of the mid-project database switch.

Production Flask API (gunicorn, 8 threads) serves both collections. Domains: images.cbproforge.com for images, multi.cbproforge.com / lumina-search.cbproforge.com for combined.

Video embedding pipeline

YouTube video data collected by scraping 46 channels across tech, science, politics, and education (freeCodeCamp, Veritasium, 3Blue1Brown, Kurzgesagt, MKBHD, TED, CNN and more) using yt-dlp --flat-playlist --dump-json.

For each video: download lowest quality version (first 30 seconds only to save bandwidth) with yt-dlp, extract 10 evenly-spaced frames with OpenCV, generate CLIP embeddings for each frame. Checkpointed every 25 videos (resume on failure). Random 2-5s delays between videos, 60s break every 50 to avoid rate limits.

This all ran on Kaggle's free T4 GPU. The data_persist/ folder shows the work split into three parallel workers: 0-5K, 5K-7.5K, 7.5K-10K videos.

Multi-stage ranking

The video search doesn't just return raw nearest neighbors. It runs three stages: 1. Pure vector search (limit × 10 candidates) 2. Text scroll search on caption and channel fields 3. Manual score boosting: +0.2 for keyword matches in caption/channel, +0.4 for brand name in channel title, scores clamped at 1.5 for exact text matches

This hybrid approach handles the common case where 'Andrej Karpathy neural networks' should surface Karpathy's channel videos even if the specific frame embedding isn't the closest vector.

Infrastructure and the /proxy_image endpoint

The Cloud Run base image (gcr.io/gemini-access-478409/video-search-base:latest) pre-downloads CLIP model weights during build. The app Dockerfile just copies main.py on top — no re-downloading models on every deploy.

The /proxy_image endpoint exists because many YouTube thumbnail URLs trigger CORS errors in browsers. The endpoint proxies the image through the server, bypassing client-side blocking. Small detail, but it prevented a lot of broken images in the UI.

Key takeaways

CLIP: dual encoder architecture, shared embedding space for text and images
Kaggle free GPU workflow: T4, parallel workers across data slices, checkpoint/resume
Qdrant vs Pinecone: self-hosted Docker vs managed cloud — the mid-project switch
yt-dlp for structured YouTube data: flat playlist scraping, polite rate limiting
Cloud Run base image pattern: pre-bake large model downloads to reduce deploy time

Watch on YouTube →← all projects