Lumina — Video & Image Search
January 1, 2026
Search engine for videos and images using OpenAI CLIP for vector embeddings. Indexed 2k+ YouTube videos (by frame) and 108k COCO images. Also includes Wikipedia search.
What is it?
A multimodal search engine for images and video frames. You describe what you're looking for in plain text, and it returns semantically matching images from 118K COCO photos and frames from 2K+ YouTube videos. No keyword tags, no metadata matching — pure vector similarity via CLIP embeddings.
Two systems, one frontend
Lumina is actually two separate pipelines merged into one UI:
Image search: jinaai/jina-clip-v1 (768-dim) embeddings stored in Pinecone (Serverless, AWS us-east-1, cosine similarity, index name: multi-beast). Built in ImageSearchEngine/.
Video search: openai/clip-vit-base-patch32 (512-dim) embeddings stored in Qdrant Cloud. Built in video_search/. The Qdrant Docker Compose file names the container 'pinecone-local' — evidence of the mid-project database switch.
Production Flask API (gunicorn, 8 threads) serves both collections. Domains: images.cbproforge.com for images, multi.cbproforge.com / lumina-search.cbproforge.com for combined.
Video embedding pipeline
YouTube video data collected by scraping 46 channels across tech, science, politics, and education (freeCodeCamp, Veritasium, 3Blue1Brown, Kurzgesagt, MKBHD, TED, CNN and more) using yt-dlp --flat-playlist --dump-json.
For each video: download lowest quality version (first 30 seconds only to save bandwidth) with yt-dlp, extract 10 evenly-spaced frames with OpenCV, generate CLIP embeddings for each frame. Checkpointed every 25 videos (resume on failure). Random 2-5s delays between videos, 60s break every 50 to avoid rate limits.
This all ran on Kaggle's free T4 GPU. The data_persist/ folder shows the work split into three parallel workers: 0-5K, 5K-7.5K, 7.5K-10K videos.
Multi-stage ranking
The video search doesn't just return raw nearest neighbors. It runs three stages: 1. Pure vector search (limit × 10 candidates) 2. Text scroll search on caption and channel fields 3. Manual score boosting: +0.2 for keyword matches in caption/channel, +0.4 for brand name in channel title, scores clamped at 1.5 for exact text matches
This hybrid approach handles the common case where 'Andrej Karpathy neural networks' should surface Karpathy's channel videos even if the specific frame embedding isn't the closest vector.
Infrastructure and the /proxy_image endpoint
The Cloud Run base image (gcr.io/gemini-access-478409/video-search-base:latest) pre-downloads CLIP model weights during build. The app Dockerfile just copies main.py on top — no re-downloading models on every deploy.
The /proxy_image endpoint exists because many YouTube thumbnail URLs trigger CORS errors in browsers. The endpoint proxies the image through the server, bypassing client-side blocking. Small detail, but it prevented a lot of broken images in the UI.
Key takeaways
- CLIP: dual encoder architecture, shared embedding space for text and images
- Kaggle free GPU workflow: T4, parallel workers across data slices, checkpoint/resume
- Qdrant vs Pinecone: self-hosted Docker vs managed cloud — the mid-project switch
- yt-dlp for structured YouTube data: flat playlist scraping, polite rate limiting
- Cloud Run base image pattern: pre-bake large model downloads to reduce deploy time