Speaker
Description
Foundation models plus ample compute make many “moderate” vision tasks solvable with minimal custom code. This talk introduces an LLM-steerable pipeline that compiles a brief YAML spec into end-to-end segmentation, zero-shot classification, and optional geometry checks, executed on GPU clusters.
A remote multimodal LLM (e.g., ChatGPT) generates the configuration based on sample images and human description of the task; a Python runner on HPC invokes SAM2 for mask proposals, CLIP for prompt-driven labels, and optional BLIP-3 VQA for per-crop verification. Crucially, this workflow may double as a data engine: it produces large, reasonably clean pseudo-labeled sets with little manual effort, enabling distillation into compact models that run without HPC.