Vision‑Language Model (VLM)

A multimodal model that jointly understands images/video and text for tasks like captioning and VQA.