*Equal contribution
Columbia University
ViperGPT decomposes visual queries into interpretable steps.
Abstract
Answering visual queries is a complex task that requires
both visual processing and reasoning. End-to-end models,
the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising
alternative, but has proven challenging due to the difficulty
of learning both the programs and modules simultaneously.
We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models
into subroutines to produce a result for any query. ViperGPT
utilizes a provided API to access the available modules, and
composes them by generating Python code that is later executed. This simple approach requires no further training,
and achieves state-of-the-art results across various complex
visual tasks.
Logical Reasoning
ViperGPT can perform logic operations because it directly executes Python code.
Spatial Understanding
We show ViperGPT‘s spatial understanding.
Knowledge
ViperGPT can access the knowledge of large language models.
Consistency
ViperGPT answers similar questions with consistent reasoning.
Math
ViperGPT can count, and divide. All using Python.
Attributes
We show some ViperGPT examples involving attributes.
Relational Reasoning
Reasoning about relations.
Negation
Negation is programmatic, not neural.
BibTeX
@article{surismenon2023vipergpt,
author = {Sur'is D'idac and Menon, Sachit and Vondrick, Carl},
title = {ViperGPT: Visual Inference via Python Execution for Reasoning},
journal = {arXiv preprint arXiv:2303.08128},
year = {2023},
}
Leave A Comment