Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support InstructBLIP #1685

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions configs/instructblip/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# MiniGPT4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# MiniGPT4
# InstructBLIP


> [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)

<!-- [ALGORITHM] -->

## Abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although
vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced.

<div align=center>
<img src="https://github.com/open-mmlab/mmpretrain/assets/48375204/4211e0d8-951f-48d0-b81d-34be2e777390" width="80%"/>
</div>

## How to use it?

<!-- [TABS-BEGIN] -->

**Use the model**

```python
from mmpretrain import inference_model

result = inference_model('instructblip-vicuna7b_3rdparty-zeroshot_caption', 'demo/cat-dog.png')
print(result)
# {'pred_caption': 'a blanket next to each other in the grass\na cute puppy and kitten wallpapers'}
```

<!-- [TABS-END] -->

## Models and results

For Vicuna model, please refer to [MiniGPT-4 page](https://github.com/Vision-CAIR/MiniGPT-4) for preparation guidelines.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Pretrained models

| Model | Params (M) | Flops (G) | Config | Download |
| :-------------------------------------------------- | :--------: | :-------: | :----------------------------------------------: | :--------------------------------------------------------------------------------: |
| `instructblip-vicuna7b_3rdparty-zeroshot_caption`\* | 8121.32 | N/A | [config](instructblip-vicuna7b_8xb32_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/instructblip/instruct-blip_vicuna7b_trimmed.pth) |

*Models with * are converted from the [official repo](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip). The config files of these models are only for inference. We haven't reproduce the training results.*

## Citation

```bibtex
@article{dai2023instructblip,
title={InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning},
author={Dai, Wenliang and Li, Junnan and Li, Dongxu and Tiong, Anthony Meng Huat and Zhao, Junqi and Wang, Weisheng and Li, Boyang and Fung, Pascale and Hoi, Steven},
journal={arXiv preprint arXiv:2305.06500},
year={2023}
}
```
77 changes: 77 additions & 0 deletions configs/instructblip/instructblip-vicuna7b_8xb32_caption.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
_base_ = [
'../_base_/datasets/coco_caption.py',
'../_base_/default_runtime.py',
]

# model settings
model = dict(
type='InstructBlipCaption',
llm_tokenizer=dict(
type='LlamaTokenizer',
name_or_path=
'/mnt/petrelfs/share_data/liuyuan/llm_weights/vicuna_weights_7b'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use our path

vision_encoder=dict(
type='BEiTViT',
# eva-g without the final layer
arch=dict(
embed_dims=1408,
num_layers=39,
num_heads=16,
feedforward_channels=6144,
),
img_size=224,
patch_size=14,
out_indices=-2,
layer_scale_init_value=0.0,
use_abs_pos_emb=True,
use_rel_pos_bias=False,
frozen_stages=39,
final_norm=False,
use_shared_rel_pos_bias=False,
out_type='raw',
pretrained= # noqa
'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_eva-g-p14_20230615-e908c021.pth' # noqa
),
text_backbone=dict(
type='AutoModelForCausalLM',
name_or_path=
'/mnt/petrelfs/share_data/liuyuan/llm_weights/vicuna_weights_7b'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same comment as above

Qformer=dict(
type='Qformer',
model_style='bert-base-uncased',
vision_model_width=1408,
add_cross_attention=True,
cross_attention_freq=2,
num_query_token=32),
prompt='Write a short description for the image.',
max_txt_len=30)

# schedule settings
optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))

param_scheduler = [
dict(
type='CosineAnnealingLR',
by_epoch=True,
begin=0,
end=10,
)
]

train_cfg = dict(max_epochs=10)
val_cfg = dict()
test_cfg = dict()

# dataset settings
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='Resize',
scale=(224, 224),
interpolation='bicubic',
backend='pillow'),
dict(type='PackInputs', meta_keys=['image_id']),
]

val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
test_dataloader = val_dataloader
33 changes: 33 additions & 0 deletions configs/instructblip/metafile.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
Collections:
- Name: InstructBLIP
Metadata:
Training Data:
- COCO
- VG
- CC3M
- CC12M
- SBU
- LAION-400M
Architecture:
- Transformer
- Q-Former
Paper:
Title: 'InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning'
URL: https://arxiv.org/abs/2305.06500
README: configs/instructblip/README.md

Models:
- Name: instructblip-vicuna7b_3rdparty-zeroshot_caption
Metadata:
FLOPs: null
Parameters: xxx
In Collection: InstructBLIP
Results:
- Task: Image Caption
Dataset: COCO
Metrics: null
Weights: https://download.openmmlab.com/mmclassification/v1/instructblip/instruct-blip_vicuna7b_trimmed.pth
Config: configs/instructblip/instructblip-vicuna7b_8xb32_caption.py
Converted From:
Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth
Code: https://github.com/salesforce/LAVIS
4 changes: 3 additions & 1 deletion mmpretrain/models/multimodal/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from .blip2 import * # noqa: F401,F403
from .chinese_clip import * # noqa: F401, F403
from .flamingo import * # noqa: F401, F403
from .instructblip import * # noqa: F401,F403
from .llava import * # noqa: F401, F403
from .minigpt4 import * # noqa: F401, F403
from .ofa import * # noqa: F401, F403
Expand All @@ -17,5 +18,6 @@
register_multimodal_placeholder([
'Blip2Caption', 'Blip2Retrieval', 'Blip2VQA', 'BlipCaption',
'BlipNLVR', 'BlipRetrieval', 'BlipGrounding', 'BlipVQA', 'Flamingo',
'OFA', 'ChineseCLIP', 'MiniGPT4', 'Llava', 'Otter'
'OFA', 'ChineseCLIP', 'InstructBlipCaption', 'MiniGPT4', 'Llava',
'Otter'
], MODELS)
4 changes: 4 additions & 0 deletions mmpretrain/models/multimodal/instructblip/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Copyright (c) OpenMMLab. All rights reserved.
from .instructblip_caption import InstructBlipCaption

__all__ = ['InstructBlipCaption']
Loading
Loading