Image-Text-to-Text
PyTorch
English
llava
1-bit
VLA
VLM
conversational
hongyuw commited on
Commit
4249505
·
verified ·
1 Parent(s): a667793

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +119 -6
README.md CHANGED
@@ -24,11 +24,44 @@ tags:
24
  ## Open Source Plan
25
 
26
  - ✅ Paper, Pre-trained VLM and evaluation code.
27
- - 🧭 Fine-tuned VLA models, pre-training and fine-tuning code.
28
- - 🧭 Pre-trained VLA.
29
-
30
-
31
- ## Evaluation on VQA
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  We use the [LMM-Eval](https://github.com/ustcwhy/BitVLA/tree/main/lmms-eval) toolkit to conduct evaluations on VQA tasks. We provide the [transformers repo](https://github.com/ustcwhy/BitVLA/tree/main/transformers) in which we modify the [modeling_llava.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/llava/modeling_llava.py) and [modeling_siglip.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/siglip/modeling_siglip.py) to support the W1.58-A8 quantization.
34
 
@@ -59,9 +92,89 @@ bash eval-dense-hf.sh /YOUR_PATH_TO_EXP/bitvla-siglipL-224px-bf16
59
 
60
  Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ## Acknowledgement
63
 
64
- This repository is built using [LMM-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) and [the HuggingFace's transformers](https://github.com/huggingface/transformers).
65
 
66
  ## Citation
67
 
 
24
  ## Open Source Plan
25
 
26
  - ✅ Paper, Pre-trained VLM and evaluation code.
27
+ - Fine-tuned VLA code and models
28
+ - 🧭 Pre-training code and VLA.
29
+
30
+ ## Contents
31
+
32
+ - [BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation](#bitvla-1-bit-vision-language-action-models-for-robotics-manipulation)
33
+ - [Contents](#contents)
34
+ - [Checkpoints](#checkpoints)
35
+ - [Vision-Language](#vision-language)
36
+ - [Evaluation on VQA](#evaluation-on-vqa)
37
+ - [Vision-Language-Action](#vision-language-action)
38
+ - [OFT Training](#oft-training)
39
+ - [1. Preparing OFT](#1-preparing-oft)
40
+ - [2. OFT fine-tuning](#2-oft-fine-tuning)
41
+ - [Evaluation on LIBERO](#evaluation-on-libero)
42
+ - [Acknowledgement](#acknowledgement)
43
+ - [Citation](#citation)
44
+ - [License](#license)
45
+ - [Contact Information](#contact-information)
46
+
47
+ ## Checkpoints
48
+
49
+ | Model | Path |
50
+ | -------------- | ----- |
51
+ | BitVLA | [hongyuw/bitvla-bitsiglipL-224px-bf16](https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16) |
52
+ | BitVLA finetuned on LIBERO-Spatial | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16) |
53
+ | BitVLA finetuned on LIBERO-Object | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_object-bf16) |
54
+ | BitVLA finetuned on LIBERO-Goal | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16) |
55
+ | BitVLA finetuned on LIBERO-Long | [hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16](https://huggingface.co/hongyuw/ft-bitvla-bitsiglipL-224px-libero_long-bf16) |
56
+ | BitVLA w/ BF16 SigLIP | [hongyuw/bitvla-siglipL-224px-bf16](https://huggingface.co/hongyuw/bitvla-siglipL-224px-bf16) |
57
+
58
+ *Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.*
59
+
60
+ *Due to limited resources, we have not yet pre-trained BitVLA on a large-scale robotics dataset. We are actively working to secure additional compute resources to conduct this pre-training.*
61
+
62
+ ## Vision-Language
63
+
64
+ ### Evaluation on VQA
65
 
66
  We use the [LMM-Eval](https://github.com/ustcwhy/BitVLA/tree/main/lmms-eval) toolkit to conduct evaluations on VQA tasks. We provide the [transformers repo](https://github.com/ustcwhy/BitVLA/tree/main/transformers) in which we modify the [modeling_llava.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/llava/modeling_llava.py) and [modeling_siglip.py](https://github.com/ustcwhy/BitVLA/blob/main/transformers/src/transformers/models/siglip/modeling_siglip.py) to support the W1.58-A8 quantization.
67
 
 
92
 
93
  Note that we provide the master weights of BitVLA and perform online quantization. For actual memory savings, you may quantize the weights offline to 1.58-bit precision. We recommend using the [bitnet.cpp](https://github.com/microsoft/bitnet) inference framework to accurately measure the reduction in inference cost.
94
 
95
+ ## Vision-Language-Action
96
+
97
+ ### OFT Training
98
+
99
+ #### 1. Preparing OFT
100
+ We fine-tune BitVLA using OFT training shown in [OpenVLA-OFT](https://github.com/moojink/openvla-oft/tree/main). First setup the environment as required by that project. You can refer to [SETUP.md](https://github.com/moojink/openvla-oft/blob/main/SETUP.md) and [LIBERO.md](https://github.com/moojink/openvla-oft/blob/main/LIBERO.md) for detailed instructions.
101
+
102
+ ```
103
+ conda create -n bitvla python=3.10 -y
104
+ conda activate bitvla
105
+ pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
106
+
107
+ # or use the provided docker
108
+ # docker run --name nvidia_24_07 --privileged --net=host --ipc=host --gpus=all -v /mnt:/mnt -v /tmp:/tmp -d nvcr.io/nvidia/pytorch:24.07-py3 sleep infinity
109
+
110
+ cd BitVLA
111
+ pip install -e openvla-oft/
112
+ pip install -e transformers
113
+
114
+ cd openvla-oft/
115
+
116
+ # install LIBERO
117
+ git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
118
+ pip install -e LIBERO/
119
+ # in BitVLA
120
+ pip install -r experiments/robot/libero/libero_requirements.txt
121
+
122
+ # install bitvla
123
+ pip install -e bitvla/
124
+ ```
125
+
126
+ We adopt the same dataset as OpenVLA-OFT for the fine-tuning on LIBERO. You can download the dataset from [HuggingFace](https://huggingface.co/datasets/openvla/modified_libero_rlds).
127
+
128
+ ```
129
+ git clone [email protected]:datasets/openvla/modified_libero_rlds
130
+ ```
131
+
132
+ #### 2. OFT fine-tuning
133
+
134
+ First convert the [BitVLA](https://huggingface.co/hongyuw/bitvla-bitsiglipL-224px-bf16) to a format compatible with the VLA codebase.
135
+
136
+ ```
137
+ python convert_ckpt.py /path/to/bitvla-bitsiglipL-224px-bf16
138
+ ```
139
+
140
+ After that, you can finetune the BitVLA using the following command. Here we take LIBERO spatial as an example:
141
+
142
+ ```
143
+ torchrun --standalone --nnodes 1 --nproc-per-node 4 vla-scripts/finetune_bitnet.py \
144
+ --vla_path /path/to/bitvla-bitsiglipL-224px-bf16 \
145
+ --data_root_dir /path/to/modified_libero_rlds/ \
146
+ --dataset_name libero_spatial_no_noops \
147
+ --run_root_dir /path/to/save/your/ckpt \
148
+ --use_l1_regression True \
149
+ --warmup_steps 375 \
150
+ --use_lora False \
151
+ --num_images_in_input 2 \
152
+ --use_proprio True \
153
+ --batch_size 2 \
154
+ --grad_accumulation_steps 8 \
155
+ --learning_rate 1e-4 \
156
+ --max_steps 10001 \
157
+ --save_freq 10000 \
158
+ --save_latest_checkpoint_only False \
159
+ --image_aug True \
160
+ --run_id_note your_id
161
+ ```
162
+
163
+ ### Evaluation on LIBERO
164
+
165
+ You can download our fine-tuned BitVLA models from [HuggingFace](https://huggingface.co/collections/hongyuw/bitvla-68468fb1e3aae15dd8a4e36e). As an example for spatial set in LIBERO, run the following script for evaluation:
166
+
167
+ ```
168
+ python experiments/robot/libero/run_libero_eval_bitnet.py \
169
+ --pretrained_checkpoint /path/to/ft-bitvla-bitsiglipL-224px-libero_spatial-bf16 \
170
+ --task_suite_name libero_spatial \
171
+ --info_in_path "information you want to show in path" \
172
+ --model_family "bitnet"
173
+ ```
174
+
175
  ## Acknowledgement
176
 
177
+ This repository is built using [LMM-Eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [the HuggingFace's transformers](https://github.com/huggingface/transformers) and [OpenVLA-OFT](https://github.com/moojink/openvla-oft).
178
 
179
  ## Citation
180