PaddlePaddle/PaddleOCR-VL · more than 40GB vRAM consumed by paddleocr-vl on A100 GPU!!!

more than 40GB vRAM consumed by paddleocr-vl on A100 GPU!!!

#59

by sayed99 - opened Nov 2

Discussion

sayed99

Nov 2

•

edited 27 days ago

this is running on colab with the official code snippet in the model card using the transformers library, it consumed more than 40GB vRAM during the inference of a single page, although it is a simple page contains no more than 300 arabic words, also i taken more than 2 minutes to generate the result which not very optimized, I will investigate on that to know if that is because specific missing configuration or attention stuff or what is the reason for that.

for reference here's the sample image i run on it:

sayed99

Nov 2

I modified the code snippet to use the flash attn 2 implementation and the performance got boosted to be 19 seconds instead of 2 minutes and using GPU vRAM of 3.3 GB instead of the massive 45 GB.
could I do a pull request for the new snippet?

maksym-ostapenko

Nov 4

•

edited Nov 4

@sayed99 , Could you point exactly to the place where you have made the change?

sayed99

Nov 4

@maksym-ostapenko
Hi! I recently updated the README to modify the Transformers code and enable the use of FlashAttention 2. This change significantly reduces memory usage and improves performance.

Here’s the pull request I opened for the model card README: Model Card PR

gggdddfff

PaddlePaddle org 27 days ago

I modified the code snippet to use the flash attn 2 implementation and the performance got boosted to be 19 seconds instead of 2 minutes and using GPU vRAM of 3.3 GB instead of the massive 45 GB.
could I do a pull request for the new snippet?

Contributions are highly welcome!

Vinci

22 days ago

@sayed99 please can you share your notebook?
I have been trying to run the model on Colab with no luck.

sayed99

21 days ago

@Vinci Hi!, you could find the updated colab optimized code under the new section "Click to expand: Use flash-attn to boost performance and reduce memory usage" on the model card,
and no problem here's the full notebook of the experiment, but please comment out the first part as I assume it will crash on the free t4 gpu due to limited memory, go for the 2 section directly when using the flash attn.
paddle-paddle-inference

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment