prithivMLmods commited on
Commit
ade7797
·
verified ·
1 Parent(s): bc484b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -0
README.md CHANGED
@@ -1,7 +1,114 @@
1
  ---
2
  license: apache-2.0
3
  pipeline_tag: image-to-text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
  ![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/2noqQmhqzZ2qHIpCYJ29v.png)
7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  pipeline_tag: image-to-text
4
+ language:
5
+ - en
6
+ - zh
7
+ base_model:
8
+ - prithivMLmods/Camel-Doc-OCR-062825
9
+ library_name: transformers
10
+ tags:
11
+ - Document
12
+ - KIE
13
+ - OCR
14
+ - VL
15
+ - Camel
16
+ - Openpdf
17
+ - text-generation-inference
18
+ - Extraction
19
+ - Linking
20
+ - Markdown
21
+ - .Md
22
  ---
23
 
24
  ![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/2noqQmhqzZ2qHIpCYJ29v.png)
25
 
26
+ # **Gliese-OCR-7B-Post1.0**
27
+
28
+ > The **Gliese-OCR-7B-Post1.0** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, optimized for **Document Retrieval**, **Content Extraction**, and **Analysis Recognition**. Built on top of the Qwen2.5-VL architecture, this model enhances document comprehension capabilities with focused training on the Opendoc2-Analysis-Recognition dataset for superior document analysis and information extraction tasks.
29
+
30
+ # Key Enhancements
31
+
32
+ * **Context-Aware Multimodal Extraction and Linking for Documents**: Advanced capability for understanding document context and establishing connections between multimodal elements within documents.
33
+
34
+ * **Enhanced Document Retrieval**: Designed to efficiently locate and extract relevant information from complex document structures and layouts.
35
+
36
+ * **Superior Content Extraction**: Optimized for precise extraction of structured and unstructured content from diverse document formats.
37
+
38
+ * **Analysis Recognition**: Specialized in recognizing and interpreting analytical content, charts, tables, and visual data representations.
39
+
40
+ * **State-of-the-Art Performance Across Resolutions**: Achieves competitive results on OCR and visual QA benchmarks such as DocVQA, MathVista, RealWorldQA, and MTVQA.
41
+
42
+ * **Video Understanding up to 20+ minutes**: Supports detailed comprehension of long-duration videos for content summarization, Q\&A, and multi-modal reasoning.
43
+
44
+ * **Visually-Grounded Device Interaction**: Enables mobile/robotic device operation via visual inputs and text-based instructions using contextual understanding and decision-making logic.
45
+
46
+ # Quick Start with Transformers
47
+
48
+ ```python
49
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
50
+ from qwen_vl_utils import process_vision_info
51
+
52
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
53
+ "prithivMLmods/Gliese-OCR-7B-Post1.0", torch_dtype="auto", device_map="auto"
54
+ )
55
+
56
+ processor = AutoProcessor.from_pretrained("prithivMLmods/Gliese-OCR-7B-Post1.0")
57
+
58
+ messages = [
59
+ {
60
+ "role": "user",
61
+ "content": [
62
+ {
63
+ "type": "image",
64
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
65
+ },
66
+ {"type": "text", "text": "Describe this image."},
67
+ ],
68
+ }
69
+ ]
70
+
71
+ text = processor.apply_chat_template(
72
+ messages, tokenize=False, add_generation_prompt=True
73
+ )
74
+ image_inputs, video_inputs = process_vision_info(messages)
75
+ inputs = processor(
76
+ text=[text],
77
+ images=image_inputs,
78
+ videos=video_inputs,
79
+ padding=True,
80
+ return_tensors="pt",
81
+ )
82
+ inputs = inputs.to("cuda")
83
+
84
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
85
+ generated_ids_trimmed = [
86
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
87
+ ]
88
+ output_text = processor.batch_decode(
89
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
90
+ )
91
+ print(output_text)
92
+ ```
93
+
94
+ # Intended Use
95
+
96
+ This model is intended for:
97
+
98
+ * Context-aware multimodal extraction and linking for complex document structures.
99
+ * High-fidelity document retrieval and content extraction from various document formats.
100
+ * Analysis recognition of charts, graphs, tables, and visual data representations.
101
+ * Document-based question answering for educational and enterprise applications.
102
+ * Extraction and LaTeX formatting of mathematical expressions from printed or handwritten content.
103
+ * Retrieval and summarization from long documents, slides, and multi-modal inputs.
104
+ * Multilingual document analysis and structured content extraction for global use cases.
105
+ * Robotic or mobile automation with vision-guided contextual interaction.
106
+
107
+ # Limitations
108
+
109
+ * May show degraded performance on extremely low-quality or occluded images.
110
+ * Not optimized for real-time applications on low-resource or edge devices due to computational demands.
111
+ * Variable accuracy on uncommon or low-resource languages/scripts.
112
+ * Long video processing may require substantial memory and is not optimized for streaming applications.
113
+ * Visual token settings affect performance; suboptimal configurations can impact results.
114
+ * In rare cases, outputs may contain hallucinated or contextually misaligned information.