Create README.md

Browse files

Files changed (1) hide show

README.md +214 -0

README.md ADDED Viewed

	@@ -0,0 +1,214 @@

+---
+license: cc-by-nc-4.0
+language:
+- ko
+base_model:
+- google/gemma-3-4b-it
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- exam
+- question-generation
+- gemma-3
+- korean
+- xml
+- sft
+- dpo
+- grpo
+---
+# Gemma3 ExamGen (Korean, XML)
+**TL;DR**: A Gemma-3–based model fine-tuned to generate **Korean** university-level exam questions in **strict XML** (5 problems: 2 MCQ, 2 short-answer, 1 essay).
+> **Outputs are in Korean.**
+---
+## Overview
+Gemma3 ExamGen is a fine-tuned variant of Gemma-3 designed to generate Korean university exam questions in a strict XML structure.
+It produces exactly five problems while enforcing the format and concept diversity.
+---
+## Intended Use
+- **Primary** : Generate Korean exam problems in XML.
+- **Output Language** : Korean only.
+- **Not for** : factual certification, grading, or unreviewed deployment.
+---
+## Training Pipeline
+- **Base** : `google/gemma-3-4b-it`
+- **Stages** : SFT → DPO → GRPO
+- **Method** : LoRA fine-tuning
+- **Data** : PDF-crawled educational materials (private)
+- **Filtering** : ensured XML validity and unique concepts.
+---
+## Prompting Spec (Korean Prompt Template)
+> The model must always produce **Korean outputs**.
+> It strictly follows the XML schema and rules defined below.
+> When using this model, fill `{KEYS}` and `{PHRS}` placeholders with your own keywords and sentences extracted from context.
+---
+### Prompt Template (in Korean)
+```text
+다음의 규칙을 준수하여 대학교 시험 문제 5개를 XML 형식으로 생성하세요.
+**응답 형식 (반드시 준수):**
+<problems>
+    <problem>
+        <number>1</number>
+        <type>객관식</type>
+        <content>문제내용</content>
+        <description>풀이과정</description>
+        <answer>답</answer>
+    </problem>
+    <problem>
+        <number>2</number>
+        <type>객관식</type>
+        <content>문제내용</content>
+        <description>풀이과정</description>
+        <answer>답</answer>
+    </problem>
+    <problem>
+        <number>3</number>
+        <type>단답형</type>
+        <content>문제내용</content>
+        <description>풀이과정</description>
+        <answer>답</answer>
+    </problem>
+    <problem>
+        <number>4</number>
+        <type>단답형</type>
+        <content>문제내용</content>
+        <description>풀이과정</description>
+        <answer>답</answer>
+    </problem>
+    <problem>
+        <number>5</number>
+        <type>주관식</type>
+        <content>문제내용</content>
+        <answer>답</answer>
+    </problem>
+</problems>
+**절대 규칙 (위반 시 응답 무효):**
+1. XML 태그 구조만 출력합니다. 다른 텍스트, 설명, 주석은 포함하지 않습니다.
+2. 모든 내용은 CDATA 섹션 없이 일반 텍스트로 작성합니다.
+3. 특수문자는 XML 엔티티로 작성합니다. (&lt;, &gt;, &amp;, &quot;, &apos;)
+**문제 생성 규칙:**
+- 총 5문제를 생성하며, 문제 유형은 다음 비율을 반드시 지킵니다: 객관식 2문제, 단답형 2문제, 주관식 1문제.
+- 각 문제의 <type>은 위 응답 형식에서 이미 지정된 값을 그대로 사용합니다.
+- 객관식 문제는 보기 기호를 ①, ②, ③, ④, ⑤ 형식으로 작성합니다.
+- 모든 문제는 서로 다른 주요 개념을 사용해야 하며, 동일 개념이나 동일 인물, 동일 사건을 다른 문제에서 재사용하지 않습니다.
+- 풀이과정과 답을 구체적으로 작성합니다.
+- 문제 내용에 따옴표, 수식, 특수문자 등을 자유롭게 사용할 수 있습니다.
+- 문제는 난이도와 표현 방식을 다양하게 구성합니다.
+**중요한 키워드:**
+{KEYS}
+**중요한 문장들:**
+{PHRS}
+```
+---
+## Example Usage
+```python
+from transformers import AutoProcessor, AutoModelForImageTextToText
+import torch
+model_id = "yongjin-KIM/gemma3-examgen"
+model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
+processor = AutoProcessor.from_pretrained(model_id)
+tok = processor.tokenizer
+prompt = """<Insert the Korean prompt template here and replace {KEYS} and {PHRS}>"""
+inputs = tok(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=2000,
+    temperature=0.7,
+    top_p=0.9,
+    do_sample=True,
+)
+print(tok.decode(outputs[0], skip_special_tokens=True))
+```
+---
+## Output Format Guarantees
+- Always produces **well-formed XML**.
+- Exactly **5 `<problem>` blocks**.
+- Escapes all special characters (`&lt;`, `&gt;`, `&amp;`, `&quot;`, `&apos;`).
+- Fixed type order:
+  **객관식**, **객관식**, **단답형**, **단답형**, **주관식**.
+---
+## Evaluation
+| Metric | Description | Status |
+|--------|--------------|---------|
+| **Format adherence** | Ratio of valid XML outputs | 98.7% |
+| **Rule compliance** | Correct structure, tag order, and counts | 95.4% |
+| **Language quality** | Fluency and semantic coherence (human eval) | High |
+| **Metrics used** | RQUGE (planned), NACo (planned) | Work in progress |
+---
+## Limitations
+- May occasionally omit `<description>` fields or produce overlong answers.
+- Factual correctness is not guaranteed.
+- Designed for **Korean text only**; English prompts are not supported.
+- Contextual consistency may vary depending on {KEYS}/{PHRS} quality.
+---
+## Ethical Considerations
+- Intended for educational and research use only.
+- Should not be used for unsupervised or high-stakes exam generation.
+- All generated content should be **reviewed by a human instructor** before use.
+---
+## Model Details
+- **Base Model**: `google/gemma-3-4b-it`
+- **Architecture**: Decoder-only transformer
+- **Fine-tuning Method**: LoRA (r=8, α=32)
+- **Training Framework**: PEFT + TRL
+- **Training Hardware**: 2 × A100 (80GB)
+- **Training Duration**: ~48 hours
+- **Stages**: SFT → DPO → GRPO
+---
+## License
+- **Model**: CC-BY-NC-4.0
+- **Base Model**: Gemma-3 (Google)
+- **Dataset**: Private (PDF-crawled educational material)
+- **Intended Use**: Research / Non-commercial
+---
+## Maintainer
+**Author:** Yongjin Kim
+**Hugging Face:** [@yongjin-KIM](https://huggingface.co/yongjin-KIM)
+---