thadillo commited on
Commit
71797a4
Β·
1 Parent(s): 1377fb1

Phases 1-3: Database schema, text processing, analyzer updates

Browse files

- Add SubmissionSentence model with relationships
- Add sentence_analysis_done flag to Submission
- Update TrainingExample to support sentence-level
- Create TextProcessor for sentence segmentation (NLTK + regex fallback)
- Update analyzer with analyze_with_sentences() method
- Store confidence scores for later retrieval

CATEGORIZATION_DECISION_GUIDE.md ADDED
@@ -0,0 +1,286 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🎯 Quick Decision Guide: Categorization Strategy
2
+
3
+ ## Your Problem (Excellent Observation!)
4
+
5
+ **Current**: One submission β†’ One category
6
+ **Reality**: One submission often contains multiple categories
7
+
8
+ **Example**:
9
+ ```
10
+ "Dallas should establish more green spaces in South Dallas neighborhoods.
11
+ Areas like Oak Cliff lack accessible parks compared to North Dallas."
12
+
13
+ Current system: Forces you to pick ONE category
14
+ Better system: Recognize both Objective + Problem
15
+ ```
16
+
17
+ ---
18
+
19
+ ## πŸ”„ Three Solutions (Ranked by Effort vs. Value)
20
+
21
+ ### πŸ₯‡ Option 1: Sentence-Level Analysis (YOUR PROPOSAL)
22
+
23
+ **What it does**:
24
+ ```
25
+ Submission A
26
+ β”œβ”€ Sentence 1: "Dallas should establish..." β†’ Objective
27
+ β”œβ”€ Sentence 2: "Areas like Oak Cliff..." β†’ Problem
28
+ └─ Geotag: [lat, lng] (applies to all sentences)
29
+ Stakeholder: Community (applies to all sentences)
30
+ ```
31
+
32
+ **UI Example**:
33
+ ```
34
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
35
+ β”‚ Submission #42 - Community β”‚
36
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
37
+ β”‚ "Dallas should establish more green β”‚
38
+ β”‚ spaces in South Dallas neighborhoods. β”‚
39
+ β”‚ Areas like Oak Cliff lack accessible β”‚
40
+ β”‚ parks compared to North Dallas." β”‚
41
+ β”‚ β”‚
42
+ β”‚ Primary Category: Objective β”‚
43
+ β”‚ Distribution: 50% Objective, 50% Problemβ”‚
44
+ β”‚ β”‚
45
+ β”‚ [β–Ό View Sentences (2)] β”‚
46
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
47
+ β”‚ β”‚ 1. "Dallas should establish..." β”‚ β”‚
48
+ β”‚ β”‚ Category: [Objective β–Ό] β”‚ β”‚
49
+ β”‚ β”‚ β”‚ β”‚
50
+ β”‚ β”‚ 2. "Areas like Oak Cliff..." β”‚ β”‚
51
+ β”‚ β”‚ Category: [Problem β–Ό] β”‚ β”‚
52
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
53
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
54
+ ```
55
+
56
+ **Pros**: βœ… Maximum accuracy, βœ… Best training data, βœ… Detailed analytics
57
+ **Cons**: ⚠️ More complex, ⚠️ Takes longer to implement
58
+ **Time**: 13-20 hours
59
+ **Value**: ⭐⭐⭐⭐⭐
60
+
61
+ ---
62
+
63
+ ### πŸ₯ˆ Option 2: Multi-Label (Simpler)
64
+
65
+ **What it does**:
66
+ ```
67
+ Submission A
68
+ β”œβ”€ Categories: [Objective, Problem]
69
+ β”œβ”€ Geotag: [lat, lng]
70
+ └─ Stakeholder: Community
71
+ ```
72
+
73
+ **UI Example**:
74
+ ```
75
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
76
+ β”‚ Submission #42 - Community β”‚
77
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
78
+ β”‚ "Dallas should establish more green β”‚
79
+ β”‚ spaces in South Dallas neighborhoods. β”‚
80
+ β”‚ Areas like Oak Cliff lack accessible β”‚
81
+ β”‚ parks compared to North Dallas." β”‚
82
+ β”‚ β”‚
83
+ β”‚ Categories: [Objective] [Problem] β”‚
84
+ β”‚ (select multiple) β”‚
85
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
86
+ ```
87
+
88
+ **Pros**: βœ… Simple to implement, βœ… Captures complexity
89
+ **Cons**: ❌ Can't tell which sentence is which, ❌ Less precise training data
90
+ **Time**: 4-6 hours
91
+ **Value**: ⭐⭐⭐
92
+
93
+ ---
94
+
95
+ ### πŸ₯‰ Option 3: Primary + Secondary
96
+
97
+ **What it does**:
98
+ ```
99
+ Submission A
100
+ β”œβ”€ Primary: Objective
101
+ β”œβ”€ Secondary: [Problem, Values]
102
+ β”œβ”€ Geotag: [lat, lng]
103
+ └─ Stakeholder: Community
104
+ ```
105
+
106
+ **Pros**: βœ… Preserves hierarchy, βœ… Moderate complexity
107
+ **Cons**: ⚠️ Arbitrary primary choice, ❌ Still loses granularity
108
+ **Time**: 8-10 hours
109
+ **Value**: ⭐⭐⭐
110
+
111
+ ---
112
+
113
+ ## πŸ“Š Side-by-Side Comparison
114
+
115
+ | Feature | Sentence-Level | Multi-Label | Primary+Secondary |
116
+ |---------|---------------|-------------|-------------------|
117
+ | **Granularity** | Each sentence categorized | Submission-level | Submission-level |
118
+ | **Training Data** | Precise per sentence | Ambiguous | Hierarchical |
119
+ | **UI Complexity** | Collapsible view | Checkbox list | Dropdown + pills |
120
+ | **Dashboard** | Dual mode (submissions vs sentences) | Overlapping counts | Clear hierarchy |
121
+ | **Implementation** | New table + logic | Array field | Two fields |
122
+ | **Time to Build** | 13-20 hrs | 4-6 hrs | 8-10 hrs |
123
+ | **Your Example** | βœ… Perfect fit | ⚠️ OK | ⚠️ OK |
124
+ | **Future AI Training** | βœ… Excellent | ⚠️ Limited | ⚠️ OK |
125
+
126
+ ---
127
+
128
+ ## 🎯 My Recommendation: Start with Proof of Concept
129
+
130
+ ### Phase 0: Quick Test (4-6 hours)
131
+
132
+ **Goal**: See sentence breakdown WITHOUT changing database
133
+
134
+ **Implementation**:
135
+ 1. Add sentence segmentation library (NLTK)
136
+ 2. Update submissions page to SHOW sentence breakdown (read-only)
137
+ 3. Display: "This submission contains X sentences in Y categories"
138
+ 4. Let admins see the breakdown and provide feedback
139
+
140
+ **Example UI** (read-only preview):
141
+ ```
142
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
143
+ β”‚ Submission #42 β”‚
144
+ β”‚ "Dallas should establish..." β”‚
145
+ β”‚ β”‚
146
+ β”‚ Current Category: Objective β”‚
147
+ β”‚ β”‚
148
+ β”‚ [πŸ’‘ AI Detected Multiple Topics] β”‚
149
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
150
+ β”‚ β”‚ This submission contains: β”‚ β”‚
151
+ β”‚ β”‚ β€’ 1 sentence about: Objective β”‚ β”‚
152
+ β”‚ β”‚ β€’ 1 sentence about: Problem β”‚ β”‚
153
+ β”‚ β”‚ β”‚ β”‚
154
+ β”‚ β”‚ [View Details β–Ό] β”‚ β”‚
155
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
156
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
157
+ ```
158
+
159
+ **Then decide**:
160
+ - βœ… If admins find it useful β†’ Full implementation
161
+ - ⚠️ If too complex β†’ Try multi-label
162
+ - ❌ If not valuable β†’ Keep current system
163
+
164
+ ---
165
+
166
+ ## πŸ’­ Questions to Help Decide
167
+
168
+ ### Ask yourself:
169
+
170
+ 1. **Frequency**: How often do submissions contain multiple categories?
171
+ - Often (>30%) β†’ Sentence-level worth it
172
+ - Sometimes (10-30%) β†’ Multi-label sufficient
173
+ - Rarely (<10%) β†’ Keep current system
174
+
175
+ 2. **Analytics depth**: Do you need to know which specific ideas are Objectives vs Problems?
176
+ - Yes, important β†’ Sentence-level
177
+ - Just need tags β†’ Multi-label
178
+ - Primary is enough β†’ Primary+Secondary
179
+
180
+ 3. **Training priority**: Is fine-tuning accuracy critical?
181
+ - Yes, very important β†’ Sentence-level (best training data)
182
+ - Moderately β†’ Multi-label OK
183
+ - Not critical β†’ Any approach works
184
+
185
+ 4. **User complexity tolerance**: How much UI complexity can admins handle?
186
+ - High (tech-savvy) β†’ Sentence-level
187
+ - Medium β†’ Multi-label
188
+ - Low β†’ Primary+Secondary
189
+
190
+ 5. **Timeline**: When do you need this?
191
+ - This week β†’ Multi-label (fast)
192
+ - Next 2 weeks β†’ Sentence-level (with testing)
193
+ - Flexible β†’ Sentence-level (best long-term)
194
+
195
+ ---
196
+
197
+ ## πŸš€ Recommended Path Forward
198
+
199
+ ### Step 1: Quick Analysis (Now - 30 min)
200
+
201
+ Run a sample analysis on your current data:
202
+
203
+ ```python
204
+ # I can write a script to analyze your 60 submissions
205
+ # and show:
206
+ # - How many have multiple categories?
207
+ # - Average sentences per submission
208
+ # - Potential category distribution
209
+
210
+ Would you like me to create this analysis script?
211
+ ```
212
+
213
+ ### Step 2: Choose Approach (After analysis)
214
+
215
+ Based on results:
216
+ - **>40% multi-category** β†’ Go with sentence-level
217
+ - **20-40% multi-category** β†’ Try proof of concept
218
+ - **<20% multi-category** β†’ Multi-label might be enough
219
+
220
+ ### Step 3: Implementation
221
+
222
+ **Option A: Full Commit (Sentence-Level)**
223
+ - I implement all 7 phases (~15 hours of work)
224
+ - You get the most powerful system
225
+
226
+ **Option B: Test First (Proof of Concept)**
227
+ - I implement Phase 0 (~4 hours)
228
+ - You test with real users
229
+ - Then decide on full implementation
230
+
231
+ **Option C: Simple (Multi-Label)**
232
+ - I implement multi-label (~5 hours)
233
+ - Less powerful but faster to market
234
+
235
+ ---
236
+
237
+ ## 🎯 What Should We Do?
238
+
239
+ **I recommend**: **Option B - Test First**
240
+
241
+ **Steps**:
242
+ 1. βœ… I create analysis script (show current data patterns)
243
+ 2. βœ… I implement proof of concept (sentence display only)
244
+ 3. βœ… You test with admins (get feedback)
245
+ 4. βœ… We decide: Full sentence-level OR Multi-label OR Keep current
246
+
247
+ **Advantages**:
248
+ - Low risk (no DB changes initially)
249
+ - Real user feedback
250
+ - Informed decision
251
+ - Can always upgrade later
252
+
253
+ ---
254
+
255
+ ## πŸ“ Your Decision
256
+
257
+ **Which path do you want to take?**
258
+
259
+ **A) Analysis Script First** (30 min)
260
+ - I create a script to analyze your 60 submissions
261
+ - Show: % multi-category, sentence distribution, etc.
262
+ - Then decide based on data
263
+
264
+ **B) Proof of Concept** (4-6 hours)
265
+ - Skip analysis, go straight to sentence display
266
+ - See it in action, get feedback
267
+ - Then decide on full implementation
268
+
269
+ **C) Full Implementation** (13-20 hours)
270
+ - Commit to sentence-level now
271
+ - Build everything
272
+ - Most powerful, takes longest
273
+
274
+ **D) Multi-Label Instead** (4-6 hours)
275
+ - Simpler approach
276
+ - Good enough for most cases
277
+ - Fast to implement
278
+
279
+ **E) Keep Current System**
280
+ - If not worth the effort
281
+ - Stay with one category per submission
282
+
283
+ ---
284
+
285
+ **What's your choice?** Let me know and I'll get started! πŸš€
286
+
Claudeβ€˜s Plan.md ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Fine-Tuning System Implementation Plan
2
+
3
+ ## Overview
4
+ Implement an active learning system that collects admin corrections, builds a training dataset, and fine-tunes the BART classification model using LoRA (Low-Rank Adaptation).
5
+
6
+ ---
7
+
8
+ ## Phase 1: Training Data Collection Infrastructure
9
+
10
+ ### 1.1 Database Schema Extensions
11
+ **New Model: `TrainingExample`**
12
+ - `id` (Integer, PK)
13
+ - `submission_id` (Integer, FK to Submission)
14
+ - `message` (Text) - snapshot of submission text
15
+ - `original_category` (String, nullable) - AI's initial prediction
16
+ - `corrected_category` (String) - Admin's correction
17
+ - `contributor_type` (String)
18
+ - `correction_timestamp` (DateTime)
19
+ - `confidence_score` (Float, nullable) - original prediction confidence
20
+ - `used_in_training` (Boolean, default=False) - track if used in fine-tuning
21
+ - `training_run_id` (Integer, nullable, FK) - which training run used this
22
+
23
+ **New Model: `FineTuningRun`**
24
+ - `id` (Integer, PK)
25
+ - `created_at` (DateTime)
26
+ - `status` (String) - 'preparing', 'training', 'evaluating', 'completed', 'failed'
27
+ - `num_training_examples` (Integer)
28
+ - `num_validation_examples` (Integer)
29
+ - `num_test_examples` (Integer)
30
+ - `training_config` (JSON) - hyperparameters, LoRA config
31
+ - `results` (JSON) - metrics (accuracy, loss, per-category F1)
32
+ - `model_path` (String, nullable) - path to saved LoRA weights
33
+ - `is_active_model` (Boolean) - currently deployed model
34
+ - `improvement_over_baseline` (Float, nullable)
35
+ - `completed_at` (DateTime, nullable)
36
+
37
+ ### 1.2 Admin Routes Extension (`app/routes/admin.py`)
38
+ **Modify `update_category` endpoint:**
39
+ - When admin changes category, create TrainingExample record
40
+ - Capture: original prediction, corrected category, confidence score
41
+ - Track whether it's a correction (different from AI) or confirmation (same)
42
+
43
+ **New endpoints:**
44
+ - `GET /admin/training-data` - View collected training examples
45
+ - `GET /admin/api/training-stats` - Stats on corrections collected
46
+ - `DELETE /admin/api/training-example/<id>` - Remove bad examples
47
+
48
+ ---
49
+
50
+ ## Phase 2: Fine-Tuning Configuration UI
51
+
52
+ ### 2.1 New Admin Page: Training Dashboard (`app/templates/admin/training.html`)
53
+ **Sections:**
54
+ 1. **Training Data Stats**
55
+ - Total corrections collected
56
+ - Per-category distribution
57
+ - Corrections vs confirmations ratio
58
+ - Data quality indicators (duplicates, conflicts)
59
+
60
+ 2. **Fine-Tuning Controls** (enabled when β‰₯20 examples)
61
+ - Configure training parameters:
62
+ - Minimum examples threshold (default: 20)
63
+ - Train/Val/Test split (e.g., 70/15/15)
64
+ - LoRA rank (r=8, 16, 32)
65
+ - Learning rate (1e-4 to 5e-4)
66
+ - Number of epochs (3-5)
67
+ - "Start Fine-Tuning" button (with confirmation)
68
+
69
+ 3. **Training History**
70
+ - Table of past FineTuningRun records
71
+ - Show: date, examples used, accuracy, status
72
+ - Actions: View details, Deploy model, Export weights
73
+
74
+ 4. **Active Model Indicator**
75
+ - Show which model is currently in use
76
+ - Option to rollback to base model
77
+
78
+ ### 2.2 Settings Extension
79
+ - `fine_tuning_enabled` (Boolean) - master switch
80
+ - `min_training_examples` (Integer, default: 20)
81
+ - `auto_train` (Boolean, default: False) - auto-trigger when threshold reached
82
+
83
+ ---
84
+
85
+ ## Phase 3: Fine-Tuning Engine
86
+
87
+ ### 3.1 New Module: `app/fine_tuning/trainer.py`
88
+
89
+ **Class: `BARTFineTuner`**
90
+
91
+ **Methods:**
92
+
93
+ `prepare_dataset(training_examples)`
94
+ - Convert TrainingExample records to HuggingFace Dataset
95
+ - Create train/val/test splits (stratified by category)
96
+ - Tokenize texts for BART
97
+ - Return: `train_dataset`, `val_dataset`, `test_dataset`
98
+
99
+ `setup_lora_model(base_model_name, lora_config)`
100
+ - Load base BART model (`facebook/bart-large-mnli`)
101
+ - Apply PEFT (Parameter-Efficient Fine-Tuning) with LoRA
102
+ - LoRA configuration:
103
+ ```python
104
+ {
105
+ "r": 16, # rank
106
+ "lora_alpha": 32,
107
+ "target_modules": ["q_proj", "v_proj"], # attention layers
108
+ "lora_dropout": 0.1,
109
+ "bias": "none"
110
+ }
111
+ ```
112
+
113
+ `train(train_dataset, val_dataset, config)`
114
+ - Use HuggingFace Trainer with custom loss
115
+ - Multi-class cross-entropy loss
116
+ - Metrics: accuracy, F1 per category, confusion matrix
117
+ - Early stopping on validation loss
118
+ - Save checkpoints to `/data/models/finetuned/run_{id}/`
119
+
120
+ `evaluate(test_dataset, model)`
121
+ - Run predictions on test set
122
+ - Calculate: accuracy, precision, recall, F1 (macro/micro)
123
+ - Generate confusion matrix
124
+ - Compare to baseline (zero-shot) performance
125
+
126
+ `export_model(run_id, destination_path)`
127
+ - Save LoRA adapter weights
128
+ - Save tokenizer config
129
+ - Create model card with metrics
130
+ - Package for backup/deployment
131
+
132
+ **Alternative Approach: Output Layer Fine-Tuning**
133
+ - Option to only train final classification head
134
+ - Faster, less prone to overfitting
135
+ - Good for small datasets (20-50 examples)
136
+
137
+ ### 3.2 Background Task Handler (`app/fine_tuning/tasks.py`)
138
+ - Fine-tuning runs in background (avoid blocking Flask)
139
+ - Options:
140
+ 1. **Simple Threading** (for development)
141
+ 2. **Celery** (for production) - requires Redis/RabbitMQ
142
+ 3. **HF Spaces Gradio Jobs** (if deploying to HF)
143
+
144
+ **Status Updates:**
145
+ - Update FineTuningRun.status in real-time
146
+ - Store progress in Settings table for UI polling
147
+ - Log to file for debugging
148
+
149
+ ---
150
+
151
+ ## Phase 4: Model Deployment & Versioning
152
+
153
+ ### 4.1 Model Manager (`app/fine_tuning/model_manager.py`)
154
+
155
+ **Class: `ModelManager`**
156
+
157
+ `get_active_model()`
158
+ - Check if fine-tuned model is deployed
159
+ - Load LoRA weights if available
160
+ - Fallback to base model
161
+
162
+ `deploy_model(run_id)`
163
+ - Set FineTuningRun.is_active_model = True
164
+ - Update Settings: `active_model_id`
165
+ - Reload analyzer with new model
166
+ - Create deployment snapshot
167
+
168
+ `rollback_to_baseline()`
169
+ - Deactivate all fine-tuned models
170
+ - Reload base BART model
171
+ - Log rollback event
172
+
173
+ `compare_models(run_id_1, run_id_2, test_dataset)`
174
+ - Side-by-side comparison
175
+ - Statistical significance tests
176
+ - A/B testing support (future)
177
+
178
+ ### 4.2 Analyzer Modification (`app/analyzer.py`)
179
+
180
+ **Update `SubmissionAnalyzer.__init__`:**
181
+ - Check for active fine-tuned model
182
+ - Load LoRA adapter if available
183
+ - Track model version being used
184
+
185
+ **Add method: `get_model_info()`**
186
+ - Return: model type (base/finetuned), version, metrics
187
+
188
+ **Store prediction metadata:**
189
+ - Add confidence scores to all predictions
190
+ - Track which model version made prediction
191
+
192
+ ---
193
+
194
+ ## Phase 5: Validation & Quality Assurance
195
+
196
+ ### 5.1 Cross-Validation
197
+ - K-fold cross-validation (k=5) for small datasets
198
+ - Stratified splits to ensure category balance
199
+ - Report: mean Β± std accuracy across folds
200
+
201
+ ### 5.2 Minimum Viable Training Set
202
+ **Data Requirements:**
203
+ - At least 3 examples per category (18 total)
204
+ - Recommended: 5+ examples per category (30 total)
205
+ - Warn if severe class imbalance (>5:1 ratio)
206
+
207
+ ### 5.3 Quality Checks
208
+ - Detect duplicate texts
209
+ - Detect conflicting labels (same text, different categories)
210
+ - Flag suspiciously short/long texts
211
+ - Admin review interface for cleanup
212
+
213
+ ### 5.4 Success Criteria
214
+ **Model is deployed if:**
215
+ - Test accuracy > baseline accuracy + 5%
216
+ - OR per-category F1 improved for majority of categories
217
+ - AND no category has F1 < 0.3 (catch catastrophic forgetting)
218
+
219
+ **If criteria not met:**
220
+ - Keep base model active
221
+ - Suggest: collect more data, adjust hyperparameters
222
+
223
+ ---
224
+
225
+ ## Phase 6: Export & Backup
226
+
227
+ ### 6.1 Model Export
228
+ **Format Options:**
229
+ 1. **HuggingFace Hub** - push LoRA adapter to private repo
230
+ 2. **Local Files** - save to `/data/models/exports/`
231
+ 3. **Download via UI** - ZIP file with weights + config
232
+
233
+ **Export Contents:**
234
+ - LoRA adapter weights (`adapter_model.bin`)
235
+ - Adapter config (`adapter_config.json`)
236
+ - Training metrics (`metrics.json`)
237
+ - Training examples used (`training_data.json`)
238
+ - Model card (`README.md`)
239
+
240
+ ### 6.2 Import Pre-trained Model
241
+ - Upload ZIP with LoRA weights
242
+ - Validate compatibility with base model
243
+ - Deploy to production
244
+
245
+ ---
246
+
247
+ ## Technical Implementation Details
248
+
249
+ ### Dependencies to Add (requirements.txt)
250
+ ```
251
+ peft>=0.7.0 # LoRA implementation
252
+ datasets>=2.14.0 # HuggingFace datasets
253
+ scikit-learn>=1.3.0 # cross-validation, metrics
254
+ matplotlib>=3.7.0 # confusion matrix plotting
255
+ seaborn>=0.12.0 # visualization
256
+ accelerate>=0.24.0 # training optimization
257
+ evaluate>=0.4.0 # evaluation metrics
258
+ ```
259
+
260
+ ### File Structure
261
+ ```
262
+ app/
263
+ β”œβ”€β”€ fine_tuning/
264
+ β”‚ β”œβ”€β”€ __init__.py
265
+ β”‚ β”œβ”€β”€ trainer.py # BARTFineTuner class
266
+ β”‚ β”œβ”€β”€ model_manager.py # Model deployment logic
267
+ β”‚ β”œβ”€β”€ tasks.py # Background job handler
268
+ β”‚ β”œβ”€β”€ metrics.py # Custom evaluation metrics
269
+ β”‚ └── data_validator.py # Training data QA
270
+ β”œβ”€β”€ models/
271
+ β”‚ └── models.py # Add TrainingExample, FineTuningRun
272
+ β”œβ”€β”€ routes/
273
+ β”‚ └── admin.py # Add training endpoints
274
+ β”œβ”€β”€ templates/admin/
275
+ β”‚ └── training.html # Training dashboard UI
276
+ └── analyzer.py # Update to support LoRA models
277
+
278
+ /data/models/ # Persistent storage (HF Spaces)
279
+ β”œβ”€β”€ finetuned/
280
+ β”‚ β”œβ”€β”€ run_1/
281
+ β”‚ β”œβ”€β”€ run_2/
282
+ β”‚ └── ...
283
+ └── exports/
284
+ ```
285
+
286
+ ### API Endpoints Summary
287
+ - `GET /admin/training` - Training dashboard page
288
+ - `GET /admin/api/training-stats` - Get correction stats
289
+ - `GET /admin/api/training-examples` - List training data
290
+ - `DELETE /admin/api/training-example/<id>` - Remove example
291
+ - `POST /admin/api/start-training` - Trigger fine-tuning
292
+ - `GET /admin/api/training-status/<run_id>` - Poll training progress
293
+ - `POST /admin/api/deploy-model/<run_id>` - Deploy fine-tuned model
294
+ - `POST /admin/api/rollback-model` - Revert to base model
295
+ - `GET /admin/api/export-model/<run_id>` - Download model weights
296
+
297
+ ### UI Workflow
298
+ 1. Admin corrects categories on Submissions page (already working)
299
+ 2. Navigate to **Training** tab in admin panel
300
+ 3. View stats: "25 corrections collected (Ready to train!)"
301
+ 4. Click "Start Fine-Tuning" β†’ Configure parameters β†’ Confirm
302
+ 5. Progress bar shows: "Preparing data... Training... Evaluating..."
303
+ 6. Results displayed: "Accuracy: 87% (+12% improvement!)"
304
+ 7. Click "Deploy Model" to activate
305
+ 8. All future predictions use fine-tuned model
306
+
307
+ ### Performance Considerations
308
+ - **Training Time**: ~2-5 minutes for 20-50 examples (CPU)
309
+ - **Memory**: LoRA uses ~10% of full fine-tuning memory
310
+ - **Storage**: ~50MB per LoRA checkpoint
311
+ - **Inference**: Minimal overhead vs base model
312
+
313
+ ### Risk Mitigation
314
+ 1. **Overfitting**: Use validation set, early stopping
315
+ 2. **Catastrophic Forgetting**: Monitor all category metrics
316
+ 3. **Bad Training Data**: Quality validation before training
317
+ 4. **Model Regression**: Always compare to baseline, allow rollback
318
+ 5. **Resource Limits**: LoRA keeps training feasible on HF Spaces
319
+
320
+ ---
321
+
322
+ ## Implementation Phases
323
+
324
+ **Phase 1 (Foundation):** Database models + data collection (2-3 hours)
325
+ **Phase 2 (UI):** Training dashboard + configuration (2-3 hours)
326
+ **Phase 3 (Core ML):** Fine-tuning engine + LoRA (4-5 hours)
327
+ **Phase 4 (Deployment):** Model management + versioning (2-3 hours)
328
+ **Phase 5 (QA):** Validation + metrics (2-3 hours)
329
+ **Phase 6 (Polish):** Export/import + documentation (1-2 hours)
330
+
331
+ **Total Estimated Time:** 13-19 hours
332
+
333
+ ---
334
+
335
+ ## Questions for Clarification
336
+
337
+ 1. **Training Infrastructure**: Run on HF Spaces (CPU) or local machine (GPU)?
338
+ 2. **Background Jobs**: Use simple threading or prefer Celery/Redis?
339
+ 3. **Model Hosting**: Keep models in HF Spaces persistent storage or upload to HF Hub?
340
+ 4. **Auto-training**: Should system auto-train when threshold reached, or admin-triggered only?
341
+ 5. **Notification**: Email/webhook when training completes?
342
+ 6. **Multi-model**: Support multiple fine-tuned models simultaneously (A/B testing)?
343
+
344
+ Ready to proceed with implementation upon your approval!
DEPLOYMENT_READY.md ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # βœ… Deployment Ready - Status Report
2
+
3
+ **Generated**: October 6, 2025
4
+ **Target Platform**: Hugging Face Spaces
5
+ **Status**: 🟒 READY TO DEPLOY
6
+
7
+ ---
8
+
9
+ ## πŸ“¦ Files Prepared
10
+
11
+ ### Core HF Files
12
+ - βœ… **Dockerfile** (port 7860, HF-optimized)
13
+ - βœ… **README.md** (with YAML metadata for Space)
14
+ - βœ… **app_hf.py** (HF Spaces entry point)
15
+ - βœ… **requirements.txt** (all dependencies)
16
+ - βœ… **wsgi.py** (WSGI wrapper)
17
+
18
+ ### Application Code
19
+ - βœ… **app/** directory (complete application)
20
+ - βœ… app/__init__.py (database config for HF)
21
+ - βœ… app/routes/ (all routes)
22
+ - βœ… app/models/ (database models)
23
+ - βœ… app/templates/ (UI templates)
24
+ - βœ… app/fine_tuning/ (model training)
25
+ - βœ… app/analyzer.py (AI classification)
26
+
27
+ ### Configuration
28
+ - βœ… **.gitignore** (excludes sensitive files)
29
+ - βœ… **.hfignore** (HF-specific exclusions)
30
+ - βœ… **Environment variables** configured:
31
+ - DATABASE_PATH=/data/app.db
32
+ - HF_HOME=/data/.cache/huggingface
33
+ - PORT=7860
34
+
35
+ ---
36
+
37
+ ## πŸ” Security Configuration
38
+
39
+ ### Secret Key (CRITICAL)
40
+ **Production Secret**: `9fd11d101e36efbd3a7893f56d604b860403d247633547586c41453118e69b00`
41
+
42
+ **⚠️ IMPORTANT**: Add this to HF Space Settings β†’ Repository secrets as:
43
+ - **Name**: `FLASK_SECRET_KEY`
44
+ - **Value**: (the key above)
45
+
46
+ ### Admin Access
47
+ - **Default Token**: `ADMIN123`
48
+ - **Recommendation**: Change before public deployment
49
+ - **Location**: app/models/models.py (line 61)
50
+
51
+ ### Session Security
52
+ - βœ… HTTPS enforced
53
+ - βœ… HttpOnly cookies
54
+ - βœ… SameSite=None (iframe support)
55
+ - βœ… Partitioned cookies (Safari compatibility)
56
+
57
+ ---
58
+
59
+ ## πŸš€ Deployment Configuration
60
+
61
+ ### Port Configuration
62
+ ```dockerfile
63
+ EXPOSE 7860 # Dockerfile
64
+ ENV PORT=7860 # Environment
65
+ port = int(os.environ.get("PORT", 7860)) # app_hf.py
66
+ ```
67
+ βœ… Verified: Port 7860 configured correctly
68
+
69
+ ### Database Configuration
70
+ ```python
71
+ DATABASE_PATH=/data/app.db # HF persistent storage
72
+ SQLALCHEMY_DATABASE_URI = f'sqlite:///{db_path}'
73
+ ```
74
+ βœ… Verified: Database uses persistent /data directory
75
+
76
+ ### Model Cache Configuration
77
+ ```dockerfile
78
+ ENV HF_HOME=/data/.cache/huggingface
79
+ ENV TRANSFORMERS_CACHE=/data/.cache/huggingface
80
+ ENV HUGGINGFACE_HUB_CACHE=/data/.cache/huggingface
81
+ ```
82
+ βœ… Verified: Models cache in persistent storage
83
+
84
+ ---
85
+
86
+ ## πŸ“Š Resource Requirements
87
+
88
+ ### Minimum (Free Tier)
89
+ - **CPU**: 2 vCPU
90
+ - **RAM**: 16GB
91
+ - **Storage**: 5GB
92
+ - **Performance**: Good for <100 submissions
93
+
94
+ ### Recommended (HF Pro - FREE for you!)
95
+ - **CPU**: 4 vCPU (CPU Upgrade)
96
+ - **RAM**: 32GB
97
+ - **Storage**: 50GB
98
+ - **Performance**: Excellent for any size session
99
+
100
+ ---
101
+
102
+ ## 🎯 Deployment Steps (Summary)
103
+
104
+ 1. **Create Space**: https://huggingface.co/new-space
105
+ - SDK: Docker ⚠️
106
+ - Hardware: CPU Basic or CPU Upgrade
107
+
108
+ 2. **Upload Files**:
109
+ - Dockerfile
110
+ - README.md
111
+ - requirements.txt
112
+ - app_hf.py
113
+ - wsgi.py
114
+ - app/ (entire directory)
115
+
116
+ 3. **Configure Secret**:
117
+ - Settings β†’ Repository secrets
118
+ - Add FLASK_SECRET_KEY
119
+
120
+ 4. **Wait for Build** (~10 minutes)
121
+
122
+ 5. **Access**: https://YOUR_USERNAME-participatory-planner.hf.space
123
+
124
+ ---
125
+
126
+ ## βœ… Pre-Flight Checklist
127
+
128
+ ### Files
129
+ - [x] Dockerfile uses port 7860
130
+ - [x] README.md has YAML header
131
+ - [x] app_hf.py configured for HF
132
+ - [x] requirements.txt complete
133
+ - [x] .hfignore excludes dev files
134
+ - [x] Database path uses /data
135
+
136
+ ### Security
137
+ - [x] Production secret key generated
138
+ - [x] .env excluded from deployment
139
+ - [x] Session cookies configured
140
+ - [x] HTTPS ready
141
+
142
+ ### Features
143
+ - [x] AI model auto-downloads
144
+ - [x] Database auto-creates
145
+ - [x] Fine-tuning works
146
+ - [x] Model selection works
147
+ - [x] Zero-shot models work
148
+ - [x] Export/Import ready
149
+
150
+ ### Testing
151
+ - [x] Local app runs successfully
152
+ - [x] Port 7860 accessible
153
+ - [x] Database persists
154
+ - [x] AI analysis works
155
+ - [x] All features tested
156
+
157
+ ---
158
+
159
+ ## πŸ“ Deployment Documentation
160
+
161
+ ### Quick Start
162
+ - **DEPLOY_TO_HF.md** - 5-minute deployment guide
163
+
164
+ ### Detailed Guides
165
+ - **HUGGINGFACE_DEPLOYMENT.md** - Complete HF deployment guide
166
+ - **HF_DEPLOYMENT_CHECKLIST.md** - Detailed checklist & troubleshooting
167
+
168
+ ### Helper Scripts
169
+ - **prepare_hf_deployment.sh** - Automated preparation script
170
+
171
+ ---
172
+
173
+ ## πŸ” Verification Commands
174
+
175
+ ### Pre-Deployment Check
176
+ ```bash
177
+ ./prepare_hf_deployment.sh
178
+ ```
179
+ **Status**: βœ… Passed
180
+
181
+ ### Manual Verification
182
+ ```bash
183
+ # Check port config
184
+ grep -E "7860" Dockerfile app_hf.py
185
+
186
+ # Check YAML header
187
+ head -10 README.md
188
+
189
+ # Verify files
190
+ ls Dockerfile README.md app_hf.py requirements.txt wsgi.py app/
191
+ ```
192
+ **Status**: βœ… All verified
193
+
194
+ ---
195
+
196
+ ## 🎁 What You Get
197
+
198
+ ### Deployed Application
199
+ - βœ… Full AI-powered planning platform
200
+ - βœ… Token-based access control
201
+ - βœ… AI categorization (6 categories)
202
+ - βœ… Geographic mapping
203
+ - βœ… Analytics dashboard
204
+ - βœ… Fine-tuning capability
205
+ - βœ… Model selection (7+ models)
206
+ - βœ… Zero-shot options (3 models)
207
+ - βœ… Export/Import sessions
208
+ - βœ… Training history
209
+ - βœ… Model deployment management
210
+
211
+ ### Infrastructure
212
+ - βœ… Auto-SSL (HTTPS)
213
+ - βœ… Persistent storage
214
+ - βœ… Auto-restart on crash
215
+ - βœ… Build logs
216
+ - βœ… Health checks
217
+ - βœ… Domain ready (Pro)
218
+
219
+ ### Cost
220
+ - βœ… **$0/month** (included in HF Pro)
221
+
222
+ ---
223
+
224
+ ## πŸ“ˆ Expected Performance
225
+
226
+ ### Build Times
227
+ - First deployment: ~10 minutes
228
+ - Subsequent builds: ~3-5 minutes
229
+ - Model download (first run): ~5 minutes
230
+
231
+ ### Runtime
232
+ - Startup: 10-20 seconds
233
+ - AI inference: <3 seconds per submission
234
+ - Page load: <2 seconds
235
+ - Database queries: <100ms
236
+
237
+ ### Storage Usage
238
+ - Base image: ~500MB
239
+ - AI models: ~1.5GB (cached)
240
+ - Database: grows with usage
241
+ - Total: ~2GB initially
242
+
243
+ ---
244
+
245
+ ## 🚨 Important Notes
246
+
247
+ ### Before Public Launch
248
+ 1. ⚠️ **Change admin token** from ADMIN123
249
+ 2. ⚠️ **Add FLASK_SECRET_KEY** to HF Secrets
250
+ 3. ⚠️ Consider making Space private if handling sensitive data
251
+ 4. ⚠️ Set up regular backups (Export feature)
252
+
253
+ ### Model Considerations
254
+ - First run downloads ~1.5GB model
255
+ - Models cache in /data (persists)
256
+ - Fine-tuned models stored in /data/models
257
+ - Training works on CPU (LoRA efficient)
258
+
259
+ ### Data Persistence
260
+ - Database: /data/app.db (persists)
261
+ - Models: /data/.cache (persists)
262
+ - Fine-tuned: models/finetuned (persists)
263
+ - 50GB storage with Pro
264
+
265
+ ---
266
+
267
+ ## 🎯 Next Steps
268
+
269
+ 1. **Deploy Now**: https://huggingface.co/new-space
270
+ 2. **Follow**: DEPLOY_TO_HF.md guide
271
+ 3. **Test**: All features after deployment
272
+ 4. **Share**: Your Space URL with stakeholders
273
+
274
+ ---
275
+
276
+ ## πŸ“ž Support & Resources
277
+
278
+ ### Documentation
279
+ - [Quick Deploy](./DEPLOY_TO_HF.md)
280
+ - [Full Guide](./HUGGINGFACE_DEPLOYMENT.md)
281
+ - [Checklist](./HF_DEPLOYMENT_CHECKLIST.md)
282
+
283
+ ### HF Resources
284
+ - [Spaces Docs](https://huggingface.co/docs/hub/spaces)
285
+ - [Discord](https://hf.co/join/discord)
286
+ - [Forum](https://discuss.huggingface.co/)
287
+
288
+ ### Monitoring
289
+ - Logs: Your Space β†’ Logs tab
290
+ - Status: Your Space β†’ Status badge
291
+ - Metrics: Your Space β†’ Settings (Pro)
292
+
293
+ ---
294
+
295
+ ## ✨ Final Status
296
+
297
+ ```
298
+ 🟒 DEPLOYMENT READY
299
+
300
+ All systems verified and tested.
301
+ All files prepared and configured.
302
+ All documentation complete.
303
+ Secret key generated.
304
+
305
+ Ready to deploy to Hugging Face Spaces!
306
+
307
+ Estimated deployment time: 15 minutes
308
+ Estimated cost: $0 (HF Pro included)
309
+ ```
310
+
311
+ ---
312
+
313
+ **Action Required**: Click β†’ https://huggingface.co/new-space
314
+
315
+ **Good luck with your deployment! πŸš€**
316
+
DEPLOYMENT_SUCCESS.md ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸŽ‰ Deployment Successful!
2
+
3
+ **Status**: βœ… Pushed to Hugging Face Spaces
4
+ **Time**: October 6, 2025
5
+ **Commit**: 1377fb1
6
+
7
+ ---
8
+
9
+ ## 🌐 Your Space
10
+
11
+ ### URLs
12
+ - **Space Dashboard**: https://huggingface.co/spaces/thadillo/participatory-planner
13
+ - **Live App**: https://thadillo-participatory-planner.hf.space
14
+ - **Settings**: https://huggingface.co/spaces/thadillo/participatory-planner/settings
15
+
16
+ ### Admin Login
17
+ - **Token**: `ADMIN123`
18
+
19
+ ---
20
+
21
+ ## 🚨 CRITICAL - Next Step Required!
22
+
23
+ ### Add Secret Key (Do this NOW!)
24
+
25
+ 1. **Go to**: https://huggingface.co/spaces/thadillo/participatory-planner/settings
26
+ 2. **Click**: "Repository secrets" (left sidebar)
27
+ 3. **Click**: "New secret"
28
+ 4. **Add**:
29
+ - **Name**: `FLASK_SECRET_KEY`
30
+ - **Value**: `9fd11d101e36efbd3a7893f56d604b860403d247633547586c41453118e69b00`
31
+ 5. **Click**: "Add secret"
32
+
33
+ **⚠️ Without this, sessions won't work properly!**
34
+
35
+ ---
36
+
37
+ ## πŸ“Š Build Status
38
+
39
+ ### What's Happening Now:
40
+ 1. βœ… Code pushed to HF Spaces
41
+ 2. πŸ”„ Docker image building (~10 minutes)
42
+ 3. ⏳ AI models downloading (~5 minutes)
43
+ 4. ⏳ App starting
44
+
45
+ ### Check Progress:
46
+ 1. Go to: https://huggingface.co/spaces/thadillo/participatory-planner
47
+ 2. Click: **"Logs"** tab
48
+ 3. Look for: `Running on http://0.0.0.0:7860`
49
+
50
+ ### Status Indicators:
51
+ - 🟑 **Yellow badge** = Building
52
+ - 🟒 **Green badge** = Running
53
+ - πŸ”΄ **Red badge** = Error (check Logs)
54
+
55
+ ---
56
+
57
+ ## 🎯 Deployed Features
58
+
59
+ ### All Features Included:
60
+ - βœ… AI-powered text categorization (6 categories)
61
+ - βœ… Model selection (7+ transformer models)
62
+ - βœ… Zero-shot model selection (3 NLI models)
63
+ - βœ… Fine-tuning capability (LoRA + Head-only)
64
+ - βœ… Training run management
65
+ - βœ… Model export/import
66
+ - βœ… Token-based access control
67
+ - βœ… Geographic mapping
68
+ - βœ… Analytics dashboard
69
+ - βœ… Session export/import
70
+
71
+ ### Infrastructure:
72
+ - βœ… Port 7860 configured
73
+ - βœ… Persistent storage (/data)
74
+ - βœ… Auto-SSL (HTTPS)
75
+ - βœ… Health checks
76
+ - βœ… Model caching
77
+
78
+ ---
79
+
80
+ ## βœ… Verification Checklist
81
+
82
+ Once build completes, test:
83
+
84
+ - [ ] App loads at https://thadillo-participatory-planner.hf.space
85
+ - [ ] Admin login works (ADMIN123)
86
+ - [ ] Can create tokens
87
+ - [ ] Can submit contributions
88
+ - [ ] AI analysis works
89
+ - [ ] Model selection works (7+ models)
90
+ - [ ] Zero-shot model selection works (3 models)
91
+ - [ ] Training panel loads
92
+ - [ ] Dashboard displays correctly
93
+ - [ ] Data persists after refresh
94
+
95
+ ---
96
+
97
+ ## πŸ“ˆ Expected Timeline
98
+
99
+ | Step | Duration | Status |
100
+ |------|----------|--------|
101
+ | Code push | Instant | βœ… Done |
102
+ | Docker build | ~10 min | πŸ”„ In progress |
103
+ | Model download | ~5 min | ⏳ Waiting |
104
+ | App start | ~30 sec | ⏳ Waiting |
105
+ | **Total** | **~15 min** | πŸ”„ |
106
+
107
+ ---
108
+
109
+ ## πŸ” Monitoring
110
+
111
+ ### View Build Logs:
112
+ ```
113
+ https://huggingface.co/spaces/thadillo/participatory-planner
114
+ β†’ Click "Logs" tab
115
+ ```
116
+
117
+ ### What to Look For:
118
+ ```
119
+ βœ“ Successfully built
120
+ βœ“ Successfully tagged
121
+ βœ“ Container started
122
+ βœ“ Running on http://0.0.0.0:7860
123
+ βœ“ Debugger is active! (or production mode)
124
+ ```
125
+
126
+ ### Common First-Time Messages (Normal):
127
+ ```
128
+ ⚠️ Downloading model... (first run, takes ~5 min)
129
+ ⚠️ Model cache empty (will populate)
130
+ ⚠️ Creating database... (auto-creates)
131
+ ```
132
+
133
+ ---
134
+
135
+ ## πŸ› οΈ Troubleshooting
136
+
137
+ ### Build Fails
138
+ **Check**: Logs tab for error details
139
+ **Common fix**: Wait and try again (HF sometimes has delays)
140
+
141
+ ### App Not Loading
142
+ **Check**: Build completed successfully (green badge)
143
+ **Fix**: Give it 15-20 minutes for first deployment
144
+
145
+ ### Session Issues
146
+ **Check**: FLASK_SECRET_KEY added to secrets?
147
+ **Fix**: Add it now (see top of this file)
148
+
149
+ ### Model Download Timeout
150
+ **Wait**: First download takes up to 10 minutes
151
+ **Normal**: Models cache after first run
152
+
153
+ ---
154
+
155
+ ## 🎁 HF Pro Benefits Active
156
+
157
+ Your deployment uses:
158
+ - βœ… Better hardware (more CPU/RAM available)
159
+ - βœ… Persistent storage (50GB)
160
+ - βœ… No sleep mode
161
+ - βœ… Priority builds
162
+ - βœ… Custom domain support
163
+ - βœ… Private space option
164
+
165
+ **Cost**: $0 (included in HF Pro) πŸŽ‰
166
+
167
+ ---
168
+
169
+ ## πŸ“Š What's Deployed
170
+
171
+ ### Git Commit Info:
172
+ ```
173
+ Commit: 1377fb1
174
+ Branch: feature/fine-tuning β†’ main
175
+ Files: 10 changed, 1020+ insertions
176
+ ```
177
+
178
+ ### Key Updates:
179
+ - Model selection (7+ transformers)
180
+ - Zero-shot options (3 NLI models)
181
+ - Fine-tuning improvements
182
+ - Training run management
183
+ - Export/delete functionality
184
+ - HF Spaces configuration
185
+
186
+ ---
187
+
188
+ ## πŸ” Security Notes
189
+
190
+ ### Current Setup:
191
+ - βœ… HTTPS enabled (automatic)
192
+ - βœ… Secret key in HF Secrets (add it!)
193
+ - ⚠️ Admin token: ADMIN123 (change for production)
194
+
195
+ ### For Production:
196
+ 1. Change admin token in `app/models/models.py`
197
+ 2. Enable Space authentication
198
+ 3. Make Space private if needed
199
+ 4. Regular data backups
200
+
201
+ ---
202
+
203
+ ## πŸ“ž Support
204
+
205
+ ### If You Need Help:
206
+ - **Logs**: Check build/runtime logs
207
+ - **HF Docs**: https://huggingface.co/docs/hub/spaces
208
+ - **HF Discord**: https://hf.co/join/discord
209
+ - **Status**: https://status.huggingface.co
210
+
211
+ ### Your Space:
212
+ - **Dashboard**: https://huggingface.co/spaces/thadillo/participatory-planner
213
+ - **Settings**: https://huggingface.co/spaces/thadillo/participatory-planner/settings
214
+ - **Files**: https://huggingface.co/spaces/thadillo/participatory-planner/tree/main
215
+
216
+ ---
217
+
218
+ ## πŸš€ Next Steps
219
+
220
+ ### Immediate (Now):
221
+ 1. βœ… Code pushed
222
+ 2. ⏳ Add FLASK_SECRET_KEY to secrets (critical!)
223
+ 3. ⏳ Wait for build (~15 min)
224
+ 4. ⏳ Test app functionality
225
+
226
+ ### Soon (After Build):
227
+ 1. Test all features
228
+ 2. Change admin token for production
229
+ 3. Configure Space settings (privacy, etc.)
230
+ 4. Share with stakeholders
231
+
232
+ ### Optional:
233
+ 1. Enable Space authentication
234
+ 2. Set up custom domain
235
+ 3. Configure hardware (CPU Upgrade)
236
+ 4. Set up monitoring/alerts
237
+
238
+ ---
239
+
240
+ ## ✨ Success Criteria
241
+
242
+ Your deployment is successful when:
243
+ - βœ… Space shows "Running" (green badge)
244
+ - βœ… App loads at URL
245
+ - βœ… Admin login works
246
+ - βœ… AI analysis completes
247
+ - βœ… Data persists
248
+ - βœ… No errors in Logs
249
+
250
+ **Estimated completion**: ~15 minutes from now
251
+
252
+ ---
253
+
254
+ ## πŸŽ‰ Congratulations!
255
+
256
+ Your Participatory Planning Platform is deploying to Hugging Face Spaces!
257
+
258
+ **Watch it build**: https://huggingface.co/spaces/thadillo/participatory-planner
259
+
260
+ **First action**: Add the secret key! ⬆️
261
+
262
+ ---
263
+
264
+ **Deployment Time**: October 6, 2025
265
+ **Platform**: Hugging Face Spaces
266
+ **Status**: πŸ”„ Building
267
+ **ETA**: ~15 minutes
268
+
DEPLOY_TO_HF.md ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸš€ Quick Deploy to Hugging Face Spaces
2
+
3
+ ## ⚑ 5-Minute Deployment
4
+
5
+ Your app is **ready to deploy**! Everything is configured.
6
+
7
+ ---
8
+
9
+ ## πŸ“‹ What You Need
10
+
11
+ 1. βœ… Hugging Face account (you have Pro!)
12
+ 2. βœ… 10 minutes of time
13
+ 3. βœ… This repository
14
+
15
+ ---
16
+
17
+ ## 🎯 Deployment Steps
18
+
19
+ ### Step 1: Run Preparation Script (Already Done!)
20
+
21
+ ```bash
22
+ cd /home/thadillo/MyProjects/participatory_planner
23
+ ./prepare_hf_deployment.sh
24
+ ```
25
+
26
+ **Status**: βœ… Complete! Files are ready.
27
+
28
+ ---
29
+
30
+ ### Step 2: Create Hugging Face Space
31
+
32
+ 1. **Go to**: https://huggingface.co/new-space
33
+
34
+ 2. **Fill in the form**:
35
+ - **Space name**: `participatory-planner` (or your choice)
36
+ - **License**: MIT
37
+ - **SDK**: ⚠️ **Docker** (IMPORTANT!)
38
+ - **Hardware**: CPU Basic (free) or CPU Upgrade (Pro - faster)
39
+ - **Visibility**: Public or Private
40
+
41
+ 3. **Click**: "Create Space"
42
+
43
+ ---
44
+
45
+ ### Step 3: Upload Files
46
+
47
+ Two options:
48
+
49
+ #### Option A: Web UI (Easier)
50
+ 1. Go to your Space β†’ **Files** tab
51
+ 2. Click "Add file" β†’ "Upload files"
52
+ 3. Upload these files/folders:
53
+ ```
54
+ βœ… Dockerfile
55
+ βœ… README.md
56
+ βœ… requirements.txt
57
+ βœ… app_hf.py
58
+ βœ… wsgi.py
59
+ βœ… app/ (entire folder)
60
+ ```
61
+ 4. Commit: "Initial deployment"
62
+
63
+ #### Option B: Git Push
64
+ ```bash
65
+ # Add HF as remote (replace YOUR_USERNAME)
66
+ git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/participatory-planner
67
+
68
+ # Push
69
+ git add Dockerfile README.md requirements.txt app_hf.py wsgi.py app/
70
+ git commit -m "πŸš€ Deploy to HF Spaces"
71
+ git push hf main
72
+ ```
73
+
74
+ ---
75
+
76
+ ### Step 4: Configure Secret Key
77
+
78
+ 1. **Go to**: Your Space β†’ Settings β†’ Repository secrets
79
+ 2. **Click**: "New secret"
80
+ 3. **Add**:
81
+ - **Name**: `FLASK_SECRET_KEY`
82
+ - **Value**: `9fd11d101e36efbd3a7893f56d604b860403d247633547586c41453118e69b00`
83
+ 4. **Save**
84
+
85
+ ---
86
+
87
+ ### Step 5: Wait for Build
88
+
89
+ 1. Go to **Logs** tab
90
+ 2. Watch the build (5-10 minutes first time)
91
+ 3. Look for:
92
+ ```
93
+ βœ“ Running on http://0.0.0.0:7860
94
+ ```
95
+ 4. Status will change: "Building" β†’ "Running" βœ…
96
+
97
+ ---
98
+
99
+ ### Step 6: Access Your App! πŸŽ‰
100
+
101
+ Your app is live at:
102
+ - **Direct**: `https://huggingface.co/spaces/YOUR_USERNAME/participatory-planner`
103
+ - **Embedded**: `https://YOUR_USERNAME-participatory-planner.hf.space`
104
+
105
+ **Login**: `ADMIN123`
106
+
107
+ ---
108
+
109
+ ## βœ… Verify Deployment
110
+
111
+ Test these features:
112
+ - [ ] App loads correctly
113
+ - [ ] Admin login works
114
+ - [ ] Can create tokens
115
+ - [ ] Can submit contributions
116
+ - [ ] AI analysis works
117
+ - [ ] Dashboard displays
118
+ - [ ] Training panel accessible
119
+ - [ ] Data persists after refresh
120
+
121
+ ---
122
+
123
+ ## πŸ”§ Troubleshooting
124
+
125
+ ### Build Failed?
126
+ - Check **Logs** tab for error details
127
+ - Verify Docker SDK was selected
128
+ - Try CPU Upgrade if out of memory
129
+
130
+ ### App Not Loading?
131
+ - Wait 10 minutes for model download
132
+ - Check Logs for errors
133
+ - Verify port 7860 in Dockerfile
134
+
135
+ ### Database Issues?
136
+ - Database creates automatically on first run
137
+ - Stored in `/data/app.db` (persists)
138
+ - Check Space hasn't run out of storage
139
+
140
+ ---
141
+
142
+ ## 🎁 Bonus: Pro Features
143
+
144
+ With your HF Pro account:
145
+
146
+ ### Faster Performance
147
+ - Settings β†’ Hardware β†’ CPU Upgrade (4 vCPU, 32GB RAM)
148
+
149
+ ### Private Space
150
+ - Settings β†’ Visibility β†’ Private
151
+ - Perfect for confidential planning sessions
152
+
153
+ ### Custom Domain
154
+ - Settings β†’ Custom domains
155
+ - Add: `planning.yourdomain.com`
156
+
157
+ ### Always-On
158
+ - Settings β†’ Sleep time β†’ Never sleep
159
+ - No cold starts!
160
+
161
+ ---
162
+
163
+ ## πŸ“Š What Gets Deployed
164
+
165
+ ### Included:
166
+ - βœ… Full application code (`app/`)
167
+ - βœ… AI models (download on first run)
168
+ - βœ… Database (created automatically)
169
+ - βœ… All features working
170
+
171
+ ### NOT Included:
172
+ - ❌ Local development files
173
+ - ❌ Your local database
174
+ - ❌ venv/
175
+ - ❌ .env file (use Secrets instead)
176
+
177
+ ---
178
+
179
+ ## πŸ” Security Notes
180
+
181
+ ### Current Setup:
182
+ - βœ… Secret key stored in HF Secrets (not in code)
183
+ - βœ… HTTPS enabled automatically
184
+ - βœ… Session cookies configured
185
+ - ⚠️ Default admin token: `ADMIN123`
186
+
187
+ ### For Production:
188
+ 1. **Change admin token** to something secure
189
+ 2. **Enable Space authentication** (Settings)
190
+ 3. **Make Space private** if handling sensitive data
191
+ 4. **Regular backups** via Export feature
192
+
193
+ ---
194
+
195
+ ## πŸ“ˆ Performance
196
+
197
+ ### Expected:
198
+ - **Build time**: 5-10 minutes (first time)
199
+ - **Model download**: 5 minutes (first run, then cached)
200
+ - **Startup time**: 10-20 seconds
201
+ - **Inference**: <3 seconds per submission
202
+ - **Storage**: ~2GB (model + database)
203
+
204
+ ### With Pro CPU Upgrade:
205
+ - ⚑ 2x faster inference
206
+ - ⚑ Faster model loading
207
+ - ⚑ Better for large sessions (100+ submissions)
208
+
209
+ ---
210
+
211
+ ## πŸ“ž Support
212
+
213
+ ### Documentation:
214
+ - **Full guide**: `HUGGINGFACE_DEPLOYMENT.md`
215
+ - **Checklist**: `HF_DEPLOYMENT_CHECKLIST.md`
216
+ - **HF Docs**: https://huggingface.co/docs/hub/spaces
217
+
218
+ ### Help:
219
+ - **Logs**: Your Space β†’ Logs tab
220
+ - **HF Discord**: https://hf.co/join/discord
221
+ - **HF Forum**: https://discuss.huggingface.co/
222
+
223
+ ---
224
+
225
+ ## 🎯 Quick Summary
226
+
227
+ ```
228
+ 1. Create Space (SDK: Docker) β†’ 1 min
229
+ 2. Upload files β†’ 2 min
230
+ 3. Add FLASK_SECRET_KEY to Secrets β†’ 1 min
231
+ 4. Wait for build β†’ 10 min
232
+ 5. Test & enjoy! β†’ ∞
233
+
234
+ Total: ~15 minutes
235
+ Cost: $0 (included in HF Pro!)
236
+ ```
237
+
238
+ ---
239
+
240
+ ## ✨ You're Ready!
241
+
242
+ Everything is configured and tested. Just follow the steps above.
243
+
244
+ **Next**: Click this link β†’ https://huggingface.co/new-space
245
+
246
+ Good luck! πŸš€πŸŽ‰
247
+
248
+ ---
249
+
250
+ **Files prepared by**: `prepare_hf_deployment.sh`
251
+ **Deployment verified**: βœ… Ready
252
+ **Secret key generated**: βœ… Ready
253
+ **Docker config**: βœ… Port 7860
254
+ **Database**: βœ… Auto-creates at `/data/app.db`
255
+
HF_DEPLOYMENT_CHECKLIST.md ADDED
@@ -0,0 +1,315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸš€ Hugging Face Deployment Checklist
2
+
3
+ ## βœ… Pre-Deployment Checklist
4
+
5
+ ### 1. Files Ready
6
+ - [x] `Dockerfile.hf` - HF-compatible Docker configuration
7
+ - [x] `app_hf.py` - HF Spaces entry point (port 7860)
8
+ - [x] `README_HF.md` - Space description with YAML metadata
9
+ - [x] `requirements.txt` - All dependencies included
10
+ - [x] `app/` directory - Complete application code
11
+ - [x] `.gitignore` - Ignore patterns configured
12
+ - [x] `wsgi.py` - WSGI application wrapper
13
+
14
+ ### 2. Configuration Verified
15
+ - [x] Port 7860 configured in Dockerfile.hf and app_hf.py
16
+ - [x] Database path uses environment variable (DATABASE_PATH=/data/app.db)
17
+ - [x] HuggingFace cache configured (/data/.cache/huggingface)
18
+ - [x] Session cookies configured for iframe embedding
19
+ - [x] Health check endpoint configured
20
+ - [x] Models directory configured (models/finetuned/)
21
+
22
+ ### 3. Security
23
+ - [ ] **IMPORTANT**: Update FLASK_SECRET_KEY in HF Secrets
24
+ - Use this secure key: `9fd11d101e36efbd3a7893f56d604b860403d247633547586c41453118e69b00`
25
+ - [ ] Consider changing ADMIN123 token to something more secure
26
+ - [ ] Review .hfignore to exclude sensitive files
27
+
28
+ ---
29
+
30
+ ## 🎯 Deployment Steps
31
+
32
+ ### Option A: Web UI (Recommended - 5 minutes)
33
+
34
+ #### Step 1: Create Space
35
+ 1. Go to https://huggingface.co/new-space
36
+ 2. Login with your HF Pro account
37
+ 3. Fill in:
38
+ - **Space name**: `participatory-planner`
39
+ - **License**: MIT
40
+ - **SDK**: Docker ⚠️ IMPORTANT
41
+ - **Hardware**: CPU Basic (or CPU Upgrade for Pro)
42
+ - **Visibility**: Public or Private
43
+
44
+ #### Step 2: Prepare Files for Upload
45
+ Run this command to copy HF-specific files:
46
+ ```bash
47
+ cd /home/thadillo/MyProjects/participatory_planner
48
+
49
+ # Copy HF-specific files to root
50
+ cp Dockerfile.hf Dockerfile
51
+ cp README_HF.md README.md
52
+ ```
53
+
54
+ #### Step 3: Upload Files via Web UI
55
+ Upload these files/folders to your Space:
56
+ - βœ… `Dockerfile` (the HF version)
57
+ - βœ… `README.md` (the HF version with YAML header)
58
+ - βœ… `requirements.txt`
59
+ - βœ… `app_hf.py`
60
+ - βœ… `wsgi.py`
61
+ - βœ… `app/` (entire folder with all subfolders)
62
+ - βœ… `.gitignore`
63
+
64
+ **DO NOT upload:**
65
+ - ❌ `venv/` (Python virtual environment)
66
+ - ❌ `instance/` (local database)
67
+ - ❌ `models/finetuned/` (will be created on HF)
68
+ - ❌ `.git/` (Git history)
69
+ - ❌ `__pycache__/` (Python cache)
70
+
71
+ #### Step 4: Configure Secrets
72
+ 1. Go to your Space β†’ Settings β†’ Repository secrets
73
+ 2. Click "Add a secret"
74
+ 3. Add:
75
+ - **Name**: `FLASK_SECRET_KEY`
76
+ - **Value**: `9fd11d101e36efbd3a7893f56d604b860403d247633547586c41453118e69b00`
77
+ 4. (Optional) Add:
78
+ - **Name**: `FLASK_ENV`
79
+ - **Value**: `production`
80
+
81
+ #### Step 5: Wait for Build
82
+ 1. Go to "Logs" tab
83
+ 2. Watch the build process (5-10 minutes first time)
84
+ 3. Look for: `Running on http://0.0.0.0:7860`
85
+ 4. Space will show "Building" β†’ "Running"
86
+
87
+ #### Step 6: Access & Test
88
+ 1. Visit: `https://huggingface.co/spaces/YOUR_USERNAME/participatory-planner`
89
+ 2. Login with: `ADMIN123`
90
+ 3. Test all features:
91
+ - [ ] Registration page loads
92
+ - [ ] Can create tokens
93
+ - [ ] Can submit contributions
94
+ - [ ] AI analysis works
95
+ - [ ] Dashboard displays correctly
96
+ - [ ] Map visualization works
97
+ - [ ] Training panel accessible
98
+ - [ ] Export/Import works
99
+
100
+ ---
101
+
102
+ ### Option B: Git CLI (For Advanced Users)
103
+
104
+ #### Step 1: Install Git LFS
105
+ ```bash
106
+ git lfs install
107
+ ```
108
+
109
+ #### Step 2: Create Space via CLI
110
+ ```bash
111
+ # Install HF CLI
112
+ pip install huggingface_hub
113
+
114
+ # Login to HF
115
+ huggingface-cli login
116
+
117
+ # Create space (replace YOUR_USERNAME)
118
+ huggingface-cli repo create participatory-planner --type space --space_sdk docker
119
+ ```
120
+
121
+ #### Step 3: Prepare Repository
122
+ ```bash
123
+ cd /home/thadillo/MyProjects/participatory_planner
124
+
125
+ # Copy HF-specific files
126
+ cp Dockerfile.hf Dockerfile
127
+ cp README_HF.md README.md
128
+
129
+ # Add HF remote
130
+ git remote add hf https://huggingface.co/spaces/YOUR_USERNAME/participatory-planner
131
+ ```
132
+
133
+ #### Step 4: Commit and Push
134
+ ```bash
135
+ # Make sure .hfignore is in place
136
+ git add .
137
+ git commit -m "πŸš€ Initial deployment to Hugging Face Spaces"
138
+ git push hf main
139
+ ```
140
+
141
+ #### Step 5: Configure secrets via Web UI
142
+ (Same as Option A, Step 4)
143
+
144
+ ---
145
+
146
+ ## πŸ“‹ Post-Deployment Verification
147
+
148
+ ### Essential Tests
149
+ - [ ] Space builds successfully (check Logs)
150
+ - [ ] App is accessible at Space URL
151
+ - [ ] Admin login works (ADMIN123)
152
+ - [ ] Database persists between restarts
153
+ - [ ] AI model loads successfully
154
+ - [ ] File uploads work
155
+ - [ ] Map loads correctly
156
+
157
+ ### Performance Checks
158
+ - [ ] First load time < 3 seconds (after warm-up)
159
+ - [ ] AI analysis completes in < 5 seconds
160
+ - [ ] No memory errors in logs
161
+ - [ ] Model caching works (subsequent loads faster)
162
+
163
+ ### Security Checks
164
+ - [ ] FLASK_SECRET_KEY is set in Secrets (not in code)
165
+ - [ ] No sensitive data in logs
166
+ - [ ] HTTPS works correctly
167
+ - [ ] Session cookies work in iframe
168
+
169
+ ---
170
+
171
+ ## πŸ”§ Troubleshooting
172
+
173
+ ### Build Fails
174
+ **Error**: "Out of memory during build"
175
+ - **Solution**: Upgrade to CPU Upgrade hardware in Settings
176
+
177
+ **Error**: "Port 7860 not responding"
178
+ - **Solution**: Verify Dockerfile exposes 7860 and app_hf.py uses it
179
+
180
+ ### Runtime Issues
181
+ **Error**: "Database locked" or "Database resets"
182
+ - **Solution**: Verify DATABASE_PATH=/data/app.db in Dockerfile
183
+
184
+ **Error**: "Model download timeout"
185
+ - **Solution**: First download takes 10+ minutes. Be patient. Check Logs.
186
+
187
+ **Error**: "Can't access Space"
188
+ - **Solution**: Check Space visibility (Settings). Set to Public.
189
+
190
+ ### AI Model Issues
191
+ **Error**: "Transformers error on first run"
192
+ - **Solution**: Models download on first use. Check HF_HOME=/data/.cache
193
+
194
+ **Error**: "CUDA/GPU errors"
195
+ - **Solution**: App uses CPU by default. Don't select GPU hardware unless needed.
196
+
197
+ ---
198
+
199
+ ## πŸ“Š Monitoring
200
+
201
+ ### Daily Checks
202
+ - View Logs tab for errors
203
+ - Check Space status badge (green = good)
204
+ - Verify database size (Settings β†’ Storage)
205
+
206
+ ### Weekly Maintenance
207
+ - Export data backup via admin panel
208
+ - Review error logs
209
+ - Check model storage size
210
+ - Update dependencies if needed
211
+
212
+ ---
213
+
214
+ ## πŸ”„ Updates & Rollbacks
215
+
216
+ ### To Update Your Space
217
+ Via Git:
218
+ ```bash
219
+ git add .
220
+ git commit -m "Update: description of changes"
221
+ git push hf main
222
+ ```
223
+
224
+ Via Web UI:
225
+ 1. Go to Files tab
226
+ 2. Edit files directly
227
+ 3. Commit changes
228
+
229
+ ### To Rollback
230
+ 1. Go to Files β†’ Commits
231
+ 2. Find last working commit
232
+ 3. Click "Revert to this commit"
233
+
234
+ ---
235
+
236
+ ## πŸ’‘ Optimization Tips
237
+
238
+ ### For Better Performance
239
+ - Enable CPU Upgrade (4 vCPU, 32GB RAM) - Free with Pro!
240
+ - Use model presets (DeBERTa-v3-small recommended)
241
+ - Set persistent storage for model cache
242
+
243
+ ### For Production Use
244
+ 1. Change admin token from ADMIN123
245
+ 2. Enable Space authentication (Settings)
246
+ 3. Set up custom domain (Pro feature)
247
+ 4. Enable always-on (Pro feature)
248
+ 5. Set up monitoring alerts
249
+
250
+ ---
251
+
252
+ ## πŸŽ‰ Success Criteria
253
+
254
+ Your deployment is successful when:
255
+ - βœ… Space status shows "Running" (green badge)
256
+ - βœ… No errors in Logs for 5 minutes
257
+ - βœ… Admin login works
258
+ - βœ… AI analysis completes successfully
259
+ - βœ… Data persists after refresh
260
+ - βœ… All features work as in local development
261
+
262
+ ---
263
+
264
+ ## πŸ“ž Support Resources
265
+
266
+ - **HF Spaces Docs**: https://huggingface.co/docs/hub/spaces
267
+ - **HF Discord**: https://hf.co/join/discord
268
+ - **App Logs**: Your Space β†’ Logs tab
269
+ - **HF Status**: https://status.huggingface.co
270
+
271
+ ---
272
+
273
+ ## πŸ” Important Security Notes
274
+
275
+ **CRITICAL - Before going public:**
276
+
277
+ 1. **Change Admin Token** in `app/models/models.py`:
278
+ ```python
279
+ if not Token.query.filter_by(token='YOUR_SECURE_TOKEN').first():
280
+ admin_token = Token(token='YOUR_SECURE_TOKEN', type='admin', ...)
281
+ ```
282
+
283
+ 2. **Use HF Secrets** (never commit secrets):
284
+ - FLASK_SECRET_KEY (already set)
285
+ - Any API keys
286
+ - Database credentials (if using external DB)
287
+
288
+ 3. **Consider Space Authentication**:
289
+ - Settings β†’ Enable authentication
290
+ - Require HF login to access
291
+
292
+ 4. **For Confidential Sessions**:
293
+ - Set Space to Private
294
+ - Use password protection
295
+ - Regular data backups
296
+
297
+ ---
298
+
299
+ ## πŸ“ Final Notes
300
+
301
+ **Estimated Deployment Time**: 10-15 minutes (first time)
302
+
303
+ **Resources Used** (with HF Pro):
304
+ - Storage: ~2GB (model cache + database)
305
+ - RAM: ~1-2GB during inference
306
+ - CPU: 2-4 cores recommended
307
+
308
+ **Cost**: $0 (included in HF Pro subscription) πŸŽ‰
309
+
310
+ **Next Step**: Click "Create Space" on huggingface.co/new-space and follow the checklist above!
311
+
312
+ ---
313
+
314
+ **Good luck with your deployment! πŸš€**
315
+
NEXT_STEPS_CATEGORIZATION.md ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🎯 Next Steps: Sentence-Level Categorization
2
+
3
+ ## πŸ“‹ What We've Created
4
+
5
+ Your excellent observation about multi-category submissions has led to a comprehensive analysis and plan:
6
+
7
+ ### πŸ“„ Documents Created:
8
+
9
+ 1. **SENTENCE_LEVEL_CATEGORIZATION_PLAN.md** (Complete implementation plan)
10
+ - 4 solution options with pros/cons
11
+ - Detailed 7-phase implementation for sentence-level
12
+ - Database schema, UI mockups, code examples
13
+ - Migration strategy
14
+
15
+ 2. **CATEGORIZATION_DECISION_GUIDE.md** (Quick decision helper)
16
+ - Visual comparisons of approaches
17
+ - Questions to help decide
18
+ - Recommended path forward
19
+
20
+ 3. **analyze_submissions_for_sentences.py** (Data analysis script)
21
+ - Analyzes your current 60 submissions
22
+ - Shows % with multiple categories
23
+ - Identifies which need sentence-level breakdown
24
+ - Generates recommendation based on data
25
+
26
+ ---
27
+
28
+ ## πŸš€ How to Proceed
29
+
30
+ ### Step 1: Run Analysis (5 minutes) ⏰
31
+
32
+ **See the data before deciding!**
33
+
34
+ ```bash
35
+ cd /home/thadillo/MyProjects/participatory_planner
36
+ source venv/bin/activate
37
+ python analyze_submissions_for_sentences.py
38
+ ```
39
+
40
+ **This will show**:
41
+ - How many submissions contain multiple categories
42
+ - Which submissions would benefit most
43
+ - Sentence count distribution
44
+ - Data-driven recommendation
45
+
46
+ **Example output**:
47
+ ```
48
+ πŸ“Š STATISTICS
49
+ ─────────────────────────────────────────
50
+ Total Submissions: 60
51
+ Multi-category: 23 (38.3%)
52
+ Avg Sentences/Submission: 2.3
53
+
54
+ πŸ’‘ RECOMMENDATION
55
+ βœ… STRONGLY RECOMMEND sentence-level categorization
56
+ 38.3% of submissions contain multiple categories.
57
+ ```
58
+
59
+ ---
60
+
61
+ ### Step 2: Choose Your Path
62
+
63
+ Based on analysis results, pick one:
64
+
65
+ #### Path A: Full Implementation (if >40% multi-category)
66
+ ```
67
+ Timeline: 2-3 weeks
68
+ Effort: 13-20 hours
69
+ Result: Best system, maximum value
70
+ ```
71
+
72
+ **What you get**:
73
+ - βœ… Sentence-level categorization
74
+ - βœ… Collapsible UI for sentence breakdown
75
+ - βœ… Dual-mode dashboard (submission vs sentence view)
76
+ - βœ… Precise training data
77
+ - βœ… Geotag inheritance
78
+ - βœ… Category distribution per submission
79
+
80
+ **Start with**: Phase 1 (Database schema)
81
+
82
+ ---
83
+
84
+ #### Path B: Proof of Concept (if 20-40% multi-category)
85
+ ```
86
+ Timeline: 3-5 days
87
+ Effort: 4-6 hours
88
+ Result: Test before committing
89
+ ```
90
+
91
+ **What you get**:
92
+ - βœ… Sentence breakdown display (read-only)
93
+ - βœ… Shows what it WOULD look like
94
+ - βœ… No database changes (safe)
95
+ - βœ… Get user feedback
96
+ - βœ… Then decide: full implementation or not
97
+
98
+ **Start with**: UI prototype (no backend changes)
99
+
100
+ ---
101
+
102
+ #### Path C: Multi-Label (if <20% multi-category)
103
+ ```
104
+ Timeline: 2-3 days
105
+ Effort: 4-6 hours
106
+ Result: Good enough, simpler
107
+ ```
108
+
109
+ **What you get**:
110
+ - βœ… Multiple categories per submission
111
+ - βœ… Simple checkbox UI
112
+ - βœ… Fast to implement
113
+ - ❌ Less granular than sentence-level
114
+
115
+ **Start with**: Add category array field
116
+
117
+ ---
118
+
119
+ #### Path D: Keep Current (if <10% multi-category)
120
+ ```
121
+ Timeline: 0 days
122
+ Effort: 0 hours
123
+ Result: No change needed
124
+ ```
125
+
126
+ **Decision**: Current system is sufficient
127
+
128
+ ---
129
+
130
+ ### Step 3: Implementation
131
+
132
+ **Once you decide, I can**:
133
+
134
+ #### If Full Implementation (Path A):
135
+ 1. βœ… Create database migration
136
+ 2. βœ… Add SubmissionSentence model
137
+ 3. βœ… Implement sentence segmentation
138
+ 4. βœ… Update analyzer for sentence-level
139
+ 5. βœ… Build collapsible UI
140
+ 6. βœ… Update dashboard aggregation
141
+ 7. βœ… Migrate existing data
142
+ 8. βœ… Add training data updates
143
+
144
+ **I'll create**: Working feature branch with all phases
145
+
146
+ #### If Proof of Concept (Path B):
147
+ 1. βœ… Add sentence display (read-only)
148
+ 2. βœ… Show category breakdown
149
+ 3. βœ… Test with users
150
+ 4. βœ… Get feedback
151
+ 5. βœ… Then decide next steps
152
+
153
+ **I'll create**: UI prototype for testing
154
+
155
+ #### If Multi-Label (Path C):
156
+ 1. βœ… Update Submission model
157
+ 2. βœ… Change UI to checkboxes
158
+ 3. βœ… Update dashboard logic
159
+ 4. βœ… Migrate data
160
+
161
+ **I'll create**: Multi-label feature
162
+
163
+ ---
164
+
165
+ ## πŸ“Š Decision Matrix
166
+
167
+ **Use this to decide**:
168
+
169
+ | Factor | Full Sentence-Level | Proof of Concept | Multi-Label | Keep Current |
170
+ |--------|-------------------|------------------|-------------|--------------|
171
+ | Multi-category % | >40% | 20-40% | 10-20% | <10% |
172
+ | Time available | 2-3 weeks | 3-5 days | 2-3 days | - |
173
+ | Training data priority | High | Medium | Low | - |
174
+ | Analytics depth | Very important | Important | Nice to have | Not critical |
175
+ | Risk tolerance | Low (test first) | Medium | High | - |
176
+
177
+ ---
178
+
179
+ ## 🎯 My Recommendation
180
+
181
+ ### Do This Now (10 minutes):
182
+
183
+ 1. **Run the analysis script**:
184
+ ```bash
185
+ cd /home/thadillo/MyProjects/participatory_planner
186
+ source venv/bin/activate
187
+ python analyze_submissions_for_sentences.py
188
+ ```
189
+
190
+ 2. **Look at the percentage** of multi-category submissions
191
+
192
+ 3. **Decide based on data**:
193
+ - **>40%** β†’ "Let's do full sentence-level"
194
+ - **20-40%** β†’ "Let's try proof of concept first"
195
+ - **<20%** β†’ "Multi-label is probably enough"
196
+
197
+ 4. **Tell me your decision**, and I'll start implementation immediately
198
+
199
+ ---
200
+
201
+ ## πŸ’‘ Key Insights from Your Observation
202
+
203
+ You identified a **critical limitation**:
204
+
205
+ > "Dallas should establish more green spaces in South Dallas neighborhoods. Areas like Oak Cliff lack accessible parks compared to North Dallas."
206
+
207
+ **Current problem**:
208
+ - System forces ONE category
209
+ - Loses semantic richness
210
+ - Training data is imprecise
211
+
212
+ **Your solution**:
213
+ - Sentence-level categorization
214
+ - Preserve all meaning
215
+ - Better AI training
216
+
217
+ **This is exactly the right thinking!** 🎯
218
+
219
+ The analysis script will show if this pattern is common enough to warrant the implementation effort.
220
+
221
+ ---
222
+
223
+ ## πŸ“ž What I Need from You
224
+
225
+ **To proceed, please**:
226
+
227
+ 1. βœ… Run the analysis script (above)
228
+ 2. βœ… Review the output
229
+ 3. βœ… Tell me which path you want:
230
+ - **A**: Full sentence-level implementation
231
+ - **B**: Proof of concept first
232
+ - **C**: Multi-label approach
233
+ - **D**: Keep current system
234
+
235
+ 4. βœ… I'll start building immediately!
236
+
237
+ ---
238
+
239
+ ## πŸ“‚ Files Ready for You
240
+
241
+ All documentation is ready:
242
+ - βœ… `SENTENCE_LEVEL_CATEGORIZATION_PLAN.md` - Full technical plan
243
+ - βœ… `CATEGORIZATION_DECISION_GUIDE.md` - Decision helper
244
+ - βœ… `analyze_submissions_for_sentences.py` - Analysis script
245
+ - βœ… This file - Next steps summary
246
+
247
+ **Everything is prepared. Just waiting for your decision!** πŸš€
248
+
249
+ ---
250
+
251
+ ## ⏰ Timeline Estimates
252
+
253
+ | Path | Phase | Time | What Happens |
254
+ |------|-------|------|--------------|
255
+ | **A: Full** | Week 1 | 8-10h | DB, backend, analysis |
256
+ | | Week 2 | 5-8h | UI, dashboard |
257
+ | | Week 3 | 2-4h | Testing, polish |
258
+ | **B: POC** | Days 1-2 | 4-6h | UI prototype |
259
+ | | Day 3 | - | User testing |
260
+ | | Days 4-5 | Decide | Full or abort |
261
+ | **C: Multi-label** | Days 1-2 | 4-6h | Implementation |
262
+ | | Day 3 | 1-2h | Testing |
263
+
264
+ ---
265
+
266
+ **Ready when you are!** Just run the analysis and let me know what you decide. πŸŽ‰
267
+
SENTENCE_LEVEL_CATEGORIZATION_PLAN.md ADDED
@@ -0,0 +1,830 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ“‹ Sentence-Level Categorization - Implementation Plan
2
+
3
+ **Problem Identified**: Single submissions often contain multiple semantic units (sentences) belonging to different categories, leading to loss of nuance.
4
+
5
+ **Example**:
6
+ > "Dallas should establish more green spaces in South Dallas neighborhoods. Areas like Oak Cliff lack accessible parks compared to North Dallas."
7
+ - Sentence 1: **Objective** (should establish...)
8
+ - Sentence 2: **Problem** (lack accessible parks...)
9
+
10
+ ---
11
+
12
+ ## 🎯 Proposed Solutions (Ranked by Complexity)
13
+
14
+ ### Option 1: Sentence-Level Categorization (User's Proposal) ⭐ RECOMMENDED
15
+
16
+ **Concept**: Break submissions into sentences, categorize each individually while maintaining parent submission context.
17
+
18
+ **Pros**:
19
+ - βœ… Maximum granularity and accuracy
20
+ - βœ… Preserves all semantic information
21
+ - βœ… Better training data for fine-tuning
22
+ - βœ… More detailed analytics
23
+ - βœ… Maintains geotag/stakeholder context
24
+
25
+ **Cons**:
26
+ - ⚠️ Significant database schema changes
27
+ - ⚠️ UI complexity increases
28
+ - ⚠️ More AI inference calls (slower/costlier)
29
+ - ⚠️ Dashboard aggregation more complex
30
+
31
+ **Complexity**: High
32
+ **Value**: Very High
33
+
34
+ ---
35
+
36
+ ### Option 2: Multi-Label Classification (Simpler Alternative)
37
+
38
+ **Concept**: Assign multiple categories to a single submission.
39
+
40
+ **Example**: Submission β†’ [Objective, Problem]
41
+
42
+ **Pros**:
43
+ - βœ… Simpler implementation (no schema change)
44
+ - βœ… Faster than sentence-level
45
+ - βœ… Captures multi-faceted submissions
46
+ - βœ… Minimal UI changes
47
+
48
+ **Cons**:
49
+ - ❌ Loses granularity (which sentence is which?)
50
+ - ❌ Can't map specific sentences to categories
51
+ - ❌ Training data less precise
52
+ - ❌ Dashboard becomes ambiguous
53
+
54
+ **Complexity**: Low
55
+ **Value**: Medium
56
+
57
+ ---
58
+
59
+ ### Option 3: Primary + Secondary Categories (Hybrid)
60
+
61
+ **Concept**: Main category + optional secondary categories.
62
+
63
+ **Example**: Submission β†’ Primary: Objective, Secondary: [Problem, Values]
64
+
65
+ **Pros**:
66
+ - βœ… Preserves primary focus
67
+ - βœ… Acknowledges complexity
68
+ - βœ… Moderate implementation effort
69
+ - βœ… Good for hierarchical analysis
70
+
71
+ **Cons**:
72
+ - ❌ Still loses sentence-level detail
73
+ - ❌ Arbitrary primary/secondary distinction
74
+ - ❌ Training data structure unclear
75
+
76
+ **Complexity**: Medium
77
+ **Value**: Medium
78
+
79
+ ---
80
+
81
+ ### Option 4: Aspect-Based Sentiment Analysis (Advanced)
82
+
83
+ **Concept**: Extract aspects/topics from each sentence, then categorize aspects.
84
+
85
+ **Example**:
86
+ - Aspect: "green spaces" β†’ Category: Objective, Sentiment: Positive desire
87
+ - Aspect: "park access disparity" β†’ Category: Problem, Sentiment: Negative
88
+
89
+ **Pros**:
90
+ - βœ… Very sophisticated analysis
91
+ - βœ… Captures nuance and sentiment
92
+ - βœ… Excellent for research
93
+
94
+ **Cons**:
95
+ - ❌ Very complex implementation
96
+ - ❌ Requires different AI models
97
+ - ❌ Overkill for planning sessions
98
+ - ❌ Harder to explain to stakeholders
99
+
100
+ **Complexity**: Very High
101
+ **Value**: Medium (unless research-focused)
102
+
103
+ ---
104
+
105
+ ## πŸ—οΈ Implementation Plan: Option 1 (Sentence-Level Categorization)
106
+
107
+ ### Phase 1: Database Schema Changes
108
+
109
+ #### New Model: `SubmissionSentence`
110
+
111
+ ```python
112
+ class SubmissionSentence(db.Model):
113
+ __tablename__ = 'submission_sentences'
114
+
115
+ id = db.Column(db.Integer, primary_key=True)
116
+ submission_id = db.Column(db.Integer, db.ForeignKey('submissions.id'), nullable=False)
117
+ sentence_index = db.Column(db.Integer, nullable=False) # 0, 1, 2...
118
+ text = db.Column(db.Text, nullable=False)
119
+ category = db.Column(db.String(50), nullable=True)
120
+ confidence = db.Column(db.Float, nullable=True)
121
+ created_at = db.Column(db.DateTime, default=datetime.utcnow)
122
+
123
+ # Relationships
124
+ submission = db.relationship('Submission', backref='sentences')
125
+
126
+ # Composite unique constraint
127
+ __table_args__ = (
128
+ db.UniqueConstraint('submission_id', 'sentence_index', name='uq_submission_sentence'),
129
+ )
130
+ ```
131
+
132
+ #### Update `Submission` Model
133
+
134
+ ```python
135
+ class Submission(db.Model):
136
+ # ... existing fields ...
137
+
138
+ # NEW: Flag to track if sentence-level analysis is done
139
+ sentence_analysis_done = db.Column(db.Boolean, default=False)
140
+
141
+ # DEPRECATED: category (keep for backward compatibility)
142
+ # category = db.Column(db.String(50), nullable=True)
143
+
144
+ def get_primary_category(self):
145
+ """Get most frequent category from sentences"""
146
+ if not self.sentences:
147
+ return self.category # Fallback to old system
148
+
149
+ from collections import Counter
150
+ categories = [s.category for s in self.sentences if s.category]
151
+ if not categories:
152
+ return None
153
+ return Counter(categories).most_common(1)[0][0]
154
+
155
+ def get_category_distribution(self):
156
+ """Get percentage of each category in this submission"""
157
+ if not self.sentences:
158
+ return {self.category: 100} if self.category else {}
159
+
160
+ from collections import Counter
161
+ categories = [s.category for s in self.sentences if s.category]
162
+ total = len(categories)
163
+ if total == 0:
164
+ return {}
165
+
166
+ counts = Counter(categories)
167
+ return {cat: (count/total)*100 for cat, count in counts.items()}
168
+ ```
169
+
170
+ #### Update `TrainingExample` Model
171
+
172
+ ```python
173
+ class TrainingExample(db.Model):
174
+ # ... existing fields ...
175
+
176
+ # NEW: Link to sentence instead of submission
177
+ sentence_id = db.Column(db.Integer, db.ForeignKey('submission_sentences.id'), nullable=True)
178
+
179
+ # Keep submission_id for backward compatibility
180
+ submission_id = db.Column(db.Integer, db.ForeignKey('submissions.id'), nullable=True)
181
+
182
+ # Relationships
183
+ sentence = db.relationship('SubmissionSentence', backref='training_examples')
184
+ ```
185
+
186
+ ---
187
+
188
+ ### Phase 2: Sentence Segmentation Logic
189
+
190
+ #### New Module: `app/utils/text_processor.py`
191
+
192
+ ```python
193
+ import re
194
+ import nltk
195
+ from typing import List
196
+
197
+ # Download required NLTK data (run once)
198
+ # nltk.download('punkt')
199
+
200
+ class TextProcessor:
201
+ """Handle sentence segmentation and text processing"""
202
+
203
+ @staticmethod
204
+ def segment_into_sentences(text: str) -> List[str]:
205
+ """
206
+ Break text into sentences using multiple strategies.
207
+
208
+ Strategies:
209
+ 1. NLTK punkt tokenizer (primary)
210
+ 2. Regex-based fallback
211
+ 3. Min/max length constraints
212
+ """
213
+ # Clean text
214
+ text = text.strip()
215
+
216
+ # Try NLTK first (better accuracy)
217
+ try:
218
+ from nltk.tokenize import sent_tokenize
219
+ sentences = sent_tokenize(text)
220
+ except:
221
+ # Fallback: regex-based segmentation
222
+ sentences = TextProcessor._regex_segmentation(text)
223
+
224
+ # Clean and filter
225
+ sentences = [s.strip() for s in sentences if s.strip()]
226
+
227
+ # Filter out very short "sentences" (likely not meaningful)
228
+ sentences = [s for s in sentences if len(s.split()) >= 3]
229
+
230
+ return sentences
231
+
232
+ @staticmethod
233
+ def _regex_segmentation(text: str) -> List[str]:
234
+ """Fallback sentence segmentation using regex"""
235
+ # Split on period, exclamation, question mark (followed by space or end)
236
+ pattern = r'(?<=[.!?])\s+(?=[A-Z])|(?<=[.!?])$'
237
+ sentences = re.split(pattern, text)
238
+ return [s.strip() for s in sentences if s.strip()]
239
+
240
+ @staticmethod
241
+ def is_valid_sentence(sentence: str) -> bool:
242
+ """Check if sentence is valid for categorization"""
243
+ # Must have at least 3 words
244
+ if len(sentence.split()) < 3:
245
+ return False
246
+
247
+ # Must have some alphabetic characters
248
+ if not any(c.isalpha() for c in sentence):
249
+ return False
250
+
251
+ # Not just a list item or fragment
252
+ if sentence.strip().startswith('-') or sentence.strip().startswith('β€’'):
253
+ return False
254
+
255
+ return True
256
+ ```
257
+
258
+ **Dependencies to add to `requirements.txt`**:
259
+ ```
260
+ nltk>=3.8.0
261
+ ```
262
+
263
+ ---
264
+
265
+ ### Phase 3: Analysis Pipeline Updates
266
+
267
+ #### Update `app/analyzer.py`
268
+
269
+ ```python
270
+ class SubmissionAnalyzer:
271
+ # ... existing code ...
272
+
273
+ def analyze_with_sentences(self, submission_text: str):
274
+ """
275
+ Analyze submission at sentence level.
276
+
277
+ Returns:
278
+ List[Dict]: List of {text: str, category: str, confidence: float}
279
+ """
280
+ from app.utils.text_processor import TextProcessor
281
+
282
+ # Segment into sentences
283
+ sentences = TextProcessor.segment_into_sentences(submission_text)
284
+
285
+ # Classify each sentence
286
+ results = []
287
+ for sentence in sentences:
288
+ if TextProcessor.is_valid_sentence(sentence):
289
+ category = self.analyze(sentence)
290
+ # Get confidence if using fine-tuned model
291
+ confidence = self._get_last_confidence() if self.model_type == 'finetuned' else None
292
+
293
+ results.append({
294
+ 'text': sentence,
295
+ 'category': category,
296
+ 'confidence': confidence
297
+ })
298
+
299
+ return results
300
+
301
+ def _get_last_confidence(self):
302
+ """Store and return last prediction confidence"""
303
+ # Implementation depends on model type
304
+ return getattr(self, '_last_confidence', None)
305
+ ```
306
+
307
+ #### Update Analysis Endpoint: `app/routes/admin.py`
308
+
309
+ ```python
310
+ @bp.route('/api/analyze', methods=['POST'])
311
+ @admin_required
312
+ def analyze_submissions():
313
+ data = request.json
314
+ analyze_all = data.get('analyze_all', False)
315
+ use_sentences = data.get('use_sentences', True) # NEW: sentence-level flag
316
+
317
+ # Get submissions to analyze
318
+ if analyze_all:
319
+ to_analyze = Submission.query.all()
320
+ else:
321
+ to_analyze = Submission.query.filter_by(sentence_analysis_done=False).all()
322
+
323
+ if not to_analyze:
324
+ return jsonify({'success': False, 'error': 'No submissions to analyze'}), 400
325
+
326
+ analyzer = get_analyzer()
327
+ success_count = 0
328
+ error_count = 0
329
+
330
+ for submission in to_analyze:
331
+ try:
332
+ if use_sentences:
333
+ # NEW: Sentence-level analysis
334
+ sentence_results = analyzer.analyze_with_sentences(submission.message)
335
+
336
+ # Clear old sentences
337
+ SubmissionSentence.query.filter_by(submission_id=submission.id).delete()
338
+
339
+ # Create new sentence records
340
+ for idx, result in enumerate(sentence_results):
341
+ sentence = SubmissionSentence(
342
+ submission_id=submission.id,
343
+ sentence_index=idx,
344
+ text=result['text'],
345
+ category=result['category'],
346
+ confidence=result.get('confidence')
347
+ )
348
+ db.session.add(sentence)
349
+
350
+ submission.sentence_analysis_done = True
351
+ # Set primary category for backward compatibility
352
+ submission.category = submission.get_primary_category()
353
+ else:
354
+ # OLD: Submission-level analysis (backward compatible)
355
+ category = analyzer.analyze(submission.message)
356
+ submission.category = category
357
+
358
+ success_count += 1
359
+
360
+ except Exception as e:
361
+ logger.error(f"Error analyzing submission {submission.id}: {e}")
362
+ error_count += 1
363
+ continue
364
+
365
+ db.session.commit()
366
+
367
+ return jsonify({
368
+ 'success': True,
369
+ 'analyzed': success_count,
370
+ 'errors': error_count,
371
+ 'sentence_level': use_sentences
372
+ })
373
+ ```
374
+
375
+ ---
376
+
377
+ ### Phase 4: UI/UX Updates
378
+
379
+ #### A. Submissions Page - Collapsible Sentence View
380
+
381
+ **Template Update: `app/templates/admin/submissions.html`**
382
+
383
+ ```html
384
+ <!-- Submission Card -->
385
+ <div class="card mb-3">
386
+ <div class="card-header d-flex justify-content-between align-items-center">
387
+ <div>
388
+ <strong>{{ submission.contributor_type }}</strong>
389
+ <span class="badge bg-secondary">{{ submission.timestamp.strftime('%Y-%m-%d %H:%M') }}</span>
390
+ </div>
391
+ <div>
392
+ {% if submission.sentence_analysis_done %}
393
+ <button class="btn btn-sm btn-outline-primary"
394
+ data-bs-toggle="collapse"
395
+ data-bs-target="#sentences-{{ submission.id }}">
396
+ <i class="bi bi-list-nested"></i> View Sentences ({{ submission.sentences|length }})
397
+ </button>
398
+ {% endif %}
399
+ </div>
400
+ </div>
401
+
402
+ <div class="card-body">
403
+ <!-- Original Message -->
404
+ <p class="mb-2">{{ submission.message }}</p>
405
+
406
+ <!-- Primary Category (backward compatible) -->
407
+ <div class="mb-2">
408
+ <strong>Primary Category:</strong>
409
+ <span class="badge bg-info">{{ submission.get_primary_category() or 'Unanalyzed' }}</span>
410
+ </div>
411
+
412
+ <!-- Category Distribution -->
413
+ {% if submission.sentence_analysis_done %}
414
+ <div class="mb-2">
415
+ <strong>Category Distribution:</strong>
416
+ {% for category, percentage in submission.get_category_distribution().items() %}
417
+ <span class="badge bg-secondary">{{ category }}: {{ "%.0f"|format(percentage) }}%</span>
418
+ {% endfor %}
419
+ </div>
420
+ {% endif %}
421
+
422
+ <!-- Collapsible Sentence Details -->
423
+ {% if submission.sentence_analysis_done %}
424
+ <div class="collapse mt-3" id="sentences-{{ submission.id }}">
425
+ <div class="border-start border-primary ps-3">
426
+ <h6>Sentence Breakdown:</h6>
427
+ {% for sentence in submission.sentences %}
428
+ <div class="mb-2 p-2 bg-light rounded">
429
+ <div class="d-flex justify-content-between align-items-start">
430
+ <div class="flex-grow-1">
431
+ <small class="text-muted">Sentence {{ sentence.sentence_index + 1 }}:</small>
432
+ <p class="mb-1">{{ sentence.text }}</p>
433
+ </div>
434
+ <div>
435
+ <select class="form-select form-select-sm"
436
+ onchange="updateSentenceCategory({{ sentence.id }}, this.value)">
437
+ <option value="">Uncategorized</option>
438
+ {% for cat in categories %}
439
+ <option value="{{ cat }}"
440
+ {% if sentence.category == cat %}selected{% endif %}>
441
+ {{ cat }}
442
+ </option>
443
+ {% endfor %}
444
+ </select>
445
+ </div>
446
+ </div>
447
+ {% if sentence.confidence %}
448
+ <small class="text-muted">Confidence: {{ "%.0f"|format(sentence.confidence * 100) }}%</small>
449
+ {% endif %}
450
+ </div>
451
+ {% endfor %}
452
+ </div>
453
+ </div>
454
+ {% endif %}
455
+ </div>
456
+ </div>
457
+ ```
458
+
459
+ **JavaScript Update**:
460
+
461
+ ```javascript
462
+ function updateSentenceCategory(sentenceId, category) {
463
+ fetch(`/admin/api/update-sentence-category/${sentenceId}`, {
464
+ method: 'POST',
465
+ headers: {'Content-Type': 'application/json'},
466
+ body: JSON.stringify({category: category})
467
+ })
468
+ .then(response => response.json())
469
+ .then(data => {
470
+ if (data.success) {
471
+ showToast('Sentence category updated', 'success');
472
+ // Optionally refresh to update distribution
473
+ } else {
474
+ showToast('Error: ' + data.error, 'error');
475
+ }
476
+ });
477
+ }
478
+ ```
479
+
480
+ #### B. Dashboard Updates - Aggregation Strategy
481
+
482
+ **Two Aggregation Modes**:
483
+
484
+ 1. **Submission-Based** (backward compatible): Count primary category per submission
485
+ 2. **Sentence-Based** (new): Count all sentences by category
486
+
487
+ **Template Update: `app/templates/admin/dashboard.html`**
488
+
489
+ ```html
490
+ <!-- Aggregation Mode Selector -->
491
+ <div class="mb-3">
492
+ <label>View Mode:</label>
493
+ <div class="btn-group" role="group">
494
+ <input type="radio" class="btn-check" name="viewMode" id="viewSubmissions"
495
+ value="submissions" checked onchange="updateDashboard()">
496
+ <label class="btn btn-outline-primary" for="viewSubmissions">
497
+ By Submissions
498
+ </label>
499
+
500
+ <input type="radio" class="btn-check" name="viewMode" id="viewSentences"
501
+ value="sentences" onchange="updateDashboard()">
502
+ <label class="btn btn-outline-primary" for="viewSentences">
503
+ By Sentences
504
+ </label>
505
+ </div>
506
+ </div>
507
+
508
+ <!-- Category Chart (updates based on mode) -->
509
+ <canvas id="categoryChart"></canvas>
510
+ ```
511
+
512
+ **Route Update: `app/routes/admin.py`**
513
+
514
+ ```python
515
+ @bp.route('/dashboard')
516
+ @admin_required
517
+ def dashboard():
518
+ analyzed = Submission.query.filter(Submission.category != None).count() > 0
519
+
520
+ if not analyzed:
521
+ flash('Please analyze submissions first', 'warning')
522
+ return redirect(url_for('admin.overview'))
523
+
524
+ # NEW: Get view mode from query param
525
+ view_mode = request.args.get('mode', 'submissions') # 'submissions' or 'sentences'
526
+
527
+ submissions = Submission.query.filter(Submission.category != None).all()
528
+
529
+ # Contributor stats (unchanged)
530
+ contributor_stats = db.session.query(
531
+ Submission.contributor_type,
532
+ db.func.count(Submission.id)
533
+ ).group_by(Submission.contributor_type).all()
534
+
535
+ # Category stats - MODE DEPENDENT
536
+ if view_mode == 'sentences':
537
+ # NEW: Sentence-based aggregation
538
+ category_stats = db.session.query(
539
+ SubmissionSentence.category,
540
+ db.func.count(SubmissionSentence.id)
541
+ ).filter(SubmissionSentence.category != None).group_by(SubmissionSentence.category).all()
542
+
543
+ # Breakdown by contributor (via parent submission)
544
+ breakdown = {}
545
+ for cat in CATEGORIES:
546
+ breakdown[cat] = {}
547
+ for ctype in CONTRIBUTOR_TYPES:
548
+ count = db.session.query(db.func.count(SubmissionSentence.id)).join(
549
+ Submission
550
+ ).filter(
551
+ SubmissionSentence.category == cat,
552
+ Submission.contributor_type == ctype['value']
553
+ ).scalar()
554
+ breakdown[cat][ctype['value']] = count
555
+ else:
556
+ # OLD: Submission-based aggregation (backward compatible)
557
+ category_stats = db.session.query(
558
+ Submission.category,
559
+ db.func.count(Submission.id)
560
+ ).filter(Submission.category != None).group_by(Submission.category).all()
561
+
562
+ breakdown = {}
563
+ for cat in CATEGORIES:
564
+ breakdown[cat] = {}
565
+ for ctype in CONTRIBUTOR_TYPES:
566
+ count = Submission.query.filter_by(
567
+ category=cat,
568
+ contributor_type=ctype['value']
569
+ ).count()
570
+ breakdown[cat][ctype['value']] = count
571
+
572
+ # Geotagged submissions (unchanged - submission level)
573
+ geotagged_submissions = Submission.query.filter(
574
+ Submission.latitude != None,
575
+ Submission.longitude != None,
576
+ Submission.category != None
577
+ ).all()
578
+
579
+ return render_template('admin/dashboard.html',
580
+ submissions=submissions,
581
+ contributor_stats=contributor_stats,
582
+ category_stats=category_stats,
583
+ geotagged_submissions=geotagged_submissions,
584
+ categories=CATEGORIES,
585
+ contributor_types=CONTRIBUTOR_TYPES,
586
+ breakdown=breakdown,
587
+ view_mode=view_mode)
588
+ ```
589
+
590
+ ---
591
+
592
+ ### Phase 5: Geographic Mapping Updates
593
+
594
+ **Challenge**: A single geotag now maps to multiple categories (via sentences).
595
+
596
+ **Solution Options**:
597
+
598
+ #### Option A: Multi-Category Markers (Recommended)
599
+ ```javascript
600
+ // Map marker shows all categories in this submission
601
+ marker.bindPopup(`
602
+ <strong>${submission.contributorType}</strong><br>
603
+ ${submission.message}<br>
604
+ <strong>Categories:</strong> ${submission.category_distribution}
605
+ `);
606
+ ```
607
+
608
+ #### Option B: One Marker Per Sentence-Category
609
+ ```javascript
610
+ // Create separate markers for each sentence (if has geotag)
611
+ // Color by sentence category
612
+ submission.sentences.forEach(sentence => {
613
+ if (sentence.category) {
614
+ createMarker({
615
+ lat: submission.latitude,
616
+ lng: submission.longitude,
617
+ category: sentence.category,
618
+ text: sentence.text
619
+ });
620
+ }
621
+ });
622
+ ```
623
+
624
+ **Recommendation**: Option A (cleaner map, less clutter)
625
+
626
+ ---
627
+
628
+ ### Phase 6: Training Data Updates
629
+
630
+ **Key Change**: Training examples now link to sentences, not submissions.
631
+
632
+ **Update Training Example Creation**:
633
+
634
+ ```python
635
+ @bp.route('/api/update-sentence-category/<int:sentence_id>', methods=['POST'])
636
+ @admin_required
637
+ def update_sentence_category(sentence_id):
638
+ try:
639
+ sentence = SubmissionSentence.query.get_or_404(sentence_id)
640
+ data = request.json
641
+ new_category = data.get('category')
642
+
643
+ # Store original
644
+ original_category = sentence.category
645
+
646
+ # Update sentence
647
+ sentence.category = new_category
648
+
649
+ # Create/update training example
650
+ existing = TrainingExample.query.filter_by(sentence_id=sentence_id).first()
651
+
652
+ if existing:
653
+ existing.original_category = original_category
654
+ existing.corrected_category = new_category
655
+ existing.correction_timestamp = datetime.utcnow()
656
+ else:
657
+ training_example = TrainingExample(
658
+ sentence_id=sentence_id,
659
+ submission_id=sentence.submission_id,
660
+ message=sentence.text, # Just the sentence text
661
+ original_category=original_category,
662
+ corrected_category=new_category,
663
+ contributor_type=sentence.submission.contributor_type
664
+ )
665
+ db.session.add(training_example)
666
+
667
+ # Update parent submission's primary category
668
+ submission = sentence.submission
669
+ submission.category = submission.get_primary_category()
670
+
671
+ db.session.commit()
672
+
673
+ return jsonify({'success': True})
674
+
675
+ except Exception as e:
676
+ return jsonify({'success': False, 'error': str(e)}), 500
677
+ ```
678
+
679
+ ---
680
+
681
+ ### Phase 7: Migration Strategy
682
+
683
+ #### Migration Script: `migrations/add_sentence_level.py`
684
+
685
+ ```python
686
+ """
687
+ Migration: Add sentence-level categorization support
688
+
689
+ This migration:
690
+ 1. Creates SubmissionSentence table
691
+ 2. Adds sentence_analysis_done flag to Submission
692
+ 3. Optionally migrates existing submissions to sentence-level
693
+ """
694
+
695
+ from app import create_app, db
696
+ from app.models.models import Submission, SubmissionSentence
697
+ from app.utils.text_processor import TextProcessor
698
+ import logging
699
+
700
+ logger = logging.getLogger(__name__)
701
+
702
+ def migrate_existing_submissions(auto_segment=False):
703
+ """
704
+ Migrate existing submissions to sentence-level structure.
705
+
706
+ Args:
707
+ auto_segment: If True, automatically segment and categorize
708
+ If False, just mark as pending sentence analysis
709
+ """
710
+ app = create_app()
711
+
712
+ with app.app_context():
713
+ # Create new table
714
+ db.create_all()
715
+
716
+ # Get all submissions
717
+ submissions = Submission.query.all()
718
+ logger.info(f"Migrating {len(submissions)} submissions...")
719
+
720
+ for submission in submissions:
721
+ if auto_segment and submission.category:
722
+ # Auto-segment using old category as fallback
723
+ sentences = TextProcessor.segment_into_sentences(submission.message)
724
+
725
+ for idx, sentence_text in enumerate(sentences):
726
+ sentence = SubmissionSentence(
727
+ submission_id=submission.id,
728
+ sentence_index=idx,
729
+ text=sentence_text,
730
+ category=submission.category, # Use old category as default
731
+ confidence=None
732
+ )
733
+ db.session.add(sentence)
734
+
735
+ submission.sentence_analysis_done = True
736
+ logger.info(f"Segmented submission {submission.id} into {len(sentences)} sentences")
737
+ else:
738
+ # Just mark for re-analysis
739
+ submission.sentence_analysis_done = False
740
+
741
+ db.session.commit()
742
+ logger.info("Migration complete!")
743
+
744
+ if __name__ == '__main__':
745
+ # Run with auto-segmentation disabled (safer)
746
+ migrate_existing_submissions(auto_segment=False)
747
+
748
+ # Or run with auto-segmentation (assigns old category to all sentences)
749
+ # migrate_existing_submissions(auto_segment=True)
750
+ ```
751
+
752
+ **Run migration**:
753
+ ```bash
754
+ python migrations/add_sentence_level.py
755
+ ```
756
+
757
+ ---
758
+
759
+ ## πŸ“Š Comparison: Implementation Approaches
760
+
761
+ | Aspect | Option 1: Sentence-Level | Option 2: Multi-Label | Option 3: Primary+Secondary |
762
+ |--------|-------------------------|----------------------|----------------------------|
763
+ | **Granularity** | ⭐⭐⭐⭐⭐ Highest | ⭐⭐⭐ Medium | ⭐⭐⭐ Medium |
764
+ | **Accuracy** | ⭐⭐⭐⭐⭐ Best | ⭐⭐⭐⭐ Good | ⭐⭐⭐⭐ Good |
765
+ | **Implementation** | ⭐⭐ Complex | ⭐⭐⭐⭐⭐ Simple | ⭐⭐⭐⭐ Moderate |
766
+ | **Training Data** | ⭐⭐⭐⭐⭐ Precise | ⭐⭐⭐ Ambiguous | ⭐⭐⭐ OK |
767
+ | **UI Complexity** | ⭐⭐ High | ⭐⭐⭐⭐⭐ Low | ⭐⭐⭐⭐ Low |
768
+ | **Dashboard** | ⭐⭐⭐ Flexible | ⭐⭐⭐ Limited | ⭐⭐⭐⭐ Clear |
769
+ | **Performance** | ⭐⭐⭐ OK (more API calls) | ⭐⭐⭐⭐⭐ Fast | ⭐⭐⭐⭐⭐ Fast |
770
+ | **Backward Compat** | ⭐⭐⭐⭐⭐ Yes | ⭐⭐⭐⭐⭐ Yes | ⭐⭐⭐⭐ Mostly |
771
+
772
+ ---
773
+
774
+ ## 🎯 Final Recommendation
775
+
776
+ ### **Implement Option 1: Sentence-Level Categorization**
777
+
778
+ **Why**:
779
+ 1. βœ… Matches your use case perfectly
780
+ 2. βœ… Provides maximum analytical value
781
+ 3. βœ… Better training data = better AI
782
+ 4. βœ… Backward compatible (maintains `submission.category`)
783
+ 5. βœ… Scalable to future needs
784
+
785
+ **Implementation Priority**:
786
+ 1. **Phase 1**: Database schema ⏱️ 2-3 hours
787
+ 2. **Phase 2**: Sentence segmentation ⏱️ 1-2 hours
788
+ 3. **Phase 3**: Analysis pipeline ⏱️ 2-3 hours
789
+ 4. **Phase 4**: UI updates (collapsible view) ⏱️ 3-4 hours
790
+ 5. **Phase 5**: Dashboard aggregation ⏱️ 2-3 hours
791
+ 6. **Phase 6**: Training updates ⏱️ 1-2 hours
792
+ 7. **Phase 7**: Migration & testing ⏱️ 2-3 hours
793
+
794
+ **Total Estimate**: 13-20 hours
795
+
796
+ ---
797
+
798
+ ## πŸ’‘ Alternative: Incremental Rollout
799
+
800
+ **If you want to test before full commitment**:
801
+
802
+ ### Phase 0: Proof of Concept (4-6 hours)
803
+ 1. Add sentence segmentation (no DB changes)
804
+ 2. Show sentence breakdown in UI (read-only)
805
+ 3. Let admins test and provide feedback
806
+ 4. Decide whether to proceed with full implementation
807
+
808
+ **Then choose**:
809
+ - βœ… **Full sentence-level** if feedback is positive
810
+ - ⚠️ **Multi-label** if sentence-level is too complex
811
+ - πŸ”„ **Stay with current** if not worth effort
812
+
813
+ ---
814
+
815
+ ## πŸš€ Next Steps
816
+
817
+ **I recommend**:
818
+
819
+ 1. **Validate approach**: Review this plan with stakeholders
820
+ 2. **Start with Phase 0**: Proof of concept (sentence display only)
821
+ 3. **Get feedback**: Do admins find sentence breakdown useful?
822
+ 4. **Decide**: Full implementation or alternative approach
823
+
824
+ **Should I proceed with**:
825
+ - A) Phase 0: Proof of concept (sentence display, no DB changes)
826
+ - B) Full implementation: All phases
827
+ - C) Alternative: Multi-label approach (simpler)
828
+
829
+ **Your choice?** 🎯
830
+
TRAINING_STRATEGY.md ADDED
@@ -0,0 +1,266 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training Strategy Guide for Participatory Planning Classifier
2
+
3
+ ## Current Performance (as of Oct 2025)
4
+
5
+ - **Dataset**: 60 examples (~42 train / 9 val / 9 test)
6
+ - **Current Best**: Head-only training - **66.7% accuracy**
7
+ - **Baseline**: ~60% (zero-shot BART-mnli)
8
+ - **Challenge**: Only 6.7% improvement - model is **underfitting**
9
+
10
+ ## Recommended Training Strategies (Ranked)
11
+
12
+ ### πŸ₯‡ **Strategy 1: LoRA with Conservative Settings**
13
+ **Best for: Your current 60-example dataset**
14
+
15
+ ```yaml
16
+ Configuration:
17
+ training_mode: lora
18
+ lora_rank: 4-8 # Start small!
19
+ lora_alpha: 8-16 # 2x rank
20
+ lora_dropout: 0.2 # High dropout to prevent overfitting
21
+ learning_rate: 1e-4 # Conservative
22
+ num_epochs: 5-7 # Watch for overfitting
23
+ batch_size: 4 # Smaller batches
24
+ ```
25
+
26
+ **Expected Accuracy**: 70-80%
27
+
28
+ **Why it works:**
29
+ - More capacity than head-only (~500K params with r=4)
30
+ - Still parameter-efficient enough for 60 examples
31
+ - Dropout prevents overfitting
32
+
33
+ **Try this first!** Your head-only results show you need more model capacity.
34
+
35
+ ---
36
+
37
+ ### πŸ₯ˆ **Strategy 2: Data Augmentation + LoRA**
38
+ **Best for: Improving beyond 80% accuracy**
39
+
40
+ **Step 1: Augment your dataset to 150-200 examples**
41
+
42
+ Methods:
43
+ 1. **Paraphrasing** (use GPT/Claude):
44
+ ```python
45
+ # For each example:
46
+ "We need better public transit"
47
+ β†’ "Public transportation should be improved"
48
+ β†’ "Transit system requires enhancement"
49
+ ```
50
+
51
+ 2. **Back-translation**:
52
+ English β†’ Spanish β†’ English (creates natural variations)
53
+
54
+ 3. **Template-based**:
55
+ Create templates for each category and fill with variations
56
+
57
+ **Step 2: Train LoRA (r=8-16) on augmented data**
58
+ - Expected Accuracy: 80-90%
59
+
60
+ ---
61
+
62
+ ### πŸ₯‰ **Strategy 3: Two-Stage Progressive Training**
63
+ **Best for: Maximizing performance with limited data**
64
+
65
+ 1. **Stage 1**: Head-only (warm-up)
66
+ - 3 epochs
67
+ - Initialize the classification head
68
+
69
+ 2. **Stage 2**: LoRA fine-tuning
70
+ - r=4, low learning rate
71
+ - Build on head-only initialization
72
+
73
+ ---
74
+
75
+ ### πŸ”§ **Strategy 4: Optimize Category Definitions**
76
+ **May help with zero-shot AND fine-tuning**
77
+
78
+ Your categories might be too similar. Consider:
79
+
80
+ **Current Categories:**
81
+ - Vision vs Objectives (both forward-looking)
82
+ - Problem vs Directives (both constraints)
83
+
84
+ **Better Definitions:**
85
+ ```python
86
+ CATEGORIES = {
87
+ 'Vision': {
88
+ 'name': 'Vision & Aspirations',
89
+ 'description': 'Long-term future state, desired outcomes, what success looks like',
90
+ 'keywords': ['future', 'aspire', 'imagine', 'dream', 'ideal']
91
+ },
92
+ 'Problem': {
93
+ 'name': 'Current Problems',
94
+ 'description': 'Existing issues, frustrations, barriers, root causes',
95
+ 'keywords': ['problem', 'issue', 'challenge', 'barrier', 'broken']
96
+ },
97
+ 'Objectives': {
98
+ 'name': 'Specific Goals',
99
+ 'description': 'Measurable targets, concrete milestones, quantifiable outcomes',
100
+ 'keywords': ['increase', 'reduce', 'achieve', 'target', 'by 2030']
101
+ },
102
+ 'Directives': {
103
+ 'name': 'Constraints & Requirements',
104
+ 'description': 'Must-haves, non-negotiables, compliance requirements',
105
+ 'keywords': ['must', 'required', 'mandate', 'comply', 'regulation']
106
+ },
107
+ 'Values': {
108
+ 'name': 'Principles & Values',
109
+ 'description': 'Core beliefs, ethical guidelines, guiding principles',
110
+ 'keywords': ['equity', 'sustainability', 'justice', 'fairness', 'inclusive']
111
+ },
112
+ 'Actions': {
113
+ 'name': 'Concrete Actions',
114
+ 'description': 'Specific steps, interventions, activities to implement',
115
+ 'keywords': ['build', 'create', 'implement', 'install', 'construct']
116
+ }
117
+ }
118
+ ```
119
+
120
+ ---
121
+
122
+ ## Alternative Base Models to Consider
123
+
124
+ ### **DeBERTa-v3-base** (Better for Classification)
125
+ ```python
126
+ # In app/analyzer.py
127
+ model_name = "microsoft/deberta-v3-base"
128
+ # Size: 184M params (vs BART's 400M)
129
+ # Often outperforms BART for classification
130
+ ```
131
+
132
+ ### **DistilRoBERTa** (Faster, Lighter)
133
+ ```python
134
+ model_name = "distilroberta-base"
135
+ # Size: 82M params
136
+ # 2x faster, 60% smaller
137
+ # Good accuracy
138
+ ```
139
+
140
+ ### **XLM-RoBERTa-base** (Multilingual)
141
+ ```python
142
+ model_name = "xlm-roberta-base"
143
+ # If you have multilingual submissions
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Data Collection Strategy
149
+
150
+ **Current**: 60 examples β†’ **Target**: 150+ examples
151
+
152
+ ### How to get more data:
153
+
154
+ 1. **Active Learning** (Built into your system!)
155
+ - Deploy current model
156
+ - Admin reviews and corrects predictions
157
+ - Automatically builds training set
158
+
159
+ 2. **Historical Data**
160
+ - Import past participatory planning submissions
161
+ - Manual labeling (15 min for 50 examples)
162
+
163
+ 3. **Synthetic Generation** (Use GPT-4)
164
+ ```
165
+ Prompt: "Generate 10 participatory planning submissions
166
+ that express VISION for urban transportation"
167
+ ```
168
+
169
+ 4. **Crowdsourcing**
170
+ - Mturk or internal team
171
+ - Label 100 examples: ~$20-50
172
+
173
+ ---
174
+
175
+ ## Performance Targets
176
+
177
+ | Dataset Size | Method | Expected Accuracy | Time to Train |
178
+ |-------------|--------|------------------|---------------|
179
+ | 60 | Head-only | 65-70% ❌ Current | 2 min |
180
+ | 60 | LoRA (r=4) | 70-80% βœ… Try next | 5 min |
181
+ | 150 | LoRA (r=8) | 80-85% ⭐ Goal | 10 min |
182
+ | 300+ | LoRA (r=16) | 85-90% 🎯 Ideal | 20 min |
183
+
184
+ ---
185
+
186
+ ## Immediate Action Plan
187
+
188
+ ### Week 1: Low-Hanging Fruit
189
+ 1. βœ… Train with LoRA (r=4, epochs=5)
190
+ 2. βœ… Compare to head-only baseline
191
+ 3. βœ… Check per-category F1 scores
192
+
193
+ ### Week 2: Data Expansion
194
+ 4. Collect 50 more examples (aim for balance)
195
+ 5. Use data augmentation (paraphrase 60 β†’ 120)
196
+ 6. Retrain LoRA (r=8)
197
+
198
+ ### Week 3: Optimization
199
+ 7. Try DeBERTa-v3-base as base model
200
+ 8. Fine-tune category descriptions
201
+ 9. Deploy best model
202
+
203
+ ---
204
+
205
+ ## Debugging Low Performance
206
+
207
+ If accuracy stays below 75%:
208
+
209
+ ### Check 1: Data Quality
210
+ ```python
211
+ # Look for label conflicts
212
+ SELECT message, corrected_category, COUNT(*)
213
+ FROM training_examples
214
+ GROUP BY message
215
+ HAVING COUNT(DISTINCT corrected_category) > 1
216
+ ```
217
+
218
+ ### Check 2: Class Imbalance
219
+ - Ensure each category has 5-10+ examples
220
+ - Use weighted loss if imbalanced
221
+
222
+ ### Check 3: Category Confusion
223
+ - Generate confusion matrix
224
+ - Merge categories that are frequently confused
225
+ (e.g., Vision + Objectives β†’ "Future Goals")
226
+
227
+ ### Check 4: Text Quality
228
+ - Remove very short texts (< 5 words)
229
+ - Remove duplicates
230
+ - Check for non-English text
231
+
232
+ ---
233
+
234
+ ## Advanced: Ensemble Models
235
+
236
+ If single model plateaus at 80-85%:
237
+
238
+ 1. Train 3 models with different seeds
239
+ 2. Use voting or averaging
240
+ 3. Typical boost: +3-5% accuracy
241
+
242
+ ```python
243
+ # Pseudo-code
244
+ predictions = [
245
+ model1.predict(text),
246
+ model2.predict(text),
247
+ model3.predict(text)
248
+ ]
249
+ final = most_common(predictions) # Voting
250
+ ```
251
+
252
+ ---
253
+
254
+ ## Conclusion
255
+
256
+ **For your current 60 examples:**
257
+ 1. 🎯 **DO**: Try LoRA with r=4-8 (conservative settings)
258
+ 2. πŸ“ˆ **DO**: Collect 50-100 more examples
259
+ 3. πŸ”„ **DO**: Try DeBERTa-v3 as alternative base model
260
+ 4. ❌ **DON'T**: Use head-only (proven to underfit)
261
+ 5. ❌ **DON'T**: Use full fine-tuning (will overfit)
262
+
263
+ **Expected outcome:** 70-85% accuracy (up from current 66.7%)
264
+
265
+ **Next milestone:** 150 examples β†’ 85%+ accuracy
266
+
ZERO_SHOT_MODEL_SELECTION.md ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Zero-Shot Model Selection Feature
2
+
3
+ ## Overview
4
+
5
+ You can now **choose which AI model** to use for zero-shot classification! This allows you to balance between accuracy and speed based on your needs.
6
+
7
+ ## Available Zero-Shot Models
8
+
9
+ ### 1. **BART-large-MNLI** (Current Default)
10
+ - **Size**: 400M parameters
11
+ - **Speed**: Slow
12
+ - **Best for**: Maximum accuracy, works out of the box
13
+ - **Description**: Large sequence-to-sequence model, excellent zero-shot performance
14
+ - **Model ID**: `facebook/bart-large-mnli`
15
+
16
+ ### 2. **DeBERTa-v3-base-MNLI** ⭐ **Recommended**
17
+ - **Size**: 86M parameters (4.5x smaller than BART)
18
+ - **Speed**: Fast
19
+ - **Best for**: Fast zero-shot classification with good accuracy
20
+ - **Description**: DeBERTa trained on NLI datasets, excellent zero-shot with better speed
21
+ - **Model ID**: `MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli`
22
+
23
+ ### 3. **DistilBART-MNLI**
24
+ - **Size**: 134M parameters
25
+ - **Speed**: Medium
26
+ - **Best for**: Balanced zero-shot performance
27
+ - **Description**: Distilled BART for zero-shot, good balance of speed and accuracy
28
+ - **Model ID**: `valhalla/distilbart-mnli-12-3`
29
+
30
+ ## How to Use
31
+
32
+ ### Step 1: Go to Training Page
33
+ 1. Navigate to **Admin Panel** β†’ **Training** tab
34
+ 2. Look for the **"Zero-Shot Classification Model"** section at the top
35
+
36
+ ### Step 2: View Current Model
37
+ - The dropdown shows the currently active model
38
+ - Below it, you'll see model information (size, speed, description)
39
+
40
+ ### Step 3: Change Model
41
+ 1. Select a different model from the dropdown
42
+ 2. The system will ask for confirmation
43
+ 3. The analyzer will reload with the new model
44
+ 4. **All future classifications** will use the selected model
45
+
46
+ ### Step 4: Test It
47
+ - Go to **Submissions** page
48
+ - Click "Re-analyze" on any submission
49
+ - The new model will be used for classification!
50
+
51
+ ## When to Use Each Model
52
+
53
+ ### Use BART-large-MNLI if:
54
+ - βœ… Accuracy is more important than speed
55
+ - βœ… You have powerful hardware
56
+ - βœ… You don't mind waiting a bit longer
57
+
58
+ ### Use DeBERTa-v3-base-MNLI if: ⭐ **RECOMMENDED**
59
+ - βœ… You want good accuracy with better speed
60
+ - βœ… You're working with many submissions
61
+ - βœ… You want to save computational resources
62
+ - βœ… You need faster response times
63
+
64
+ ### Use DistilBART-MNLI if:
65
+ - βœ… You want something in between
66
+ - βœ… You're familiar with BART but need better speed
67
+
68
+ ## Technical Details
69
+
70
+ ### How It Works
71
+
72
+ 1. **Settings Storage**: The selected model is stored in the database (`Settings` table)
73
+ 2. **Dynamic Loading**: The analyzer checks the setting and loads the selected model
74
+ 3. **Hot Reload**: When you change models, the analyzer reloads automatically
75
+ 4. **No Data Loss**: Changing models doesn't affect your training data or fine-tuned models
76
+
77
+ ### Model Persistence
78
+
79
+ - The selected model remains active even after app restart
80
+ - Each submission classification uses the currently active zero-shot model
81
+ - Fine-tuned models override zero-shot models when deployed
82
+
83
+ ### API Endpoints
84
+
85
+ **Get Current Model:**
86
+ ```
87
+ GET /admin/api/get-zero-shot-model
88
+ ```
89
+
90
+ **Change Model:**
91
+ ```
92
+ POST /admin/api/set-zero-shot-model
93
+ Body: {"model_key": "deberta-v3-base-mnli"}
94
+ ```
95
+
96
+ ## Performance Comparison
97
+
98
+ | Model | Parameters | Classification Speed | Relative Accuracy |
99
+ |-------|-----------|---------------------|-------------------|
100
+ | BART-large-MNLI | 400M | 1x (baseline) | 100% |
101
+ | DeBERTa-v3-base-MNLI | 86M | ~4x faster | ~95-98% |
102
+ | DistilBART-MNLI | 134M | ~2x faster | ~92-95% |
103
+
104
+ *Note: Actual performance may vary based on your hardware and text length*
105
+
106
+ ## Fine-Tuning vs Zero-Shot
107
+
108
+ ### Zero-Shot Model Selection
109
+ - **When**: Before you have training data
110
+ - **What**: Chooses which pre-trained model to use
111
+ - **Where**: Admin β†’ Training β†’ Zero-Shot Classification Model
112
+ - **Effect**: Affects all new classifications immediately
113
+
114
+ ### Fine-Tuning Model Selection
115
+ - **When**: When training with your labeled data
116
+ - **What**: Chooses which model architecture to fine-tune
117
+ - **Where**: Admin β†’ Training β†’ Base Model Architecture for Fine-Tuning
118
+ - **Effect**: Only affects that specific training run
119
+
120
+ ### Can I use both?
121
+ **Yes!** You can:
122
+ 1. **Select a zero-shot model** (e.g., DeBERTa-v3-base-MNLI) for initial classifications
123
+ 2. **Fine-tune** using any model (e.g., DeBERTa-v3-small) for better performance
124
+ 3. **Deploy** the fine-tuned model, which will override the zero-shot model
125
+
126
+ ## Troubleshooting
127
+
128
+ **Q: I changed the model but nothing happened?**
129
+ A: The change affects new classifications. Try clicking "Re-analyze" on a submission to see the new model in action.
130
+
131
+ **Q: Which model should I choose?**
132
+ A: Start with **DeBERTa-v3-base-MNLI** - it's faster than BART with minimal accuracy loss.
133
+
134
+ **Q: Does this affect my fine-tuned models?**
135
+ A: No! Zero-shot models are only used when no fine-tuned model is deployed.
136
+
137
+ **Q: Can I switch back to BART?**
138
+ A: Yes! Just select BART-large-MNLI from the dropdown anytime.
139
+
140
+ **Q: Will changing models break anything?**
141
+ A: No, it's completely safe. Your data, training runs, and fine-tuned models are unaffected.
142
+
143
+ ## Best Practices
144
+
145
+ 1. **Start with DeBERTa-v3-base-MNLI** for better speed
146
+ 2. **Compare results** - try re-analyzing the same submission with different models
147
+ 3. **Consider your hardware** - larger models need more RAM
148
+ 4. **Fine-tune eventually** - zero-shot is great, but fine-tuning is better!
149
+
150
+ ## Example Workflow
151
+
152
+ ```
153
+ 1. Install app
154
+ ↓
155
+ 2. Select DeBERTa-v3-base-MNLI (for speed)
156
+ ↓
157
+ 3. Collect submissions
158
+ ↓
159
+ 4. Correct categories (builds training data)
160
+ ↓
161
+ 5. Fine-tune using DeBERTa-v3-small (best for small datasets)
162
+ ↓
163
+ 6. Deploy fine-tuned model (overrides zero-shot)
164
+ ↓
165
+ 7. Enjoy better accuracy! πŸŽ‰
166
+ ```
167
+
168
+ ## What's Next?
169
+
170
+ After selecting your zero-shot model:
171
+ - **Collect data**: Let users submit and classify with the selected model
172
+ - **Review & correct**: Use the admin panel to fix any misclassifications
173
+ - **Build training set**: Corrections are automatically saved
174
+ - **Fine-tune**: Once you have 20+ examples, train a custom model
175
+ - **Deploy**: Your fine-tuned model will outperform any zero-shot model!
176
+
177
+ ---
178
+
179
+ **Ready to try it?** Go to Admin β†’ Training and select your model! πŸš€
180
+
181
+ For questions or issues:
182
+ 1. Check the model info displayed below the dropdown
183
+ 2. Review this guide
184
+ 3. Try switching back to BART if issues occur
185
+
analyze_submissions_for_sentences.py ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Analyze existing submissions to determine if sentence-level categorization is worth implementing.
4
+
5
+ This script:
6
+ 1. Segments submissions into sentences
7
+ 2. Categorizes each sentence using current AI model
8
+ 3. Compares sentence-level vs submission-level categories
9
+ 4. Shows statistics to inform decision
10
+
11
+ Run: python analyze_submissions_for_sentences.py
12
+ """
13
+
14
+ import sys
15
+ import os
16
+ import re
17
+ from collections import Counter, defaultdict
18
+ from app import create_app, db
19
+ from app.models.models import Submission
20
+ from app.analyzer import get_analyzer
21
+ import nltk
22
+
23
+ # Try to download required NLTK data
24
+ try:
25
+ nltk.data.find('tokenizers/punkt')
26
+ except LookupError:
27
+ print("Downloading NLTK punkt tokenizer...")
28
+ nltk.download('punkt', quiet=True)
29
+
30
+ def segment_sentences(text):
31
+ """Simple sentence segmentation"""
32
+ try:
33
+ from nltk.tokenize import sent_tokenize
34
+ sentences = sent_tokenize(text)
35
+ except:
36
+ # Fallback: regex-based
37
+ pattern = r'(?<=[.!?])\s+(?=[A-Z])|(?<=[.!?])$'
38
+ sentences = re.split(pattern, text)
39
+
40
+ # Clean and filter
41
+ sentences = [s.strip() for s in sentences if s.strip()]
42
+ # Filter very short "sentences"
43
+ sentences = [s for s in sentences if len(s.split()) >= 3]
44
+
45
+ return sentences
46
+
47
+ def analyze_submissions():
48
+ """Analyze submissions to see if sentence-level categorization is beneficial"""
49
+
50
+ app = create_app()
51
+
52
+ with app.app_context():
53
+ # Get all analyzed submissions
54
+ submissions = Submission.query.filter(Submission.category != None).all()
55
+
56
+ if not submissions:
57
+ print("❌ No analyzed submissions found. Please run AI analysis first.")
58
+ return
59
+
60
+ print(f"\n{'='*70}")
61
+ print(f"πŸ“Š SENTENCE-LEVEL CATEGORIZATION ANALYSIS")
62
+ print(f"{'='*70}\n")
63
+
64
+ print(f"Analyzing {len(submissions)} submissions...\n")
65
+
66
+ # Load analyzer
67
+ analyzer = get_analyzer()
68
+
69
+ # Statistics
70
+ total_submissions = len(submissions)
71
+ total_sentences = 0
72
+ multi_sentence_count = 0
73
+ multi_category_count = 0
74
+
75
+ sentence_counts = []
76
+ category_changes = []
77
+
78
+ submission_details = []
79
+
80
+ # Analyze each submission
81
+ for submission in submissions:
82
+ # Segment into sentences
83
+ sentences = segment_sentences(submission.message)
84
+ sentence_count = len(sentences)
85
+
86
+ total_sentences += sentence_count
87
+ sentence_counts.append(sentence_count)
88
+
89
+ if sentence_count > 1:
90
+ multi_sentence_count += 1
91
+
92
+ # Categorize each sentence
93
+ sentence_categories = []
94
+ for sentence in sentences:
95
+ try:
96
+ category = analyzer.analyze(sentence)
97
+ sentence_categories.append(category)
98
+ except Exception as e:
99
+ print(f"Error analyzing sentence: {e}")
100
+ sentence_categories.append(None)
101
+
102
+ # Check if categories differ
103
+ unique_categories = set([c for c in sentence_categories if c])
104
+
105
+ if len(unique_categories) > 1:
106
+ multi_category_count += 1
107
+ category_changes.append({
108
+ 'id': submission.id,
109
+ 'text': submission.message,
110
+ 'submission_category': submission.category,
111
+ 'sentence_categories': sentence_categories,
112
+ 'sentences': sentences,
113
+ 'contributor_type': submission.contributor_type
114
+ })
115
+
116
+ # Print Statistics
117
+ print(f"{'─'*70}")
118
+ print(f"πŸ“ˆ STATISTICS")
119
+ print(f"{'─'*70}\n")
120
+
121
+ print(f"Total Submissions: {total_submissions}")
122
+ print(f"Total Sentences: {total_sentences}")
123
+ print(f"Avg Sentences/Submission: {total_sentences/total_submissions:.1f}")
124
+ print(f"Multi-sentence (>1): {multi_sentence_count} ({multi_sentence_count/total_submissions*100:.1f}%)")
125
+ print(f"Multi-category: {multi_category_count} ({multi_category_count/total_submissions*100:.1f}%)")
126
+
127
+ # Sentence distribution
128
+ print(f"\nπŸ“Š Sentence Count Distribution:")
129
+ sentence_dist = Counter(sentence_counts)
130
+ for count in sorted(sentence_dist.keys()):
131
+ bar = 'β–ˆ' * int(sentence_dist[count] / total_submissions * 50)
132
+ print(f" {count} sentence(s): {sentence_dist[count]:3d} {bar}")
133
+
134
+ # Category changes
135
+ if category_changes:
136
+ print(f"\n{'─'*70}")
137
+ print(f"πŸ”„ SUBMISSIONS WITH MULTIPLE CATEGORIES ({len(category_changes)})")
138
+ print(f"{'─'*70}\n")
139
+
140
+ for idx, item in enumerate(category_changes[:10], 1): # Show first 10
141
+ print(f"\n{idx}. Submission #{item['id']} ({item['contributor_type']})")
142
+ print(f" Submission-level: {item['submission_category']}")
143
+ print(f" Text: \"{item['text'][:100]}{'...' if len(item['text']) > 100 else ''}\"")
144
+ print(f" Sentence breakdown:")
145
+
146
+ for i, (sentence, category) in enumerate(zip(item['sentences'], item['sentence_categories']), 1):
147
+ marker = "⚠️" if category != item['submission_category'] else "βœ“"
148
+ print(f" {marker} S{i} [{category:12s}] \"{sentence[:60]}{'...' if len(sentence) > 60 else ''}\"")
149
+
150
+ if len(category_changes) > 10:
151
+ print(f"\n ... and {len(category_changes) - 10} more")
152
+
153
+ # Category distribution comparison
154
+ print(f"\n{'─'*70}")
155
+ print(f"πŸ“Š CATEGORY DISTRIBUTION COMPARISON")
156
+ print(f"{'─'*70}\n")
157
+
158
+ # Submission-level counts
159
+ submission_cats = Counter([s.category for s in submissions if s.category])
160
+
161
+ # Sentence-level counts
162
+ sentence_cats = Counter()
163
+ for item in category_changes:
164
+ for cat in item['sentence_categories']:
165
+ if cat:
166
+ sentence_cats[cat] += 1
167
+
168
+ print(f"{'Category':<15} {'Submission-Level':<20} {'Sentence-Level (multi-cat only)':<30}")
169
+ print(f"{'-'*15} {'-'*20} {'-'*30}")
170
+
171
+ categories = ['Vision', 'Problem', 'Objectives', 'Directives', 'Values', 'Actions']
172
+ for cat in categories:
173
+ sub_count = submission_cats.get(cat, 0)
174
+ sen_count = sentence_cats.get(cat, 0)
175
+ sub_bar = 'β–ˆ' * int(sub_count / total_submissions * 20)
176
+ sen_bar = 'β–ˆ' * int(sen_count / multi_category_count * 20) if multi_category_count > 0 else ''
177
+ print(f"{cat:<15} {sub_count:3d} {sub_bar:<15} {sen_count:3d} {sen_bar:<15}")
178
+
179
+ # Recommendation
180
+ print(f"\n{'='*70}")
181
+ print(f"πŸ’‘ RECOMMENDATION")
182
+ print(f"{'='*70}\n")
183
+
184
+ multi_cat_percentage = (multi_category_count / total_submissions * 100) if total_submissions > 0 else 0
185
+
186
+ if multi_cat_percentage > 40:
187
+ print(f"βœ… STRONGLY RECOMMEND sentence-level categorization")
188
+ print(f" {multi_cat_percentage:.1f}% of submissions contain multiple categories.")
189
+ print(f" Current system is losing significant semantic detail.")
190
+ print(f"\n πŸ“ˆ Expected benefits:")
191
+ print(f" β€’ {multi_category_count} submissions will have richer categorization")
192
+ print(f" β€’ Training data will be ~{total_sentences - total_submissions} examples richer")
193
+ print(f" β€’ Analytics will be more accurate")
194
+ elif multi_cat_percentage > 20:
195
+ print(f"⚠️ RECOMMEND sentence-level categorization (or proof of concept)")
196
+ print(f" {multi_cat_percentage:.1f}% of submissions contain multiple categories.")
197
+ print(f" Moderate benefit expected.")
198
+ print(f"\n πŸ’‘ Suggestion: Start with proof of concept (display only)")
199
+ print(f" Then decide if full implementation is worth it.")
200
+ else:
201
+ print(f"ℹ️ OPTIONAL - Multi-label might be sufficient")
202
+ print(f" Only {multi_cat_percentage:.1f}% of submissions contain multiple categories.")
203
+ print(f" Sentence-level might be overkill.")
204
+ print(f"\n πŸ’‘ Consider:")
205
+ print(f" β€’ Multi-label classification (simpler)")
206
+ print(f" β€’ Or keep current system if working well")
207
+
208
+ # Implementation effort
209
+ print(f"\nπŸ“‹ Implementation Effort:")
210
+ print(f" β€’ Full sentence-level: 13-20 hours")
211
+ print(f" β€’ Proof of concept: 4-6 hours")
212
+ print(f" β€’ Multi-label: 4-6 hours")
213
+
214
+ print(f"\n{'='*70}\n")
215
+
216
+ # Export detailed results
217
+ export_path = "sentence_analysis_results.txt"
218
+ with open(export_path, 'w') as f:
219
+ f.write("DETAILED SENTENCE-LEVEL ANALYSIS RESULTS\n")
220
+ f.write("="*70 + "\n\n")
221
+ f.write(f"Total Submissions: {total_submissions}\n")
222
+ f.write(f"Multi-category Submissions: {multi_category_count} ({multi_cat_percentage:.1f}%)\n\n")
223
+
224
+ f.write("\nDETAILED BREAKDOWN:\n\n")
225
+ for idx, item in enumerate(category_changes, 1):
226
+ f.write(f"\n{idx}. Submission #{item['id']}\n")
227
+ f.write(f" Contributor: {item['contributor_type']}\n")
228
+ f.write(f" Submission Category: {item['submission_category']}\n")
229
+ f.write(f" Full Text: {item['text']}\n")
230
+ f.write(f" Sentences:\n")
231
+ for i, (sentence, category) in enumerate(zip(item['sentences'], item['sentence_categories']), 1):
232
+ f.write(f" {i}. [{category}] {sentence}\n")
233
+ f.write("\n")
234
+
235
+ print(f"πŸ“„ Detailed results exported to: {export_path}")
236
+
237
+ if __name__ == '__main__':
238
+ try:
239
+ analyze_submissions()
240
+ except Exception as e:
241
+ print(f"\n❌ Error: {e}")
242
+ import traceback
243
+ traceback.print_exc()
244
+ sys.exit(1)
245
+
app/analyzer.py CHANGED
@@ -168,6 +168,9 @@ class SubmissionAnalyzer:
168
  confidence = predictions[0][predicted_class].item()
169
 
170
  category = self.id2label[predicted_class]
 
 
 
171
 
172
  logger.info(f"Fine-tuned model classified as: {category} (confidence: {confidence:.2f})")
173
 
@@ -191,6 +194,9 @@ class SubmissionAnalyzer:
191
  # Extract the category name from the label
192
  top_label = result['labels'][0]
193
  category = top_label.split(':')[0]
 
 
 
194
 
195
  logger.info(f"Zero-shot model classified as: {category} (confidence: {result['scores'][0]:.2f})")
196
 
@@ -207,6 +213,48 @@ class SubmissionAnalyzer:
207
  list: List of predicted categories
208
  """
209
  return [self.analyze(msg) for msg in messages]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
210
 
211
  def get_model_info(self):
212
  """
 
168
  confidence = predictions[0][predicted_class].item()
169
 
170
  category = self.id2label[predicted_class]
171
+
172
+ # Store confidence for later retrieval
173
+ self._last_confidence = confidence
174
 
175
  logger.info(f"Fine-tuned model classified as: {category} (confidence: {confidence:.2f})")
176
 
 
194
  # Extract the category name from the label
195
  top_label = result['labels'][0]
196
  category = top_label.split(':')[0]
197
+
198
+ # Store confidence for later retrieval
199
+ self._last_confidence = result['scores'][0]
200
 
201
  logger.info(f"Zero-shot model classified as: {category} (confidence: {result['scores'][0]:.2f})")
202
 
 
213
  list: List of predicted categories
214
  """
215
  return [self.analyze(msg) for msg in messages]
216
+
217
+ def analyze_with_sentences(self, submission_text: str):
218
+ """
219
+ Analyze submission at sentence level.
220
+
221
+ Args:
222
+ submission_text: Full submission text
223
+
224
+ Returns:
225
+ List[Dict]: List of {text: str, category: str, confidence: float}
226
+ """
227
+ from app.utils.text_processor import TextProcessor
228
+
229
+ # Segment into sentences
230
+ sentences = TextProcessor.segment_and_clean(submission_text)
231
+
232
+ # Classify each sentence
233
+ results = []
234
+ for sentence in sentences:
235
+ try:
236
+ category = self.analyze(sentence)
237
+
238
+ # Get confidence if available
239
+ confidence = self._get_last_confidence() if hasattr(self, '_last_confidence') else None
240
+
241
+ results.append({
242
+ 'text': sentence,
243
+ 'category': category,
244
+ 'confidence': confidence
245
+ })
246
+
247
+ logger.info(f"Sentence classified: '{sentence[:50]}...' -> {category}")
248
+ except Exception as e:
249
+ logger.error(f"Error analyzing sentence '{sentence[:50]}...': {e}")
250
+ # Skip problematic sentences
251
+ continue
252
+
253
+ return results
254
+
255
+ def _get_last_confidence(self):
256
+ """Get last prediction confidence (if available)"""
257
+ return getattr(self, '_last_confidence', None)
258
 
259
  def get_model_info(self):
260
  """
app/models/models.py CHANGED
@@ -29,11 +29,38 @@ class Submission(db.Model):
29
  latitude = db.Column(db.Float, nullable=True)
30
  longitude = db.Column(db.Float, nullable=True)
31
  timestamp = db.Column(db.DateTime, default=datetime.utcnow)
32
- category = db.Column(db.String(50), nullable=True) # Vision, Problem, Objectives, Directives, Values, Actions
33
  flagged_as_offensive = db.Column(db.Boolean, default=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  def to_dict(self):
36
- return {
 
37
  'id': self.id,
38
  'message': self.message,
39
  'contributorType': self.contributor_type,
@@ -42,10 +69,51 @@ class Submission(db.Model):
42
  'lng': self.longitude
43
  } if self.latitude and self.longitude else None,
44
  'timestamp': self.timestamp.isoformat() if self.timestamp else None,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  'category': self.category,
46
- 'flaggedAsOffensive': self.flagged_as_offensive
 
47
  }
48
 
 
49
  class Settings(db.Model):
50
  __tablename__ = 'settings'
51
 
@@ -74,8 +142,9 @@ class TrainingExample(db.Model):
74
  __tablename__ = 'training_examples'
75
 
76
  id = db.Column(db.Integer, primary_key=True)
77
- submission_id = db.Column(db.Integer, db.ForeignKey('submissions.id'), nullable=False)
78
- message = db.Column(db.Text, nullable=False) # Snapshot of submission text
 
79
  original_category = db.Column(db.String(50), nullable=True) # AI's prediction
80
  corrected_category = db.Column(db.String(50), nullable=False) # Admin's correction
81
  contributor_type = db.Column(db.String(20), nullable=False)
@@ -86,6 +155,7 @@ class TrainingExample(db.Model):
86
 
87
  # Relationships
88
  submission = db.relationship('Submission', backref='training_examples')
 
89
  training_run = db.relationship('FineTuningRun', backref='training_examples')
90
 
91
  def to_dict(self):
 
29
  latitude = db.Column(db.Float, nullable=True)
30
  longitude = db.Column(db.Float, nullable=True)
31
  timestamp = db.Column(db.DateTime, default=datetime.utcnow)
32
+ category = db.Column(db.String(50), nullable=True) # Vision, Problem, Objectives, Directives, Values, Actions (backward compat)
33
  flagged_as_offensive = db.Column(db.Boolean, default=False)
34
+ sentence_analysis_done = db.Column(db.Boolean, default=False) # NEW: Track if sentence-level analysis is complete
35
+
36
+ def get_primary_category(self):
37
+ """Get most frequent category from sentences (or fallback to old category)"""
38
+ if not self.sentences or len(self.sentences) == 0:
39
+ return self.category # Fallback to old system
40
+
41
+ from collections import Counter
42
+ categories = [s.category for s in self.sentences if s.category]
43
+ if not categories:
44
+ return None
45
+ return Counter(categories).most_common(1)[0][0]
46
+
47
+ def get_category_distribution(self):
48
+ """Get percentage of each category in this submission"""
49
+ if not self.sentences or len(self.sentences) == 0:
50
+ return {self.category: 100.0} if self.category else {}
51
+
52
+ from collections import Counter
53
+ categories = [s.category for s in self.sentences if s.category]
54
+ total = len(categories)
55
+ if total == 0:
56
+ return {}
57
+
58
+ counts = Counter(categories)
59
+ return {cat: round((count/total)*100, 1) for cat, count in counts.items()}
60
 
61
  def to_dict(self):
62
+ """Convert to dictionary with sentence-level support"""
63
+ base_dict = {
64
  'id': self.id,
65
  'message': self.message,
66
  'contributorType': self.contributor_type,
 
69
  'lng': self.longitude
70
  } if self.latitude and self.longitude else None,
71
  'timestamp': self.timestamp.isoformat() if self.timestamp else None,
72
+ 'category': self.get_primary_category() if self.sentence_analysis_done else self.category,
73
+ 'flaggedAsOffensive': self.flagged_as_offensive,
74
+ 'sentenceAnalysisDone': self.sentence_analysis_done
75
+ }
76
+
77
+ # Add sentence-level data if available
78
+ if self.sentence_analysis_done and self.sentences:
79
+ base_dict['sentences'] = [s.to_dict() for s in self.sentences]
80
+ base_dict['categoryDistribution'] = self.get_category_distribution()
81
+
82
+ return base_dict
83
+
84
+
85
+ class SubmissionSentence(db.Model):
86
+ """Stores individual sentences from submissions with their categories"""
87
+ __tablename__ = 'submission_sentences'
88
+
89
+ id = db.Column(db.Integer, primary_key=True)
90
+ submission_id = db.Column(db.Integer, db.ForeignKey('submissions.id'), nullable=False)
91
+ sentence_index = db.Column(db.Integer, nullable=False) # 0, 1, 2...
92
+ text = db.Column(db.Text, nullable=False)
93
+ category = db.Column(db.String(50), nullable=True)
94
+ confidence = db.Column(db.Float, nullable=True)
95
+ created_at = db.Column(db.DateTime, default=datetime.utcnow)
96
+
97
+ # Relationships
98
+ submission = db.relationship('Submission', backref='sentences')
99
+
100
+ # Composite unique constraint
101
+ __table_args__ = (
102
+ db.UniqueConstraint('submission_id', 'sentence_index', name='uq_submission_sentence'),
103
+ )
104
+
105
+ def to_dict(self):
106
+ return {
107
+ 'id': self.id,
108
+ 'submission_id': self.submission_id,
109
+ 'sentence_index': self.sentence_index,
110
+ 'text': self.text,
111
  'category': self.category,
112
+ 'confidence': self.confidence,
113
+ 'created_at': self.created_at.isoformat() if self.created_at else None
114
  }
115
 
116
+
117
  class Settings(db.Model):
118
  __tablename__ = 'settings'
119
 
 
142
  __tablename__ = 'training_examples'
143
 
144
  id = db.Column(db.Integer, primary_key=True)
145
+ submission_id = db.Column(db.Integer, db.ForeignKey('submissions.id'), nullable=True) # Made nullable for sentence-level
146
+ sentence_id = db.Column(db.Integer, db.ForeignKey('submission_sentences.id'), nullable=True) # NEW: Link to sentence
147
+ message = db.Column(db.Text, nullable=False) # Snapshot of submission/sentence text
148
  original_category = db.Column(db.String(50), nullable=True) # AI's prediction
149
  corrected_category = db.Column(db.String(50), nullable=False) # Admin's correction
150
  contributor_type = db.Column(db.String(20), nullable=False)
 
155
 
156
  # Relationships
157
  submission = db.relationship('Submission', backref='training_examples')
158
+ sentence = db.relationship('SubmissionSentence', backref='training_examples')
159
  training_run = db.relationship('FineTuningRun', backref='training_examples')
160
 
161
  def to_dict(self):
app/utils/__init__.py ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Utils package
2
+
app/utils/text_processor.py ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Text processing utilities for sentence-level categorization.
3
+ Handles sentence segmentation and text cleaning.
4
+ """
5
+
6
+ import re
7
+ from typing import List
8
+ import logging
9
+
10
+ logger = logging.getLogger(__name__)
11
+
12
+ class TextProcessor:
13
+ """Handle sentence segmentation and text processing"""
14
+
15
+ @staticmethod
16
+ def segment_into_sentences(text: str) -> List[str]:
17
+ """
18
+ Break text into sentences using multiple strategies.
19
+
20
+ Strategies:
21
+ 1. NLTK punkt tokenizer (primary)
22
+ 2. Regex-based fallback
23
+ 3. Min/max length constraints
24
+
25
+ Args:
26
+ text: Input text to segment
27
+
28
+ Returns:
29
+ List of sentences
30
+ """
31
+ # Clean text
32
+ text = text.strip()
33
+
34
+ if not text:
35
+ return []
36
+
37
+ # Try NLTK first (better accuracy)
38
+ try:
39
+ import nltk
40
+ # Try to use punkt tokenizer
41
+ try:
42
+ from nltk.tokenize import sent_tokenize
43
+ sentences = sent_tokenize(text)
44
+ except LookupError:
45
+ # Download punkt if not available
46
+ logger.info("Downloading NLTK punkt tokenizer...")
47
+ nltk.download('punkt', quiet=True)
48
+ from nltk.tokenize import sent_tokenize
49
+ sentences = sent_tokenize(text)
50
+ except Exception as e:
51
+ # Fallback: regex-based segmentation
52
+ logger.warning(f"NLTK tokenization failed ({e}), using regex fallback")
53
+ sentences = TextProcessor._regex_segmentation(text)
54
+
55
+ # Clean and filter
56
+ sentences = [s.strip() for s in sentences if s.strip()]
57
+
58
+ # Filter out very short "sentences" (likely not meaningful)
59
+ # Require at least 3 words
60
+ sentences = [s for s in sentences if len(s.split()) >= 3]
61
+
62
+ return sentences
63
+
64
+ @staticmethod
65
+ def _regex_segmentation(text: str) -> List[str]:
66
+ """
67
+ Fallback sentence segmentation using regex.
68
+
69
+ This is less accurate than NLTK but works without dependencies.
70
+ """
71
+ # Split on period, exclamation, question mark (followed by space or end)
72
+ # Look for: ., !, or ? followed by space + capital letter, or end of string
73
+ pattern = r'(?<=[.!?])\s+(?=[A-Z])|(?<=[.!?])$'
74
+ sentences = re.split(pattern, text)
75
+
76
+ return [s.strip() for s in sentences if s.strip()]
77
+
78
+ @staticmethod
79
+ def is_valid_sentence(sentence: str) -> bool:
80
+ """
81
+ Check if sentence is valid for categorization.
82
+
83
+ Args:
84
+ sentence: Input sentence
85
+
86
+ Returns:
87
+ True if valid, False otherwise
88
+ """
89
+ # Must have at least 3 words
90
+ if len(sentence.split()) < 3:
91
+ return False
92
+
93
+ # Must have some alphabetic characters
94
+ if not any(c.isalpha() for c in sentence):
95
+ return False
96
+
97
+ # Not just a list item or fragment
98
+ stripped = sentence.strip()
99
+ if stripped.startswith('-') or stripped.startswith('β€’') or stripped.startswith('*'):
100
+ # Allow if it has substantial text after the bullet
101
+ if len(stripped[1:].strip().split()) < 3:
102
+ return False
103
+
104
+ return True
105
+
106
+ @staticmethod
107
+ def clean_sentence(sentence: str) -> str:
108
+ """
109
+ Clean a sentence for processing.
110
+
111
+ Args:
112
+ sentence: Input sentence
113
+
114
+ Returns:
115
+ Cleaned sentence
116
+ """
117
+ # Remove leading bullet points or numbers
118
+ sentence = re.sub(r'^[\s\-β€’*\d.]+\s*', '', sentence)
119
+
120
+ # Normalize whitespace
121
+ sentence = ' '.join(sentence.split())
122
+
123
+ # Ensure it ends with punctuation
124
+ if sentence and not sentence[-1] in '.!?':
125
+ sentence += '.'
126
+
127
+ return sentence.strip()
128
+
129
+ @staticmethod
130
+ def segment_and_clean(text: str) -> List[str]:
131
+ """
132
+ Segment text into sentences and clean them.
133
+
134
+ This is the main entry point for text processing.
135
+
136
+ Args:
137
+ text: Input text
138
+
139
+ Returns:
140
+ List of cleaned, valid sentences
141
+ """
142
+ # Segment
143
+ sentences = TextProcessor.segment_into_sentences(text)
144
+
145
+ # Clean and filter
146
+ result = []
147
+ for sentence in sentences:
148
+ cleaned = TextProcessor.clean_sentence(sentence)
149
+ if TextProcessor.is_valid_sentence(cleaned):
150
+ result.append(cleaned)
151
+
152
+ return result
153
+
154
+ @staticmethod
155
+ def get_sentence_count_estimate(text: str) -> int:
156
+ """
157
+ Quick estimate of sentence count without full processing.
158
+
159
+ Args:
160
+ text: Input text
161
+
162
+ Returns:
163
+ Estimated sentence count
164
+ """
165
+ # Count sentence-ending punctuation
166
+ count = text.count('.') + text.count('!') + text.count('?')
167
+
168
+ # At least 1 if text exists
169
+ return max(1, count)
170
+
mock_data_60.json ADDED
@@ -0,0 +1,726 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "submissions": [
3
+ {
4
+ "id": 1,
5
+ "message": "We dream of a future with everyone has affordable housing within 20 minutes of work",
6
+ "contributor_type": "government",
7
+ "location": {
8
+ "lat": -15.7795,
9
+ "lng": -47.979
10
+ },
11
+ "timestamp": "2025-01-15T14:30:00",
12
+ "category": "Vision",
13
+ "flagged_as_offensive": false
14
+ },
15
+ {
16
+ "id": 2,
17
+ "message": "Our vision is to create air quality meets the highest international standards",
18
+ "contributor_type": "other",
19
+ "location": {
20
+ "lat": -15.7251,
21
+ "lng": -47.9745
22
+ },
23
+ "timestamp": "2025-01-15T15:00:00",
24
+ "category": "Vision",
25
+ "flagged_as_offensive": false
26
+ },
27
+ {
28
+ "id": 3,
29
+ "message": "The ideal scenario would be air quality meets the highest international standards",
30
+ "contributor_type": "government",
31
+ "location": {
32
+ "lat": -15.7235,
33
+ "lng": -47.9387
34
+ },
35
+ "timestamp": "2025-01-15T15:30:00",
36
+ "category": "Vision",
37
+ "flagged_as_offensive": false
38
+ },
39
+ {
40
+ "id": 4,
41
+ "message": "We dream of a future with zero waste is achieved through comprehensive recycling",
42
+ "contributor_type": "industry",
43
+ "location": {
44
+ "lat": -15.778,
45
+ "lng": -47.8505
46
+ },
47
+ "timestamp": "2025-01-15T16:00:00",
48
+ "category": "Vision",
49
+ "flagged_as_offensive": false
50
+ },
51
+ {
52
+ "id": 5,
53
+ "message": "The ideal scenario would be parks and nature are accessible to all residents",
54
+ "contributor_type": "government",
55
+ "location": {
56
+ "lat": -15.7061,
57
+ "lng": -47.8908
58
+ },
59
+ "timestamp": "2025-01-15T16:30:00",
60
+ "category": "Vision",
61
+ "flagged_as_offensive": false
62
+ },
63
+ {
64
+ "id": 6,
65
+ "message": "We dream of a future with renewable energy powers 100% of our infrastructure",
66
+ "contributor_type": "other",
67
+ "location": {
68
+ "lat": -15.7388,
69
+ "lng": -47.9121
70
+ },
71
+ "timestamp": "2025-01-15T17:00:00",
72
+ "category": "Vision",
73
+ "flagged_as_offensive": false
74
+ },
75
+ {
76
+ "id": 7,
77
+ "message": "We envision a city where equity and inclusion are foundational to all decisions",
78
+ "contributor_type": "industry",
79
+ "location": {
80
+ "lat": -15.8396,
81
+ "lng": -47.8803
82
+ },
83
+ "timestamp": "2025-01-15T17:30:00",
84
+ "category": "Vision",
85
+ "flagged_as_offensive": false
86
+ },
87
+ {
88
+ "id": 8,
89
+ "message": "The ideal scenario would be all citizens have access to clean energy and green spaces",
90
+ "contributor_type": "community",
91
+ "location": {
92
+ "lat": -15.8681,
93
+ "lng": -47.9813
94
+ },
95
+ "timestamp": "2025-01-15T18:00:00",
96
+ "category": "Vision",
97
+ "flagged_as_offensive": false
98
+ },
99
+ {
100
+ "id": 9,
101
+ "message": "Imagine a community that children can safely walk or bike to school",
102
+ "contributor_type": "community",
103
+ "location": {
104
+ "lat": -15.8515,
105
+ "lng": -47.8442
106
+ },
107
+ "timestamp": "2025-01-15T18:30:00",
108
+ "category": "Vision",
109
+ "flagged_as_offensive": false
110
+ },
111
+ {
112
+ "id": 10,
113
+ "message": "We want to see a city that zero waste is achieved through comprehensive recycling",
114
+ "contributor_type": "academic",
115
+ "location": {
116
+ "lat": -15.7153,
117
+ "lng": -47.9456
118
+ },
119
+ "timestamp": "2025-01-15T19:00:00",
120
+ "category": "Vision",
121
+ "flagged_as_offensive": false
122
+ },
123
+ {
124
+ "id": 11,
125
+ "message": "We are facing challenges with insufficient green spaces in densely populated zones",
126
+ "contributor_type": "government",
127
+ "location": {
128
+ "lat": -15.7989,
129
+ "lng": -47.979
130
+ },
131
+ "timestamp": "2025-01-15T19:30:00",
132
+ "category": "Problem",
133
+ "flagged_as_offensive": false
134
+ },
135
+ {
136
+ "id": 12,
137
+ "message": "One major concern is inadequate waste management systems",
138
+ "contributor_type": "industry",
139
+ "location": {
140
+ "lat": -15.7862,
141
+ "lng": -47.9812
142
+ },
143
+ "timestamp": "2025-01-15T20:00:00",
144
+ "category": "Problem",
145
+ "flagged_as_offensive": false
146
+ },
147
+ {
148
+ "id": 13,
149
+ "message": "There is inadequate digital divide affecting low-income communities",
150
+ "contributor_type": "academic",
151
+ "location": {
152
+ "lat": -15.8672,
153
+ "lng": -47.8886
154
+ },
155
+ "timestamp": "2025-01-15T20:30:00",
156
+ "category": "Problem",
157
+ "flagged_as_offensive": false
158
+ },
159
+ {
160
+ "id": 14,
161
+ "message": "A critical problem is aging water infrastructure causing frequent issues",
162
+ "contributor_type": "ngo",
163
+ "location": {
164
+ "lat": -15.7679,
165
+ "lng": -47.862
166
+ },
167
+ "timestamp": "2025-01-15T21:00:00",
168
+ "category": "Problem",
169
+ "flagged_as_offensive": false
170
+ },
171
+ {
172
+ "id": 15,
173
+ "message": "The current situation with lack of affordable housing for middle-income families is problematic",
174
+ "contributor_type": "ngo",
175
+ "location": {
176
+ "lat": -15.6868,
177
+ "lng": -47.8453
178
+ },
179
+ "timestamp": "2025-01-15T21:30:00",
180
+ "category": "Problem",
181
+ "flagged_as_offensive": false
182
+ },
183
+ {
184
+ "id": 16,
185
+ "message": "The main issue is aging water infrastructure causing frequent issues",
186
+ "contributor_type": "community",
187
+ "location": {
188
+ "lat": -15.7037,
189
+ "lng": -47.8742
190
+ },
191
+ "timestamp": "2025-01-15T22:00:00",
192
+ "category": "Problem",
193
+ "flagged_as_offensive": false
194
+ },
195
+ {
196
+ "id": 17,
197
+ "message": "We are facing challenges with lack of affordable housing for middle-income families",
198
+ "contributor_type": "government",
199
+ "location": {
200
+ "lat": -15.7255,
201
+ "lng": -47.9207
202
+ },
203
+ "timestamp": "2025-01-15T22:30:00",
204
+ "category": "Problem",
205
+ "flagged_as_offensive": false
206
+ },
207
+ {
208
+ "id": 18,
209
+ "message": "We lack sufficient inadequate waste management systems",
210
+ "contributor_type": "community",
211
+ "location": {
212
+ "lat": -15.7296,
213
+ "lng": -47.9722
214
+ },
215
+ "timestamp": "2025-01-15T23:00:00",
216
+ "category": "Problem",
217
+ "flagged_as_offensive": false
218
+ },
219
+ {
220
+ "id": 19,
221
+ "message": "One major concern is inadequate waste management systems",
222
+ "contributor_type": "industry",
223
+ "location": {
224
+ "lat": -15.7532,
225
+ "lng": -47.9011
226
+ },
227
+ "timestamp": "2025-01-15T23:30:00",
228
+ "category": "Problem",
229
+ "flagged_as_offensive": false
230
+ },
231
+ {
232
+ "id": 20,
233
+ "message": "The main issue is food deserts in several neighborhoods",
234
+ "contributor_type": "industry",
235
+ "location": {
236
+ "lat": -15.7114,
237
+ "lng": -47.8629
238
+ },
239
+ "timestamp": "2025-01-16T00:00:00",
240
+ "category": "Problem",
241
+ "flagged_as_offensive": false
242
+ },
243
+ {
244
+ "id": 21,
245
+ "message": "We should strive to ensure 90% of residents live within 10 minutes of transit",
246
+ "contributor_type": "other",
247
+ "location": {
248
+ "lat": -15.8209,
249
+ "lng": -47.9591
250
+ },
251
+ "timestamp": "2025-01-16T00:30:00",
252
+ "category": "Objectives",
253
+ "flagged_as_offensive": false
254
+ },
255
+ {
256
+ "id": 22,
257
+ "message": "Our target is to increase bike lane network by 200 kilometers",
258
+ "contributor_type": "other",
259
+ "location": {
260
+ "lat": -15.8401,
261
+ "lng": -47.9368
262
+ },
263
+ "timestamp": "2025-01-16T01:00:00",
264
+ "category": "Objectives",
265
+ "flagged_as_offensive": false
266
+ },
267
+ {
268
+ "id": 23,
269
+ "message": "The objective should be to increase bike lane network by 200 kilometers",
270
+ "contributor_type": "academic",
271
+ "location": {
272
+ "lat": -15.7152,
273
+ "lng": -47.9343
274
+ },
275
+ "timestamp": "2025-01-16T01:30:00",
276
+ "category": "Objectives",
277
+ "flagged_as_offensive": false
278
+ },
279
+ {
280
+ "id": 24,
281
+ "message": "We must work towards reduce carbon emissions by 50% in the next 5 years",
282
+ "contributor_type": "other",
283
+ "location": {
284
+ "lat": -15.8555,
285
+ "lng": -47.9754
286
+ },
287
+ "timestamp": "2025-01-16T02:00:00",
288
+ "category": "Objectives",
289
+ "flagged_as_offensive": false
290
+ },
291
+ {
292
+ "id": 25,
293
+ "message": "We must work towards increase bike lane network by 200 kilometers",
294
+ "contributor_type": "ngo",
295
+ "location": {
296
+ "lat": -15.7199,
297
+ "lng": -47.9691
298
+ },
299
+ "timestamp": "2025-01-16T02:30:00",
300
+ "category": "Objectives",
301
+ "flagged_as_offensive": false
302
+ },
303
+ {
304
+ "id": 26,
305
+ "message": "The objective should be to create 500 acres of new parks and green spaces",
306
+ "contributor_type": "academic",
307
+ "location": {
308
+ "lat": -15.7006,
309
+ "lng": -47.9967
310
+ },
311
+ "timestamp": "2025-01-16T03:00:00",
312
+ "category": "Objectives",
313
+ "flagged_as_offensive": false
314
+ },
315
+ {
316
+ "id": 27,
317
+ "message": "The primary objective is retrofit all public buildings for energy efficiency",
318
+ "contributor_type": "industry",
319
+ "location": {
320
+ "lat": -15.8463,
321
+ "lng": -48.0058
322
+ },
323
+ "timestamp": "2025-01-16T03:30:00",
324
+ "category": "Objectives",
325
+ "flagged_as_offensive": false
326
+ },
327
+ {
328
+ "id": 28,
329
+ "message": "We should strive to increase bike lane network by 200 kilometers",
330
+ "contributor_type": "industry",
331
+ "location": {
332
+ "lat": -15.6882,
333
+ "lng": -47.9008
334
+ },
335
+ "timestamp": "2025-01-16T04:00:00",
336
+ "category": "Objectives",
337
+ "flagged_as_offensive": false
338
+ },
339
+ {
340
+ "id": 29,
341
+ "message": "We aim to achieve provide high-speed internet to 100% of households",
342
+ "contributor_type": "industry",
343
+ "location": {
344
+ "lat": -15.7342,
345
+ "lng": -47.9172
346
+ },
347
+ "timestamp": "2025-01-16T04:30:00",
348
+ "category": "Objectives",
349
+ "flagged_as_offensive": false
350
+ },
351
+ {
352
+ "id": 30,
353
+ "message": "We aim to achieve improve water quality to exceed national standards",
354
+ "contributor_type": "community",
355
+ "location": {
356
+ "lat": -15.7662,
357
+ "lng": -47.9675
358
+ },
359
+ "timestamp": "2025-01-16T05:00:00",
360
+ "category": "Objectives",
361
+ "flagged_as_offensive": false
362
+ },
363
+ {
364
+ "id": 31,
365
+ "message": "We must implement restrictions on single-use plastics in retail",
366
+ "contributor_type": "community",
367
+ "location": {
368
+ "lat": -15.879,
369
+ "lng": -47.9683
370
+ },
371
+ "timestamp": "2025-01-16T05:30:00",
372
+ "category": "Directives",
373
+ "flagged_as_offensive": false
374
+ },
375
+ {
376
+ "id": 32,
377
+ "message": "We should establish rules for noise regulations in residential areas",
378
+ "contributor_type": "academic",
379
+ "location": {
380
+ "lat": -15.7637,
381
+ "lng": -47.9788
382
+ },
383
+ "timestamp": "2025-01-16T06:00:00",
384
+ "category": "Directives",
385
+ "flagged_as_offensive": false
386
+ },
387
+ {
388
+ "id": 33,
389
+ "message": "We should establish rules for energy efficiency standards for all renovations",
390
+ "contributor_type": "other",
391
+ "location": {
392
+ "lat": -15.713,
393
+ "lng": -47.9773
394
+ },
395
+ "timestamp": "2025-01-16T06:30:00",
396
+ "category": "Directives",
397
+ "flagged_as_offensive": false
398
+ },
399
+ {
400
+ "id": 34,
401
+ "message": "The city should enforce building codes that require accessibility standards",
402
+ "contributor_type": "other",
403
+ "location": {
404
+ "lat": -15.6881,
405
+ "lng": -48.0225
406
+ },
407
+ "timestamp": "2025-01-16T07:00:00",
408
+ "category": "Directives",
409
+ "flagged_as_offensive": false
410
+ },
411
+ {
412
+ "id": 35,
413
+ "message": "We need to mandate energy efficiency standards for all renovations",
414
+ "contributor_type": "academic",
415
+ "location": {
416
+ "lat": -15.8179,
417
+ "lng": -47.9225
418
+ },
419
+ "timestamp": "2025-01-16T07:30:00",
420
+ "category": "Directives",
421
+ "flagged_as_offensive": false
422
+ },
423
+ {
424
+ "id": 36,
425
+ "message": "Authorities need to enforce building codes that require accessibility standards",
426
+ "contributor_type": "government",
427
+ "location": {
428
+ "lat": -15.8307,
429
+ "lng": -47.898
430
+ },
431
+ "timestamp": "2025-01-16T08:00:00",
432
+ "category": "Directives",
433
+ "flagged_as_offensive": false
434
+ },
435
+ {
436
+ "id": 37,
437
+ "message": "Authorities need to enforce protected bike lanes on all major corridors",
438
+ "contributor_type": "government",
439
+ "location": {
440
+ "lat": -15.7259,
441
+ "lng": -47.9658
442
+ },
443
+ "timestamp": "2025-01-16T08:30:00",
444
+ "category": "Directives",
445
+ "flagged_as_offensive": false
446
+ },
447
+ {
448
+ "id": 38,
449
+ "message": "Policies must ensure tree preservation ordinances in development zones",
450
+ "contributor_type": "industry",
451
+ "location": {
452
+ "lat": -15.8086,
453
+ "lng": -47.9173
454
+ },
455
+ "timestamp": "2025-01-16T09:00:00",
456
+ "category": "Directives",
457
+ "flagged_as_offensive": false
458
+ },
459
+ {
460
+ "id": 39,
461
+ "message": "We should establish rules for building codes that require accessibility standards",
462
+ "contributor_type": "community",
463
+ "location": {
464
+ "lat": -15.8257,
465
+ "lng": -48.0039
466
+ },
467
+ "timestamp": "2025-01-16T09:30:00",
468
+ "category": "Directives",
469
+ "flagged_as_offensive": false
470
+ },
471
+ {
472
+ "id": 40,
473
+ "message": "Authorities need to enforce restrictions on single-use plastics in retail",
474
+ "contributor_type": "government",
475
+ "location": {
476
+ "lat": -15.6997,
477
+ "lng": -47.8941
478
+ },
479
+ "timestamp": "2025-01-16T10:00:00",
480
+ "category": "Directives",
481
+ "flagged_as_offensive": false
482
+ },
483
+ {
484
+ "id": 41,
485
+ "message": "Our foundation is built on transparency and democratic decision-making",
486
+ "contributor_type": "industry",
487
+ "location": {
488
+ "lat": -15.7953,
489
+ "lng": -47.8969
490
+ },
491
+ "timestamp": "2025-01-16T10:30:00",
492
+ "category": "Values",
493
+ "flagged_as_offensive": false
494
+ },
495
+ {
496
+ "id": 42,
497
+ "message": "We hold social equity and inclusive participation as a core value",
498
+ "contributor_type": "academic",
499
+ "location": {
500
+ "lat": -15.8073,
501
+ "lng": -47.993
502
+ },
503
+ "timestamp": "2025-01-16T11:00:00",
504
+ "category": "Values",
505
+ "flagged_as_offensive": false
506
+ },
507
+ {
508
+ "id": 43,
509
+ "message": "We are committed to innovation balanced with preservation",
510
+ "contributor_type": "ngo",
511
+ "location": {
512
+ "lat": -15.7714,
513
+ "lng": -47.9996
514
+ },
515
+ "timestamp": "2025-01-16T11:30:00",
516
+ "category": "Values",
517
+ "flagged_as_offensive": false
518
+ },
519
+ {
520
+ "id": 44,
521
+ "message": "We are committed to community resilience and mutual support",
522
+ "contributor_type": "ngo",
523
+ "location": {
524
+ "lat": -15.78,
525
+ "lng": -47.9534
526
+ },
527
+ "timestamp": "2025-01-16T12:00:00",
528
+ "category": "Values",
529
+ "flagged_as_offensive": false
530
+ },
531
+ {
532
+ "id": 45,
533
+ "message": "We are committed to community resilience and mutual support",
534
+ "contributor_type": "industry",
535
+ "location": {
536
+ "lat": -15.7062,
537
+ "lng": -47.8504
538
+ },
539
+ "timestamp": "2025-01-16T12:30:00",
540
+ "category": "Values",
541
+ "flagged_as_offensive": false
542
+ },
543
+ {
544
+ "id": 46,
545
+ "message": "We are committed to accessibility and universal design",
546
+ "contributor_type": "community",
547
+ "location": {
548
+ "lat": -15.7476,
549
+ "lng": -47.9312
550
+ },
551
+ "timestamp": "2025-01-16T13:00:00",
552
+ "category": "Values",
553
+ "flagged_as_offensive": false
554
+ },
555
+ {
556
+ "id": 47,
557
+ "message": "It is essential to prioritize health and wellbeing for all residents",
558
+ "contributor_type": "other",
559
+ "location": {
560
+ "lat": -15.7532,
561
+ "lng": -47.9828
562
+ },
563
+ "timestamp": "2025-01-16T13:30:00",
564
+ "category": "Values",
565
+ "flagged_as_offensive": false
566
+ },
567
+ {
568
+ "id": 48,
569
+ "message": "We hold innovation balanced with preservation as a core value",
570
+ "contributor_type": "industry",
571
+ "location": {
572
+ "lat": -15.8689,
573
+ "lng": -48.0167
574
+ },
575
+ "timestamp": "2025-01-16T14:00:00",
576
+ "category": "Values",
577
+ "flagged_as_offensive": false
578
+ },
579
+ {
580
+ "id": 49,
581
+ "message": "The principle of innovation balanced with preservation matters to us",
582
+ "contributor_type": "community",
583
+ "location": {
584
+ "lat": -15.6869,
585
+ "lng": -48.0234
586
+ },
587
+ "timestamp": "2025-01-16T14:30:00",
588
+ "category": "Values",
589
+ "flagged_as_offensive": false
590
+ },
591
+ {
592
+ "id": 50,
593
+ "message": "Our community values accessibility and universal design",
594
+ "contributor_type": "academic",
595
+ "location": {
596
+ "lat": -15.8087,
597
+ "lng": -47.9772
598
+ },
599
+ "timestamp": "2025-01-16T15:00:00",
600
+ "category": "Values",
601
+ "flagged_as_offensive": false
602
+ },
603
+ {
604
+ "id": 51,
605
+ "message": "We can construct comprehensive recycling and composting facilities",
606
+ "contributor_type": "industry",
607
+ "location": {
608
+ "lat": -15.8132,
609
+ "lng": -47.9721
610
+ },
611
+ "timestamp": "2025-01-16T15:30:00",
612
+ "category": "Actions",
613
+ "flagged_as_offensive": false
614
+ },
615
+ {
616
+ "id": 52,
617
+ "message": "Let us establish a new metro line connecting eastern suburbs",
618
+ "contributor_type": "industry",
619
+ "location": {
620
+ "lat": -15.694,
621
+ "lng": -47.9389
622
+ },
623
+ "timestamp": "2025-01-16T16:00:00",
624
+ "category": "Actions",
625
+ "flagged_as_offensive": false
626
+ },
627
+ {
628
+ "id": 53,
629
+ "message": "We should install community centers in underserved neighborhoods",
630
+ "contributor_type": "government",
631
+ "location": {
632
+ "lat": -15.8259,
633
+ "lng": -47.9417
634
+ },
635
+ "timestamp": "2025-01-16T16:30:00",
636
+ "category": "Actions",
637
+ "flagged_as_offensive": false
638
+ },
639
+ {
640
+ "id": 54,
641
+ "message": "We should build a new metro line connecting eastern suburbs",
642
+ "contributor_type": "community",
643
+ "location": {
644
+ "lat": -15.717,
645
+ "lng": -47.9367
646
+ },
647
+ "timestamp": "2025-01-16T17:00:00",
648
+ "category": "Actions",
649
+ "flagged_as_offensive": false
650
+ },
651
+ {
652
+ "id": 55,
653
+ "message": "Let us organize farmers markets in every district",
654
+ "contributor_type": "industry",
655
+ "location": {
656
+ "lat": -15.8263,
657
+ "lng": -47.9003
658
+ },
659
+ "timestamp": "2025-01-16T17:30:00",
660
+ "category": "Actions",
661
+ "flagged_as_offensive": false
662
+ },
663
+ {
664
+ "id": 56,
665
+ "message": "We should build comprehensive recycling and composting facilities",
666
+ "contributor_type": "community",
667
+ "location": {
668
+ "lat": -15.8417,
669
+ "lng": -47.9085
670
+ },
671
+ "timestamp": "2025-01-16T18:00:00",
672
+ "category": "Actions",
673
+ "flagged_as_offensive": false
674
+ },
675
+ {
676
+ "id": 57,
677
+ "message": "We can construct free WiFi hotspots in all public spaces",
678
+ "contributor_type": "government",
679
+ "location": {
680
+ "lat": -15.8124,
681
+ "lng": -47.8294
682
+ },
683
+ "timestamp": "2025-01-16T18:30:00",
684
+ "category": "Actions",
685
+ "flagged_as_offensive": false
686
+ },
687
+ {
688
+ "id": 58,
689
+ "message": "We need to develop farmers markets in every district",
690
+ "contributor_type": "community",
691
+ "location": {
692
+ "lat": -15.7155,
693
+ "lng": -47.918
694
+ },
695
+ "timestamp": "2025-01-16T19:00:00",
696
+ "category": "Actions",
697
+ "flagged_as_offensive": false
698
+ },
699
+ {
700
+ "id": 59,
701
+ "message": "We need to develop protected bike lanes on major streets",
702
+ "contributor_type": "other",
703
+ "location": {
704
+ "lat": -15.8594,
705
+ "lng": -47.9596
706
+ },
707
+ "timestamp": "2025-01-16T19:30:00",
708
+ "category": "Actions",
709
+ "flagged_as_offensive": false
710
+ },
711
+ {
712
+ "id": 60,
713
+ "message": "Let us create solar panel installations on 200 public buildings",
714
+ "contributor_type": "community",
715
+ "location": {
716
+ "lat": -15.7879,
717
+ "lng": -47.9923
718
+ },
719
+ "timestamp": "2025-01-16T20:00:00",
720
+ "category": "Actions",
721
+ "flagged_as_offensive": false
722
+ }
723
+ ],
724
+ "export_date": "2025-10-06T13:14:53.243263",
725
+ "description": "Mock dataset with 60 balanced submissions (10 per category)"
726
+ }
prepare_hf_deployment.sh ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Hugging Face Deployment Preparation Script
4
+ # This script prepares your app for deployment to Hugging Face Spaces
5
+
6
+ set -e # Exit on error
7
+
8
+ echo "πŸš€ Preparing for Hugging Face Spaces Deployment"
9
+ echo "================================================"
10
+ echo ""
11
+
12
+ # Check if we're in the right directory
13
+ if [ ! -f "app_hf.py" ]; then
14
+ echo "❌ Error: Must run from project root (where app_hf.py is located)"
15
+ exit 1
16
+ fi
17
+
18
+ # Step 1: Copy HF-specific files
19
+ echo "πŸ“ Step 1: Copying HF-specific files..."
20
+ cp Dockerfile.hf Dockerfile
21
+ echo " βœ“ Copied Dockerfile.hf β†’ Dockerfile"
22
+
23
+ cp README_HF.md README.md
24
+ echo " βœ“ Copied README_HF.md β†’ README.md"
25
+
26
+ # Step 2: Verify required files exist
27
+ echo ""
28
+ echo "πŸ” Step 2: Verifying required files..."
29
+ required_files=("Dockerfile" "README.md" "requirements.txt" "app_hf.py" "wsgi.py" ".gitignore" "app/__init__.py")
30
+
31
+ for file in "${required_files[@]}"; do
32
+ if [ -f "$file" ] || [ -d "$file" ]; then
33
+ echo " βœ“ $file"
34
+ else
35
+ echo " ❌ Missing: $file"
36
+ exit 1
37
+ fi
38
+ done
39
+
40
+ # Step 3: Check app/ directory
41
+ echo ""
42
+ echo "πŸ“‚ Step 3: Checking app directory structure..."
43
+ app_dirs=("app/routes" "app/models" "app/templates" "app/fine_tuning")
44
+
45
+ for dir in "${app_dirs[@]}"; do
46
+ if [ -d "$dir" ]; then
47
+ echo " βœ“ $dir/"
48
+ else
49
+ echo " ⚠️ Warning: $dir/ not found"
50
+ fi
51
+ done
52
+
53
+ # Step 4: Verify port configuration
54
+ echo ""
55
+ echo "πŸ”Œ Step 4: Verifying port 7860 configuration..."
56
+
57
+ if grep -q "7860" Dockerfile && grep -q "7860" app_hf.py; then
58
+ echo " βœ“ Port 7860 configured correctly"
59
+ else
60
+ echo " ❌ Port 7860 not found in Dockerfile or app_hf.py"
61
+ exit 1
62
+ fi
63
+
64
+ # Step 5: Check for sensitive files
65
+ echo ""
66
+ echo "πŸ”’ Step 5: Checking for sensitive files..."
67
+
68
+ if [ -f ".env" ]; then
69
+ echo " ⚠️ WARNING: .env file exists - DO NOT upload to HF!"
70
+ echo " Use HF Secrets instead for FLASK_SECRET_KEY"
71
+ fi
72
+
73
+ if [ -f "instance/participatory_planner.db" ]; then
74
+ echo " ⚠️ Local database exists - will NOT be uploaded (good)"
75
+ fi
76
+
77
+ # Step 6: Generate deployment summary
78
+ echo ""
79
+ echo "πŸ“Š Step 6: Deployment Summary"
80
+ echo "============================="
81
+ echo ""
82
+ echo "Ready to deploy to Hugging Face Spaces!"
83
+ echo ""
84
+ echo "πŸ“¦ Files ready for upload:"
85
+ echo " - Dockerfile (HF version)"
86
+ echo " - README.md (with YAML header)"
87
+ echo " - requirements.txt"
88
+ echo " - app_hf.py"
89
+ echo " - wsgi.py"
90
+ echo " - app/ directory"
91
+ echo " - .gitignore"
92
+ echo ""
93
+ echo "πŸ” IMPORTANT - Configure these secrets in HF Space Settings:"
94
+ echo " Secret Name: FLASK_SECRET_KEY"
95
+ echo " Secret Value: 9fd11d101e36efbd3a7893f56d604b860403d247633547586c41453118e69b00"
96
+ echo ""
97
+ echo "🌐 Next steps:"
98
+ echo " 1. Go to https://huggingface.co/new-space"
99
+ echo " 2. Choose SDK: Docker"
100
+ echo " 3. Upload the files listed above"
101
+ echo " 4. Add FLASK_SECRET_KEY to Secrets"
102
+ echo " 5. Wait for build (~10 minutes first time)"
103
+ echo ""
104
+ echo "πŸ“– For detailed instructions, see:"
105
+ echo " - HF_DEPLOYMENT_CHECKLIST.md"
106
+ echo " - HUGGINGFACE_DEPLOYMENT.md"
107
+ echo ""
108
+ echo "βœ… Preparation complete! Ready to deploy! πŸŽ‰"
109
+
requirements.txt CHANGED
@@ -14,3 +14,6 @@ matplotlib>=3.7.0
14
  seaborn>=0.12.0
15
  accelerate>=0.24.0
16
  evaluate>=0.4.0
 
 
 
 
14
  seaborn>=0.12.0
15
  accelerate>=0.24.0
16
  evaluate>=0.4.0
17
+
18
+ # Text processing (for sentence segmentation)
19
+ nltk>=3.8.0
run.py CHANGED
@@ -1,3 +1,9 @@
 
 
 
 
 
 
1
  from app import create_app
2
 
3
  app = create_app()
 
1
+ import os
2
+ from dotenv import load_dotenv
3
+
4
+ # Load environment variables (including CUDA_VISIBLE_DEVICES)
5
+ load_dotenv()
6
+
7
  from app import create_app
8
 
9
  app = create_app()
sentence_analysis_results.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ DETAILED SENTENCE-LEVEL ANALYSIS RESULTS
2
+ ======================================================================
3
+
4
+ Total Submissions: 60
5
+ Multi-category Submissions: 0 (0.0%)
6
+
7
+
8
+ DETAILED BREAKDOWN:
9
+