IndoBERT NER - Surabaya Opinion Analysis
Model Description
This model is a fine-tuned version of indobenchmark/indobert-base-p2 for Named Entity Recognition (NER). It is specifically trained to detect entities in informal Indonesian text and Suroboyoan slang, focusing on social media complaints and public opinions regarding the city of Surabaya.
Model Details
- Base Model:
indobenchmark/indobert-base-p2 - Training Dataset: Kiuyha/surabaya-ner-dataset
- Domain: Social Media (Twitter/X and Reddit)
- Language: Indonesian (Informal/Slang)
- Task: Token Classification (NER)
- License: Apache-2.0
- Tags: ner, bert, indobert, surabaya, social-media
Intended Use
This model is designed to extract entities such as:
- Locations (LOC) - Places, streets, parks, neighborhoods
- Persons (PERSON) - Names of individuals, public figures
- Organizations (ORG) - Government agencies, companies, institutions
The model is particularly robust against typos, slang, and the specific context of urban complaints (e.g., mentioning specific parks, streets, or government figures in Surabaya).
How to Use
You can use this model directly with the Hugging Face pipeline:
from transformers import pipeline
# Load the pipeline
# aggregation_strategy="simple" merges sub-tokens (B-LOC, I-LOC) into single entities
nlp = pipeline("ner", model="Kiuyha/surabaya-opinion-indobert-ner", aggregation_strategy="simple")
# Example text
text = "Saya sedang makan soto di Taman Bungkul Surabaya bersama Anies Baswedan."
# Run inference
result = nlp(text)
print(result)
Output Format
The pipeline will return a list of detected entities with their start/end positions and confidence scores:
[
{
'entity_group': 'LOC',
'score': np.float32(0.96564007),
'word': 'taman bungkul surabaya',
'start': 26,
'end': 48
},
{
'entity_group': 'PERSON',
'score': np.float32(0.9307747),
'word': 'anies baswedan',
'start': 57,
'end': 71
}
]
Training Data
The model was trained on the Surabaya NER Dataset:
| Split | Sentences |
|---|---|
| Train | 6,577 |
| Validation | 822 |
| Test | 823 |
| Total | 8,222 |
The data consists of scraped social media posts filtered for topics related to Surabaya city issues including:
- Traffic and transportation
- Flooding and drainage
- Crime and safety
- Public utilities (water, electricity)
- Government services
- Infrastructure
Performance
The model was evaluated on the test set (823 sentences) and achieved the following results:
Test Set Classification Report
| Entity | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| LOC | 0.78 | 0.81 | 0.79 | 1,363 |
| ORG | 0.60 | 0.61 | 0.61 | 344 |
| PERSON | 0.79 | 0.77 | 0.78 | 231 |
| Micro avg | 0.75 | 0.77 | 0.74 | 1,938 |
| Macro avg | 0.72 | 0.73 | 0.73 | 1,938 |
| Weighted avg | 0.75 | 0.77 | 0.76 | 1,938 |
Performance Insights
- LOC (Location): Strong performance with 78% F1-score, indicating good detection of Surabaya landmarks, streets, and neighborhoods
- PERSON: High precision (84%) shows the model is reliable when identifying person names, though recall could be improved
- ORG (Organization): Lower performance (59% F1-score) reflects the challenge of identifying organizations in informal text where names may be abbreviated or mentioned indirectly
- Overall: 74% F1-score (micro avg) demonstrates solid performance on social media text with informal language and slang
Use Cases
- Social Media Monitoring: Automatically extract locations and entities from citizen complaints
- Urban Analytics: Track which locations are frequently mentioned in negative contexts
- Government Services: Route complaints to appropriate departments based on detected entities
- Sentiment Analysis: Combine with sentiment models to understand location-specific issues
- Research: Study public opinion patterns in Surabaya through entity extraction
Limitations
- Optimized for informal Indonesian and Suroboyoan dialect - may not perform as well on formal text
- Trained specifically on Surabaya-related content - performance may vary on other Indonesian cities
- Best suited for social media text - may require fine-tuning for other domains
- Limited to three entity types: LOC, PERSON, and ORG
Citation
If you use this model in your research, please cite both the model and the dataset:
@misc{surabaya-ner-indobert,
author = {Kiuyha},
title = {IndoBERT NER - Surabaya Opinion Analysis},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Kiuyha/surabaya-opinion-indobert-ner}}
}
Contact
For questions or issues regarding this model, please open an issue on the model repository page.
- Downloads last month
- 31
Model tree for Kiuyha/surabaya-opinion-indobert-ner
Base model
indobenchmark/indobert-base-p2