IndoBERT NER - Surabaya Opinion Analysis

Model Description

This model is a fine-tuned version of indobenchmark/indobert-base-p2 for Named Entity Recognition (NER). It is specifically trained to detect entities in informal Indonesian text and Suroboyoan slang, focusing on social media complaints and public opinions regarding the city of Surabaya.

Model Details

Base Model: indobenchmark/indobert-base-p2
Training Dataset: Kiuyha/surabaya-ner-dataset
Domain: Social Media (Twitter/X and Reddit)
Language: Indonesian (Informal/Slang)
Task: Token Classification (NER)
License: Apache-2.0
Tags: ner, bert, indobert, surabaya, social-media

Intended Use

This model is designed to extract entities such as:

Locations (LOC) - Places, streets, parks, neighborhoods
Persons (PERSON) - Names of individuals, public figures
Organizations (ORG) - Government agencies, companies, institutions

The model is particularly robust against typos, slang, and the specific context of urban complaints (e.g., mentioning specific parks, streets, or government figures in Surabaya).

How to Use

You can use this model directly with the Hugging Face pipeline:

from transformers import pipeline

# Load the pipeline
# aggregation_strategy="simple" merges sub-tokens (B-LOC, I-LOC) into single entities
nlp = pipeline("ner", model="Kiuyha/surabaya-opinion-indobert-ner", aggregation_strategy="simple")

# Example text
text = "Saya sedang makan soto di Taman Bungkul Surabaya bersama Anies Baswedan."

# Run inference
result = nlp(text)

print(result)

Output Format

The pipeline will return a list of detected entities with their start/end positions and confidence scores:

[
    {
        'entity_group': 'LOC',
        'score': np.float32(0.96564007),
        'word': 'taman bungkul surabaya',
        'start': 26,
        'end': 48
    },
    {
        'entity_group': 'PERSON',
        'score': np.float32(0.9307747),
        'word': 'anies baswedan',
        'start': 57,
        'end': 71
    }
]

Training Data

The model was trained on the Surabaya NER Dataset:

Split	Sentences
Train	6,577
Validation	822
Test	823
Total	8,222

The data consists of scraped social media posts filtered for topics related to Surabaya city issues including:

Traffic and transportation
Flooding and drainage
Crime and safety
Public utilities (water, electricity)
Government services
Infrastructure

Performance

The model was evaluated on the test set (823 sentences) and achieved the following results:

Test Set Classification Report

Entity	Precision	Recall	F1-Score	Support
LOC	0.78	0.81	0.79	1,363
ORG	0.60	0.61	0.61	344
PERSON	0.79	0.77	0.78	231

Micro avg	0.75	0.77	0.74	1,938
Macro avg	0.72	0.73	0.73	1,938
Weighted avg	0.75	0.77	0.76	1,938

Performance Insights

LOC (Location): Strong performance with 78% F1-score, indicating good detection of Surabaya landmarks, streets, and neighborhoods
PERSON: High precision (84%) shows the model is reliable when identifying person names, though recall could be improved
ORG (Organization): Lower performance (59% F1-score) reflects the challenge of identifying organizations in informal text where names may be abbreviated or mentioned indirectly
Overall: 74% F1-score (micro avg) demonstrates solid performance on social media text with informal language and slang

Use Cases

Social Media Monitoring: Automatically extract locations and entities from citizen complaints
Urban Analytics: Track which locations are frequently mentioned in negative contexts
Government Services: Route complaints to appropriate departments based on detected entities
Sentiment Analysis: Combine with sentiment models to understand location-specific issues
Research: Study public opinion patterns in Surabaya through entity extraction

Limitations

Optimized for informal Indonesian and Suroboyoan dialect - may not perform as well on formal text
Trained specifically on Surabaya-related content - performance may vary on other Indonesian cities
Best suited for social media text - may require fine-tuning for other domains
Limited to three entity types: LOC, PERSON, and ORG

Citation

If you use this model in your research, please cite both the model and the dataset:

@misc{surabaya-ner-indobert,
  author = {Kiuyha},
  title = {IndoBERT NER - Surabaya Opinion Analysis},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Kiuyha/surabaya-opinion-indobert-ner}}
}

Contact

For questions or issues regarding this model, please open an issue on the model repository page.

Downloads last month: 31

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Kiuyha/surabaya-opinion-indobert-ner

Base model

indobenchmark/indobert-base-p2

Finetuned

(84)

this model

Dataset used to train Kiuyha/surabaya-opinion-indobert-ner

Collection including Kiuyha/surabaya-opinion-indobert-ner

Surabaya Opinion Analysis

Collection

6 items • Updated 2 days ago