IndoBERT NER - Surabaya Opinion Analysis

Model Description

This model is a fine-tuned version of indobenchmark/indobert-base-p2 for Named Entity Recognition (NER). It is specifically trained to detect entities in informal Indonesian text and Suroboyoan slang, focusing on social media complaints and public opinions regarding the city of Surabaya.

Model Details

  • Base Model: indobenchmark/indobert-base-p2
  • Training Dataset: Kiuyha/surabaya-ner-dataset
  • Domain: Social Media (Twitter/X and Reddit)
  • Language: Indonesian (Informal/Slang)
  • Task: Token Classification (NER)
  • License: Apache-2.0
  • Tags: ner, bert, indobert, surabaya, social-media

Intended Use

This model is designed to extract entities such as:

  • Locations (LOC) - Places, streets, parks, neighborhoods
  • Persons (PERSON) - Names of individuals, public figures
  • Organizations (ORG) - Government agencies, companies, institutions

The model is particularly robust against typos, slang, and the specific context of urban complaints (e.g., mentioning specific parks, streets, or government figures in Surabaya).

How to Use

You can use this model directly with the Hugging Face pipeline:

from transformers import pipeline

# Load the pipeline
# aggregation_strategy="simple" merges sub-tokens (B-LOC, I-LOC) into single entities
nlp = pipeline("ner", model="Kiuyha/surabaya-opinion-indobert-ner", aggregation_strategy="simple")

# Example text
text = "Saya sedang makan soto di Taman Bungkul Surabaya bersama Anies Baswedan."

# Run inference
result = nlp(text)

print(result)

Output Format

The pipeline will return a list of detected entities with their start/end positions and confidence scores:

[
    {
        'entity_group': 'LOC',
        'score': np.float32(0.96564007),
        'word': 'taman bungkul surabaya',
        'start': 26,
        'end': 48
    },
    {
        'entity_group': 'PERSON',
        'score': np.float32(0.9307747),
        'word': 'anies baswedan',
        'start': 57,
        'end': 71
    }
]

Training Data

The model was trained on the Surabaya NER Dataset:

Split Sentences
Train 6,577
Validation 822
Test 823
Total 8,222

The data consists of scraped social media posts filtered for topics related to Surabaya city issues including:

  • Traffic and transportation
  • Flooding and drainage
  • Crime and safety
  • Public utilities (water, electricity)
  • Government services
  • Infrastructure

Performance

The model was evaluated on the test set (823 sentences) and achieved the following results:

Test Set Classification Report

Entity Precision Recall F1-Score Support
LOC 0.78 0.81 0.79 1,363
ORG 0.60 0.61 0.61 344
PERSON 0.79 0.77 0.78 231
Micro avg 0.75 0.77 0.74 1,938
Macro avg 0.72 0.73 0.73 1,938
Weighted avg 0.75 0.77 0.76 1,938

Performance Insights

  • LOC (Location): Strong performance with 78% F1-score, indicating good detection of Surabaya landmarks, streets, and neighborhoods
  • PERSON: High precision (84%) shows the model is reliable when identifying person names, though recall could be improved
  • ORG (Organization): Lower performance (59% F1-score) reflects the challenge of identifying organizations in informal text where names may be abbreviated or mentioned indirectly
  • Overall: 74% F1-score (micro avg) demonstrates solid performance on social media text with informal language and slang

Use Cases

  • Social Media Monitoring: Automatically extract locations and entities from citizen complaints
  • Urban Analytics: Track which locations are frequently mentioned in negative contexts
  • Government Services: Route complaints to appropriate departments based on detected entities
  • Sentiment Analysis: Combine with sentiment models to understand location-specific issues
  • Research: Study public opinion patterns in Surabaya through entity extraction

Limitations

  • Optimized for informal Indonesian and Suroboyoan dialect - may not perform as well on formal text
  • Trained specifically on Surabaya-related content - performance may vary on other Indonesian cities
  • Best suited for social media text - may require fine-tuning for other domains
  • Limited to three entity types: LOC, PERSON, and ORG

Citation

If you use this model in your research, please cite both the model and the dataset:

@misc{surabaya-ner-indobert,
  author = {Kiuyha},
  title = {IndoBERT NER - Surabaya Opinion Analysis},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Kiuyha/surabaya-opinion-indobert-ner}}
}

Contact

For questions or issues regarding this model, please open an issue on the model repository page.

Downloads last month
31
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kiuyha/surabaya-opinion-indobert-ner

Finetuned
(84)
this model

Dataset used to train Kiuyha/surabaya-opinion-indobert-ner

Collection including Kiuyha/surabaya-opinion-indobert-ner