Model Card: NYC Touristy Text Classifier

Model Details

  • Model Type: A distilbert-base-uncased model fine-tuned for binary text classification.
  • Model Date: September 19, 2025
  • Developed by: [Your Name/Team Name]
  • Hugging Face Hub ID: zacCMU/2025-24679-text-distilbert-predictor

Intended Use

This model is designed to classify short, descriptive texts about locations in New York City as either touristy or not_touristy. It is intended for applications that aim to categorize user-generated content, filter location reviews, or analyze descriptive narratives about urban environments.

Training Data

The model was fine-tuned on the bareethul/nyc-landmark-descriptions dataset. This dataset contains descriptions of various locations, each labeled as touristy or not_touristy.

The training process utilized both the original and an augmented version of the dataset to improve robustness and generalization.

Training Procedure

The model was trained for 5 epochs using the Hugging Face Trainer API. The training process was configured with the following key hyperparameters:

  • Learning Rate: 2e-5
  • Batch Size: 8 per device for both training and evaluation
  • Weight Decay: 0.01
  • Evaluation Strategy: Performed at the end of each epoch
  • Best Model Selection: The model with the highest accuracy on the evaluation set was saved as the final version.

Evaluation

The model's performance was evaluated on two separate datasets: the augmented test set and the original, un-augmented data (treated as an external validation set). The model achieved perfect scores across all standard classification metrics on both sets.

Test Results (Augmented Data):

Metric Value
Accuracy 1.0000
F1 1.0000
Precision 1.0000
Recall 1.0000

External Validation Results (Original Data):

Metric Value
Accuracy 1.0000
F1 1.0000
Precision 1.0000
Recall 1.0000

Sample Prediction:

Input: 'Flower stalls line the avenues, petals bright against brownstone grit. Young lovers trade tulips, old friends share sunflowers, all believing in the promise of beauty for another day.'

True Label: not_touristy

Predicted: not_touristy (Confidence: 0.999)

Limitations and Ethical Considerations

  • Dataset Specificity: This model is highly specialized for the nyc-landmark-descriptions dataset. Its performance on text describing locations outside of New York City or on different styles of prose is not guaranteed.
  • Subjectivity: The labels touristy and not_touristy are inherently subjective and reflect the definitions used in the original dataset. The model's classifications may not align with every individual's perception.
  • Potential for Overfitting: While the model scored perfectly on the provided test sets, this may indicate a risk of overfitting to the specific vocabulary and structure of the training data. Performance may differ on completely novel, real-world data.

How to Use

You can use this model for inference with the pipeline function from the transformers library.

Downloads last month
9
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zacCMU/2025-24679-text-distilbert-predictor

Finetuned
(10500)
this model

Dataset used to train zacCMU/2025-24679-text-distilbert-predictor