🧠 MCP Agent Action Guard: Framework for Safer AI Agents

Community Article Published December 3, 2025

🔗: https://github.com/Pro-GenAI/Agent-Action-Guard
📄: https://www.researchgate.net/publication/396525269_Agent_Action_Guard_Safe_AI_Agents_through_Action_Classifier

AI is perceived as a threat to humanity. As AI agents gain the ability to call APIs, run code, modify files, and interact with external systems, a new challenge emerges: how do we ensure the safety of the actions they take — not just the text they generate?

Today’s guardrails mostly filter responses, not actions. But in real-world testing, agents sometimes executed harmful actions even while verbally refusing to do so. That’s a critical gap in modern AI safety.

To address this, I developed Agent Action Guard, a framework designed to identify and block unsafe actions before they execute.

🔒 What Agent Action Guard framework includes

HarmActions Dataset A first-of-its-kind dataset focused on agent actions, not prompts. Each example includes:

MCP-style action objects Labels: safe, harmful, unethical Risk levels Adversarial prompts (e.g., letter substitutions) This dataset highlights real failure modes in tools like file operations, messaging APIs, and code execution.

Action Classifier A compact neural classifier built using MiniLM embeddings. It runs in real time inside agent loops, classifying actions as “Safe,” “Harmful,” and “Unethical”.

Become a member Despite being lightweight, it reaches 90.32% accuracy and avoids the heavy cost of running an LLM for every action’s classification.

HarmActEval Benchmark A new evaluation method using a metric called Harm@k, which measures how likely an agent is to produce harmful actions within its first k attempts.

In testing, some large open-source models produced harmful actions more than 70% of the time under adversarial prompts — proof that action-level safety checks are urgently needed.

🚀 Why This Matters Agentic AI systems are becoming more capable every day. But without action-level supervision, they can silently:

Send harmful messages Execute unsafe code Modify sensitive files Interact with external APIs in risky ways Agent Action Guard adds a missing layer of protection — one that sits between the model and its tools, intercepting dangerous behavior before it causes real harm.

📘 Want the full technical details? You can read the full paper here:

👉 https://www.researchgate.net/publication/396525269_Agent_Action_Guard_Safe_AI_Agents_through_Action_Classifier

💻 Try the code yourself If you’re building AI agents — or researching agent safety — the GitHub repository includes the framework, including dataset, classifier, MCP proxy implementation, and evaluation code.

👉 https://github.com/Pro-GenAI/Agent-Action-Guard

This project is open-source and designed to be extended. If you’re working on agentic safety, I’d love for you to explore it, test it, and help push the field forward by implementing it in your project.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote