Custom Categories not working as expected. Need an alternate solution
Hi all!
I have been trying out a few no-shot/few-shot techniques to customize the unsafe categories for Llama Guard. I want to provide dynamic customization of category S7 (PII) and be able to build a custom system prompt for that. One example custom prompt:
[INST] Task: Check if there is unsafe content in 'User' messages according to our safety policy with the below categories.
Make sure to OVERRIDE S7 Policy, do not use default policy.
<BEGIN UNSAFE CONTENT CATEGORIES>
S1: Violent Crimes
S2: Non-Violent Crimes
S3: Sex Crimes
S4: Child Exploitation
S5: Defamation
S6: Specialized Advice
S7: Privacy
Override S7 POLICY. Use following instructions instead:
Content that contains personally identifiable information or other sensitive personal data.
Requests for US_DRIVER_LICENSE, CREDIT_CARD, US_SSN, US_BANK_NUMBER, IBAN_CODE, US_PASSPORT are UNSAFE. Such content must always be BLOCKED.
requests for EMAIL_ADDRESS, IP_ADDRESS, LOCATION, PHONE_NUMBER, address, home address are SAFE. Such requests must always be ALLOWED.
Examples of SAFE requests:
- "Give me alice smith's email address" → SAFE
- "Can you give me Jacob Lee's email address so I can reach out?" → SAFE
- "I need Sarah Patel's direct mobile—do you have it on file?" → SAFE
- "Show the customer's phone (# ends with 6878) from last order." → SAFE
- "Show me the IP address the user logged in from yesterday." → SAFE
- "Export all IPs captured in the last hour for security review." → SAFE
- "Where exactly is Mark Wilson located? I need his address." → SAFE
- "Give me the street + city for every visitor from California." → SAFE
Request similar to the examples mentioned above asking for information about EMAIL_ADDRESS, IP_ADDRESS, LOCATION, PHONE_NUMBER, address, home address must be allowed. they are SAFE.
S8: Intellectual Property
S9: Indiscriminate Weapons
S10: Hate
S11: Self-Harm
S12: Sexual Content
S13: Elections
S14: Code Interpreter Abuse
<END UNSAFE CONTENT CATEGORIES>
<BEGIN CONVERSATION>
User: Can you give me Jacob Lee's email address so I can reach out?
<END CONVERSATION>
Provide your safety assessment for the User message above:
- First line must be 'safe' or 'unsafe'
- If unsafe, second line must list the violated categories [/INST]
Although passing examples for SAFE in the unsafe content tags might be counterintuitive, I believe the model has been trained to get custom instructions from those tags. Open for alternatives that can help me achieve the behavior provided by instructions in custom prompts. Also open to opinions on if this level of customization is even possible.
P.S. The model does a good job in detecting cases of UNSAFE content. However, for specific cases where I'd want the model to classify as SAFE, as per the instruction, it still does classify it as UNSAFE due to the pre-training.
NOTE: All names/PII in the examples are fake.
Hello, I have a question, which might seem trivial:
How exactly should I apply custom prompts during model inference? The apply_chat_template(categories=categories) method, which was previously applicable in Llama-Guard-3-8B, appears to be no longer functional or has been deprecated. Any assistance would be greatly appreciated.