Content Safety Pipeline
CID222's multi-layer safety pipeline processes every request through specialized detection engines, applying your configured policies to protect sensitive data.
Pipeline Overview
The content safety pipeline consists of six layers that process content in sequence:
- Semantic Router — Classifies content to route to appropriate detectors
- Pattern Detection — Regex-based detection of structured PII
- Entity Recognition — ML-based NER for names, locations, organizations
- Safety Classification — Toxicity, hate speech, and jailbreak detection
- Policy Engine — Evaluates rules and determines actions
- Action Executor — Applies masking, rejection, or flagging
Layer 1: Semantic Routing
The semantic router analyzes incoming content to determine which detection engines are most relevant. This optimization reduces processing time by skipping unnecessary checks.
Content is classified into categories:
- Conversational — Standard chat, routes through all detectors
- Code — Programming content, emphasizes credential detection
- Data Entry — Form-like input, emphasizes PII patterns
- Adversarial — Suspicious patterns, emphasizes jailbreak detection
Layer 2: Pattern Detection
High-speed regex-based detection for structured data types:
- Email addresses
- Phone numbers (international formats)
- Credit card numbers (Luhn validation)
- Social Security Numbers
- IBANs and bank account numbers
- IP addresses (IPv4 and IPv6)
- API keys and credentials
Layer 3: Entity Recognition
Machine learning-based Named Entity Recognition identifies:
- PERSON — Full names, including titles and suffixes
- LOCATION — Addresses, cities, countries
- ORGANIZATION — Company and institution names
- DATE — Dates of birth, appointments
- MEDICAL — Health conditions, medications
The NER engine supports over 100 languages with varying accuracy based on training data availability.
Layer 4: Safety Classification
Specialized ML models detect harmful content categories:
| Detector | Purpose | Accuracy |
|---|---|---|
| Toxicity Classifier | Profanity, abuse, harassment | >95% |
| Hate Speech Detector | Discriminatory content | >92% |
| Jailbreak Guard | Prompt injection attempts | >95% |
| Content Classifier | Sexual, violent content | >90% |
Layer 5: Policy Engine
The policy engine evaluates detection results against your configured rules:
- Filter Groups — Organize related filters together
- Confidence Thresholds — Minimum confidence to trigger action
- Priority Rules — REJECT takes precedence over MASK over FLAG
- Exemptions — Allow specific patterns in certain contexts
Layer 6: Action Execution
Based on policy evaluation, one of three actions is taken:
| Action | Behavior |
|---|---|
| MASK | Replace detected content with placeholder (e.g., [EMAIL]). Request proceeds with sanitized content. |
| REJECT | Block the entire request. Return error response to client. |
| FLAG | Log the detection for review. Request proceeds unchanged. |
Confidence Boosting
When multiple detectors identify the same content, confidence scores are boosted:
- Regex + NER agreement → +10% confidence
- Multiple NER models agree → +15% confidence
- Context validation matches → +5% confidence
Output Filtering
The same pipeline runs on LLM responses to catch any leaked sensitive data:
- Hallucinated PII (fake but realistic data)
- Reconstructed masked data
- Training data leakage
- Harmful content in responses