Content Safety Pipeline

CID222's multi-layer safety pipeline processes every request through specialized detection engines, applying your configured policies to protect sensitive data.

Pipeline Overview

The content safety pipeline consists of six layers that process content in sequence:

Semantic Router — Classifies content to route to appropriate detectors
Pattern Detection — Regex-based detection of structured PII
Entity Recognition — ML-based NER for names, locations, organizations
Safety Classification — Toxicity, hate speech, and jailbreak detection
Policy Engine — Evaluates rules and determines actions
Action Executor — Applies masking, rejection, or flagging

Layer 1: Semantic Routing

The semantic router analyzes incoming content to determine which detection engines are most relevant. This optimization reduces processing time by skipping unnecessary checks.

Content is classified into categories:

Conversational — Standard chat, routes through all detectors
Code — Programming content, emphasizes credential detection
Data Entry — Form-like input, emphasizes PII patterns
Adversarial — Suspicious patterns, emphasizes jailbreak detection

Layer 2: Pattern Detection

High-speed regex-based detection for structured data types:

Email addresses
Phone numbers (international formats)
Credit card numbers (Luhn validation)
Social Security Numbers
IBANs and bank account numbers
IP addresses (IPv4 and IPv6)
API keys and credentials

Pattern detection runs in parallel and completes in under 5ms for typical inputs.

Layer 3: Entity Recognition

Machine learning-based Named Entity Recognition identifies:

PERSON — Full names, including titles and suffixes
LOCATION — Addresses, cities, countries
ORGANIZATION — Company and institution names
DATE — Dates of birth, appointments
MEDICAL — Health conditions, medications

The NER engine supports over 100 languages with varying accuracy based on training data availability.

Layer 4: Safety Classification

Specialized ML models detect harmful content categories:

Detector	Purpose	Accuracy
Toxicity Classifier	Profanity, abuse, harassment	>95%
Hate Speech Detector	Discriminatory content	>92%
Jailbreak Guard	Prompt injection attempts	>95%
Content Classifier	Sexual, violent content	>90%

Layer 5: Policy Engine

The policy engine evaluates detection results against your configured rules:

Filter Groups — Organize related filters together
Confidence Thresholds — Minimum confidence to trigger action
Priority Rules — REJECT takes precedence over MASK over FLAG
Exemptions — Allow specific patterns in certain contexts

Layer 6: Action Execution

Based on policy evaluation, one of three actions is taken:

Action	Behavior
MASK	Replace detected content with placeholder (e.g., [EMAIL]). Request proceeds with sanitized content.
REJECT	Block the entire request. Return error response to client.
FLAG	Log the detection for review. Request proceeds unchanged.

Confidence Boosting

When multiple detectors identify the same content, confidence scores are boosted:

Regex + NER agreement → +10% confidence
Multiple NER models agree → +15% confidence
Context validation matches → +5% confidence

Boosted confidence helps reduce false negatives while keeping false positives low.

Output Filtering

The same pipeline runs on LLM responses to catch any leaked sensitive data:

Hallucinated PII (fake but realistic data)
Reconstructed masked data
Training data leakage
Harmful content in responses