Skip to main content
CID222Documentation

Content Safety Pipeline

CID222's multi-layer safety pipeline processes every request through specialized detection engines, applying your configured policies to protect sensitive data.

Pipeline Overview

The content safety pipeline consists of six layers that process content in sequence:

  1. Semantic Router — Classifies content to route to appropriate detectors
  2. Pattern Detection — Regex-based detection of structured PII
  3. Entity Recognition — ML-based NER for names, locations, organizations
  4. Safety Classification — Toxicity, hate speech, and jailbreak detection
  5. Policy Engine — Evaluates rules and determines actions
  6. Action Executor — Applies masking, rejection, or flagging

Layer 1: Semantic Routing

The semantic router analyzes incoming content to determine which detection engines are most relevant. This optimization reduces processing time by skipping unnecessary checks.

Content is classified into categories:

  • Conversational — Standard chat, routes through all detectors
  • Code — Programming content, emphasizes credential detection
  • Data Entry — Form-like input, emphasizes PII patterns
  • Adversarial — Suspicious patterns, emphasizes jailbreak detection

Layer 2: Pattern Detection

High-speed regex-based detection for structured data types:

  • Email addresses
  • Phone numbers (international formats)
  • Credit card numbers (Luhn validation)
  • Social Security Numbers
  • IBANs and bank account numbers
  • IP addresses (IPv4 and IPv6)
  • API keys and credentials
Pattern detection runs in parallel and completes in under 5ms for typical inputs.

Layer 3: Entity Recognition

Machine learning-based Named Entity Recognition identifies:

  • PERSON — Full names, including titles and suffixes
  • LOCATION — Addresses, cities, countries
  • ORGANIZATION — Company and institution names
  • DATE — Dates of birth, appointments
  • MEDICAL — Health conditions, medications

The NER engine supports over 100 languages with varying accuracy based on training data availability.

Layer 4: Safety Classification

Specialized ML models detect harmful content categories:

DetectorPurposeAccuracy
Toxicity ClassifierProfanity, abuse, harassment>95%
Hate Speech DetectorDiscriminatory content>92%
Jailbreak GuardPrompt injection attempts>95%
Content ClassifierSexual, violent content>90%

Layer 5: Policy Engine

The policy engine evaluates detection results against your configured rules:

  • Filter Groups — Organize related filters together
  • Confidence Thresholds — Minimum confidence to trigger action
  • Priority Rules — REJECT takes precedence over MASK over FLAG
  • Exemptions — Allow specific patterns in certain contexts

Layer 6: Action Execution

Based on policy evaluation, one of three actions is taken:

ActionBehavior
MASKReplace detected content with placeholder (e.g., [EMAIL]). Request proceeds with sanitized content.
REJECTBlock the entire request. Return error response to client.
FLAGLog the detection for review. Request proceeds unchanged.

Confidence Boosting

When multiple detectors identify the same content, confidence scores are boosted:

  • Regex + NER agreement → +10% confidence
  • Multiple NER models agree → +15% confidence
  • Context validation matches → +5% confidence
Boosted confidence helps reduce false negatives while keeping false positives low.

Output Filtering

The same pipeline runs on LLM responses to catch any leaked sensitive data:

  • Hallucinated PII (fake but realistic data)
  • Reconstructed masked data
  • Training data leakage
  • Harmful content in responses