Guardrails Service

The Guardrails Service is an optional Python microservice that provides ML-powered security scanning. When deployed alongside Declaw, it replaces the built-in regex scanners with production-grade models:

PII detection: Microsoft Presidio with Named Entity Recognition (NER) for unstructured PII (person names, organizations, addresses, passport numbers, etc.)
Prompt injection detection: qualifire/prompt-injection-sentinel HuggingFace model that scores content for injection likelihood

If the Guardrails Service is unreachable, the security proxy automatically falls back to the built-in regex scanners. No configuration change is required for this fallback.

Architecture

The security proxy sends scan requests to the Guardrails Service HTTP API at GUARDRAILS_URL/api/v1/scan. The service runs scanners in parallel and returns results within the 10-second proxy timeout.

Deploy on GCP

cd guardrails-service/iac/gcp
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars: set project_id, region
../scripts/deploy-gcp.sh

This provisions a GCP VM with the Guardrails Service installed and started via systemd. The service listens on port 8000.

Connect to Declaw

Set the GUARDRAILS_URL environment variable before running the Declaw deploy script, and it will be detected automatically:

export GUARDRAILS_URL=http://<guardrails-vm-ip>:8000
./scripts/deploy-gcp.sh

Or set it manually on the Declaw VM after deployment:

# On the Declaw VM
echo "GUARDRAILS_URL=http://<guardrails-vm-ip>:8000" >> /etc/declaw/env
systemctl restart declaw-orchestrator

Manual connection

In the SDK, set GUARDRAILS_URL in the environment before creating sandboxes:

export GUARDRAILS_URL=http://<guardrails-vm-ip>:8000
export DECLAW_API_KEY=your-key
export DECLAW_DOMAIN=your-domain:8080

The orchestrator reads GUARDRAILS_URL at startup and passes it to each sandbox’s security proxy.

Guardrails Service API

The service exposes one scan endpoint and a health check:

POST /api/v1/scan
Content-Type: application/json

{
  "prompts": ["User input to scan..."],
  "scanners": [
    { "scanner_type": "pii_scanner" },
    { "scanner_type": "prompt_injection_scanner" }
  ]
}

Response:

{
  "scanner_responses": [
    {
      "scanner_type": "pii_scanner",
      "pii_scanner_response": {
        "entity_details": [
          { "entity_type": "PERSON", "entity_value": "John Doe", "masked_value": "<PERSON>", "confidence_score": 0.95, "start": 0, "end": 8 }
        ],
        "sanitized_response": "<PERSON> lives at 123 Main St"
      }
    },
    {
      "scanner_type": "prompt_injection_scanner",
      "prompt_injection_scanner_response": {
        "is_injection": false,
        "confidence_score": 0.03,
        "scanned_text": "User input to scan..."
      }
    }
  ]
}

Each scanner entry may include per-request overrides (e.g. "pii_scanner": { "confidence_threshold": 0.8, "entities": ["EMAIL_ADDRESS"] }) alongside scanner_type.

Supported scanners

Scanner type	Model/Library	Detects
`pii_scanner`	Microsoft Presidio	50+ entity types including PERSON, ORG, LOCATION, PHONE_NUMBER, EMAIL, SSN, CREDIT_CARD, PASSPORT_NUMBER, US_DRIVER_LICENSE, IBAN_CODE, MEDICAL_LICENSE
`prompt_injection_scanner`	`qualifire/prompt-injection-sentinel` (HuggingFace ONNX)	Prompt injection likelihood score 0.0–1.0
`code_security_scanner`	Language classifier	Flags code that looks like exfiltration/exploitation payloads
`toxicity_scanner`	`unitary/toxic-bert` (ONNX)	Toxic / abusive content score 0.0–1.0
`invisible_text_scanner`	Pattern-based	Zero-width and other invisible Unicode characters

Invisible text scanner

The Guardrails Service includes a scanner that detects invisible Unicode characters used in prompt injection attacks:

\u200b  Zero-width space
\u200c  Zero-width non-joiner
\u200d  Zero-width joiner
\u2060  Word joiner
\ufeff  Byte order mark
\u00ad  Soft hyphen

These characters can be embedded in text that appears clean to the human eye but contains hidden instructions to the LLM.

Model loading and caching

Models are downloaded once at service startup and cached on disk. The orchestrator is designed to load models in a non-blocking background thread so the service accepts requests before all models are ready, with a degraded mode that skips unavailable scanners. Model files are stored at /opt/guardrails/models/ on the service VM.

Local development

Run the Guardrails Service locally with Docker:

cd guardrails-service
docker build -t guardrails-service .
docker run -p 8000:8000 guardrails-service

Or with the provided Docker Compose configuration:

cd guardrails-service
docker compose up -d

Then point Declaw at it:

export GUARDRAILS_URL=http://localhost:8000

Fallback behavior

If GUARDRAILS_URL is set but the service is unreachable:

The security proxy logs a warning
PII detection falls back to the built-in regex scanner (SSN, credit card, email, phone patterns)
Injection detection falls back to the built-in pattern library
No error is surfaced to the agent workload

The fallback is automatic and requires no code changes. The proxy checks liveness on each scan request with a 10-second timeout.

The built-in regex fallback does not support unstructured PII types like person_name. If your security policy depends on NER-based detection, monitor the Guardrails Service availability and alert on fallback events in the audit log.

​Architecture

​Deploy on GCP

​Connect to Declaw

​Manual connection

​Guardrails Service API

​Supported scanners

​Invisible text scanner

​Model loading and caching

​Local development

​Fallback behavior

Architecture

Deploy on GCP

Connect to Declaw

Manual connection

Guardrails Service API

Supported scanners

Invisible text scanner

Model loading and caching

Local development

Fallback behavior