What You’ll Learn
- The streaming rehydration challenge: tokens split across SSE chunk boundaries
- Policy configuration for streaming-compatible PII rehydration
- The proxy’s five-step buffering strategy (accumulate, detect, flush, replace, end)
- A concrete walkthrough of a 7-chunk SSE stream with token restoration
- Latency considerations for streaming vs. non-streaming modes
The Challenge
LLM APIs stream responses token-by-token via Server-Sent Events (SSE). A redaction token like[REDACTED_EMAIL_1] may be split across multiple chunks:
PII scanning via the guardrails service is rolling out. This example demonstrates the SDK API and explains the expected behavior once the service is active on your account.
Prerequisites
This example is available in Python. TypeScript version coming soon.
Code Walkthrough
Policy configuration — streaming rehydration uses the samerehydrate_response=True flag:
The Proxy’s Buffering Strategy
When an SSE stream is active, the proxy applies a five-step strategy:- Buffer accumulation — As SSE chunks arrive, the proxy accumulates text in an internal buffer rather than forwarding immediately.
-
Token boundary detection — The proxy scans the buffer for complete redaction tokens (e.g.,
[REDACTED_EMAIL_1]). It also checks for partial token prefixes at the buffer tail. - Safe flush — Text confirmed to not contain (partial) tokens is flushed to the client. Text that might be part of a token is held in the buffer.
- Token replacement — When a complete token is detected, the proxy replaces it with the original PII value from its mapping table and flushes the restored text.
- Stream end — When the SSE stream ends, any remaining buffered text is flushed (no partial token can match at this point).
Concrete Example
Suppose sandbox code calls an LLM API with:Latency Considerations
Streaming rehydration introduces a small latency cost:- Buffering delays delivery of chunks near token boundaries
- Most chunks (those without token prefixes) pass through with negligible delay
- The added latency is typically less than 50ms per token boundary
rehydrate_response=False. The proxy will not buffer or scan response chunks.