Web Scraping with Network Policy

What You’ll Learn

How to create a sandbox with network={"allow_out": [...]} for domain-level scraping control
How to upload and run a scraping script using Python’s urllib (stdlib — no pip installs needed)
How to verify that allowed domains are reachable and blocked domains are not
The pattern for containing scrapers that should only access specific sources

Prerequisites

Declaw running locally or in the cloud (see Deployment)
DECLAW_API_KEY and DECLAW_DOMAIN set in your environment
Outbound network access from your Declaw instance to httpbin.org

This example is available in Python. TypeScript support coming soon.

Code Walkthrough

1. Create the sandbox with a network allow-list

from declaw import Sandbox

sbx = Sandbox.create(
    template="python",
    timeout=300,
    network={"allow_out": ["httpbin.org"]},
)

Only traffic destined for httpbin.org is allowed. All other outbound connections — including DNS for other domains and direct IP connections — are blocked by the TCP proxy.

2. The scraper script

The scraper uses Python’s built-in urllib — no third-party packages required. The sandbox’s base Ubuntu image already has Python 3 installed:

SCRAPER_SCRIPT = """\
import urllib.request
import json

url = "http://httpbin.org/get"
print(f"Fetching: {url}")

req = urllib.request.Request(url, headers={"User-Agent": "Declaw-Sandbox/1.0"})
with urllib.request.urlopen(req, timeout=10) as resp:
    body = resp.read().decode("utf-8")
    data = json.loads(body)
    print(f"Status: {resp.status}")
    print(f"Origin IP: {data.get('origin', 'unknown')}")
    print(f"Headers sent: {json.dumps(data.get('headers', {}), indent=2)}")
    print("SUCCESS: Allowed domain is reachable")
"""

3. Prove blocked domains are unreachable

Use a TCP socket test rather than an HTTP request — the block applies at the TCP layer, so even raw socket connections to blocked IPs are refused:

BLOCKED_SOCKET_TEST = """\
import socket

target = "93.184.216.34"  # example.com IP
port = 80
try:
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.settimeout(5)
    s.connect((target, port))
    s.close()
    print("CONNECTED")
except Exception as e:
    print(f"BLOCKED: {e}")
"""

4. Upload and run all three tests

try:
    # Test 1: Scrape the allowed domain
    sbx.files.write("/home/user/scraper.py", SCRAPER_SCRIPT)
    result = sbx.commands.run("python3 /home/user/scraper.py 2>&1")
    print(result.stdout)

    # Test 2: TCP socket to allowed domain (httpbin.org)
    sbx.files.write("/home/user/allowed_test.py", ALLOWED_SOCKET_TEST)
    result2 = sbx.commands.run("python3 /home/user/allowed_test.py 2>&1")
    print(result2.stdout)

    # Test 3: TCP socket to blocked domain (example.com)
    sbx.files.write("/home/user/blocked_test.py", BLOCKED_SOCKET_TEST)
    result3 = sbx.commands.run("python3 /home/user/blocked_test.py 2>&1")
    print(result3.stdout)
finally:
    sbx.kill()

Expected Output

--- Creating Sandbox with Network Policy ---
  Policy: allow only httpbin.org outbound
  Sandbox created: sbx-abc123

--- Scraping Allowed Domain (httpbin.org) ---
Fetching: http://httpbin.org/get
Status: 200
Origin IP: 203.0.113.42
Headers sent: {
  "Host": "httpbin.org",
  "User-Agent": "Declaw-Sandbox/1.0"
}
SUCCESS: Allowed domain is reachable

--- TCP Socket Test: Allowed Domain (httpbin.org) ---
Resolved httpbin.org to 54.243.149.112
CONNECTED: Allowed domain is reachable at TCP level

--- TCP Socket Test: Blocked Domain (example.com) ---
Attempting TCP connection to 93.184.216.34:80 (example.com)...
BLOCKED: [Errno 110] Connection timed out

--- Network Policy Summary ---
  httpbin.org:  ALLOWED (network policy permits this domain)
  example.com:  BLOCKED (not in the allow list)

Use Cases

Price monitoring: Allow only the target retailer’s domain. The scraper cannot exfiltrate data to other servers or call home. News aggregation: Allowlist a set of news site domains. Even if the scraped page contains malicious JavaScript or links, the sandbox cannot follow them to unauthorized destinations. Competitive intelligence: Restrict the scraper to a defined list of competitor domains. Any unexpected outbound connection is blocked automatically.

Domain Allowlist vs IP Allowlist

The network policy uses domain names, not IP addresses. The proxy resolves the domain to an IP at connection time and enforces the rule at the TCP layer. This means:

allow_out: ["httpbin.org"] permits connections to any IP that httpbin.org resolves to
Direct IP connections (like 93.184.216.34) are blocked unless the IP resolves to an allowlisted domain at the time of the connection
CDNs and load balancers that share IPs across domains are handled correctly — the proxy checks the SNI (TLS) or Host header (HTTP) rather than just the IP

For CIDR-based rules or more fine-grained control, see Network Policies.

​What You’ll Learn

​Prerequisites

​Code Walkthrough

​1. Create the sandbox with a network allow-list

​2. The scraper script

​3. Prove blocked domains are unreachable

​4. Upload and run all three tests

​Expected Output

​Use Cases

​Domain Allowlist vs IP Allowlist

What You’ll Learn

Prerequisites

Code Walkthrough

1. Create the sandbox with a network allow-list

2. The scraper script

3. Prove blocked domains are unreachable

4. Upload and run all three tests

Expected Output

Use Cases

Domain Allowlist vs IP Allowlist