Utilities > duckdb_warc
DuckDB extension to query WARC files. (In Development)
DuckDB WARC Extension
A DuckDB extension for parsing WARC (Web ARChive) records. Designed for use with Common Crawl's columnar index for efficient selective record fetching.
Features
parse_warc(BLOB|VARCHAR)scalar function to parse WARC records- Returns structured data: WARC headers, HTTP headers, and body
- Auto-detects gzip compression
- Works with Common Crawl byte-range fetching workflow
Installation
-- Load the extension
LOAD './build/release/extension/warc/warc.duckdb_extension';
Building
make configure
make release
Usage
parse_warc() Function
Parses a WARC record and returns a struct with:
| Field | Type | Description |
|---|---|---|
warc_version |
VARCHAR | WARC format version (e.g., "1.0") |
warc_headers |
VARCHAR | JSON object of WARC headers |
http_version |
VARCHAR | HTTP version (e.g., "HTTP/1.1") |
http_status |
INTEGER | HTTP status code (e.g., 200) |
http_headers |
VARCHAR | JSON object of HTTP headers |
http_body |
BLOB | Response body content (binary) |
Examples
Parse a local WARC file:
SELECT parse_warc(content) FROM read_blob('record.warc.gz');
Parse uncompressed WARC from text:
SELECT parse_warc(content) FROM read_text('record.warc');
Extract specific fields:
SELECT
(parse_warc(content)).http_status,
(parse_warc(content)).http_body
FROM read_blob('record.warc.gz');
Common Crawl Workflow
The recommended workflow for Common Crawl is:
- Query the columnar index (Parquet) to find records
- Fetch only the specific byte ranges you need
- Parse with this extension
-- Step 1: Query Common Crawl index to find a URL
-- (Use their Parquet index at s3://commoncrawl/cc-index/...)
-- Step 2: Download specific byte range
-- curl -r OFFSET-END https://data.commoncrawl.org/FILENAME > record.warc.gz
-- Step 3: Parse the record
SELECT
(parse_warc(content)).http_status,
(parse_warc(content)).http_body
FROM read_blob('record.warc.gz');
Example: Fetch example.com from Common Crawl
# Download only 945 bytes instead of 1.1GB WARC file
curl -s -r"46376769-46377713" \
"https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-47/segments/1762439342185.16/warc/CC-MAIN-20251106200718-20251106230718-00970.warc.gz" \
> record.warc.gz
SELECT
(parse_warc(content)).warc_version,
(parse_warc(content)).http_status,
(parse_warc(content)).http_body
FROM read_blob('record.warc.gz');
Output:
┌──────────────┬─────────────┬─────────────────────────────────────────────────┐
│ warc_version │ http_status │ http_body │
├──────────────┼─────────────┼─────────────────────────────────────────────────┤
│ 1.0 │ 200 │ <!doctype html><html lang="en"><head><title>... │
└──────────────┴─────────────┴─────────────────────────────────────────────────┘
Schema
parse_warc(BLOB) -> STRUCT(
warc_version VARCHAR,
warc_headers VARCHAR, -- JSON: {"WARC-Type": "response", "WARC-Date": "...", ...}
http_version VARCHAR,
http_status INTEGER,
http_headers VARCHAR, -- JSON: {"content-type": "text/html", ...}
http_body BLOB -- Binary body (use decode(http_body) for text)
)
Technical Details
- Built with Rust using the
warccrate (v0.4.0) - Uses
flate2for gzip decompression - Auto-detects compressed vs uncompressed input
- Compatible with DuckDB v1.4.2
License
MIT