Utilities > duckdb-web-archive-cdx
DuckDB extension to query the Internet Archive and CommonCrawl CDX APIs directly from SQL. (In Development)
duckdb-web-archive-cdx
DuckDB extension to query web archive CDX APIs directly from SQL.
Features
- Two data sources: Wayback Machine (1996-present) and Common Crawl (2008-present)
- Smart pushdowns:
SELECT,WHERE,LIMIT, andDISTINCT ONare pushed to the CDX API - Minimal requests: Only fetches what you need - no wasted bandwidth
- Response fetching: Archived page content is fetched only when
response.bodyis selected
Installation
INSTALL web_archive FROM community;
LOAD web_archive;
Quick Start: Track GitHub Stars Over Time
Query the Wayback Machine to see how DuckDB's GitHub stars grew over the years.
Using the html_query extension to extract data from archived HTML:
SELECT DISTINCT ON(year)
url[9:] as url,
strftime(timestamp::TIMESTAMP, '%Y-%m-%d') AS date,
html_query(response.body, '#repo-stars-counter-star', '@text') AS stars,
html_query(response.body, '#repo-network-counter', '@text') AS forks,
html_query(response.body, 'a[href="/duckdb/duckdb/releases"] span', '@text') AS releases,
html_query(response.body, 'a[href="/duckdb/duckdb/graphs/contributors"] span', '@text') AS contributors
FROM wayback_machine()
WHERE url = 'github.com/duckdb/duckdb'
AND timestamp > '2018-01-01'
AND statuscode = 200
AND mimetype = 'text/html'
ORDER BY date
LIMIT 10;
┌──────────────────────────┬────────────┬─────────┬─────────┬──────────┬──────────────┐
│ url │ date │ stars │ forks │ releases │ contributors │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├──────────────────────────┼────────────┼─────────┼─────────┼──────────┼──────────────┤
│ github.com/duckdb/duckdb │ 2021-04-06 │ NULL │ NULL │ 17 │ 59 │
│ github.com/duckdb/duckdb │ 2022-01-07 │ 4.1k │ 358 │ 22 │ 91 │
│ github.com/duckdb/duckdb │ 2023-01-08 │ 7.9k │ 745 │ 29 │ 170 │
│ github.com/duckdb/duckdb │ 2024-01-23 │ 14k │ 1.3k │ 36 │ 277 │
│ github.com/duckdb/duckdb │ 2025-01-04 │ 25.5k │ 2k │ 45 │ 389 │
└──────────────────────────┴────────────┴─────────┴─────────┴──────────┴──────────────┘
This query makes only 6 HTTP requests despite potentially matching thousands of archived snapshots:
| Request | URL |
|---|---|
| 1 | CDX API query with all filters and &collapse=timestamp:4&limit=10 |
| 2-6 | Fetch 5 archived pages (one per year) |
The extension automatically:
- Pushes
WHEREfilters (statuscode,mimetype) to the CDX API - Pushes
LIMIT 10to the API - Pushes
DISTINCT ON(year)as&collapse=timestamp:4(dedupe by year) - Only requests
response.bodyfor the 5 matching rows - Only fetches needed fields (
timestamp,original) via&fl=parameter
How Pushdown Works
1. Filter Pushdown (WHERE)
WHERE clauses are converted to CDX API parameters:
SELECT url FROM wayback_machine()
WHERE url LIKE 'example.com/%' -- Pushed as: url=example.com/*
AND statuscode = 200 -- Pushed as: &filter=statuscode:200
AND mimetype = 'text/html' -- Pushed as: &filter=mimetype:text/html
AND mimetype != 'application/pdf' -- Pushed as: &filter=!mimetype:application/pdf
AND timestamp > '2020-01-01' -- Pushed as: &from=20200101
LIMIT 10;
2. LIMIT Pushdown
LIMIT goes directly to the CDX API - no over-fetching:
-- CDX API receives &limit=5, returns exactly 5 records
SELECT url FROM wayback_machine()
WHERE url LIKE 'example.com/%'
LIMIT 5;
3. SELECT Pushdown (Projection)
Only requested columns are fetched via &fl= parameter:
-- Fast: CDX returns only url field (&fl=original)
SELECT url FROM wayback_machine()
WHERE url LIKE 'example.com/%' LIMIT 10;
-- Slower: Also downloads archived page content
SELECT url, response.body FROM wayback_machine()
WHERE url LIKE 'example.com/%' LIMIT 10;
4. DISTINCT ON Pushdown (Collapse)
DISTINCT ON is pushed as the CDX &collapse= parameter:
-- One snapshot per year: &collapse=timestamp:4
SELECT DISTINCT ON(year) url, timestamp
FROM wayback_machine() WHERE url = 'example.com';
-- One snapshot per year+month: &collapse=timestamp:6
SELECT DISTINCT ON(year, month) url, timestamp
FROM wayback_machine() WHERE url = 'example.com';
-- One per unique digest (content hash): &collapse=digest
SELECT DISTINCT ON(digest) url, timestamp
FROM wayback_machine() WHERE url LIKE 'example.com/%';
-- Multiple collapse params: &collapse=digest&collapse=statuscode
SELECT DISTINCT ON(digest, statuscode) url, timestamp
FROM wayback_machine() WHERE url LIKE 'example.com/%';
-- Prefix collapse (first 6 chars of urlkey): &collapse=urlkey:6
SELECT DISTINCT ON(urlkey[:6]) url, timestamp
FROM wayback_machine() WHERE url LIKE 'example.com/%';
Response Fetching
When you select response.body, the extension fetches archived pages:
-- No page fetching (CDX API only)
SELECT url, timestamp FROM wayback_machine()
WHERE url = 'example.com' LIMIT 5;
-- Fetches 5 archived pages from web.archive.org
SELECT url, timestamp, response.body FROM wayback_machine()
WHERE url = 'example.com' LIMIT 5;
Each archived page is fetched via:
https://web.archive.org/web/{timestamp}id_/{url}
Wayback Machine
-- Find archived snapshots
SELECT url, timestamp, statuscode
FROM wayback_machine()
WHERE url LIKE 'example.com/%'
AND statuscode = 200
LIMIT 10;
-- Get latest snapshot
SELECT url, timestamp, response.body
FROM wayback_machine()
WHERE url = 'example.com/about'
AND statuscode = 200
ORDER BY timestamp DESC
LIMIT 1;
Wayback Machine URL Patterns
WHERE url = 'example.com' -- Exact match
WHERE url LIKE 'example.com/%' -- Prefix match (paths)
WHERE url LIKE '%.example.com' -- Domain match (subdomains)
Wayback Machine Columns
| Column | Type | Description |
|---|---|---|
| url | VARCHAR | Original URL |
| urlkey | VARCHAR | SURT-formatted URL |
| timestamp | TIMESTAMP | Archive timestamp |
| year | INTEGER | Year extracted from timestamp |
| month | INTEGER | Month extracted from timestamp |
| statuscode | INTEGER | HTTP status code |
| mimetype | VARCHAR | Content MIME type |
| digest | VARCHAR | Content hash |
| length | BIGINT | Content length |
| response | STRUCT | Response with body and error fields |
Wayback Machine Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| max_results | BIGINT | 100 | Maximum results from CDX API |
| collapse | VARCHAR | - | Manual collapse field (prefer DISTINCT ON) |
| debug | BOOLEAN | false | Show cdx_url column with generated API URL |
| timeout | BIGINT | 180 | Timeout in seconds for fetching responses |
Common Crawl Index
-- Find HTML pages from a domain
SELECT url, timestamp, statuscode
FROM common_crawl_index()
WHERE url LIKE '%.example.com/%'
AND statuscode = 200
AND mimetype = 'text/html'
LIMIT 10;
-- Fetch page content (downloads from WARC files)
SELECT url, response.body
FROM common_crawl_index()
WHERE url LIKE 'https://example.com/%'
AND statuscode = 200
LIMIT 1;
-- Use specific crawl
SELECT url FROM common_crawl_index()
WHERE crawl_id = 'CC-MAIN-2025-47'
AND url LIKE '%.example.com/%'
LIMIT 10;
Common Crawl URL Patterns
WHERE url LIKE '%.example.com/%' -- Domain wildcard (subdomains)
WHERE url LIKE 'https://example.com/%' -- Prefix match
WHERE url SIMILAR TO '.*example\.com/$' -- Regex match
Common Crawl Columns
| Column | Type | Description |
|---|---|---|
| url | VARCHAR | Original URL |
| timestamp | TIMESTAMP | Crawl timestamp |
| crawl_id | VARCHAR | Crawl identifier (e.g., CC-MAIN-2025-47) |
| statuscode | INTEGER | HTTP status code |
| mimetype | VARCHAR | Content MIME type |
| digest | VARCHAR | Content hash |
| filename | VARCHAR | WARC filename |
| offset | BIGINT | Offset in WARC |
| length | BIGINT | Content length |
| response | STRUCT | Parsed WARC response (headers + body) |
| warc | STRUCT | WARC metadata |
Common Crawl Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| max_results | BIGINT | 100 | Maximum results from CDX API |
Examples
Track website changes over time
SELECT DISTINCT ON(year)
timestamp,
length,
digest
FROM wayback_machine()
WHERE url = 'example.com'
AND statuscode = 200
ORDER BY timestamp;
Find all unique domains from a TLD
SELECT DISTINCT regexp_extract(url, 'https?://([^/]+)', 1) as domain
FROM wayback_machine()
WHERE url LIKE '%.gov/%'
AND statuscode = 200
LIMIT 1000;
Export URLs to file
COPY (
SELECT url, timestamp
FROM common_crawl_index()
WHERE url LIKE '%.example.com/%'
AND statuscode = 200
LIMIT 10000
) TO 'urls.csv';
Data Sources
- Wayback Machine: https://web.archive.org - Continuous archiving since 1996
- Common Crawl: https://commoncrawl.org - Monthly web crawls since 2008
CDX API Documentation
- CDX Server API Reference: https://github.com/webrecorder/pywb/wiki/CDX-Server-API
- Common Crawl Index API: https://index.commoncrawl.org
- Wayback Machine CDX API: https://web.archive.org/cdx/search/cdx
License
MIT