Project Awesome project awesome

Utilities > duckdb-web-archive-cdx

DuckDB extension to query the Internet Archive and CommonCrawl CDX APIs directly from SQL. (In Development)

Package 19 stars GitHub

duckdb-web-archive-cdx

DuckDB extension to query web archive CDX APIs directly from SQL.

Features

  • Two data sources: Wayback Machine (1996-present) and Common Crawl (2008-present)
  • Smart pushdowns: SELECT, WHERE, LIMIT, and DISTINCT ON are pushed to the CDX API
  • Minimal requests: Only fetches what you need - no wasted bandwidth
  • Response fetching: Archived page content is fetched only when response.body is selected

Installation

INSTALL web_archive FROM community;
LOAD web_archive;

Quick Start: Track GitHub Stars Over Time

Query the Wayback Machine to see how DuckDB's GitHub stars grew over the years.

Using the html_query extension to extract data from archived HTML:

SELECT DISTINCT ON(year)
    url[9:] as url,
    strftime(timestamp::TIMESTAMP, '%Y-%m-%d') AS date,
    html_query(response.body, '#repo-stars-counter-star', '@text') AS stars,
    html_query(response.body, '#repo-network-counter', '@text') AS forks,
    html_query(response.body, 'a[href="/duckdb/duckdb/releases"] span', '@text') AS releases,
    html_query(response.body, 'a[href="/duckdb/duckdb/graphs/contributors"] span', '@text') AS contributors
FROM wayback_machine()
WHERE url = 'github.com/duckdb/duckdb'
  AND timestamp > '2018-01-01'
  AND statuscode = 200
  AND mimetype = 'text/html'
ORDER BY date
LIMIT 10;
┌──────────────────────────┬────────────┬─────────┬─────────┬──────────┬──────────────┐
│           url            │    date    │  stars  │  forks  │ releases │ contributors │
│         varchar          │  varchar   │ varchar │ varchar │ varchar  │   varchar    │
├──────────────────────────┼────────────┼─────────┼─────────┼──────────┼──────────────┤
│ github.com/duckdb/duckdb │ 2021-04-06 │ NULL    │ NULL    │ 17       │ 59           │
│ github.com/duckdb/duckdb │ 2022-01-07 │ 4.1k    │ 358     │ 22       │ 91           │
│ github.com/duckdb/duckdb │ 2023-01-08 │ 7.9k    │ 745     │ 29       │ 170          │
│ github.com/duckdb/duckdb │ 2024-01-23 │ 14k     │ 1.3k    │ 36       │ 277          │
│ github.com/duckdb/duckdb │ 2025-01-04 │ 25.5k   │ 2k      │ 45       │ 389          │
└──────────────────────────┴────────────┴─────────┴─────────┴──────────┴──────────────┘

This query makes only 6 HTTP requests despite potentially matching thousands of archived snapshots:

Request URL
1 CDX API query with all filters and &collapse=timestamp:4&limit=10
2-6 Fetch 5 archived pages (one per year)

The extension automatically:

  • Pushes WHERE filters (statuscode, mimetype) to the CDX API
  • Pushes LIMIT 10 to the API
  • Pushes DISTINCT ON(year) as &collapse=timestamp:4 (dedupe by year)
  • Only requests response.body for the 5 matching rows
  • Only fetches needed fields (timestamp, original) via &fl= parameter

How Pushdown Works

1. Filter Pushdown (WHERE)

WHERE clauses are converted to CDX API parameters:

SELECT url FROM wayback_machine()
WHERE url LIKE 'example.com/%'      -- Pushed as: url=example.com/*
  AND statuscode = 200              -- Pushed as: &filter=statuscode:200
  AND mimetype = 'text/html'        -- Pushed as: &filter=mimetype:text/html
  AND mimetype != 'application/pdf' -- Pushed as: &filter=!mimetype:application/pdf
  AND timestamp > '2020-01-01'      -- Pushed as: &from=20200101
LIMIT 10;

2. LIMIT Pushdown

LIMIT goes directly to the CDX API - no over-fetching:

-- CDX API receives &limit=5, returns exactly 5 records
SELECT url FROM wayback_machine()
WHERE url LIKE 'example.com/%'
LIMIT 5;

3. SELECT Pushdown (Projection)

Only requested columns are fetched via &fl= parameter:

-- Fast: CDX returns only url field (&fl=original)
SELECT url FROM wayback_machine()
WHERE url LIKE 'example.com/%' LIMIT 10;

-- Slower: Also downloads archived page content
SELECT url, response.body FROM wayback_machine()
WHERE url LIKE 'example.com/%' LIMIT 10;

4. DISTINCT ON Pushdown (Collapse)

DISTINCT ON is pushed as the CDX &collapse= parameter:

-- One snapshot per year: &collapse=timestamp:4
SELECT DISTINCT ON(year) url, timestamp
FROM wayback_machine() WHERE url = 'example.com';

-- One snapshot per year+month: &collapse=timestamp:6
SELECT DISTINCT ON(year, month) url, timestamp
FROM wayback_machine() WHERE url = 'example.com';

-- One per unique digest (content hash): &collapse=digest
SELECT DISTINCT ON(digest) url, timestamp
FROM wayback_machine() WHERE url LIKE 'example.com/%';

-- Multiple collapse params: &collapse=digest&collapse=statuscode
SELECT DISTINCT ON(digest, statuscode) url, timestamp
FROM wayback_machine() WHERE url LIKE 'example.com/%';

-- Prefix collapse (first 6 chars of urlkey): &collapse=urlkey:6
SELECT DISTINCT ON(urlkey[:6]) url, timestamp
FROM wayback_machine() WHERE url LIKE 'example.com/%';

Response Fetching

When you select response.body, the extension fetches archived pages:

-- No page fetching (CDX API only)
SELECT url, timestamp FROM wayback_machine()
WHERE url = 'example.com' LIMIT 5;

-- Fetches 5 archived pages from web.archive.org
SELECT url, timestamp, response.body FROM wayback_machine()
WHERE url = 'example.com' LIMIT 5;

Each archived page is fetched via:

https://web.archive.org/web/{timestamp}id_/{url}

Wayback Machine

-- Find archived snapshots
SELECT url, timestamp, statuscode
FROM wayback_machine()
WHERE url LIKE 'example.com/%'
  AND statuscode = 200
LIMIT 10;

-- Get latest snapshot
SELECT url, timestamp, response.body
FROM wayback_machine()
WHERE url = 'example.com/about'
  AND statuscode = 200
ORDER BY timestamp DESC
LIMIT 1;

Wayback Machine URL Patterns

WHERE url = 'example.com'           -- Exact match
WHERE url LIKE 'example.com/%'      -- Prefix match (paths)
WHERE url LIKE '%.example.com'      -- Domain match (subdomains)

Wayback Machine Columns

Column Type Description
url VARCHAR Original URL
urlkey VARCHAR SURT-formatted URL
timestamp TIMESTAMP Archive timestamp
year INTEGER Year extracted from timestamp
month INTEGER Month extracted from timestamp
statuscode INTEGER HTTP status code
mimetype VARCHAR Content MIME type
digest VARCHAR Content hash
length BIGINT Content length
response STRUCT Response with body and error fields

Wayback Machine Parameters

Parameter Type Default Description
max_results BIGINT 100 Maximum results from CDX API
collapse VARCHAR - Manual collapse field (prefer DISTINCT ON)
debug BOOLEAN false Show cdx_url column with generated API URL
timeout BIGINT 180 Timeout in seconds for fetching responses

Common Crawl Index

-- Find HTML pages from a domain
SELECT url, timestamp, statuscode
FROM common_crawl_index()
WHERE url LIKE '%.example.com/%'
  AND statuscode = 200
  AND mimetype = 'text/html'
LIMIT 10;

-- Fetch page content (downloads from WARC files)
SELECT url, response.body
FROM common_crawl_index()
WHERE url LIKE 'https://example.com/%'
  AND statuscode = 200
LIMIT 1;

-- Use specific crawl
SELECT url FROM common_crawl_index()
WHERE crawl_id = 'CC-MAIN-2025-47'
  AND url LIKE '%.example.com/%'
LIMIT 10;

Common Crawl URL Patterns

WHERE url LIKE '%.example.com/%'        -- Domain wildcard (subdomains)
WHERE url LIKE 'https://example.com/%'  -- Prefix match
WHERE url SIMILAR TO '.*example\.com/$' -- Regex match

Common Crawl Columns

Column Type Description
url VARCHAR Original URL
timestamp TIMESTAMP Crawl timestamp
crawl_id VARCHAR Crawl identifier (e.g., CC-MAIN-2025-47)
statuscode INTEGER HTTP status code
mimetype VARCHAR Content MIME type
digest VARCHAR Content hash
filename VARCHAR WARC filename
offset BIGINT Offset in WARC
length BIGINT Content length
response STRUCT Parsed WARC response (headers + body)
warc STRUCT WARC metadata

Common Crawl Parameters

Parameter Type Default Description
max_results BIGINT 100 Maximum results from CDX API

Examples

Track website changes over time

SELECT DISTINCT ON(year)
    timestamp,
    length,
    digest
FROM wayback_machine()
WHERE url = 'example.com'
  AND statuscode = 200
ORDER BY timestamp;

Find all unique domains from a TLD

SELECT DISTINCT regexp_extract(url, 'https?://([^/]+)', 1) as domain
FROM wayback_machine()
WHERE url LIKE '%.gov/%'
  AND statuscode = 200
LIMIT 1000;

Export URLs to file

COPY (
  SELECT url, timestamp
  FROM common_crawl_index()
  WHERE url LIKE '%.example.com/%'
    AND statuscode = 200
  LIMIT 10000
) TO 'urls.csv';

Data Sources

CDX API Documentation

License

MIT

Back to Web Archiving