Project Awesome project awesome

Web Archiving

An effort to preserve the Web for future generations.

Collection 2.5k stars GitHub

Tools & Software

Acquisition

ArchiveBox

A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly Bookmark Archiver). (In Development)

archivenow

A Python library to push web resources into on-demand web archives. (Stable)

Auto Archiver 1.1k updated 8d ago

Python script to automatically archive social media posts, videos, and images from a Google Sheets document. Read the article about Auto Archiver on bellingcat.com.

Browsertrix Crawler 1.0k updated yesterday

A Chromium based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. (Stable)

Brozzler

A distributed web crawler (爬虫) that uses a real browser (Chrome or Chromium) to fetch pages and embedded urls and to extract links. (Stable)

Cairn

A npm package and CLI tool for saving webpages. (Stable)

Chronicler

Web browser with record and replay functionality. (In Development)

crau 64 updated 3y ago

crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs. (Stable)

Crawl

A simple web crawler in Golang. (Stable)

crocoite 46 (archived)

Crawl websites using headless Google Chrome/Chromium and save resources, static DOM snapshot and page screenshots to WARC files. (In Development)

DiskerNet 3.9k updated 4d ago

A non-WARC-based tool which hooks into the Chrome browser and archives everything you browse making it available for offline replay. (In Development)

F(b)arc

A commandline tool and Python library for archiving data from Facebook using the Graph API. (Stable)

freeze-dry 301 updated 3y ago

JavaScript library to turn page into static, self-contained HTML document; useful for browser extensions. (In Development)

grab-site 1.6k updated 10mo ago

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns. (Stable)

Heritrix Q&A 3.2k updated 16d ago

A discussion forum for asking questions and getting answers about using Heritrix.

Heritrix Walkthrough 10 updated 9y ago

(In Development)

html2warc 23 updated 2y ago

A simple script to convert offline data into a single WARC file. (Stable)

monolith

CLI tool to save a web page as a single HTML file. (Stable)

Obelisk 309 updated 1mo ago

Go package and CLI tool for saving web page as single HTML file. (Stable)

Scoop

High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web. (Stable)

SingleFile

Browser extension for Firefox/Chrome and CLI tool to save a faithful copy of a complete page as a single HTML file. (Stable)

Social Feed Manager

Open source software that enables users to create social media collections from Twitter, Tumblr, Flickr, and Sina Weibo public APIs. (Stable)

Squidwarc

An open source, high-fidelity, page interacting archival crawler that uses Chrome or Chrome Headless directly. (In Development)

twarc 1.4k updated 4mo ago

A command line tool and Python library for archiving Twitter JSON data. (Stable)

WAIL 394 updated 1y ago

A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages; Python, Electron. (Stable)

Warcprox

WARC-writing MITM HTTP/S proxy. (Stable)

WARCreate

A Google Chrome extension for archiving an individual webpage or website to a WARC file. (Stable)

Warcworker

An open source, dockerized, queued, high fidelity web archiver based on Squidwarc with a simple web GUI. (Stable)

Wayback 2.2k updated 2d ago

A toolkit for snapshot webpage to Internet Archive, archive.today, IPFS and beyond. (Stable)

Waybackpy 567 updated 2y ago

Wayback Machine Save, CDX and availability API interface in Python and a command-line tool (Stable)

Web2Warc 25 updated 8y ago

An easy-to-use and highly customizable crawler that enables anyone to create their own little Web archives (WARC/CDX). (Stable)

Wget-lua 24 updated 10y ago

Wget with Lua extension. (Stable)

Wpull 604 updated 1y ago

A Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler. (Stable)

hyphe 374 updated 8d ago

A webcrawler built for research uses with a graphical user interface in order to build web corpuses made of lists of web actors and maps of links between them. (Stable)

Mink 58 updated 7mo ago

A Google Chrome extension for querying Memento aggregators while browsing and integrating live-archived web navigation. (Stable)

PANDORÆ 16 updated 3mo ago

A desktop research software to be plugged on a Solr endpoint to query, retrieve, normalize and visually explore web archives. (Stable)

playback

A toolkit for searching archived webpages from <!--lint ignore double-link-->Internet Archive, archive.today, Memento and beyond. (In Development)

webarchive-discovery 132 updated 4mo ago

Front-ends.

Shine 43 (archived)

A prototype web archives exploration UI, developed with researchers as part of the Big UK Domain Data for the Arts and Humanities project. (Stable)

SolrWayback 137 updated 15d ago

A backend Java and frontend VUE JS project with freetext search and a build in playback engine. Require Warc files has been index with the Warc-Indexer. The web application also has a wide range of data visualization tools and data export tools that can be used on the whole webarchive. SolrWayback 4 Bundle release contains all the software and dependencies in an out-of-the box solution that is easy to install.

Warclight 50 (archived)

A Project Blacklight based Rails engine that supports the discovery of web archives held in the WARC and ARC formats. (In Development)

Wasp 27 updated 3y ago

A fully functional prototype of a personal web archive and search system. (In Development)

Utilities

ArchiveTools

Collection of tools to extract and interact with WARC files (Python).

bagnabit2warc

Convert a bag-nabit dataset stored in a ZIP into a full-content WARC.

Go Get Crawl

Extract web archive data using Waybak Machine and Common Crawl. (Stable)

gowarcserver 17 updated 11mo ago

BadgerDB-based capture index (CDX) and WARC record server, used to index and serve WARC files (Go).

har2warc

Convert HTTP Archive (HAR) -> Web Archive (WARC) format (Python).

HTTPreserve linkstat

Command line implementation of httpreserve.info to describe the status of a web page. Can be easily scripted and provides JSON output to enable querying through tools like JQ. HTTPreserve Linkstat describes current status, and earliest and latest links on archive.org. (Golang). (Stable)

Internet Archive Library

A command line tool and Python library for interacting directly with archive.org. (Python). (Stable)

httrack2warc

Convert HTTrack archives to WARC format (Java).

MementoMap 11 updated 4y ago

A Tool to Summarize Web Archive Holdings (Python). (In Development)

MemGator 78 updated 1y ago

A Memento Aggregator CLI and Server (Golang). (Stable)

node-cdxj

CDXJ file parser (Node.js). (Stable)

OutbackCDX 38 updated 24d ago

RocksDB-based capture index (CDX) server supporting incremental updates and compression. Can be used as backend for OpenWayback, PyWb and Heritrix. (Stable)

py-wasapi-client 16 updated 6y ago

Command line application to download crawls from WASAPI (Python). (Stable)

tikalinkextract

Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). (In Development)

wasapi-downloader 7 updated 4mo ago

Java command line application to download crawls from WASAPI. (Stable)

warcdb 405 updated 1y ago

A command line utility (Python) for importing WARC files into a SQLite database. (Stable)

warcbench 9 updated 7mo ago

A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.

warcdedupe

WARC deduplication tool (and WARC library) written in Rust. (In Development)

warc-safe 18 updated 3mo ago

Automatic detection of viruses and NSFW content in WARC files.

WarcPartitioner 1 updated 9y ago

Partition (W)ARC Files by MIME Type and Year. (Stable)

warcrefs 10 updated 7y ago

Web archive deduplication tools. (Stable)

webarchive-indexing 47 updated 8y ago

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

wikiteam

Tools for downloading and preserving wikis. (Stable)

Community Resources

Web Archiving Service Providers