Web Archiving
An effort to preserve the Web for future generations.
Contents
Training/Documentation
Tools & Software
A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly Bookmark Archiver). (In Development)
A Python library to push web resources into on-demand web archives. (Stable)
Python script to automatically archive social media posts, videos, and images from a Google Sheets document. Read the article about Auto Archiver on bellingcat.com.
A Chromium based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. (Stable)
A distributed web crawler (爬虫) that uses a real browser (Chrome or Chromium) to fetch pages and embedded urls and to extract links. (Stable)
crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs. (Stable)
Crawl websites using headless Google Chrome/Chromium and save resources, static DOM snapshot and page screenshots to WARC files. (In Development)
A non-WARC-based tool which hooks into the Chrome browser and archives everything you browse making it available for offline replay. (In Development)
A commandline tool and Python library for archiving data from Facebook using the Graph API. (Stable)
JavaScript library to turn page into static, self-contained HTML document; useful for browser extensions. (In Development)
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns. (Stable)
An open source, extensible, web-scale, archival quality web crawler. (Stable)
A simple script to convert offline data into a single WARC file. (Stable)
Go package and CLI tool for saving web page as single HTML file. (Stable)
High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web. (Stable)
Browser extension for Firefox/Chrome and CLI tool to save a faithful copy of a complete page as a single HTML file. (Stable)
An open source, high-fidelity, page interacting archival crawler that uses Chrome or Chrome Headless directly. (In Development)
A command line tool and Python library for archiving Twitter JSON data. (Stable)
A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages; Python, Electron. (Stable)
An open source, dockerized, queued, high fidelity web archiver based on Squidwarc with a simple web GUI. (Stable)
A toolkit for snapshot webpage to Internet Archive, archive.today, IPFS and beyond. (Stable)
Wayback Machine Save, CDX and availability API interface in Python and a command-line tool (Stable)
Acquisition
Replay
Web Archive (WARC) indexing and replay using IPFS.
The open source project aimed to develop Wayback Machine, the key software used by web archives worldwide to play back archived websites in the user's browser. (Stable)
Search & Discovery
A webcrawler built for research uses with a graphical user interface in order to build web corpuses made of lists of web actors and maps of links between them. (Stable)
A Google Chrome extension for querying Memento aggregators while browsing and integrating live-archived web navigation. (Stable)
Utilities
Collection of tools to extract and interact with WARC files (Python).
Convert a bag-nabit dataset stored in a ZIP into a full-content WARC.
Extract web archive data using <!--lint ignore double-link-->Wayback Machine and Common Crawl. (Stable)
BadgerDB-based capture index (CDX) and WARC record server, used to index and serve WARC files (Go).
Command line implementation of <!--lint ignore double-link-->httpreserve.info to describe the status of a web page. Can be easily scripted and provides JSON output to enable querying through tools like JQ. HTTPreserve Linkstat describes current status, and earliest and latest links on <!--lint ignore double-link-->archive.org. (Golang). (Stable)
A command line tool and Python library for interacting directly with <!--lint ignore double-link-->archive.org. (Python). (Stable)
RocksDB-based capture index (CDX) server supporting incremental updates and compression. Can be used as backend for OpenWayback, PyWb and Heritrix. (Stable)
Command line application to download crawls from WASAPI (Python). (Stable)
Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). (In Development)
Java command line application to download crawls from WASAPI. (Stable)
A command line utility (Python) for importing WARC files into a SQLite database. (Stable)
A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
WARC I/O Libraries
A Splitable Hadoop InputFormat for Concatenated GZIP Files (and .warc.gz). (Stable)*
Tools for reading/writing/validating WARC/ARC/GZIP files (Java). (Stable)
Parse WARC files or create WARC files using either Electron or chrome-remote-interface (Node.js). (Stable)
Analysis
Web application for distributed compute analysis of Archive-It web archive collections. (Stable)
An Apache Spark framework (not only) for Web Archives that enables easy data processing, extraction as well as derivation. (Stable)
Notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit. (Stable)
Archives Unleashed Toolkit (AUT) is an open-source platform for analyzing web archives with Apache Spark. (Stable)
Quality Assurance
Data
Data
A backend Java and frontend VUE JS project with freetext search and a build in playback engine. Require Warc files has been index with the Warc-Indexer. The web application also has a wide range of data visualization tools and data export tools that can be used on the whole webarchive. SolrWayback 4 Bundle release contains all the software and dependencies in an out-of-the box solution that is easy to install.