Web Archiving

A tool which maintains an additive archive from RSS feeds, bookmarks, and links using wget, Chrome headless, and other methods (formerly Bookmark Archiver). (In Development)

archivenow 432 updated 2y ago

A Python library to push web resources into on-demand web archives. (Stable)

Auto Archiver 1.1k updated 4mo ago

Python script to automatically archive social media posts, videos, and images from a Google Sheets document. Read the article about Auto Archiver on bellingcat.com.

Browsertrix Crawler 1.0k updated 3mo ago

A Chromium based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. (Stable)

Brozzler 794 updated 3mo ago

A distributed web crawler (爬虫) that uses a real browser (Chrome or Chromium) to fetch pages and embedded urls and to extract links. (Stable)

Cairn 51 updated 3mo ago

A npm package and CLI tool for saving webpages. (Stable)

Chronicler 92 (archived)

Web browser with record and replay functionality. (In Development)

crau 64 updated 3y ago

crau is the way (most) Brazilians pronounce crawl, it's the easiest command-line tool for archiving the Web and playing archives: you just need a list of URLs. (Stable)

crocoite 46 (archived)

Crawl websites using headless Google Chrome/Chromium and save resources, static DOM snapshot and page screenshots to WARC files. (In Development)

DiskerNet 3.9k updated 4mo ago

A non-WARC-based tool which hooks into the Chrome browser and archives everything you browse making it available for offline replay. (In Development)

F(b)arc 78 (archived)

A commandline tool and Python library for archiving data from Facebook using the Graph API. (Stable)

freeze-dry 301 updated 3y ago

JavaScript library to turn page into static, self-contained HTML document; useful for browser extensions. (In Development)

grab-site 1.6k updated 1y ago

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns. (Stable)

Heritrix 3.2k updated 4mo ago

An open source, extensible, web-scale, archival quality web crawler. (Stable)

html2warc 23 updated 3y ago

A simple script to convert offline data into a single WARC file. (Stable)

monolith 15.0k updated 2mo ago

CLI tool to save a web page as a single HTML file. (Stable)

Obelisk 309 updated 5mo ago

Go package and CLI tool for saving web page as single HTML file. (Stable)

Scoop 196 updated 10mo ago

High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web. (Stable)

SingleFile 21.0k updated 4mo ago

Browser extension for Firefox/Chrome and CLI tool to save a faithful copy of a complete page as a single HTML file. (Stable)

Squidwarc 175 updated 6y ago

An open source, high-fidelity, page interacting archival crawler that uses Chrome or Chrome Headless directly. (In Development)

twarc 1.4k updated 8mo ago

A command line tool and Python library for archiving Twitter JSON data. (Stable)

WAIL 394 updated 1y ago

A graphical user interface (GUI) atop multiple web archiving tools intended to be used as an easy way for anyone to preserve and replay web pages; Python, Electron. (Stable)

Warcprox 448 updated 3mo ago

WARC-writing MITM HTTP/S proxy. (Stable)

Warcworker 62 updated 2y ago

An open source, dockerized, queued, high fidelity web archiver based on Squidwarc with a simple web GUI. (Stable)

Wayback 2.2k updated 4mo ago

A toolkit for snapshot webpage to Internet Archive, archive.today, IPFS and beyond. (Stable)

Waybackpy 567 updated 2y ago

Wayback Machine Save, CDX and availability API interface in Python and a command-line tool (Stable)

Web2Warc 25 updated 8y ago

An easy-to-use and highly customizable crawler that enables anyone to create their own little Web archives (WARC/CDX). (Stable)

Wget-lua 24 updated 10y ago

Wget with Lua extension. (Stable)

Wpull 604 updated 2y ago

A Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler. (Stable)

Acquisition

Crawl

A simple web crawler in Golang. (Stable)

Social Feed Manager

Open source software that enables users to create social media collections from Twitter, Tumblr, Flickr, and Sina Weibo public APIs. (Stable)

WARCreate

A Google Chrome extension for archiving an individual webpage or website to a WARC file. (Stable)

Heritrix

Heritrix Walkthrough 10 updated 10y ago

(In Development)

Replay

InterPlanetary Wayback (ipwb) 650 updated 9mo ago

Web Archive (WARC) indexing and replay using IPFS.

OpenWayback 516 updated 2y ago

The open source project aimed to develop Wayback Machine, the key software used by web archives worldwide to play back archived websites in the user's browser. (Stable)

PYWB 1.6k updated 3mo ago

A Python 3 implementation of web archival replay tools, sometimes also known as 'Wayback Machine'. (Stable)

warc2html 51 updated 10mo ago

Converts WARC files to static HTML suitable for browsing offline or rehosting.

Search & Discovery

hyphe 374 updated 4mo ago

A webcrawler built for research uses with a graphical user interface in order to build web corpuses made of lists of web actors and maps of links between them. (Stable)

Mink 58 updated 10mo ago

A Google Chrome extension for querying Memento aggregators while browsing and integrating live-archived web navigation. (Stable)

PANDORÆ 16 updated 7mo ago

A desktop research software to be plugged on a Solr endpoint to query, retrieve, normalize and visually explore web archives. (Stable)

playback 13 updated 2mo ago

A toolkit for searching archived webpages from Internet Archive, archive.today, Memento and beyond. (In Development)

Utilities

ArchiveTools 78 updated 4y ago

Collection of tools to extract and interact with WARC files (Python).

bagnabit2warc 1 updated 4mo ago

Convert a bag-nabit dataset stored in a ZIP into a full-content WARC.

Go Get Crawl 175 updated 1y ago

Extract web archive data using Wayback Machine and Common Crawl. (Stable)

gowarcserver 17 updated 1y ago

BadgerDB-based capture index (CDX) and WARC record server, used to index and serve WARC files (Go).

har2warc 55 updated 7y ago

Convert HTTP Archive (HAR) -> Web Archive (WARC) format (Python).

HTTPreserve linkstat 10 updated 1y ago

Command line implementation of httpreserve.info to describe the status of a web page. Can be easily scripted and provides JSON output to enable querying through tools like JQ. HTTPreserve Linkstat describes current status, and earliest and latest links on archive.org. (Golang). (Stable)

Internet Archive Library 1.9k updated 2mo ago

A command line tool and Python library for interacting directly with archive.org. (Python). (Stable)

httrack2warc 34 updated 1y ago

Convert HTTrack archives to WARC format (Java).

MementoMap 11 updated 5y ago

A Tool to Summarize Web Archive Holdings (Python). (In Development)

MemGator 78 updated 1y ago

A Memento Aggregator CLI and Server (Golang). (Stable)

node-cdxj 2 updated 9y ago

CDXJ file parser (Node.js). (Stable)

OutbackCDX 38 updated 4mo ago

RocksDB-based capture index (CDX) server supporting incremental updates and compression. Can be used as backend for OpenWayback, PyWb and Heritrix. (Stable)

py-wasapi-client 16 updated 6y ago

Command line application to download crawls from WASAPI (Python). (Stable)

tikalinkextract 11 updated 1y ago

Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). (In Development)

wasapi-downloader 7 updated 8mo ago

Java command line application to download crawls from WASAPI. (Stable)

warcdb 405 updated 2y ago

A command line utility (Python) for importing WARC files into a SQLite database. (Stable)

warcbench 9 updated 11mo ago

A tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.

warcdedupe

WARC deduplication tool (and WARC library) written in Rust. (In Development)

warc-safe 18 updated 7mo ago

Automatic detection of viruses and NSFW content in WARC files.

WarcPartitioner 1 updated 9y ago

Partition (W)ARC Files by MIME Type and Year. (Stable)

warcrefs 10 updated 7y ago

Web archive deduplication tools. (Stable)

webarchive-indexing 47 updated 8y ago

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

wikiteam 838 updated 6mo ago

Tools for downloading and preserving wikis. (Stable)

duckdb-web-archive-cdx 19 updated 5mo ago

DuckDB extension to query the Internet Archive and CommonCrawl CDX APIs directly from SQL. (In Development)

duckdb_warc 4 updated 5mo ago

DuckDB extension to query WARC files. (In Development)

Warchaeology 49 updated 2mo ago

Warchaeology is a collection of tools for inspecting, manipulating, deduplicating and validating WARC-files. (Stable)

WARC I/O Libraries

FastWARC 135 updated 9mo ago

A high-performance WARC parsing library (Python).

HadoopConcatGz 9 updated 8y ago

A Splitable Hadoop InputFormat for Concatenated GZIP Files (and .warc.gz). (Stable)*

jwarc 55 updated 4mo ago

Read and write WARC files with a type safe API (Java).

Jwat

Libraries for reading/writing/validating WARC/ARC/GZIP files (Java). (Stable)

Jwat-Tools 5 updated 2y ago

Tools for reading/writing/validating WARC/ARC/GZIP files (Java). (Stable)

node-warc 104 updated 1y ago

Parse WARC files or create WARC files using either Electron or chrome-remote-interface (Node.js). (Stable)

Sparkling 16 updated 4mo ago

Internet Archive's Sparkling Data Processing Library. (Stable)

Unwarcit 13 updated 4y ago

Command line interface to unzip WARC and WACZ files (Python).

warc 59 updated 1y ago

A Rust library for reading and writing WARC files. (Stable)

Warcat 165 updated 1y ago

Tool and library for handling Web ARChive (WARC) files (Python). (Stable)

Warcat-rs 29 updated 1y ago

Command-line tool and Rust library for handling Web ARChive (WARC) files. (In Development)

warcio 452 updated 4mo ago

Streaming WARC/ARC library for fast web archive IO (Python). (Stable)

warctools 171 updated 11mo ago

Library to work with ARC and WARC files (Python).

webarchive 20 updated 3y ago

Golang readers for ARC and WARC webarchive formats (Golang).

Analysis

Archives Research Compute Hub 20 updated 3mo ago

Web application for distributed compute analysis of Archive-It web archive collections. (Stable)

ArchiveSpark 158 updated 9mo ago

An Apache Spark framework (not only) for Web Archives that enables easy data processing, extraction as well as derivation. (Stable)

Archives Unleashed Notebooks 26 updated 3y ago

Notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit. (Stable)

Archives Unleashed Toolkit 154 updated 7mo ago

Archives Unleashed Toolkit (AUT) is an open-source platform for analyzing web archives with Apache Spark. (Stable)

Common Crawl Jupyter notebooks 64 updated 8mo ago

A collection of notebooks using Common Crawl's various datasets. (Stable)

Tweet Archvies Unleashed Toolkit 10 updated 4mo ago

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark. (In Development)

Quality Assurance

FlameShot 29.8k updated 2mo ago

Screen capture and annotation on Ubuntu.

xDoTool 3.8k updated 4mo ago

Click automation on Ubuntu.

Curation

Zotero Robust Links Extension 21 updated 4y ago

A Zotero extension that submits to and reads from web archives. Source on GitHub. Supercedes leonkt/zotero-memento.

leonkt/zotero-memento 348 updated 4y ago

Superceded by Zotero Robust Links Extension.

Data

here 132 updated 8mo ago

Other possible options for builting a front-end are listed on in the webarchive-discovery wiki,

Shine 43 (archived)

A prototype web archives exploration UI, developed with researchers as part of the Big UK Domain Data for the Arts and Humanities project. (Stable)

SolrWayback 137 updated 4mo ago

A backend Java and frontend VUE JS project with freetext search and a build in playback engine. Require Warc files has been index with the Warc-Indexer. The web application also has a wide range of data visualization tools and data export tools that can be used on the whole webarchive. SolrWayback 4 Bundle release contains all the software and dependencies in an out-of-the box solution that is easy to install.

Warclight 50 (archived)

A Project Blacklight based Rails engine that supports the discovery of web archives held in the WARC and ARC formats. (In Development)

Wasp 27 updated 3y ago

A fully functional prototype of a personal web archive and search system. (In Development)

Community Resources

Other Awesome Lists

Awesome Memento 112 updated 5mo ago

Web Archiving Service Providers

Self-hostable, Open Source

Browsertrix 393 updated 3mo ago

From Webrecorder, source available at

Slack

Common Crawl Foundation Partners

(ask greg zat commoncrawl zot org for an invite)

Web Archiving

Contents

Training/Documentation

The WARC Standard

Tools & Software

Acquisition

Heritrix

Replay

Search & Discovery

Utilities

WARC I/O Libraries

Analysis

Quality Assurance

Curation

Data

Data

Community Resources

Other Awesome Lists

Web Archiving Service Providers

Self-hostable, Open Source

Slack