Project Awesome project awesome

Utilities > tikalinkextract

Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). (In Development)

Package 11 stars GitHub

httpreserve

tikalinkextract

Tika client for httpreserve.

About

Tikalinkextract requires users start the Tika HTTP server, and then it provides a way for them to automate the batch processing of those files into its text extraction mechanism. The text is then processed to look for hyperlinks which are extracted and output to stdout. There are examples you can try below.

More information is available on the OPF website: Hyperlinks in your files? How to get them out using tikalinkextract

Demo

asciicast

Use with Wget

Extract the links from your files using seeds option

./tikalinkextract -seeds -file archives-nz-demo/ > transferlinks.txt

Use the seeds to generate a warc file

wget -T 10 --tries=1 --page-requisites --span-hosts --convert-links  --execute robots=off --adjust-extension --no-directories --directory-prefix=output --warc-cdx --warc-file=accession --wait=0.1 --user-agent=httpreserve-wget/0.0.1 -i transferlinks.txt

See explainshell.com

Resources that might be useful

License

Tika is licensed as Apache License 2.0.

This tool is licensed GNU General Public License Version 3.

Back to Web Archiving