Utilities > tikalinkextract
Extract hyperlinks as a seed for web archiving from folders of document types that can be parsed by Apache Tika (Golang, Apache Tika Server). (In Development)
tikalinkextract
Tika client for httpreserve.
About
Tikalinkextract requires users start the Tika HTTP server, and then it provides a way for them to automate the batch processing of those files into its text extraction mechanism. The text is then processed to look for hyperlinks which are extracted and output to stdout. There are examples you can try below.
More information is available on the OPF website: Hyperlinks in your files? How to get them out using tikalinkextract
Demo
Use with Wget
Extract the links from your files using seeds option
./tikalinkextract -seeds -file archives-nz-demo/ > transferlinks.txt
Use the seeds to generate a warc file
wget -T 10 --tries=1 --page-requisites --span-hosts --convert-links --execute robots=off --adjust-extension --no-directories --directory-prefix=output --warc-cdx --warc-file=accession --wait=0.1 --user-agent=httpreserve-wget/0.0.1 -i transferlinks.txt
See explainshell.com
Resources that might be useful
License
Tika is licensed as Apache License 2.0.
This tool is licensed GNU General Public License Version 3.
