Nutch-sa
From KallestadWiki
[edit]
Top URLs in your domain
bin/nutch readdb crawldb -topN NumberOfUrls outputdir
[edit]
DB Stats
bin/nutch readdb crawldb -stats
[edit]
List of Inbound and Outbound URLs
bin/nutch readlinkdb linkdb -dump outputdir
or
bin/nutch readlinkdb -url urltoquery
For the URL option, the url needs to be fully qualified, include a trailing slash for directories, and have the protocol lead in (http:// or https:// etc.)
