Extracts URLs from feeds
A multithreaded, queue-based fetcher adapted from Apache Nutch.
Parser for HTML documents only which uses ICU4J to detect the charset encoding.
A single-threaded fetcher with no internal queue.
Extracts URLs from sitemap files.
Provides common functionalities for Bolts which emit tuples to the status stream, e.g.
Generates a partition key for a given URL based on the hostname, domain or IP address.
Copyright © 2018 DigitalPebble Ltd. All rights reserved.