Skip navigation links
A B C D E F G H I J K L M N O P Q R S T U V X _ 

A

AbstractHttpProtocol - Class in com.digitalpebble.stormcrawler.protocol
 
AbstractHttpProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
AbstractIndexerBolt - Class in com.digitalpebble.stormcrawler.indexing
Abstract class to simplify writing IndexerBolts
AbstractIndexerBolt() - Constructor for class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
 
AbstractStatusUpdaterBolt - Class in com.digitalpebble.stormcrawler.persistence
Abstract bolt used to store the status of URLs.
AbstractStatusUpdaterBolt() - Constructor for class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
accept() - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexRule
Return if this rule is used for filtering-in or out.
ack(Tuple, String) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
Must be overridden for implementations where the actual writing can be delayed e.g.
activate() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
activate() - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
 
AdaptiveScheduler - Class in com.digitalpebble.stormcrawler.persistence
Adaptive fetch scheduler, checks by signature comparison whether a re-fetched page has changed: if yes, shrink the fetch interval up to a minimum fetch interval if not, increase the fetch interval up to a maximum
AdaptiveScheduler() - Constructor for class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
add(String, Metadata, Date) - Static method in class com.digitalpebble.stormcrawler.spout.MemorySpout
Add a new URL
addMeasurement(long) - Method in class com.digitalpebble.stormcrawler.util.CollectionMetric
 
addValue(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
addValues(String, Collection<String>) - Method in class com.digitalpebble.stormcrawler.Metadata
 
agentNames - Variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
allowForbidden - Variable in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
 
AllowRedirParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
allowRedirs() - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
ANCHORS_KEY_NAME - Static variable in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
Metadata key name for tracking the anchors

B

BasicURLFilter - Class in com.digitalpebble.stormcrawler.filtering.basic
Simple URL filters : can be used early in the filtering chain
BasicURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
 
BasicURLNormalizer - Class in com.digitalpebble.stormcrawler.filtering.basic
 
BasicURLNormalizer() - Constructor for class com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer
 
BATCH_SIZE - Static variable in class com.digitalpebble.stormcrawler.spout.FileSpout
 

C

CACHE - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
cacheConfigParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
Parameter name to configure the cache @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterAccess=1h"
canonicalMetadataParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Field name to use for reading the canonical property of the metadata
checkCustomInterval(Metadata, Status) - Method in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
Returns the first matching custom interval
checkDomainMatchToUrl(String, String) - Static method in class com.digitalpebble.stormcrawler.util.CookieConverter
Helper method to check if url matches a cookie domain.
chooseTasks(int, List<Object>) - Method in class com.digitalpebble.stormcrawler.util.URLStreamGrouping
 
cleanup() - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
 
cleanup() - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
cleanup() - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
 
cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolFactory
 
close() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
CollectionMetric - Class in com.digitalpebble.stormcrawler.util
 
CollectionMetric() - Constructor for class com.digitalpebble.stormcrawler.util.CollectionMetric
 
collector - Variable in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
collector - Variable in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
 
com.digitalpebble.stormcrawler - package com.digitalpebble.stormcrawler
 
com.digitalpebble.stormcrawler.bolt - package com.digitalpebble.stormcrawler.bolt
 
com.digitalpebble.stormcrawler.filtering - package com.digitalpebble.stormcrawler.filtering
 
com.digitalpebble.stormcrawler.filtering.basic - package com.digitalpebble.stormcrawler.filtering.basic
 
com.digitalpebble.stormcrawler.filtering.depth - package com.digitalpebble.stormcrawler.filtering.depth
 
com.digitalpebble.stormcrawler.filtering.host - package com.digitalpebble.stormcrawler.filtering.host
 
com.digitalpebble.stormcrawler.filtering.metadata - package com.digitalpebble.stormcrawler.filtering.metadata
 
com.digitalpebble.stormcrawler.filtering.regex - package com.digitalpebble.stormcrawler.filtering.regex
 
com.digitalpebble.stormcrawler.filtering.robots - package com.digitalpebble.stormcrawler.filtering.robots
 
com.digitalpebble.stormcrawler.indexing - package com.digitalpebble.stormcrawler.indexing
 
com.digitalpebble.stormcrawler.parse - package com.digitalpebble.stormcrawler.parse
 
com.digitalpebble.stormcrawler.parse.filter - package com.digitalpebble.stormcrawler.parse.filter
 
com.digitalpebble.stormcrawler.persistence - package com.digitalpebble.stormcrawler.persistence
 
com.digitalpebble.stormcrawler.protocol - package com.digitalpebble.stormcrawler.protocol
 
com.digitalpebble.stormcrawler.protocol.file - package com.digitalpebble.stormcrawler.protocol.file
 
com.digitalpebble.stormcrawler.protocol.httpclient - package com.digitalpebble.stormcrawler.protocol.httpclient
 
com.digitalpebble.stormcrawler.protocol.selenium - package com.digitalpebble.stormcrawler.protocol.selenium
 
com.digitalpebble.stormcrawler.spout - package com.digitalpebble.stormcrawler.spout
 
com.digitalpebble.stormcrawler.util - package com.digitalpebble.stormcrawler.util
 
conf - Variable in class com.digitalpebble.stormcrawler.ConfigurableTopology
 
ConfigurableTopology - Class in com.digitalpebble.stormcrawler
 
ConfigurableTopology() - Constructor for class com.digitalpebble.stormcrawler.ConfigurableTopology
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.host.HostURLFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter
 
configure(Map, JsonNode) - Method in interface com.digitalpebble.stormcrawler.filtering.URLFilter
Called when this filter is being initialized
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.URLFilters
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.ContentFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.MD5SignatureParseFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilter
Called when this filter is being initialized
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
configure(Config) - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilter
Called when this filter is being initialised
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
 
configure(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
 
configure(Map) - Method in class com.digitalpebble.stormcrawler.util.URLPartitioner
 
ConfUtils - Class in com.digitalpebble.stormcrawler.util
 
Constants - Class in com.digitalpebble.stormcrawler
 
CONTENT_DISPOSITION - Static variable in interface com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_ENCODING - Static variable in interface com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_LANGUAGE - Static variable in interface com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_LENGTH - Static variable in interface com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_LOCATION - Static variable in interface com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_MD5 - Static variable in interface com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_TYPE - Static variable in interface com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
ContentFilter - Class in com.digitalpebble.stormcrawler.parse.filter
Restricts the text of the main document based on the text value of an Xpath expression (e.g.
ContentFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.ContentFilter
 
CookieConverter - Class in com.digitalpebble.stormcrawler.util
Helper to extract cookies from cookies string.
CookieConverter() - Constructor for class com.digitalpebble.stormcrawler.util.CookieConverter
 
createDOM(Node, Node, Document, Map<String, String>) - Static method in class com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder
The internal helper that copies content from the specified Jsoup Node into a W3C Node.
createRule(boolean, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilter
 
createRule(boolean, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
Creates a new RegexRule.

D

deactivate() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
deactivate() - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
 
DebugParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
Dumps the DOM representation of a document into a file
DebugParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
 
defaultfetchInterval - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
defaultFetchIntervalParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
DefaultScheduler - Class in com.digitalpebble.stormcrawler.persistence
Schedules a nextFetchDate based on the configuration
DefaultScheduler() - Constructor for class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
 
DELETION_STREAM_NAME - Static variable in class com.digitalpebble.stormcrawler.Constants
 
depthKeyName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Metadata key name for tracking the depth
deserialize(ByteBuffer) - Method in class com.digitalpebble.stormcrawler.util.StringTabScheme
 
DomainParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
Adds domain (or host) to metadata - can be used later on for indexing
DomainParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter
 
drivers - Variable in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
 
DummyIndexer - Class in com.digitalpebble.stormcrawler.indexing
Any tuple that went through all the previous bolts is sent to the status stream with a Status of FETCHED.
DummyIndexer() - Constructor for class com.digitalpebble.stormcrawler.indexing.DummyIndexer
 

E

emitOutlink(Tuple, URL, String, Metadata, String...) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
Used for redirections or when discovering sitemap URLs
empty - Static variable in class com.digitalpebble.stormcrawler.Metadata
 
EMPTY_RULES - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
emptyNavigationFilters - Static variable in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
 
emptyParseFilter - Static variable in class com.digitalpebble.stormcrawler.parse.ParseFilters
 
emptyURLFilters - Static variable in class com.digitalpebble.stormcrawler.filtering.URLFilters
 
ERRORCACHE - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
errorFetchIntervalParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.indexing.DummyIndexer
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.indexing.StdOutIndexer
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
expressions - Variable in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
 
extractConfigElement(Map) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
If the config consists of a single key 'config', its values are used instead
extractMetaTags(DocumentFragment) - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
 
extractRefreshURL(DocumentFragment) - Static method in class com.digitalpebble.stormcrawler.util.RefreshTag
 
extractResult(TimeReducerState) - Method in class com.digitalpebble.stormcrawler.util.PerSecondReducer
 

F

FeedParserBolt - Class in com.digitalpebble.stormcrawler.bolt
Extracts URLs from feeds
FeedParserBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
 
FETCH_INTERVAL_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Key to store the current fetch interval value, must be listed in "metadata.persist".
FetcherBolt - Class in com.digitalpebble.stormcrawler.bolt
A multithreaded, queue-based fetcher adapted from Apache Nutch.
FetcherBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.FetcherBolt
 
fetchErrorCountParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
fetchErrorFetchIntervalParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
fetchIntervalDecRate - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
fetchIntervalIncRate - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
fieldNameForText() - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Returns the field name to use for the text or null if the text must not be indexed
fieldNameForURL() - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Returns the field name to use for the URL or null if the URL must not be indexed
FileProtocol - Class in com.digitalpebble.stormcrawler.protocol.file
 
FileProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
FileResponse - Class in com.digitalpebble.stormcrawler.protocol.file
 
FileResponse(String, Metadata, FileProtocol) - Constructor for class com.digitalpebble.stormcrawler.protocol.file.FileResponse
 
FileSpout - Class in com.digitalpebble.stormcrawler.spout
Reads the lines from a UTF-8 file and use them as a spout.
FileSpout(String, String, Scheme) - Constructor for class com.digitalpebble.stormcrawler.spout.FileSpout
 
FileSpout(String, Scheme) - Constructor for class com.digitalpebble.stormcrawler.spout.FileSpout
 
FileSpout(Scheme, String...) - Constructor for class com.digitalpebble.stormcrawler.spout.FileSpout
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.host.HostURLFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
This function does the replacements by iterating through all the regex patterns.
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter
 
filter(URL, Metadata, String) - Method in interface com.digitalpebble.stormcrawler.filtering.URLFilter
Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.URLFilters
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.ContentFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.MD5SignatureParseFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilter
Called when parsing a specific page
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
 
filter(RemoteWebDriver, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilter
The end result comes from the first filter to return non-null
filter(RemoteWebDriver, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
 
filter(Metadata) - Method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Determine which metadata should be persisted for a given document including those which are not necessarily transferred to the outlinks
filterDocument(Metadata) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Determine whether a document should be indexed based on the presence of a given key/value or the RobotsTags.ROBOTS_NO_INDEX directive.
filterMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Returns a mapping field name / values for the metadata to index
filterOutlink(URL, String, Metadata, String...) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
filterPathRepet(String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
 
FORBID_ALL_RULES - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
fromConf(Map) - Static method in class com.digitalpebble.stormcrawler.filtering.URLFilters
Loads and configure the URLFilters based on the storm config if there is one otherwise returns an empty URLFilter.
fromConf(Map) - Static method in class com.digitalpebble.stormcrawler.parse.ParseFilters
Loads and configure the ParseFilters based on the storm config if there is one otherwise returns an emptyParseFilter.
fromConf(Map) - Static method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
Loads and configure the NavigationFilters based on the storm config if there is one otherwise returns an emptyNavigationFilters.
fromHTTPCode(int) - Static method in enum com.digitalpebble.stormcrawler.persistence.Status
Maps the HTTP Code to FETCHED, FETCH_ERROR or REDIRECTION

G

get(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
get(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
getAgentString(Config) - Static method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
getAgentString(String, String, String, String, String) - Static method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
getAnchor() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
getBoolean(Map<String, Object>, String, boolean) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getCacheKey(URL) - Static method in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
Compose unique key to store and access robot rules in cache for given URL
getConf() - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
 
getContent() - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
getContent() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
 
getContentLengthFetched() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
Returns the number of bytes fetched per request when not cached
getCookies(String[], URL) - Static method in class com.digitalpebble.stormcrawler.util.CookieConverter
Get a list of cookies based on the cookies string taken from response header and the target url.
getEncoding() - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
getFirstValue(String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
getFloat(Map<String, Object>, String, float) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getHost(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
Returns the lowercased hostname for the url or null if the url is not well formed.
getHostSegments(URL) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
Partitions of the hostname of the url by "."
getHostSegments(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
Partitions of the hostname of the url by "."
getInstance(Map) - Static method in class com.digitalpebble.stormcrawler.persistence.Scheduler
Returns a Scheduler instance based on the configuration
getInstance(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
 
getInt(Map<String, Object>, String, int) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getLong(Map<String, Object>, String, long) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getMetadata() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
getMetadata() - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
getMetadata() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
 
getMetaForOutlink(String, String, Metadata) - Method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Determine which metadata should be transfered to an outlink.
getOutlinks() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
getOutputFields() - Method in class com.digitalpebble.stormcrawler.util.StringTabScheme
 
getPage(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
Returns the page for the url.
getParseMap() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
getPartition(String, Metadata) - Method in class com.digitalpebble.stormcrawler.util.URLPartitioner
Returns the host, domain, IP of a URL so that it can be partitioned for politeness, depending on the value of the config partition.url.mode.
getProtocol(URL) - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolFactory
Returns an instance of the protocol to use for a given URL
getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
getProtocolOutput(String, Metadata) - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
Fetches the content and additional metadata IMPORTANT: the metadata returned within the response should only be new additional, no need to return the metadata passed in.
getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
 
getRobotRules(String) - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
getRobotRules(String) - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
getRobotRules(String) - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
 
getRobotRulesSet(Protocol, URL) - Method in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
Get the rules from robots.txt which applies for the given url.
getRobotRulesSet(Protocol, String) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
getRobotRulesSet(Protocol, URL) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
getStatusCode() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
 
getString(Map<String, Object>, String) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getString(Map<String, Object>, String, String) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getTargetURL() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
getText() - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
getValueAndReset() - Method in class com.digitalpebble.stormcrawler.util.CollectionMetric
 
getValues(String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
getValues(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
getValues(String, String) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
guessMimeType(String, String, byte[]) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
 

H

handleResponse(HttpResponse) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
HostURLFilter - Class in com.digitalpebble.stormcrawler.filtering.host
Filters URL based on the hostname.
HostURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.host.HostURLFilter
 
httpDateFormat - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Format dates in HTTP headers, cf.
HttpHeaders - Interface in com.digitalpebble.stormcrawler.protocol
A collection of HTTP header names.
HttpProtocol - Class in com.digitalpebble.stormcrawler.protocol.httpclient
Uses Apache httpclient to handle http and https
HttpProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
HttpRobotRulesParser - Class in com.digitalpebble.stormcrawler.protocol
This class is used for parsing robots for urls belonging to HTTP protocol.
HttpRobotRulesParser(Config) - Constructor for class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
 

I

init(Map) - Method in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
init(Map) - Method in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
 
init(Map) - Method in class com.digitalpebble.stormcrawler.persistence.Scheduler
 
init() - Method in class com.digitalpebble.stormcrawler.util.PerSecondReducer
 
INTERVAL_DEC_RATE - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Configuration property (float) to set the decrement rate.
INTERVAL_INC_RATE - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Configuration property (float) to set the increment rate.
INTERVAL_MAX - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Configuration property (int) to set the maximum fetch interval in minutes.
INTERVAL_MIN - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Configuration property (int) to set the minimum fetch interval in minutes.
isAllowAll() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
isAllowed(String) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
isAllowNone() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
isEmpty() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
isFeedKey - Static variable in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
 
isLocal - Variable in class com.digitalpebble.stormcrawler.ConfigurableTopology
 
isNoCache() - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
 
isNoFollow() - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
 
isNoIndex() - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
 
isSitemapKey - Static variable in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
iterator() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 

J

jsoup2DOM(Document) - Static method in class com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder
Returns a W3C DOM that exposes the same content as the supplied Jsoup document into a W3C DOM.
jsoup2HTML(Document) - Static method in class com.digitalpebble.stormcrawler.parse.JSoupDOMBuilder
 
JSoupDOMBuilder - Class in com.digitalpebble.stormcrawler.parse
TODO use org.jsoup.helper.W3CDom instead?
JSoupParserBolt - Class in com.digitalpebble.stormcrawler.bolt
Parser for HTML documents only which uses ICU4J to detect the charset encoding.
JSoupParserBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
 

K

keySet() - Method in class com.digitalpebble.stormcrawler.Metadata
 

L

LAST_MODIFIED - Static variable in interface com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
LinkParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
ParseFilter to extract additional links with Xpath can be configured with e.g.
LinkParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter
 
loadConf(String, Config) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
loadListFromConf(String, Map) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
Return one or more Strings regardless of whether they are represented as a single String or a list in the config.
LOCATION - Static variable in interface com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
LOG - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
LOG - Static variable in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
 
LOG - Static variable in class com.digitalpebble.stormcrawler.spout.FileSpout
 

M

main(AbstractHttpProtocol, String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
Called by extensions of this class
main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol
 
match(String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexRule
Checks if a url matches this rule.
MaxDepthFilter - Class in com.digitalpebble.stormcrawler.filtering.depth
Filter out URLs whose depth is greater than maxDepth.
MaxDepthFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter
 
maxDepthKeyName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Metadata key name for tracking a non-default max depth
maxFetchErrorsParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
Number of successive FETCH_ERROR before status changes to ERROR
maxFetchInterval - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
MD5SignatureParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
Computes a signature for a page, based on the binary content or text.
MD5SignatureParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.MD5SignatureParseFilter
 
MemorySpout - Class in com.digitalpebble.stormcrawler.spout
Stores URLs in memory.
MemorySpout(String...) - Constructor for class com.digitalpebble.stormcrawler.spout.MemorySpout
 
MemoryStatusUpdater - Class in com.digitalpebble.stormcrawler.persistence
Use in combination with the MemorySpout for testing in local mode.
MemoryStatusUpdater() - Constructor for class com.digitalpebble.stormcrawler.persistence.MemoryStatusUpdater
 
Metadata - Class in com.digitalpebble.stormcrawler
Wrapper around Map <String,String[]>
Metadata() - Constructor for class com.digitalpebble.stormcrawler.Metadata
 
Metadata(Map<String, String[]>) - Constructor for class com.digitalpebble.stormcrawler.Metadata
Wraps an existing HashMap into a Metadata object - does not clone the content
metadata2fieldParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Mapping between metadata keys and field names for indexing Can be a list of values separated by a = or a single string
MetadataFilter - Class in com.digitalpebble.stormcrawler.filtering.metadata
Filter out URLs based on metadata in the source document
MetadataFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter
 
metadataFilterParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
list of metadata key + values to be used as a filter.
metadataPersistParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Parameter name indicating which metadata to persist for a given document but not transfer to outlinks.
MetadataTransfer - Class in com.digitalpebble.stormcrawler.util
Implements the logic of how the metadata should be passed to the outlinks, what should be stored back in the persistence layer etc...
MetadataTransfer() - Constructor for class com.digitalpebble.stormcrawler.util.MetadataTransfer
 
metadataTransferClassParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Class to use for transfering metadata to outlinks.
metadataTransferParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Parameter name indicating which metadata to transfer to the outlinks and persist for a given document.
minFetchInterval - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 

N

NavigationFilter - Class in com.digitalpebble.stormcrawler.protocol.selenium
 
NavigationFilter() - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilter
 
NavigationFilters - Class in com.digitalpebble.stormcrawler.protocol.selenium
Wrapper for the NavigationFilter defined in a JSON configuration
NavigationFilters(Map, String) - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
loads the filters from a JSON configuration file
needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.filter.ContentFilter
 
needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
 
needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
 
needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.ParseFilter
Specifies whether this filter requires a DOM representation of the document
needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
 
NEVER - Static variable in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
Date far in the future used for never-refetch items.
nextTuple() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
nextTuple() - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
 
normaliseToMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
Adds a normalised representation of the directives in the metadata

O

open(Map, TopologyContext, SpoutOutputCollector) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
open(Map, TopologyContext, SpoutOutputCollector) - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
 
Outlink - Class in com.digitalpebble.stormcrawler.parse
 
Outlink(String) - Constructor for class com.digitalpebble.stormcrawler.parse.Outlink
 
Outlink(String, String) - Constructor for class com.digitalpebble.stormcrawler.parse.Outlink
 
overwriteLastModified - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 

P

ParseData - Class in com.digitalpebble.stormcrawler.parse
 
ParseData() - Constructor for class com.digitalpebble.stormcrawler.parse.ParseData
 
ParseData(String, Metadata) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseData
 
ParseData(Metadata) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseData
 
ParseFilter - Class in com.digitalpebble.stormcrawler.parse
Implementations of ParseFilter are responsible for extracting custom data from the crawled content.
ParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.ParseFilter
 
ParseFilters - Class in com.digitalpebble.stormcrawler.parse
Wrapper for the ParseFilters defined in a JSON configuration
ParseFilters(Map, String) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseFilters
loads the filters from a JSON configuration file
ParseResult - Class in com.digitalpebble.stormcrawler.parse
 
ParseResult() - Constructor for class com.digitalpebble.stormcrawler.parse.ParseResult
 
ParseResult(Map<String, ParseData>) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseResult
 
parseRules(String, byte[], String, String) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
Parses the robots content using the SimpleRobotRulesParser from crawler commons
PARTITION_MODE_DOMAIN - Static variable in class com.digitalpebble.stormcrawler.Constants
 
PARTITION_MODE_HOST - Static variable in class com.digitalpebble.stormcrawler.Constants
 
PARTITION_MODE_IP - Static variable in class com.digitalpebble.stormcrawler.Constants
 
PARTITION_MODEParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
PerSecondReducer - Class in com.digitalpebble.stormcrawler.util
Used to return an average value per second
PerSecondReducer() - Constructor for class com.digitalpebble.stormcrawler.util.PerSecondReducer
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.indexing.DummyIndexer
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.indexing.StdOutIndexer
 
prepare(Map, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
prepare(WorkerTopologyContext, GlobalStreamId, List<Integer>) - Method in class com.digitalpebble.stormcrawler.util.URLStreamGrouping
 
Protocol - Interface in com.digitalpebble.stormcrawler.protocol
 
ProtocolFactory - Class in com.digitalpebble.stormcrawler.protocol
 
ProtocolFactory(Config) - Constructor for class com.digitalpebble.stormcrawler.protocol.ProtocolFactory
 
ProtocolResponse - Class in com.digitalpebble.stormcrawler.protocol
 
ProtocolResponse(byte[], int, Metadata) - Constructor for class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
 
put(String, String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
put(String, String, String) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
put(String, Metadata) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
putAll(Metadata) - Method in class com.digitalpebble.stormcrawler.Metadata
Puts all the metadata into the current instance

Q

QUEUE_MODE_DOMAIN - Static variable in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
QUEUE_MODE_HOST - Static variable in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
QUEUE_MODE_IP - Static variable in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 

R

reduce(TimeReducerState, Object) - Method in class com.digitalpebble.stormcrawler.util.PerSecondReducer
 
RefreshTag - Class in com.digitalpebble.stormcrawler.util
 
RefreshTag() - Constructor for class com.digitalpebble.stormcrawler.util.RefreshTag
 
RegexRule - Class in com.digitalpebble.stormcrawler.filtering.regex
A generic regular expression rule.
RegexRule(boolean, String) - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexRule
Constructs a new regular expression rule.
RegexURLFilter - Class in com.digitalpebble.stormcrawler.filtering.regex
Filters URLs based on a file of regular expressions using the Java Regex implementation.
RegexURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilter
 
RegexURLFilterBase - Class in com.digitalpebble.stormcrawler.filtering.regex
An abstract class for implementing Regex URL filtering.
RegexURLFilterBase() - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
 
RegexURLNormalizer - Class in com.digitalpebble.stormcrawler.filtering.regex
The RegexURLNormalizer is a URL filter that normalizes URLs by matching a regular expression and inserting a replacement string.
RegexURLNormalizer() - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
 
RemoteDriverProtocol - Class in com.digitalpebble.stormcrawler.protocol.selenium
Delegates the requests to one or more remote selenium servers.
RemoteDriverProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol
 
remove(String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
resolveURL(URL, String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
Resolve relative URL-s and fix a few java.net.URL errors in handling of URLs with embedded params and pure query targets.
RESPONSE_COOKIES_HEADER - Static variable in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
RobotRules - Class in com.digitalpebble.stormcrawler.protocol
Wrapper for BaseRobotRules which tracks the number of requests and length of the responses needed to get the rules.
RobotRules(BaseRobotRules) - Constructor for class com.digitalpebble.stormcrawler.protocol.RobotRules
 
RobotRulesParser - Class in com.digitalpebble.stormcrawler.protocol
This class uses crawler-commons for handling the parsing of robots.txt files.
RobotRulesParser() - Constructor for class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
ROBOTS_NO_CACHE - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
 
ROBOTS_NO_FOLLOW - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
 
ROBOTS_NO_FOLLOW_STRICT - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
Whether to interpret the noFollow directive strictly (remove links) or not (remove anchor and do not track original URL).
ROBOTS_NO_INDEX - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
 
RobotsFilter - Class in com.digitalpebble.stormcrawler.filtering.robots
URLFilter which discards URLs based on the robots.txt directives.
RobotsFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter
 
RobotsTags - Class in com.digitalpebble.stormcrawler.util
Normalises the robots instructions provided by the HTML meta tags or the HTTP X-Robots-Tag headers.
RobotsTags(Metadata) - Constructor for class com.digitalpebble.stormcrawler.util.RobotsTags
Get the values from the fetch metadata
RobotsTags() - Constructor for class com.digitalpebble.stormcrawler.util.RobotsTags
 
run(String[]) - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
 

S

schedule(Status, Metadata) - Method in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
schedule(Status, Metadata) - Method in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
 
schedule(Status, Metadata) - Method in class com.digitalpebble.stormcrawler.persistence.Scheduler
Returns a Date indicating when the document should be refetched next, based on its status.
Scheduler - Class in com.digitalpebble.stormcrawler.persistence
 
Scheduler() - Constructor for class com.digitalpebble.stormcrawler.persistence.Scheduler
 
schedulerClassParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.Scheduler
Class to use for Scheduler.
SeleniumProtocol - Class in com.digitalpebble.stormcrawler.protocol.selenium
 
SeleniumProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
 
SelfURLFilter - Class in com.digitalpebble.stormcrawler.filtering.basic
Filters links to self
SelfURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter
 
SET_LAST_MODIFIED - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Configuration property (boolean) whether or not to set the "last-modified" metadata field when a page change was detected by signature comparison.
setAnchor(String) - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
setConf(Config) - Method in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
 
setConf(Config) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
Set the Configuration object
setContent(byte[]) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
setContentLengthFetched(int[]) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
Returns the number of bytes fetched per request when not cached
setLastModified - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
setMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
setMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
setOutlinks(List<Outlink>) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
setTargetURL(String) - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
setText(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
setValue(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
Set the value for a given key.
setValues(String, String[]) - Method in class com.digitalpebble.stormcrawler.Metadata
 
SIGNATURE_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Name of the signature key in metadata, must be defined as "keyName" in the configuration of MD5SignatureParseFilter .
SIGNATURE_MODIFIED_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Key to store the date when the signature has been changed, must be listed in "metadata.persist".
SIGNATURE_OLD_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Name of key to hold previous signature: a copy, not overwritten by MD5SignatureParseFilter, is added by com.digitalpebble.stormcrawler.parse.filter.SignatureCopyParseFilter .
SimpleFetcherBolt - Class in com.digitalpebble.stormcrawler.bolt
A single-threaded fetcher with no internal queue.
SimpleFetcherBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
SiteMapParserBolt - Class in com.digitalpebble.stormcrawler.bolt
Extracts URLs from sitemap files.
SiteMapParserBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
size() - Method in class com.digitalpebble.stormcrawler.Metadata
 
size() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
skipRobots - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
start(ConfigurableTopology, String[]) - Static method in class com.digitalpebble.stormcrawler.ConfigurableTopology
 
Status - Enum in com.digitalpebble.stormcrawler.persistence
 
STATUS_ERROR_CAUSE - Static variable in class com.digitalpebble.stormcrawler.Constants
 
STATUS_ERROR_MESSAGE - Static variable in class com.digitalpebble.stormcrawler.Constants
 
STATUS_ERROR_SOURCE - Static variable in class com.digitalpebble.stormcrawler.Constants
 
StatusEmitterBolt - Class in com.digitalpebble.stormcrawler.bolt
Provides common functionalities for Bolts which emit tuples to the status stream, e.g.
StatusEmitterBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
StatusStreamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
StdOutIndexer - Class in com.digitalpebble.stormcrawler.indexing
Indexer which generates fields for indexing and sends them to the standard output.
StdOutIndexer() - Constructor for class com.digitalpebble.stormcrawler.indexing.StdOutIndexer
 
StdOutStatusUpdater - Class in com.digitalpebble.stormcrawler.persistence
Dummy status updater which dumps the content of the incoming tuples to the standard output.
StdOutStatusUpdater() - Constructor for class com.digitalpebble.stormcrawler.persistence.StdOutStatusUpdater
 
store(String, Status, Metadata, Date) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
store(String, Status, Metadata, Date) - Method in class com.digitalpebble.stormcrawler.persistence.MemoryStatusUpdater
 
store(String, Status, Metadata, Date) - Method in class com.digitalpebble.stormcrawler.persistence.StdOutStatusUpdater
 
storeHTTPHeaders - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
StringTabScheme - Class in com.digitalpebble.stormcrawler.util
Converts a byte array into URL + metadata
StringTabScheme() - Constructor for class com.digitalpebble.stormcrawler.util.StringTabScheme
 
StringTabScheme(Status) - Constructor for class com.digitalpebble.stormcrawler.util.StringTabScheme
 
submit(Config, TopologyBuilder) - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
Submits the topology with the name taken from the configuration
submit(String, Config, TopologyBuilder) - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
Submits the topology under a specific name

T

textFieldParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Field name to use for storing the text of a document
toASCII(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
 
toProtocolResponse() - Method in class com.digitalpebble.stormcrawler.protocol.file.FileResponse
 
toString() - Method in class com.digitalpebble.stormcrawler.Metadata
 
toString(String) - Method in class com.digitalpebble.stormcrawler.Metadata
Returns a String representation of the metadata with one K/V per line
toString() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
toString() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
toUNICODE(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
 
trackDepthParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Parameter name indicating whether to track the depth from seed.
trackPathParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Parameter name indicating whether to track the url path or not.
TRANSFER_ENCODING - Static variable in interface com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
ttl - Variable in class com.digitalpebble.stormcrawler.ConfigurableTopology
 

U

urlFieldParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Field name to use for storing the url of a document
URLFilter - Interface in com.digitalpebble.stormcrawler.filtering
Unlike Nutch, URLFilters can normalise the URLs as well as filtering them.
URLFilterBolt - Class in com.digitalpebble.stormcrawler.bolt
 
URLFilterBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
 
URLFilters - Class in com.digitalpebble.stormcrawler.filtering
Wrapper for the URLFilters defined in a JSON configuration
URLFilters(Map, String) - Constructor for class com.digitalpebble.stormcrawler.filtering.URLFilters
Loads the filters from a JSON configuration file
URLPartitioner - Class in com.digitalpebble.stormcrawler.util
Generates a partition key for a given URL based on the hostname, domain or IP address.
URLPartitioner() - Constructor for class com.digitalpebble.stormcrawler.util.URLPartitioner
 
URLPartitionerBolt - Class in com.digitalpebble.stormcrawler.bolt
Generates a partition key for a given URL based on the hostname, domain or IP address.
URLPartitionerBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
 
urlPathKeyName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Metadata key name for tracking the source URLs
URLStreamGrouping - Class in com.digitalpebble.stormcrawler.util
 
URLStreamGrouping() - Constructor for class com.digitalpebble.stormcrawler.util.URLStreamGrouping
 
URLStreamGrouping(String) - Constructor for class com.digitalpebble.stormcrawler.util.URLStreamGrouping
 
URLUtil - Class in com.digitalpebble.stormcrawler.util
Utility class for URL analysis
useCacheParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
Parameter name to indicate whether the internal cache should be used for discovered URLs.
useCookies - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 

V

valueForURL(Tuple) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Returns the value to be used as the URL for indexing purposes, if present the canonical value is used instead
valueOf(String) - Static method in enum com.digitalpebble.stormcrawler.persistence.Status
Returns the enum constant of this type with the specified name.
values() - Static method in enum com.digitalpebble.stormcrawler.persistence.Status
Returns an array containing the constants of this enum type, in the order they are declared.

X

XPathFilter - Class in com.digitalpebble.stormcrawler.parse.filter
Simple ParseFilter to illustrate and test the interface.
XPathFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
 

_

_collector - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
A B C D E F G H I J K L M N O P Q R S T U V X _ 
Skip navigation links

Copyright © 2017 DigitalPebble Ltd. All rights reserved.