Q: Topologies? Spouts? Bolts? I am confused!
Q: Do I need a Storm cluster to run StormCrawler?
A: No. It can run in local mode and will just use the Storm libraries as dependencies. It makes sense to install Storm in pseudo-distributed mode though so that you can use its UI to monitor the topologies.
Q: Why Apache Storm?
A: Apache Storm is an elegant framework, with simple concepts, which provides a solid platform for distributed stream processing. It gives us fault tolerance and guaranteed data processing out of the box. The project is also very dynamic and backed by a thriving community. Last but not least it is under ASF 2.0 license.
Q: How fast is StormCrawler?
A: This depends mainly on the diversity of hostnames as well as your politeness settings. For instance, if you have 1M URLs from the same host and have set a delay of 1 sec between request then the best you'll be able to do is 86400 pages per day. In practice this would be less than that as the time needed for fetching the content (which itself depends on your network and how large the documents are), parsing and indexing it etc... This is true of any crawler, not just StormCrawler.
Q: Speaking of which, why should I use it instead of, say, Apache Nutch?
A: This tutorial on using Nutch and SC for indexing with Cloudsearch give you some idea of how they compare in their methodology and performance. We also ran a comparative benchmark on a larger crawl.
In a nutshell (pardon the pun), Nutch proceeds by batch steps where it selects the URLs to fetch, fetches them, parses them then update it database with the new info about the URLs it just processed and adds the newly discovered URLs. The generate and update steps take longer and longer as the crawl grows and the resources are used unevenly : when fetching there is little CPU or disk used whereas when doing all the other activities, you are not fetching anything at all, which is a waste of time and bandwidth.
StormCrawler proceeds differently and does everything at the same time, hence optimising the physical resources of the cluster, but can potentially accomodate more use cases, e.g. when URLs naturally come as streams or when low latency is a must. URLs also get indexed as they are fetched and not as a batch. On a more subjective note and apart from being potentially more efficient, StormCrawler is more modern, easier to understand and build, nicer to use, more versatile and more actively maintained (but this is just my opinion and I am of course completely biased).
Apache Nutch is a great tool though, which we used for years with many of our customers at DigitalPebble, and it can also do things that StormCrawler cannot currently do out of the box like deduplicating or advanced scoring like PageRank.
Q: Do I need some sort of external storage? And if so, then what?
A: Yes, you'll need to store the URLs to fetch somewhere. The type of the storage to use depends on the nature of your crawl. If your crawl is not recursive i.e. you just want to process specific pages and/or won't discover new pages through more than one path, then you could use messaging queues like RabbitMQ, AWS SQS or Apache Kafka. All you'll need is a Spout implementation that can read from them and generate Tuples for the crawl topology. Luckily there are many existing resources you can leverage. You might even find that the FileSpout is all you need. An important consideration with queues is that you might want to bucket the URLs to fetch per host / domain or IP into separate queues and have one instance of the Spout per queue. This way you'll get a good diversity of URLs down the topology and optimal performance.
If your crawl is recursive and there is a possibility that URLs which are already known are discovered multiple times, then a queue won't help as it would add the same URL to the queue every time it is discovered. This would be very inefficient. Instead you should use a storage where the keys are unique, like for instance a relational database. StormCrawler has several resources you can use in the external modules such as one for SOLR, Elasticsearch or SQL.
The advantage of using StormCrawler is that is it both modular and flexible. You can plug it to pretty much any storage you want.
Q: Is StormCrawler polite?
Q: When do I know when a crawl is finished?
A: This is actually not straightforward. Storm topologies are designed to run continuously and there is no standard mechanism to determine when they should be terminated. Unless you implement a bespoke way of terminating the topology, you'll need to keep an eye on the progress of the crawl and stop it manually.