This is a simple crawling multithread library I developed(consider this a beta version, but should work fine).
If you need the source-code, let me know.
If you want to send feedback, feel free to post a comment.
This is an example on how to use the library:
1) the library dispatches an event when a new document is loaded, so make a listener for this event;
2) set the maximum number of concurrent threads;
3) start crawling.
MultiThreadCrawler crawler = new MultiThreadCrawler();
crawler.addDocumentCrawledEventListener(new DocumentCrawledListener() {
public void documentCrawled(DocumentCrawledEvent e)
{
System.out.println("" + e.getDocument().getUrl());
//System.out.println("" + ((MultiThreadCrawler)e.getSource()).getRemainingUrlsCount());
}
});
try
{
crawler.setMaxConcurrentThreads(48);
crawler.startCrawling(new URL("FULL_URL_OF_THE_SITE(WITH_HTTP)"));
} catch (MalformedURLException ex)
{
//DO SOMETHING
}
And finally you can download the library here:
RomeosaCrawler.jar (23.68 kb)