Crawling multithread library in java

Thursday, 7 January 2010 18:05 by romeosa

This is a simple crawling multithread library I developed(consider this a beta version, but should work fine).
If you need the source-code, let me know.
If you want to send feedback, feel free to post a comment.

This is an example on how to use the library:
1) the library dispatches an event when a new document is loaded, so make a listener for this event;

2) set the maximum number of concurrent threads;

3) start crawling.

MultiThreadCrawler crawler = new MultiThreadCrawler(); 
crawler.addDocumentCrawledEventListener( 
	new DocumentCrawledListener() 
	{ 
		public void documentCrawled(DocumentCrawledEvent e) 		
		{ 
			System.out.println("" + 
			e.getDocument().getUrl());
			//System.out.println("" + 
			//(MultiThreadCrawler)e.getSource())
			//.getRemainingUrlsCount()); 
		} 
	}); 
try 
{ 
	crawler.setMaxConcurrentThreads(48); 	
	crawler.startCrawling(new URL("FULL_URL_OF_THE_SITE" +
	+ "(WITH_HTTP)")); 
} catch (MalformedURLException ex) { //DO SOMETHING } 

 

 

And finally you can download the library here:

RomeosaCrawler.jar (23.68 kb)

Tags:   , , ,
Categories:  
Actions:   E-mail | Permalink | Comments (0) | Comment RSSRSS comment feed

Add comment




  Country flag

biuquote
  • Comment
  • Preview
Loading