Java Open Source Projects
A web crawler (also known as a Web spider or Web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms (Kobayashi and Takeda, 2000).
This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).
A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.(From Wikipedia)

Web Crawlers

Heritrix - Internet Archive Crawler
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix (sometimes spelled heretrix , or misspelled or missaid as h...
Web-Harvest
Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverag...
WebSPHINX - A Personal, Customizable Web Crawler
WebSPHINX ( Web site- S pecific P rocessors for H TML IN formation e X traction) is a Java class library and interactive development environment for web crawlers. A web crawler (al...
JSpider - a Web Spider Engine
JSpider is: A highly configurable and customizable Web Spider engine . Developed under the LGPL Open Source license In 100% pure Java You can use it to : Check your sit...
JoBo
JoBo is a simple program to download complete websites to your local computer. Internally it is basically a web spider. The main advantage to other download tools is that it can au...
bitmechanic- spindle
Spindle is a web indexing/search tool built on top of the Lucene toolkit. It includes a HTTP spider that is used to build the index, and a search class that is used to search the i...
Arachnid Web Spider Framework
Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub...
LARM
LARM is a 100% Java search solution for end-users of the Jakarta Lucene search engine framework. It contains methods for indexing files, database tables, and a crawler for indexing...
Arale - A java web spider
Arale can download entire web sites or specific resources from the web. Arale can also render dynamic sites to static pages. I wrote this utility in 2001 to familiarize myself with...
WebLech URL Spider
WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as ...

List of Companies, Suppliers, Distributors, Importers & Exporters
Add to favorites | Contact US | English Books | Why and How | Sitemap Generator
Copyright © 2007 - 2008 BizDrv.com