Today: Girls and Reality TV, Stars' Awkward Photos, Hollywood Cradle Robbers

Web Crawler - The Effective Ones

[Web Crawler - The Effective Ones]

[What is a Web Crawler]
A computer program that automatically browses the Internet. It is also known as Spiders, Bots, Automatic Indexers, and Web Robots.

A Web Crawler performs following tasks:

1. It starts crawling from a given seed URL.
2. It retrieves the contents of seed web page.
3. It identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
4. It subsequently crawls the retrieved URLs.
5. It also performs the additional tasks such as URL validation, checking for the freshness of content by comparing against old stored content.
6. It can also detect the language of web page.

[Challenges]

The three main challenges which makes crawling a very difficult task:

1. Its large volume:
The large volume implies that the crawler can only download a fraction of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that by the time the crawler is downloading the last pages from a site, it is very likely that new pages have been added to the site, or that pages have already been updated or even deleted.

2. Its fast rate of change
Content now is constantly changing - thus what you crawled 1 hour ago might have changed now.

3. Dynamic page generation.

Since it very difficult to crawl the complete web - it is paramount that you crawl the web intelligently.

[Policies for Intelligent Crawling]

1. Selection Policy: which dictates pages to download.

2. Re-visit Policy: which dictates when to check for changes to the pages.

3. Politeness Policy: which states how to avoid overloading web sites

4. Parallelization Policy: which states how to coordinate distributed Web crawlers.

[Pseudo code for a Web Crawler]


Figure - Web Crawler Pseudo-Code
Advertisements
Comments
Zimbio Entertainment
Copyright © 2012 - Zimbio, Inc. Some rights reserved.
Share
. . .
Follow
. . .