Sunday, March 18, 2007

Types of Google Crawlers

# Googlebot: crawl pages from our web index and our news index
# Googlebot-Mobile: crawls pages for our mobile index
# Googlebot-Image: crawls pages for our image index
# Mediapartners-Google: crawls pages to determine AdSense content. We only use this bot to crawl your site if you show AdSense ads on your site.
# Adsbot-Google: crawls pages to measure AdWords landing page quality.

Howto to block Googlebot using robots.txt

User-agent: Googlebot
Disallow: /

Allowing Googlebot only
If you want to block access to all bots other than the Googlebot, you can use the following syntax:

User-agent: *
Disallow: /

User-agent: Googlebot

Googlebot follows the line directed at it, rather than the line directed at everyone.

The Allow extension
Googlebot recognizes an extension to the robots.txt standard called Allow.

The Allow line works exactly like the Disallow line. Simply list a directory or page you want to allow.

You may want to use Disallow and Allow together.
To block access to all pages in a subdirectory except one, you could use the following entries:

User-Agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Those entries would block all pages inside the folder1 directory except for myfile.html.

If you block Googlebot and want to allow another of Google's bots (such as Googlebot-Mobile), you can allow access to that bot using the Allow rule. For instance:

User-agent: Googlebot
Disallow: /

User-agent: Googlebot-Mobile

Meanwhile Yahoo bot is called slurp

User-agent: Slurp
Disallow: /cgi-bin/ for blocking directories

Refer abt google bot here
Refere abt yahoo bot here