Welcome Guest ( Log In | Register )



2 Pages V   1 2 >  
Reply to this topicStart new topic
> Robots.txt Introduction, bots and spiders crawling the web
NilsC
post Feb 10 2005, 11:18 PM
Post #1


To Err Is Human, To Forgive Divine
Group Icon

Group: Members
Posts: 558
Joined: 24-December 04
From: http://www.ultimatekayakfishing.com/
Member No.: 1,871



Search engines look in the "root" directory for robots.txt. This file first tell the spider / bot (called "User-agent" from now on) what files it can harvest and the folders it can harvest from. This is called "The Robots Exclusion Standard".

The format (syntax) of the robots.txt file has to be followed. It consists of records that have 2 fields. The first is the "User-agent Line" the second is one or more "Disallow Line(s)"

Syntax is <field> ":" <value>

You should create the text in UNIX line ender mode. Good text editors or one of the Linux line editors work.

***WARNING***


Do not use your HTML editor to create the robots.txt field unless it have a "text mode" to edit in. Notepad can be used if your FTP client have



First we talk about the "User Agents"

We have all seen "Googlebot" who hangs out here enough to be a Moderator so we use that as an example smile.gif
The useragent line specifies the bot that this record is for:

User-agent: Googlebot

or you can specify all bots/spiders using wildcard "*":

User-agent: *

There are lists of all known User-agents on the net and you can check your log files to find User-agents that hit your site.

Now comes the second part of the record, the "Disallow"

It may have one or more lines depending on how much restriction you need. It's called "Disallow" because you allow all useragents to harvest all files and folders unless you specify otherwise. This example tells useragents that they are not allowed to harvest from the file my_email.html:

Disallow: my_email.html

To tell the useragents that a directory is off limit you have to use this format:

Disallow: /cgi-bin/

Now useragents (conforming bots and spiders) will stay out of that directory.

Can I use wildcards when setting up the rules? In a way you can because if you use:

Disallow: /private

This would block /private.html and /private/index.html and any other files receding in the /private/ folder. Don't put disallow: alone on a line, because the blank space is interpreted as "good to go, no restrictions" unless that is what you are trying to do.

This brings us to the next step because we as "professional" web designers put comments on everything we do so we can figure out what was going on when we wrote this file. You can put comments in the robots/txt file by starting the comment with #. One word of caution, put the comment on a line by itself because a lot of useragents don't interpret white-space. The:
"Disallow: /images/ #don’t need to see my wedding pictures".
Will not stop all useragents from harvesting from the /images/ directory because they read this as:
"Disallow: /image/#don’tneedtoseemyweddingpictures"
and will go on into the /images/ directory. So put the comment on the next line after the disallow statement.

A couple of examples to round out this tutorial:

In my web-site here at Astahost I use the simplest form of the robots.txt file because I use the web-site for testing concepts and don't need it harvested. You can see what it looks like here: http://www.ngc.astahost.com/robots.txt

CODE

#First example

User-agent: *

Disallow: /
#This keeps all robots out (the setting I use now)

#Second example

User-agent: *

Disallow:
#This allows the useragent to harvest all files and folders.

#Third example

User-agent: *

Disallow: /images/

Disallow: /cgi-bin/

Disallow: /my_mail.html
#This will block all boots from the files and folders inside
#the images and cgi-bin folder. The file my_mail.html will
#not be harvested.

#Fourth example

User-agent: Anthill

Disallow: /pricelist/
#This will block the Anthill useragent from accessing your
#pricelist folder and all files in that directory. (Anthill
#is used to gather price-information automatically from online
#stores. Support for international versions.)

#Fifth example

User-agent: linklooker

Disallow: /
#This is a new bot, that is not registered so who know what
#the data is collected for.

List on known bots and spiders with detailed information and description.

If you are allowing some bots but not all then make sure you list the bots you allow first then the deny.
CODE

#Allow googlebot, msnbot, and askjeeves to harvest all files and folders.
User-agent: googlebot

Disallow:

User-agent: msnbot

disallow:

User-agent: askjeeves

Disallow:
#The rest of the bots and spiders are blocked
User-agent: *

Disallow: /


This are a few examples of how a simple robots.txt file is setup, there are a lot of complicated configurations out there and even some sites that have put the useragent and disallow statement backward.

There are a lot of good tutorials on robots.txt and you can do a search on Google to find them. Here is a link to TheBigCrawl – checking robots.txt files for errors.
If you took the time to create a robots.txt file, take the time to check it smile.gif "Robots.txt checker tongue.gif

Hope this is to some help.

Nils
Go to the top of the page
 
+Quote Post
OpaQue
post Feb 10 2005, 11:36 PM
Post #2


Administrator
Group Icon

Group: Admin
Posts: 467
Joined: 26-August 04
Member No.: 1
myCENTs:85.82



Also, MSN and GOOGLE Both support their own types of custom robots.txt commands which their crawler can recognise.

One such example is wildcards. Google Bot understands then and can be used to ban a certain types of files or even URLs.

You can checkout MSN Search and go to webmaster section to configure the robots.txt. There they have given instructions to configure it for MSN crawler.

[ PS : This is a WONDERFUL TUTORIAL ]
Go to the top of the page
 
+Quote Post
szupie
post Feb 11 2005, 11:48 AM
Post #3


S.P.A.M.S.W.A.T.
Group Icon

Group: Members
Posts: 814
Joined: 22-January 05
From: San Antonio, Texas (No, I'm not dumb. I just moved here...)
Member No.: 2,284



Nice tutorial! I have a few questions:

1. Do all bots read this file?
2. If they do, do they all follow what you said in the file? or can they break the rules and ignore you file?
3. Why not HTML editors?
Go to the top of the page
 
+Quote Post
NilsC
post Feb 11 2005, 01:28 PM
Post #4


To Err Is Human, To Forgive Divine
Group Icon

Group: Members
Posts: 558
Joined: 24-December 04
From: http://www.ultimatekayakfishing.com/
Member No.: 1,871



  1. Do all bots read this file, Yes and No, all compliant bots and spiders read this file.
  2. Same answer, if they are compliant they follow the rules.
  3. Unless the HTML editor have a "text edit feature" that put "Unix line endings" at the end of the file bots and spiders may not be able to read. (EditPad Lite, freeware have a setting for this)
All bots are not like, there are bots / spiders that are not conforming. One example that I hate, they are called "spam harvesters" they do not follow the rules and will continue down all your pages collecting email addresses. That is why I'm a member of the "HoneyPot" project. Since conforming bots/spiders follow the rules and dont harvest from a file/directory that is on the "Disallow" list you know that only "BAD" bots will harvest data from the spamtrap. Looking in the logfiles you can now identify new User-Agents that are bad or identify themselves as users(browsers) An example is one of the most active spamharvesters that operate theit bots from hijacked computers in Netherland, their bots/spiders identify themself as "Netscape 3.0 Compatible (WhatsNew Robot)"and you can see details here Spam Harvest bot

Hope this helped.
Nils
Go to the top of the page
 
+Quote Post
ewc21
post Mar 20 2005, 08:58 AM
Post #5


Newbie [ Level 1 ]
Group Icon

Group: Members
Posts: 4
Joined: 20-March 05
Member No.: 3,128



I wonder if putting the following in your HEAD tag will help

<meta name="robots" content="follow, index">

or should you just use the robots.txt on your root folder.

Anyone?
Go to the top of the page
 
+Quote Post
spacewaste
post Mar 20 2005, 09:18 AM
Post #6


Premium Member
Group Icon

Group: Members
Posts: 385
Joined: 13-October 04
From: Ontario
Member No.: 1,175



I'm wondering what if i put my robot file inside the directory /example

would it read only what is inside example...

or would it read my whole site still?
Go to the top of the page
 
+Quote Post
syhs89
post May 2 2005, 08:15 AM
Post #7


Newbie [ Level 1 ]
Group Icon

Group: Members
Posts: 9
Joined: 2-May 05
Member No.: 4,639



robots.txt is suppose to put in root folder...it might dun do the trick...
and may i know other robots name like yahoo.com?
what if i just wanna block googlebot?
is it
User-agent: googlebot

Disallow:/

User-agent: *

Disallow:
????


Go to the top of the page
 
+Quote Post
m3ch4
post May 2 2005, 03:25 PM
Post #8


Advanced Member
Group Icon

Group: Members
Posts: 112
Joined: 29-April 05
Member No.: 4,527



QUOTE(NilsC @ Feb 10 2005, 07:18 PM)
Search engines look in the "root" directory for robots.txt. This file first tell the spider / bot (called "User-agent" from now on) what files it can harvest and the folders it can harvest from. This is called "The Robots Exclusion Standard".


qucik question: root directory refers to where you place your "index.html" file, correct?
Go to the top of the page
 
+Quote Post
NilsC
post May 2 2005, 03:45 PM
Post #9


To Err Is Human, To Forgive Divine
Group Icon

Group: Members
Posts: 558
Joined: 24-December 04
From: http://www.ultimatekayakfishing.com/
Member No.: 1,871



you can have index.html in any folder, Your root folder on this server is "public_html" and that's where the robots.txt file goes.

Yahoo bot is identified as "Inktomi Slurp",
http://www.robotstxt.org/wc/active/html/slurp.html
http://help.yahoo.com/help/us/ysearch/slurp/slurp-01.html
and there is a list of bots here.

http://www.robotstxt.org/wc/active/html/index.html

Nils
Go to the top of the page
 
+Quote Post
m3ch4
post May 2 2005, 03:53 PM
Post #10


Advanced Member
Group Icon

Group: Members
Posts: 112
Joined: 29-April 05
Member No.: 4,527



QUOTE(NilsC @ May 2 2005, 11:45 AM)
you can have index.html in any folder, Your root folder on this server is "public_html" and that's where the robots.txt file goes.


I'm sorry, I need a bit more clairification here ><'' (m3ch4 = newb =P)

if I create a site (not done yet, but below will be the basic map right now) and it looks like this...

system482.astahost.com (main page)
-includes "index.html"

->links to->system482.astahost.com/video_games/
-includes "index.html"
->links to->system482.astahost.com/personal/
-includes "index.html"
->links to->system482.astahost.com/blah_blah_blah/
-includes "index.html"

where is "public_html" exsist?
Go to the top of the page
 
+Quote Post

2 Pages V   1 2 >
Reply to this topicStart new topic

Collapse

> Similar Topics

Topics Topics


 



- Lo-Fi Version Time is now: 5th December 2008 - 01:16 PM