|
|
|
|
![]() ![]() |
Feb 14 2006, 02:54 AM
Post
#1
|
|
|
Advanced Member ![]() ![]() ![]() ![]() ![]() ![]() Group: Validating Posts: 111 Joined: 28-January 06 Member No.: 10,917 |
As we all know, Robots are programs that traverse the Web automatically. Some people call them Crawlers or Spiders.
Quite often, you need to restrict a robot lke (GoogleBOT) from crawling specific portion of your website. You can do it in two different ways. Firstly, it is done by including a specially formatted file on his site, namely robots.txt, in http://www.yourdomain.com/robots.txt. Also Robots META tag(" special HTML META tag") may also be used to indicate if a page may or may not be indexed, or analysed for links by a crawler. Usually, a combination of Robots META TAG and robots.txt file is used to get the best result. In a nutshell, when a Robot ( like GoogleBOT,msnBOT) visits a Web site, say http://www.yourdomain.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records. An example of a simple robots.txt file is shown below: QUOTE User-agent: * Disallow: /login.php The above line directs all the robots ( Google, msn, Yahoo) not to scroll, login.php file located in the root directory of your website. A detailed discussion on robots.txt file is available at Robotstxt.org Google has added a new feature to check for URLs excluded in robots.txt file. That is, you can check whether the GoogleBOT is complying with the instructions of the robots.txt file or not in a matter of few seconds. You need to have a Google Account to check for the robots.txt file of your website. Only thing is that, it is in BETA form, ( like Google Sitemap), but it may turn out to be pretty effective in near future. |
|
|
|
Feb 14 2006, 07:31 AM
Post
#2
|
|
|
Advanced Member Group: Members Posts: 196 Joined: 17-June 05 From: Topi,Swabi,NWFP,Pakistan Member No.: 6,301 |
I wonder why its a txt and not an XML. XML would be a far better choice.
|
|
|
|
Feb 14 2006, 12:59 PM
Post
#3
|
|
|
Veteran Nut Group: Members Posts: 527 Joined: 4-October 05 From: UK Member No.: 8,895 |
It doesn't matter what it is. All it does is contain simple information, that any basic language can understand.
|
|
|
|
Feb 15 2006, 05:28 PM
Post
#4
|
|
|
Whitest Black Mage Group: [MODERATOR] Posts: 1,316 Joined: 20-May 05 From: NB, Canada Member No.: 5,281 |
Awesome, I've never really thought about it but was always curious how those bots worked and if you could influence them directly in any way, I'll have to read up a bit on that site you posted when I have some more free time, thanks for the link
|
|
|
|
Feb 15 2006, 06:26 PM
Post
#5
|
|
|
the Q Group: [HOSTED] Posts: 1,013 Joined: 13-July 05 From: Lithuania, Vilnius Member No.: 7,059 |
It is not an xml file, because txt is much more simpler to use and besides robots.txt files are available for a really long time and XML is not so old.. not everyone know how to use XML and might not use it, this is one of the simplest things to do, but I agree that an addition could be made, but most robots would not support it, only the ones which are updated..
|
|
|
|
Feb 20 2006, 03:00 AM
Post
#6
|
|
|
Member [ Level 1 ] Group: Members Posts: 30 Joined: 20-February 06 Member No.: 11,416 |
I agree. besides, a lot of people dont know xml (like me) and it would be hard to make, and we might screw it up. (like i did with google sitemaps lol). You can also create a .htaccess file which will force bots to comply with ur commands but most free hosts do not allow it.
|
|
|
|
Sep 2 2006, 06:04 PM
Post
#7
|
|
|
Premium Member Group: Members Posts: 216 Joined: 7-March 05 From: Carrollton, TX Member No.: 2,953 |
hmm... sounds interesting.
Maybe I'll try that- it would be fun to mess around with the famous Google bots. haha But I'll have to look it over again some other time. |
|
|
|
![]() ![]() |
Similar Topics
|
Lo-Fi Version | Time is now: 22nd August 2008 - 03:13 AM |