|
 
       
THE
CURRENT VERSION IS SEY 2003

Robots.txt
and the Robots Meta Tag
by
André le Roux
Dec. 2001
NOTE:
This page is NOT maintained. For an updated discussion on the Robots.txt
file and the Robots Meta Tag, please refer to the current version of
the Search Engine Yearbook.

The
robots.txt file
What
does the "robots" text file do?
Most sites contain pages that should not be indexed
by the search engines. Administrative pages, for example, Pandecta Magazine's
"contact" page: "contact.html". There's no need
to have it indexed, so we use the robots.txt file to tell the search
engine spider (robot) to ignore it.
Very important:
The robots.txt file must be in your root directory.
Like this: www.pandecta.com/robots.txt
Not like this: www.pandecta.com/admin/robots.txt

The
syntax of the robots.txt file
User-agent: *
Disallow: /images/
Disallow: /contact.html
Disallow: /privacy/privacy.html
The first line specifies which robots should ignore
/images/, /contact.html and /privacy/privacy.html. The asterisk * is
a wildcard - so all robots should ignore the directories and files listed
below it. If I only wanted Googlebot to ignore those directories &
files, I'd type "User-agent: Googlebot".
The second line refers to an entire directory.
Nothing in that directory will be indexed.
The third line refers to a specific page in the
root directory - in this case the contact.html file.
The fourth line refers to a specific file
in a specific directory.
The robots meta
tag
The Robots META tag does exactly the same thing
as the robots.txt file - but it is not as reliable. Not all robots honor
the robots meta tag.
Use it if your site is in a subdirectory like www.freewebspace.com/users/mycoolhomepage/
and you can't get the server administrator to add (or add changes to)
a robots.txt file.
If you have access to your root directory, forget
about the robots meta tag. Use the robots.txt file. No need to have
both.

The syntax of
the robots meta tag is:
<META NAME="ROBOTS" CONTENT="NOINDEX,
NOFOLLOW">
Type that between the <HEAD> and </HEAD>
tags on each page you do not want to be indexed.

More robots.txt
resources
robots.txt Syntax Checker
http://www.tardis.ed.ac.uk/~sxw/robots/check/
Robot Names
Robot
Names by the Search Engine Dictionary
CNN's robots.txt file:
http://www.cnn.com/robots.txt

This
page is based on information contained in the Search Engine Yearbook 2003.
For more detailed search engine information & help, please refer to the
current version of the book.

Stay
up to date on changes in the search engine world with the EnginePaper
Newsletter. It goes out only when something important changes in
the search engine world. Subscribe now with a blank email to
send-ep-subscribe@topica.com
. It's 100% free and safe. View our full privacy policy
here.
|