What is
"robots.txt"?
For those
who don’t know what is robots.txt, it is a text file which instructs search
engine robots to crawl a page or not. Any CMS (content management system) based
sites are having admin module online and that should not be crawled by Search
Engines, using robots.txt you can block that part to not get it crawl.
About /robots.txt
Web site owners use
the /robots.txt file to give instructions about their site to web robots; this
is called The Robots Exclusion Protocol.It
works like this: a robot wants to visits a Web site URL, say
http://www.example.com/welcome.html. Before it does so, it firsts checks for
http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /
The "User-agent: *" means this section applies to all robots. The "Disallow:
/" tells the robot that it should not visit any
pages on the site.
There are two
important considerations when using /robots.txt:
- robots can
ignore your /robots.txt. Especially malware robots that scan the web for
security vulnerabilities, and email address harvesters used by spammers
will pay no attention.
- The /robots.txt
file is a publicly available file. Anyone can see what sections of your
server you don't want robots to use.
So
don't try to use /robots.txt to hide information.
User-agent: *
Disallow: /
- robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
- The /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
How to create a /robots.txt file
- Where to put it
- In the top-level directory of your web server.
- When a robot
looks for the "/robots.txt" file for URL, it strips the path
component from the URL (everything from the first single slash), and puts
"/robots.txt" in its place.
- For example, for
"http://www.example.com/shop/index.html, it will remove the "/shop/index.html", and replace it with "/robots.txt", and will end up with
"http://www.example.com/robots.txt".
- So, as a web
site owner you need to put it in the right place on your web server for
that resulting URL to work. Usually that is the same place where you put
your web site's main "index.html"
welcome page. Where exactly that is, and how to put the file there,
depends on your web server software.
- Remember to use
all lower case for the filename: "robots.txt",
not "Robots.TXT.
What to put in it
- The
"/robots.txt" file is a text file, with one or more records.
Usually contains a single record looking like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
In this example,
three directories are excluded.
Note that you need a
separate "Disallow" line for every URL prefix you want to exclude --
you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line.
Also, you may not have blank lines in a record, as they are used to delimit
multiple records.
Note also that
globbing and regular expression are not supported
in either the User-agent or Disallow lines. The '*' in the User-agent field is
a special value meaning "any robot". Specifically, you cannot have
lines like "User-agent: *bot*", "Disallow: /tmp/*" or
"Disallow: *.gif".
What you want to
exclude depends on your server. Everything not explicitly disallowed is
considered fair game to retrieve. Here follow some examples:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
To exclude all robots from the entire server
User-agent: *
Disallow: /
User-agent: *
Disallow: /
To allow all robots complete access
User-agent: *
Disallow:
(Or just create an
empty "/robots.txt" file, or don't use one at all)
User-agent: *
Disallow:
To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
To exclude a single robot
User-agent: BadBot
Disallow: /
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google
Disallow:
User-agent: *
Disallow: /
User-agent: Google
Disallow:
User-agent: *
Disallow: /
To
exclude all files except one
This is currently a
bit awkward, as there is no "Allow" field. The easy way is to put all
files to be disallowed into a separate directory, say "stuff", and
leave the one file in the level above this directory:
User-agent: *
Disallow: /~joe/stuff/
Alternatively you can
explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
User-agent: *
Disallow: /~joe/stuff/
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
/robots.txt checker
There are third party
tool which allows us to check the /robots.txt :
- Google's robots.txt analysis tool (requires
a Google Account)
If you really think it is useful, Don't Forget to share.
For more Content, Tips, Knowledge kindly subscribe to the blog.
If you really think it is useful, Don't Forget to share.
For more Content, Tips, Knowledge kindly subscribe to the blog.
For more Content, Tips, Knowledge kindly subscribe to the blog.