This article was previously published under Q217103
We strongly recommend that all users upgrade to Microsoft Internet Information Services (IIS) version 7.0 running on Microsoft Windows Server 2008. IIS 7.0 significantly increases Web infrastructure security. For more information about IIS security-related topics, visit the following Microsoft Web site:
Web Spiders, often called Robots, are WWW search engines that "crawl"across the Internet and index pages on Web servers. A Web Spider will thencatalog that information and make it available to the Internet forsearching. This makes it easier for users to find specific information onthe Internet by allowing "one-stop shopping" through the Spider's WWWsite. Most Robots also prioritize documents that are on the Internet,allowing search results to be "scored" or arranged in order of most likelymatches on a search.
A Robots.txt file is a special text file that is always located in yourWeb server's root directory. This file contains restrictions for WebSpiders, telling them where they have permission to search. It should benoted that Web Robots are not required to respect Robots.txt files, butmost well-written Web Spiders follow the rules you define.
A Robot identifies itself when it browses your site, which is known as the"User-agent" and appears in the logs for IIS. Generally, the flow ofevents when a Web Spider crawls your site is similar to the following:
The Robot asks for your /robots.txt file and looks for a "User-agent:" line that refers to it specifically.
If it finds an entry for itself, such as "User-agent: WebRobot,"then it follows the rules that pertain to it.
If is does notfind an entry for itself, it looks for a global set of rules, such as"User-agent: *," and obeys those rules.
If the Robot has an entry for itself and a global set of rules isalso present, the Robot's personal rules will supersede the global rules.
Rules for a user-agent are set up as "Disallow:" statements thattell a robot where it cannot search. A disallow statement is applied toany address that may have been requested by the Robot. Forexample:
"Disallow: /test" causes a Web Spider to ignore /test/index.htm, andso on.
"Disallow: /" causes a WebSpider to ignore the whole site; sometimes this is desirable.
"Disallow: " allows a Web Spider to crawl the wholesite.
Lines that begin with the pound symbol (#) denote comments, whichcan be useful when creating long sets of rules.Examples
This example disallows all Web Spiders for the entiresite:
# Make changes for all web spidersUser-agent: *Disallow: /
The following example disallows a Robot named "WebSpider" from thevirtual paths "/marketing" and "/sales":
# Tell "WebSpider" where it can't goUser-agent: WebSpiderDisallow: /marketingDisallow: /sales# Allow all other robots to browse everywhereUser-agent: *Disallow:
This example allows only a Web Spider named "SpiderOne" into a site,while denying all other Spiders:
# Allow "SpiderOne" in the siteUser-agent: SpiderOneDisallow:# Deny all other spidersUser-agent: *Disallow: /
This last example disallows FrontPage-related paths in the root ofyour Web site:
Microsoft Internet Information Services 6.0, Microsoft Internet Information Server 1.0, Microsoft Internet Information Server 2.0, Microsoft Internet Information Server 3.0, Microsoft Internet Information Server 4.0, Microsoft Internet Information Services 5.0, Microsoft Internet Information Services 7.0