Article ID: 284022 - Last Review: October 26, 2007 - Revision: 3.3 Indexing Web Sites on the InternetThis article was previously published under Q284022 IMPORTANT: This article contains information about modifying the registry.
Before you modify the registry, make sure to back it up and make sure that you
understand how to restore the registry if a problem occurs. For information
about how to back up, restore, and edit the registry, click the following
article number to view the article in the Microsoft Knowledge Base: 256986
(http://support.microsoft.com/kb/256986/EN-US/
)
Description of the Microsoft Windows Registry
On This PageSUMMARY This article describes considerations that you need to take
into account when you create a Web site content source for a public Web site.
MORE INFORMATION If you understand the following considerations, you can
ensure that you perform indexing in the most efficient way, without causing
adverse effects to the Web sites. Understanding the following considerations
can also help you when troubleshooting common indexing issues. Controlling Crawler Access with a Robots.txt File and HTML META TagsA Web site administrator can use a Robots.txt file to indicate where robots (Web crawlers) can go on a Web site and whether or not to exclude specific crawlers. Web servers use these rules to control access to sites by preventing robots from accessing certain areas. Microsoft SharePoint Portal Server 2001 and Microsoft Office SharePoint Portal Server 2003 always look for this file when crawling and obey the restrictions in it.A Web site administrator can also restrict access to certain documents by using Hypertext Markup Language (HTML) META tags. These tags tell the robot whether it can include the document in the index and whether it can follow the links in the document by using the INDEX/NOINDEX and FOLLOW/NOFOLLOW attributes in the tag. For example, if you do not want the document to be crawled and you do not want links in the document to be followed, you can mark a document with following tag:
<META name="robots" content= "noindex, nofollow">
SharePoint Portal Server always obeys the HTML rules of robots
exclusion when SharePoint Portal Server crawls Web sites. Note that robots
exclusions are counted as rule exclusions (which are not visible in the
gatherer log viewer by default) by SharePoint Portal Server. See the "Gatherer
Log Information" section of this article for additional information about how
to view the gatherer logs.Robots.txt files specify restrictions for each User Agent. Change the User Agent string to identify your site when crawling the Internet. By default, the string for SharePoint Portal Server is:
Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 4.0 Robot) Microsoft
To add your identifier, you need to modify the
registry.WARNING: If you use Registry Editor incorrectly, you may cause serious problems that may require you to reinstall your operating system. Microsoft cannot guarantee that you can solve problems that result from using Registry Editor incorrectly. Use Registry Editor at your own risk. To add your identifier, add the registry key that is appropriate for your version of Sharepoint Portal Server: For Microsoft SharePoint Portal Server 2001, add the following key: HKEY_LOCAL_MACHINE\Software\Microsoft\Search\1.0\Gathering Manager\UserAgent For Microsoft Office SharePoint Portal Server 2003, add the following key: HKEY_LOCAL_MACHINE\Software\Microsoft\SPSSearch\Gathering Manager Following Complex LinksBy default, SharePoint Portal Server does not follow complex links (links that contain commands following a question mark in the URL; for example, http://www.mysite.com/default.asp?url=/somedir/somefile.htm). If the site that you are crawling contains complex links that you want to follow, you must create a site path rule for the site:
Crawling Password-Protected Web SitesThere is no way to specify credentials in the URL that you specify for a Web site content source. If you want to crawl a password-protected site, create a site path rule. Follow steps 1 through 3 in the "Following Complex Links" section of this article to create the site path rule. Click the Options tab, click the Account tab, and then provide the user name and password.Understanding That a File Type That Is Referenced in a Link May Be ExcludedEach workspace maintains a file type inclusion list, and when content sources are indexed, only those file types are indexed. If a Web site link references an excluded file type, the link is not followed and is logged as a rule exclusion. One example is a link such as:
http://www.mysite.com/Index.cfm?ArticleID=q284022
Unless .cfm is added to the file type inclusion list, the link is
not followed. Adding Proxy Server SettingsIf your network uses a proxy server to access the Internet, you must provide the proxy server information so that the SharePoint Portal Server crawler can use that information. This information is configured on the Proxy Server tab of the server's properties in the SharePoint Portal Server Administration console. This information is used only by search, and if you change the information in that location, you do not affect any settings that are configured in Microsoft Internet Explorer.Configuring Host HopsWhen you create a Web site content source, you choose to index either This page or This site. When you select This site, all of the links to pages within that site are followed, but links to other Web sites are not. This can cause an immediate index failure if you attempt to index a Web site in which the default page performs an immediate redirect to another site (for example, if you connect to http://my_site.com and you are redirected to a default page at http://my_alternate_site.com). In this case, or if you want the crawler to follow links to other sites, you must configure a custom host hops setting. You can set host hops on the Configuration tab of the Web site content source properties.IMPORTANT: Limit the number of site hops to the absolute minimum number necessary. When you perform an Internet crawl, you might index millions of documents in just a few site hops. If you set the number of site hops on a Web site content source to unlimited (by clicking Custom, and then clicking to clear the Limit site hops and Limit page depth check boxes), you must include a site path rule that specifically includes that content source in the index. Otherwise, the content source is automatically excluded from the index to prevent unlimited crawling. The site path rule strategy that is recommended when you are crawling Internet sites is to create an exclusion rule for the entire HTTP URL space (http://*), and then create inclusion rules for only those sites that you want to index. Being a Considerate CrawlerWhen you crawl someone else's Web site, you increase the load on that server. You can use site hit frequency rules to avoid overloading a Web site that you are indexing. Site hit frequency rules specify how frequently documents are requested from a site and how many documents are requested. Site hit frequency rules are configured on the Load tab of the server's properties in the SharePoint Portal Server Administration console.Gatherer Log InformationWhen you perform an index update, crawling activity is recorded in gatherer logs. The easiest way to view the gatherer logs is to use the gatherer log Web page viewer. To gain access to the log viewer, click the Click here for detailed log link in the Content Sources folder. By default, only error messages are displayed in the log viewer. If you want to view rule exclusions and successes for troubleshooting purposes, you can enable this ability on the Logging tab of the workspace properties in the SharePoint Portal Server Administration console. It is recommended that you not enable these logging settings unless you are actively troubleshooting because logging the additional information greatly increases the log file size.You can also view success and rule exclusions by using the Gthrlog.vbs utility, a command-line utility that is located in the Support\Tools folder on the SharePoint Portal Server CD-ROM.
| Article Translations
|
Back to the top
