For example, you crawl one web application that contains many site collections:
In this example, you see events that resemble the following in the ULS logs:
A server is set as unavailable when the crawler process receives 32 consecutives unexpected errors. An unexpected error is something that occurs at the connectivity level, such as a time-out not. An unexpected error does not occur at the HTTP level. For example, a 404 error does not qualify as an unexpected error.
After 32 consecutive unexpected errors, the crawler marks the server as unavailable for 10 minutes. After 10 minutes, the crawler again tries to download one resource from the server. Instead of putting all documents in a failure state, the crawler delays the crawl of that server for 10 minutes. This gives time for the potential connectivity problem to be resolved
Crawl exclusion rules take effect immediately after creation, and documents in progress that match the exclusion rules will be ignored. Documents that are already crawled will not be removed until a full crawl is completed. (This is not an issue in this particular case.)
However, there is an issue with this approach. The match is performed on the access URL (that is, the URL that starts "with sts4:") and not the http URL. Specifically, the match is performed on the access URL when document processing starts and on the display URL when the processing is complete. However, the match on the display URL occurs after the document is downloaded and after the crawl errors occur.