Skip to main content

From The Field

Go Search
From The Field
  

From The Field > Posts > Crawler Impact Rules - What not to do
Crawler Impact Rules - What not to do
In a recent engagement I was asked to troubleshoot an indexing problem for a customer. Essentially the SharePoint indexer despite running on 64bit with 12GB ram and having real fast network links to the Database and remote File Shares was crawling very slowly.
 
So this first thing I did was take a look at these performance counters to assess the state of play:
 
 
Perf Counters
 
What was obvious from this was that the Documents Filtered rate was extremely low at an average of 2 per second. Moss Indexer at full throttle should be indexing considerably faster than this, 80 docs per second is not uncommon in well performing systems.
 
What was also apparent was that the Threads Accessing Network was a similar figure to the Filtering Threads total. This is indicative of threads waiting on response from the data source.
 
So where to go from here?
 
The thread count was so high in these counters that I wanted to see what 'tuning' had been applied to the Indexer. Two places to check.
 
  1. The SSP Search Settings on the Indexer for the Indexer Performance Setting.
  2. Any Crawler Impact Rules that may have been set

In this case the Indexer Performance Level was set to Maximum which is fine with a dedicated Indexing Server.

When Checking the Crawler Impact Rules though I discovered a whole world of problems.

Crawler Impact Rules

What I found here was over twenty crawler impact rules had been configured for the search service but each one had been setup to use the maximum number of requests for the crawl - Sixty Four.
 
Best Practice and Technet provide guidance for crawler impact rules as follows.
 
For crawling internal content in your organization, you can set crawler impact rules based on the performance and capacity of the crawled servers. For example, you might try to avoid crawling internal servers at peak load times. However, for crawling external sites, this kind of coordination is usually not feasible. Therefore, it is best to configure crawl requests to minimize consumption of external site resources and bandwidth so that external site administrators are less inclined to restrict your future access.
During initial deployment, set your crawler impact rules to minimize impact on crawled servers while crawling them frequently enough to ensure relatively fresh results. Later, during the operations phase, you can adjust crawler impact rules based on your experience and the data from your crawl logs.
With many impact rules and all set to sixty four the target servers were simply overwhelmed with requests resulting in a major bottleneck in the search service and reduced performance as we have seen.
 
Testing the search service by reducing the crawler impact rule maximum requests to the default of eight resulted in immediate improvements in the document filtered rate to around thirty documents per second and the threads accessing the network : total filtering threads ratio improved enormously.
 
This was by no means the end of the story and the next task involved a lot of trial and error to determine the optimum configuration for the impact rules (or deleting them entirely).
 
The moral of this tale though is to get the message out about what Crawler Impact Rules are all about. They are not there to squeeze more output from the search engine, they are there to reduce the impact the crawler has on the sources being crawled. There is almost never a reason to increase this number beyond the default and in many cases, such as this one, reducing the number actually improves performance.

Comments

There are no comments yet for this post.
Items on this list require content approval. Your submission will not appear in public views until approved by someone with proper rights. More information on content approval.

Title (required) *


Body (required) *

Name (required) *


Are you a bot? *


Anti-Spam Filter 1

What's 10+4? *


Anti-Spam Filter 2
Attachments