Friday, August 05, 2005

Optimizing External Site Crawls

After launching SharePoint Blogsearch, I discovered (with some help from Greg) that the SharePointPSSearch (SPSSearch) service needs a few tweaks to work well on external web sites. After doing some digging around in the documentation, looking into the packets with a protocol analyzer, and generally scratching my head in confusion, I learned a few things:

1. SPSSearch does not always honor robots.txt files (this is a text file placed in the root directory of a web site that tells crawlers how to behave). Yes, the documentation says it does, and you can modify the id string in the registry, but it doesn't always seem to work. I'm still trying to come up with an answer to this one.

2. By default, the crawler will request as many documents from the target site as it can fit into the available threads or until it starts receiving TCP errors; in other words, it will hammer an external site into submission. Fortunately, you can control this errant behavior. Go to SharePoint Central Site Administration Manage Search Settings. In the 'Site Hit Frequency Rules' section, click on 'Manage Site Hit Frequency Rules'. Click 'Add Update Rule' on the toolbar. In the Site Name field, enter "*" for all web sites (you can also set rules by explicit name, domain, etc.). Next, click the 'Limit number of documents requested simultaneously' radio button and enter a small number (minimum is 1, max is 999, I used 5) in the Number of Documents field. This will significantly reduce the load on target servers.

3. Incremental updates are SUPPOSED to ignore content that has not been changed, crawling only those docs that have changed. In reality, this is not the case. I noticed that all documents were being processed even though the content was static. I tested a full update and incremental update on a static site and the exact same load was generated (packet count, bytes, etc.) in both a full and incremental update. This could be a bug or it could be some hidden registry setting somewhere that I haven't found yet. Any ideas would be appreciated.

4. Adaptive updates only appear to work on SPS/WSS sites. According to the docs, adaptive updates make an educated guess, based on historical patterns, as to what content may have changed on a site. In theory, this should greatly reduce the load on a crawled site and it may very well work that way in SPS/WSS. It doesn't seem to have any effect on other types of sites but my test configuration was limited so I may not have all the data. Again, if anyone has any ideas, please share.

More on this topic as I continue to tweak the settings.

Update: If you want to learn more about search optimization, here is a KB article to get you started.

Update 2: Here's another tip. When creating an inclusion rule for a subdirectory on a site the default behavior is to also include the parent site. For example, a new inclusion rule for http://www.theegroup.net/blogs/ would create two entries - one for the full path and one for the parent http://www.theegroup.net. This means that any links to other URL's on the same site would be processed which greatly expands the scope of the crawl. To restrict this behavior, change the parent rule to an exclusion and leave the child URL as an inclusion. This will restrict the crawler to links under the child URL.