| Deployment has an ASP.NET web part that is calling into a COM+ based application. Rather than use Kerberos Delegation the web part calls LogonUser(). This results in ~200-300 NTLM Auth/second. The company's infrastructure is based on x86 DC's with 14 domains. They had one outage after another which manifested as white screens in IE when accessing SharePoint. I captured a memory dump of W3WP.exe and found that all the active threads had been waiting for quite some time on RPCSS. Went back to the server captured a memory dump of the RPCSS process. It was waiting on LSASS. Captured a memory dump of the whole machine to look at LSASS, it was waiting on a response from two auth requests with several in the semaphore queued up waiting to be authenticated. Learning's from this experience. If you have 14 domains and x86 DC's make sure you have DC's or GC's (preferred if you are in Native mode or greater) close to the machines generating the authentication. Second, it's easier to view the netlogon.log file to find this problem. Third, now it's easier to enable the Netlogon performance counters to diagnose this problem. This happened before they were created. Fourth, if the DC's can be profiled to ensure they can handle the load increase MaxConcurrentAPI on the SharePoint boxes and the DC's servicing SharePoint so that the request doesn't bottleneck on the trust. Fifth, if you use LogonUser() understand the consequences and consider Kerberos instead.
Second story: SharePoint farm is in an environment where the GC's are x64 with lots of ram. They also had 8 domains around the world. There are 35,000 (150,000 total) users hitting the farm on average due to the heavy collaborative nature of the farm we saw 3000-5000 NTLM Auth/second. The DC's were fine from a utilization perspective. However, we were seeing a bottleneck on SharePoint servers in the form of White Screens or spinning globes. Checked the Netlogon.log and found we were stalling on MaxConcurrentAPI. We bumped it slowly on the SharePoint servers and the problem still occurred but slower to get there. Moved the troubleshooting to the DC's and found that the trust was the bottleneck. Increased MaxConcurrentAPI on the DC's. The problem took longer to surface. Finally had enough data to see that the problem almost always surfaced on Wednesdays. Discovered that all the local DC's were being rebooted on Tuesday night. Secure Channels were being established with DC's over the WAN.Moral of the story, MaxConcurrentAPI modifications are not a cure all. A solid Domain architecture and DC placement are critical to success under high volumes.
Third story, why some don't like messing with MaxConcurrentAPI. A colleague of mine that had little experience with Domains called me after bumping MaxConcurrentAPI to it's highest value across SharePoint Servers and DC's. He was seeing the DC's spike in utilization to the point where they were sending back and RPC Server too busy error. When I asked if he had first profiled the DC's to see if they could handle such a jump in load he said no. Remember when you bump MaxConcurrentAPI you are increasing load on the DC's by multiples. For example, if I have a default of 2 threads doing concurrent authentication and I bump it to 4 I have doubled the load from that server to my DC if the traffic is constant. In this customer's case it was caused by Virtualized x86 DC's with really poor allocations to RAM and bad disk planning. Therefore always test, test, test before increasing MaxConcurrentAPI. Having spent years in AD I assume people will do this but it's not always the case.
Now take those stories and multiply them by dozens of times. This is why I became such a Kerberos advocate because I got tired of working with plumbing created back in 1995. Kerberos is barely faster over long sessions as Spence Harbar and Bob Fox have proven through a great deal of testing. However, it's a life savor in high authentication environments with a distributed DC infrastructure containing several domains. The clients Authenticate before contacting SharePoint. In my next post I will address PAC validation, why I have had to go to extrodinary lengths to turn it on etc.
Updated information:
You can now raise MaxConcurrentAPI to 150 instead of 10 if necessary. |