The previous sections listed the counters for the most common use of Exchange Servers—as mail-flow and mailbox servers. However, some organizations heavily use Exchange server roles, such as front-end servers and public folder servers. For those organizations, there are other performance issues that need to be monitored.
Looking at Front-End Servers
Front-end servers, such as those that serve Outlook Web Access, authentication, IP address checking, Secure Sockets Layer (SSL) protocol, and encryption schemes, have security features that require significant processing. For these servers, you are likely to see increased processor activity, both in privileged and user mode, and an increase in the rate of context switches and interrupts. If the processors in the server cannot handle this increased load, queues are likely to develop.
If your front-end servers are using SSL, the Local Security Authentication Server (Lsass.exe) process may consume a large amount of CPU. This is because SSL processing occurs here at the server. This means that administrators used to monitoring CPU usage may see less processor consumed by the Inetinfo.exe process and more consumed by the Lsass.exe process.
Improving Front-End Server Performance
The following item describes how you can improve the front-end server performance:
· Use hardware cryptographic accelerators
When there is extremely high SSL use, you can improve performance by using hardware cryptographic accelerators to offload the calculations and remove SSL from being a bottleneck.
Looking at Public Folder Servers
For public folder servers, it is important to understand that the replication traffic between public folders (if there is more than one public folder in the topology) can affect all the servers involved. Arrange the replication schedule of the servers so that a replication queue does not mount any public folder. Processing replication changes causes resource competition with the operations already occurring on the server.
Replication Mail Flow
Replication messages are received by SMTP, categorized and handed to the local SMTP queue. The messages are then submitted to the Public Folder store. Once they have been submitted to the Public Folder store the messages are put in the Replication Receive Queue. The messages in the Replication Receive Queue then get processed and the changes are performed on the appropriate Public Folder
Use the counter listed in the following table to determine whether there is any public folder performance degradation.
Performance Counter for Public Folder Server Receive Queue
Counter |
Expected values |
MSExchangeIS Public\Replication Receive Queue Size Indicates the number of replication messages waiting to be processed. |
· This value should not go above 100. · This value should return to a minimum value between replication intervals |
Improving Public Folder Server Performance
The following item describes how you can improve the public folder server performance:
· Tune replication schedule to avoid queues
You can increase or decrease the frequency that a public folder replicates its content changes to other public folders. For some deployments, having replication contents replicate more frequently actually results in performance gains. These performance gains are possible because the increased replication frequency avoids big replication queues and involves less public folder content being replicated at a time.
However, if replication is never completing then you will see an unbounded growth of the Replication Receive Queue size. When changing the frequency of replication monitor this counter to ensure that replication is completing before the beginning of the next interval.
· Minimize the number of replicas
Minimizing the number of replicas will also result in performance gains. In some cases multiple replicas of folders may have been added in order to distribute the load over multiple servers. A more effective way to balance the load would be to divide the folders across multiple servers.
For example, if you were to divide the folders between two servers such that each server had replicas of half the hierarchy that scenario would result in less load per server than if the two servers that had replicas of all the folders. Dividing the folders results in better performance because there will not be the added performance cost of replicating all the content changes between the servers.
· Throttling Offline Address Book downloads to reduce network hit
Exchange 2003 and Outlook 2003 Cached Mode increase the number of downloads of the Offline Address Book (OAB). These downloads may result in very high network utilization and may overload the network. By default, every client request for a full OAB is served immediately. Throttling the OAB downloads will help limit the effect on the network. See the following Microsoft Knowledge Base articles for more information.
· Throttling full offline Address Book downloads to limit the effect on a LAN in Exchange Server 2003
· Structuring the Public Folder hierarchy
Reducing the number of direct subfolders will also help improve performance. A deeper and narrow hierarchy will have lower performance cost than a shallow and wide hierarchy. Limiting the number of direct subfolders for any given folder to 250 will reduce the cost of both replication and user actions.
· Limiting the number of messages per folder
Limiting the number of messages for a public folder will reduce search and sort times on that folder. This will improve the client experience as well as reduce load on the server.
It is not uncommon for servers to experience multiple bottlenecks. It is important, however, to understand whether there are any causal relations occurring—that is, where one subsystem's performance issues spills over to another subsystem. For example, a CPU-bound server can build up queues, which causes unusually high use of the SMTP disks.
Because of the possibility of causal relations occurring between subsystems, analyze the performance logs with regard to:
· The role assigned to the server.
· The cause or causes that trigger the performance degradation of one or more subsystems.
Generally, it is worth mitigating each bottleneck, and then seeing the effects of removing that malfunctioning piece of the puzzle. Otherwise, enforcing policies may be enough to mitigate issues caused by multiple bottlenecks. For instance, enforcing message sizes for POP3 retrievals can reduce the load on the database disk. However, enforcement may not be enough. There are many cases that will require upgrades or a redesign of the hardware.
Example of a Multiple Bottleneck
In this example, the Exchange server is a mailbox server that hosts 6,000 users. It is connected to three direct attach storage arrays:
· One array has the database.
· Another array has the transaction logs.
· The third array has the SMTP disks.
This Exchange server has two storage groups with two private databases, each database with 1,500 users. The SLA for backing up and restoring limits the number of storage groups to two.
The problem is that, during the daytime, users experience slow response as they use Microsoft® Office Outlook® in online mode.
Looking at the Symptoms
By collecting a performance log during the eight hours of day-time use for which this server is experiencing degraded performance, it becomes clear that the MSExchangeIS\RPC Requests counter is constantly around 60, and that some clients experience slow responses to the operations requested. Furthermore, the MSExchangeIS\RPC Averaged Latency counter is constantly hitting or going above 100 ms. These are clear symptoms of performance issues that need to be isolated.
Eight hours of day-time performance information
Analysis of the performance logs uncovered problems with the performance of the database drive, the log drive, and the CPU. The following sections indicate which performance counters were used to determine each problem.
Problem 1—Database drive with bad performance
The Exchange Server is connected to a Storage Area Network that cannot handle the I/O load. As shown in Figure 10, the write latencies on the database drive (as indicated by the PhysicalDisk\Average Disk sec/Write counter) average 62 ms, with frequent spikes above 80 ms and some above 100 ms.
Performance log of the database drive
Problem 1 Solution
By adding another array and controller, and then splitting the storage groups into separate arrays, the performance of the database drive improved.
Problem 2—Log drive with bad performance
As shown in the following figure, a slow transaction log drive is causing the Database\Log Record Stalls/sec counter to average 114 stalls/second, with constant spikes above 150 stalls/second. In addition, there are frequent log threads waiting as indicated by the Database\Log Record Stalls/sec counter, with spikes above 20.
Performance log of the transaction log drive
Problem 2 Solution
The controller responsible for the transaction logs was experiencing problems. The controller has the write-back cache disabled. The stalls subsided after replacing the old controller with a new controller that had a properly functioning write-back cache.
Problem 3—CPU-bound
As shown in the following figure, this server is experiencing high CPU usage (with the Processor\% Processor Time counter averaging 97%) and large processor queues (as indicated by the System\Processor Queue Length counter).
Performance log of CPU usage
Problem 3 Solution
The slowness on the database and transaction logs aggravated the CPU utilization, causing more context switches than necessary (an average of 50,000 on this performance log) and consequently over utilizing the CPU. By resolving the database issues in Problem 1 and the transaction logs issues in Problem 2, the CPU utilization problem shown in the figure in this problem is resolved as well.