Wednesday, July 1, 2015

The Tale of IO Performance Issues on a Server

I am writing this blog from memory of an issue I faced some time ago. So, I won't be able to provide screen shots or specific values. However, I learned so much from this issue that I think it is worth sharing here.

Sometimes, you have to go out of your domain area to get to the root of an issue and resolve it. 

I received complaint from users about slow performance on a server. On asking more details, I got response that users were experiencing slow performance in general, everything is slow.

Here is what I checked and found on the server
  1. Blocking - None
  2. Number of sessions - less than 100
  3. Wait stats - IO related waits dominated the wait stats.
  4. Page Life Expectancy - more than 10K. No memory constraint
  5. CPU utilization - minimal utilization
  6. Disk utilization - % disk idle value is high. Low disk utilization
  7. Number of IOs on disks - never went above 500 IO per second. Low disk IO.
  8. IO latency - Disk ms/read and Disk ms/write reached over 100 ms frequently on data and log drives. High IO latency.
High disk IO latency at such low IO and disk utilization indicated disk IO issue. I asked server and SAN teams to look into disk performance. SAN team checked out response time between server and SAN and found it to be within their "good" range. Server admin found that data and log drives were formatted with 4K cluster size instead of 64K cluster size. In absence of any other findings, we went ahead with taking outage and reformatting the drive with 64K cluster size. As expected, it did not give noticeable performance boost, if any. The issue of high IO latency with low IO and disk utilization continued.

Server admin checked all HBA settings and they found them in compliance with the standard configuration they followed for their builds. This server was part of a 4 node majority-node(with file share witness) cluster. I will name them NodeA, NodeB, NodeC, and NodeD for this discussion. Node A is the problem server. While server admin continued to investigate further, I decided to check other servers in the cluster and compare their IO performance. I found the same issue on Node B. The remaining two nodes, NodeC and NodeD, were able to handle 10K IO with very little IO latency. The question was how can two nodes in a cluster handle such high IO loads without issues while the remaining struggle with little IO. I asked SAN team if the problem nodes shared ports/switches on SAN side with other servers. They found that all four nodes shared backend port on SAN side. I went back to HBA settings on all four nodes and found that "Execution Throttle" setting on the problem nodes, NodeA and NodeB, was as per build standard but, on NodeC and NodeD, it was set to four times the standard value.
Here is a brief on Execution Throttle setting: This is an adapter firmware parameter that specifies the maximum number of commands executing on a per Target port (storage port) basis.  When a port’s Execution Throttle value is reached, no new commands (I/Os) are executed until the current command finishes executing to that specific Target port.  Commands can continue to be executed for other target ports that have not reached their Execution Throttle threshold."
More details can be found at
http://support.qlogic.com/SupportCenter/articles/Question_Answers/2718

It appeared that NodeC and NodeD were overwhelming the SAN port with so much IO that NodeA and NodeB had tough time sending their IOs causing high IO latency at low IO rate and low disk utilization. Server admin lowered Execution Throttle of NodeC and NodeD to same values as in NodeA and NodeB. IO performance on both problem nodes improved more than 10 fold. IO performance on the four nodes came at par. phew!!

Conclusion: IO performance issues on a server may not be related to bottleneck/configuration at the server. It may be due to misconfiguration at other servers with which the problem server shares SAN disks, SAN ports, or network between servers and SAN.