Thursday, July 23, 2015

What is Consuming CPU and IO in SQL?

Testing environment: SQL 2008 R2

You can find out which process is using the most CPU using perfmon, task manager or any server monitoring tool. Once you have determined that SQL is the main consumer of CPU on your server, the next step is to dive into SQL and find out which processes in SQL are the main consumers. Some situations of high CPU and IO utilization are tricky to troubleshoot, especially in cases where multiple applications are using the SQL server. No one process stands out as the culprit. There will be some process on the top but, how can you be sure that the process on top is actually causing high CPU or IO. Even if the session on top is adding to the problem, how much of the problem is it responsible for.

Sometime ago, I ran into an issue with host CPU utilization keeping above 95% and SQL taking up almost all of it. SQL server had more than 500 sessions on multiple databases with none of them large enough to be called the culprit. The monitoring tool my team was using did not give a clear picture of what was taking up all the CPU in SQL. So, I used a combination of available queries and new queries to figure out what is taking up CPU and IO. This issue is discussed further in the last paragraph of this article.

The Script
The script below is a refined version of the script I used. The script needs input about the interval over which you would like the script to capture CPU and IO data. The script captures CPU utilization of SQL on the host at the beginning, middle, and end of the interval over which it is run. It also captures CPU and IO of each SPID at beginning and end of interval. It then uses that information to come up with percentage of CPU and IO used by each SPID, group by database, group by program name, and group by host name. I would suggest that you run it with less than 5 seconds interval. We will take a look at the output of this script in the next section.

--Declare interval variable. This will be the gap in seconds that SQL will keep between first AND second polling of data.

DECLARE @interval INT

SELECT @interval = 3 --Change interval AS required

--Declare waittime

DECLARE @waittime Datetime

--Declare other variable to hold sum values

DECLARE @total_delta_cpu BIGINT

DECLARE @total_delta_physicalio BIGINT

--Get the mid time

SELECT @waittime = DATEADD(ss,@interval/2,GETDATE())

--Collect CPU utilization by Resource pool at the beginning

SELECT DISTINCT 'CPU Utilization at the Beginning', perfCount.object_name, perfCount.counter_name, perfCount.instance_name AS ResourcePool,
       CASE WHEN perfBase.cntr_value = 0
       THEN 0
       ELSE CAST((CAST(perfCount.cntr_value AS FLOAT) / perfBase.cntr_value) * 100 AS INT)
       END AS Percentage_HostCPU_BySQL
FROM
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage %' ) perfCount
INNER JOIN
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage % base') perfBase
ON perfCount.object_name = perfBase.object_name

-- Take current snapshot of resource utilization FROM sysprocesses TABLE
SELECT sp.spid
             ,DB_NAME(sp.dbid) AS DBName
             ,sp.cpu, sp.physical_io
             ,sp.memusage
             ,1 AS sample
             ,getdate() AS sampleTime
             ,sp.hostname
             ,sp.program_name
             ,sp.loginame
             ,st.text
INTO #resources
FROM master..sysprocesses sp CROSS APPLY
master.sys.dm_exec_sql_text(sp.sql_handle) st

--Wait for half TIME
IF GETDATE() < @waittime
       WAITFOR TIME @waittime
 
--Collect CPU utilization by Resource pool in the middle
SELECT DISTINCT 'CPU Utilization in the Middle'
             ,perfCount.object_name
             ,perfCount.counter_name
             ,perfCount.instance_name AS ResourcePool
             ,CASE WHEN perfBase.cntr_value = 0
             THEN 0
             ELSE CAST((CAST(perfCount.cntr_value AS FLOAT) / perfBase.cntr_value) * 100 AS INT)
             END AS Percentage_HostCPU_BySQL
FROM
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage %' ) perfCount
INNER JOIN
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage % base') perfBase
ON perfCount.object_name = perfBase.object_name
 
--Wait the remaining interval time
IF GETDATE() < DATEADD(ss,@interval - (@interval/2),@waittime )
BEGIN
       SET @waittime = DATEADD(ss,@interval - (@interval/2),@waittime )
       WAITFOR TIME @waittime
END
 
-- Take another snapshot of resource utilization FROM sysprocesses TABLE
INSERT #resources
SELECT sp.spid
             ,DB_NAME(sp.dbid) AS DBName
             ,sp.cpu
             ,sp.physical_io
             ,sp.memusage
             ,2 AS sample
             ,getdate() AS sampleTime
             ,sp.hostname
             ,sp.program_name
             ,sp.loginame
             ,st.text
FROM master..sysprocesses sp CROSS APPLY
master.sys.dm_exec_sql_text(sp.sql_handle) st
 
--Collect CPU utilization by Resource pool at the end

SELECT DISTINCT 'CPU Utilization at the End'
             ,perfCount.object_name
             ,perfCount.counter_name
             ,perfCount.instance_name AS ResourcePool
             ,CASE WHEN perfBase.cntr_value = 0
             THEN 0
             ELSE CAST((CAST(perfCount.cntr_value AS FLOAT) / perfBase.cntr_value) * 100 AS INT)
             END AS Percentage_HostCPU_BySQL
FROM
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage %' ) perfCount
INNER JOIN
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage % base') perfBase
ON perfCount.object_name = perfBase.object_name
 
--Get the total utilization for CPU and IO by all SPIDs in SQL
SELECT @total_delta_cpu = SUM(CAST((r2.cpu - r1.cpu) AS bigint))
FROM #resources AS r1 INNER JOIN #resources AS r2 ON r1.spid = r2.spid
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) > 0

SELECT @total_delta_physicalio = SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))
FROM #resources AS r1 INNER JOIN #resources AS r2 ON r1.spid = r2.spid
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.physical_io - r1.physical_io) > 0
 
-- Find the deltas by SPID. Order by CPU utilization.
SELECT top 100 r1.spid
             ,(CAST(r2.cpu  AS BIGINT)- CAST(r1.cpu AS BIGINT))*100/@total_delta_cpu AS PercentageOfSQLCPU_bySPID
             ,(CAST(r2.physical_io  AS BIGINT)- CAST(r1.physical_io AS BIGINT))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_bySPID
             ,r1.DBName
             ,r2.cpu - r1.cpu AS delta_cpu
             ,r2.physical_io - r1.physical_io AS delta_physical_io
             ,r1.hostname, r1.program_name
             ,r1.loginame
             ,r1.text
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
ORDER BY (r2.cpu - r1.cpu) DESC

--Find the deltas grouped by database.  Order by CPU utilization.
SELECT r1.DBName
             ,SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS PercentageOfSQLCPU_byDB
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byDB  
             ,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS float)) AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2 ON r1.spid = r2.spid
AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.DBName
ORDER BY delta_cpu DESC

--Find the deltas grouped by Program Name.  Order by CPU utilization.
SELECT r1.program_name
             ,CAST(SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS int) AS PercentageOfSQLCPU_byProgram
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byProgram
             ,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.program_name
ORDER BY delta_cpu DESC

--Find the deltas grouped by Host Name. Order by CPU utilization.
SELECT r1.hostname
             ,CAST(SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS int) AS PercentageOfSQLCPU_byHost
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byHost
             ,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.hostname
ORDER BY delta_cpu DESC

-- Find the deltas by SPID. Order by IO utilization.
SELECT TOP 100 r1.spid
             ,(CAST(r2.cpu  AS BIGINT)- CAST(r1.cpu AS BIGINT))*100/@total_delta_cpu AS PercentageOfSQLCPU_bySPID
             ,(CAST(r2.physical_io  AS BIGINT)- CAST(r1.physical_io AS BIGINT))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_bySPID
             ,r1.DBName
             ,r2.cpu - r1.cpu AS delta_cpu
             ,r2.physical_io - r1.physical_io AS delta_physical_io
             ,r1.hostname
             ,r1.program_name
             ,r1.loginame
             ,r1.text
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
ORDER BY delta_physical_io DESC

--Find the deltas grouped by database.  Order by IO utilization.
SELECT r1.DBName
             ,SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS PercentageOfSQLCPU_byDB
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byDB
             ,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS float)) AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2 ON r1.spid = r2.spid
AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.DBName
ORDER BY delta_physical_io DESC

--Find the deltas grouped by Program Name.  Order by IO utilization.
SELECT r1.program_name
             ,CAST(SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS int) AS PercentageOfSQLCPU_byProgram
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byProgram
             ,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.program_name
ORDER BY delta_physical_io DESC

--Find the deltas grouped by Host Name. Order by IO utilization.
SELECT r1.hostname
             ,CAST(SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS int) AS PercentageOfSQLCPU_byHost
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byHost
             ,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
             ,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.hostname
ORDER BY delta_physical_io DESC

DROP TABLE #resources
 

Script Results Analysis

Let us look at output of this query on a test system.

Refer Figure 1. It shows CPU utilization of SQL at the host before, in the middle, and at the end of interval. CPU utilization by SQL varies from 24% to 19% back to 24% within the 3 seconds interval.

Figure 1

Refer Figure 2. It shows percentage of CPU used by each SPID in relation to what SQL consumes. If SQL server consumes 50% of CPU on the server and a SPID consumes up 50% of that, the SPID is in fact using 25%(50% of 50%) of CPU of the host.

SPID 371 running against master database uses 26% of the CPU used by SQL. It is also a consumer of 59% of physical IO that SQL is doing. SPID 287 running against App1DB uses 23% of CPU used by SQL.

Figure 2

Refer Figure 3. It shows percentage of CPU used by each DB. It sums up percentage of all SPIDs running against each DB. Row 1 shows that App1DB sessions are using up 46% of CPU used by SQL and is responsible for 39% of physical IO. Note that it was not the top consumer when we were only looking at CPU consumed by SPIDs in Figure 2. However, it had several sessions running on it. Row 2 shows that master sessions consume only 26% CPU but most of the physical IO. It appears that the first session in Figure 2 that consumed 26% CPU was the only one running on master.

Figure 3

Refer Figure 4. It shows percentage of CPU used by each program. It sums up percentage of all SPIDs running against each program. Looking at row 1, APP1 sessions used up 46% of CPU used by SQL and was responsible for 39% of physical IO. Note that it was not the top consumer when we were only looking at CPU consumed by SPIDs in Figure 2. Microsoft SQL Server Management Studion – Query, which is the app running against master(as in figure 2), consume 26% CPU.  



Figure 4

Refer Figure 5. It shows percentage of CPU used by client host name. It sums up percentage of all SPIDs coming from each host. Sessions from APP1-SRVP2 used up 29% of CPU used by SQL and was responsible for 26% of physical IO. Sessions from APP1-SRVP used up 17% of CPU and 13% of IO. Both these servers belong to APP1 and so together they consumed 46% CPU and 39% IO, which matches with the percentages for APP1 in Figure 4.

Figure 5

The script generates more output beyond what Figure 5 shows. Rest of the output shows same data sorted by Percentage IO in descending order. The remaining output would be self-explanatory once you see it.

The idea of this script is to look at CPU and IO consumption from different angles in an effort to find out the real consumer of resources so you can focus on right area. You could have any of the scenarios in your environment:

1.     One application from one host using one database on SQL

2.     Multiple applications from different hosts using a database

3.     An application from one or multiple hosts using multiple databases

4.     An application from multiple hosts using a database via one session per host or multiple sessions per host

Coming back to the specific issue I faced, SCCM application was the culprit in that case. None of its sessions took more than 1% CPU. In fact going by SPID, none of its sessions showed in top 10 CPU consumers. However, the script showed that its hundreds of sessions together accounted for more than 85% of the CPU that SQL was consuming.

I will cover how to match disk IO stats from perfmon with data from sys.dm_io_virtual_file_stats in another post.