You can find out which process is using the most CPU using perfmon, task manager or any server monitoring tool. Once you have determined that SQL is the main consumer of CPU on your server, the next step is to dive into SQL and find out which processes in SQL are the main consumers. Some situations of high CPU and IO utilization are tricky to troubleshoot, especially in cases where multiple applications are using the SQL server. No one process stands out as the culprit. There will be some process on the top but, how can you be sure that the process on top is actually causing high CPU or IO. Even if the session on top is adding to the problem, how much of the problem is it responsible for.
Sometime ago,
I ran into an issue with host CPU utilization keeping above 95% and SQL taking
up almost all of it. SQL server had more than 500 sessions on multiple
databases with none of them large enough to be called the culprit. The monitoring
tool my team was using did not give a clear picture of what was taking up all the CPU in SQL. So, I used a combination of available queries and new queries to
figure out what is taking up CPU and IO. This issue is discussed further in the last paragraph of this article.
The
Script
The script below
is a refined version of the script I used. The script needs input about the interval
over which you would like the script to capture CPU and IO data. The script captures
CPU utilization of SQL on the host at the beginning, middle, and end of the
interval over which it is run. It also captures CPU and IO of each SPID at
beginning and end of interval. It then uses that information to come up with
percentage of CPU and IO used by each SPID, group by database, group by program
name, and group by host name. I would suggest that you run it with less than 5
seconds interval. We will take a look at the output of this script in the next
section.
--Declare interval variable. This will
be the gap in seconds that SQL will keep between first AND second polling of
data.
DECLARE @interval INT
SELECT @interval = 3 --Change interval AS required
--Declare waittime
DECLARE @waittime Datetime
--Declare other variable to hold sum values
DECLARE @total_delta_cpu BIGINT
DECLARE @total_delta_physicalio BIGINT
--Get the mid time
SELECT @waittime = DATEADD(ss,@interval/2,GETDATE())
--Collect CPU utilization by Resource
pool at the beginning
SELECT DISTINCT 'CPU Utilization at the Beginning', perfCount.object_name,
perfCount.counter_name,
perfCount.instance_name AS ResourcePool,
CASE WHEN perfBase.cntr_value = 0THEN 0
ELSE CAST((CAST(perfCount.cntr_value AS FLOAT) / perfBase.cntr_value) * 100 AS INT)
END AS Percentage_HostCPU_BySQL
FROM
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage %' ) perfCount
INNER JOIN
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage % base') perfBase
ON perfCount.object_name = perfBase.object_name
-- Take current snapshot of resource
utilization FROM sysprocesses TABLE
SELECT sp.spid,DB_NAME(sp.dbid) AS DBName
,sp.cpu, sp.physical_io
,sp.memusage
,1 AS sample
,getdate() AS sampleTime
,sp.hostname
,sp.program_name
,sp.loginame
,st.text
INTO #resources
FROM master..sysprocesses sp CROSS APPLY
master.sys.dm_exec_sql_text(sp.sql_handle) st
--Wait for half TIME
IF GETDATE() < @waittime
WAITFOR TIME @waittime
--Collect CPU utilization by Resource pool in the middle
SELECT DISTINCT 'CPU Utilization in the Middle'
,perfCount.object_name
,perfCount.counter_name
,perfCount.instance_name AS ResourcePool
,CASE WHEN perfBase.cntr_value = 0
THEN 0
ELSE CAST((CAST(perfCount.cntr_value AS FLOAT) / perfBase.cntr_value) * 100 AS INT)
END AS Percentage_HostCPU_BySQL
FROM
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage %' ) perfCount
INNER JOIN
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage % base') perfBase
ON perfCount.object_name = perfBase.object_name
--Wait the remaining interval time
IF GETDATE() < DATEADD(ss,@interval - (@interval/2),@waittime )
BEGIN
SET @waittime = DATEADD(ss,@interval - (@interval/2),@waittime )
WAITFOR TIME @waittime
END
-- Take another snapshot of resource utilization FROM sysprocesses TABLE
INSERT #resources
SELECT sp.spid
,DB_NAME(sp.dbid) AS DBName
,sp.cpu
,sp.physical_io
,sp.memusage
,2 AS sample
,getdate() AS sampleTime
,sp.hostname
,sp.program_name
,sp.loginame
,st.text
FROM master..sysprocesses sp CROSS APPLY
master.sys.dm_exec_sql_text(sp.sql_handle) st
--Collect CPU utilization by Resource pool at the end
SELECT DISTINCT 'CPU Utilization at the End'
,perfCount.object_name,perfCount.counter_name
,perfCount.instance_name AS ResourcePool
,CASE WHEN perfBase.cntr_value = 0
THEN 0
ELSE CAST((CAST(perfCount.cntr_value AS FLOAT) / perfBase.cntr_value) * 100 AS INT)
END AS Percentage_HostCPU_BySQL
FROM
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage %' ) perfCount
INNER JOIN
(SELECT * FROM sys.dm_os_performance_counters
WHERE rtrim(object_Name) like '%Resource Pool Stats%'
AND rtrim(counter_name) = 'CPU usage % base') perfBase
ON perfCount.object_name = perfBase.object_name
--Get the total utilization for CPU and IO by all SPIDs in SQL
SELECT @total_delta_cpu = SUM(CAST((r2.cpu - r1.cpu) AS bigint))
FROM #resources AS r1 INNER JOIN #resources AS r2 ON r1.spid = r2.spid
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) > 0
SELECT @total_delta_physicalio =
SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))
FROM #resources AS r1 INNER JOIN #resources
AS r2 ON r1.spid = r2.spid WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.physical_io - r1.physical_io) > 0
-- Find the deltas by SPID. Order by CPU utilization.
SELECT top 100 r1.spid
,(CAST(r2.cpu AS BIGINT)- CAST(r1.cpu AS BIGINT))*100/@total_delta_cpu AS PercentageOfSQLCPU_bySPID
,(CAST(r2.physical_io AS BIGINT)- CAST(r1.physical_io AS BIGINT))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_bySPID
,r1.DBName
,r2.cpu - r1.cpu AS delta_cpu
,r2.physical_io - r1.physical_io AS delta_physical_io
,r1.hostname, r1.program_name
,r1.loginame
,r1.text
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
ORDER BY (r2.cpu - r1.cpu) DESC
--Find the deltas grouped by
database. Order by CPU utilization.
SELECT r1.DBName,SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS PercentageOfSQLCPU_byDB
,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byDB
,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
,SUM(CAST((r2.physical_io - r1.physical_io) AS float)) AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2 ON r1.spid = r2.spid
AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.DBName
ORDER BY delta_cpu DESC
--Find the deltas grouped by Program
Name. Order by CPU utilization.
SELECT r1.program_name,CAST(SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS int) AS PercentageOfSQLCPU_byProgram
,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byProgram
,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.program_name
ORDER BY delta_cpu DESC
--Find the deltas grouped by Host Name.
Order by CPU utilization.
SELECT r1.hostname,CAST(SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS int) AS PercentageOfSQLCPU_byHost
,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byHost
,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.hostname
ORDER BY delta_cpu DESC
-- Find the deltas by SPID. Order by IO
utilization.
SELECT TOP 100 r1.spid,(CAST(r2.cpu AS BIGINT)- CAST(r1.cpu AS BIGINT))*100/@total_delta_cpu AS PercentageOfSQLCPU_bySPID
,(CAST(r2.physical_io AS BIGINT)- CAST(r1.physical_io AS BIGINT))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_bySPID
,r1.DBName
,r2.cpu - r1.cpu AS delta_cpu
,r2.physical_io - r1.physical_io AS delta_physical_io
,r1.hostname
,r1.program_name
,r1.loginame
,r1.text
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
ORDER BY delta_physical_io DESC
--Find the deltas grouped by
database. Order by IO utilization.
SELECT r1.DBName,SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS PercentageOfSQLCPU_byDB
,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byDB
,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
,SUM(CAST((r2.physical_io - r1.physical_io) AS float)) AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2 ON r1.spid = r2.spid
AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.DBName
ORDER BY delta_physical_io DESC
--Find the deltas grouped by Program Name. Order by IO utilization.
SELECT r1.program_name
,CAST(SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS int) AS PercentageOfSQLCPU_byProgram
,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byProgram
,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.program_name
ORDER BY delta_physical_io DESC
--Find the deltas grouped by Host Name.
Order by IO utilization.
SELECT r1.hostname,CAST(SUM(CAST((r2.cpu - r1.cpu) AS bigint))*100/@total_delta_cpu AS int) AS PercentageOfSQLCPU_byHost
,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS PercentageOfSQLPhysicalIO_byHost
,SUM(CAST((r2.cpu - r1.cpu) AS bigint)) AS delta_cpu
,SUM(CAST((r2.physical_io - r1.physical_io) AS bigint))*100/@total_delta_physicalio AS delta_physical_io
FROM #resources AS r1 INNER JOIN #resources AS r2
ON r1.spid = r2.spid AND r1.DBName = r2.DBName
WHERE r1.sample = 1
AND r2.sample = 2
AND (r2.cpu - r1.cpu) >= 0
AND (r2.physical_io - r1.physical_io) >= 0
GROUP BY r1.hostname
ORDER BY delta_physical_io DESC
DROP TABLE #resources
Script
Results Analysis
Let us look
at output of this query on a test system.
Refer Figure
1. It shows CPU utilization of SQL at the host before, in the middle, and at
the end of interval. CPU utilization by SQL varies from 24% to 19% back to 24%
within the 3 seconds interval.
Figure 1
Refer Figure
2. It shows percentage of CPU used by each SPID in relation to what SQL
consumes. If SQL server consumes 50% of CPU on the server and a SPID consumes
up 50% of that, the SPID is in fact using 25%(50% of 50%) of CPU of the host.
SPID 371
running against master database uses 26% of the CPU used by SQL. It is also a
consumer of 59% of physical IO that SQL is doing. SPID 287 running against App1DB
uses 23% of CPU used by SQL.
Figure 2
Refer Figure
3. It shows percentage of CPU used by each DB. It sums up percentage of all
SPIDs running against each DB. Row 1 shows that App1DB sessions are using up
46% of CPU used by SQL and is responsible for 39% of physical IO. Note that it
was not the top consumer when we were only looking at CPU consumed by SPIDs in
Figure 2. However, it had several sessions running on it. Row 2 shows that master sessions consume only 26% CPU but most of the
physical IO. It appears that the first session in Figure 2 that consumed 26% CPU was the only one running on master.
Refer Figure
4. It shows percentage of CPU used by each program. It sums up percentage of all
SPIDs running against each program. Looking at row 1, APP1 sessions used up 46%
of CPU used by SQL and was responsible for 39% of physical IO. Note that it was
not the top consumer when we were only looking at CPU consumed by SPIDs in
Figure 2. Microsoft SQL Server Management Studion – Query, which is the app running
against master(as in figure 2), consume 26% CPU.
Figure 4
Refer Figure
5. It shows percentage of CPU used by client host name. It sums up percentage
of all SPIDs coming from each host. Sessions from APP1-SRVP2 used up 29% of CPU
used by SQL and was responsible for 26% of physical IO. Sessions from APP1-SRVP
used up 17% of CPU and 13% of IO. Both these servers belong to APP1 and so together they
consumed 46% CPU and 39% IO, which matches with the percentages for APP1 in Figure 4.
Figure 5
The script
generates more output beyond what Figure 5 shows. Rest of the output shows same
data sorted by Percentage IO in descending order. The remaining output would be
self-explanatory once you see it.
The idea of
this script is to look at CPU and IO consumption from different angles in an
effort to find out the real consumer of resources so you can focus on right area.
You could have any of the scenarios in your environment:
1.
One
application from one host using one database on SQL
2.
Multiple
applications from different hosts using a database
3.
An
application from one or multiple hosts using multiple databases
4.
An
application from multiple hosts using a database via one session per host or multiple
sessions per host
Coming back
to the specific issue I faced, SCCM application was the culprit in that case.
None of its sessions took more than 1% CPU. In fact going by SPID, none of its
sessions showed in top 10 CPU consumers. However, the script showed that its
hundreds of sessions together accounted for more than 85% of the CPU that SQL
was consuming.
I will cover how to match disk IO stats from perfmon with data from sys.dm_io_virtual_file_stats in another post.
I will cover how to match disk IO stats from perfmon with data from sys.dm_io_virtual_file_stats in another post.
Nice detailed script for deep analysis.
ReplyDelete