Testing environment: SQL 2014 Developer Edition
Recently, a friend brought up a requirement he was trying to
find solution for.
Scenario: A table
having a large varchar column besides many other columns with more than a
billion rows. Total table size touches 900 GB and has a primary clustered key. Users
need the varchar column only around month-end. Data is populated in the table,
including the big varchar column, before month end.
Requirement: Users
need the big varchar column only on 8 days around month-end. Basically, they don’t
need the large varchar column for 22 days in a month. However, due to it being
part of the table, there are few rows per page and so queries have to do a lot
of IO to get the required data. It is slowing down their queries besides
wasting memory. If somehow, they could get rid of this column for 22 days and
get it back for the 8 month-end days, it would serve the purpose. They do not
need to retain the varchar data after it is not required.
One solution is to create a non-clustered index that
included all columns but the large varchar column. All queries except the ones
that query the big varchar column would use non-clustered index to fetch all
their data and so won’t touch the table. That would solve the reporting problem
but, it would increase the total table size by a significant amount.
Considering that they do not need to retain varchar column
data, another solution could be to drop the column when they don’t need it or
delete values in the column and rebuild clustered index. Both these operations
are super expensive and lock the table making it effectively unusable for long
time.
I am a fan of using Partitioned Views but I have always thought
of them mostly in terms of splitting data horizontally. After my friend brought
this problem to me, I thought if it is possible to use Vertical Partitioned
View as a solution to his requirement. Instead of splitting a table into
multiple small tables on row values, how about splitting them on column names. The
two or more smaller tables would all have same primary key. They could all be
joined together in a view and users could use the view to query, update, insert,
and delete data. For updates, deletes, and inserts, we would need to create instead
of triggers on the view. I am sure this is not without some positive and
negative performance impact based on the operation to be performed.
Let us test this out. I will create a table with 5 small columns
and 1 char(5000) column. The char column is the one that we need only for few
days in a month. Note ID column is the primary key with clustered index.
CREATE TABLE [dbo].[Customers](
[ID]
[INT] NOT NULL,
[FirstName]
[VARCHAR](50) NULL,
[LastName]
[VARCHAR](50) NULL,
[SalaryDollars]
[INT] NULL,
[OtherInfo]
[CHAR](5000) NULL,
CONSTRAINT [PK_Customers] PRIMARY
KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE =
OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
Now, let us insert data
--Declare variables
DECLARE @id INT
DECLARE @maxid INT
--Get the maximum id from customers
table
SELECT @maxid = max(id) FROM [dbo].[Customers]
--Assign starting value to @id
IF @maxid is null
SELECT @id = 1
ELSE
SELECT @id = @maxid + 1
--While loop to insert data
WHILE @id <= @maxid+10000
BEGIN
INSERT INTO [dbo].[Customers]
([ID]
,[FirstName]
,[LastName]
,[SalaryDollars]
,[OtherInfo])
SELECT @id,
CAST(@id AS VARCHAR(10))+'FN' + CAST(@id AS VARCHAR(10)),
CAST(@id AS VARCHAR(10))+'LN' + CAST(@id AS VARCHAR(10)),
CASE WHEN @id < 1000 THEN 1000+@id ELSE @id end,
'This is the additional information for
customer id '+ CAST(@id AS VARCHAR(10))
--Increment value of @id
SET @id = @id+1
END
This creates the starting point for the scenario.
Now, we split this table into two tables, first table
without the large char column and the other table with just primary key and
large char column.
--Table with all columns except the
large char(5000) column
CREATE TABLE [dbo].[Customers_WithoutOtherInfo](
[ID]
[INT] NOT NULL,
[FirstName]
[VARCHAR](50) NULL,
[LastName]
[VARCHAR](50) NULL,
[SalaryDollars]
[INT] NULL
CONSTRAINT [PK_Customers_WithoutOtherInfo] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE =
OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
--Table with just the primary key
column and large char(5000) column
CREATE TABLE [dbo].[Customers_OtherInfo](
[ID]
[INT] NOT NULL,
[OtherInfo]
[CHAR](5000) NULL
CONSTRAINT [PK_Customers_OtherInfo] PRIMARY KEY CLUSTERED
(
[ID] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
Insert data into each table. This is splitting data by
columns.
INSERT INTO [dbo].[Customers_WithoutOtherInfo]
SELECT ID,FirstName, LastName,
SalaryDollars FROM [dbo].[Customers]
INSERT INTO [dbo].[Customers_OtherInfo]
SELECT ID,OtherInfo FROM [dbo].[Customers]
Let’s create a view on these two tables
CREATE VIEW [dbo].[vw_Customers_Updateable]
AS
SELECT CWO.ID, CWO.FirstName, CWO.LastName, CWO.SalaryDollars, CO.OtherInfo
FROM dbo.Customers_WithoutOtherInfo AS
CWO INNER JOIN
dbo.Customers_OtherInfo AS CO ON CWO.ID = CO.ID
--Instead of Insert trigger
CREATE TRIGGER
IOI_Trig_vw_Customers_Updateable ON [dbo].[vw_Customers_Updateable]
INSTEAD OF INSERT
AS
BEGIN
INSERT INTO [dbo].[Customers_WithoutOtherInfo]
(ID,
[FirstName],
[LastName],
[SalaryDollars]
)
SELECT ID,
[FirstName],
[LastName],
[SalaryDollars]
FROM inserted
INSERT INTO [dbo].[Customers_OtherInfo]
(ID,
OtherInfo
)
SELECT
ID,
OtherInfo
FROM inserted
END
CREATE TRIGGER
IOU_Trig_vw_Customers_Updateable ON [dbo].[vw_Customers_Updateable]
INSTEAD OF UPDATE
AS
BEGIN
UPDATE [dbo].[Customers_WithoutOtherInfo]
SET ID = i.ID,
FirstName = I.FirstName,
LastName = I.LastName,
SalaryDollars = I.SalaryDollars
FROM inserted I JOIN [Customers_WithoutOtherInfo] CWO ON CWO.ID = I.ID
UPDATE [dbo].[Customers_OtherInfo]
SET ID = I.ID,
OtherInfo = I.OtherInfo
FROM inserted I JOIN [Customers_OtherInfo] CO ON CO.ID =
I.ID
END
--Instead of Delete trigger
ALTER TRIGGER
IOD_Trig_vw_Customers_Updateable ON [dbo].[vw_Customers_Updateable]
INSTEAD OF DELETE
AS
BEGIN
DELETE CO FROM [dbo].[Customers_OtherInfo] CO JOIN
deleted D ON (CO.ID = D.ID)
DELETE CWO FROM [dbo].[Customers_WithoutOtherInfo] CWO JOIN deleted D ON (CWO.ID = D.ID)
END
The view is now setup for me to be able to run selects,
inserts, updates, and deletes just like I would on Customers table.
Performance of updates, inserts, and deletes on the view would
be degraded compared to same operation on the single combined table. However,
this method provides a solution to the scenario my friend was facing. They can
split the table into two just like in this example. One table will have all
columns except for the big varchar column. The other table will have the
primary key and the big varchar column. They can then create a view like in the
example above but change the view definition and triggers on it based on
whether they need big varchar column or not. When they need the big varchar
column, they can change definition of view to join the two tables and show the
big varchar column along with other columns. Triggers on the view can be
changed such that any update, insert or delete affects both tables. When they
do not need the big varchar column, they can change the view definition to not
join the table that contains big varchar column. They will need to drop all
triggers in that case as there is no need for triggers when there is only one
underlying table showing all its columns. The view will basically represent the
table that has other columns and so performance of queries for 22 days of the
month will be much better.
I would appreciate anyone sharing his/her experience with vertical
partitioned views in feedback.
No comments:
Post a Comment