Rajeev Chaudhary: Vertical Partitioned Views

Testing environment: SQL 2014 Developer Edition

Recently, a friend brought up a requirement he was trying to find solution for.

Scenario: A table having a large varchar column besides many other columns with more than a billion rows. Total table size touches 900 GB and has a primary clustered key. Users need the varchar column only around month-end. Data is populated in the table, including the big varchar column, before month end.

Requirement: Users need the big varchar column only on 8 days around month-end. Basically, they don’t need the large varchar column for 22 days in a month. However, due to it being part of the table, there are few rows per page and so queries have to do a lot of IO to get the required data. It is slowing down their queries besides wasting memory. If somehow, they could get rid of this column for 22 days and get it back for the 8 month-end days, it would serve the purpose. They do not need to retain the varchar data after it is not required.

One solution is to create a non-clustered index that included all columns but the large varchar column. All queries except the ones that query the big varchar column would use non-clustered index to fetch all their data and so won’t touch the table. That would solve the reporting problem but, it would increase the total table size by a significant amount.

Considering that they do not need to retain varchar column data, another solution could be to drop the column when they don’t need it or delete values in the column and rebuild clustered index. Both these operations are super expensive and lock the table making it effectively unusable for long time.

I am a fan of using Partitioned Views but I have always thought of them mostly in terms of splitting data horizontally. After my friend brought this problem to me, I thought if it is possible to use Vertical Partitioned View as a solution to his requirement. Instead of splitting a table into multiple small tables on row values, how about splitting them on column names. The two or more smaller tables would all have same primary key. They could all be joined together in a view and users could use the view to query, update, insert, and delete data. For updates, deletes, and inserts, we would need to create instead of triggers on the view. I am sure this is not without some positive and negative performance impact based on the operation to be performed.

Let us test this out. I will create a table with 5 small columns and 1 char(5000) column. The char column is the one that we need only for few days in a month. Note ID column is the primary key with clustered index.

CREATE TABLE [dbo].[Customers](

[ID] [INT] NOT NULL,

[FirstName] [VARCHAR](50) NULL,

[LastName] [VARCHAR](50) NULL,

[SalaryDollars] [INT] NULL,

[OtherInfo] [CHAR](5000) NULL,

CONSTRAINT [PK_Customers] PRIMARY KEY CLUSTERED

(

[ID] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

Now, let us insert data

--Declare variables

DECLARE @id INT

DECLARE @maxid INT

--Get the maximum id from customers table

SELECT @maxid = max(id) FROM [dbo].[Customers]

--Assign starting value to @id

IF @maxid is null

SELECT @id = 1

ELSE

SELECT @id = @maxid + 1

--While loop to insert data

WHILE @id <= @maxid+10000

BEGIN

INSERT INTO [dbo].[Customers]

([ID]

,[FirstName]

,[LastName]

,[SalaryDollars]

,[OtherInfo])

SELECT @id,

CAST(@id AS VARCHAR(10))+'FN' + CAST(@id AS VARCHAR(10)),

CAST(@id AS VARCHAR(10))+'LN' + CAST(@id AS VARCHAR(10)),

CASE WHEN @id < 1000 THEN 1000+@id ELSE @id end,

'This is the additional information for customer id '+ CAST(@id AS VARCHAR(10))

--Increment value of @id

SET @id = @id+1

END

This creates the starting point for the scenario.

Now, we split this table into two tables, first table without the large char column and the other table with just primary key and large char column.

--Table with all columns except the large char(5000) column

CREATE TABLE [dbo].[Customers_WithoutOtherInfo](

[ID] [INT] NOT NULL,

[FirstName] [VARCHAR](50) NULL,

[LastName] [VARCHAR](50) NULL,

[SalaryDollars] [INT] NULL

CONSTRAINT [PK_Customers_WithoutOtherInfo] PRIMARY KEY CLUSTERED

(

[ID] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

--Table with just the primary key column and large char(5000) column

CREATE TABLE [dbo].[Customers_OtherInfo](

[ID] [INT] NOT NULL,

[OtherInfo] [CHAR](5000) NULL

CONSTRAINT [PK_Customers_OtherInfo] PRIMARY KEY CLUSTERED

(

[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

Insert data into each table. This is splitting data by columns.

INSERT INTO [dbo].[Customers_WithoutOtherInfo]

SELECT ID,FirstName, LastName, SalaryDollars FROM [dbo].[Customers]

INSERT INTO [dbo].[Customers_OtherInfo]

SELECT ID,OtherInfo FROM [dbo].[Customers]

Note that table Customers_WithoutOtherInfo can accommodate way more number of rows into one page than the original Customers table since it does not have the big char column in it. In the example here, Customers_WithoutOtherInfo occupied only 74 pages vs 10040 pages of Customers.

Let’s create a view on these two tables

CREATE VIEW [dbo].[vw_Customers_Updateable]

SELECT CWO.ID, CWO.FirstName, CWO.LastName, CWO.SalaryDollars, CO.OtherInfo

FROM dbo.Customers_WithoutOtherInfo AS CWO INNER JOIN

dbo.Customers_OtherInfo AS CO ON CWO.ID = CO.ID

This view will allow selects without issues but, to be able to update, insert and delete data properly, we need to create instead of triggers. Let’s create three INSTEAD OF TRIGGERS on the view, one each for insert, update, and delete operations.

--Instead of Insert trigger

CREATE TRIGGER IOI_Trig_vw_Customers_Updateable ON [dbo].[vw_Customers_Updateable]

INSTEAD OF INSERT

BEGIN

INSERT INTO [dbo].[Customers_WithoutOtherInfo]

(ID,

[FirstName],

[LastName],

[SalaryDollars]

)

SELECT ID,

[FirstName],

[LastName],

[SalaryDollars]

FROM inserted

INSERT INTO [dbo].[Customers_OtherInfo]

(ID,

OtherInfo

)

SELECT

ID,

OtherInfo

FROM inserted

END

--Instead of Update trigger

CREATE TRIGGER IOU_Trig_vw_Customers_Updateable ON [dbo].[vw_Customers_Updateable]

INSTEAD OF UPDATE

BEGIN

UPDATE [dbo].[Customers_WithoutOtherInfo]

SET ID = i.ID,

FirstName = I.FirstName,

LastName = I.LastName,

SalaryDollars = I.SalaryDollars

FROM inserted I JOIN [Customers_WithoutOtherInfo] CWO ON CWO.ID = I.ID

UPDATE [dbo].[Customers_OtherInfo]

SET ID = I.ID,

OtherInfo = I.OtherInfo

FROM inserted I JOIN [Customers_OtherInfo] CO ON CO.ID = I.ID

END

--Instead of Delete trigger

ALTER TRIGGER IOD_Trig_vw_Customers_Updateable ON [dbo].[vw_Customers_Updateable]

INSTEAD OF DELETE

BEGIN

DELETE CO FROM [dbo].[Customers_OtherInfo] CO JOIN deleted D ON (CO.ID = D.ID)

DELETE CWO FROM [dbo].[Customers_WithoutOtherInfo] CWO JOIN deleted D ON (CWO.ID = D.ID)

END

The view is now setup for me to be able to run selects, inserts, updates, and deletes just like I would on Customers table.

Performance of updates, inserts, and deletes on the view would be degraded compared to same operation on the single combined table. However, this method provides a solution to the scenario my friend was facing. They can split the table into two just like in this example. One table will have all columns except for the big varchar column. The other table will have the primary key and the big varchar column. They can then create a view like in the example above but change the view definition and triggers on it based on whether they need big varchar column or not. When they need the big varchar column, they can change definition of view to join the two tables and show the big varchar column along with other columns. Triggers on the view can be changed such that any update, insert or delete affects both tables. When they do not need the big varchar column, they can change the view definition to not join the table that contains big varchar column. They will need to drop all triggers in that case as there is no need for triggers when there is only one underlying table showing all its columns. The view will basically represent the table that has other columns and so performance of queries for 22 days of the month will be much better.

I would appreciate anyone sharing his/her experience with vertical partitioned views in feedback.

Rajeev Chaudhary

Wednesday, September 2, 2015

Vertical Partitioned Views

No comments:

Post a Comment