What are Columnstore Indexes and when do we use them in SQL Server
- How we enable Unified Auditing in Oracle Database - 7 October 2024
- What is PostgreSQL and how do we do a full installation - September 2, 2024
- How do we configure GoldenGate replication to read from Oracle Data Guard Standby - 2 August 2024
In this article we will analyze the difference of a table that is physically stored through Clustered Rowstore Index by being through Clustered Columnstore Index. What exactly are Columnstore Indexes, when should they be used and what are their pros and cons?
What are Columnstore Indexes
The Columnstore indexes is a technology that allows to store the data of a table per column as opposed to the "classic" Rowstore indexes which are stored per row of table entries. Because each column has the same type of data, it offers up to x10x greater compression and x10x faster data reading.
How Columnstore Indexes work
During its creation Columnstore Index records are divided into groups of 1,048,576 records (102,400 minimum) called Row Groups, then each column that has the Row Group you split and compress into groups called Column Segments. Since the minimum for a Row Group is 102,400 records when there are records left over, the above records go into a special space you call Deltastore. In the Deltastore space the data is still Row based and does not have the compression of the Column Store.
What are the advantages of Columnstore Indexes?
- Columnstore Indexes, as we mentioned at the beginning, because they divide the data into groups of columns, provide high compression of the data that leads to requiring less storage space.
- The high compression combined with the fact that queries on Columnstore Indexes run on batch mode (optimized for multiple records) instead of row mode they provide faster reading of the data.
- Queries that need only specific columns from the tables require reduced I/O in relation to Rowstore.
- The analytical queries that they use aggregate functions such as SUM, AVG and GROUPING are performed in minimal times.
What are the disadvantages of Columnstore Indexes
- Updating the tables in OLTP workloads with DML like insert, update and delete lags a lot compared to Rowstore.
- Each column must fit on a single page eg a field that is varchar(max) is not supported and if we want to use it we should keep only the first 4000 characters (LEFT(column,4000)).
- The search specific records usually takes more time.
The example
We have a table with a classic Clustered Rowstore Index at at ID. Before we do anything we enable statistics for timings and I/O:
set statistics time,io on; go SELECT * FROM Comments;
We make a second table copy of the same. Because it has a field whose size exceeds the one page (8k) it must have to create the Columnstore Index. To fit on one page we keep only the first 4000 characters:
*The order is to make the table with the entries shuffled by ID and date. We will see why below.
Then we create it Clustered Columnstore Index:
--drop table comments_columnstore select id,CreationDate,PostId,left(Text,4000) as Text,score,userid into Comments_Columnstore from Comments order by score desc; GO create clustered columnstore index idx_comments_clmstore on [dbo].[Comments_Columnstore]; GO
Reading his records Rowstore table for a month, we see that it takes 7 seconds:
SELECT id,CreationDate,PostId,Text,score FROM Comments where 1=1 and CreationDate between '2010-01-01' and '2010-02-01' order by CreationDate;
Accordingly to Columnstore table only takes 5 seconds. Also we can see in the messages that this time he does Segment reads:
SELECT id,CreationDate,PostId,Text,score FROM Comments_Columnstore where 1=1 and CreationDate between '2010-01-01' and '2010-02-01' order by CreationDate;
But because the Row Groups they are divided according to how the entries are in the table and we passed them mixed up, if we want to sort them by date we will have to make Clustered Rowstore Index on the table extinguishing the Columnstore and then rebuild the Columnstore by deleting the Rowstore.
We do this as in Columnstore Indexes we cannot yet define which field to order them with Segment Row Groups.
If we do this and run the query again we will see that from 5 seconds it dropped to 1.5 seconds:
create clustered index idx_comments_clmstore on [dbo].[Comments_Columnstore](CreationDate) with (drop_Existing = on); go create clustered columnstore index idx_comments_clmstore on [dbo].[Comments_Columnstore] with (drop_Existing = on); go SELECT id,CreationDate,PostId,Text,score FROM Comments_Columnstore where 1=1 and CreationDate between '2010-01-01' and '2010-02-01' order by CreationDate;
If we run aggregate function to Rowstore table we will see that it will take 5 seconds:
SELECT score,count(id) as Count FROM Comments group by score;
But if we run it on Columnstore table we will see what it will do only 10 milliseconds:
SELECT score,count(id) as Count FROM Comments_Columnstore group by score;
If we do 100,000 inserts to Rowstore table will make 41 seconds:
declare @i int set @i = 1 WHILE @i < 100000 BEGIN insert into comments(CreationDate,PostId,Score,Text,Userid) values ('2010-09-06 08:09:52.330',35314,3,'Test Comment',3); set @i=@i+1; END;
To Columnstore but the 100,000 table inserts they will take 54 seconds.
declare @i int set @i = 1 WHILE @i < 100000 BEGIN insert into comments_columnstore(CreationDate,PostId,Score,Text,Userid) values ('2010-09-06 08:09:52.330',35314,3,'Test Comment',3); set @i=@i+1; END;
Accordingly if we do delete to Rowstore panel will do 5 seconds:
delete from comments where Text = 'Test Comment';
While in Columnstore table it delete it will take almost 10 seconds (ie twice the time):
delete from comments_columnstore where Text = 'Test Comment';
When to use a Columnstore Index
The Columnstore Indexes have been created primarily for use in Data warehouses (OLAP). They are made in Fact tables and to the very old Dimensional tables. Using them dramatically improves performance as a large amount of data is required.
Also with creation Nonclustered Columnstore Index in Rowstore tables we can increase the performance of analytical queries that use aggregate functions such as SUM, AVG and GROUPING are performed in minimal times.
On the website columnscore.com you'll find a nice quiz with questions by Brent Ozar about when we do or don't use a Columnstore Index.