Column Store VS Row Store

Other Ways to Store Data¶

Column Store VS Record Store¶

Relational DBMS¶

Record-oriented storage
Write-optimized systems
Row store relational databases are write-optimized by storing attributes of a record contiguously in storage. This allows for a single disk write to push all of the fields of a single record out to disk. This approach is efficient for write-heavy environments such as OLTP (Online Transaction Processing) systems, but is not ideal for datawarehouse systems..

Data warehouses / Analytical Databases¶

Column store architecture
Read-optimized systems
Column stores make databases read-optimized in two ways:
- Avoiding irrelevant data access: By storing data by column, C-Store allows users to only read the columns they need for a query, as opposed to reading entire rows in a row-store database. This can significantly improve read performance, especially for queries that only involve a subset of attributes.
- Data compression: Column stores can compress data more effectively than row stores because they store each column separately. This is because each column likely contains similar data types, which can be compressed more efficiently than the mixed data types within a row.

C-Store Overview¶

C-Store is a column-oriented DBMS that is designed to be read-optimized. It achieves this by storing data by column rather than by row, and by using careful coding and packing of objects into storage. C-Store also uses bitmap indexes to complement B-tree structures. These design choices allow C-Store to outperform commercial row-oriented products on warehouse-style queries.

Column store architecture benefits¶

Data stored in a traditional row store style

Transaction ID	Date	Product	Quantity	Price	Customer ID
1	2024-03-15	Laptop	2	$800	101
2	2024-03-15	Smartphone	1	$500	102
3	2024-03-15	Tablet	3	$300	103

In a C-Store database, data is stored column by column. Each column represents a single attribute, and all values for that attribute across all records are stored together.

Transaction ID	1	2	3
Date	2024-03-15	2024-03-15	2024-03-15
Product	Laptop	Smartphone	Tablet
Quantity	2	1	3
Price	$800	$500	$300
Customer ID	101	102	103

Performance advantages in warehouses¶

A column store architecture benefits by reducing the number of disk accesses required per query.
This is because only the columns required for processing a given query are read, and irrelevant attributes are avoided.
In warehouse environments, where typical queries involve aggregates performed over large numbers of data items, a column store architecture has a significant performance advantage.

Design Decisions¶

Read Optimize Data
Store data by columns rather than by row.
C-Store chooses to compress the data whenever possible.

Data coding and compression¶

This is done by using coding schemes that can represent data in fewer bits than the original format.
Instead of storing a state like “CA” as a two-character abbreviation, C-Store can code it into a 6-bit value.

Storage using read-optimized structures¶

With a Column Structure, All the values for a single attribute are stored together, rather than each row of data being stored contiguously.
This allows C-Store to only read the columns that are required for a query, reducing the amount of data that needs to be accessed from disk.
Physically store a collection of columns, where each column is sorted on some attribute.
These sorted collections of columns are called projections.
C-Store can store multiple projections of the same data, with each projection sorted on a different attribute.
- This allows C-Store to choose the most efficient projection to use when answering a query.

Demonstration of Physical Storage¶

As noted, C-Store’s storage are Collections of columns sorted on attributes
Projection storage is then used to gain differnt views of this data.

Visualizing C-Store Physical Storage¶

Let’s visualize the physical storage of a C-Store database using a simple example with multiple projections. In this example, we’ll consider a database storing sales data with columns for Transaction ID, Date, Product, Quantity, Price, and Customer ID.

Traditional Row Store view of this data.

Transaction ID	Date	Product	Quantity	Price	Customer ID
1	2024-03-15	Laptop	2	$800	101
2	2024-03-15	Smartphone	1	$500	102
3	2024-03-15	Tablet	3	$300	103

In a C-Store database, the data would be physically stored in columns rather than rows. Here’s how it might look:


Transaction ID:	1	2	3
Date:	2024-03-15	2024-03-15	2024-03-15
Product:	Laptop	Smartphone	Tablet
Quantity:	2	1	3
Price:	$800	$500	$300
Customer ID:	101	102	103

Each column contains all the values for a specific attribute across all transactions.
Now let’s see what we mean by storing Multiple Projections:
Consider two different projections or views of this data:

Projection 1: Product Sales - Columns: Product, Quantity, Price


Product:	Laptop	Smartphone	Tablet
Quantity:	2	1	3
Price:	$800	$500	$300

Projection 2: Customer Transactions - Columns: Customer ID, Transaction ID, Date, Product


Customer ID:	101	102	103
Transaction ID:	1	2	3
Date:	2024-03-15	2024-03-15	2024-03-15
Product:	Laptop	Smartphone	Tablet

Querying Multiple Projections:¶

Queries can be performed on each projection separately, allowing for efficient analysis based on different perspectives of the data.

For example, to calculate the total sales revenue for each product in Projection 1, you could run a SQL query like:

SELECT Product, SUM(Quantity * Price) AS TotalRevenue
FROM ProductSales
GROUP BY Product;

And to retrieve all transactions made by a specific customer in Projection 2, you could run a query like:

SELECT *
FROM CustomerTransactions
WHERE CustomerID = 101;

These queries can be executed independently on their respective projections, leveraging the columnar storage layout for efficient data retrieval and processing.

Hardware Architecture¶

C-Store is designed for a grid computing environment.

Grid computers¶

Grid computing involves a network of interconnected computers (nodes) to work together as a single system.
Each node has its own processing power, memory, and storage.
Grid computing allows C-Store to be K-safe, meaning it can tolerate K failures.
This is achieved by storing redundant data on different nodes.

Horizontal data partitioning¶

Horizontal Partitioning is a technique that splits a table into smaller subsets based on a chosen criterion, typically by ranges of values or hash functions.
Each subset, called a shard (or partition), is then stored on a separate machine.
The Grid computer architecture allows C-Store to horizontally partition data across the disks of the various nodes. This is beneficial because it allows for very large amounts of data to be stored and retrieved in parallel.

Info

horizontal partitioning benefits C-Store in a grid computing environment:

Scalability: By distributing data across multiple machines, C-Store can handle much larger datasets than if it were limited to a single server. As the data volume grows, additional nodes can be added to the grid to accommodate the increase.

Parallel Processing: Queries can be executed in parallel across different nodes, potentially speeding up query processing times significantly. Since each node holds a portion of the data, queries can be divided and executed simultaneously on multiple machines, retrieving relevant data faster.

Fault Tolerance: If a single node fails, the data on that node remains accessible because other nodes still hold their respective partitions. This redundancy helps ensure data availability and system reliability.

Transactions and Updates¶

Challenges of updates in read-mostly environments¶

C-Store faces the challenge of balancing the need to update data frequently with the desire to optimize data structures for reading. Traditionally, these two goals are at odds with each other. C-Store addresses this challenge by using a hybrid architecture with a Writeable Store (WS) for updates and a Read-optimized Store (RS) for queries. Updates are funneled through the WS, which is designed for fast inserts and updates. Then, the data is periodically bulk-loaded into the RS. This bulk-loading process is performed by the tuple mover component. However, this approach can introduce latency into updates, since the data may not be immediately reflected in the RS.

C-Stor in the Wild¶

Unfortunately I don’t have ther resources at CR to let you play with an implementation of C-Stor, but I think its important for you all to know what it is, when you should use it (i.e. parallelism and warehousing data queries), and what products support C-Store like architecture.

C-Stor was presented in this article in 2005, and it didn’t take long for commercial vendors to see the huge benefit of this type of architecture. Here is a short list of commercial databases that integrated some version of C-Stor like features into their products, or entirely NEW products:

Amazon Redshift: Redshift is a fully managed data warehouse service provided by Amazon Web Services (AWS). It utilizes a columnar storage format and massively parallel processing (MPP) architecture to deliver high-performance query processing and scalability for analytics workloads in the cloud.
Google BigQuery: BigQuery is a serverless, highly scalable data warehouse service offered by Google Cloud Platform (GCP). It uses a columnar storage format and a distributed processing engine to deliver fast and cost-effective analytics capabilities for large datasets.
Oracle Database In-Memory Option: Oracle Database provides an in-memory option that enables columnar storage and processing for analytical workloads. With this option, tables or portions of tables can be populated into memory in a columnar format, improving query performance for analytics and reporting.
Microsoft SQL Server Columnstore Indexes: SQL Server offers columnstore indexes, which store and manage data in a columnar format for improved query performance, especially for data warehouse and analytics scenarios. Columnstore indexes can be applied to both clustered and non-clustered tables in SQL Server.
IBM Db2 BLU Acceleration: Db2 BLU Acceleration is a feature of IBM Db2 that provides columnar storage and in-memory processing capabilities. It leverages techniques such as columnar compression, data skipping, and SIMD (Single Instruction, Multiple Data) processing to accelerate query performance for analytical workloads.
SAP HANA: SAP HANA is an in-memory data platform that natively supports columnar storage and processing. It is designed for real-time analytics and transaction processing, offering high-speed data processing capabilities for large datasets.
Vertica by Micro Focus: Vertica is a columnar database designed for analytics and data warehousing applications. It uses a columnar storage format and a distributed architecture to provide high-performance query processing and scalability for large-scale analytics.