A good number of people first learn about databases as relational databases, where data is stored in tables, with one row representing one record. In fact, this is a typical Row-based store, which stores tables on disk partitions by rows.


And some databases also support Column-based store, which stores tables on disk partitions by column.


Comparison of storage methods

The difference between the two is shown below.

As you can see from the figure, in row storage, the attribute value of a row is stored in the adjacent space, followed by the attribute value of the next record.

In the case of column storage, all the values of a single attribute are stored in the adjacent space, i.e., all the data of a column are stored consecutively, and each attribute has a different space.

Here, you can think about the two kinds of more suitable for querying, and more suitable for modifying?

Comparison in data writing:

1) Writes to row storage are done in a single pass. Writes are built on the operating system's file system, which guarantees the success or failure of the writing process, so the integrity of the data can be determined.

(2) Columnar storage requires splitting a row of records into a single column, so the number of writes is significantly higher than row storage, and the actual time consumption is greater because of the time required to move and position the head on the disk. Therefore, row storage has a great advantage in writing.

3) There is also data modification, which is actually a write process. So, data modification is also dominated by line storage.

Comparison in data reading:

(1) Row storage usually takes out a row of data completely, and if only a few columns of data are needed, there will be redundant columns, and the process of eliminating redundant columns is usually done in memory for the sake of shortening processing time.

2) Column storage reads a section or all of the data of a collection each time, there is no redundancy problem, and the lookup content is stored continuously, which is especially suitable for projection.

3) Data distribution for both stores. Since each column data type of column storage is homogeneous, there is no problem of duality. For example, if the data type of a column is integer (int), then its data set must be integer data. This situation makes data parsing very easy. In contrast, row storage is much more complex, because multiple types of data are stored in a row, and data parsing requires frequent conversions between multiple data types, which consumes CPU and increases the parsing time. Therefore, the parsing process of column storage is more conducive to analyzing big data.

4) Comparison in terms of data compression and more performance reading. The same column of data, the same data type, column storage mode is suitable for data compression, different columns can use different compression algorithms, compressed storage will bring IO performance improvement.

Comparison of advantages and disadvantages

The table storage type is the first step of table definition design, and the customer business type is the main factor to decide the table storage type. Row and column storage models have their own advantages and disadvantages, and it is recommended to choose according to the actual situation.
A comparison of the advantages and disadvantages of row and column storage and the applicable scenarios is shown in the following table.

Row Storage Column storage
Advantages

Data is saved together. inserts/updates easily.

  1. Only the columns involved will be read during the query.

  2. Projection is very efficient.

  3. Any column can be used as an index.

Disadvantages

When Selection, all data is read even if only a few columns are involved.

  1. When selection is complete, the selected colum n has to be reassembled.

  2. INSE RT/UPDATE is more troublesome.

  3. Point query is not suitable.

Applicable scenarios

1.Point query (return few records, simp le query based on index).

2. Scenarios with more additions, deletions and changes.

1.Statistical analysis type of query (OLAP, such as data warehouse business, this type of table will do a lot of convergence calculations, and involves fewer column operations, more correlation, grouping operations).

2.Instant query (query conditions are uncertain, row storage table scan is difficult to use indexes)

Row and column storage experiments

openGauss supports mixed storage of rows and columns, and you can specify the storage method when you build a table. Let's conduct the following experiments.

Experimental environment: Huawei Cloud Server + openGauss Enterprise Edition 3.0.0 + openEuler20.03

Create a row storage table custom1 and a column storage table custom2 and insert 500,000 records.

openGauss=# create table custom1 (id integer,name varchar2(20)); 
CREATE TABLE 
openGauss=# create table custom2 (id integer,name varchar2(20)) with (orientation = column); 
CREATE TABLE 
openGauss=# insert into custom1 select n,'testtt'||n from generate_series(1,500000) n; 
INSERT 0 500000 
openGauss=# insert into custom2 select * from custom1; 
INSERT 0 500000

If we look at the storage space of the two tables and compare the Size column, we can see that the column table takes up much less storage space than the row table, almost 1/7 of the space of the row table.
openGauss=# \d+ 
                                           List of relations 
 Schema |    Name    | Type  | Owner |    Size    |               Storage                | Description 
--------+------------+-------+-------+------------+--------------------------------------+------------- 
 public | custom1    | table | omm   | 24 MB      | {orientation=row,compression=no}     | 
 public | custom2    | table | omm   | 3104 kB    | {orientation=column,compression=low} |

Comparing the time to insert a new record, the column storage table is a little slower.
openGauss=# explain analyze insert into custom1 values(1,'zhang3'); 
                                          QUERY PLAN 
----------------------------------------------------------------------------------------------- 
 [Bypass] 
 Insert on custom1  (cost=0.00..0.01 rows=1 width=0) (actual time=0.059..0.060 rows=1 loops=1) 
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1) 
 Total runtime: 0.135 ms 
(4 rows) 
 
openGauss=# explain analyze insert into custom2 values(1,'zhang3'); 
                                          QUERY PLAN 
----------------------------------------------------------------------------------------------- 
 Insert on custom2  (cost=0.00..0.01 rows=1 width=0) (actual time=0.119..0.120 rows=1 loops=1) 
   ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1) 
 Total runtime: 0.207 ms 
(3 rows)

Finally, delete the test table.

openGauss=# drop table custom1; 
DROP TABLE 
openGauss=#drop table custom2; 
DROP TABLE

Interested students can test more scenarios themselves, such as creating large wide tables, update tables and other scenarios to test.

Selection advice

  • Update frequency: If the data is updated frequently, select the row storage table.

  • Insertion frequency: For frequent insertion of small amount, select row storage table. If you insert a large amount of data at a time, select the column storage table.

  • The number of columns in the table: In general, if the table has more fields, that is, the number of columns (large wide table), the query does not involve many columns, suitable for column storage. If the number of fields in the table is relatively small and most of the fields in the query, then it is better to choose row storage.

  • The number of columns in the query: If each query involves only a few (<50% of the total number of columns) columns of the table, choose column storage table. (Don't ask what the rest of the columns are for, A says useful is useful.)

  • Compression rate: column storage table than row storage table compression rate is high. But the high compression rate will consume more CPU resources.

Cautions

Due to the special storage method, there are more constraints when using column storage. For example, column storage tables do not support arrays, do not support generated columns, do not support creating global temporary tables, do not support foreign keys, and support fewer data types than row storage. You need to check the corresponding database documentation when using it.