1. What is sub-database and sub-table?

Table splitting: Split the data in one table into multiple tables according to certain rules, reduce lock granularity and index trees, and improve data query efficiency.
Split database: Split the data in one database into multiple databases according to certain rules to relieve the pressure on a single server (CPU, memory, disk, IO).

2. Data segmentation architecture

2.1. Vertical (vertical) segmentation

There are two common types of vertical segmentation: vertical database partitioning and vertical table partitioning.

Vertical database sharding: Based on business coupling, different tables with low correlation are stored in different databases. The approach is similar to splitting a large system into multiple small systems, which are divided independently according to business classification. Similar to the approach of “microservice governance”, each microservice uses a separate database.
Vertical table splitting: It is based on “columns” in the database. If a table has many fields, you can create a new extended table and split the fields that are not frequently used or have large field lengths into the extended table. When there are many fields (for example, a large table has more than 100 fields), “split the large table into small tables” is easier to develop and maintain, and can also avoid cross-page problems. The bottom layer of MySQL is stored through data pages. Recording that takes up too much space will lead to page crossing, causing additional performance overhead. In addition, the database loads data into the memory in units of rows, so that the field length in the table is shorter and the access frequency is higher. The memory can load more data, the hit rate is higher, and the disk IO is reduced, thus improving the database performance.

Advantages and Disadvantages of Vertical Slicing:

Advantages

Solve the coupling at the business system level and make the business clear;
Similar to the governance of microservices, it can also perform hierarchical management, maintenance, monitoring, expansion, etc. of data of different businesses;
In high concurrency scenarios, vertical segmentation will increase the bottleneck of IO, database connections, and single-machine hardware resources to a certain extent;

Disadvantages

Some tables cannot be joined and can only be solved through interface aggregation, which increases the complexity of development;
Distributed transaction processing is complex;
There is still the problem of excessive data volume in a single table (needs horizontal segmentation)

1.2. Horizontal (horizontal) segmentation

When it is difficult for an application to be vertically segmented at a finer granularity, or the number of rows of data after segmentation is huge, and there are performance bottlenecks in single database reading, writing, and storage, then horizontal segmentation is required.

Horizontal segmentation is divided into intra-database sharding and sub-database sharding. Based on the inherent logical relationship of the data in the table, the same table is dispersed into multiple databases or multiple tables according to different conditions. Each table only contains Part of the data, thereby reducing the amount of data in a single table and achieving a distributed effect.

Advantages and disadvantages of horizontal slicing:

Advantages

There is no performance bottleneck caused by excessive data volume or high concurrency in a single database, improving system stability and load capacity.
The application side transformation is small and there is no need to split the business module

Disadvantages

Transaction consistency across shards is difficult to guarantee;
Cross-database join association query performance is poor;
Data expansion is difficult and requires a lot of maintenance;

Common solutions and advantages and disadvantages of horizontal segmentation

Range split
Split according to time interval or ID interval. For example: distribute data of different months or even days into different libraries by date; assign records with userIds from 1 to 9999 to the first library, records with userIds from 10000 to 20000 to the second library, and so on.

In a sense, the “hot and cold data separation” used in some systems, migrating some less used historical data to other libraries, and only providing hot data queries in business functions, is a similar practice.

①Advantages

The size of a single table is controllable
It is naturally easy to expand horizontally. If you want to expand the entire sharded cluster later, you only need to add nodes, and there is no need to migrate the data of other shards;
When using shard fields for range search, continuous sharding can quickly locate shards for quick query, effectively avoiding cross-shard query problems;

②Disadvantages

Hotspot data becomes a performance bottleneck. Continuous sharding may have data hotspots, such as sharding by time fields. Some shards store data in the most recent time period and may be frequently read and written, while some shards store historical data that is rarely queried.

Take the modulo based on the numerical value
Generally, the splitting method of hash modulo mod is used. For example, the Customer table is split into 4 libraries according to the cusno field. The remainder with a remainder of 0 is placed in the first library, and the remainder with a remainder of 1 is placed in the second library. And so on.
In this way, the data of the same user will be scattered into the same database. If the query condition contains the cusno field, the corresponding database can be clearly positioned for query.

①Advantages
- Data fragmentation is relatively even, and hot spots and concurrent access bottlenecks are not prone to occur.

②Disadvantages

When the sharded cluster is expanded later, old data needs to be migrated (using a consistent hash algorithm can better avoid this problem)
Easily faced with complex problems of cross-shard queries. For example, in the above example, if cusno is not included in the frequently used query conditions, the database will not be located. Therefore, it is necessary to initiate queries to four libraries at the same time, then merge the data in the memory, take the minimum set and return it to the application. Instead, the library became a drag.

3. Data segmentation principles

3.1 Try not to segment if possible

Data segmentation is not necessarily efficient. Data segmentation will increase business complexity to a certain extent. In addition to carrying data storage and query, the database is also one of its important tasks to assist the business in better realizing its needs.
Don’t use the big trick of sub-database and sub-table unless absolutely necessary to avoid “over-design” and “premature optimization”. Before splitting databases and tables, don’t split just for the sake of splitting. Try your best to do what you can first, such as upgrading hardware, upgrading network, separating read and write, index optimization, etc. When the amount of data reaches the bottleneck of a single table, consider sharding databases and tables.

3.2 The amount of data is too large and normal operation and maintenance affects business access.

The operation and maintenance mentioned here refers to:

For database backup, if a single table is too large, a large amount of disk IO and network IO will be required during backup. For example, if 1T of data is transmitted over the network and takes up 50MB, it will take 20,000 seconds to complete the transmission. The risk of the entire process is relatively high.
When making DDL modifications to a large table, MySQL will lock the entire table. This time will be very long. During this time, the business cannot access the table, which has a great impact. If you use pt-online-schema-change, triggers and shadow tables will be created during the use process, which also takes a long time. During this operation, it is counted as risk time. Splitting the data table and reducing the total amount can help reduce this risk.
Large tables will be accessed and updated frequently, so lock waits are more likely to occur; split the data and trade space for time to reduce access pressure in disguise.

3.3 As business develops, some fields need to be split vertically

3.4 Rapid growth of data volume

With the rapid development of business, the amount of data in a single table will continue to grow. When performance approaches the bottleneck, it is necessary to consider horizontal sharding and create separate databases and tables. At this time, you must choose appropriate segmentation rules and estimate the data capacity in advance.

3.4 Security and Usability

Vertical segmentation is performed at the business level to separate the databases of unrelated businesses. Because the data volume and access volume of each business are different, one business cannot implicate the database and implicate other businesses. Using horizontal slicing, when a database problem occurs, it will not affect 100% of users. Each database only bears part of the business data, so that the overall availability can be improved.

4. Problems caused by sub-database and sub-table

1. Things consistency problem

Database sharding may cause the data required to execute a transaction to be distributed on different servers. Transactional operations cannot be implemented at the database level. Distributed transaction operations need to be introduced into the upper-level business, which will inevitably bring certain complexity to the business. In order to solve the transaction There are generally two approaches to sexual problems:
Option 1: During the design process of the database and table sub-scheme, from a business perspective, try to ensure that the tables operated by a transaction are distributed in one database, thereby achieving transaction guarantees at the database level.
Option 2: When method 1 cannot be implemented, the business layer introduces distributed transaction components to ensure transactionality, such as transactional messages, TCC, Seata and other distributed transaction methods to achieve ultimate data consistency.

2. Cross-node association query join problem

solution:
① Store the table with E-R relationship on a shard.
② Realize associated queries through global tables, field redundancy, data assembly and other methods.
③ For cross-database joins that must be performed, cross-database joins across up to two tables are supported.

3. Cross-node paging, sorting, and function issues

When querying multiple databases across nodes, problems such as limit paging and order by sorting may occur. Paging needs to be sorted according to the specified field. When the sorting field is a sharding field, it is easier to locate the specified shard through the sharding rules; when the sorting field is not a sharding field, it becomes more complicated. The data needs to be sorted and returned in different shard nodes first, and then the result sets returned by different shards are summarized and sorted again, and finally returned to the user.

4. Distributed primary key ID conflict problem

solution:
① Use the incr command of Redis to generate the primary key.
② Use UUID to generate the primary key (not recommended: the fields are relatively long and difficult to sort).
③ Snowflake distributed self-increasing ID algorithm.

5. Data migration,

When the business develops rapidly and faces performance and storage bottlenecks, sharding design will be considered. At this time, it is inevitable to consider the issue of historical data migration. The general approach is to first read the historical data, and then write the data to each shard node according to the specified sharding rules. In addition, capacity planning needs to be carried out based on the current data volume and QPS, as well as the speed of business development, to calculate the approximate number of shards required (it is generally recommended that the data volume of a single table on a single shard should not exceed 1000W)

If you use numerical range sharding, you only need to add nodes to expand the capacity, and there is no need to migrate the sharded data. If numerical modulo sharding is used, it will be relatively troublesome to consider later expansion issues.

5. What is a good sub-database and table scheme?

Meet the needs of business scenarios: Choose different database and table sharding solutions according to different business scenarios: such as dividing by time, dividing by user ID, dividing by business capabilities, etc.
Program sustainability:
What is sustainability? In fact, when the business data level and traffic level further reach new levels in the future, our sub-database and sub-table solution can continue to flexibly expand capacity.
Minimize data migration: Capacity expansion generally involves historical data migration. The smaller the amount of data that needs to be migrated after expansion, the stronger the sustainability. The ideal state before and after migration is (same database, same table > same table, different database> Different tables in the same database > Different tables in different databases)
Data skew: the balance of data distribution in database tables, ensuring that data traffic is equally distributed in each database table as much as possible, and avoiding the pressure of hot data on a single database.
Maximum data skew rate: (sample with the largest amount of data - sample with the smallest amount of data) / sample with the smallest amount of data. Generally speaking, it is acceptable if our maximum data skew rate is within 5%.