The latest MongoDB specification in 2023

preface

MongoDB is a typical representative of non-relational databases. According to DB-Engines Ranking data, in recent years, MongoDB has always been the leader in the NoSQL field. MongoDB is a database system designed for the rapid development of Internet applications. Its data model and persistence strategy are designed to build high read/write performance and can be expanded elastically. With the popularity of MongoDB and the rapid growth of usage, organize this document to standardize the use, facilitate management and obtain higher performance. We elaborate and request from the aspects of database design specifications, collection design specifications, index design specifications, document design specifications, API usage specifications, and connection specifications.

Storage selection

It mainly solves the problem of access efficiency of a large amount of data and reduces the pressure on mysql. MongoDB has a variety of built-in data fragmentation features, which can well meet the needs of large data volumes. The built-in Sharding feature prevents the system from encountering performance bottlenecks in the process of data growth.
With complex data structures, the same data can be queried with various query conditions. MongoDB’s BSON data format is very suitable for storage and query in document format; it supports rich query expressions, and can easily query embedded objects, arrays and sub-documents in documents.
Non-transactional and weakly related collections can be used (MongoDB4.0+ supports cross-Collection transactions, MongoDB4.2+ supports cross-Shard transactions).
No multi-document transactional requirements and complex associated retrieval.
The business iterates rapidly, and the business needs change frequently.
The data model is not fixed, and the storage format is flexible.
The read/write concurrency of a single cluster is too large to support business growth.
Expect 5 nines of database high availability scenarios.

1. Database design specification

[Mandatory] Database naming convention: db_xxxx
[Mandatory] The database name is all lowercase, and any special characters other than _ are prohibited, and the database name starting with a number is prohibited, such as: 123_abc;

Note: The database exists in the form of folders, using special characters or other non-standard naming methods will lead to naming confusion
[Mandatory] The database name can be up to 64 characters.
[Mandatory] Before creating a new database, try to evaluate the size and QPS of the database, and discuss with the DBA in advance whether to create a new database or create a new cluster specifically for the database.

2. Collection Design Specifications

[Mandatory] The collection name is all lowercase, and any special characters other than _ are prohibited, and the collection name starting with a number is prohibited, such as: 123_abc, the beginning of system is prohibited; system is the prefix of the system collection;
[Mandatory] The collection name can be up to 64 characters;
[Suggestion] Writing a large collection in a library will affect the read and write performance of other collections. If the collection with relatively busy business is in one DB, it is recommended to have a maximum of 80 collections, and the performance of disk I/O should also be considered;
[Suggestion] If the amount of data in the evaluation single collection is large, you can split a large table into multiple small tables, and then store each small table in an independent library or sharding sub-table;
[Suggestion] The collection of MongoDB has the function of “automatically clearing expired data”. You only need to add a TTL index to the time field of the documents in the collection to realize this function, but it should be noted that the type of the field must be mongoDate(), must be combined with actual business design needs;
[Suggestion] Designing a polling set-whether the set is designed as a capped limited set must be combined with the actual business design.

Create collection rules

Different business scenarios can use different configurations;

db.createCollection("logs", {
    "storageEngine": {
        "wiredTiger": {
            "configString": "internal_page_max=16KB, leaf_page_max=16KB, leaf_value_max=8KB, os_cache_max=1GB "
        }
    }
})

a. If it is a table that reads more and writes less, we can set the page size as small as possible when creating it, such as 16KB, if the amount of table data is not large

1	"internal_page_max=16KB,leaf_page_max=16KB,leaf_value_max=8KB,os_cache_max=1GB"

b. If the data volume of this table with more reads and fewer writes is relatively large, you can set a compression algorithm for it, for example:

1	"block_compressor=zlib, internal_page_max=16KB, leaf_page_max=16KB, leaf_value_max=8KB"

c. Note: Do not use the zlib compression algorithm, which consumes a lot of cpu. If you use snapp, it will consume 20% of the cpu, and if you use zlib, it will consume 90% of the cpu, or even 100%.

d. If it is a table with more writes and fewer reads, you can set leaf_page_max to 1MB, and enable the compression algorithm, or you can set the os_cache_max value of the page cache size at the operating system level, so that it will not occupy too much page cache memory and prevent the impact read operation

For tables with more reads and fewer writes, internal_page_max=16KB, the default is 4KB. leaf_page_max=16KB, the default is 32KB. leaf_value_max=8KB, the default is 64MB. The default is 4KB leaf_page_max=16KB the default is 32KB leaf_value_max=8KB the default is 64M

3. Document Design Specifications

[Mandatory] The keys in the collection are prohibited from using any special characters other than “_” (underscore).
[Mandatory] Try to store documents of the same type in one collection, and scatter documents of different types in different collections; documents of the same type can greatly improve index utilization, and queries may occur if documents are mixed. A full table scan is often required;
[Suggestion] It is forbidden to use _id, such as: write custom content to _id; Note: MongoDB tables are similar to InnoDB, they are all index-organized tables, the data content follows the primary key, and _id is in MongoDB The default primary key of _id, once the value of _id is non-incrementing, when the amount of data reaches a certain level, each write may cause the binary tree of the primary key to be greatly adjusted, which will be a very expensive write, so write It will decrease as the amount of data increases, so be sure not to write custom content in _id.
[Suggestion] Try not to make the array field a query condition;
[Suggestion] If the field is large, it should be compressed and stored as much as possible;

Do not store too long strings. If this field is a query condition, make sure that the value of this field does not exceed 1KB; MongoDB’s index only supports fields within 1K. If the length of the data you store exceeds 1K, it will not be recognized. index
[Suggestion] Try to store data with unified case;
[Suggestion] If the amount of data in the evaluation single set is large, you can split a large table into multiple small tables, and then store each small table in an independent library or sharding table.

4. Index Design Specifications

[Mandatory] MongoDB’s composite index usage strategy is consistent with MySQL and follows the “leftmost principle”;
[Mandatory] The length of the index name should not exceed 128 characters;
[Mandatory] The query scenarios should be comprehensively evaluated as much as possible, and the single-column index should be merged into the composite index as much as possible through the evaluation to reduce the number, combined with points 1 and 2;
[Recommendation] Prioritize the use of covering indexes;
[Suggestion] When creating a composite index, you should evaluate the fields contained in the index, and try to put the fields with a large data cardinality (data with many unique values) in front of the composite index;
[Suggestion] MongoDB supports TTL index, which can automatically delete data before XXX seconds according to your needs and will try to choose to perform deletion operations during off-peak business periods; see if the business needs this type of index;
[Suggestion] When the amount of data is large, the creation of MongoDB indexes is a slow process, so you should try to evaluate before going online or before the amount of data becomes large, and create indexes that will be used as needed;
[Suggestion] If the data you store is geographical location information, such as: latitude and longitude data. Then you can add the geographic index supported by MongoDB on this field: 2d and 2dsphere, but they are different, and mixing them will lead to inaccurate results.

5. API usage specification

[Mandatory] An index must be created on the field of the query condition or the field of the sorting condition;
[Mandatory] The query result only contains the required fields, not all fields;
[Mandatory] Updates are atomic at the document level, which means that a statement that updates 10 documents may fail for some reason after updating 3 documents. Applications must handle these failures according to their own policies;
[Recommendation] The BSON size of a single document cannot exceed 16M;
[Suggestion] Disable update, remove or find statements without conditions;
[Suggestion] Limit the number of returned records, and each query result should not exceed 2000. If you need to query more than 2000 pieces of data, use multi-threaded concurrent query in the code;
[Suggestion] When writing data, if you need to implement a function similar to INSERT INTO ON DUPLICATE KEY UPDATE in MySQL, you can choose the upsert() function;
[Suggestion] You can choose to use batchInsert when writing a large amount of data, but currently the maximum message length that MongoDB can accept each time is 48MB. If it exceeds 48MB, it will be automatically split into multiple 48MB messages;
[Suggestion] The -1 and 1 in the index are different, one is the reverse order, and the other is the forward order. You should establish a suitable index sorting according to your own business scenarios. It should be noted that {a:1,b:-1 } is the same as {a:-1,b:1};
[Suggestion] Try to check your own program performance when developing your business. You can use the explain() function to check your query execution details. In addition, the hint() function is equivalent to force index() in MySQL;
[Suggestion] If you combine the volume size/number of documents to be fixed, then it is recommended to create a capped (capped) collection. The write performance of this collection is very high and there is no need to specifically clean up old data. It should be noted that capped tables do not support remove () and update() operations;
[Suggestion] Some operators in the query may cause poor performance, such as ne, not, exists, nin, or, try not to use them in business;

exist: Because of the loose document structure, the query must traverse each document ne: If the negated value is the majority, the entire index will be scanned not: It may cause the query optimizer to not know which index should be used, so it will often degenerate Nin for full table scan: full table scan or: as many times as there are conditions, and finally merge the result set, so use in as much as possible
[Suggestion] Do not take out too much data for sorting at one time. MongoDB currently supports sorting the result set within 32MB. If you need to sort, please try to limit the amount of data in the result set;
[Suggestion] The aggregation framework of MongoDB is very easy to use, and can realize complex statistical queries through simple syntax, and the performance is also good;
[Suggestion] If you need to clean up all the data in a collection, the performance of remove() is very low. In this scenario, drop() should be used; remove() is a row-by-row operation, so when deleting a large amount of data poor performance;
[Suggestion] When using an array field as a query condition, it will have no chance to cover the index; this is because the array is stored in the index, even if the array field is removed from the fields that need to be returned, such an index still cannot coverage query;
[Suggestion] If there is a range condition in the query, try to put it together with the fixed value condition for filtering, and put the fixed value query field before the range query field when creating an index.

6. Connection specification

[Mandatory] Correctly connect to the replica set, which provides data protection, high availability and disaster recovery mechanisms. If the master node goes down, one of the slave nodes will be automatically promoted to slave node.
[Suggestion] Reasonably control the size of the connection pool and limit the number of connection resources. You can configure the connection pool size through the maxPoolSize parameter in the Connection String URL.
[Suggestion] Replica set read option By default, all read requests in the replica set are sent to the Primary, and the Driver can route read requests to other nodes through the set Read Preference.