1.Bloom filter

Bloom filter is a space-saving probabilistic data structure for testing whether an element is a member of a collection or not. It is used in scenarios where we only need to check if an element belongs to an object.

In BigTable (and Cassandra), any read operation must read from the SSTables that make up the Tablet. To reduce the number of disk accesses, BigTable uses Bloom filters.

2.Consistency hash

Consistent hashing allows you to easily scale, thus allowing data to be replicated in an efficient manner, resulting in better availability and fault tolerance.

Each data item identified by a key is assigned to a node by hashing the key of the data item to produce its position on the ring, and then traversing the ring clockwise to find the first node with a position greater than that position. The node associated with the node is the location of the data item.


The main advantage of consistent hashing is incremental stability; the departure or arrival of a node to a cluster affects only its immediate neighbors, other nodes are not affected.

3.Quorum

In a distributed environment, quorum is the minimum number of servers that need to successfully execute this distributed operation before the operation can be confirmed as successful.

Cassandra, to ensure data consistency, each write request can be configured to succeed only if the data has been written to at least one quorum (or most) replica nodes.


For leader elections, Chubby uses Paxos, which uses quorum to ensure strong consistency.


Dynamo replicates writes to a haphazard quorum of other nodes in the system, rather than a strict majority quorum like Paxos. all read/write operations are performed on the first NN normal node in the preference list, which may not always be the first NN node encountered when traversing the consistent hash ring.

4.Leader (Leader) and followers (Follower)

In order to achieve fault tolerance in systems that manage data, it is necessary to replicate data across multiple servers.


One server is selected as the leader in the cluster. The leader is responsible for making decisions on behalf of the entire cluster and propagating them to all other servers.


With a cluster of three to five nodes, as in systems that implement consensus, leader election can be implemented within the data cluster itself, without relying on any external systems. Leader elections are performed at server startup. Each server starts leader election at boot time and attempts to elect a leader. The system does not accept any client requests unless a leader is elected.
5.Heartbeat

The heartbeat mechanism is used to detect if an existing leader has failed so that a new leader election can be initiated.

6.Fencing

In the leader-follower model, it is not possible to determine that a leader has stopped working when the leader fails. For example, a slow network or network partition may trigger a new leader election even though the previous leader is still running and considers it still the active leader.


Masking is the process of placing a fence around a previously active leader so that it cannot access cluster resources, thereby stopping service for any read/write requests.


The following two techniques are used.

  • Resource blocking: The system blocks previously active leaders from accessing resources needed to perform basic tasks.

  • Node Masking: The system blocks previously active leaders from accessing all resources. A common way to perform this action is to power off or reset the node.

7. WAL (pre-written log Write-ahead Log)

Pre-written logging is an advanced solution to the problem of file system inconsistencies in operating systems. Inspired by database management systems, this method first writes a summary of the operations to be performed into a "log" and then physically writes them to disk. In case of a crash, the operating system simply checks this log and continues from the point of interruption.

8、分段日志

Split logs into multiple smaller files rather than a single large file for ease of operation.


Individual log files can grow and become a performance bottleneck when read at startup. Older logs are purged periodically and it is difficult to perform purge operations on a single large file.


Individual logs are split into multiple segments. Log files roll up after a specified size limit. Using log segments requires an easy way to map logical log offsets (or log sequence numbers) to log segment files.

9. High-Water mark

The last log entry on the leader that has been successfully copied to the follower's quorum. the index of this entry in the log is called the high-water trail lead. The leader only exposes data to the high water trail lead.


Kafka: To handle non-repeatable reads and ensure data consistency, the Kafka broker keeps track of the high water mark, which is the maximum offset for a given partition. Users can only see messages up to the high water mark.

10, Lease

A lease is like a lock, but it works even if the client leaves. The client requests a lease for a limited period, after which the lease expires. If the client wants to extend the lease, it can renew the lease before it expires.


The Chubby client maintains a time-limited session lease with the leader. During this time interval, the leader guarantees that the session will not be unilaterally terminated.

11. Gossip protocol

The Gossip protocol is a peer-to-peer communication mechanism in which nodes periodically exchange state information about themselves and other nodes that they know about.


Each node initiates one Gossip round per second to exchange state information about itself and other nodes with another random node.

12, Phi Accrual Failure Detection

This algorithm uses historical detection signal information to make the threshold adaptive. A generic accrual fault detector does not determine whether a server is active or not, but instead outputs the suspicious level of the server in question.


Cassandra uses the Phi accrual fault detector algorithm to determine the status of nodes in a cluster.

13.Brain fracture

A scenario in which a distributed system has two or more active leaders is called a brain fracture.


The brain fracture problem can be solved by using a Generation Clock, which is simply a monotonically increasing number that indicates the generation of servers.


Each time a new leader is elected, the generation number is increased. This means that if the old leader had a clock number of "1", the new leader will have a clock number of "2". This clock number is included in every request sent from the leader to the other nodes. In this way, nodes can now easily distinguish the true leader by simply trusting the leader with the highest number.


Kafka: To handle brain fractures (we can have multiple active controller brokers), Kafka uses the "Epoch number", which is simply a monotonically increasing number to represent the generation of the server.


HDFS: ZooKeeper is used to ensure that only one NameNode is active at any given time. epoch numbers are maintained as part of each transaction ID to reflect the generation of the NameNode.

14. checksum

In a distributed system, data obtained from nodes may be corrupted when moving data between components.


Calculate the checksum and store it with the data.


To calculate the checksum, use a cryptographic hash function such as MD5, SHA-1, SHA-256, or SHA-512. The hash function takes the input data and generates a fixed-length string (containing letters and numbers); this string is called the checksum.


When the system stores some data, it calculates the checksum of the data and stores the checksum with the data. When the client retrieves the data, it verifies that the data received from the server matches the stored checksum. If not, then the client can choose to retrieve that data from another copy.


HDFS and Chubby store the checksum of each file with the data.
​15. CAP Theorem

The CAP theorem states that a distributed system cannot simultaneously provide all three of the following desirable properties.


Consistency (C), Availability (A), and Partitioning Tolerance (P).


According to the CAP theorem, any distributed system needs to choose two of the three properties. These three options are CA, CP, and AP.


Dynamo: In CAP theorem terminology, Dynamo belongs to the class of AP systems designed to achieve high availability at the expense of strong consistency.


BigTable: For the purposes of the CAP theorem, a BigTable is a CP system, i.e., it has strictly consistent reads and writes.
16. PACELEC theorem

The PACELC theorem states that in a system of replicated data.

  • If there is a partition ('P'), the distributed system can make a tradeoff between availability and consistency (i.e., 'A' and 'C');

  • Otherwise ('E'), the system can trade-off between latency ('L') and consistency ('C') when the system operates normally without partitioning.

The first part of the theorem (PAC) is the same as the CAP theorem, and the ELC is the extension. The whole argument assumes that we maintain high availability through replication. Thus, when it fails, the CAP theorem prevails. But if not, we must still consider the tradeoff between consistency and latency of the replicated system.

17.Hinted Handoff

If a node shuts down, the system keeps hints (or comments) of all the requests they missed. When the failed node recovers, requests will be forwarded to them based on the stored hints.


When a node shuts down, the leader writes a hint in a text file on the local disk. This prompt contains data and information about the node it belongs to. When the leader realizes that the node for which it keeps hints has recovered, it forwards a write request for each hint to that node.
18.Read time fix

In a distributed system where data is replicated across multiple nodes, some nodes may end up with outdated data.


The obsolete data is repaired during the read operation, because at this point, we can read data from multiple nodes to compare and find the node with the obsolete data. This mechanism is called read repair. Once the node with the old data is known, the read repair operation pushes the newer version of the data to the node with the older version.


Cassandra and Dynamo use "read repair" to push the latest version of data to the node with the older version.
19.Merkle Trees

"Read Repair eliminates conflicts while the read request is being processed. However, if a replica is significantly behind other replicas, it may take a long time to resolve the conflict.


Replicas can contain large amounts of data. Simply splitting the entire range to compute checksums for comparison is not very feasible; there is too much data to transfer. Instead, we can use a Merkle tree to compare copies of a range.


Merkle trees are hashed binary trees, where each internal node is a hash of its two children, and each leaf node is a hash of a portion of the original data.

Comparing Merkle trees is conceptually simple.
  • Compare the root hash of two trees.

  • If they are equal, stop.

  • Check recursively on the left and right children.

To implement anti-entropy and resolve conflicts in the background, Dynamo uses Merkle trees.