• [Basic concepts of current limiting]

  • [Common algorithms for current limiting schemes] 

  • [Commonly Used Flow Limiting Schemes] 

  • [Consider flow-limiting design from the architectural dimension] 

  • [Specific Means to Implement Flow Limiting]




[Basic concept of current limiting]

For a general flow limitation scenario it has two dimensions of information.

Time The flow limit is based on a certain time range or a certain point in time, which is often referred to as a "time window", such as a time window of every minute or every second.

Resources Limitations based on available resources, such as setting a maximum number of accesses, or a maximum number of available connections

Combining the above two dimensions, limiting flow means limiting access to resources in a certain time window, for example, setting a maximum of 100 access requests per second. However, in a real scenario, we set more than one type of flow limitation rule, but will set multiple flow limitation rules to work together, the main types of flow limitation rules are as follows.

[QPS and connection count control]

For connection count and QPS) limit, we can set the limit in IP dimension or set the limit based on individual server. In a real-world environment, it is common to set multiple dimensional flow restriction rules, such as setting the same IP access frequency to less than 10 per second and the number of connections to less than 5, and then setting a maximum QPS of 1000 per machine and keeping the maximum number of connections at 200. All flow-limiting rules work together for traffic control.

[Transmission Rate]

We are not unfamiliar with "transfer rate", such as the download speed of resources. Some websites have a more detailed flow limitation logic in this area, for example, the download speed is 100k/s for ordinary registered users, and 10M/s after purchasing membership, which is based on the flow limitation logic of user groups or user tags.

[black and white list]

Black and white list is a very common means of restricting and releasing traffic in various large enterprise applications, and the black and white list is often dynamically changing. For example, if an IP is accessed too frequently over a period of time and is identified by the system as a bot user or traffic attack, then the IP will be added to the blacklist, thus limiting its access to system resources, which is commonly known as "blocking IP". The crawlers we usually see, such as crawling beautiful pictures on Zhihu, or crawling stock time information of brokerage system, all these crawlers must implement the function of changing IP to prevent being blacklisted. Sometimes we also find that the company's network can not access large public websites such as 12306, this is also because some companies out of the network IP is the same address, so in the case of too much access, this IP address is identified by the other system, and thus added to the blacklist. Students using home broadband should know that most network operators assign users to different outgoing IP segments, or dynamically change the user's IP address from time to time. The whitelist is better understood as the equivalent of an imperial gold medal in the body, free to travel through the various restrictions on the rules of traffic, unhindered. For example, some e-commerce companies will add the accounts of mega sellers to the whitelist, because such sellers often have their own set of operation and maintenance systems, and need to interface with the company's IT system to do a lot of commodity release, replenishment and other operations.

[Distributed Environment]

Distributed differs from the scenario of single machine flow restriction by considering all servers in the whole distributed environment as a whole. For example, for IP flow restriction, we limit the maximum number of accesses per second to 10 for one IP. No matter which machine the request from this IP falls on, as long as it accesses a service node in the cluster, it will be subject to the flow restriction rules. It is better to save the flow restriction information on a "centralized" component, so that it can get the access status of all machines in the cluster, there are two mainstream flow restriction schemes.

  • Gate-level flow restriction Applying flow restriction rules at the entry point of all traffic

  • Middleware flow restriction stores the flow restriction information in some middleware (e.g. Redis cache) in a distributed environment, from which each component can get the current traffic statistics at the moment and decide whether to deny service or release the traffic

  • sentinel, a component of the springcloud ecosystem for distributed flow limiting, fusion degradation, etc.


[Common algorithms for current limiting schemes]

[Token Bucket Algorithm]

The Token Bucket token bucket algorithm is the most widely used flow-limiting algorithm, and as the name implies, it has two key roles.

  • Token The Request that gets the token will be processed, other Requests will either be queued or discarded directly

  • The bucket is used to hold the tokens, and all Requests get their tokens from this bucket. There are 2 main processes involved.


Token Generation

This process involves a token generator and a token bucket. As we mentioned earlier, a token bucket is a place for tokens, and since it is a bucket, it must have a capacity, which means that the number of tokens a token bucket can hold is a fixed value. For the token generator, it will add tokens to the bucket at a predefined rate, for example, we can configure it to issue tokens at a rate of 100 requests per second, or 50 per minute. Note that the rate of issuance is uniform, meaning that the 50 tokens are not issued at once at the beginning of each time window, but at a uniform rate over the time window. In the token dispenser is a faucet, if the bucket below is full, then naturally the water (token) will flow outside. In the token dispensing process as well, the capacity of the token bucket is limited, and if the rated capacity of the token is currently full, then the new token will be discarded.

Token acquisition

When each access request arrives, a token must be fetched to execute the subsequent logic. If the number of tokens is small and the number of requests is large, some of the requests will naturally not get a token, so we can set up a "buffer queue" to hold the extra tokens. The cache queue is actually an optional option, and not all applications that apply the token bucket algorithm will implement it. When a cache queue exists, requests that do not get a token at the moment are placed in this queue until a new token is generated, and then a request is pulled from the queue head to match the token. When the queue is full, this part of the access request will be discarded. In practice we can also add a series of effects to this queue, such as setting the survival time of requests in the queue, or transforming the queue into a PriorityQueue, which is sorted according to some priority instead of first-in-first-out.

[Leaky Bucket Algorithm]

Leaky Bucket, another bucket, the flow restriction algorithm is barred with the bucket, so what is the difference between Leaky Bucket and Token Bucket, the first half of the Leaky Bucket algorithm is similar to the Token Bucket, but the object of operation is different, the Token Bucket is to put the token into the bucket, while the Leaky Bucket is to put the packet of the access request into the bucket. The same is true that if the bucket is full, then new packets coming later will be discarded. The second half of the leaky bucket algorithm is distinctive in that it will only ever flow packets out of the bucket at a constant rate. For example, if I set the leaky bucket to hold 100 packets and then the outflow rate is 1s one, then no matter what rate the packets flow into the bucket or how many packets are in the bucket, the leaky bucket ensures that these packets will always be processed at a constant rate of 1s one.

Difference between leaky bucket vs token bucket

Based on their respective characteristics, it is easy to see that both algorithms have a "constant" rate and a "variable" rate. The token bucket creates tokens at a constant rate, but the rate at which access requests get tokens is "variable", so as many tokens as there are, and when they run out, they just wait. The leaky bucket, on the other hand, processes requests at a "constant" rate, but the rate at which these requests flow into the bucket is "variable". In terms of these two characteristics, the natural characteristics of the leaky bucket determine that it does not experience bursts of traffic, even if 1000 requests per second come in, then the rate of access to the output of the backend service is always constant. Unlike the token bucket, which can "pre-store" a certain amount of tokens, so it can consume all the tokens in a short time when dealing with burst traffic.

[Sliding Window]

For example, if we have 5 user visits in each second and 10 user visits in the 5th second, then the number of visits in this time window from 0 to 5 seconds is 15. If our interface sets the upper limit of visits in the time window to 20, then when the time reaches the 6th second, the total number of counts in this time window becomes 10, because the 1 second grid has exited the time window, so the number of visits in The number of visits that can be received in the sixth second is 20-10=10. The sliding window is actually a calculator algorithm that has the remarkable feature of smoothing the flow-limiting effect when the span of the time window is longer. For example, if the current time window is only two seconds, and the access requests are all concentrated in the first second, when the time slides back one second, the number of counts in the current window will change significantly, lengthening the time window can reduce the probability of this situation.

[Commonly used current limiting schemes]

[Legitimacy verification flow restriction]

such as verification codes, IP blacklists, etc., which are effective means of preventing malicious attacks and crawler harvesting.

[Guawa Flow Limiting]

In the field of flow limiting, Guava provides several flow limiting support classes under its multi-threaded module, led by RateLimiter, but the scope of action is limited to the "current" server, which means that Guawa's flow limiting is a single machine flow limiting, and it can't help across machines or jvm processes. I currently have 2 servers [Server 1, Server 2], both of which have a login service deployed, and if I want to control the traffic on these two machines, for example, to keep the total number of accesses on both machines within 20 per second, if I use Guava, I can only control the number of accesses on each machine <= 10 independently. Although Guava is not a solution for distributed systems, it is a simple and lightweight client-side flow limiting component that is ideal for explaining flow limiting algorithms

[Flow limitation at the gateway level]

The service gateway, as the first gateway in the whole distributed link, takes over all user visit requests, so flow restriction at the gateway level is a good entry point The top-to-bottom path is, in order.

1. User traffic is forwarded from the gateway level to the backend service

2. The backend service takes the traffic and calls the cache for data

3. If there is no data in the cache, the database is accessed

The traffic is decreasing from top to bottom, with the most intensive user access requests gathered at the gateway layer, followed by the backend service. Then, after the validation logic of the backend service, some of the wrong requests are brushed off, and the remaining requests fall on the cache, and only if there is no data in the cache will the database at the bottom of the funnel be requested, so the number of requests at the database level is minimal (compared to other components, the database is often the worst in terms of concurrency, and even after a lot of transformation of Ali's MySQL, the single-computer concurrency is not comparable to that of (Redis, Kafka and other components) The current mainstream gateway layer is represented by the software Nginx, and Spring Cloud Gateway and Zuul such gateway layer components.

Nginx Flow Limiting

In system architecture, Nginx proxy and route forwarding is a very important function of its role as a gateway layer. Due to its inherent lightweight and excellent design, Nginx is the preferred choice of many companies. Nginx can act as the front-most gateway against most network traffic when considered at the gateway level, so it is also a good choice to use Nginx for flow limiting, in Nginx provides two methods of limiting traffic: one is to control the rate, and the other is to control the number of concurrent connections.

To control the rate we need to use limit_req_zone to limit the number of requests per unit of time, i.e., the rate limit, because Nginx's limit statistics are millisecond-based, and we set the rate to 2r/s, which translates to only one request allowed through a single IP within 500 milliseconds, and only the second request allowed through from 501ms onwards.


The rate control on the optimized version is very precise but too harsh in a production environment, in reality we should control the total number of visits per IP unit of total time, not to milliseconds as above, we can use the burst keyword to turn on this setting burst=4 meaning that each IP allows up to 4 burst requests


To control the number of concurrency, use the limit_conn_zone and limit_conn directives to control the number of concurrency, where limit_conn perip 10 means that a single IP can hold up to 10 connections at the same time; limit_conn perserver 100 means that the total number of concurrent connections that the server can handle at the same time is 100. The

Note: This connection is counted only after the request header has been processed by the backend.

Middleware flow limitation

For a distributed environment, there is nothing more than the need for a central node-like place to store flow-limiting data. Let's say that if I want to control the access rate of the interface to 100 requests per second, then I need to store somewhere the number of requests that have been received in the current 1s and can be accessed by all nodes in the cluster environment. So what technology can we use to store this temporary data? Well, I'm sure you can all think of redis. Using the Redis expiration time feature, we can easily set a time span for limiting the flow (for example, 10 requests per second, or 10 requests every 10 seconds). At the same time, Redis has a special skill - scripting - which allows us to write a script to restrict the flow of logic into Redis, thus completely removing the burden of restricting flow from the service layer, while Redis' powerful concurrency and highly available clustering architecture can also support large clusters of restricted access.

[reids + lua] Flow limiting component

In addition to the above, there are also open source components that provide similar functionality, such as Sentinel, which is an open source component produced by Ali and included in the Spring Cloud Alibaba component library, Sentinel provides a rich API for flow restriction and a visualization control panel, which can easily help us to limit the flow of governance

[Consider flow limiting design from the architectural dimension]

In real projects, not only one kind of flow-limiting means will be used, but often several ways are used with each other to give a sense of hierarchy to the flow-limiting strategy and achieve the maximum utilization of resources. In this process, the design of the flow restriction strategy can also refer to the funnel model mentioned earlier, where the top is wide and the bottom is tight, and the design of the flow restriction scheme in different parts of the funnel should focus on the high availability of the current components as much as possible. Take the actual project I participated in as an example, for example, we developed a product detail page interface, through the cell phone Taobao inflow, the app side of the access request will first pass through the Ali mtop gateway, in the gateway layer we will do a more relaxed flow restriction, wait until the request through the gateway to the background of the product detail page service, and then use a series of middleware + flow restriction components, the service for more detailed Restriction of flow control

[Specific means of implementing stream limiting]

1. Tomcat uses maxThreads to implement flow limiting.

2. Nginx uses limit_req_zone and burst to implement rate limiting.

3. Nginx's limit_conn_zone and limit_conn directives control the total number of concurrent connections.

4. The time window algorithm can be implemented with Redis' ordered sets.

5. The leaky bucket algorithm can be implemented using Redis-Cell.

6. The token algorithm can be implemented by solving Google's guava package.

 It is important to note that the flow limiting solution implemented by Redis can be used in distributed systems, while the flow limiting solution implemented by guava can only be used in standalone environments. If you find server-side flow limiting troublesome, you can directly use container flow limiting (Nginx or Tomcat) without changing any code, provided that you can meet the business requirements of your project.

[Tomcat Streaming Limit]

Tomcat 8.5 version of the maximum number of threads in conf/server.xml configuration, maxThreads is Tomcat's maximum number of threads, when the concurrency of the request is greater than this value (maxThreads), the request will be queued for execution, so that the purpose of limiting the flow is completed. Note that.

The value of maxThreads can be adjusted to a larger value, the default for Tomcat is 150 (Tomcat version 8.5).However, this value is not as large as it should be, it depends on the specific server configuration. It should be noted that each thread opened requires 1MB of JVM memory for the thread stack, and the more threads there are the heavier the GC burden.

Finally, note that the operating system has a limit on the number of threads in a process; Windows does not allow more than 2000 threads per process, and Linux does not allow more than 1000 threads per process.