I've been talking with a friend of mine about interview questions. One question keeps popping up: what are the ways that we could protect a service against excessive traffic?
What we refer to as "a service", is a typical micro service that serves user requests. A service usually runs on a set of independent and interchangeable servers.
Services should be aware of their own capacity, i.e. how much traffic they could handle. Higher load can cause average latency to go up through context switching costs and CPU starvation. The basic idea is that a service should reject incoming traffic when capacity is reached, but before the server melts down. A straightforward approach is to count queries served in a rolling time window, and drop traffic if that count is too high.
Not all traffic is born equal. Some of them are more expensive ("bulk creating posts"), others are cheaper ("rendering a post"). We could assign different costs to different types of traffic. If the total cost of traffic served is too high, future traffic should be dropped.
Capacity can be more fine grained. CPU, memory, network bandwidth, disk IO or even parallelism, each can have their own capacity. Running a 7-variable budgeting algorithm in real time can be expensive. This "multi-dimensional" approach is often used to reveal the true cost of traffic ("we can only serve 100 QPS of writes, otherwise we will run out of disk IOPS"). Moreover it often makes more sense to count inflight requests when calculating memory and parallelism usage.
Server should throttle misbehaving clients. This is to make sure that one bad client does not bring down the entire service and cost other clients.
A service can voluntarily enter into "degraded" mode when they sense that they are under pressure. Instead of rejecting traffic, the service would continue to serve traffic as much as it can, but at lower quality. It could drop consistency guarantees, serve stale (but more cachable) data, extend cache expiration time, omit big data chunks (videos), retry less aggressively, give up tasks that run too long, run background tasks less often etc.
Even when actively rejecting and deflecting traffic, we are not completely safe. It costs something just to reject connections. Depending on the implementation, each connection needs additional memory to be established, file descriptor to be allocated etc. We need cooperation from clients and load balancers to stop the traffic from the source.
A lot of services now require clients to respect a pre-agreed quota. Clients are expected to check quota before sending any traffic to the server. Stopping traffic from the source is a great way to avoid outages.
Individual server shutdowns can be a source of load, since less servers means more load on the remaining ones. A server should shutdown gracefully, notifying dependencies and clients, handling as many incoming requests as possible before giving up.
This first stage of a graceful shutdown is usually entering into lame duck mode. In this mode, the server continues to serve traffic as normal, but starts telling clients and load balancers that it is going away. Gradually traffic should stop coming in, if we assume clients are cooperating. Clients should instead send their requests to other servers. The lame duck mode usually lasts a defined period of time, before the server starts rejecting traffic.
After the lame duck mode ends, the server initiates the shutdown protocol. Shutting down could mean being killed instantly. It could also mean a whole procedure like closing database connections nicely, rubbing all memory off of sensitive information etc. At this stage, the server is not serving any traffic.
A special case of shutting down is rolling out a new version of binary. Servers running the old binary will be brought down, and restarted again with the new binary. Rolling restart works like the following: first update a portion of the servers, letting all others continue serving traffic. Then move on to the next portion. Repeat until all servers are updated.
Rolling restart is critical to ensure service continuity. It also helps prevent a thundering herd of clients trying to establish connection with the newly restarted servers. The load of re-establishing connections could overwhelm a server at startup. Then all clients move on to other servers, causing an even bigger pressure and gradually killing all of the servers.
Tuning production systems like load protection is not exact science. After all, nobody knows what is going to happen in real life. Luckily rough estimates and heuristics can go a long way.