Incident Report for Spreedly
We are working to expand our infrastructure, which means we have introduced a lot of new servers. The plans we have for networking these servers are detailed, as is the documentation of the current state of existing systems. The road to a completed implementation requires that our current firewall devices stand-in for the role ultimately to be filled by the new firewalls. A single detail was overlooked: the IP addresses that the new firewalls will own are currently owned by the active firewalls. When those addresses were assigned to the new firewalls, it confused the switches. Connections to new servers would time out because the reply packets would end up on the new firewalls, not making their way back through those which are in active service.
This would not have been seen in our service, except for one thing: we are using one of our new servers as an additional log aggregator. The application servers buffer logs locally, but after a good while of not being able to reach the aggregator, the buffer filled up, and rsyslog began to block the application processes (Unicorn). Once they were all waiting for rsyslog, the load balancers began replying immediately with the “503 Service Unavailable”.
Only a few transactions failed, since a large number of GET requests occur and were able to use up the application processes.
How We’re Going to Improve
The issue here was an unbounded queue. We’ll address that by leveraging rsyslog’s advanced queueing options without neglecting one very important concern: certain activities must always be logged in the system. Also, we need to know when rsyslog is unable to work off it’s queue, so we are going to find a way to be alerted as soon as that is the case.
If you have any questions about this incident, don’t hesitate to drop us a line firstname.lastname@example.org.