André Figueira

Systems engineer - I write apps, I make websites, opinions are my own...

What I've learnt using RabbitMQ at high scale

This living blog post is a collection of issues and solutions I've faced using RabbitMQ in enterprise, and at scale, hopefully it'll help a few people out!

I've been working with RabbitMQ for just over a year now, for a big project at work, I can't go into too much detail, but will go into the technical detail of the issues I faced.

One of the contributors to the PHP amqp library has posted about common issues, which have also been used to supplement this article, see the full post here

Broken pipe or closed connection

This is quite a common one! this error indicates that you're connection is dead, either due to the TCP connection being dropped or that RabbitMQ has closed your connection. The number one thing to do is find out why you're getting this issue, in my cases it happened due to a variety of issue I had to trace each one and address each cause.

Cause 1 - TCP Connection being closed

TCP connection being closed by load balancer, for example, let's say your RabbitMQ instance is running on a network with a 1GBps switch, and you manage to fill that out, you're going to find your connection being dropped, which will lead to this error.

Cause 2 - Consumer taking longer than heartbeat frame

Your consumer is taking longer to process than the heartbeat frame with RabbitMQ.

Network can fail in many ways, sometimes pretty subtle (e.g. high ratio packet loss). Disrupted TCP connections take a moderately long time (about 11 minutes with default configuration on Linux, for example) to be detected by the operating system. AMQP 0-9-1 offers a heartbeat feature to ensure that the application layer promptly finds out about disrupted connections (and also completely unresponsive peers). Heartbeats also defend against certain network equipment which may terminate "idle" TCP connections when there's no activity on them for a certain period of time.

When this happens, RabbitMQ will add a log entry regarding this specific instance;

2017-09-29 09:32:32.327 [warning] <0.2375.628> closing AMQP connection <0.2375.628> ( ->
missed heartbeats from client, timeout: 8s

The only true way to work around this, is by increasing your heartbeat, to longer than your consumer might take.

Cause 3 - High load scenarios

RabbitMQ Server under high load, or restarted. Pretty straight forward, so you may want to make sure your workers are able to carry on and restart themselves without any adverse effects.

fwrite(): send of (x) bytes failed with errno=104 Connection reset by peer

This one is quite an anoying one.

Cause 1 - High load scenario

When your server is under high load, connections may be dropped, especially if you're reaching the limit of how many open ports, or open connections your RabbitMQ instance is able to handle.

Cause 2 - RabbitMQ is down!

Pretty self explanitory... fix the dead rabbit

Cause 3 - Your consumers are taking too long

If your consumers are taking longer than the heartbeat you've setup to process their message, when they go to ack, they are going to be hit with this issue. The reason is, your connection will have been closed because rabbit didn't receive a heartbeat frame in time, so what this means is that, you've done the work in your process but you no longer have a valid connection to inform rabbit, so the message will be redelivered, and your consumer will fail.

Unknown delivery_tag 2

This one may very well catch you out, the error is specifically:

Server ack'ed unknown delivery_tag "2"

Cause 1

This happens because you've already acknowledged the message, and are trying to acknowledge it again, double check your code, and you'll find you're acking the same message twice.

Consumer checklist

Here's a list of a few things to make sure you have configured for your consumers, to keep them running well.

Enable heartbeats