Skip to content

Classic queue processes can run into an exception (a continuation to #12367) #13758

Closed
@alxndrdude

Description

@alxndrdude

Describe the bug

We do run a RMQ node of v4.0.7 (deployed via a docker container).

We do have a classic durable queue with queue TTL and message TTL, a nearly continuous work-load of approx 200 msg per sec, which is consumed by one consumer and produced by multiple producers (routed via a topic exchange).

Currently, we do observer in the RMQ log error events "Restarting crashed queue ...".
This happens approx 4-8 times a day.
After such a crash our client hangs and stops consuming, and seems NOT to recognize, that the queue had crashed and be recovered (as long as the client hangs, messages are queued in the queue).
After restarting the client, all works as expected again.

Note:

  • For a long time, we ran on RMQ v3.12.x, then upgraded first to v3.13.7 and then to v4.0.7.
  • Running v3.12.x, no problems seem to occur, in RMQ v3.13.7 we first saw the problem. After upgrading to v4.0.7 the errors still remain.
  • The same messages are routed to other queues, however, we do not observe the error for those.

I would appreciate any help, advice or Bug fix.
Thanks.

Reproduction steps

Unfortunately, we are not able to reproduce this bug in a sandbox or test environment.

An extraction of the prod error log:

Logs

2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** Generic server <0.100434552.1> terminating
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** Last message in was {'$gen_cast',{notify_sent,<0.100460962.1>,50}}
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** When Server state == {q,
...
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** Reason for termination ==
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> ** {function_clause,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>        [{rabbit_msg_store,reader_pread_parse,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [[eof]],
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_msg_store.erl"},{line,672}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_msg_store,reader_pread,2,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_msg_store.erl"},{line,668}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_msg_store,read_many_file2,4,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_msg_store.erl"},{line,536}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,maybe_deltas_to_betas,4,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,2346}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,fetch_from_q3,1,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,2238}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,queue_out,1,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,1260}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,collect_by_predicate,3,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,1557}]},
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>         {rabbit_variable_queue,remove_by_predicate,2,
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1>             [{file,"rabbit_variable_queue.erl"},{line,1495}]}]}
2025-04-15 18:43:01.739405+00:00 [error] <0.100434552.1> 
...
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>   crasher:
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     initial call: rabbit_amqqueue_process:init/1
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     pid: <0.100434552.1>
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     registered_name: []
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     exception exit: {function_clause,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                         [{rabbit_msg_store,reader_pread_parse,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [[eof]],
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_msg_store.erl"},{line,672}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_msg_store,reader_pread,2,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_msg_store.erl"},{line,668}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_msg_store,read_many_file2,4,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_msg_store.erl"},{line,536}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,maybe_deltas_to_betas,4,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,2346}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,fetch_from_q3,1,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,2238}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,queue_out,1,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,1260}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,collect_by_predicate,3,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},{line,1557}]},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                          {rabbit_variable_queue,remove_by_predicate,2,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                              [{file,"rabbit_variable_queue.erl"},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                               {line,1495}]}]}
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     ancestors: [<0.99260568.1>,<0.3066.0>,<0.661.0>,<0.660.0>,
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                   rabbit_vhost_sup_sup,rabbit_sup,<0.216.0>]
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     message_queue_len: 172
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     messages: [{'$gen_cast',{notify_sent,<0.100460962.1>,50}},
...
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                   {{credit_to,<0.83484211.1>},177},
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>                   {{credit_to,<0.98925663.1>},96}]
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     trap_exit: true
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     status: running
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     heap_size: 3581853
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     stack_size: 29
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>     reductions: 455662007
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1>   neighbours:
2025-04-15 18:43:03.264088+00:00 [error] <0.100434552.1> 
...
...
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     supervisor: {<0.99260568.1>,rabbit_amqqueue_sup}
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     errorContext: child_terminated
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     reason: {function_clause,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                 [{rabbit_msg_store,reader_pread_parse,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [[eof]],
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_msg_store.erl"},{line,672}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_msg_store,reader_pread,2,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_msg_store.erl"},{line,668}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_msg_store,read_many_file2,4,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_msg_store.erl"},{line,536}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,maybe_deltas_to_betas,4,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,2346}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,fetch_from_q3,1,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,2238}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,queue_out,1,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,1260}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,collect_by_predicate,3,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,1557}]},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                  {rabbit_variable_queue,remove_by_predicate,2,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                      [{file,"rabbit_variable_queue.erl"},{line,1495}]}]}
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>     offender: [{pid,<0.100434552.1>},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {id,rabbit_amqqueue},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {mfargs,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                    {rabbit_amqqueue_process,start_link,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                        [{amqqueue,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             {resource,<<"/">>,queue,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                                 <<"measurement.message.live_consumer">>},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             true,false,none,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             [{<<"x-expires">>,long,5000},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                              {<<"x-message-ttl">>,long,30000},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                              {<<"x-queue-type">>,longstr,<<"classic">>}],
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             none,[],[],[],undefined,undefined,[],[],live,0,[],
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             <<"/">>,
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             #{user => <<"mpgplatform">>},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                             rabbit_classic_queue,#{}},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                         <0.95885391.1>]}},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {restart_type,transient},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {significant,true},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {shutdown,600000},
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1>                {child_type,worker}]
2025-04-15 18:43:03.836306+00:00 [error] <0.99260568.1> 
...
...
2025-04-15 18:43:03.841432+00:00 [error] <0.102250085.1> Restarting crashed queue 'measurement.message.live_consumer' in vhost '/'.
...
2025-04-15 18:43:03.849148+00:00 [warning] <0.102250085.1> Queue measurement.message.live_consumer in vhost / dropped 0/0/0 persistent messages and 0 transient messages after unclean shutdown

Expected behavior

To not observe queue crashes.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions