Fix restarting the entire cluster when volumes are persisted #157

gianlucaborello · 2018-06-22T23:49:36Z

Hi,

I finally figured out the issue I was talking about in #155, after doing some more troubleshooting.

In summary;

Create a cluster with persistent volumes automatically provisioned by the StatefulSet.
Simulate a churning of the cluster, as a result of a Kubernetes maintenance operation (e.g. upgrading nodes, or user stopping the application, or user migrating the persistent layer somewhere else, ...). This can be achieved, for example, by deleting all the pods of the cluster.
When the StatefulSet reschedules the pods and reattaches the existing volumes to them, the cluster fails to come up 100% of the times.

Analysis of the issue

To give a bit more details, when the StatefulSet reschedules the pods that have been deleted, it does so in an orderly manner. This means that the pods beyond the first are not scheduled until the liveness/readiness probes of the first succeed. The problem is, the first agent is never able to form the cluster. When the first mysqld goes up, it returns as a status error message metadata exists, but GR is not active and so the agent issues a rebootClusterFromCompleteOutage(), but the command fails:

Dba.rebootClusterFromCompleteOutage: ERROR: Error starting cluster: 'mysql1:3306' - Query failed. MySQL Error (3096): ClassicSession.query: The START GROUP_REPLICATION command failed as there was an error when initializing the group communication layer.. Query: START group_replication: MySQL Error (3096): ClassicSession.query: The START GROUP_REPLICATION command failed as there was an error when initializing the group communication layer. (RuntimeError)

To understand the reason, we can look in the mysqld log:

2018-06-22T21:14:32.793708Z 2 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Peer address "mysql1:33061" is not valid.'
2018-06-22T21:14:32.793730Z 2 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Peer address "mysql2:33061" is not valid.'
2018-06-22T21:14:32.793738Z 2 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] None of the provided peer address is valid.'
2018-06-22T21:14:32.793909Z 2 [ERROR] [MY-011674] [Repl] Plugin group_replication reported: 'Unable to initialize the group communication engine'
2018-06-22T21:14:32.793924Z 2 [ERROR] [MY-011637] [Repl] Plugin group_replication reported: 'Error on group communication engine initialization'
2018-06-22T21:14:32.793933Z 2 [Note] [MY-011649] [Repl] Plugin group_replication reported: 'Requesting to leave the group despite of not being a member'
2018-06-22T21:14:32.793939Z 2 [ERROR] [MY-011718] [Repl] Plugin group_replication reported: 'Error calling group communication interfaces while trying to leave the group'

As you can see, the replication fails to start and stalls the whole thing, it complains about peer addresses not being valid. Tracing down the error inside the group replicationg plugin code, the relevant functions that validate peer addresses are two:

https://github.com/mysql/mysql-server/blob/mysql-8.0.11/plugin/group_replication/libmysqlgcs/src/bindings/xcom/gcs_xcom_utils.cc#L104

https://github.com/mysql/mysql-server/blob/mysql-8.0.11/plugin/group_replication/libmysqlgcs/src/bindings/xcom/gcs_xcom_utils.cc#L871

We can see how, from mysqld point of view, in order for a peer name to be valid it must be at least a resolvable DNS name. However, since Kubernetes is orderly recreating the pods starting from 0, the other pods don't exist yet, and their virtual service name doesn't exist yet either, so the whole thing fails.

Notice that, if we were to instead do a test and manually create fake dangling Kubernetes services for mysql1 and mysql2 (that is, services that can be resolved to a VIP but behind which there's no mysqld listening because pods are not up yet), we would get a different result:

2018-06-22T21:11:03.398851Z 2 [Note] [MY-011670] [Repl] Plugin group_replication reported: 'Group Replication applier module successfully initialized!'
2018-06-22T21:11:03.422512Z 0 [Note] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] XCom initialized and ready to accept incoming connections on port 33061'
2018-06-22T21:11:06.580221Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to mysql1:33061 on local port: 33061.'
2018-06-22T21:11:09.651869Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to mysql2:33061 on local port: 33061.'
2018-06-22T21:11:12.725023Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to mysql1:33061 on local port: 33061.'
2018-06-22T21:11:15.796127Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to mysql2:33061 on local port: 33061.'
...
2018-06-22T21:12:03.400898Z 2 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2018-06-22T21:12:03.401013Z 2 [Note] [MY-011649] [Repl] Plugin group_replication reported: 'Requesting to leave the group despite of not being a member'
2018-06-22T21:12:03.401047Z 2 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'

Here, MySQL correctly initializes the group replication even if the peer names can't be reached (but resolved), and so the agent is fully able to issue the standard rebootClusterFromCompleteOutage() command, restoring the cluster. So, this is really a Kubernetes/MySQL interaction issue.

The simple solution I proposed here is to reset the seeds of the first instance before rebooting the cluster, so we are sure the replication plugin won't try to contact hosts that can't be even resolved, failing. This should in theory be safe because we really invoke that part of the code just when the instance 0 cannot contact any other member of the cluster, so we're not just blindly resetting the seed list every time.

There could be some chances of forming a split brain, but I believe the same could be said even with the current logic. The fix could be made more robust, for example by resetting the seeds just if we see, by querying the Kubernetes API, that the first pod is in fact really the only one in existence at that time. I'm happy to do so if required, but first I wanted to share this approach to also see if there are other ideas from you.

The code changes are based on the PR from #155 for (my) convenience, but obviously I would rebase them appropriately if the other one doesn't get a thumbs up and this one does.

Thanks

owainlewis

Changes look good to me. Needs a rebase before merge.

gianlucaborello · 2018-06-26T16:08:42Z

Rebased. Thanks a lot for the reviews and for merging some of the other PRs.

prydie · 2018-07-03T09:37:19Z

Hi @gianlucaborello,

Also LGTM 👍 . I like the idea of checking that the other Pods are indeed missing as far as the API server is concerned but happy to introduce that as a follow-up.

~~Would you mind signing-off the commit and then I'll merge?~~

owainlewis added the oracle-cla: yes Contributor has signed the Oracle Contributor Licence Agreement label Jun 26, 2018

owainlewis self-requested a review June 26, 2018 10:14

owainlewis self-assigned this Jun 26, 2018

owainlewis requested a review from prydie June 26, 2018 10:15

owainlewis approved these changes Jun 26, 2018

View reviewed changes

Fix restarting the entire cluster when volumes are persisted.

7124525

gianlucaborello force-pushed the fix-pv-restart branch from 68eac92 to 7124525 Compare June 26, 2018 16:06

owainlewis merged commit fca0915 into oracle:master Jul 3, 2018

gianlucaborello deleted the fix-pv-restart branch July 6, 2018 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix restarting the entire cluster when volumes are persisted #157

Fix restarting the entire cluster when volumes are persisted #157

Uh oh!

gianlucaborello commented Jun 22, 2018

Uh oh!

owainlewis left a comment

Uh oh!

gianlucaborello commented Jun 26, 2018

Uh oh!

prydie commented Jul 3, 2018 •

edited

Loading

Uh oh!

Uh oh!

Fix restarting the entire cluster when volumes are persisted #157

Fix restarting the entire cluster when volumes are persisted #157

Uh oh!

Conversation

gianlucaborello commented Jun 22, 2018

Uh oh!

owainlewis left a comment

Choose a reason for hiding this comment

Uh oh!

gianlucaborello commented Jun 26, 2018

Uh oh!

prydie commented Jul 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

prydie commented Jul 3, 2018 •

edited

Loading