This repository was archived by the owner on May 28, 2021. It is now read-only.
Fix restarting the entire cluster when volumes are persisted #157
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
I finally figured out the issue I was talking about in #155, after doing some more troubleshooting.
In summary;
Analysis of the issue
To give a bit more details, when the StatefulSet reschedules the pods that have been deleted, it does so in an orderly manner. This means that the pods beyond the first are not scheduled until the liveness/readiness probes of the first succeed. The problem is, the first agent is never able to form the cluster. When the first mysqld goes up, it returns as a status error message
metadata exists, but GR is not active
and so the agent issues arebootClusterFromCompleteOutage()
, but the command fails:To understand the reason, we can look in the mysqld log:
As you can see, the replication fails to start and stalls the whole thing, it complains about peer addresses not being valid. Tracing down the error inside the group replicationg plugin code, the relevant functions that validate peer addresses are two:
https://github.com/mysql/mysql-server/blob/mysql-8.0.11/plugin/group_replication/libmysqlgcs/src/bindings/xcom/gcs_xcom_utils.cc#L104
https://github.com/mysql/mysql-server/blob/mysql-8.0.11/plugin/group_replication/libmysqlgcs/src/bindings/xcom/gcs_xcom_utils.cc#L871
We can see how, from mysqld point of view, in order for a peer name to be valid it must be at least a resolvable DNS name. However, since Kubernetes is orderly recreating the pods starting from 0, the other pods don't exist yet, and their virtual service name doesn't exist yet either, so the whole thing fails.
Notice that, if we were to instead do a test and manually create fake dangling Kubernetes services for
mysql1
andmysql2
(that is, services that can be resolved to a VIP but behind which there's no mysqld listening because pods are not up yet), we would get a different result:Here, MySQL correctly initializes the group replication even if the peer names can't be reached (but resolved), and so the agent is fully able to issue the standard
rebootClusterFromCompleteOutage()
command, restoring the cluster. So, this is really a Kubernetes/MySQL interaction issue.The simple solution I proposed here is to reset the seeds of the first instance before rebooting the cluster, so we are sure the replication plugin won't try to contact hosts that can't be even resolved, failing. This should in theory be safe because we really invoke that part of the code just when the instance 0 cannot contact any other member of the cluster, so we're not just blindly resetting the seed list every time.
There could be some chances of forming a split brain, but I believe the same could be said even with the current logic. The fix could be made more robust, for example by resetting the seeds just if we see, by querying the Kubernetes API, that the first pod is in fact really the only one in existence at that time. I'm happy to do so if required, but first I wanted to share this approach to also see if there are other ideas from you.
The code changes are based on the PR from #155 for (my) convenience, but obviously I would rebase them appropriately if the other one doesn't get a thumbs up and this one does.
Thanks