Fixing a Galera MariaDB cluster that will only start one node; or, why I’m an idiot

I set up a 3-node Galera MariaDB cluster at home. The process involved in doing that is somewhat complicated, though not outrageously so, but beyond the scope of this post, which is about fixing a dumb mistake I made during setup.

Recently I had a fairly sizeable issue with my Nutanix CE cluster that resulted in a lot of VMs crashing, including some subset of my MariaDB VMs. Once I got everything fixed, I checked that database access through my HAProxy setup worked, and it did, so I assumed that all three nodes were hale and happy.

Today, I wanted to do some updates to my Nagios monitoring setup so that it would actually keep an eye on the MariaDB services on the cluster IP and all 3 nodes. Oddly, 2 of the three nodes were failing to respond, though the cluster IP worked fine because node3 was happy.

On each of node1 and node2, I tried to start MariaDB and got:

Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [Note] WSREP: view((empty))
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
Jul 14 20:08:41 mariadb02 mysqld[1699]:          at gcomm/src/pc.cpp:connect():158
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1608: Failed to open channel 'hearn_mariadb_cluster' at 'gcomm://10.10.5.2,10.10.5.3,<truncated for length>
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: gcs connect failed: Connection timed out
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: wsrep::connect(gcomm://10.10.5.2,10.10.5.3,10.1.5.4) failed: 7
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] Aborting
Jul 14 20:08:41 mariadb02 systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Jul 14 20:08:41 mariadb02 systemd[1]: mariadb.service: Failed with result 'exit-code'.
Jul 14 20:08:41 mariadb02 systemd[1]: Failed to start MariaDB 10.4.11 database server.

So I did what any self-respecting nerd does, and started googling. I found a lot of stackoverflow posts where people had misconfigured firewalls, so I turned those off for testing purposes. No change. I followed the Galera instructions on re-bootstrapping a cluster, but nothing fixed it. I rebooted all three nodes; nope. This thing had started up fine when I set it up a few months ago, what the hell was going wrong now?

Then I took a closer look at the error message, and noted something significant that made me feel real dumb:

Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: wsrep::connect(gcomm://10.10.5.2,10.10.5.3,10.1.5.4) failed: 7

I’ve changed the IPs from my real ones, but left in the dumb typo to demonstrate: note that the first two addresses are in the 10.10.5.0/24, but the third one starts with 10.1 instead of 10.10? Yeah, I fat-fingered the galera configuration file on all three nodes. The reason that it had started in the first place was that when I set it up, I did the initial bootstrap on the 10.10.5.2 VM, and the other two nodes were able to talk to it to join the cluster. After the last round of crashes, mariadb03 became the primary node, but neither of the other two could talk to it because they had the wrong IP address for it.

Fixed that, started mariadb on the other two nodes, and now everybody’s fat and happy. And I continue to be an idiot.