Resolving a Nutanix CE cluster crash due to disk failure during node removal

Warning: deleting any file from a Nutanix CVM is a dangerous thing that can lead to a complete loss of a cluster and all the VMs and data on it. If you have Nutanix support, CALL THEM! Only attempt this sort of thing on a cluster that you can afford to rebuild from scratch that doesn’t contain any data you care about!

The Nutanix CE (Community Edition) cluster that I run on a collection of old HPE DL360 servers is a little finicky, and of course, I don’t have official Nutanix support for it. The CE forums can be a little hit or miss, particularly for obscure problems, so when I have problems I often find myself doing a lot of googling for error messages that apparently have never been documented, and often as not, simply end up rebuilding the cluster from scratch and restoring the VMs via Data Protection.

My cluster had been having a lot of Cassandra crashes, leading to nodes being tagged as “degraded,” which often led to VMs panicking and rebooting. While trying to clear things up, I had been removing nodes one by one (a multi-hour process as the cluster migrated hundreds of GB of data and metadata to the other nodes) and adding them back in, which seemed to stabilize them.

A smaller issue I had was that the storage in the cluster was very imbalanced; four of the nodes have 4 disks, and one has 8, and because of the way disk failures have occurred (leading to me replacing older disks with newer, larger ones), the node with 8 disks has about 4TB of available storage. while the others have between 1.5 and 2.5. I hadn’t removed the 4TB node yet, so I decided what I would do is detach it from the cluster, pull two 1TB disks out of it, add it back in, and then replace some of the old 146GB disks in the other nodes with the bigger ones to get everything within a few hundred GB of 2TB. I initiated the removal of the large node, which required the system to move nearly 1.5TB of data to the other nodes in the cluster. A few minutes afterwards, Prism reported that a disk had failed on another node, which worried me, but the node detach tasks were continuing and the size of the used storage on the removed node was shrinking, so I left it alone. I watched for a while, but knowing it would take 6-8 hours to complete, eventually I wandered off and allowed Prism to log me out.

When I came back, any attempt to log into Prism returned “Server Error.” Hm. I ssh’d into one of the CVMs and noticed that none of the ncli, acli, or ecli tools worked, all returning “Error: Connection refused (Connection refused)”. They rely on Prism for communication with the cluster, so it was clear that Prism was wonked. I tried restarting it:

nutanix@NTNX-28af434a-A-CVM:~$ allssh "genesis stop prism"
nutanix@NTNX-28af434a-A-CVM:~$ cluster start

The cluster would restart Prism services, but couldn’t communicate with them. I checked the prism_gateway.log in data/logs and saw a lot of failed connections, but couldn’t determine what was going on. An old blogspot page I found turned up the possibility that the node removal task and the failed disk removal had come into conflict, and everything was stuck until those tasks either were completed (which couldn’t happen while Prism was down) or cancelled. Luckily, the progress_monitor_cli command still worked.

nutanix@NTNX-28af434a-A-CVM:~$ progress_monitor_cli --fetchall | egrep "entity_id|entity_type|operation"
<snip>
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461664"
  operation: kRemove
  entity_type: kNode
  entity_id: "26461552"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461665"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461662"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461663"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461661"
  operation: kRemove
  entity_type: kDisk
  entity_id: "114231585"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461668"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461666"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461667"

Annoyingly, the example in the website was not accurate as far as the “entity_type” or “operation” strings, which needed to have the leading “k” removed, and be lower case. The actual command I needed to run was:

progress_monitor_cli --entity_id="26461667" --entity_type="disk" --operation="remove" -delete

Once I had run that on all the disks, I was able to run it on the node removal task, which got everything cleared up, but still, Prism wouldn’t respond. I did a complete cluster restart, but no dice; I rebooted all the nodes and started the cluster again, still no Prism. I tried powering off the detached node; nope. I powered that back on and powered off the node that had reported the failed disk; still no luck. I powered them both off, but while the cluster would start, Prism was not available. So I brought everything back online and started digging into the logs.

One thing I noticed was that things really seemed to be having difficulty talking to Zookeeper. Genesis reported that zookeeper was online, but when I dug into the logs on the node I had attempted to detach, I noticed it seemed to keep trying to replay logs, and failing with I/O errors. When I looked at the contents of /home/nutanix/data/zookeeper/version-2, I noticed that there were “epoch” files that had a number different than other nodes in the cluster, and also the files in that directory had not been touched in several days. When I tried to use “tar” to back them up, I got odd I/O errors. On the assumption that at least two ZK nodes were still online and happy, I stopped the service on the wonky nodes and simply deleted all the “log” and “snapshot” files in that directory, and rebooted the CVM to bring Zookeeper back up. It happily came online and began recovering its data from the other 2 nodes. That seemed to eliminate the errors I was seeing related to ZK. Prism still wouldn’t come online. What the ham?!?

The other service that seemed to be complaining a lot was Cassandra, and the nodetool command was giving me frustrating messages. On the node with the reported failed disk, any nodetool command returned a lengthy Java exception trace. On nodes other than that one, they’d report that the other node was “down” (and the one I’d tried to detach was in forwarding mode, which wasn’t unexpected):

nutanix@NTNX-fe8d4a5e-A-CVM:~$ nodetool -h 0 ring
Address         Status State      Load            Owns    Token
                                                          sFUzzzzz0000000000000000000000000000000000000000000000000000
baddisknode     Down   Normal     ? bytes         26.37%  8b3ODYsE0000000000000000000000000000000000000000000000000000
goodnode1     Up     Normal     8.62 GB         26.37%  OwbmR7kU0000000000000000000000000000000000000000000000000000
goodnode2     Up     Normal     8.64 GB         23.63%  db3ODYsE0000000000000000000000000000000000000000000000000000
removednode     Up     Forwarding 7.04 GB         23.63%  sFUzzzzz0000000000000000000000000000000000000000000000000000

Restarting Cassandra didn’t do much to make the down node happy. The logs mentioned a lot of issues replaying logs after restarting, throwing similar input/output errors as Zookeeper, so I guessed maybe they were corrupted. Figuring that if I couldn’t get the cluster back online I was going to have to rebuild it and recover the VMs via Data Protection (from a “remote site” cluster in the same rack), I planned to tar up the Cassandra logs in /home/nutanix/data/stargate-storage/disks/6005*/metadata/cassandra/commitlog/ and delete them. However, when I attempted the tar, I got:

tar: /home/nutanix/data/stargate-storage/disks/600508b1001c1f6d/metadata/cassandra/commitlog/CommitLog-3-1683565787792.log: File shrank by 53925107 bytes; padding with zeros

Well, that’s got to be the bad guy, right? I stopped Cassandra on the node, deleted that file, and ran “cluster start” to get Cassandra started. After that, the logs looked far healthier, and I could run nodetool on the wonky node and it reported as being up:

nutanix@NTNX-8f724455-A-CVM:~$ nodetool -h 0 ring
Address         Status State      Load            Owns    Token
                                                          sFUzzzzz0000000000000000000000000000000000000000000000000000
fixednode     Up     Normal     6.02 GB         26.37%  8b3ODYsE0000000000000000000000000000000000000000000000000000
goodnode1     Up     Normal     9.31 GB         26.37%  OwbmR7kU0000000000000000000000000000000000000000000000000000
goodnode2     Up     Normal     8.88 GB         23.63%  db3ODYsE0000000000000000000000000000000000000000000000000000
detachednode     Up     Forwarding 7.04 GB         23.63%  sFUzzzzz0000000000000000000000000000000000000000000000000000

After that, I restarted Prism on all the nodes once more:

nutanix@NTNX-8f724455-A-CVM:~$ allssh "genesis stop prism"
nutanix@NTNX-8f724455-A-CVM:~$ cluster start

After that, Prism became available, and the ncli, acli, and ecli commands got back to work! Interestingly, despite my having deleted the removal tasks earlier, Prism seems to be trying to detach disks and the node, so I’m going to leave it be and hope it finishes on its own.

Update, 14 hours later: the node finally detached! And another disk failed on another node, which is wild, though not totally unexpected with hardware that’s close to 15 years old.