Resolving a Nutanix CE cluster crash due to disk failure during node removal

Warning: deleting any file from a Nutanix CVM is a dangerous thing that can lead to a complete loss of a cluster and all the VMs and data on it. If you have Nutanix support, CALL THEM! Only attempt this sort of thing on a cluster that you can afford to rebuild from scratch that doesn’t contain any data you care about!

The Nutanix CE (Community Edition) cluster that I run on a collection of old HPE DL360 servers is a little finicky, and of course, I don’t have official Nutanix support for it. The CE forums can be a little hit or miss, particularly for obscure problems, so when I have problems I often find myself doing a lot of googling for error messages that apparently have never been documented, and often as not, simply end up rebuilding the cluster from scratch and restoring the VMs via Data Protection.

My cluster had been having a lot of Cassandra crashes, leading to nodes being tagged as “degraded,” which often led to VMs panicking and rebooting. While trying to clear things up, I had been removing nodes one by one (a multi-hour process as the cluster migrated hundreds of GB of data and metadata to the other nodes) and adding them back in, which seemed to stabilize them.

A smaller issue I had was that the storage in the cluster was very imbalanced; four of the nodes have 4 disks, and one has 8, and because of the way disk failures have occurred (leading to me replacing older disks with newer, larger ones), the node with 8 disks has about 4TB of available storage. while the others have between 1.5 and 2.5. I hadn’t removed the 4TB node yet, so I decided what I would do is detach it from the cluster, pull two 1TB disks out of it, add it back in, and then replace some of the old 146GB disks in the other nodes with the bigger ones to get everything within a few hundred GB of 2TB. I initiated the removal of the large node, which required the system to move nearly 1.5TB of data to the other nodes in the cluster. A few minutes afterwards, Prism reported that a disk had failed on another node, which worried me, but the node detach tasks were continuing and the size of the used storage on the removed node was shrinking, so I left it alone. I watched for a while, but knowing it would take 6-8 hours to complete, eventually I wandered off and allowed Prism to log me out.

When I came back, any attempt to log into Prism returned “Server Error.” Hm. I ssh’d into one of the CVMs and noticed that none of the ncli, acli, or ecli tools worked, all returning “Error: Connection refused (Connection refused)”. They rely on Prism for communication with the cluster, so it was clear that Prism was wonked. I tried restarting it:

nutanix@NTNX-28af434a-A-CVM:~$ allssh "genesis stop prism"
nutanix@NTNX-28af434a-A-CVM:~$ cluster start

The cluster would restart Prism services, but couldn’t communicate with them. I checked the prism_gateway.log in data/logs and saw a lot of failed connections, but couldn’t determine what was going on. An old blogspot page I found turned up the possibility that the node removal task and the failed disk removal had come into conflict, and everything was stuck until those tasks either were completed (which couldn’t happen while Prism was down) or cancelled. Luckily, the progress_monitor_cli command still worked.

nutanix@NTNX-28af434a-A-CVM:~$ progress_monitor_cli --fetchall | egrep "entity_id|entity_type|operation"
<snip>
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461664"
  operation: kRemove
  entity_type: kNode
  entity_id: "26461552"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461665"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461662"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461663"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461661"
  operation: kRemove
  entity_type: kDisk
  entity_id: "114231585"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461668"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461666"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461667"

Annoyingly, the example in the website was not accurate as far as the “entity_type” or “operation” strings, which needed to have the leading “k” removed, and be lower case. The actual command I needed to run was:

progress_monitor_cli --entity_id="26461667" --entity_type="disk" --operation="remove" -delete

Once I had run that on all the disks, I was able to run it on the node removal task, which got everything cleared up, but still, Prism wouldn’t respond. I did a complete cluster restart, but no dice; I rebooted all the nodes and started the cluster again, still no Prism. I tried powering off the detached node; nope. I powered that back on and powered off the node that had reported the failed disk; still no luck. I powered them both off, but while the cluster would start, Prism was not available. So I brought everything back online and started digging into the logs.

One thing I noticed was that things really seemed to be having difficulty talking to Zookeeper. Genesis reported that zookeeper was online, but when I dug into the logs on the node I had attempted to detach, I noticed it seemed to keep trying to replay logs, and failing with I/O errors. When I looked at the contents of /home/nutanix/data/zookeeper/version-2, I noticed that there were “epoch” files that had a number different than other nodes in the cluster, and also the files in that directory had not been touched in several days. When I tried to use “tar” to back them up, I got odd I/O errors. On the assumption that at least two ZK nodes were still online and happy, I stopped the service on the wonky nodes and simply deleted all the “log” and “snapshot” files in that directory, and rebooted the CVM to bring Zookeeper back up. It happily came online and began recovering its data from the other 2 nodes. That seemed to eliminate the errors I was seeing related to ZK. Prism still wouldn’t come online. What the ham?!?

The other service that seemed to be complaining a lot was Cassandra, and the nodetool command was giving me frustrating messages. On the node with the reported failed disk, any nodetool command returned a lengthy Java exception trace. On nodes other than that one, they’d report that the other node was “down” (and the one I’d tried to detach was in forwarding mode, which wasn’t unexpected):

nutanix@NTNX-fe8d4a5e-A-CVM:~$ nodetool -h 0 ring
Address         Status State      Load            Owns    Token
                                                          sFUzzzzz0000000000000000000000000000000000000000000000000000
baddisknode     Down   Normal     ? bytes         26.37%  8b3ODYsE0000000000000000000000000000000000000000000000000000
goodnode1     Up     Normal     8.62 GB         26.37%  OwbmR7kU0000000000000000000000000000000000000000000000000000
goodnode2     Up     Normal     8.64 GB         23.63%  db3ODYsE0000000000000000000000000000000000000000000000000000
removednode     Up     Forwarding 7.04 GB         23.63%  sFUzzzzz0000000000000000000000000000000000000000000000000000

Restarting Cassandra didn’t do much to make the down node happy. The logs mentioned a lot of issues replaying logs after restarting, throwing similar input/output errors as Zookeeper, so I guessed maybe they were corrupted. Figuring that if I couldn’t get the cluster back online I was going to have to rebuild it and recover the VMs via Data Protection (from a “remote site” cluster in the same rack), I planned to tar up the Cassandra logs in /home/nutanix/data/stargate-storage/disks/6005*/metadata/cassandra/commitlog/ and delete them. However, when I attempted the tar, I got:

tar: /home/nutanix/data/stargate-storage/disks/600508b1001c1f6d/metadata/cassandra/commitlog/CommitLog-3-1683565787792.log: File shrank by 53925107 bytes; padding with zeros

Well, that’s got to be the bad guy, right? I stopped Cassandra on the node, deleted that file, and ran “cluster start” to get Cassandra started. After that, the logs looked far healthier, and I could run nodetool on the wonky node and it reported as being up:

nutanix@NTNX-8f724455-A-CVM:~$ nodetool -h 0 ring
Address         Status State      Load            Owns    Token
                                                          sFUzzzzz0000000000000000000000000000000000000000000000000000
fixednode     Up     Normal     6.02 GB         26.37%  8b3ODYsE0000000000000000000000000000000000000000000000000000
goodnode1     Up     Normal     9.31 GB         26.37%  OwbmR7kU0000000000000000000000000000000000000000000000000000
goodnode2     Up     Normal     8.88 GB         23.63%  db3ODYsE0000000000000000000000000000000000000000000000000000
detachednode     Up     Forwarding 7.04 GB         23.63%  sFUzzzzz0000000000000000000000000000000000000000000000000000

After that, I restarted Prism on all the nodes once more:

nutanix@NTNX-8f724455-A-CVM:~$ allssh "genesis stop prism"
nutanix@NTNX-8f724455-A-CVM:~$ cluster start

After that, Prism became available, and the ncli, acli, and ecli commands got back to work! Interestingly, despite my having deleted the removal tasks earlier, Prism seems to be trying to detach disks and the node, so I’m going to leave it be and hope it finishes on its own.

Update, 14 hours later: the node finally detached! And another disk failed on another node, which is wild, though not totally unexpected with hardware that’s close to 15 years old.

Configuring static IP on new Ubuntu 20 builds with cloud-init

For a long time I had been building VMs on my Nutanix AHV cluster using DHCP, because unlike VMware, Nutanix didn’t offer an easy way to hard-code an IP to a VM that’s not already on the network. I considered giving it a temporary DHCP address, and then once it was networked, ssh-ing in and hard-coding that or another IP and then doing some network magic in the infrastructure to make everything work, but that was more trouble than it was really worth. Once AHV assigns an IP to a VM, it will always give it that same IP…unless you migrate it between clusters, which happens all too frequently as I have to rebuild my primary cluster every few months after some kind of minor error that the cluster couldn’t survive. After the last time I had to do a rebuild, and went back in to manually apply the correct IPs to a couple dozen DHCP-enabled VMs, I decided to figure out static IPs.

I knew that Nutanix offered the ability to run a cloud-init script when creating a VM, but I’d never been able to get it to work; partly that was RedHat/CentOS’s fault, because it doesn’t seem like they properly implemented cloud-init to allow you to use the “network:” block to set things up. I ended up having to manually “cat” the necessary information into the ifcfg files in /etc/sysconfig/network-scripts, which worked quite well.

Ubuntu 16 and 18 work pretty much the same way; just overwrite the /etc/network/interfaces file and “ifup” the virtual NIC. Ubuntu 20, however, is a real headache, because by default it no longer uses /etc/network/interfaces, instead using a tool named “netplan”. It also is a bit more of a bear when it comes to cloud-init because it doesn’t seem to get along with Nutanix very well to get the scripts in place; Nutanix provides the script as a file on a virtual CD-ROM that cloud-init mounts and gets the info from. Ubuntu 20 doesn’t mount that automatically. Also, I use packer to build the “golden image” that gets cloned when I create VMs; packer reboots the builder VM a couple times as part of generating the disk image, and cloud-init generally only runs on first boot. Getting around this is a bit of a multistep process, and I had to work backwards from the desired end-state, i.e., a VM with an active network using a static IP configuration.

The “netplan” process will read configuration files from /etc/netplan; those files generally follow the same yaml code style as cloud-init. The files are read in lexical (alphanumeric) order, so you need to have a file in there that supersedes anything else, alphabetically; by default there’s a 50-cloud-init.yaml file that’s set for DHCP and that supposedly can get regenerated automatically, so I made my file named 51-cloud-init.yaml (and I remove the 50-* one just to be sure). On the VM it has to look something like this:

#cloud-config
# Configure networking, set a static IP
network:
  version: 2
  renderer: networkd
  ethernets:
    ens3:
      addresses:
      - 10.22.48.8/24
      gateway4: 10.22.48.1
      nameservers:
        addresses:
        - 4.4.4.4
        - 8.8.8.8

The trick, obviously, is that the IP needs to be dynamically generated at build time. Because I use Ansible (and the nifty new Nutanix Ansible collection) to create my VMs, I can use the “template” module and Jinja2, passing in the necessary variables:

#cloud-config
# Configure networking, set a static IP
network:
  version: 2
  renderer: networkd
  ethernets:
    ens3:
      addresses:
      - {{ip_with_prefix}}
      gateway4: {{ ip_with_prefix | ansible.utils.ipaddr ('address/prefix') | ansible.utils.ipaddr('1') | ansible.utils.ipaddr('address') }}
      nameservers:
        addresses:
        - 4.4.4.4
        - 8.8.8.8

The “ip_with_prefix” variable is calculated from inputs elsewhere in the playbook and should always be in “xxx.xxx.xxx.xxx/yyy” format, with yyy being the subnet prefix (not the IP-formatted netmask). The Jinja2 code calculates the gateway for it based on that variable. For now I just hard-code DNS servers, but those could obviously be provided via variables as well.

When my Ansible code builds the VM, it generates a netplan file from that template and passes it in as the cloud-init script, even though technically it’s a netplan script that we’re just using cloud-init to get on the VM. Within the OS, that shows up as a file on the virtual CDROM. Unfortunately, Ubuntu 20 doesn’t seem to automatically mount the CDROM and process the contents as part of its cloud-init on bootup.

Here’s where the trickery, which took me days to figure out, comes in. What needs to happen is that the disk template that Nutanix clones to create a VM needs to have an actual cloud-init script on it that mounts the CDROM and copies the netplan script to the right location. On my Ubuntu 20 disk template, that cloud-init script is named 07_networkcopy.cfg, and looks like this:

#cloud-config
runcmd:
  - "sudo mount -v /dev/sr0 /mnt"
  - "sudo cp /mnt/openstack/latest/user_data /etc/netplan/51-cloud-init.yaml"
  - "sudo rm /etc/netplan/50-cloud-init.yaml"
  - "sudo netplan generate"
  - "sudo netplan apply"

I should have been able to use the “mounts” directive to mount the CDROM, but that was very hit-or-miss. Cloud-init was also weirdly sensitive to what file these commands were in. I originally had the mount step in a separate file named 06_mountcd.cfg, but it never seemed to run. The 07_networkcopy.cfg definitely ran because I got errors about /mnt/openstack/latest/user_data (the location AHV puts cloud-init scripts) not existing. When I put everything in 07_networkcopy.cfg it worked fine. I probably should also put a step in to unmount the CD, I’ll do that next.

I create my disk templates using packer. The code that gets the 07_networkcopy.cfg file into place consists of a few “provisioner” steps in my packer build config file. First is a file directive that copies the file over:

    {
      "destination": "~ubuntu/07_networkcopy.cfg",
      "source": "07_networkcopy.cfg",
      "type": "file"
    },

Then there are a few steps inside an “inline” provisioner (just a list of commands packer runs on the build VM once it’s up):

    {
      "inline": [
        "sudo mv ~ubuntu/07_networkcopy.cfg /etc/cloud/cloud.cfg.d/07_networkcopy.cfg",
        "sudo rm -f /etc/netplan/00-installer-config.yaml",
        "sudo rm -f /etc/netplan/50-cloud-init.yaml",
        "sudo rm -f /etc/cloud/cloud.cfg.d/subiquity-disable-cloudinit-networking.cfg",
        "sudo cloud-init clean"
      ],
      "type": "shell"
    }

Obviously this assumes that the ubuntu user has the ability to sudo without a password; I have a separate inline block to set that up. The 07_networkcopy.cfg “true cloud-init” file gets copied to the cloud-init config file location, and packer cleans up the /etc/netplan directory and gets rid of a cloud-init file that prevents cloud-init from actually making network changes. (That “subiquity-etc.” step may not actually be necessary because cloud-init itself isn’t running any network commands, it’s merely copying a file directly to netplan and telling it to use it, but it was an artifact from earlier testing, and it worked this way, so I left it.) Then packer runs cloud-init clean, which tells the VM to re-run the cloud-init scripts on bootup.

After running the packer build, you get a disk image that contains no network information, but when booted attempts to mount a CDROM, copy a network file from it to /etc/netplan and tells netplan to apply it. Every time I build a VM, that disk image gets cloned, and I generate a new netplan config file that gets passed in via the CDROM to be consumed by cloud-init when the VM boots.

Clear as mud?

Testing Ansible roles for CentOS8 with Molecule and Vagrant

Since RHEL8/CentOS8 came out, I’ve been working on updating all my Ansible roles at work and at home to function run on the new OS. It can be a bit of a headache, since certain long-time features (NTP, for example) have disappeared and replaced with other things (Chrony).

However, the bigger headache turned out to be running Molecule tests on the roles. At home, I had been running Molecule 2.12.1, which is pretty old, along with Ansible 2.7.5, and those versions really don’t seem to get along with CentOS 8. They really really didn’t like if I tried to use Vagrant for testing, either. Some roles I was able to test with Docker, but it seemed kludgey, and some roles (anything that interacts with systemd, for example) just don’t work in a Docker container.

It took some trial and error, but I found a configuration that seems to do the job. I do my testing on a CentOS7 server (until I get my roles fixed I don’t plan to deploy CentOS 8 anywhere). I upgraded Molecule, Ansible, and Vagrant accordingly:

  • molecule==2.22
  • ansible==2.9.13 (selected automatically when I specified Molecule 2.22)
  • python-vagrant==0.5.15 (latest)
  • Vagrant 2.2.10

I know that Molecule 3.0x is available, but apparently it’s not really ready for primetime yet; the Vagrant provider doesn’t work at all with it.

Some gotchas:

  1. Molecule 2.22 requires python3, so you’ll need to install it via yum, dnf, or apt. I found that whatever active release of the “python36” and “python36-devel” packages available for CentOS work fine.
  2. Vagrant 2.2.10 requires the “fuse” and “fuse-libs” packages, at least on CentOS.
  3. You may have to update the molecule setup in your role with a “molecule init”. I ended up having to completely rebuild the molecule directory to get it to work with python3 and molecule 2.22, though I probably only really needed to create a new vagrant scenario and get rid of the old one. Run “molecule init –help” to get more information about this.
  4. I highly, highly recommend running the tests within a python virtual environment. This ensures that you have a “clean” setup every time you run the test, particularly if you’re automating it.

I have to admit: I was not very familiar with virtual environments prior to setting this up. (My python fu in general is pretty lacking, honestly.) It turned out to be fairly simple, though; to construct a virtual environment for python3, make sure that the “venv” package is installed:

pip3 install venv

And then create the virtual environment:

python3 -m venv my_venv

You can replace “my_venv” with whatever string you like; python will create a subdirectory wherever you are with that name. Then you activate it:

. my_venv/bin/activate

Now you have a completely clean python3 environment into which you can install the molecule, ansible, etc. packages. I put the list of packages and the versions I want into a pip-requirements.txt file that looks like this:

molecule==2.22
python-vagrant==0.5.15

Then it’s a matter of running the install:

pip install -r test/pip-requirements.txt

Doing this for every automated test does add some processing time, but it’s the best way to ensure that you have a completely clean python3 environment with the packages you want, along with their dependencies, and nothing that you don’t need.

If you’re running automated tests in Jenkins, there is a handy plugin named “Pyenv Pipeline” that allows you to wrap your stages and commands inside a “withPythonEnv” block like this:

      withPythonEnv('/usr/bin/python3') {
        stage ("Set up virtual environment") {
          sh 'pip install -r test/pip-requirements.txt'
        }
        stage ("Install vault password file") {
          withCredentials([string(credentialsId: 'ansiblevaultpass', variable: 'vault_pass')]) {
            sh "echo ${vault_pass} > test/.vault_pass"
            sh "chmod 0400 test/.vault_pass"
          }
        }
        stage ("Executing Molecule Lint") {
          sh 'molecule --debug lint'
        }
<snip>

Something else to remember is that the molecule linting features will look at the virtual environment directory unless told otherwise, so whatever you name your local directory needs to be ignored, as well as the one that’s created by the Jenkins plugin. Make sure they’re listed in the “ignore” block in the .yamllint file in the root directory:

---
# Based on ansible-lint config
extends: default

ignore: |
  .pyenv-python/
  molecule_venv/
  .pyenv-usr-bin-python3/

Python3 and python2 use different directories in the Jenkins plugin; .pyenv-python is for python2, and .pyenv-usr-bin-python3 is (obviously) for python3. “molecule_venv” is the one I created for local testing.

I found that git automatically ignored virtual environment directories. I’m not entirely clear on how it knew to do that, but i did not have to specifically exclude any of those directories in .gitignore.

Happy testing!

Spacewalk refusing to sync channels, and throws befuddling Postgres errors

In my home lab I run an instance of “Spacewalk,” which is an open-source version of Red Hat’s Satellite 5 tool. It allows you to manage CentOS, Red Hat, and Fedora devices in various ways, but it’s mostly used to manage “yum” packages.

The gist of the package management setup is that you set up “channels,” which contain links to a variety of internet repos, and download the packages from those repos according to a schedule you set up, stashing them locally to save download time when installing or upgrading software on a device. It stores a ton of metadata about the packages in a Postgres database. You subscribe your hosts to the channels you want so that they can see certain packages, depending on what you want installed where. For example, I have a “Cloudera” channel containing all the server and agent RPMs for Cloudera, that only my VMs that run Cloudera are subscribed to.

You can do nifty tricks with “clone” channels. Because the main channels get updated with new packages all the time, if you do a patch cycle that lasts several months because you have hundreds of servers to upgrade, you can run into a situation where you upgraded a dev VM to kernel kernel-3.10.0-514.el7.x86_64 on June 1st, and then when you got around to prod on August 1st, the latest kernel was up to kernel-3.10.0-514.el7.x86_64. To get around that, at the start of your cycle, you clone the main channel, and subscribe all the servers to the clone; it’s essentially a snapshot of the channel. Then the servers only get the packages that were in the channel on the day you cloned it. Next patch cycle, repeat the process.

I had a wonky issue recently where my Spacewalk VM rebooted and came back up, but was unable to mount the NFS share where I actually store all the RPMs (I’ve got about 1TB of them at this point, so it makes sense to stash them on my giant FreeNAS instead of using up Nutanix storage). It dutifully started up Spacewalk anyway, and started sticking RPMs into /var/satellite and filling up /var. This is annoying, but not fatal. What became more of a problem was my assumption that Spacewalk would be smart enough to resync the packages after I deleted them, mounted the NFS share, and restarted Spacewalk.

Instead, it started throwing weird errors about duplicates:

2020/07/20 16:07:06 -04:00 ERROR: (23, 'ERROR:  duplicate key value violates unique constraint "rhn_cnp_cid_nid_uq"', 'Could not update database entry.')

I did some googlin’ and results were inconclusive. An official Red Hat support page (I have access because of my job) seemed to indicate that I was going to have to do some Postgres hacking, which was a scary thought. (At that point I wisely created a VM snapshot of the Spacewalk server so I didn’t blow the whole thing up and not have a backup.) I also came across this site that seemed to indicate that I can solve the problem from within Spacewalk by looking for “orphaned” packages (rpms that are not affiliated with any channel), and delete them. However, when I went to look for them, I had 7238 orphaned packages! That seemed…odd.

I knew I had a good snapshot of the VM and could restore, but the NFS data would be more tricky; if I deleted that, it was gone from the NAS, and wouldn’t be contained in the snapshot. (It occurred to me later that I believe FreeNAS also has a snapshot capability, but it didn’t occur to me at the time.) I figured if things really got dire, Spacewalk would probably redownload the missing stuff, so I threw caution to the wind and deleted all 7000+ orphaned packages.

It took a while, and Spacewalk itself got so busy doing it that it wouldn’t talk to me for a while, but sure enough, it threw a bunch of messages in /var/log/rhn/rhn_taskomatic_daemon.log about packages not actually existing on the filesystem:

INFO   | jvm 1    | 2020/07/20 15:40:47 | 2020-07-20 15:40:47,709 [DefaultQuartzScheduler_Worker-8] ERROR com.redhat.rhn.taskomatic.task.PackageCleanup - /var/spacewalk/redhat/1/462/rh-python36-numpy/1:1.13.1-1.el7/x86_64/462e4a2f951afca
6ce14076c9e859784a5e7630112c07d77b30bcef2dc410299/rh-python36-numpy-1.13.1-1.el7.x86_64.rpm not found

After it successfully got them deleted, I attempted a sync again, from the command line:

[root@spacewalk rhn]# /usr/bin/spacewalk-repo-sync --channel centos_6_x86_64
08:08:00 ======================================
08:08:00 | Channel: centos_6_x86_64
08:08:00 ======================================
08:08:00 Sync of channel started.
08:08:00 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os
08:08:04 Packages in repo:              6713
08:11:46 Packages already synced:       6216
08:11:46 Packages to sync:               497
08:11:46 New packages to download:         0
08:11:46 Linking packages to channel.
08:11:57 ERROR: (23, 'ERROR:  duplicate key value violates unique constraint "rhn_cnp_cid_nid_uq"', 'Could not update database entry.')
08:11:57 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=updates
Traceback (most recent call last):
  File "/usr/bin/spacewalk-repo-sync", line 257, in <module>
    sys.exit(abs(main() or 0))
  File "/usr/bin/spacewalk-repo-sync", line 240, in main
    elapsed_time, channel_ret_code = sync.sync()
  File "/usr/lib/python2.7/site-packages/spacewalk/satellite_tools/reposync.py", line 453, in sync
    """, repo_id=int(repo_id))
  File "/usr/lib/python2.7/site-packages/spacewalk/server/rhnSQL/__init__.py", line 292, in fetchall_dict
    h.execute(sql, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/spacewalk/server/rhnSQL/sql_base.py", line 151, in execute
    return self._execute_wrapper(self._execute, *p, **kw)
  File "/usr/lib/python2.7/site-packages/spacewalk/server/rhnSQL/driver_postgresql.py", line 302, in _execute_wrapper
    raise sql_base.SQLSchemaError(error_code, e.pgerror, e)
spacewalk.server.rhnSQL.sql_base.SQLSchemaError: (99999, 'ERROR:  current transaction is aborted, commands ignored until end of transaction block', '', InternalError('current transaction is aborted, commands ignored until end of transact
ion block\n',))

Dang it. Same error. Okay, let’s look at the Postgres logs. In this case, Spacewalk sets up Postgres with some default settings, and stashes pg_log in /var/lib/pgsql/data/pg_log, rolling it over every day and saving 7 days of logs. I’m still not entirely sure I needed to, but I increased the verbosity in /var/lib/pgsql/data/postgresql.conf:

│client_min_messages = debug1

And ran “rhn-satellite restart” to refresh the whole Spacewalk environment, including Postgres. Then I ran the repo-sync again. As it failed, Postgres threw this error:

2020-07-20 13:32:17.644 EDT ERROR:  duplicate key value violates unique constraint "rhn_cnp_cid_nid_uq"
2020-07-20 13:32:17.644 EDT DETAIL:  Key (channel_id, name_id, package_arch_id)=(101, 4586, 120) already exists.
2020-07-20 13:32:17.644 EDT CONTEXT:  SQL statement "insert into rhnChannelNewestPackage
                        (channel_id, name_id, evr_id, package_id, package_arch_id)
                        (select channel_id,
                                name_id, evr_id,
                                package_id, package_arch_id
                           from rhnChannelNewestPackageView
                          where channel_id = channel_id_in
                            and (package_name_id_in is null
                                 or name_id = package_name_id_in)
                        )"
        PL/pgSQL function rhn_channel.refresh_newest_package(numeric,character varying,numeric) line 9 at SQL statement

So apparently the issue is whatever row in rhnChannelNewestPackage with channel_id=101, name_id=4586, and package_arch_id=120. Let’s see if we can figure out what that is. I su’d to the postgres user and accessed the database:

[root@spacewalk reposync]# su - postgres
Last login: Mon Jul 20 15:10:45 EDT 2020 on pts/2
-bash-4.2$ psql rhnschema
psql (9.2.23)
Type "help" for help.

rhnschema=# select * from rhnChannelNewestPackage where channel_id=101 and name_id=4586;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4586 |  19624 |             120 |      63550
(1 row)

Okay, let’s see what that guy is.

rhnschema=# select * from rhnpackage where id=63550;
  id   | org_id | name_id | evr_id | package_arch_id | package_group | rpm_version |                             description                              |                    summary                    | package_size | payload_size | ins
talled_size |        build_host         |     build_time      | source_rpm_id | checksum_id | vendor | payload_format | compat |                                                                      path
                                    | header_sig | copyright | cookie |     last_modified      |            created            |           modified            | header_start | header_end
-------+--------+---------+--------+-----------------+---------------+-------------+----------------------------------------------------------------------+-----------------------------------------------+--------------+--------------+----
------------+---------------------------+---------------------+---------------+-------------+--------+----------------+--------+-------------------------------------------------------------------------------------------------------------
------------------------------------+------------+-----------+--------+------------------------+-------------------------------+-------------------------------+--------------+------------
 63550 |      1 |    4586 |  19624 |             120 |            13 | 4.8.0       | This package provides the sar and iostat commands for Linux. Sar and+| The sar and iostat system monitoring commands |       239820 |       854664 |
     846133 | x86_64_01.bsys.centos.org | 2018-01-26 02:43:42 |         26153 |     7595952 | CentOS | cpio           |      0 | redhat/1/b05/sysstat/9.0.4-33el6_9.1/x86_64/b05d4090caae0fb2df7005a07862ce2ce82e88ced1a58f42f2eb88769e47a45b
/sysstat-9.0.4-33el6_9.1.x86_64.rpm |            | GPLv2+    |        | 2018-01-28 11:05:36-05 | 2018-01-28 21:05:38.918665-05 | 2018-01-28 21:05:38.918665-05 |         1384 |      27824
       |        |         |        |                 |               |             | iostat enable system monitoring of disk, network, and other IO      +|                                               |              |              |
            |                           |                     |               |             |        |                |        |
                                    |            |           |        |                        |                               |                               |              |
       |        |         |        |                 |               |             | activity.                                                            |                                               |              |              |
            |                           |                     |               |             |        |                |        |
                                    |            |           |        |                        |                               |                               |              |
(1 row)

sysstat-9.0.4-33el6_9.1.x86_64.rpm seems to be the problem. I thought about simply deleting that row from rhnPackage and rhnChannelNewestPackage, but decided it would probably be cleaner to let Spacewalk do it and theoretically take care of all the bad metadata itself. So I logged back into spacewalk, went to the Channels tab atop the dashboard, then clicked Manage Software Channels in the left-side menu, followed by Manage Software Packages. I selected “All Managed Packages” in the dropdown, searched for sysstat, found the 9.0.4-33el6_9.1 version, selected and deleted it. Then I ran the repo-sync again.

Argh. Same error. Okay, let’s check Postgres again.

2020-07-21 08:11:57.106 EDT DETAIL:  Key (channel_id, name_id, package_arch_id)=(101, 4083, 100) already exists.

Curious. Same error, with a different name_id. Let’s see what that is.

rhnschema=# select * from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4083 |  19623 |             100 |      63530
(1 row)

What the flip? That’s the same package_id from before. But the package is gone!

rhnschema=# select * from rhnpackage where id=63550;
 id | org_id | name_id | evr_id | package_arch_id | package_group | rpm_version | description | summary | package_size | payload_size | installed_size | build_host | build_time | source_rpm_id | checksum_id | vendor | payload_format | co
mpat | path | header_sig | copyright | cookie | last_modified | created | modified | header_start | header_end
----+--------+---------+--------+-----------------+---------------+-------------+-------------+---------+--------------+--------------+----------------+------------+------------+---------------+-------------+--------+----------------+---
-----+------+------------+-----------+--------+---------------+---------+----------+--------------+------------
(0 rows)

Okay, it must be stashing a row of metadata in some table somewhere. I deleted it from rhnChannelNewestPackage manually:

rhnschema=# delete from rhnChannelNewestPackage where channel_id=101 and name_id=4083;
DELETE 1
rhnschema=# select * from rhnChannelNewestPackage where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
(0 rows)

There was also a mention in the Postgres log of rhnChannelNewestPackageView, which still had a row mentioning that name_id:

rhnschema=# select * from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4083 |  19623 |             100 |      63530
(1 row)

That object is clearly a view, and therefore I assumed I wouldn’t be able to delete directly from it, only from the tables underneath it, but thought maybe there might be some kind of trigger set up that would allow you to delete things and it would do some cleanup underneath. Might as well give it a shot:

rhnschema=# delete from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
ERROR:  cannot delete from view "rhnchannelnewestpackageview"
HINT:  You need an unconditional ON DELETE DO INSTEAD rule or an INSTEAD OF DELETE trigger.

Yep, as I figured. Okay, let’s see if it’ll tell me what tables that view encompasses:

rhnschema=# \d rhnchannelnewestpackageview
View "public.rhnchannelnewestpackageview"
     Column      |  Type   | Modifiers
-----------------+---------+-----------
 channel_id      | numeric |
 name_id         | numeric |
 evr_id          | numeric |
 package_arch_id | numeric |
 package_id      | numeric |

Not helpful. At this point, sadly, I had run out of Postgres expertise (I don’t have any to begin with, really, other than wrangling Slony clusters at work), so it was down to shots in the dark. I saw there was an rhnChannelPackage table and thought maybe it still contains a row for that package:

rhnschema=# select * from rhnchannelpackage where channel_id=101 and name_id=4083;
ERROR:  column "name_id" does not exist
LINE 1: ...* from rhnchannelpackage where channel_id=101 and name_id=40...

Heaven forfend it have the same table structure as rhnChannelNewestPackage. Argh. Okay, what is its structure?

rhnschema=# \d rhnchannelpackage
                Table "public.rhnchannelpackage"
   Column   |           Type           |       Modifiers
------------+--------------------------+------------------------
 channel_id | numeric                  | not null
 package_id | numeric                  | not null
 created    | timestamp with time zone | not null default now()
 modified   | timestamp with time zone | not null default now()
Indexes:
    "rhn_cp_cp_uq" UNIQUE, btree (channel_id, package_id)
    "rhn_cp_pid_idx" btree (package_id)
Foreign-key constraints:
    "rhn_cp_cid_fk" FOREIGN KEY (channel_id) REFERENCES rhnchannel(id) ON DELETE CASCADE
    "rhn_cp_pid_fk" FOREIGN KEY (package_id) REFERENCES rhnpackage(id)
Triggers:
    rhn_channel_package_mod_trig BEFORE INSERT OR UPDATE ON rhnchannelpackage FOR EACH ROW EXECUTE PROCEDURE rhn_channel_package_mod_trig_fun()

Well golly, look at all that. None of those indexes or foreign-keys were mentioned in the original error, but it wouldn’t’ve surprised me if somehow through the table structure they didn’t get involved in it. Let’s see if there’s a row matching our wonky sysstat package:

rhnschema=# select * from rhnchannelpackage where channel_id=101 and package_id=63530;
 channel_id | package_id |           created            |           modified
------------+------------+------------------------------+------------------------------
        101 |      63530 | 2018-01-28 21:05:58.69026-05 | 2018-01-28 21:05:58.69026-05
(1 row)

Well knock me over with a feather. Okey dokey, let’s eliminate that thang:

rhnschema=# delete from rhnchannelpackage where channel_id=101 and package_id=63530;
DELETE 1

Now, it still appeared in rhnChannelNewestPackageView, unfortunately:

rhnschema=# select * from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4083 |   1685 |             100 |       5380
(1 row)

But I wasn’t sure what the heck else to try, so figured it wouldn’t hurt anything to do a reposync again:

[root@spacewalk rhn]# /usr/bin/spacewalk-repo-sync --channel centos_6_x86_64
08:20:52 ======================================
08:20:52 | Channel: centos_6_x86_64
08:20:52 ======================================
08:20:52 Sync of channel started.
08:20:52 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os
08:20:57 Packages in repo:              6713
08:23:34 Packages already synced:       6216
08:23:34 Packages to sync:               497
08:23:34 New packages to download:         0
08:23:34 Linking packages to channel.
08:23:38 Repo http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os has comps file comps.xml.

Hey, it never got that far without throwing an error. That’s promising.

08:23:38 Repo http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os has 0 errata.
08:23:38 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=updates
08:23:42 Packages in repo:              1068
08:23:46 Packages already synced:          4
08:23:46 Packages to sync:              1064
08:23:46 New packages to download:      1064
08:23:46 1/1064 : 389-ds-base-1.2.11.15-97.el6_10.x86_64.rpm
08:23:47 2/1064 : adcli-0.8.1-3.el6_10.x86_64.rpm
08:23:47 3/1064 : 389-ds-base-devel-1.2.11.15-97.el6_10.i686.rpm
08:23:47 4/1064 : autofs-5.0.5-140.el6_10.1.x86_64.rpm
08:23:47 5/1064 : 389-ds-base-devel-1.2.11.15-97.el6_10.x86_64.rpm
08:23:47 6/1064 : 389-ds-base-libs-1.2.11.15-97.el6_10.x86_64.rpm
08:23:47 7/1064 : 389-ds-base-libs-1.2.11.15-97.el6_10.i686.rpm
08:23:47 8/1064 : autofs-5.0.5-140.el6_10.x86_64.rpm
08:23:47 9/1064 : bind-chroot-9.8.2-0.68.rc1.el6_10.1.x86_64.rpm
08:23:47 10/1064 : bind-chroot-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:47 11/1064 : bind-chroot-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:47 12/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.1.i686.rpm
08:23:47 13/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.1.x86_64.rpm
08:23:48 14/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.3.i686.rpm
08:23:48 15/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:48 16/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.7.i686.rpm
08:23:48 17/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:48 18/1064 : bind-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:48 19/1064 : bind-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:48 20/1064 : bind-libs-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:48 21/1064 : bind-libs-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:48 22/1064 : bind-9.8.2-0.68.rc1.el6_10.1.x86_64.rpm

Yaaaaaaaasssssss!

I’m honestly not entirely sure what fixed it, but seems like the safest thing is to look at the Postgres logs, figure out what package is causing it to burp, and then go into Postgres and delete any reference to it from the database. Try to reposync again. If that doesn’t work, try prayer.

Fixing a Galera MariaDB cluster that will only start one node; or, why I’m an idiot

I set up a 3-node Galera MariaDB cluster at home. The process involved in doing that is somewhat complicated, though not outrageously so, but beyond the scope of this post, which is about fixing a dumb mistake I made during setup.

Recently I had a fairly sizeable issue with my Nutanix CE cluster that resulted in a lot of VMs crashing, including some subset of my MariaDB VMs. Once I got everything fixed, I checked that database access through my HAProxy setup worked, and it did, so I assumed that all three nodes were hale and happy.

Today, I wanted to do some updates to my Nagios monitoring setup so that it would actually keep an eye on the MariaDB services on the cluster IP and all 3 nodes. Oddly, 2 of the three nodes were failing to respond, though the cluster IP worked fine because node3 was happy.

On each of node1 and node2, I tried to start MariaDB and got:

Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [Note] WSREP: view((empty))
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
Jul 14 20:08:41 mariadb02 mysqld[1699]:          at gcomm/src/pc.cpp:connect():158
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1608: Failed to open channel 'hearn_mariadb_cluster' at 'gcomm://10.10.5.2,10.10.5.3,<truncated for length>
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: gcs connect failed: Connection timed out
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: wsrep::connect(gcomm://10.10.5.2,10.10.5.3,10.1.5.4) failed: 7
Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] Aborting
Jul 14 20:08:41 mariadb02 systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Jul 14 20:08:41 mariadb02 systemd[1]: mariadb.service: Failed with result 'exit-code'.
Jul 14 20:08:41 mariadb02 systemd[1]: Failed to start MariaDB 10.4.11 database server.

So I did what any self-respecting nerd does, and started googling. I found a lot of stackoverflow posts where people had misconfigured firewalls, so I turned those off for testing purposes. No change. I followed the Galera instructions on re-bootstrapping a cluster, but nothing fixed it. I rebooted all three nodes; nope. This thing had started up fine when I set it up a few months ago, what the hell was going wrong now?

Then I took a closer look at the error message, and noted something significant that made me feel real dumb:

Jul 14 20:08:41 mariadb02 mysqld[1699]: 2020-07-14 20:08:41 0 [ERROR] WSREP: wsrep::connect(gcomm://10.10.5.2,10.10.5.3,10.1.5.4) failed: 7

I’ve changed the IPs from my real ones, but left in the dumb typo to demonstrate: note that the first two addresses are in the 10.10.5.0/24, but the third one starts with 10.1 instead of 10.10? Yeah, I fat-fingered the galera configuration file on all three nodes. The reason that it had started in the first place was that when I set it up, I did the initial bootstrap on the 10.10.5.2 VM, and the other two nodes were able to talk to it to join the cluster. After the last round of crashes, mariadb03 became the primary node, but neither of the other two could talk to it because they had the wrong IP address for it.

Fixed that, started mariadb on the other two nodes, and now everybody’s fat and happy. And I continue to be an idiot.

Recovering the “install” user on a Nutanix CE node

I run a four-node Nutanix Community Edition cluster at home, on a batch of old HP DL360s scavenged from the office when they were throwing things out some years ago. It’s generally reliable; occasionally there’s a network glitch that takes out a bunch of VMs, but most of the time it stays up, happy and hale.

Hardware failures happen, however, particularly to disks. The cluster has 28 disks, so it’s inevitable that they wear out, particularly since my dusty basement is hardly a pristine data center environment. The cluster is also reliant on SSDs, where most of the I/O is directed from the VMs, and which also host the Controller VMs, so on occasion the SSDs wear out and fail. It’s a huge hassle to repair those, because it essentially requires you to reinstall the CVM and hope that the cluster recognizes and readmits it.

It’s doubly hard if the “install” user has disappeared, which apparently happens during the cleanup process that occurs after a cluster software upgrade. I had installed the cluster using a 2017 image, and then upgraded it in mid-2019; when I went back to try and repair a failed SSD, none of the install scripts were there, and even the install user had been removed from all four hosts, making it impossible to rebuild the CVM on the new drive.

To get around this, I went and found the full install image (which normally you stick on a USB or SD card to use as a boot disk), which contained the necessary files, but figuring out how to mount it and get the files off without actually burning a USB stick was a bit of a trial.

Mounting on Windows proved nearly impossible, so I copied the file to where one of my Linux VMs could get at it. The issue is that the disk image is partitioned, just like a regular physical or virtual disk would be, but of course the operating system doesn’t recognize a file as a disk device. Luckily, the “mount” command has an “offset” feature that allows you to tell it where to start looking for an actual filesystem. You can figure out that offset value using fdisk:

[root@mybox Nutanix]# fdisk -l ce-2019.02.11-stable.img

Disk ce-2019.02.11-stable.img: 7444 MB, 7444889600 bytes, 14540800 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000d3ba3

                   Device Boot      Start         End      Blocks   Id  System
ce-2019.02.11-stable.img1   *        2048    14540799     7269376   83  Linux

It even sticks a “1” on the end of the image name, just like a disk partition, though if you tried “mount ce-2019.02.11-stable.img1 /mnt” it would have no idea what you were talking about.

To figure out the offset number, note that the partition starts at sector 2048, and each sector is 512 bytes; 2048 * 512 = 1048576, so that’s the offset. You can then mount it read-only:

 mount -o ro,loop,offset=1048576 ce-2019.02.11-stable.img /mnt/loop

After that I just used rsync to copy the data over to the affected host, recreated the install user and group (both using uid/gid 1000), and was able to rebuild my CVM. Unfortunately I ran ~install/cleanup.sh first, which seemed to wipe all the configuration from the host, and made the cluster forget all about it, after which I was unable to add the host back in, possibly because of the failing SD card on another host…my cluster is a bit FUBAR at the moment. But at least I got the disk replaced!

Over-subscription of thin-provisioned virtual disks leading to Nutanix cluster issues

An issue with both physical and virtual environments is unused resources, particularly memory and storage. If you allocate 250GB to a VM, and it’s only using 50GB of it, that’s a lot of wasted disk. Multiply that by dozens of VMs and you’ve got a lot of storage not being used. The good news is that modern hypervisors offer “thin-provisioned” virtual disks: thin-provisioning means that, even if you set the disk size to 250GB, if the OS is only using 50GB of it, you’re truly only using 50GB of it at the physical layer. Which means that you can “over-subscribe” your storage by allocating more of it in virtual disks than you actually have, and then the VMs can grab some of that additional 200GB if they need it for something, and then when they’re done with it, the hypervisor takes it back. You don’t need to tightly monitor each VM’s utilization; you just give it as much as it’s likely to need for regular operations, and just monitor the overall physical storage utilization of your environment.

The problem is that, if you over-subscribe your storage (i.e., allocate 15TB of “thin-provisioned” storage to VMs in a physical environment containing only 10TB of storage), and your VMs do use it all up at once, you’re in trouble. So generally you don’t want to oversubscribe by more than, say, 25%, depending on how much your VM storage requirements fluctuate.

Another issue that I discovered recently relates to how VMware recovers unused space from the VMs, and how Red Hat Enterprise Linux communicates to VMware that virtual storage is no longer needed, particularly in an environment with Nutanix infrastructure.

A short primer on Nutanix, if you aren’t familiar with it: you take a number of physical hosts and combine them into a cluster, combining the storage into a large pool. The VMs can live on any host in the cluster, using the CPU and memory resources available there, and their data is spread throughout the cluster and replicated for reliability. If a host fails, the VMs get restarted on other hosts, and any datablocks that were living on that host are restored from the copies available on other hosts. You can use a number of different hypervisors; if, for example, you want to use VMware, the storage containers created at the Nutanix level are presented to all the cluster hypervisors as regular NFS storage.

I support a Nutanix cluster that is dedicated to a large application with a lot of data constantly being written and cleaned up, several hundred GB every day. The cluster has a total of about 290TB of storage; with a “replication factor” of 2 (each block has a copy on another host in the cluster), that gives us about 145TB of “logical” space. We have about 15 very large VMs, each of which has a bit over 8TB of disk attached, along with some other smaller VMs, totalling about 114TB of storage, all thin-provisioned. We also use “protection domains” to create and replicate snapshots of the VMs to send to a “backup” cluster in another data center, and the delta disks for those snapshots take up several additional TB in storage.

I started getting alerts that we were using more than 75% of the available storage in the cluster, and it continued to grow, approaching 90%. Nutanix requires a certain amount of free space in order to operate, an in order to survive a host failure, since a host failure obviously reduces the overall available storage. If we went much beyond 90%, the cluster would begin to have lots of difficult to resolve issues (VMs freezing, filesystems becoming “read-only”, etc.). But wait…the VMs are only using about 35% of their virtual disks! Why is Nutanix saying we’re using almost 90%?

I took a look at our vSphere client, and it reported similar numbers to Nutanix: each VM was using 7+ TB of its 8TB storage. How was this possible?

We initially thought that the issue was with the protection domain snapshots. The way that snapshots work in Nutanix is similar to the way they work in VMware; when a snapshot is created, new data being written goes to “delta” disk files, and the original disk files are left quiesced so that they can be safely copied somewhere else, or restored in the event of a failure. Eventually, when the snapshot is deleted, the delta disk files get merged back into the main disk files. We had recently enabled protection domains for all the VMs, and the storage had been growing ever since, so it seemed a likely cause. Was Nutanix simply failing to clean up delta disks, and they were just growing and growing as time went on from the creation of the initial snapshots? The fact that the “.snapshots” directory within the datastore seemed to be using a remarkable 126TB of storage seemed to bear this out.

We called Nutanix, who explained that no, the .snapshot directory wasn’t actually using that much storage, or at least, not exclusive of the actual data disks. The data disks were stored there, and the files that appeared to be data disks in the usual VM directories were just pointing to them. So that turned out to be a red herring.

The actual issue seems to be that the operating system on the VMs, in this case Red Hat Enterprise Linux 6 (RHEL6), and VMware, don’t communicate very well about what’s actual reclaimable space. When a file gets deleted in RHEL6, the data doesn’t actually get cleaned up right away; what gets removed is the “inode,” essentially a pointer that describes where on the disk to find the data. The data itself doesn’t get removed; it just gets overwritten later on by new data as needed. Over time, as the application wrote many GB a day to its filesystems, that data continued to stay on the disks, even though the inodes pointing to the data had been deleted, and the OS reported that the space was available. VMware saw all those ones and zeroes and assumed that this was actual live data that could not be cleaned up by the hypervisor processes that maintain the thin-provisioning.

In short, in an environment like this with lots and lots of data coming in and getting “deleted,” you can’t actually oversubscribe the storage because the OS and VMware don’t talk to each other very well about what data is actually needed.

The fun part was the trick I did in the short term to clean things up until I could replace the virtual disks with smaller ones and “de-over-subscribe” the storage. It turns out if you write a very large file to the filesystem containing nothing but zeroes, and then delete it, VMware will see that as reclaimable space. It took a few days to run on all the filesystems in the cluster, but I used the “dd” command to output from /dev/zero to a file in the directory, and then removed it; VMware would eventually knock back the reported storage utilization, and Nutanix’s “Curator” scans would reclaim the space as well.

dd if=/dev/zero of=/datastore01/file_full_of_zeroes bs=1M count=1000000
rm /datastore01/file_full_of_zeroes

The first command creates a 1TB file containing nothing but zeroes; the second removes it. Doing that to each of 4 datastores on each VM got the reported utilization from 7TB to 4TB on each VM, and the Nutanix-reported utilization back under 60%. Some information I googled up seemed to indicate that operations might be necessary on the virtual disks from within the hypervisor, or possibly having to do a storage migration to get VMware to reclaim the space, but it did it automatically (we’re running ESX 6.0.0, which may have different space reclamation techniques than previous versions).

Molecule: Failed to import docker-py – cannot import name certs

When trying to do a molecule test of an Ansible role on a friend’s jumpbox, I kept getting this error:

thehearn@thatbox99> molecule converge -s docker
<snip>
    TASK [Discover local Docker images] ********************************************
    failed: [localhost] (item=None) => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}
    failed: [localhost] (item=None) => {"censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result", "changed": false}

Running in “debug” mode gave me a bit more detail, leading to this message:

thehearn@thatbox99> molecule --debug converge -s docker
<snip>
        "msg": "Failed to import docker-py - cannot import name certs. Try `pip install docker-py`"

docker-py is already installed, though:

thehearn@thatbox99> pip freeze | grep docker-py
docker-py==1.10.6
docker-pycreds==0.4.0

Weird. What the heck is going on? Apparently, the “certs” feature is part of the “requests” package. However, that’s installed as well:

thehearn@thatbox99> pip freeze | grep requests
requests==2.22.0
requests-ntlm==1.1.0

Huh. I guess I’ll just remove it, and reinstall it, and see if that does the job.

thehearn@thatbox99> sudo pip uninstall requests
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Pytho
n 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support
Uninstalling requests-2.18.4:
  Would remove:
    /usr/lib/python2.7/site-packages/requests-2.18.4.dist-info/*
    /usr/lib/python2.7/site-packages/requests/*
  Would not remove (might be manually added):
    /usr/lib/python2.7/site-packages/requests/packages/__init__.py
    /usr/lib/python2.7/site-packages/requests/packages/__init__.pyo
Proceed (y/n)? y
  Successfully uninstalled requests-2.18.4
thehearn@thatbox99> sudo pip install requests
Installing collected packages: requests
  Found existing installation: requests 2.6.0
ERROR: Cannot uninstall 'requests'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

Aha. Looks like maybe we had conflicting versions of “requests” installed at the same time, 2.18.4, and 2.6.0 (from distutils). Okay, let’s see if we can force the reinstall:

thehearn@thatbox99> sudo pip install requests --upgrade --force-reinstall
<snip>
Installing collected packages: urllib3, certifi, chardet, idna, requests
  Found existing installation: urllib3 1.22
    Uninstalling urllib3-1.22:
      Successfully uninstalled urllib3-1.22
  Found existing installation: certifi 2018.4.16
    Uninstalling certifi-2018.4.16:
      Successfully uninstalled certifi-2018.4.16
  Found existing installation: chardet 3.0.4
    Uninstalling chardet-3.0.4:
      Successfully uninstalled chardet-3.0.4
  Found existing installation: idna 2.6
    Uninstalling idna-2.6:
      Successfully uninstalled idna-2.6
  Found existing installation: requests 2.6.0
ERROR: Cannot uninstall 'requests'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

Argh. Okay, maybe we can tell it to ignore the fact that it’s already installed?

thehearn@thatbox99> sudo pip install requests --upgrade --force-reinstall --ignore-installed
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Pytho
n 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support
Collecting requests
  Using cached https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Using cached https://files.pythonhosted.org/packages/b4/40/a9837291310ee1ccc242ceb6ebfd9eb21539649f193a7c8c86ba15b98539/urllib3-1.25.7-py2.py3-none-any.whl
Collecting certifi>=2017.4.17
  Using cached https://files.pythonhosted.org/packages/18/b0/8146a4f8dd402f60744fa380bc73ca47303cccf8b9190fd16a827281eac2/certifi-2019.9.11-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Collecting idna<2.9,>=2.5
  Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Installing collected packages: urllib3, certifi, chardet, idna, requests
Successfully installed certifi-2019.9.11 chardet-3.0.4 idna-2.8 requests-2.22.0 urllib3-1.25.7

That looks better. And now my molecule test is able to spin up the docker instances! Awesome!

Setting up a Highly Available Kubernetes cluster

After taking a particularly good course on Kubernetes hosted by the Linux Foundation, I decided I wanted to finally get a cluster up and running in my home lab. I’d managed to build a 3-node cluster (one master and 2 “worker” nodes), but was eager to see if I could set up a “highly available” cluster with multiple masters.

In my home lab, I have a Nutanix “Community Edition” cluster, and I can use Terraform and Ansible to quickly spin up new VMs. I found a decent document on kubernetes.io that explains the process of setting up multiple masters, though it’s not entirely complete, as we’ll discover.

The first step was to spin up the master nodes, which I named (oh-so-imaginatively) kubemaster01, kubemaster02, and kubemaster03, all of them Ubuntu 16.04.05. The only reason I chose Ubuntu was it was what we had used in the training course, and I knew it would work. I may try and add some Centos 7 nodes to the cluster somewhere down the line, just to see how well they work.

For my purposes, I don’t believe the masters have to be very big, just 2 cores and 2GB of memory, along with the default 20GB root disk; I can always expand them later. The nodes are named kubenode01, kubenode02, and kubenode03, with the same specs; because they do most of the “work” of the cluster, I’ll need to keep a close eye on their resource usage and increase as necessary, particularly if I find I need to build persistent volumes for the applications.

Because I put them all in their own special VLAN, I don’t really need to worry about firewalls; they do run iptables, but Kubernetes seems to be pretty good about self-managing the rules that it needs, or at least, I haven’t had to do any manual configuration.

All the nodes need to have docker, kubeadm, and the kubelet installed, plus kubectl to do any managing of the running cluster, though in theory you could get away with installing kubectl on just the masters. They also need to have swap disabled. I wrote an ansible role that handles all of that for me. I hard-coded the version of the kubernetes tools to version 1.13.1-00, which is what we used in the course and is pretty stable, despite having come out only a few months ago.

To simplify setup, I enabled root ssh access on all the masters, and set up a passphrase-protected key for the root user so I could copy files around as needed.

Once the preliminary steps were out of the way, I found I was going to need a load-balancing proxy. This is the first place where the doc from kubernetes.io falls short (at least, as of February 2019); it says “HAProxy can be used as a load balancer,” but offers no instructions on how to set it up, and while HAProxy documentation abounds, it doesn’t offer any advice on setting it up as a Kubernetes API front-end. It took a lot of trial and error to get it working, but I finally got it to do what I wanted.

The versions of HAProxy that are available to download and compile are several minor versions ahead of what’s available for download from apt or yum, but I figured an older version would probably suffice. To make it be as recent as possible, I built myself an Ubuntu 16.04 VM (named haproxy01), and then upgraded it to Ubuntu 18.04.02. (I haven’t gotten around to making a packer build for Ubuntu 18 yet.) Then I used apt to install HAProxy 1.8.8, which isn’t much older than the current “stable” release at 1.9.4. The latest release that Ubuntu 16 has is 1.6.3, and CentOS 7 is only up to 1.5.18.

apt update &amp;&amp; apt install haproxy

Because the Kubernetes API server pod listens on port 6443 by default, I needed to be able to forward secure connections on that port to all 3 masters. After a lot of cursing and gnashing of teeth I got /etc/haproxy/haproxy.cfg to look like this:

global
  log /dev/log local0
  log /dev/log local1 notice
  chroot /var/lib/haproxy
  stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
  stats timeout 30s
  user haproxy
  group haproxy
  daemon
  ca-base /etc/ssl/certs
  crt-base /etc/ssl/private
  ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS
  ssl-default-bind-options no-sslv3

defaults
  mode tcp
  log global
  option tcplog
  balance leastconn
  option dontlognull
  option redispatch
  option contstats
  option socket-stats
  timeout server 600s
  timeout client 600s
  timeout connect 5s
  timeout queue 60s
  retries 3
  default-server inter 15s rise 2 fall 2
  backlog 10000

frontend k8s_api_front
  bind *:6443 name https tcp-ut 30s
  default_backend k8s_api_back

backend k8s_api_back
  option tcp-check
  tcp-check connect port 6443 ssl
  server kubemaster01.hearn.local <ip address>:6443 check verify none
  server kubemaster02.hearn.local <ip address>:6443 check verify none
  server kubemaster03.hearn.local <ip address>:6443 check verify none

I will be the first to admit that I have no idea what most of the settings in defaults and global do; many of them are haproxy.cfg defaults. The real work is happening in the “frontend” and “backend” sections. An HAProxy server can have multiples of each, pointing traffic in various directions. The “frontend k8s_api_front” section tells HAProxy to listen on port 6443 for secure connections, and then forward them to backend k8s_api_back, which distributes connections to the 3 master servers that are all listening on the same port. (I’ve removed my IP addresses, but they should match the IP addresses of the master nodes.)

I also updated my DNS so that I had a CNAME named “k8sapipxy” that pointed to haproxy01. I’ll be using that DNS record to build all the masters and nodes.

Once I got HAProxy working, I was able to set up the first master node, kubemaster01. I modified the kubeadm-config.yaml file a bit from the one provided in the kubernetes.io doc:

apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: 1.13.1
networking:
dnsDomain: cluster.local
podSubnet: 192.168.0.0/16
serviceSubnet: 10.96.0.0/12
apiServer:
certSANs:
- "k8sapipxy.hearn.local"
controlPlaneEndpoint: "k8sapipxy.hearn.local:6443"

Then I could simply run, on kubemaster01:

kubeadm init --config kubeadm-config.yaml

Until I got the HAProxy set up properly, I ran into a lot of headaches where kubeadm couldn’t seem to communicate through HAProxy to reach the apiserver pod it had set up, but once HAProxy was working as it should, the command worked fine. It will also display the necessary command to run on other nodes to add them to the cluster, including the other master servers (with an extra option that identifies them as masters instead of just worker nodes); it should look something like this:

kubeadm join --token uygjg4.44wh6oioigcstoue \
k8sapipxy.hearn.local:6443 --discovery-token-ca-cert-hash \
sha256:e4b834054fc6fb4c063fc404cd5d7779fc1c253e5d1022def2dd5886fb553f6c

Save that command somewhere, you’ll need it to join additional masters and nodes. The token is only good for 24 hours, but you can generate a new token with the following command:

kubeadm token create

And if you find you lost the CA cert hash, you can get it with the following rather lengthy command:

openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //g'

The doc specifies using the Weave plugin for networking, but I was more familiar with Calico, which we had used in the class, so I downloaded the same Calico and RBAC configuration files we’d used in the class:

https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml
https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/rbac-kdd.yaml

These get applied using kubectl on kubemaster01:

kubectl apply -f rbac-kdd.yaml
kubectl apply -f calico.yaml

Once that’s done, you should be able to use kubectl to monitor the progress of the pod creation on your master:

kubectl -n kube-system get pods

The “-n kube-system” is necessary because the kube-system namespace is usually hidden during a regular pod display. It should only take a couple of minutes, and you should have a fully functioning K8s master! To add the other masters as actual masters instead of worker nodes, you’ll need to copy over all the relevant PKI keys from your first master to the others, along with the main K8s configuration file:

for host in kubemaster02 kubemaster03; do
    scp /etc/kubernetes/pki/ca.crt $host:/etc/kubernetes/pki/
    scp /etc/kubernetes/pki/ca.key $host:/etc/kubernetes/pki/
    scp /etc/kubernetes/pki/sa.key $host:/etc/kubernetes/pki/
    scp /etc/kubernetes/pki/sa.pub $host:/etc/kubernetes/pki/
    scp /etc/kubernetes/pki/front-proxy-ca.crt $host:/etc/kubernetes/pki/
    scp /etc/kubernetes/pki/front-proxy-ca.key $host:/etc/kubernetes/pki/
    scp /etc/kubernetes/pki/etcd/ca.crt $host:/etc/kubernetes/pki/etcd/ca.crt
    scp /etc/kubernetes/pki/etcd/ca.key $host:/etc/kubernetes/pki/etcd/ca.key
    scp /etc/kubernetes/admin.conf $host:/etc/kubernetes
done

You do not need to do that step for the worker nodes, only for the masters. Once you’ve got the keys in place, you can run the deploy command you generated earlier, but with an extra option at the end:

kubeadm join --token uygjg4.44wh6oioigcstoue \
k8sapipxy.hearn.local:6443 --discovery-token-ca-cert-hash \
sha256:e4b834054fc6fb4c063fc404cd5d7779fc1c253e5c1022dff23d9886ab553f6c --experimental-control-plane

The “experimental-control-plane” option indicates that this is an additional master instead of a worker, and it should spin up its own master pods, such as the apiserver and etcd instance. You can monitor the progress of pod creation from your first master:

kubectl -n kube-system get pods

Once all the pods report that they’re up and running, you can run the kubeadm join again on the 3rd master. That should be all that’s necessary; Kubernetes will create all the necessary pods on all the masters, including the apiservers, etcd instances, coredns pods, and all the pods needed for calico. It should look vaguely like this:

NAME                                               READY   STATUS    RESTARTS   AGE
calico-node-4649j                                  2/2     Running   0          2d19h
calico-node-8qv9z                                  2/2     Running   0          2d19h
calico-node-h49vp                                  2/2     Running   0          2d19h
calico-node-p9vbm                                  2/2     Running   0          2d19h
calico-node-q5ctt                                  2/2     Running   0          2d19h
calico-node-spfs7                                  2/2     Running   0          2d19h
coredns-86c58d9df4-2ckdt                           1/1     Running   0          2d19h
coredns-86c58d9df4-tdzbt                           1/1     Running   0          2d19h
etcd-kubemaster01.hearn.local                      1/1     Running   0          2d19h
etcd-kubemaster02.hearn.local                      1/1     Running   0          2d19h
etcd-kubemaster03.hearn.local                      1/1     Running   0          2d19h
kube-apiserver-kubemaster01.hearn.local            1/1     Running   0          2d19h
kube-apiserver-kubemaster02.hearn.local            1/1     Running   0          2d19h
kube-apiserver-kubemaster03.hearn.local            1/1     Running   1          2d19h
kube-controller-manager-kubemaster01.hearn.local   1/1     Running   4          2d19h
kube-controller-manager-kubemaster02.hearn.local   1/1     Running   5          2d19h
kube-controller-manager-kubemaster03.hearn.local   1/1     Running   2          2d19h
kube-proxy-44wv7                                   1/1     Running   0          2d19h
kube-proxy-66xj4                                   1/1     Running   0          2d19h
kube-proxy-f28zd                                   1/1     Running   0          2d19h
kube-proxy-kz8bs                                   1/1     Running   0          2d19h
kube-proxy-tzf4v                                   1/1     Running   0          2d19h
kube-proxy-zm74m                                   1/1     Running   0          2d19h
kube-scheduler-kubemaster01.hearn.local            1/1     Running   5          2d19h
kube-scheduler-kubemaster02.hearn.local            1/1     Running   4          2d19h
kube-scheduler-kubemaster03.hearn.local            1/1     Running   3          2d19h

Once all 3 masters are online, you can go to each worker node and run the kubeadm join command, but without the experimental-control-plane option:

kubeadm join --token uygjg4.44wh6oioigcstoue \
k8sapipxy.hearn.local:6443 --discovery-token-ca-cert-hash \
sha256:e4b834054fc6fb4c063fc404cd5d7779fc1c253e5c1022dff23d9886ab553f6c

Each node should happily join the cluster and be ready to host pods! Now I just have to figure out what to do with it…

Fixing Samba (or other services) connectivity

So you know you’ve got Samba running on your server.  You can connect to it locally.  But you can’t seem to get to it from anywhere else.  What the heck is going on?

Server’s up:

# systemctl status smb
● smb.service - Samba SMB Daemon
Loaded: loaded (/usr/lib/systemd/system/smb.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2018-10-27 14:06:51 EDT; 56min ago
Main PID: 174091 (smbd)
Status: "smbd: ready to serve connections..."

Firewalls are down (not ideal, but sometimes the only way to make sure is to turn ’em off):

# systemctl status iptables
● iptables.service - IPv4 firewall with iptables
Loaded: loaded (/usr/lib/systemd/system/iptables.service; disabled; vendor preset: disabled)
Active: inactive (dead) since Sat 2018-10-27 13:54:19 EDT; 1h 14min ago
# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Sat 2018-10-27 13:39:19 EDT; 1h 29min ago

But you can’t mount your Samba shares.  Nor can you even reach the stupid thing from another device:

# nc -vzw 3 sambavm 139
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection refused.

And Samba is definitely listening on 139/tcp:

# netstat -plant | grep 139
tcp 0 0 127.0.0.1:139 0.0.0.0:* LISTEN 174091/smbd

Argh!  What on earth is happening?  I guess let’s look at smb.conf and see if anything funny is going on in there.

interfaces = lo ens160

Well that’s odd.  I know for some RHEL7 VMs, the interface is named ens160.  But I’m pretty sure mine is named eth0.

# ip a
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000

Yep.  I wonder what happens if I correct that line and restart Samba.

# vi /etc/samba/smb.conf
interfaces = lo eth0
# systemctl restart smb

Well looky there, I can mount my Samba shares!  Exciting times for all!  Either my NIC card changed names (either as a result of patching, or because I migrated it from ESXi to Nutanix’s Acropolis hypervisor), or somebody was fiddling with my smb.conf (equally likely).  Either way, if your smb.conf restricts what interfaces can be used to access Samba, make sure it has the right NIC.