Spacewalk refusing to sync channels, and throws befuddling Postgres errors

In my home lab I run an instance of “Spacewalk,” which is an open-source version of Red Hat’s Satellite 5 tool. It allows you to manage CentOS, Red Hat, and Fedora devices in various ways, but it’s mostly used to manage “yum” packages.

The gist of the package management setup is that you set up “channels,” which contain links to a variety of internet repos, and download the packages from those repos according to a schedule you set up, stashing them locally to save download time when installing or upgrading software on a device. It stores a ton of metadata about the packages in a Postgres database. You subscribe your hosts to the channels you want so that they can see certain packages, depending on what you want installed where. For example, I have a “Cloudera” channel containing all the server and agent RPMs for Cloudera, that only my VMs that run Cloudera are subscribed to.

You can do nifty tricks with “clone” channels. Because the main channels get updated with new packages all the time, if you do a patch cycle that lasts several months because you have hundreds of servers to upgrade, you can run into a situation where you upgraded a dev VM to kernel kernel-3.10.0-514.el7.x86_64 on June 1st, and then when you got around to prod on August 1st, the latest kernel was up to kernel-3.10.0-514.el7.x86_64. To get around that, at the start of your cycle, you clone the main channel, and subscribe all the servers to the clone; it’s essentially a snapshot of the channel. Then the servers only get the packages that were in the channel on the day you cloned it. Next patch cycle, repeat the process.

I had a wonky issue recently where my Spacewalk VM rebooted and came back up, but was unable to mount the NFS share where I actually store all the RPMs (I’ve got about 1TB of them at this point, so it makes sense to stash them on my giant FreeNAS instead of using up Nutanix storage). It dutifully started up Spacewalk anyway, and started sticking RPMs into /var/satellite and filling up /var. This is annoying, but not fatal. What became more of a problem was my assumption that Spacewalk would be smart enough to resync the packages after I deleted them, mounted the NFS share, and restarted Spacewalk.

Instead, it started throwing weird errors about duplicates:

2020/07/20 16:07:06 -04:00 ERROR: (23, 'ERROR:  duplicate key value violates unique constraint "rhn_cnp_cid_nid_uq"', 'Could not update database entry.')

I did some googlin’ and results were inconclusive. An official Red Hat support page (I have access because of my job) seemed to indicate that I was going to have to do some Postgres hacking, which was a scary thought. (At that point I wisely created a VM snapshot of the Spacewalk server so I didn’t blow the whole thing up and not have a backup.) I also came across this site that seemed to indicate that I can solve the problem from within Spacewalk by looking for “orphaned” packages (rpms that are not affiliated with any channel), and delete them. However, when I went to look for them, I had 7238 orphaned packages! That seemed…odd.

I knew I had a good snapshot of the VM and could restore, but the NFS data would be more tricky; if I deleted that, it was gone from the NAS, and wouldn’t be contained in the snapshot. (It occurred to me later that I believe FreeNAS also has a snapshot capability, but it didn’t occur to me at the time.) I figured if things really got dire, Spacewalk would probably redownload the missing stuff, so I threw caution to the wind and deleted all 7000+ orphaned packages.

It took a while, and Spacewalk itself got so busy doing it that it wouldn’t talk to me for a while, but sure enough, it threw a bunch of messages in /var/log/rhn/rhn_taskomatic_daemon.log about packages not actually existing on the filesystem:

INFO   | jvm 1    | 2020/07/20 15:40:47 | 2020-07-20 15:40:47,709 [DefaultQuartzScheduler_Worker-8] ERROR com.redhat.rhn.taskomatic.task.PackageCleanup - /var/spacewalk/redhat/1/462/rh-python36-numpy/1:1.13.1-1.el7/x86_64/462e4a2f951afca
6ce14076c9e859784a5e7630112c07d77b30bcef2dc410299/rh-python36-numpy-1.13.1-1.el7.x86_64.rpm not found

After it successfully got them deleted, I attempted a sync again, from the command line:

[root@spacewalk rhn]# /usr/bin/spacewalk-repo-sync --channel centos_6_x86_64
08:08:00 ======================================
08:08:00 | Channel: centos_6_x86_64
08:08:00 ======================================
08:08:00 Sync of channel started.
08:08:00 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os
08:08:04 Packages in repo:              6713
08:11:46 Packages already synced:       6216
08:11:46 Packages to sync:               497
08:11:46 New packages to download:         0
08:11:46 Linking packages to channel.
08:11:57 ERROR: (23, 'ERROR:  duplicate key value violates unique constraint "rhn_cnp_cid_nid_uq"', 'Could not update database entry.')
08:11:57 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=updates
Traceback (most recent call last):
  File "/usr/bin/spacewalk-repo-sync", line 257, in <module>
    sys.exit(abs(main() or 0))
  File "/usr/bin/spacewalk-repo-sync", line 240, in main
    elapsed_time, channel_ret_code = sync.sync()
  File "/usr/lib/python2.7/site-packages/spacewalk/satellite_tools/reposync.py", line 453, in sync
    """, repo_id=int(repo_id))
  File "/usr/lib/python2.7/site-packages/spacewalk/server/rhnSQL/__init__.py", line 292, in fetchall_dict
    h.execute(sql, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/spacewalk/server/rhnSQL/sql_base.py", line 151, in execute
    return self._execute_wrapper(self._execute, *p, **kw)
  File "/usr/lib/python2.7/site-packages/spacewalk/server/rhnSQL/driver_postgresql.py", line 302, in _execute_wrapper
    raise sql_base.SQLSchemaError(error_code, e.pgerror, e)
spacewalk.server.rhnSQL.sql_base.SQLSchemaError: (99999, 'ERROR:  current transaction is aborted, commands ignored until end of transaction block', '', InternalError('current transaction is aborted, commands ignored until end of transact
ion block\n',))

Dang it. Same error. Okay, let’s look at the Postgres logs. In this case, Spacewalk sets up Postgres with some default settings, and stashes pg_log in /var/lib/pgsql/data/pg_log, rolling it over every day and saving 7 days of logs. I’m still not entirely sure I needed to, but I increased the verbosity in /var/lib/pgsql/data/postgresql.conf:

│client_min_messages = debug1

And ran “rhn-satellite restart” to refresh the whole Spacewalk environment, including Postgres. Then I ran the repo-sync again. As it failed, Postgres threw this error:

2020-07-20 13:32:17.644 EDT ERROR:  duplicate key value violates unique constraint "rhn_cnp_cid_nid_uq"
2020-07-20 13:32:17.644 EDT DETAIL:  Key (channel_id, name_id, package_arch_id)=(101, 4586, 120) already exists.
2020-07-20 13:32:17.644 EDT CONTEXT:  SQL statement "insert into rhnChannelNewestPackage
                        (channel_id, name_id, evr_id, package_id, package_arch_id)
                        (select channel_id,
                                name_id, evr_id,
                                package_id, package_arch_id
                           from rhnChannelNewestPackageView
                          where channel_id = channel_id_in
                            and (package_name_id_in is null
                                 or name_id = package_name_id_in)
                        )"
        PL/pgSQL function rhn_channel.refresh_newest_package(numeric,character varying,numeric) line 9 at SQL statement

So apparently the issue is whatever row in rhnChannelNewestPackage with channel_id=101, name_id=4586, and package_arch_id=120. Let’s see if we can figure out what that is. I su’d to the postgres user and accessed the database:

[root@spacewalk reposync]# su - postgres
Last login: Mon Jul 20 15:10:45 EDT 2020 on pts/2
-bash-4.2$ psql rhnschema
psql (9.2.23)
Type "help" for help.

rhnschema=# select * from rhnChannelNewestPackage where channel_id=101 and name_id=4586;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4586 |  19624 |             120 |      63550
(1 row)

Okay, let’s see what that guy is.

rhnschema=# select * from rhnpackage where id=63550;
  id   | org_id | name_id | evr_id | package_arch_id | package_group | rpm_version |                             description                              |                    summary                    | package_size | payload_size | ins
talled_size |        build_host         |     build_time      | source_rpm_id | checksum_id | vendor | payload_format | compat |                                                                      path
                                    | header_sig | copyright | cookie |     last_modified      |            created            |           modified            | header_start | header_end
-------+--------+---------+--------+-----------------+---------------+-------------+----------------------------------------------------------------------+-----------------------------------------------+--------------+--------------+----
------------+---------------------------+---------------------+---------------+-------------+--------+----------------+--------+-------------------------------------------------------------------------------------------------------------
------------------------------------+------------+-----------+--------+------------------------+-------------------------------+-------------------------------+--------------+------------
 63550 |      1 |    4586 |  19624 |             120 |            13 | 4.8.0       | This package provides the sar and iostat commands for Linux. Sar and+| The sar and iostat system monitoring commands |       239820 |       854664 |
     846133 | x86_64_01.bsys.centos.org | 2018-01-26 02:43:42 |         26153 |     7595952 | CentOS | cpio           |      0 | redhat/1/b05/sysstat/9.0.4-33el6_9.1/x86_64/b05d4090caae0fb2df7005a07862ce2ce82e88ced1a58f42f2eb88769e47a45b
/sysstat-9.0.4-33el6_9.1.x86_64.rpm |            | GPLv2+    |        | 2018-01-28 11:05:36-05 | 2018-01-28 21:05:38.918665-05 | 2018-01-28 21:05:38.918665-05 |         1384 |      27824
       |        |         |        |                 |               |             | iostat enable system monitoring of disk, network, and other IO      +|                                               |              |              |
            |                           |                     |               |             |        |                |        |
                                    |            |           |        |                        |                               |                               |              |
       |        |         |        |                 |               |             | activity.                                                            |                                               |              |              |
            |                           |                     |               |             |        |                |        |
                                    |            |           |        |                        |                               |                               |              |
(1 row)

sysstat-9.0.4-33el6_9.1.x86_64.rpm seems to be the problem. I thought about simply deleting that row from rhnPackage and rhnChannelNewestPackage, but decided it would probably be cleaner to let Spacewalk do it and theoretically take care of all the bad metadata itself. So I logged back into spacewalk, went to the Channels tab atop the dashboard, then clicked Manage Software Channels in the left-side menu, followed by Manage Software Packages. I selected “All Managed Packages” in the dropdown, searched for sysstat, found the 9.0.4-33el6_9.1 version, selected and deleted it. Then I ran the repo-sync again.

Argh. Same error. Okay, let’s check Postgres again.

2020-07-21 08:11:57.106 EDT DETAIL:  Key (channel_id, name_id, package_arch_id)=(101, 4083, 100) already exists.

Curious. Same error, with a different name_id. Let’s see what that is.

rhnschema=# select * from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4083 |  19623 |             100 |      63530
(1 row)

What the flip? That’s the same package_id from before. But the package is gone!

rhnschema=# select * from rhnpackage where id=63550;
 id | org_id | name_id | evr_id | package_arch_id | package_group | rpm_version | description | summary | package_size | payload_size | installed_size | build_host | build_time | source_rpm_id | checksum_id | vendor | payload_format | co
mpat | path | header_sig | copyright | cookie | last_modified | created | modified | header_start | header_end
----+--------+---------+--------+-----------------+---------------+-------------+-------------+---------+--------------+--------------+----------------+------------+------------+---------------+-------------+--------+----------------+---
-----+------+------------+-----------+--------+---------------+---------+----------+--------------+------------
(0 rows)

Okay, it must be stashing a row of metadata in some table somewhere. I deleted it from rhnChannelNewestPackage manually:

rhnschema=# delete from rhnChannelNewestPackage where channel_id=101 and name_id=4083;
DELETE 1
rhnschema=# select * from rhnChannelNewestPackage where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
(0 rows)

There was also a mention in the Postgres log of rhnChannelNewestPackageView, which still had a row mentioning that name_id:

rhnschema=# select * from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4083 |  19623 |             100 |      63530
(1 row)

That object is clearly a view, and therefore I assumed I wouldn’t be able to delete directly from it, only from the tables underneath it, but thought maybe there might be some kind of trigger set up that would allow you to delete things and it would do some cleanup underneath. Might as well give it a shot:

rhnschema=# delete from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
ERROR:  cannot delete from view "rhnchannelnewestpackageview"
HINT:  You need an unconditional ON DELETE DO INSTEAD rule or an INSTEAD OF DELETE trigger.

Yep, as I figured. Okay, let’s see if it’ll tell me what tables that view encompasses:

rhnschema=# \d rhnchannelnewestpackageview
View "public.rhnchannelnewestpackageview"
     Column      |  Type   | Modifiers
-----------------+---------+-----------
 channel_id      | numeric |
 name_id         | numeric |
 evr_id          | numeric |
 package_arch_id | numeric |
 package_id      | numeric |

Not helpful. At this point, sadly, I had run out of Postgres expertise (I don’t have any to begin with, really, other than wrangling Slony clusters at work), so it was down to shots in the dark. I saw there was an rhnChannelPackage table and thought maybe it still contains a row for that package:

rhnschema=# select * from rhnchannelpackage where channel_id=101 and name_id=4083;
ERROR:  column "name_id" does not exist
LINE 1: ...* from rhnchannelpackage where channel_id=101 and name_id=40...

Heaven forfend it have the same table structure as rhnChannelNewestPackage. Argh. Okay, what is its structure?

rhnschema=# \d rhnchannelpackage
                Table "public.rhnchannelpackage"
   Column   |           Type           |       Modifiers
------------+--------------------------+------------------------
 channel_id | numeric                  | not null
 package_id | numeric                  | not null
 created    | timestamp with time zone | not null default now()
 modified   | timestamp with time zone | not null default now()
Indexes:
    "rhn_cp_cp_uq" UNIQUE, btree (channel_id, package_id)
    "rhn_cp_pid_idx" btree (package_id)
Foreign-key constraints:
    "rhn_cp_cid_fk" FOREIGN KEY (channel_id) REFERENCES rhnchannel(id) ON DELETE CASCADE
    "rhn_cp_pid_fk" FOREIGN KEY (package_id) REFERENCES rhnpackage(id)
Triggers:
    rhn_channel_package_mod_trig BEFORE INSERT OR UPDATE ON rhnchannelpackage FOR EACH ROW EXECUTE PROCEDURE rhn_channel_package_mod_trig_fun()

Well golly, look at all that. None of those indexes or foreign-keys were mentioned in the original error, but it wouldn’t’ve surprised me if somehow through the table structure they didn’t get involved in it. Let’s see if there’s a row matching our wonky sysstat package:

rhnschema=# select * from rhnchannelpackage where channel_id=101 and package_id=63530;
 channel_id | package_id |           created            |           modified
------------+------------+------------------------------+------------------------------
        101 |      63530 | 2018-01-28 21:05:58.69026-05 | 2018-01-28 21:05:58.69026-05
(1 row)

Well knock me over with a feather. Okey dokey, let’s eliminate that thang:

rhnschema=# delete from rhnchannelpackage where channel_id=101 and package_id=63530;
DELETE 1

Now, it still appeared in rhnChannelNewestPackageView, unfortunately:

rhnschema=# select * from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4083 |   1685 |             100 |       5380
(1 row)

But I wasn’t sure what the heck else to try, so figured it wouldn’t hurt anything to do a reposync again:

[root@spacewalk rhn]# /usr/bin/spacewalk-repo-sync --channel centos_6_x86_64
08:20:52 ======================================
08:20:52 | Channel: centos_6_x86_64
08:20:52 ======================================
08:20:52 Sync of channel started.
08:20:52 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os
08:20:57 Packages in repo:              6713
08:23:34 Packages already synced:       6216
08:23:34 Packages to sync:               497
08:23:34 New packages to download:         0
08:23:34 Linking packages to channel.
08:23:38 Repo http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os has comps file comps.xml.

Hey, it never got that far without throwing an error. That’s promising.

08:23:38 Repo http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os has 0 errata.
08:23:38 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=updates
08:23:42 Packages in repo:              1068
08:23:46 Packages already synced:          4
08:23:46 Packages to sync:              1064
08:23:46 New packages to download:      1064
08:23:46 1/1064 : 389-ds-base-1.2.11.15-97.el6_10.x86_64.rpm
08:23:47 2/1064 : adcli-0.8.1-3.el6_10.x86_64.rpm
08:23:47 3/1064 : 389-ds-base-devel-1.2.11.15-97.el6_10.i686.rpm
08:23:47 4/1064 : autofs-5.0.5-140.el6_10.1.x86_64.rpm
08:23:47 5/1064 : 389-ds-base-devel-1.2.11.15-97.el6_10.x86_64.rpm
08:23:47 6/1064 : 389-ds-base-libs-1.2.11.15-97.el6_10.x86_64.rpm
08:23:47 7/1064 : 389-ds-base-libs-1.2.11.15-97.el6_10.i686.rpm
08:23:47 8/1064 : autofs-5.0.5-140.el6_10.x86_64.rpm
08:23:47 9/1064 : bind-chroot-9.8.2-0.68.rc1.el6_10.1.x86_64.rpm
08:23:47 10/1064 : bind-chroot-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:47 11/1064 : bind-chroot-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:47 12/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.1.i686.rpm
08:23:47 13/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.1.x86_64.rpm
08:23:48 14/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.3.i686.rpm
08:23:48 15/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:48 16/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.7.i686.rpm
08:23:48 17/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:48 18/1064 : bind-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:48 19/1064 : bind-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:48 20/1064 : bind-libs-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:48 21/1064 : bind-libs-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:48 22/1064 : bind-9.8.2-0.68.rc1.el6_10.1.x86_64.rpm

Yaaaaaaaasssssss!

I’m honestly not entirely sure what fixed it, but seems like the safest thing is to look at the Postgres logs, figure out what package is causing it to burp, and then go into Postgres and delete any reference to it from the database. Try to reposync again. If that doesn’t work, try prayer.

Over-subscription of thin-provisioned virtual disks leading to Nutanix cluster issues

An issue with both physical and virtual environments is unused resources, particularly memory and storage. If you allocate 250GB to a VM, and it’s only using 50GB of it, that’s a lot of wasted disk. Multiply that by dozens of VMs and you’ve got a lot of storage not being used. The good news is that modern hypervisors offer “thin-provisioned” virtual disks: thin-provisioning means that, even if you set the disk size to 250GB, if the OS is only using 50GB of it, you’re truly only using 50GB of it at the physical layer. Which means that you can “over-subscribe” your storage by allocating more of it in virtual disks than you actually have, and then the VMs can grab some of that additional 200GB if they need it for something, and then when they’re done with it, the hypervisor takes it back. You don’t need to tightly monitor each VM’s utilization; you just give it as much as it’s likely to need for regular operations, and just monitor the overall physical storage utilization of your environment.

The problem is that, if you over-subscribe your storage (i.e., allocate 15TB of “thin-provisioned” storage to VMs in a physical environment containing only 10TB of storage), and your VMs do use it all up at once, you’re in trouble. So generally you don’t want to oversubscribe by more than, say, 25%, depending on how much your VM storage requirements fluctuate.

Another issue that I discovered recently relates to how VMware recovers unused space from the VMs, and how Red Hat Enterprise Linux communicates to VMware that virtual storage is no longer needed, particularly in an environment with Nutanix infrastructure.

A short primer on Nutanix, if you aren’t familiar with it: you take a number of physical hosts and combine them into a cluster, combining the storage into a large pool. The VMs can live on any host in the cluster, using the CPU and memory resources available there, and their data is spread throughout the cluster and replicated for reliability. If a host fails, the VMs get restarted on other hosts, and any datablocks that were living on that host are restored from the copies available on other hosts. You can use a number of different hypervisors; if, for example, you want to use VMware, the storage containers created at the Nutanix level are presented to all the cluster hypervisors as regular NFS storage.

I support a Nutanix cluster that is dedicated to a large application with a lot of data constantly being written and cleaned up, several hundred GB every day. The cluster has a total of about 290TB of storage; with a “replication factor” of 2 (each block has a copy on another host in the cluster), that gives us about 145TB of “logical” space. We have about 15 very large VMs, each of which has a bit over 8TB of disk attached, along with some other smaller VMs, totalling about 114TB of storage, all thin-provisioned. We also use “protection domains” to create and replicate snapshots of the VMs to send to a “backup” cluster in another data center, and the delta disks for those snapshots take up several additional TB in storage.

I started getting alerts that we were using more than 75% of the available storage in the cluster, and it continued to grow, approaching 90%. Nutanix requires a certain amount of free space in order to operate, an in order to survive a host failure, since a host failure obviously reduces the overall available storage. If we went much beyond 90%, the cluster would begin to have lots of difficult to resolve issues (VMs freezing, filesystems becoming “read-only”, etc.). But wait…the VMs are only using about 35% of their virtual disks! Why is Nutanix saying we’re using almost 90%?

I took a look at our vSphere client, and it reported similar numbers to Nutanix: each VM was using 7+ TB of its 8TB storage. How was this possible?

We initially thought that the issue was with the protection domain snapshots. The way that snapshots work in Nutanix is similar to the way they work in VMware; when a snapshot is created, new data being written goes to “delta” disk files, and the original disk files are left quiesced so that they can be safely copied somewhere else, or restored in the event of a failure. Eventually, when the snapshot is deleted, the delta disk files get merged back into the main disk files. We had recently enabled protection domains for all the VMs, and the storage had been growing ever since, so it seemed a likely cause. Was Nutanix simply failing to clean up delta disks, and they were just growing and growing as time went on from the creation of the initial snapshots? The fact that the “.snapshots” directory within the datastore seemed to be using a remarkable 126TB of storage seemed to bear this out.

We called Nutanix, who explained that no, the .snapshot directory wasn’t actually using that much storage, or at least, not exclusive of the actual data disks. The data disks were stored there, and the files that appeared to be data disks in the usual VM directories were just pointing to them. So that turned out to be a red herring.

The actual issue seems to be that the operating system on the VMs, in this case Red Hat Enterprise Linux 6 (RHEL6), and VMware, don’t communicate very well about what’s actual reclaimable space. When a file gets deleted in RHEL6, the data doesn’t actually get cleaned up right away; what gets removed is the “inode,” essentially a pointer that describes where on the disk to find the data. The data itself doesn’t get removed; it just gets overwritten later on by new data as needed. Over time, as the application wrote many GB a day to its filesystems, that data continued to stay on the disks, even though the inodes pointing to the data had been deleted, and the OS reported that the space was available. VMware saw all those ones and zeroes and assumed that this was actual live data that could not be cleaned up by the hypervisor processes that maintain the thin-provisioning.

In short, in an environment like this with lots and lots of data coming in and getting “deleted,” you can’t actually oversubscribe the storage because the OS and VMware don’t talk to each other very well about what data is actually needed.

The fun part was the trick I did in the short term to clean things up until I could replace the virtual disks with smaller ones and “de-over-subscribe” the storage. It turns out if you write a very large file to the filesystem containing nothing but zeroes, and then delete it, VMware will see that as reclaimable space. It took a few days to run on all the filesystems in the cluster, but I used the “dd” command to output from /dev/zero to a file in the directory, and then removed it; VMware would eventually knock back the reported storage utilization, and Nutanix’s “Curator” scans would reclaim the space as well.

dd if=/dev/zero of=/datastore01/file_full_of_zeroes bs=1M count=1000000
rm /datastore01/file_full_of_zeroes

The first command creates a 1TB file containing nothing but zeroes; the second removes it. Doing that to each of 4 datastores on each VM got the reported utilization from 7TB to 4TB on each VM, and the Nutanix-reported utilization back under 60%. Some information I googled up seemed to indicate that operations might be necessary on the virtual disks from within the hypervisor, or possibly having to do a storage migration to get VMware to reclaim the space, but it did it automatically (we’re running ESX 6.0.0, which may have different space reclamation techniques than previous versions).