Spacewalk refusing to sync channels, and throws befuddling Postgres errors

In my home lab I run an instance of “Spacewalk,” which is an open-source version of Red Hat’s Satellite 5 tool. It allows you to manage CentOS, Red Hat, and Fedora devices in various ways, but it’s mostly used to manage “yum” packages.

The gist of the package management setup is that you set up “channels,” which contain links to a variety of internet repos, and download the packages from those repos according to a schedule you set up, stashing them locally to save download time when installing or upgrading software on a device. It stores a ton of metadata about the packages in a Postgres database. You subscribe your hosts to the channels you want so that they can see certain packages, depending on what you want installed where. For example, I have a “Cloudera” channel containing all the server and agent RPMs for Cloudera, that only my VMs that run Cloudera are subscribed to.

You can do nifty tricks with “clone” channels. Because the main channels get updated with new packages all the time, if you do a patch cycle that lasts several months because you have hundreds of servers to upgrade, you can run into a situation where you upgraded a dev VM to kernel kernel-3.10.0-514.el7.x86_64 on June 1st, and then when you got around to prod on August 1st, the latest kernel was up to kernel-3.10.0-514.el7.x86_64. To get around that, at the start of your cycle, you clone the main channel, and subscribe all the servers to the clone; it’s essentially a snapshot of the channel. Then the servers only get the packages that were in the channel on the day you cloned it. Next patch cycle, repeat the process.

I had a wonky issue recently where my Spacewalk VM rebooted and came back up, but was unable to mount the NFS share where I actually store all the RPMs (I’ve got about 1TB of them at this point, so it makes sense to stash them on my giant FreeNAS instead of using up Nutanix storage). It dutifully started up Spacewalk anyway, and started sticking RPMs into /var/satellite and filling up /var. This is annoying, but not fatal. What became more of a problem was my assumption that Spacewalk would be smart enough to resync the packages after I deleted them, mounted the NFS share, and restarted Spacewalk.

Instead, it started throwing weird errors about duplicates:

2020/07/20 16:07:06 -04:00 ERROR: (23, 'ERROR:  duplicate key value violates unique constraint "rhn_cnp_cid_nid_uq"', 'Could not update database entry.')

I did some googlin’ and results were inconclusive. An official Red Hat support page (I have access because of my job) seemed to indicate that I was going to have to do some Postgres hacking, which was a scary thought. (At that point I wisely created a VM snapshot of the Spacewalk server so I didn’t blow the whole thing up and not have a backup.) I also came across this site that seemed to indicate that I can solve the problem from within Spacewalk by looking for “orphaned” packages (rpms that are not affiliated with any channel), and delete them. However, when I went to look for them, I had 7238 orphaned packages! That seemed…odd.

I knew I had a good snapshot of the VM and could restore, but the NFS data would be more tricky; if I deleted that, it was gone from the NAS, and wouldn’t be contained in the snapshot. (It occurred to me later that I believe FreeNAS also has a snapshot capability, but it didn’t occur to me at the time.) I figured if things really got dire, Spacewalk would probably redownload the missing stuff, so I threw caution to the wind and deleted all 7000+ orphaned packages.

It took a while, and Spacewalk itself got so busy doing it that it wouldn’t talk to me for a while, but sure enough, it threw a bunch of messages in /var/log/rhn/rhn_taskomatic_daemon.log about packages not actually existing on the filesystem:

INFO   | jvm 1    | 2020/07/20 15:40:47 | 2020-07-20 15:40:47,709 [DefaultQuartzScheduler_Worker-8] ERROR com.redhat.rhn.taskomatic.task.PackageCleanup - /var/spacewalk/redhat/1/462/rh-python36-numpy/1:1.13.1-1.el7/x86_64/462e4a2f951afca
6ce14076c9e859784a5e7630112c07d77b30bcef2dc410299/rh-python36-numpy-1.13.1-1.el7.x86_64.rpm not found

After it successfully got them deleted, I attempted a sync again, from the command line:

[root@spacewalk rhn]# /usr/bin/spacewalk-repo-sync --channel centos_6_x86_64
08:08:00 ======================================
08:08:00 | Channel: centos_6_x86_64
08:08:00 ======================================
08:08:00 Sync of channel started.
08:08:00 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os
08:08:04 Packages in repo:              6713
08:11:46 Packages already synced:       6216
08:11:46 Packages to sync:               497
08:11:46 New packages to download:         0
08:11:46 Linking packages to channel.
08:11:57 ERROR: (23, 'ERROR:  duplicate key value violates unique constraint "rhn_cnp_cid_nid_uq"', 'Could not update database entry.')
08:11:57 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=updates
Traceback (most recent call last):
  File "/usr/bin/spacewalk-repo-sync", line 257, in <module>
    sys.exit(abs(main() or 0))
  File "/usr/bin/spacewalk-repo-sync", line 240, in main
    elapsed_time, channel_ret_code = sync.sync()
  File "/usr/lib/python2.7/site-packages/spacewalk/satellite_tools/reposync.py", line 453, in sync
    """, repo_id=int(repo_id))
  File "/usr/lib/python2.7/site-packages/spacewalk/server/rhnSQL/__init__.py", line 292, in fetchall_dict
    h.execute(sql, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/spacewalk/server/rhnSQL/sql_base.py", line 151, in execute
    return self._execute_wrapper(self._execute, *p, **kw)
  File "/usr/lib/python2.7/site-packages/spacewalk/server/rhnSQL/driver_postgresql.py", line 302, in _execute_wrapper
    raise sql_base.SQLSchemaError(error_code, e.pgerror, e)
spacewalk.server.rhnSQL.sql_base.SQLSchemaError: (99999, 'ERROR:  current transaction is aborted, commands ignored until end of transaction block', '', InternalError('current transaction is aborted, commands ignored until end of transact
ion block\n',))

Dang it. Same error. Okay, let’s look at the Postgres logs. In this case, Spacewalk sets up Postgres with some default settings, and stashes pg_log in /var/lib/pgsql/data/pg_log, rolling it over every day and saving 7 days of logs. I’m still not entirely sure I needed to, but I increased the verbosity in /var/lib/pgsql/data/postgresql.conf:

│client_min_messages = debug1

And ran “rhn-satellite restart” to refresh the whole Spacewalk environment, including Postgres. Then I ran the repo-sync again. As it failed, Postgres threw this error:

2020-07-20 13:32:17.644 EDT ERROR:  duplicate key value violates unique constraint "rhn_cnp_cid_nid_uq"
2020-07-20 13:32:17.644 EDT DETAIL:  Key (channel_id, name_id, package_arch_id)=(101, 4586, 120) already exists.
2020-07-20 13:32:17.644 EDT CONTEXT:  SQL statement "insert into rhnChannelNewestPackage
                        (channel_id, name_id, evr_id, package_id, package_arch_id)
                        (select channel_id,
                                name_id, evr_id,
                                package_id, package_arch_id
                           from rhnChannelNewestPackageView
                          where channel_id = channel_id_in
                            and (package_name_id_in is null
                                 or name_id = package_name_id_in)
                        )"
        PL/pgSQL function rhn_channel.refresh_newest_package(numeric,character varying,numeric) line 9 at SQL statement

So apparently the issue is whatever row in rhnChannelNewestPackage with channel_id=101, name_id=4586, and package_arch_id=120. Let’s see if we can figure out what that is. I su’d to the postgres user and accessed the database:

[root@spacewalk reposync]# su - postgres
Last login: Mon Jul 20 15:10:45 EDT 2020 on pts/2
-bash-4.2$ psql rhnschema
psql (9.2.23)
Type "help" for help.

rhnschema=# select * from rhnChannelNewestPackage where channel_id=101 and name_id=4586;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4586 |  19624 |             120 |      63550
(1 row)

Okay, let’s see what that guy is.

rhnschema=# select * from rhnpackage where id=63550;
  id   | org_id | name_id | evr_id | package_arch_id | package_group | rpm_version |                             description                              |                    summary                    | package_size | payload_size | ins
talled_size |        build_host         |     build_time      | source_rpm_id | checksum_id | vendor | payload_format | compat |                                                                      path
                                    | header_sig | copyright | cookie |     last_modified      |            created            |           modified            | header_start | header_end
-------+--------+---------+--------+-----------------+---------------+-------------+----------------------------------------------------------------------+-----------------------------------------------+--------------+--------------+----
------------+---------------------------+---------------------+---------------+-------------+--------+----------------+--------+-------------------------------------------------------------------------------------------------------------
------------------------------------+------------+-----------+--------+------------------------+-------------------------------+-------------------------------+--------------+------------
 63550 |      1 |    4586 |  19624 |             120 |            13 | 4.8.0       | This package provides the sar and iostat commands for Linux. Sar and+| The sar and iostat system monitoring commands |       239820 |       854664 |
     846133 | x86_64_01.bsys.centos.org | 2018-01-26 02:43:42 |         26153 |     7595952 | CentOS | cpio           |      0 | redhat/1/b05/sysstat/9.0.4-33el6_9.1/x86_64/b05d4090caae0fb2df7005a07862ce2ce82e88ced1a58f42f2eb88769e47a45b
/sysstat-9.0.4-33el6_9.1.x86_64.rpm |            | GPLv2+    |        | 2018-01-28 11:05:36-05 | 2018-01-28 21:05:38.918665-05 | 2018-01-28 21:05:38.918665-05 |         1384 |      27824
       |        |         |        |                 |               |             | iostat enable system monitoring of disk, network, and other IO      +|                                               |              |              |
            |                           |                     |               |             |        |                |        |
                                    |            |           |        |                        |                               |                               |              |
       |        |         |        |                 |               |             | activity.                                                            |                                               |              |              |
            |                           |                     |               |             |        |                |        |
                                    |            |           |        |                        |                               |                               |              |
(1 row)

sysstat-9.0.4-33el6_9.1.x86_64.rpm seems to be the problem. I thought about simply deleting that row from rhnPackage and rhnChannelNewestPackage, but decided it would probably be cleaner to let Spacewalk do it and theoretically take care of all the bad metadata itself. So I logged back into spacewalk, went to the Channels tab atop the dashboard, then clicked Manage Software Channels in the left-side menu, followed by Manage Software Packages. I selected “All Managed Packages” in the dropdown, searched for sysstat, found the 9.0.4-33el6_9.1 version, selected and deleted it. Then I ran the repo-sync again.

Argh. Same error. Okay, let’s check Postgres again.

2020-07-21 08:11:57.106 EDT DETAIL:  Key (channel_id, name_id, package_arch_id)=(101, 4083, 100) already exists.

Curious. Same error, with a different name_id. Let’s see what that is.

rhnschema=# select * from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4083 |  19623 |             100 |      63530
(1 row)

What the flip? That’s the same package_id from before. But the package is gone!

rhnschema=# select * from rhnpackage where id=63550;
 id | org_id | name_id | evr_id | package_arch_id | package_group | rpm_version | description | summary | package_size | payload_size | installed_size | build_host | build_time | source_rpm_id | checksum_id | vendor | payload_format | co
mpat | path | header_sig | copyright | cookie | last_modified | created | modified | header_start | header_end
----+--------+---------+--------+-----------------+---------------+-------------+-------------+---------+--------------+--------------+----------------+------------+------------+---------------+-------------+--------+----------------+---
-----+------+------------+-----------+--------+---------------+---------+----------+--------------+------------
(0 rows)

Okay, it must be stashing a row of metadata in some table somewhere. I deleted it from rhnChannelNewestPackage manually:

rhnschema=# delete from rhnChannelNewestPackage where channel_id=101 and name_id=4083;
DELETE 1
rhnschema=# select * from rhnChannelNewestPackage where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
(0 rows)

There was also a mention in the Postgres log of rhnChannelNewestPackageView, which still had a row mentioning that name_id:

rhnschema=# select * from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4083 |  19623 |             100 |      63530
(1 row)

That object is clearly a view, and therefore I assumed I wouldn’t be able to delete directly from it, only from the tables underneath it, but thought maybe there might be some kind of trigger set up that would allow you to delete things and it would do some cleanup underneath. Might as well give it a shot:

rhnschema=# delete from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
ERROR:  cannot delete from view "rhnchannelnewestpackageview"
HINT:  You need an unconditional ON DELETE DO INSTEAD rule or an INSTEAD OF DELETE trigger.

Yep, as I figured. Okay, let’s see if it’ll tell me what tables that view encompasses:

rhnschema=# \d rhnchannelnewestpackageview
View "public.rhnchannelnewestpackageview"
     Column      |  Type   | Modifiers
-----------------+---------+-----------
 channel_id      | numeric |
 name_id         | numeric |
 evr_id          | numeric |
 package_arch_id | numeric |
 package_id      | numeric |

Not helpful. At this point, sadly, I had run out of Postgres expertise (I don’t have any to begin with, really, other than wrangling Slony clusters at work), so it was down to shots in the dark. I saw there was an rhnChannelPackage table and thought maybe it still contains a row for that package:

rhnschema=# select * from rhnchannelpackage where channel_id=101 and name_id=4083;
ERROR:  column "name_id" does not exist
LINE 1: ...* from rhnchannelpackage where channel_id=101 and name_id=40...

Heaven forfend it have the same table structure as rhnChannelNewestPackage. Argh. Okay, what is its structure?

rhnschema=# \d rhnchannelpackage
                Table "public.rhnchannelpackage"
   Column   |           Type           |       Modifiers
------------+--------------------------+------------------------
 channel_id | numeric                  | not null
 package_id | numeric                  | not null
 created    | timestamp with time zone | not null default now()
 modified   | timestamp with time zone | not null default now()
Indexes:
    "rhn_cp_cp_uq" UNIQUE, btree (channel_id, package_id)
    "rhn_cp_pid_idx" btree (package_id)
Foreign-key constraints:
    "rhn_cp_cid_fk" FOREIGN KEY (channel_id) REFERENCES rhnchannel(id) ON DELETE CASCADE
    "rhn_cp_pid_fk" FOREIGN KEY (package_id) REFERENCES rhnpackage(id)
Triggers:
    rhn_channel_package_mod_trig BEFORE INSERT OR UPDATE ON rhnchannelpackage FOR EACH ROW EXECUTE PROCEDURE rhn_channel_package_mod_trig_fun()

Well golly, look at all that. None of those indexes or foreign-keys were mentioned in the original error, but it wouldn’t’ve surprised me if somehow through the table structure they didn’t get involved in it. Let’s see if there’s a row matching our wonky sysstat package:

rhnschema=# select * from rhnchannelpackage where channel_id=101 and package_id=63530;
 channel_id | package_id |           created            |           modified
------------+------------+------------------------------+------------------------------
        101 |      63530 | 2018-01-28 21:05:58.69026-05 | 2018-01-28 21:05:58.69026-05
(1 row)

Well knock me over with a feather. Okey dokey, let’s eliminate that thang:

rhnschema=# delete from rhnchannelpackage where channel_id=101 and package_id=63530;
DELETE 1

Now, it still appeared in rhnChannelNewestPackageView, unfortunately:

rhnschema=# select * from rhnChannelNewestPackageView where channel_id=101 and name_id=4083;
 channel_id | name_id | evr_id | package_arch_id | package_id
------------+---------+--------+-----------------+------------
        101 |    4083 |   1685 |             100 |       5380
(1 row)

But I wasn’t sure what the heck else to try, so figured it wouldn’t hurt anything to do a reposync again:

[root@spacewalk rhn]# /usr/bin/spacewalk-repo-sync --channel centos_6_x86_64
08:20:52 ======================================
08:20:52 | Channel: centos_6_x86_64
08:20:52 ======================================
08:20:52 Sync of channel started.
08:20:52 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os
08:20:57 Packages in repo:              6713
08:23:34 Packages already synced:       6216
08:23:34 Packages to sync:               497
08:23:34 New packages to download:         0
08:23:34 Linking packages to channel.
08:23:38 Repo http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os has comps file comps.xml.

Hey, it never got that far without throwing an error. That’s promising.

08:23:38 Repo http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=os has 0 errata.
08:23:38 Repo URL: http://mirrorlist.centos.org/?release=6&arch=x86_64&repo=updates
08:23:42 Packages in repo:              1068
08:23:46 Packages already synced:          4
08:23:46 Packages to sync:              1064
08:23:46 New packages to download:      1064
08:23:46 1/1064 : 389-ds-base-1.2.11.15-97.el6_10.x86_64.rpm
08:23:47 2/1064 : adcli-0.8.1-3.el6_10.x86_64.rpm
08:23:47 3/1064 : 389-ds-base-devel-1.2.11.15-97.el6_10.i686.rpm
08:23:47 4/1064 : autofs-5.0.5-140.el6_10.1.x86_64.rpm
08:23:47 5/1064 : 389-ds-base-devel-1.2.11.15-97.el6_10.x86_64.rpm
08:23:47 6/1064 : 389-ds-base-libs-1.2.11.15-97.el6_10.x86_64.rpm
08:23:47 7/1064 : 389-ds-base-libs-1.2.11.15-97.el6_10.i686.rpm
08:23:47 8/1064 : autofs-5.0.5-140.el6_10.x86_64.rpm
08:23:47 9/1064 : bind-chroot-9.8.2-0.68.rc1.el6_10.1.x86_64.rpm
08:23:47 10/1064 : bind-chroot-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:47 11/1064 : bind-chroot-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:47 12/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.1.i686.rpm
08:23:47 13/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.1.x86_64.rpm
08:23:48 14/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.3.i686.rpm
08:23:48 15/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:48 16/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.7.i686.rpm
08:23:48 17/1064 : bind-devel-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:48 18/1064 : bind-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:48 19/1064 : bind-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:48 20/1064 : bind-libs-9.8.2-0.68.rc1.el6_10.3.x86_64.rpm
08:23:48 21/1064 : bind-libs-9.8.2-0.68.rc1.el6_10.7.x86_64.rpm
08:23:48 22/1064 : bind-9.8.2-0.68.rc1.el6_10.1.x86_64.rpm

Yaaaaaaaasssssss!

I’m honestly not entirely sure what fixed it, but seems like the safest thing is to look at the Postgres logs, figure out what package is causing it to burp, and then go into Postgres and delete any reference to it from the database. Try to reposync again. If that doesn’t work, try prayer.