It’s generally a bad idea to do logical dump/restores of slony nodes, and for this reason slony provides the CLONE PREPARE/CLONE FINISH action commands to clone subscriber nodes.
In this instance though, I’ve a test environment where I’d stopped the slons, dumped out and dropped a subscriber database and then gone on to do some other testing with that postgres cluster. Sometime later I want to do some more testing with the slony cluster; I never dropped this node from the slony config, but in the meantime I’ve changed the postgres version from 9.0 to 9.4 and recreated the postgres cluster with initdb. Schema wise nothing has changed with the old nodes.
What follows is some fudge to make the node resume with replication, it’s neither recommended nor guaranteed to work. Copy this process at your peril.
After recompiling the same version of slony I had before (2.2.4) and installing the binaries, I’ve just restored the old subscriber database dump into the new postgres 9.4 cluster and updated my extensions.
So what’s holding me off just restarting the slons now? Firstly slony stores the oid of each table participating in replication and these will surely have changed now. The following query produces a long list of tables where this is true:
[postgres]
TEST=# SELECT s.tab_set, s.tab_reloid, s.tab_nspname, s.tab_relname,
c.oid, n.nspname, c.relname
FROM _test_replication.sl_table s
JOIN pg_catalog.pg_class c ON c.oid = s.tab_reloid
JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
WHERE (s.tab_reloid, s.tab_nspname, s.tab_relname) <> (c.oid, n.nspname, c.relname);
[/postgres]
The fix here is pretty simple, I could run REPAIR CONFIG against the node, but instead I just hit the updatereloid(<set id>, <node id>) function directly:
[postgres]
TEST=# SELECT s.set_id, s.set_comment,
CASE _test_replication.updatereloid(s.set_id,1) WHEN 1 THEN ‘OK’ ELSE ‘FAILED’ END
FROM _test_replication.sl_set s;
[/postgres]
For clarity, my resurrected node is ID 1, and was a subscriber at the time it was destroyed.
Secondly slony stores the transaction ID for each replication event against the node, and if my resurrected node has a lower current xid then new events are going to look like they’re in the past compared to events that have already been generated against the node. I can see the current xid of my new postgresql cluster with the following query:
[postgres]
TEST=# SELECT txid_current();
txid_current
————–
25765
[/postgres]
And I can compare this to slony events generated against the node in its old incarnation as follows:
[postgres]
TEST=# SELECT max(txid_snapshot_xmax(ev_snapshot)), txid_current(), txid_current()-max(txid_snapshot_xmax(ev_snapshot)) AS delta
FROM _test_replication.sl_event where ev_origin = 1;
max | txid_current | delta
——-+————–+——–
89004 | 25767 | -63237
[/postgres]
So transaction ID wise my node is in the past, to bump that up I need about 63237 transactions to happen, if that were a lot higher I’d have to think of another way (perhaps by stopping all the slons and updating all values of ev_snapshot for node 1 on all nodes), but I can easily generate 60k transactions by hitting txid_current():
$ while [ 1 ]; do psql -U glyn -d TEST -tAc "SELECT txid_current();"; done 25768 25769 25770 <snip> 89002 89003 89004 89005 89006 ^C $
I can now restart my slons and replication should continue, and so far all appears well.