I’ve been spending my spare bits of time over the last couple of weeks looking at the latest release of Slony-I. At a quick glance the main change between 2.1 and 2.2 appears to be to the sl_log table format, but although seemingly minor, the changes to the way clusters are failed over and reshaped actually go much deeper too.
For example in previous versions it was possible for a subscriber to pull multiple sets from different providers and later change the provider for any set at will using the “SCUBSCRIBE SET” command. However as of 2.2, although it’s still possible to initially subscribe a node with different providers for each set, any changes must use the “RESUBSCRIBE NODE” command, which only allows resubscribing all sets from a particular origin to a single provider.
There’s also changes to the “FAILOVER” command to improve reliability in a situation where multiple nodes have failed; you can now pass in multiple failed nodes and Slony should do the right thing. So far my tests with 2.2.2 show there may be some issues when passing in multiple failed nodes where one is a downstream provider to a cascaded subscriber, however that’s a corner case and hopefully we’ll see a fix soonish. (Edit 16/05/2014: There’s now a patch against 2.2.2 for this)
The changes to the sl_log table mean that replicated data is now replicated in a slightly more logical way; data is logged as arrays of values rather than chunks of sql to execute on the subscriber, and the data is sent over a pipe using copy rather than fetched in chunks via a cursor. Also DDL had been moved out of sl_event and into a new sl_log_script table. Upgrade will most likely require some brief downtime, as running update functions requires a lock of all sets and waiting out the cleanup interval for a cleanupevent/logswitch to happen to clear out the tables.
On a separate note, this evening (not the best use of a bank holiday weekend) whilst looking at how these changes would affect my experimental failover script I had a quick bash at adding in an “autofailover” functionality; the idea being that the script keeps polling all the nodes, and upon detecting any unavailable nodes runs the failover command. It’s a functionality I’ve never personally wanted as it’s possible to get into all sorts of trouble blindly failing over onto an asynchronous replica, in fact in a busy environment it’s pretty much guaranteed (E.g missing a single update to a product price and then taking millions of sales on the wrong price!). However, perhaps it could be quite useful in a mostly read only environment where updates are low volume such as a wiki; more thought needed I think.