Category Archives: Databases

WSFC and iscsitarget: “does not have the inquiry data (SCSI page 83h VPD descriptor) that is required by failover clustering”

Last week whilst trying to get to grips with SQL Server AlwaysOn Failover Clusters, I set up a simple iSCSI target using the “iscsitarget” package as per the Debian docs. However when trying to validate the cluster in WSFC (Windows Server Failover Clustering) the disk checks failed with:

“does not have the inquiry data (SCSI page 83h VPD descriptor) that is required by failover clustering”

This has something to do with the scsiId, which is required by the cluster manager to control volume ownership, being supplied by iscsitarget in a format unsupported by WSFC.

I failed to find a workaround for this and instead switched to using “tgt” to serve the iSCSI targets. I was pushed for time, and couldn’t find a straightforward guide so I’m documenting my steps here.

1) Install tgt:

# apt-get install tgt

2) Enable and start tgt:

# systemctl enable tgt.service
# systemctl start tgt.service

3) Create the iSCSI target(s) and add their backing stores:

# tgtadm --lld iscsi --op new --mode target --tid 1 --targetname iqn.2001-04.com.example:storage.lun1
# tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 --backing-store /dev/sdb1

4) Bind the target to listen on all interfaces, with a user account:

# tgtadm --lld iscsi --op bind --mode target --tid 1 -I ALL
# tgtadm --lld iscsi --op new --mode account --user mssql --password secret
# tgtadm --lld iscsi --op bind --mode account --tid 1 --user mssql

5) Dump the config out into a configuration file:

# tgt-admin --dump > /etc/tgt/conf.d/default.conf
# sed -i -e 's/PLEASE_CORRECT_THE_PASSWORD/secret/' /etc/tgt/conf.d/default.conf

6) Restart to ensure the configuration is picked up.

# systemctl restart tgt.service

Windows psql and utf8 client_encoding issues

Prior to pg 9.1, you could connect to any database with psql regardless of encoding and you’d get the server encoding as your client encoding unless you set it:

C:\>chcp 1252
Active code page: 1252

C:\>"C:\Program Files\PostgreSQL\9.0\bin\psql" -U glyn -d TEST -h pgtest
psql (9.0.22, server 9.4.4)
WARNING: psql version 9.0, server version 9.4.
         Some psql features might not work.
SSL connection (cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256)
Type "help" for help.

TEST=> show client_encoding;
 client_encoding
-----------------
 LATIN1
(1 row)

That wasn’t quite right; the client_encoding is a lie. On a modern psql version that’s quite rightly prevented:

C:\>"C:\Program Files (x86)\pgAdmin III\1.20\psql" -U glyn -d TEST -h pgtest
psql: FATAL:  conversion between WIN1252 and LATIN1 is not supported

This is not an issue if you want to connect to a utf8 database, but the issue I had this morning was connecting to a latin1 database with psql from a Windows client (something I do rarely). If I set the codepage to utf8 to match client encoding, I got a “Not enough memory.” error:

C:\>chcp 65001
Active code page: 65001

C:\>set PGCLIENTENCODING=UTF8

C:\>"C:\Program Files (x86)\pgAdmin III\1.20\psql" -U glyn -d TEST -h pgtest
psql (9.4.0, server 9.4.4)
WARNING: Console code page (65001) differs from Windows code page (1252)
         8-bit characters might not work correctly. See psql reference
         page "Notes for Windows users" for details.
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 25
6, compression: off)
Type "help" for help.

TEST=> show client_encoding;
Not enough memory.

I could set the codepage to 1252, but that would mean my setting for client_encoding would be a lie, and if I were to then revert to set client_encoding=’WIN1252′ I’d have come full circle and be back at the “FATAL: conversion between WIN1252 and LATIN1 is not supported” error message.

A quick google revealed these bug reports with no solutions. Another dig at the docs revealed the following passage:

pager

Controls use of a pager program for query and psql help output. If the environment variable PAGER is set, the output is piped to the specified program. Otherwise a platform-dependent default (such as more) is used.

So how does more behave?

C:\>chcp 65001
Active code page: 65001

C:\>more
Not enough memory.

Bingo! So if I turn the pager off the error should go:

C:\>chcp 65001
Active code page: 65001

C:\>set PGCLIENTENCODING=UTF8
C:\>"C:\Program Files (x86)\pgAdmin III\1.20\psql" -U glyn -d TEST -h pgtest
WARNING: Console code page (65001) differs from Windows code page (1252)
         8-bit characters might not work correctly. See psql reference
         page "Notes for Windows users" for details.
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 25
6, compression: off)
Type "help" for help.

TEST=> \pset pager off
Pager usage is off.
TEST=> show client_encoding;
 client_encoding
-----------------
 UTF8
(1 row)

Cluprit found; quite embarrassing that the source of such a verbose error as “Not enough memory.” being Microsoft didn’t occour sooner. So lets try a different pager (sourced from)

C:\>chcp 65001
Active code page: 65001

C:\>set PGCLIENTENCODING=UTF8
C:\>set PAGER="C:\Program Files (x86)\gnuwin32\bin\less"
C:\>set LESS=--quit-at-eof

C:\>"C:\Program Files (x86)\pgAdmin III\1.20\psql" -U glyn -d TEST -h pgtest
psql (9.4.0, server 9.4.4)
WARNING: Console code page (65001) differs from Windows code page (1252)
         8-bit characters might not work correctly. See psql reference
         page "Notes for Windows users" for details.
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 25
6, compression: off)
Type "help" for help.

TEST=> show client_encoding;

 client_encoding
-----------------
 UTF8
(1 row)

(END)

Being cavalier with slony nodes and pg_dump/pg_restore

It’s generally a bad idea to do logical dump/restores of slony nodes, and for this reason slony provides the CLONE PREPARE/CLONE FINISH action commands to clone subscriber nodes.

In this instance though, I’ve a test environment where I’d stopped the slons, dumped out and dropped a subscriber database and then gone on to do some other testing with that postgres cluster. Sometime later I want to do some more testing with the slony cluster; I never dropped this node from the slony config, but in the meantime I’ve changed the postgres version from 9.0 to 9.4 and recreated the postgres cluster with initdb. Schema wise nothing has changed with the old nodes.

What follows is some fudge to make the node resume with replication, it’s neither recommended nor guaranteed to work. Copy this process at your peril.

After recompiling the same version of slony I had before (2.2.4) and installing the binaries, I’ve just restored the old subscriber database dump into the new postgres 9.4 cluster and updated my extensions.

So what’s holding me off just restarting the slons now? Firstly slony stores the oid of each table participating in replication and these will surely have changed now. The following query produces a long list of tables where this is true:

[postgres]
TEST=# SELECT s.tab_set, s.tab_reloid, s.tab_nspname, s.tab_relname,
c.oid, n.nspname, c.relname
FROM _test_replication.sl_table s
JOIN pg_catalog.pg_class c ON c.oid = s.tab_reloid
JOIN pg_catalog.pg_namespace n ON c.relnamespace = n.oid
WHERE (s.tab_reloid, s.tab_nspname, s.tab_relname) <> (c.oid, n.nspname, c.relname);
[/postgres]

The fix here is pretty simple, I could run REPAIR CONFIG against the node, but instead I just hit the updatereloid(<set id>, <node id>) function directly:

[postgres]
TEST=# SELECT s.set_id, s.set_comment,
CASE _test_replication.updatereloid(s.set_id,1) WHEN 1 THEN ‘OK’ ELSE ‘FAILED’ END
FROM _test_replication.sl_set s;
[/postgres]

For clarity, my resurrected node is ID 1, and was a subscriber at the time it was destroyed.

Secondly slony stores the transaction ID for each replication event against the node, and if my resurrected node has a lower current xid then new events are going to look like they’re in the past compared to events that have already been generated against the node. I can see the current xid of my new postgresql cluster with the following query:

[postgres]
TEST=# SELECT txid_current();
txid_current
————–
25765
[/postgres]

And I can compare this to slony events generated against the node in its old incarnation as follows:

[postgres]
TEST=# SELECT max(txid_snapshot_xmax(ev_snapshot)), txid_current(), txid_current()-max(txid_snapshot_xmax(ev_snapshot)) AS delta
FROM _test_replication.sl_event where ev_origin = 1;

max | txid_current | delta
——-+————–+——–
89004 | 25767 | -63237
[/postgres]

So transaction ID wise my node is in the past, to bump that up I need about 63237 transactions to happen, if that were a lot higher I’d have to think of another way (perhaps by stopping all the slons and updating all values of ev_snapshot for node 1 on all nodes), but I can easily generate 60k transactions by hitting txid_current():

$ while [ 1 ]; do psql -U glyn -d TEST -tAc "SELECT txid_current();"; done
25768
25769
25770

<snip>

89002
89003
89004
89005
89006
^C
$ 

I can now restart my slons and replication should continue, and so far all appears well.