Linux bonding and active-backup of LACP DLA / 802.3ad adapters on linked, but non-stacked switches.

Today I’m trying to configure a couple of servers each with 2 LACP trunks going to separate switches on our network. I was hoping that if I made a single 802.3ad bond with all the interfaces it’d automatically work in active-backup mode with the 2 trunks and give me switch redundancy.

It would appear that the Linux bonding driver does do this, so I set up my bond as follows:

auto bond0
iface bond0 inet static
	address 192.168.0.30
	netmask 255.255.255.0
	network 192.168.0.0
	broadcast 192.168.0.255
	gateway 192.168.0.1
	bond_slaves eth0 eth1 eth2 eth3
	bond_mode 802.3ad
	bond_miimon 100
	bond_downdelay 200
	bond_updelay 200       
	bond_lacp_rate 1  
	bond_xmit_hash_policy layer2+3
# invoke-rc.d networking reload

… all initially appears good, and I can see 2 separate aggregator ids; 1 & 3 with active aggregator id 1:

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
		Aggregator ID: 1
		Number of ports: 2
		Actor Key: 17
		Partner Key: 386
		Partner Mac Address: 00:21:f7:0e:c1:00

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:6e:96:19:9d:f0
Aggregator ID: 1
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:6e:96:19:9d:f1
Aggregator ID: 1
Slave queue ID: 0

Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:6e:96:19:9d:f2
Aggregator ID: 3
Slave queue ID: 0

Slave Interface: eth3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:6e:96:19:9d:f3
Aggregator ID: 3
Slave queue ID: 0

If I bring down both interfaces on aggregator 1 then the above switches to aggregator ID 3 and all seems good.

# ifconfig eth0 down; ifconfig eth1 down

But it all goes bad once I bring those interfaces back up; the machine disappears off the network.

# ifconfig eth0 up; ifconfig eth1 up

The issue appears to be that the link status on both trunks is up, and since the MAC address used is the same for each trunk (that of the first adapter) once traffic has passed through both switches they both have the MAC present in their switching tables.

I couldn’t find any proper workaround for this, and eventually found a stack-exchange post outlining the same issue. Aparrently if the switches can be linked with VPC (Virtual Port Channel) or MLAG (Multi Chassis Link Aggregation) then it can work, but otherwise not.

What I’ve done in the end is a poor-mans workaround that simply involves checking the status of the bond, and switching the interfaces when the aggregator becomes inactive. It looks like this (on debian):

auto bond0
iface bond0 inet static
	hwaddress <mac address>
	address 192.168.0.30
	netmask 255.255.255.0
	network 192.168.0.0
	broadcast 192.168.0.255
	gateway 192.168.0.1
	bond_slaves eth0 eth1
	bond_mode 802.3ad
	bond_miimon 100
	bond_downdelay 200
	bond_updelay 200       
	bond_lacp_rate 1  
	bond_xmit_hash_policy layer2+3

Set the bond to always use the MAC from eth0 instead of the first interface:

# mac=$(cat /sys/class/net/eth0/address); sed -i "s/<mac address>/$mac/" /etc/network/interfaces            

Reload our network configuration and check this simple configuration works as we expect:

# invoke-rc.d networking reload

Now create a script to check the status of the bond, and if it shows no active aggregator then switch the interfaces and reload network configuration:

# vi /usr/local/bin/lacp_switch.sh
#!/bin/bash
if [ $(grep -c 'bond bond0 has no active aggregator' /proc/net/bonding/bond0) -eq 1 ]; then
	if [ $(grep -c 'eth2' /etc/network/interfaces) -eq 1 ]; then
			echo "$(date +'%T %x') : Changing bond0 slaves to eth0 & eth1 on switch1"
			sed -i 's/eth2/eth0/;s/eth3/eth1/' /etc/network/interfaces
	elif [ $(grep -c 'eth0' /etc/network/interfaces) -eq 1 ]; then
			echo "$(date +'%T %x') : Changing bond0 slaves to eth2 & eth3 on switch2"
			sed -i 's/eth0/eth2/;s/eth1/eth3/' /etc/network/interfaces
	else
			echo "$(date +'%T %x') : Unknown configuration"
			exit 1
	fi
	/etc/init.d/networking reload
fi

Make it executable and schedule it to run every 6 seconds:

# chmod 700 /usr/local/bin/lacp_switch.sh	
# echo -e "SHELL=/bin/bash\n* * * * * root for i in {1..10}; do /usr/local/bin/lacp_switch.sh >> /var/log/lacp_switch_check 2>&1 & sleep 6; done" > /etc/cron.d/lacp_switch_check

This works, but I’m not happy with it. If somebody knows a way to do the above please do tell!

samba_4.1.17+dfsg-2+deb8u1 root share results in NT_STATUS_ACCESS_DENIED on subdirectories on Debian Jessie

Recently I saw a Debian Jessie server start returning “NT_STATUS_ACCESS_DENIED” whenever a user tried to access a subdirectory from a root share. A quick dig through the Debian bug tracker revealed this bug report so we’ll see it fixed in an update at some point.

However there’s no telling when the update will actually come; so the question is what to do in the meantime? One option is to replicate the mount point elsewhere and share that, e.g after doing the below we could just set “path=/mnt/root”

# mkdir /mnt/root
# mount -o rbind / /mnt/root  

The other option is to apply the patch supplied in the upstream bug report to the existing Debian package; the only issue here is we have to tread carefully so as not to break the packaging system. The Debian packaging system is very much an unknown to me, but this is how I go about applying such a patch (Disclaimer: Follow this advice at your own peril)

First we need to make sure we have the tools for building packages:

$ sudo apt-get install build-essential devscripts

Then get the source and the upstream patch:

$ cd /tmp
$ sudo apt-get update
$ wget -O samba.patch https://attachments.samba.org/attachment.cgi?id=11742
$ apt-get source samba
$ cd samba-4.1.17+dfsg

To prepare a patch proper we’d use quilt

$ sudo apt-get install quilt
$ export QUILT_PATCHES=debian/patches
$ export QUILT_REFRESH_ARGS="-p ab --no-timestamps --no-index"
$ quilt push -a
$ quilt new bug_812429_share_of_root_no_longer_works.patch 
$ quilt add source3/smbd/vfs.c
$ patch -p1 < ../samba.patch
$ quilt refresh
$ quilt pop -a 

Or alternatively as we only really care about making the binary package we can take a shortcut and just apply the patch on top of the source we’ve downloaded:

$ patch -p1 < ../samba.patch
$ dpkg-source --commit

Now we want to make sure our package doesn’t get overwritten until an actual update appears, we can bump the version number to enforce this:

$ debchange --increment
    * Add bug_812429_share_of_root_no_longer_works.patch

Now build our package(s):

$ dpkg-buildpackage -us -uc

We only really need the changes in “samba-libs_4.1.17+dfsg-2+deb8u1.1_amd64.deb”, but because we’ve bumped the version number we need to apply all the rebuilt packages that depend on samba-libs:

$ su -
# dpkg -i samba-libs_4.1.17+dfsg-2+deb8u1.1_amd64.deb
# dpkg -i python-samba_4.1.17+dfsg-2+deb8u1.1_amd64.deb
# dpkg -i libsmbclient_4.1.17+dfsg-2+deb8u1.1_amd64.deb
# dpkg -i samba-common_4.1.17+dfsg-2+deb8u1.1_all.deb
# dpkg -i samba-common-bin_4.1.17+dfsg-2+deb8u1.1_amd64.deb
# dpkg -i samba_4.1.17+dfsg-2+deb8u1.1_amd64.deb
# dpkg -i samba-dsdb-modules_4.1.17+dfsg-2+deb8u1.1_amd64.deb
# dpkg -i samba-vfs-modules_4.1.17+dfsg-2+deb8u1.1_amd64.deb
# dpkg -i smbclient_4.1.17+dfsg-2+deb8u1.1_amd64.deb  

Hopefully this will suffice, and once the Debian apt repository is updated, and only then will “apt-get upgrade” overwrite our patched package.

Windows psql and utf8 client_encoding issues

Prior to pg 9.1, you could connect to any database with psql regardless of encoding and you’d get the server encoding as your client encoding unless you set it:

C:\>chcp 1252
Active code page: 1252

C:\>"C:\Program Files\PostgreSQL\9.0\bin\psql" -U glyn -d TEST -h pgtest
psql (9.0.22, server 9.4.4)
WARNING: psql version 9.0, server version 9.4.
         Some psql features might not work.
SSL connection (cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256)
Type "help" for help.

TEST=> show client_encoding;
 client_encoding
-----------------
 LATIN1
(1 row)

That wasn’t quite right; the client_encoding is a lie. On a modern psql version that’s quite rightly prevented:

C:\>"C:\Program Files (x86)\pgAdmin III\1.20\psql" -U glyn -d TEST -h pgtest
psql: FATAL:  conversion between WIN1252 and LATIN1 is not supported

This is not an issue if you want to connect to a utf8 database, but the issue I had this morning was connecting to a latin1 database with psql from a Windows client (something I do rarely). If I set the codepage to utf8 to match client encoding, I got a “Not enough memory.” error:

C:\>chcp 65001
Active code page: 65001

C:\>set PGCLIENTENCODING=UTF8

C:\>"C:\Program Files (x86)\pgAdmin III\1.20\psql" -U glyn -d TEST -h pgtest
psql (9.4.0, server 9.4.4)
WARNING: Console code page (65001) differs from Windows code page (1252)
         8-bit characters might not work correctly. See psql reference
         page "Notes for Windows users" for details.
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 25
6, compression: off)
Type "help" for help.

TEST=> show client_encoding;
Not enough memory.

I could set the codepage to 1252, but that would mean my setting for client_encoding would be a lie, and if I were to then revert to set client_encoding=’WIN1252′ I’d have come full circle and be back at the “FATAL: conversion between WIN1252 and LATIN1 is not supported” error message.

A quick google revealed these bug reports with no solutions. Another dig at the docs revealed the following passage:

pager

Controls use of a pager program for query and psql help output. If the environment variable PAGER is set, the output is piped to the specified program. Otherwise a platform-dependent default (such as more) is used.

So how does more behave?

C:\>chcp 65001
Active code page: 65001

C:\>more
Not enough memory.

Bingo! So if I turn the pager off the error should go:

C:\>chcp 65001
Active code page: 65001

C:\>set PGCLIENTENCODING=UTF8
C:\>"C:\Program Files (x86)\pgAdmin III\1.20\psql" -U glyn -d TEST -h pgtest
WARNING: Console code page (65001) differs from Windows code page (1252)
         8-bit characters might not work correctly. See psql reference
         page "Notes for Windows users" for details.
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 25
6, compression: off)
Type "help" for help.

TEST=> \pset pager off
Pager usage is off.
TEST=> show client_encoding;
 client_encoding
-----------------
 UTF8
(1 row)

Cluprit found; quite embarrassing that the source of such a verbose error as “Not enough memory.” being Microsoft didn’t occour sooner. So lets try a different pager (sourced from)

C:\>chcp 65001
Active code page: 65001

C:\>set PGCLIENTENCODING=UTF8
C:\>set PAGER="C:\Program Files (x86)\gnuwin32\bin\less"
C:\>set LESS=--quit-at-eof

C:\>"C:\Program Files (x86)\pgAdmin III\1.20\psql" -U glyn -d TEST -h pgtest
psql (9.4.0, server 9.4.4)
WARNING: Console code page (65001) differs from Windows code page (1252)
         8-bit characters might not work correctly. See psql reference
         page "Notes for Windows users" for details.
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 25
6, compression: off)
Type "help" for help.

TEST=> show client_encoding;

 client_encoding
-----------------
 UTF8
(1 row)

(END)