Raspberry pi SD card failure

Pac-man Raspberry Pi2

So, it happened again, the micro SD card in the Raspberry pi used to host this blog started to fail:

	mmc0: Timeout waiting for hardware interrupt 

… and my Raspberry pi2’s kill count now stands at 3 SD cards.

Recovery is not a problem since it’s backed up regularly, but it’s an irritating issue; whilst I don’t expect massive endurance from any of these cards, I expect them to last longer than they have as the system isn’t write heavy. The symptoms so far are never the same; card 1 simply became read-only, card 2 started exhibiting silent file corruption fixable by re-writing the data, and now the issue with card 3 manifests itself as the server randomly hanging whenever the system hits a particular portion of the card. Card 1 was class 6 and branded “Maxell”, card 2 was class 10/U1 and branded “Kingston” and card 3 a “Toshiba” class10/U1.

You can pull some information about the cards controller from the command line, however it would seem that without dissolving off the cards cases in nitric acid it’s hard to really know what you’ve got.

The underlying issue could be anything; firmware, power or pi2 design, but because the failures are always different I think this is purely down to card quality and wear out. Historically this server was hosted on a Cobalt Qube2 with an 8gb flash card, and then on a Raspberry pi model B with a 16GB SD card; both for long periods of time without issue. Both cards were also branded “SanDisk”; although I’ve no idea if that really makes any difference.

This time I’ve brought the site back up on the Raspberry pi model B, mainly because I’d not got a suitable working micro SD card for the pi2. We’ll see how it goes, but it might be time to move some things to tmpfs!

WSFC and iscsitarget: “does not have the inquiry data (SCSI page 83h VPD descriptor) that is required by failover clustering”

Last week whilst trying to get to grips with SQL Server AlwaysOn Failover Clusters, I set up a simple iSCSI target using the “iscsitarget” package as per the Debian docs. However when trying to validate the cluster in WSFC (Windows Server Failover Clustering) the disk checks failed with:

“does not have the inquiry data (SCSI page 83h VPD descriptor) that is required by failover clustering”

This has something to do with the scsiId, which is required by the cluster manager to control volume ownership, being supplied by iscsitarget in a format unsupported by WSFC.

I failed to find a workaround for this and instead switched to using “tgt” to serve the iSCSI targets. I was pushed for time, and couldn’t find a straightforward guide so I’m documenting my steps here.

1) Install tgt:

# apt-get install tgt

2) Enable and start tgt:

# systemctl enable tgt.service
# systemctl start tgt.service

3) Create the iSCSI target(s) and add their backing stores:

# tgtadm --lld iscsi --op new --mode target --tid 1 --targetname iqn.2001-04.com.example:storage.lun1
# tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 --backing-store /dev/sdb1

4) Bind the target to listen on all interfaces, with a user account:

# tgtadm --lld iscsi --op bind --mode target --tid 1 -I ALL
# tgtadm --lld iscsi --op new --mode account --user mssql --password secret
# tgtadm --lld iscsi --op bind --mode account --tid 1 --user mssql

5) Dump the config out into a configuration file:

# tgt-admin --dump > /etc/tgt/conf.d/default.conf
# sed -i -e 's/PLEASE_CORRECT_THE_PASSWORD/secret/' /etc/tgt/conf.d/default.conf

6) Restart to ensure the configuration is picked up.

# systemctl restart tgt.service

Linux bonding and active-backup of LACP DLA / 802.3ad adapters on linked, but non-stacked switches.

Today I’m trying to configure a couple of servers each with 2 LACP trunks going to separate switches on our network. I was hoping that if I made a single 802.3ad bond with all the interfaces it’d automatically work in active-backup mode with the 2 trunks and give me switch redundancy.

It would appear that the Linux bonding driver does do this, so I set up my bond as follows:

auto bond0
iface bond0 inet static
	address 192.168.0.30
	netmask 255.255.255.0
	network 192.168.0.0
	broadcast 192.168.0.255
	gateway 192.168.0.1
	bond_slaves eth0 eth1 eth2 eth3
	bond_mode 802.3ad
	bond_miimon 100
	bond_downdelay 200
	bond_updelay 200       
	bond_lacp_rate 1  
	bond_xmit_hash_policy layer2+3
# invoke-rc.d networking reload

… all initially appears good, and I can see 2 separate aggregator ids; 1 & 3 with active aggregator id 1:

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
		Aggregator ID: 1
		Number of ports: 2
		Actor Key: 17
		Partner Key: 386
		Partner Mac Address: 00:21:f7:0e:c1:00

Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:6e:96:19:9d:f0
Aggregator ID: 1
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:6e:96:19:9d:f1
Aggregator ID: 1
Slave queue ID: 0

Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:6e:96:19:9d:f2
Aggregator ID: 3
Slave queue ID: 0

Slave Interface: eth3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:6e:96:19:9d:f3
Aggregator ID: 3
Slave queue ID: 0

If I bring down both interfaces on aggregator 1 then the above switches to aggregator ID 3 and all seems good.

# ifconfig eth0 down; ifconfig eth1 down

But it all goes bad once I bring those interfaces back up; the machine disappears off the network.

# ifconfig eth0 up; ifconfig eth1 up

The issue appears to be that the link status on both trunks is up, and since the MAC address used is the same for each trunk (that of the first adapter) once traffic has passed through both switches they both have the MAC present in their switching tables.

I couldn’t find any proper workaround for this, and eventually found a stack-exchange post outlining the same issue. Aparrently if the switches can be linked with VPC (Virtual Port Channel) or MLAG (Multi Chassis Link Aggregation) then it can work, but otherwise not.

What I’ve done in the end is a poor-mans workaround that simply involves checking the status of the bond, and switching the interfaces when the aggregator becomes inactive. It looks like this (on debian):

auto bond0
iface bond0 inet static
	hwaddress <mac address>
	address 192.168.0.30
	netmask 255.255.255.0
	network 192.168.0.0
	broadcast 192.168.0.255
	gateway 192.168.0.1
	bond_slaves eth0 eth1
	bond_mode 802.3ad
	bond_miimon 100
	bond_downdelay 200
	bond_updelay 200       
	bond_lacp_rate 1  
	bond_xmit_hash_policy layer2+3

Set the bond to always use the MAC from eth0 instead of the first interface:

# mac=$(cat /sys/class/net/eth0/address); sed -i "s/<mac address>/$mac/" /etc/network/interfaces            

Reload our network configuration and check this simple configuration works as we expect:

# invoke-rc.d networking reload

Now create a script to check the status of the bond, and if it shows no active aggregator then switch the interfaces and reload network configuration:

# vi /usr/local/bin/lacp_switch.sh
#!/bin/bash
if [ $(grep -c 'bond bond0 has no active aggregator' /proc/net/bonding/bond0) -eq 1 ]; then
	if [ $(grep -c 'eth2' /etc/network/interfaces) -eq 1 ]; then
			echo "$(date +'%T %x') : Changing bond0 slaves to eth0 & eth1 on switch1"
			sed -i 's/eth2/eth0/;s/eth3/eth1/' /etc/network/interfaces
	elif [ $(grep -c 'eth0' /etc/network/interfaces) -eq 1 ]; then
			echo "$(date +'%T %x') : Changing bond0 slaves to eth2 & eth3 on switch2"
			sed -i 's/eth0/eth2/;s/eth1/eth3/' /etc/network/interfaces
	else
			echo "$(date +'%T %x') : Unknown configuration"
			exit 1
	fi
	/etc/init.d/networking reload
fi

Make it executable and schedule it to run every 6 seconds:

# chmod 700 /usr/local/bin/lacp_switch.sh	
# echo -e "SHELL=/bin/bash\n* * * * * root for i in {1..10}; do /usr/local/bin/lacp_switch.sh >> /var/log/lacp_switch_check 2>&1 & sleep 6; done" > /etc/cron.d/lacp_switch_check

This works, but I’m not happy with it. If somebody knows a way to do the above please do tell!