Category Archives: PostgreSQL

PostgreSQL Latin1 database with ASP.Net Unicode and DOS codepage 850 applications

So the title sums it up without going into detail; we basically have a legacy DOS application that had its underlying dataflies migrated into a latin1 encoded PostgreSQL database with no form of intelligent mapping (i.e. the cp850 “ú” character which is codepoint 163 is seen as the character “£” in latin1).

Converting to UTF8 at this point would have saved lots of hassle, however at the time of conversion the existing data wouldn’t easily go into utf8 without further intermediate conversion. Even then UTF8 wouldn’t be a magic bullet; we’d still have to sanitize the incoming data so we didn’t store anything the DOS side (or any other equipment) couldn’t read.

Fast forward a few years and now there’s two main apps hitting the database; an ASP.Net application and the aforementioned legacy DOS system. Both are heavily used and the DOS system isn’t going anywhere so we need a way for both systems to get along; crucially the DOS application must be able to understand everything the ASP.Net application puts in the database, and the ASP.Net application must be able to display the DOS codepage correctly in UTF8.

So how do we do this? All my attempts using the PostgreSQL encode/decode functions failed due to there being no cp850 representation, and although I’m sure it’s possible to do the conversion in sql using something like translate / convert and a large mapping string I don’t have time to fiddle with it right now. However perls encode and decode functions do have the ability to convert from cp850, so one solution from within PostgreSQL is to write a set of untrusted pl/perl functions like:

[postgres]
CREATE OR REPLACE FUNCTION cp850_to_utf8(text)
RETURNS text AS
$BODY$
use Encode;
return encode( ‘iso-8859-1’, decode(‘cp850’, $_[0] ));
$BODY$
LANGUAGE ‘plperlu’ IMMUTABLE;

CREATE OR REPLACE FUNCTION utf8_to_cp850(text)
RETURNS text AS
$BODY$
use Encode;
return encode( ‘cp850’, decode(‘iso-8859-1’, $_[0] ));
$BODY$
LANGUAGE ‘plperlu’ IMMUTABLE;

[/postgres]

For my current problem this means if either application writes something like “Straße” the other can read back the same representation.

From the ASP.Net side we can do the conversion outside the database with a similar process. Converting text to go into the database involves switching the encoding first from unicode into cp850 then back into latin1 to get the correct representation on the DOS side:

private static Encoding _eCp850 = Encoding.GetEncoding(850);
private static Encoding _eUnicode = Encoding.UTF8;
private static Encoding _eLatin1 = Encoding.GetEncoding("ISO-8859-1");

public static string encode_cp850(string sText)
{
	string sReturn;
	byte[] bSource;
	byte[] bTarget;

	bSource = _eUnicode.GetBytes(sText);
	bTarget = Encoding.Convert(_eUnicode, _eCp850, bSource);
	sReturn = _eLatin1.GetString(bTarget);

	return sReturn;
}

Reading is a little trickier though, as all strings in .Net are UTF8 (and actually stored as UTF16 internally) so once our string is read back into a string we can’t easily do our conversion as our initial representation has been mangled. However if we read our string back from the database as bytea into a byte array we can successfully convert it into the correct UTF8 representation:

public static string decode_cp850(byte[] sTextAsBytea)
{
	string sReturn;
	byte[] bSource = sTextAsBytea;
	byte[] bTarget;

	bTarget = Encoding.Convert(_eCp850, _eUnicode, bSource);
	sReturn = _eUnicode.GetString(bTarget);

	return sReturn;
}
        

I’m still wondering if there is a better way (I know, I know, yes it’s to have everything in UTF8), but for now this seems to manage ok for us.

I’ve uploaded some scripts to my git repository at postgresql/encoding_woes (mirrored also on github glynastill/encoding_woes)

Uploading old stuff – DBLT

I’ve recently been digging through a bunch of old test programs and scripts I’ve written over the years, and thought I’d upload some of them incase they can be of any use to anyone. I thought I’d start off with a console app I’d named DBLT.

DBLT is simply the abbreviation for “database load test” that jumped into my head when faced with the Visual Studio “New Project” dialogue. I wanted to perform some load tests in a way that more closely represented a specific application and measured the things I was interested in. I don’t class it as a polished testing platform, and I take the results of my tests with a pinch of salt; as I assume some of my techniques in running the tests and locking when maintaining the statistics will in some way have skewed them, but hopefully only slightly.

The test is written in C#, and uses the npgsql data provider to connect to PostgreSQL. As I mentioned initially, the main reason for this is I wanted to test different configurations in connecting to our databases in a similar way to how a specific application did. Mainly this was to test the performance of npgsqls local connection pool logic, and other associated parameters, but also to test other parameters on our pgBouncer pools, PostgreSQL and Linux hosts.

My other reason for writing my own test is that pgbench outputs results based around TPS, but I wanted a slightly different take on the statistics. I wanted to know things like how long my queries were taking on average, what the worst and best execution times were and the standard deviation of execution time to see that we were providing a good quality of service to all clients rather than just shunting the most TPS through the database. If you’re now thinking “the fool – he should have just run pgbench and run an analysis on his logs”, you’re quite right; but what that wouldn’t have given me an overall view of the behaviour of the system when we start including things like cascaded connection pools (local npgsql, and pgbouncer), networking and local client machine configuration. I could have analyzed some of this using tools like tsung, but had issues generating heavy load with tsung and its output didn’t give me exactly what I wanted.

I wrote dblt back in 2009, and I’ve found it useful on quite a few occasions but never thought to share it before. I’ve recently signed up to github so the source can be obtained from glynastill/dblt or my git repository at postgresql/dblt. I’ve also uploaded a compiled binary here

dblt.

Hopefully someone finds it useful or highlights failings that can be fixed. Even if you don’t want to test against PostgreSQL it could quite easily be used to test sql server or other systems by changing the data provider.

Installing Greenplum Community Edition Database 4.2 on Debian Wheezy

Today I wanted to have a go with Greenplum Database; a massively parallel DBMS based on PostgreSQL for data warehouse type workloads with features such as columnar storage.

Other than its PostgreSQL roots I know very little about Greenplum, and like many of the other commercial database products it too is only supported on the “Enterprise” linuxes.

The intention of this little exercise is to get Greenplum running on Debian, and learn a little more about it in the process. As I mentioned Greenplum is a massively parallel DBMS, so it can spread one query over multiple cores in multiple servers – I’m only going to attempt to set it up on a single machine, but installing over a whole bunch of servers should be possible with the same technique simply by adding more hosts to the various “hostfiles” passed into the Greenplum tools.

I’m following the Installation Guide from the Greenplum website and start off by installing a few prerequisites for the installation. These are; ed and unzip which are used by gpinitsystem, the former to modify parameters in postgresql.conf files, and ntp which is required by Greenplum to keep segment hosts in sync. The others are not required, but I like to do my work in a screen, and I installed the python modules whilst fiddling with gpssh-exkeys but I’m pretty certain they’re not needed.

# apt-get install ntp ed unzip python-crypto python-paramiko screen

Next we grab the installer from Greenplum and extract it:

$ mkdir /usr/gp_install
$ cd /usr/gp_install
$ wget http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/Greenplum_CE_Database/database_server/4.2.2.4/greenplum-db-4.2.2.4-build-1-CE-RHEL5-x86_64.zip
$ unzip greenplum-db-4.2.2.4-build-1-CE-RHEL5-x86_64.zip

Create an installation directory:

$ mkdir /usr/local/greenplum-db-4.2.2.4

Add the following kernel parameters to /etc/sysctl.conf. These are as recommended by the installation document and I assume that the values for shared memory are based on their default postgresql.conf settings, but they’ve obviously made a bit of a guess here, and depending on our hardware / setup we’d want to tune these further:

kernel.shmmax = 500000000
kernel.shmmni = 4096
kernel.shmall = 4000000000
kernel.sem = 250 512000 100 2048
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.msgmni = 2048
net.ipv4.tcp_syncookies = 1
net.ipv4.ip_forward = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.ipv4.ip_local_port_range = 1025 65535
net.core.netdev_max_backlog = 10000
vm.overcommit_memory = 2

Increase the limits for open files and running processes in /etc/security/limits.conf and uncomment the line with pam_limits.so in /etc/pam.d/su so that su uses these limits.

* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072

The installation guide also recommends setting the IO scheduler to deadline and a readahead size of 16384 sectors (or 8kb). Again this is very dependent on hardware (for example I’d be wanting to set the scheduler to no-op on a good raid controller) so I find it odd they suggest such cludgy values, best to do your own testing here. Anyway, they suggest the following in /etc/rc.local:

 
$ sudo blockdev --setra 16384 /dev/sda
$ sudo echo deadline > /sys/block/sda/queue/scheduler

In order to keep track of hosts Greenplum either needs static IPs or DNS records. My VM uses DHCP so I’ve quickly added a secondary IP in /etc/network/interfaces with a static address, but I could just as easily have changed the main IP to static:

auto eth0:1
iface eth0:1 inet static
address 192.168.0.201
netmask 255.255.255.0
gateway 192.168.0.1
$ sudo invoke-rc.d networking restart

Now make sure the machines hostname is resolvable to that IP if it’s not in DNS add an entry to your /etc/hosts file (one for each server, on every server if you’re doing a multiple server setup):

192.168.0.201   greenplum_test.8kb.co.uk  greenplum_test

So now we’re almost ready to install Greenplum, however if we try to run the installer at this point we’ll get an error message stating “Installer will only install on RedHat/CentOS x86_64”, it’s easy enough to trick it if we want to though by creating a redhat-release file (you can delete it when you’re done with the installer):

$ sudo echo 'Not really Red Hat Enterprise Linux but Debian GNU/Linux 7.0' > /etc/redhat-release

Now we can run the installer, the default install directory is /usr/local/greenplum-db-4.2.2.4 and the installer creates a symbolic link /usr/local/greenplum-db to it.

$ sudo /bin/bash greenplum-db-4.2.2.4-build-1-CE-RHEL5-x86_64.bin

Once the installer has run we proceed by setting up the segments that will make up our Greenplum cluster. To install a segment the gpseginstall command is run, and the hostfile_exkeys file needs to contain the hostnames of all segment hosts in the Greenplum cluster. For a single host we just need our local hostname in the hostfile_exkeys file:

# cd /usr/local/greenplum-db
# sudo echo `hostname` > hostfile_exkeys

The gpseginstall command copies the relevant files over to each segment, sets up users and directory ownership. For a standalone server this doesn’t actually have to be run and can be done manually, but for simplicity and completeness this is how it goes:

# . /usr/local/greenplum-db/greenplum_path.sh
# gpseginstall -f hostfile_exkeys -u gpadmin -p mypassword

When I ran this I got one error when it tried to run gpssh-exkeys for the gpadmin user, essentially the gpadmin user didn’t have $GPHOME environment set, so I ran this again myself with an absolute path:

# su - gpadmin
$ chsh         # changed my shell to /bin/bash
$ . /usr/local/greenplum-db/greenplum_path.sh
$ /usr/local/greenplum-db-4.2.2.4/bin/gpssh-exkeys -f hostfile_exkeys

Now make sure the environment variables are set up for your gpadmin user by adding “. /usr/local/greenplum-db/greenplum_path.sh” (the dot is important – it runs the script in the current environment) in /home/gpadmin/.profile.

We can test if the ssh keys have been exchanged properly at this point too, if this doesn’t authenticate us then we need to repeat the above to find where we went wrong:

# su - gpadmin
$ gpssh -f hostfile_exkeys -e ls -l $GPHOME

Next we need to create our data storage area; I’m setting up my data to be mirrored on the same disk. However if we were doing this for real then we’d be putting the mirrors on separate arrays or ideally if we had multiple hosts we could pass the -S option to gpinitsystem to make sure it spreads the mirrors about across alternate hosts.

# mkdir /greenplum
# mkdir /greenplum/master
# mkdir /greenplum/data1
# mkdir /greenplum/data2
# mkdir /greenplum/data1/primary
# mkdir /greenplum/data1/mirror
# mkdir /greenplum/data2/primary
# mkdir /greenplum/data2/mirror
# chown -R gpadmin:gpadmin /greenplum

Before we run gpinitsystem we can now check that everything is set up correctly by running the gpcheck utility, again some of the warnings (like it whinging you’re not using the deadline IO scheduler) can be ignored:

# cp /usr/local/greenplum-db/hostfile_exkeys /usr/local/greenplum-db/hostfile_gpcheck
# su - gpadmin
$ gpcheck -f /usr/local/greenplum-db/hostfile_gpcheck -m greenplum_test

Now we are ready to run gpinitsystem, we can copy over a hostfile and the sample gpinitsystem_config.

# cp /usr/local/greenplum-db/hostfile_exkeys /greenplum/hostfile_gpinitsystem
# cp /usr/local/greenplum-db/docs/cli_help/gpconfigs/gpinitsystem_config /greenplum/gpinitsystem_config

For our setup with 2 segments and 2 mirror segments we edit the /greenplum/gpinitsystem_config file as follows:

ARRAY_NAME="EMC Greenplum DW"
SEG_PREFIX=gpseg
PORT_BASE=40000
MIRROR_PORT_BASE=50000
declare -a DATA_DIRECTORY=(/greenplum/data1/primary /greenplum/data2/primary)
declare -a MIRROR_DATA_DIRECTORY=(/greenplum/data1/mirror /greenplum/data2/mirror)
MASTER_HOSTNAME=greenplum_test
MASTER_DIRECTORY=/greenplum/master
MASTER_PORT=5432
REPLICATION_PORT_BASE=41000
MIRROR_REPLICATION_PORT_BASE=51000
TRUSTED_SHELL=ssh
CHECK_POINT_SEGMENTS=8
ENCODING=UNICODE

Now for the exciting part, we run gpinitsystem to setup the cluster (set the locale to what suits you best):

# su - gpadmin
$ gpinitsystem -c /greenplum/gpinitsystem_config -h /greenplum/hostfile_gpinitsystem --locale=en_GB.utf8

Once this has run, we just need to set up a few bits and pieces before we can start the server. We need to make sure the MASTER_DATA_DIRECTORY environment variable is set for our gpadmin user by adding the following in our /home/gpadmin/.profile (or if you prefer /home/gpadmin/.bashrc):

export MASTER_DATA_DIRECTORY=/greenplum/master/gpseg-1

If we wanted we could now start the server using gpstart and stop it with gpstop. However I want to create an init script, if you want this put the following in /etc/init.d/greenplum

#!/bin/sh

### BEGIN INIT INFO
# Provides:          greenplum
# Required-Start:    $network $local_fs $remote_fs $syslog
# Required-Stop:     $network $local_fs $remote_fs $syslog
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Start greenplum daemons at boot time
# Description:       Enable greenplum service.
### END INIT INFO

# Run the greenplum path script script
. /usr/local/greenplum-db/greenplum_path.sh

# Who to run the postmaster as, usually "postgres".  (NOT "root")
GPUSER=gpadmin
GPLOG="$GPHOME/serverlog.log"
export MASTER_DATA_DIRECTORY=/greenplum/master/gpseg-1

set -e


# Parse command line parameters.
case $1 in
  start)
	echo -n "Starting Greenplum RDMS: "
	su -l gpadmin -c "gpstart -a" >>$GPLOG 2>&1
	echo "ok"
	;;
  stop)
	echo -n "Stopping Greenplum RDMS: "
	su -l gpadmin -c "gpstop -a -M fast" >>$GPLOG 2>&1
	echo "ok"
	;;
  restart)
	echo -n "Restarting Greenplum RDMS: "
	su -l gpadmin -c "gpstop -a -M fast -r" >>$GPLOG 2>&1
	echo "ok"
	;;
  reload)
	echo -n "Reload Greenplum RDMS: "
	su -l gpadmin -c "gpstop -u" >>$GPLOG 2>&1
	echo "ok"
	;;
  status)
	su -l gpadmin -c "gpstate"
	;;
  *)
	# Print help
	echo "Usage: $0 {start|stop|restart|reload|status}" 1>&2
	exit 1
	;;
esac

exit 0

And set it to run at boot:

# chmod 755 /etc/init.d/greenplum
# update-rc.d greenplum defaults

We can now, finally start up the server, and as the gpadmin user execute some SQL:

#invoke-rc.d greenplum start
# su - gpadmin
$ psql -U gpadmin -d postgres

Lets create a bog standard table and see what we get:

[postgres]
psql (8.2.15)
Type "help" for help.

postgres=# CREATE DATABASE test;
postgres=#\c test
test=# CREATE TABLE test_table (id bigserial PRIMARY KEY NOT NULL,string text);
test=# INSERT INTO test_table (string) SELECT generate_series(1,10000,2) || ‘_test_data’ AS string;
test=# EXPLAIN ANALYZE SELECT * FROM test_table;
QUERY PLAN
—————————————————————————————————————————————-
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..58.00 rows=2500 width=22)
Rows out: 5000 rows at destination with 24 ms to first row, 109 ms to end, start offset by 66 ms.
-> Seq Scan on test_table (cost=0.00..58.00 rows=2500 width=22)
Rows out: Avg 2500.0 rows x 2 workers. Max 2500 rows (seg0) with 3.572 ms to first row, 29 ms to end, start offset by 80 ms.
Slice statistics:
(slice0) Executor memory: 247K bytes.
(slice1) Executor memory: 181K bytes avg x 2 workers, 181K bytes max (seg0).
Statement statistics:
Memory used: 128000K bytes
Total runtime: 202.259 ms
(10 rows)

[/postgres]

Looks a little different to the usual PostgreSQL seq_scan; we’ve two workers scanning the two underlying segments in parallel. Lets try with some more Greenplum specific options:

[postgres]

test=# CREATE TABLE test_table2 (id bigserial NOT NULL, string text) WITH (appendonly=true, orientation=column) DISTRIBUTED BY (id);
test=# CREATE INDEX test_table2_id ON test_table2 USING btree (id);
test=# INSERT INTO test_table2 (id, string) SELECT generate_series, generate_series || ‘_test_data’ AS string FROM generate_series(1,10000,2);
test=# EXPLAIN ANALYZE SELECT * FROM test_table2;
QUERY PLAN
—————————————————————————————————————————————-
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..54.00 rows=2500 width=22)
Rows out: 5000 rows at destination with 28 ms to first row, 114 ms to end, start offset by 66 ms.
-> Append-only Columnar Scan on test_table2 (cost=0.00..54.00 rows=2500 width=22)
Rows out: Avg 2500.0 rows x 2 workers. Max 2560 rows (seg0) with 9.701 ms to first row, 28 ms to end, start offset by 81 ms.
Slice statistics:
(slice0) Executor memory: 183K bytes.
(slice1) Executor memory: 363K bytes avg x 2 workers, 363K bytes max (seg0).
Statement statistics:
Memory used: 128000K bytes
Total runtime: 207.515 ms
[/postgres]

This time we created the table with column based storage as appose to the more conventional row based storage we’re used to, and partitioned over the segments by “id”. Also note that we state the table is append-only, AFAIK this is a requirement when using columnar storage with this version of Greenplum.

We may want to change some guc parameters in postgresql.conf or more likely add in some hosts to pg_hba.conf. To do this we want to edit the settings for the master host, in this setup those files are in my /greenplum/master/gpseg-1/ directory.

That’s it for now, hopefully I’ll get time to revisit and have a more indepth play with Greenplums features.