Greenplum | database dumping ground

Today I wanted to have a go with Greenplum Database; a massively parallel DBMS based on PostgreSQL for data warehouse type workloads with features such as columnar storage.

Other than its PostgreSQL roots I know very little about Greenplum, and like many of the other commercial database products it too is only supported on the “Enterprise” linuxes.

The intention of this little exercise is to get Greenplum running on Debian, and learn a little more about it in the process. As I mentioned Greenplum is a massively parallel DBMS, so it can spread one query over multiple cores in multiple servers – I’m only going to attempt to set it up on a single machine, but installing over a whole bunch of servers should be possible with the same technique simply by adding more hosts to the various “hostfiles” passed into the Greenplum tools.

I’m following the Installation Guide from the Greenplum website and start off by installing a few prerequisites for the installation. These are; ed and unzip which are used by gpinitsystem, the former to modify parameters in postgresql.conf files, and ntp which is required by Greenplum to keep segment hosts in sync. The others are not required, but I like to do my work in a screen, and I installed the python modules whilst fiddling with gpssh-exkeys but I’m pretty certain they’re not needed.

# apt-get install ntp ed unzip python-crypto python-paramiko screen

Next we grab the installer from Greenplum and extract it:

$ mkdir /usr/gp_install
$ cd /usr/gp_install
$ wget http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/Greenplum_CE_Database/database_server/4.2.2.4/greenplum-db-4.2.2.4-build-1-CE-RHEL5-x86_64.zip
$ unzip greenplum-db-4.2.2.4-build-1-CE-RHEL5-x86_64.zip

Create an installation directory:

$ mkdir /usr/local/greenplum-db-4.2.2.4

Add the following kernel parameters to /etc/sysctl.conf. These are as recommended by the installation document and I assume that the values for shared memory are based on their default postgresql.conf settings, but they’ve obviously made a bit of a guess here, and depending on our hardware / setup we’d want to tune these further:

kernel.shmmax = 500000000
kernel.shmmni = 4096
kernel.shmall = 4000000000
kernel.sem = 250 512000 100 2048
kernel.sysrq = 1
kernel.core_uses_pid = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.msgmni = 2048
net.ipv4.tcp_syncookies = 1
net.ipv4.ip_forward = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.ipv4.ip_local_port_range = 1025 65535
net.core.netdev_max_backlog = 10000
vm.overcommit_memory = 2

Increase the limits for open files and running processes in /etc/security/limits.conf and uncomment the line with pam_limits.so in /etc/pam.d/su so that su uses these limits.

* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072

The installation guide also recommends setting the IO scheduler to deadline and a readahead size of 16384 sectors (or 8kb). Again this is very dependent on hardware (for example I’d be wanting to set the scheduler to no-op on a good raid controller) so I find it odd they suggest such cludgy values, best to do your own testing here. Anyway, they suggest the following in /etc/rc.local:

 
$ sudo blockdev --setra 16384 /dev/sda
$ sudo echo deadline > /sys/block/sda/queue/scheduler

In order to keep track of hosts Greenplum either needs static IPs or DNS records. My VM uses DHCP so I’ve quickly added a secondary IP in /etc/network/interfaces with a static address, but I could just as easily have changed the main IP to static:

auto eth0:1
iface eth0:1 inet static
address 192.168.0.201
netmask 255.255.255.0
gateway 192.168.0.1

$ sudo invoke-rc.d networking restart

Now make sure the machines hostname is resolvable to that IP if it’s not in DNS add an entry to your /etc/hosts file (one for each server, on every server if you’re doing a multiple server setup):

192.168.0.201   greenplum_test.8kb.co.uk  greenplum_test

So now we’re almost ready to install Greenplum, however if we try to run the installer at this point we’ll get an error message stating “Installer will only install on RedHat/CentOS x86_64”, it’s easy enough to trick it if we want to though by creating a redhat-release file (you can delete it when you’re done with the installer):

$ sudo echo 'Not really Red Hat Enterprise Linux but Debian GNU/Linux 7.0' > /etc/redhat-release

Now we can run the installer, the default install directory is /usr/local/greenplum-db-4.2.2.4 and the installer creates a symbolic link /usr/local/greenplum-db to it.

$ sudo /bin/bash greenplum-db-4.2.2.4-build-1-CE-RHEL5-x86_64.bin

Once the installer has run we proceed by setting up the segments that will make up our Greenplum cluster. To install a segment the gpseginstall command is run, and the hostfile_exkeys file needs to contain the hostnames of all segment hosts in the Greenplum cluster. For a single host we just need our local hostname in the hostfile_exkeys file:

# cd /usr/local/greenplum-db
# sudo echo `hostname` > hostfile_exkeys

The gpseginstall command copies the relevant files over to each segment, sets up users and directory ownership. For a standalone server this doesn’t actually have to be run and can be done manually, but for simplicity and completeness this is how it goes:

# . /usr/local/greenplum-db/greenplum_path.sh
# gpseginstall -f hostfile_exkeys -u gpadmin -p mypassword

When I ran this I got one error when it tried to run gpssh-exkeys for the gpadmin user, essentially the gpadmin user didn’t have $GPHOME environment set, so I ran this again myself with an absolute path:

# su - gpadmin
$ chsh         # changed my shell to /bin/bash
$ . /usr/local/greenplum-db/greenplum_path.sh
$ /usr/local/greenplum-db-4.2.2.4/bin/gpssh-exkeys -f hostfile_exkeys

Now make sure the environment variables are set up for your gpadmin user by adding “. /usr/local/greenplum-db/greenplum_path.sh” (the dot is important – it runs the script in the current environment) in /home/gpadmin/.profile.

We can test if the ssh keys have been exchanged properly at this point too, if this doesn’t authenticate us then we need to repeat the above to find where we went wrong:

# su - gpadmin
$ gpssh -f hostfile_exkeys -e ls -l $GPHOME

Next we need to create our data storage area; I’m setting up my data to be mirrored on the same disk. However if we were doing this for real then we’d be putting the mirrors on separate arrays or ideally if we had multiple hosts we could pass the -S option to gpinitsystem to make sure it spreads the mirrors about across alternate hosts.

# mkdir /greenplum
# mkdir /greenplum/master
# mkdir /greenplum/data1
# mkdir /greenplum/data2
# mkdir /greenplum/data1/primary
# mkdir /greenplum/data1/mirror
# mkdir /greenplum/data2/primary
# mkdir /greenplum/data2/mirror
# chown -R gpadmin:gpadmin /greenplum

Before we run gpinitsystem we can now check that everything is set up correctly by running the gpcheck utility, again some of the warnings (like it whinging you’re not using the deadline IO scheduler) can be ignored:

# cp /usr/local/greenplum-db/hostfile_exkeys /usr/local/greenplum-db/hostfile_gpcheck
# su - gpadmin
$ gpcheck -f /usr/local/greenplum-db/hostfile_gpcheck -m greenplum_test

Now we are ready to run gpinitsystem, we can copy over a hostfile and the sample gpinitsystem_config.

# cp /usr/local/greenplum-db/hostfile_exkeys /greenplum/hostfile_gpinitsystem
# cp /usr/local/greenplum-db/docs/cli_help/gpconfigs/gpinitsystem_config /greenplum/gpinitsystem_config

For our setup with 2 segments and 2 mirror segments we edit the /greenplum/gpinitsystem_config file as follows:

ARRAY_NAME="EMC Greenplum DW"
SEG_PREFIX=gpseg
PORT_BASE=40000
MIRROR_PORT_BASE=50000
declare -a DATA_DIRECTORY=(/greenplum/data1/primary /greenplum/data2/primary)
declare -a MIRROR_DATA_DIRECTORY=(/greenplum/data1/mirror /greenplum/data2/mirror)
MASTER_HOSTNAME=greenplum_test
MASTER_DIRECTORY=/greenplum/master
MASTER_PORT=5432
REPLICATION_PORT_BASE=41000
MIRROR_REPLICATION_PORT_BASE=51000
TRUSTED_SHELL=ssh
CHECK_POINT_SEGMENTS=8
ENCODING=UNICODE

Now for the exciting part, we run gpinitsystem to setup the cluster (set the locale to what suits you best):

# su - gpadmin
$ gpinitsystem -c /greenplum/gpinitsystem_config -h /greenplum/hostfile_gpinitsystem --locale=en_GB.utf8

Once this has run, we just need to set up a few bits and pieces before we can start the server. We need to make sure the MASTER_DATA_DIRECTORY environment variable is set for our gpadmin user by adding the following in our /home/gpadmin/.profile (or if you prefer /home/gpadmin/.bashrc):

export MASTER_DATA_DIRECTORY=/greenplum/master/gpseg-1

If we wanted we could now start the server using gpstart and stop it with gpstop. However I want to create an init script, if you want this put the following in /etc/init.d/greenplum

#!/bin/sh

### BEGIN INIT INFO
# Provides:          greenplum
# Required-Start:    $network $local_fs $remote_fs $syslog
# Required-Stop:     $network $local_fs $remote_fs $syslog
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Start greenplum daemons at boot time
# Description:       Enable greenplum service.
### END INIT INFO

# Run the greenplum path script script
. /usr/local/greenplum-db/greenplum_path.sh

# Who to run the postmaster as, usually "postgres".  (NOT "root")
GPUSER=gpadmin
GPLOG="$GPHOME/serverlog.log"
export MASTER_DATA_DIRECTORY=/greenplum/master/gpseg-1

set -e


# Parse command line parameters.
case $1 in
  start)
	echo -n "Starting Greenplum RDMS: "
	su -l gpadmin -c "gpstart -a" >>$GPLOG 2>&1
	echo "ok"
	;;
  stop)
	echo -n "Stopping Greenplum RDMS: "
	su -l gpadmin -c "gpstop -a -M fast" >>$GPLOG 2>&1
	echo "ok"
	;;
  restart)
	echo -n "Restarting Greenplum RDMS: "
	su -l gpadmin -c "gpstop -a -M fast -r" >>$GPLOG 2>&1
	echo "ok"
	;;
  reload)
	echo -n "Reload Greenplum RDMS: "
	su -l gpadmin -c "gpstop -u" >>$GPLOG 2>&1
	echo "ok"
	;;
  status)
	su -l gpadmin -c "gpstate"
	;;
  *)
	# Print help
	echo "Usage: $0 {start|stop|restart|reload|status}" 1>&2
	exit 1
	;;
esac

exit 0

And set it to run at boot:

# chmod 755 /etc/init.d/greenplum
# update-rc.d greenplum defaults

We can now, finally start up the server, and as the gpadmin user execute some SQL:

#invoke-rc.d greenplum start
# su - gpadmin
$ psql -U gpadmin -d postgres

Lets create a bog standard table and see what we get:

[postgres]
psql (8.2.15)
Type "help" for help.

postgres=# CREATE DATABASE test;
postgres=#\c test
test=# CREATE TABLE test_table (id bigserial PRIMARY KEY NOT NULL,string text);
test=# INSERT INTO test_table (string) SELECT generate_series(1,10000,2) || ‘_test_data’ AS string;
test=# EXPLAIN ANALYZE SELECT * FROM test_table;
QUERY PLAN
—————————————————————————————————————————————-
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..58.00 rows=2500 width=22)
Rows out: 5000 rows at destination with 24 ms to first row, 109 ms to end, start offset by 66 ms.
-> Seq Scan on test_table (cost=0.00..58.00 rows=2500 width=22)
Rows out: Avg 2500.0 rows x 2 workers. Max 2500 rows (seg0) with 3.572 ms to first row, 29 ms to end, start offset by 80 ms.
Slice statistics:
(slice0) Executor memory: 247K bytes.
(slice1) Executor memory: 181K bytes avg x 2 workers, 181K bytes max (seg0).
Statement statistics:
Memory used: 128000K bytes
Total runtime: 202.259 ms
(10 rows)

[/postgres]

Looks a little different to the usual PostgreSQL seq_scan; we’ve two workers scanning the two underlying segments in parallel. Lets try with some more Greenplum specific options:

[postgres]

test=# CREATE TABLE test_table2 (id bigserial NOT NULL, string text) WITH (appendonly=true, orientation=column) DISTRIBUTED BY (id);
test=# CREATE INDEX test_table2_id ON test_table2 USING btree (id);
test=# INSERT INTO test_table2 (id, string) SELECT generate_series, generate_series || ‘_test_data’ AS string FROM generate_series(1,10000,2);
test=# EXPLAIN ANALYZE SELECT * FROM test_table2;
QUERY PLAN
—————————————————————————————————————————————-
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..54.00 rows=2500 width=22)
Rows out: 5000 rows at destination with 28 ms to first row, 114 ms to end, start offset by 66 ms.
-> Append-only Columnar Scan on test_table2 (cost=0.00..54.00 rows=2500 width=22)
Rows out: Avg 2500.0 rows x 2 workers. Max 2560 rows (seg0) with 9.701 ms to first row, 28 ms to end, start offset by 81 ms.
Slice statistics:
(slice0) Executor memory: 183K bytes.
(slice1) Executor memory: 363K bytes avg x 2 workers, 363K bytes max (seg0).
Statement statistics:
Memory used: 128000K bytes
Total runtime: 207.515 ms
[/postgres]

This time we created the table with column based storage as appose to the more conventional row based storage we’re used to, and partitioned over the segments by “id”. Also note that we state the table is append-only, AFAIK this is a requirement when using columnar storage with this version of Greenplum.

We may want to change some guc parameters in postgresql.conf or more likely add in some hosts to pg_hba.conf. To do this we want to edit the settings for the master host, in this setup those files are in my /greenplum/master/gpseg-1/ directory.

That’s it for now, hopefully I’ll get time to revisit and have a more indepth play with Greenplums features.

database dumping ground

… getting the wrong answer fast is not an improvement over the right answer slow

Tag Archives: Greenplum

Installing Greenplum Community Edition Database 4.2 on Debian Wheezy