Category Archives: Programming

Unit testing in PostgreSQL with pgTap

The applications I work with use procedural functions and RDBMS features quite heavily, and although we have unit tests for application code, we currently have no unit testing directly in the database. I’ll be the first to admit that I can be fairly highly strung when making large changes to our systems, and having a suite of unit tests to run is a great way to extinguish any histrionics.

I’ve been looking at pgTap today which provides an excellent way to perform unit testing in PostgreSQL covering pretty much every aspect of the schema.

To install pgTap we just download, make and install just like any other extension:

$ cd /usr/pgsql_src
$ wget
$ unzip && cd pgtap*
$ make
$ make installcheck PGUSER=glyn
$ sudo make install

The system I’m testing on is still running Postgres 9.0, so to install the extension I have to revert to running the sql script rather than “CREATE EXTENSION”, and for some reason the pgtap.sql script didn’t end up in my contrib directory along with the sql for the extension based versions, so I just copied it there manually before running it:

$ cp sql/pgtap.sql `pg_config --sharedir`/contrib
$ psql -d TEST -f `pg_config --sharedir`/contrib/pgtap.sql 

If this were 9.1+ that would just be just be:

$ psql -d TEST -c "CREATE EXTENSION pgtap"

Now pgTap is installed I can run a test that checks nothing. We perform all tests within a transaction which is rolled back at the end to ensure our tests do not make any persistent changes.

[postgres highlight=”1,3,8,13,17″]
TEST=# SELECT plan(1);
(1 row)
TEST=# SELECT pass( ‘Testing nothing’ );
ok 1 – Testing nothing
(1 row)
TEST=# SELECT * FROM finish();
(0 rows)

The output basically means; 1..1 we ran 1 of 1 tests and ok 1 – Testing nothing test number 1, with description “Testing nothing” returned “ok”. The above looks a bit messy and the docs recommend setting the following in psql to make the behaviour and output a little easier to handle:

\set ECHO
\set QUIET 1
\pset format unaligned
\pset tuples_only true
\pset pager
\set ON_ERROR_STOP true
\set QUIET 1

Now lets create a function and perform an actual test on the results:

TEST=# CREATE OR REPLACE FUNCTION public.test1(text) RETURNS text AS ‘SELECT upper($1) || length($1)::text’ language ‘SQL’;

TEST=# SELECT plan(1);
TEST=# SELECT results_eq(E’SELECT * FROM public.test1(\’glyn\’)’, E’SELECT upper(\’glyn\’) || length(\’glyn\’)::text’,’public.test1() should return text in uppercase suffixed with length’);
ok 1 – public.test1() should return text in uppercase suffixed with length
TEST=# SELECT * FROM finish();

We can also simply test if some operation like an insert succeeds:

TEST=# CREATE TABLE public.table1 (a text, b integer);

TEST=# SELECT plan(1);
TEST=# SELECT lives_ok(E’INSERT INTO public.table1 VALUES (\’TEST\’, 1)’);
ok 1
TEST=# SELECT * FROM finish();

…and we can check that performing some operation fails with a particular exception:

TEST=# SELECT plan(1);
TEST=# SELECT throws_ok(E’INSERT INTO public.table1 VALUES (\’test\’, \’not an integer\’)’,’22P02′,NULL,’Column b is an integer, inserting text should fail’);
ok 1 – Column b is an integer, inserting text should fail
TEST=# SELECT * FROM finish();

We can test if an operation created the expected data too, if I create a trigger on my test table to uppercase any text inserted, we can check if it worked as expected:

TEST=# CREATE FUNCTION public.table1_upper_func() RETURNS trigger AS $$ BEGIN NEW.a := upper(NEW.a); RETURN NEW; END; $$ LANGUAGE plpgsql VOLATILE;
TEST=# CREATE TRIGGER table1_upper_trig BEFORE INSERT OR UPDATE ON public.table1 FOR EACH ROW EXECUTE PROCEDURE public.table1_upper_func();

TEST=# SELECT plan(1);
TEST=# INSERT INTO public.table1 VALUES (‘test’,1);
TEST=# SELECT is(a, ‘TEST’, ‘Inserted values into column "a" should be uppercased’ ) FROM public.table1 WHERE b = 1;
ok 1 – Inserted values into column "a" should be uppercased
TEST=# SELECT * FROM finish();

We can also test the existence and structure of a table:

TEST=# CREATE INDEX ON public.table1 USING btree(b);
TEST=# SELECT plan(6);
TEST=# SELECT has_table(‘public’, ‘table1’, ‘table public.table1 exists’ );
ok 1 – table public.table1 exists
TEST=# SELECT has_column(‘public’, ‘table1’, ‘a’, ‘column public.table1.a exists’);
ok 2 – column public.table1.a exists
TEST=# SELECT col_type_is(‘public’, ‘table1’, ‘a’, ‘text’, ‘Type of column public.table1.a is text’);
ok 3 – Type of column public.table1.a is text
TEST=# SELECT has_column(‘public’, ‘table1’, ‘b’, ‘column public.table1.b exists’);
ok 4 – column public.table1.b exists
TEST=# SELECT col_type_is(‘public’, ‘table1’, ‘b’, ‘integer’, ‘Type of column public.table1.b is integer’);
ok 5 – Type of column public.table1.b is integer
TEST=# SELECT has_index(‘public’, ‘table1′,’table1_b_idx’, ‘Table public.table1 has an index table1_b_idx’);
ok 6 – Table public.table1 has an index table1_b_idx
TEST=# SELECT * FROM finish();

To wrap all of this up pgTap provides a runtests() function that will run predefined test functions without us having to define the number of tests with plan(n) or run finish(), and we can write these functions just like any other function incorporating the tests demonstrated above. By default the function looks for function names starting with test, it’s probably a good idea to put the tests in a specific schema or give them a more unique name than that.

When runtests() is run it also first looks for functions starting with “startup” or “shutdown” and these are run in alphabetical order before and after a whole set of test functions respectively. In addition functions starting with “setup” or “teardown” are, again, run in alphabetical order before and after each test function respectively. Typically these would be used to check, perform and undo any data changes required by each and or all tests, for example if we’ve called a sequence that doesn’t belong to a table during our tests, it’s value won’t be rolled back so we might want to reset it’s value.

One thing to clarify though (which came to my attention when running pg_prove blelow), is that where the “setup” and “teardown” functions are automatically rolled back at the end of each test, the “startup” and “shutdown” functions are not automatically rolled back at the end of the tests. If you want that functionality then you need to wrap your call to runtests in a transaction and roll it back at the end just as before.

E.g. we can create our test schema and functions:

CREATE SCHEMA unit_testing;

TEST=# CREATE OR REPLACE FUNCTION unit_testing.setup_tap_test_example()
INSERT INTO public.table1 VALUES (‘test’,1);
RETURN NEXT is(a, ‘TEST’, ‘Should have initial row in table1’) FROM public.table1 WHERE b = 1;
$$ LANGUAGE plpgsql;

TEST=# CREATE OR REPLACE FUNCTION unit_testing.tap_test_example()
RETURN NEXT has_table(‘public’, ‘table1’, ‘table public.table1 exists’ );
RETURN NEXT has_column(‘public’, ‘table1’, ‘a’, ‘column public.table1.a exists’);
RETURN NEXT col_type_is(‘public’, ‘table1’, ‘a’, ‘text’, ‘Type of column public.table1.a is text’);
RETURN NEXT has_column(‘public’, ‘table1’, ‘b’, ‘column public.table1.b exists’);
RETURN NEXT col_type_is(‘public’, ‘table1’, ‘b’, ‘integer’, ‘Type of column public.table1.b is integer’);
RETURN NEXT has_index(‘public’, ‘table1′,’table1_b_idx’, ‘Table public.table1 has an index table1_b_idx’);
$$ LANGUAGE plpgsql;

TEST=# CREATE OR REPLACE FUNCTION unit_testing.teardown_tap_test_example()
RETURN NEXT lives_ok(‘DELETE FROM public.table1 WHERE b = 1’);
$$ LANGUAGE plpgsql;

Now we can run them:
TEST=# SELECT * FROM runtests(‘unit_testing’,’^tap_test_example’);
ok 1 – Should have initial row in table1
# unit_testing.tap_test_example()
ok 2 – table public.table1 exists
ok 3 – column public.table1.a exists
ok 4 – Type of column public.table1.a is text
ok 5 – column public.table1.b exists
ok 6 – Type of column public.table1.b is integer
ok 7 – Table public.table1 has an index table1_b_idx
ok 8

Gravy. We can, if we want, also install the pg_prove module from cpan to finish wrapping it all together with TAP::Harness and provide a summary of our tests:

$ sudo cpan TAP::Parser::SourceHandler::pgTAP
$ pg_prove -d TEST -R -s unit_testing -x '^tap_test_example'
runtests('unit_testing'::name, '^tap_test_example'::text); ..
# unit_testing.tap_test_example()
runtests('unit_testing'::name, '^tap_test_example'::text); .. ok
All tests successful.
Files=1, Tests=9,  0 wallclock secs ( 0.05 usr  0.02 sys +  0.00 cusr  0.01 csys =  0.08 CPU)
Result: PASS

This is just a small example of some of the tests we can do with pgTap, the suite provides a very extensive set of tests. See the documentation here.

Experimenting with Slony 2.2

I’ve been spending my spare bits of time over the last couple of weeks looking at the latest release of Slony-I. At a quick glance the main change between 2.1 and 2.2 appears to be to the sl_log table format, but although seemingly minor, the changes to the way clusters are failed over and reshaped actually go much deeper too.

For example in previous versions it was possible for a subscriber to pull multiple sets from different providers and later change the provider for any set at will using the “SCUBSCRIBE SET” command. However as of 2.2, although it’s still possible to initially subscribe a node with different providers for each set, any changes must use the “RESUBSCRIBE NODE” command, which only allows resubscribing all sets from a particular origin to a single provider.

There’s also changes to the “FAILOVER” command to improve reliability in a situation where multiple nodes have failed; you can now pass in multiple failed nodes and Slony should do the right thing. So far my tests with 2.2.2 show there may be some issues when passing in multiple failed nodes where one is a downstream provider to a cascaded subscriber, however that’s a corner case and hopefully we’ll see a fix soonish. (Edit 16/05/2014: There’s now a patch against 2.2.2 for this)

The changes to the sl_log table mean that replicated data is now replicated in a slightly more logical way; data is logged as arrays of values rather than chunks of sql to execute on the subscriber, and the data is sent over a pipe using copy rather than fetched in chunks via a cursor. Also DDL had been moved out of sl_event and into a new sl_log_script table. Upgrade will most likely require some brief downtime, as running update functions requires a lock of all sets and waiting out the cleanup interval for a cleanupevent/logswitch to happen to clear out the tables.

On a separate note, this evening (not the best use of a bank holiday weekend) whilst looking at how these changes would affect my experimental failover script I had a quick bash at adding in an “autofailover” functionality; the idea being that the script keeps polling all the nodes, and upon detecting any unavailable nodes runs the failover command. It’s a functionality I’ve never personally wanted as it’s possible to get into all sorts of trouble blindly failing over onto an asynchronous replica, in fact in a busy environment it’s pretty much guaranteed (E.g missing a single update to a product price and then taking millions of sales on the wrong price!). However, perhaps it could be quite useful in a mostly read only environment where updates are low volume such as a wiki; more thought needed I think.

(The script by the way is here slony-i/slony_failover and also mirrored on github glynastill/slony_failover)

UK Geographic postcode data, latitude longitude, Royal Mail PAF and Ordnance Survey data

When it comes to getting accurate geographic data for UK postcodes I only know of 2 sources; Ordnance Survey OpenData which is free, or the Postcode Address File (PAF) from the Royal Mail which costs a fair whack. If detail of individual addresses is required then there’s little choice other than to purchase the PAF data, but if only postcodes and coordinates are required then the OS data is fine.

Rather than latitude and longitude, both sources supply cartesian coordinates in the form of eastings and northings. It’s not a big issue though and we don’t even have to understand the maths involved in conversion because there are a few great cartographic tools out there that will do the conversions for us; perhaps the most popular of these is the Proj4 cartographic libraries.

Conversion to WGS84 / latitude and longitude

In my case I want to pull the data along with latitude and longitude into a PostgreSQL database and I have 2 options; do the conversion before importing with the cs2cs tool or use the PostGIS module within the database to compute them afterwards or during the load from an intermediate table.

The cs2cs tool takes input from stdin or a file and outputs anything after the eastings and northings to stdout. E.g. to convert from the British national grid, we’d do something like:

$ echo '7811 340198 other data' | cs2cs -f '%.7f' +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894 +units=m +no_defs +to +proj=latlong +ellps=WGS84 +towgs84=0,0,0 +no_defs

A breakdown of the parameters passed above:

  • -f ‘%.7f’ – Format of output – 7 decimal places
  • +proj=tmerc – Transverse Mercator Projection
  • +lat_0=49 – National grid true origin latitude
  • +lon_0=-2 – National grid true origin longitude
  • +k=0.9996012717 – National Grid scale factor on central meridian
  • +x_0=400000 – Easting of true origin metres
  • +y_0=-100000 – Northing of true origin metres
  • +ellps=airy – National grid is based on airy 1830 ellipsoid
  • +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894
    The 7 Bursa Wolf transformation parameters, used in our projection transformation to approximate the transformation from horizontal datum
  • +units=m – Units in metres
  • +no_defs – Don’t use the /usr/share/proj/proj_def.dat defaults file (they’re for US maps)
  • +to +proj=latlong – Ouptut latitude and longitude
  • +ellps=WGS84 – Ouptut WGS84 ellipsoid
  • +towgs84=0,0,0 – WGS84 datum shifts set to zero

If we have the PostGIS module loaded in PostgreSQL we could calculate our latitude and longitude similarly with

SELECT ST_x(ST_transform(ST_GeomFromText(‘POINT(‘||’457811 340198’||’)’,27700),4326)) AS longitude,
ST_y(ST_transform(ST_GeomFromText(‘POINT(‘||’457811 340198’||’)’,27700),4326)) AS latitude;

What we’re doing here is using PostGIS to first transform from our eastings / northings into a geometric type then transforming that back into latitude / longitude. My gut feeling is that there should be a more direct way to do the conversion than the above; but I couldn’t find any. The values we pass in are SRIDs; 27700 to represent SRID 27700 for OSGB 1936 / British National Grid, and 4326 to represent SRID 4326 for WGS84. For any references on the TM75 / Irish Grid we’d use SRID 29903 instead.

OrdananceSurvey Data

Getting the data from the data from OrdananceSurvey is fairly easy; go to the OpenData website, choose the Code Point data option and fill in the form to get a download link emailed to you.

The data may be lacking a few more postcodes than the Royal Mail data, but the geographic coordinates are much more accurate. Once you’ve got the data you can use the methods above to generate the latitude and longitude.

My process for importing the data is composed of 3 steps, firstly I want the area information along with my postcodes and they’re supplied in separate worksheets in an excel document named Codelist.xls. To quickly pull this data out I use the ssconvert utility supplied with Gnumeric:

$ sudo apt-get install gnumeric --no-install-recommends
$ ssconvert -S ~/codepoint_data/Doc/Codelist.xls ~/codepoint_data/Doc/Codelist_%s.csv > /dev/null 2>&1 

The tool wants to use X and will spew out warnings if you’re on a headless machine, I just pipe these to null (ignorance is bliss). Then I remove the duplicates and add in the area types with sed:

$ sed -e 's/$/,CTY/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_CTY.csv
$ sed -e 's/$/,DIS/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_DIS.csv
$ sed -e 's/$/,DIW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_DIW.csv
$ sed -e 's/$/,LBO/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_LBO.csv
$ sed -e 's/$/,LBW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_LBW.csv
$ sed -e 's/$/,MTD/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_MTD.csv
$ sed -e 's/$/,MTW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_MTW.csv
$ sed -e 's/$/,UTA/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_UTA.csv
$ sed -e 's/$/,UTE/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_UTE.csv
$ sed -e 's/$/,UTW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_UTW.csv

Then I use a perl script to generate the latitude and longitude:

$ sudo apt-get install proj-bin
$ sudo perl -MCPAN -e 'install Text::CSV'
$ sudo perl -MCPAN -e 'install Geo::Proj4'
$ -i "~/codepoint_data/Data/CSV/*.csv" -o ~/codepoint_data/Data/all_areas_20140324.csv

I can now copy these directly into my database with a plpgsql script (See the link at the end of the post for the perl script and a database function to import into PostgreSQL)

Royal Mail Postcode Address File

If you’ve paid for the “postzon” data from the Royal Mail, you can go about generating the latitude and longitude in a similar way. There’s artefacts in the data from RM that shows its legacy, but they provide some (fairly convoluted) documentation here to help with deciphering it. The main gotchas being:

  • The cartesian coordinates are only accurate to 100 meters, which can be annoying as it can place you on the wrong street.
  • You may have to fiddle with the northing values before you can import them, as for 7 digit northings the leftmost digit is alphabetic (P|O=12,U|T=11,Z|Y=10) to fit them into a 5 character field (the least significant digit is always truncated).
  • The cartesian coordinates for Northern Ireland (postcodes starting BT) are for the TM75 Irish Grid system, whereas the rest are for the OSGB 1936 / British National Grid system. No fault of RM, but these need to be converted with the different SRID.
  • If you’re using the relational full file format there’s a few rules to follow regarding their schema design, one being that organisation keys for large organisations relate to address keys in their organisations table instead of organisation keys.

Like the OS data my process for importing is to post process the data with a perl script:

$ -i ~/rm_data/pzone100.c01 -o ~/rm_data/pzone100.c02

Again, I then import the data into my database with a plpgsql script (linked below).

Edit 22/05/2014: I’ve now tested an actual quarterly update, and created a separate version of the plpgsql functions using postgis to do all the latitude / longitude parts for both import and update (hence the script above isn’t required if you are using postgis)

You can find the scripts in my git repository at postgresql/geographic_data (mirrored also on github glynastill/geographic_data). I’m not totally happy with the speed of the perl scripts that do the latitude / longitude creation before loading; they could be much faster with a few minor changes. I’m sure the plpgsql functions could easily be translated for other databases without too much hassle if you really want to.