Category Archives: PostGIS

Updating Royal Mail PAF data and ignorance about PostgreSQL deferred foreign keys

A few months ago I bashed together a pl/pgsql function for importing Royal Mail PAF
data into PostgreSQL
and at the same time I also put together another function to perform an update from their “changes” files.

Today I came to actually try and use that function, and it’s not much of a surprise that it needed a little more work; at the time I was working off a fresh import and a small set of changes, whilst looking (with squinty eyes) at the fairly long winded docs. The changes files from RM are supplied as a journal of changes, which I can only assume were intended to be read and applied sequentially; this means we have to rely on some of the smarts SQL provides us with to apply them in bulk. The docs even try to dissuade the use of the changes files by saying “Applying Changes files regularly is ‘high maintenance’, because applying thousands of changes is a very long & involved process compared to buying full refreshes”, however I don’t want to do full refreshes as that’s likely to cause more interruption, and besides it just seems a bit of a cop out to me.

So after another quick skim over the docs, I made a couple of corrections to the update function where I’d complicated things by replacing blank key columns with nulls, and added in an update procedure for the Alias and Postzon files (which are only supplied as a full refresh), this time using postgis to generate the latitude and longitude and save running my horrible perl script.

Then I had a go at running the update:

TEST=# SELECT update_pc_paf('Y14M04', '/tmp');
NOTICE:  2014-05-22 19:40:00.161582+01: Import starting for edition Y14M04 with data root /tmp/Y14M04_CHANGES/
NOTICE:  2014-05-22 19:40:00.165233+01: Begin staging Changes1 file
NOTICE:  2014-05-22 19:40:00.439189+01: Done staging Changes1 file
NOTICE:  2014-05-22 19:40:00.439379+01: Preparing to update localities
NOTICE:  2014-05-22 19:40:00.487656+01: Prepared 15 records for update on localities
NOTICE:  2014-05-22 19:40:00.489481+01: Removed 0 records from localities
NOTICE:  2014-05-22 19:40:00.504522+01: Created 15 records in localities
NOTICE:  2014-05-22 19:40:00.504651+01: Preparing to update thoroughfares
NOTICE:  2014-05-22 19:40:00.54741+01: Prepared 377 records for update on thoroughfares
NOTICE:  2014-05-22 19:40:00.548924+01: Removed 0 records from thoroughfares
NOTICE:  2014-05-22 19:40:00.589353+01: Created 377 records in thoroughfares
NOTICE:  2014-05-22 19:40:00.589477+01: Preparing to update thoroughfare_descriptor
NOTICE:  2014-05-22 19:40:00.631469+01: Prepared 0 records for update on thoroughfare_descriptor
NOTICE:  2014-05-22 19:40:00.632967+01: Removed 0 records from thoroughfare_descriptor
NOTICE:  2014-05-22 19:40:00.633275+01: Created 0 records in thoroughfare_descriptor
NOTICE:  2014-05-22 19:40:00.633444+01: Preparing to update building_names
NOTICE:  2014-05-22 19:40:00.700856+01: Prepared 7947 records for update on building_names
NOTICE:  2014-05-22 19:40:00.704261+01: Removed 0 records from building_names
NOTICE:  2014-05-22 19:40:01.155068+01: Created 7947 records in building_names
NOTICE:  2014-05-22 19:40:01.155178+01: Preparing to update sub_building_names
NOTICE:  2014-05-22 19:40:01.223108+01: Prepared 7798 records for update on sub_building_names
NOTICE:  2014-05-22 19:40:01.226335+01: Removed 0 records from sub_building_names
NOTICE:  2014-05-22 19:40:01.669071+01: Created 7798 records in sub_building_names
NOTICE:  2014-05-22 19:40:01.675826+01: Begin staging Changes2 file
NOTICE:  2014-05-22 19:40:02.59217+01: Done staging Changes2 file
NOTICE:  2014-05-22 19:40:02.592303+01: Preparing to update mainfile
NOTICE:  2014-05-22 19:40:05.07166+01: Prepared 216438 records for update on mainfile
NOTICE:  2014-05-22 19:41:41.439761+01: Removed 87291 records from mainfile
NOTICE:  2014-05-22 19:42:54.976222+01: Created 129147 records in mainfile
NOTICE:  2014-05-22 19:42:54.97639+01: Preparing to update organisations
NOTICE:  2014-05-22 19:42:55.609607+01: Prepared 65158 records for update on organisations
NOTICE:  2014-05-22 19:43:07.778516+01: Removed 32840 records from organisations
NOTICE:  2014-05-22 19:43:10.141383+01: Created 32318 records in organisations
NOTICE:  2014-05-22 19:43:10.18821+01: Begin staging WChanges file
NOTICE:  2014-05-22 19:43:10.218245+01: Done staging WChanges file
NOTICE:  2014-05-22 19:43:10.218346+01: Preparing to update welsh mainfile
NOTICE:  2014-05-22 19:43:10.311915+01: Prepared 8395 records for update on welsh mainfile
NOTICE:  2014-05-22 19:43:20.631738+01: Removed 3098 records from welsh mainfile
NOTICE:  2014-05-22 19:43:23.616801+01: Created 5149 records in welsh mainfile
NOTICE:  2014-05-22 19:43:23.61696+01: Preparing to update welsh organisations
NOTICE:  2014-05-22 19:43:23.624131+01: Prepared 0 records for update on welsh organisations
NOTICE:  2014-05-22 19:43:23.624836+01: Removed 0 records from welsh organisations
NOTICE:  2014-05-22 19:43:23.625512+01: Created 0 records in welsh organisations
NOTICE:  2014-05-22 19:43:23.634001+01: Begin staging postzon
NOTICE:  2014-05-22 19:43:29.097317+01: Done staging postzon file, now staging with lat long data
NOTICE:  2014-05-22 19:44:41.996743+01: Done staging postzon with lat long data, updating
NOTICE:  2014-05-22 19:44:57.195802+01: Inserted 3594 new records for update on postzon_100m
NOTICE:  2014-05-22 19:46:13.242174+01: Updated 212398 records for update on postzon_100m
NOTICE:  2014-05-22 19:46:24.038271+01: Deleted 3991 records for update on postzon_100m
NOTICE:  2014-05-22 19:46:24.808325+01: Begin staging alias file
NOTICE:  2014-05-22 19:46:32.810242+01: Done staging alias file
NOTICE:  2014-05-22 19:46:34.021948+01: Done staging counties , updating
NOTICE:  2014-05-22 19:46:34.023405+01: Inserted 0 new records for update on counties
NOTICE:  2014-05-22 19:46:34.024632+01: Updated 0 records for update on counties
NOTICE:  2014-05-22 19:46:34.025273+01: Deleted 0 records for update on counties
NOTICE:  2014-05-22 19:46:38.25488+01: Done staging county_alias , updating
NOTICE:  2014-05-22 19:46:41.711214+01: Inserted 3594 new records for update on county_alias
NOTICE:  2014-05-22 19:46:46.227681+01: Updated 12 records for update on county_alias
NOTICE:  2014-05-22 19:46:53.812323+01: Deleted 3991 records for update on county_alias
NOTICE:  2014-05-22 19:46:53.812483+01: Completed
(1 row)

My schema for the PAF data has deferrable foreign keys, and in order to apply the updates in chunks I’ve set them as deferred in the above function; I only find out if I’ve violated them when they’re checked on commit:


…. And it just sits there forever! I go and do something else, but when I come back it’s still sitting there; what did I forget?

Well we can make a pretty good guess that we’re waiting whilst the deferred constraints are checked, but the constraint checks don’t take that long when they’re not deferred so why is it taking so long?

Well it turns out I’ve got indexes on the satellite tables that are referenced by the foreign keys on my main address table, but no indexes for those keys on the actual main address table. It’s apparent that when the deferred constraint is checked it needs to look up the row first; something that wouldn’t have been required if the constraint were immediate. I add some indexes, and try again:

Time: 281979.437 ms

It worked, and we didn’t violate any constraints!

The new script (named paf_postgresql_import_postgis.plpgsql) is in the git repository at postgresql/geographic_data (mirrored also on github glynastill/geographic_data).

Spatial capabilities in PostgreSQL with PostGIS

When dealing with geographic data in PostgreSQL, at some point we’re going to want to lookup rows based purely on their location relative to one another.

With smaller volumes of data, we can get away quite easily with just latitude / longitude and some maths, but once we get past a certain point we’re going to want to be able to do index lookups based on distance from a specific location; this is where the PostGIS extension is priceless.

Installing PostGIS on Debian Wheezy

Installing under Debian when PostgreSQL has been installed via the package manager is dead simple. I’m using the 2.1 version packaged for a 9.3 server from the repository, but you should select the package that matches your server version. It’s also possible to install version 1.5 directly from Wheezy main if you don’t want to add other repositories to apt.

$ sudo apt-get install postgresql-9.3-postgis-2.1

This will install the PostGIS contrib modules into the /usr/share/postgresql/9.3/contrib directory rather than the “extension” directory used by most other PostgreSQL packages in Debian.

Some of the machines I look after have PostgreSQL installed from source for various reasons (but mainly because historically packages from Debain weren’t very timely), and these machines require a slightly more lengthy setup process. To compile PostGIS from source we need to install some dependencies, one of which is the development files for the geospatial data abstraction library and these are in the “libgdal-dev” package; however this depends on the “libpq-dev” package which will most likely interfere with our source install. There are two options here; either install the package without it’s dependencies (make a note you’ve done it to avoid future dependency problems), or roll gdal from source:

First we install the required dependencies, these are listed in the PostGIS docs and for my test machine that already has everything required to compile PostgreSQL already these are:

$ sudo apt-get install libgeos-dev libproj-dev libjson-c-dev libjson0-dev

Then install the gdal dev libraries in whichever manner suits:

$ sudo apt-get download libgdal-dev
$ sudo dpkg --force-all -i libgdal-dev_1.9.0-3.1_amd64.deb


$ cd /tmp
$ wget
$ tar xvfz gdal-1.10.1.tar.gz && cd gdal-1.10.1
$ ./configure
$ make
$ sudo make install

Once this is done we can compile and install PostGIS:

$ cd /tmp
$ wget
$ tar xvfz postgis-2.1.1.tar.gz && cd postgis-2.1.1
$ ./configure
$ make				
$ sudo make install

Getting started with PostGIS

Now we’ve got PostGIS installed we just need to create the extension in the database:

$ psql -U glyn -d test -c 'CREATE EXTENSION postgis; CREATE EXTENSION postgis_topology;'

OR on and on 9.0 and lower run the sql scripts in the contrib/postgis-2.1 directory:

$ psql -U glyn -d test -f postgis.sql
$ psql -U glyn -d test -f postgis_comments.sql
$ psql -U glyn -d test -f spatial_ref_sys.sql
$ psql -U glyn -d test -f rtpostgis.sql
$ psql -U glyn -d test -f raster_comments.sql
$ psql -U glyn -d test -f topology.sql
$ psql -U glyn -d test -f topology_comments.sql

So let’s generate some data for testing; we create a table called “friends” with 90k rows which stores their locations as latitude and longitude values. Admittedly the distribution in this table won’t be that realistic, but it should suffice for testing:

SELECT b.unnest || ' v.' || generate_series,
    CASE WHEN random() > 0.5 THEN 'Somewhere Else' ELSE 'Somewhere' END,
    1.0838637+random()*(CASE WHEN random() > 0.5 THEN -1 ELSE 1 END),
    52.7389201+random()*(CASE WHEN random() > 0.5 THEN -1 ELSE 1 END)
FROM generate_series(1,10000)
CROSS JOIN (SELECT unnest(ARRAY['White Wonder','Colonel K','El Loco','Count Duckula','Leatherhead','Barron Greenback','Ernest Penfold','Professor Heinrich Von Squawkencluck','Flying Officer Buggles Pigeon'])) b;

In the scenario where we don’t have PostGIS we can create an sql function to calculate earth distance between two points, but any relative distances will always be an unknown so can’t be indexed:

CREATE OR REPLACE FUNCTION earth_distance_miles(lat1 float, lat2 float, long1 float, long2 float) 
RETURNS double precision
    'SELECT 3963.0*acos(sin($1/57.2958)*sin($2/57.2958)+cos($1/57.2958)*cos($2/57.2958)*cos(($4/57.2958)-($3/57.2958)));'

This makes listing out “friends” within a mile pretty easy:

    earth_distance_miles(52.7389201, f.latitude,  1.0838637, f.longitude) AS dist_miles
FROM friends f WHERE earth_distance_miles(52.7389201, f.latitude,  1.0838637, f.longitude) <= 1
ORDER BY earth_distance_miles(52.7389201, f.latitude,  1.0838637, f.longitude);
                                           QUERY PLAN
 Sort  (cost=10988.40..11063.40 rows=30000 width=108) (actual time=160.006..160.013 rows=29 loops=1)
   Sort Key: ((3963::double precision * acos(((0.795884736186082::double precision * sin((latitude / 57.2958::double precision))) 
   	+ ((0.605448170123596::double precision * cos((latitude / 57.2958::double precision))) * cos(((longitude / 57.2958::double precision) 
   	- 0.0189169834438126::double precision)))))))
   Sort Method:  quicksort  Memory: 30kB
   ->  Seq Scan on friends f  (cost=0.00..7510.00 rows=30000 width=108) (actual time=19.993..159.930 rows=29 loops=1)
         Filter: ((3963::double precision * acos(((0.795884736186082::double precision * sin((latitude / 57.2958::double precision))) 
         	+ ((0.605448170123596::double precision * cos((latitude / 57.2958::double precision))) 
         	* cos(((longitude / 57.2958::double precision) - 0.0189169834438126::double precision)))))) <= 1::double precision)
 Total runtime: 160.069 ms

Now lets put PostGIS to work and add in an indexed geography column:

ALTER TABLE friends ADD COLUMN geog geography(Point,4326); -- SRID 4326 for WGS84
UPDATE friends SET geog = ST_MakePoint(longitude, latitude);

Now let’s try to list out our “friends” within a mile again, this time making use of the PostGIS ST_Distance and ST_DWithin functions:

    ST_Distance(f.geog, ST_MakePoint(1.0838637, 52.7389201))/1609 AS dist_miles
FROM friends f WHERE ST_DWithin(f.geog, ST_MakePoint(1.0838637, 52.7389201), 1609)
ORDER BY ST_Distance(f.geog, ST_MakePoint(1.0838637, 52.7389201));
                                           QUERY PLAN
 Sort  (cost=37.90..37.90 rows=1 width=108) (actual time=1.756..1.760 rows=29 loops=1)
   Sort Key: (_st_distance(geog, '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography, 0::double precision, true))
   Sort Method:  quicksort  Memory: 32kB
   ->  Bitmap Heap Scan on friends f  (cost=2.39..37.89 rows=1 width=108) (actual time=0.908..1.684 rows=29 loops=1)
         Recheck Cond: (geog && '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography)
         Filter: (('0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography && _st_expand(geog, 1609::double precision)) 
         	AND _st_dwithin(geog, '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography, 1609::double precision, true))
         ->  Bitmap Index Scan on friends_geog_idx  (cost=0.00..2.39 rows=16 width=0) (actual time=0.351..0.351 rows=45 loops=1)
               Index Cond: (geog && '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography)
 Total runtime: 1.821 ms

This shows a marked improvement from a query time of 160.069ms down to 1.821 ms. Obviously our mileage will vary depending on the quantity of data in the table, it’s distribution and just how many rows we want to retrieve.

UK Geographic postcode data, latitude longitude, Royal Mail PAF and Ordnance Survey data

When it comes to getting accurate geographic data for UK postcodes I only know of 2 sources; Ordnance Survey OpenData which is free, or the Postcode Address File (PAF) from the Royal Mail which costs a fair whack. If detail of individual addresses is required then there’s little choice other than to purchase the PAF data, but if only postcodes and coordinates are required then the OS data is fine.

Rather than latitude and longitude, both sources supply cartesian coordinates in the form of eastings and northings. It’s not a big issue though and we don’t even have to understand the maths involved in conversion because there are a few great cartographic tools out there that will do the conversions for us; perhaps the most popular of these is the Proj4 cartographic libraries.

Conversion to WGS84 / latitude and longitude

In my case I want to pull the data along with latitude and longitude into a PostgreSQL database and I have 2 options; do the conversion before importing with the cs2cs tool or use the PostGIS module within the database to compute them afterwards or during the load from an intermediate table.

The cs2cs tool takes input from stdin or a file and outputs anything after the eastings and northings to stdout. E.g. to convert from the British national grid, we’d do something like:

$ echo '7811 340198 other data' | cs2cs -f '%.7f' +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894 +units=m +no_defs +to +proj=latlong +ellps=WGS84 +towgs84=0,0,0 +no_defs

A breakdown of the parameters passed above:

  • -f ‘%.7f’ – Format of output – 7 decimal places
  • +proj=tmerc – Transverse Mercator Projection
  • +lat_0=49 – National grid true origin latitude
  • +lon_0=-2 – National grid true origin longitude
  • +k=0.9996012717 – National Grid scale factor on central meridian
  • +x_0=400000 – Easting of true origin metres
  • +y_0=-100000 – Northing of true origin metres
  • +ellps=airy – National grid is based on airy 1830 ellipsoid
  • +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894
    The 7 Bursa Wolf transformation parameters, used in our projection transformation to approximate the transformation from horizontal datum
  • +units=m – Units in metres
  • +no_defs – Don’t use the /usr/share/proj/proj_def.dat defaults file (they’re for US maps)
  • +to +proj=latlong – Ouptut latitude and longitude
  • +ellps=WGS84 – Ouptut WGS84 ellipsoid
  • +towgs84=0,0,0 – WGS84 datum shifts set to zero

If we have the PostGIS module loaded in PostgreSQL we could calculate our latitude and longitude similarly with

SELECT ST_x(ST_transform(ST_GeomFromText('POINT('||'457811 340198'||')',27700),4326)) AS longitude,
	ST_y(ST_transform(ST_GeomFromText('POINT('||'457811 340198'||')',27700),4326)) AS latitude;

What we’re doing here is using PostGIS to first transform from our eastings / northings into a geometric type then transforming that back into latitude / longitude. My gut feeling is that there should be a more direct way to do the conversion than the above; but I couldn’t find any. The values we pass in are SRIDs; 27700 to represent SRID 27700 for OSGB 1936 / British National Grid, and 4326 to represent SRID 4326 for WGS84. For any references on the TM75 / Irish Grid we’d use SRID 29903 instead.

OrdananceSurvey Data

Getting the data from the data from OrdananceSurvey is fairly easy; go to the OpenData website, choose the Code Point data option and fill in the form to get a download link emailed to you.

The data may be lacking a few more postcodes than the Royal Mail data, but the geographic coordinates are much more accurate. Once you’ve got the data you can use the methods above to generate the latitude and longitude.

My process for importing the data is composed of 3 steps, firstly I want the area information along with my postcodes and they’re supplied in separate worksheets in an excel document named Codelist.xls. To quickly pull this data out I use the ssconvert utility supplied with Gnumeric:

$ sudo apt-get install gnumeric --no-install-recommends
$ ssconvert -S ~/codepoint_data/Doc/Codelist.xls ~/codepoint_data/Doc/Codelist_%s.csv > /dev/null 2>&1 

The tool wants to use X and will spew out warnings if you’re on a headless machine, I just pipe these to null (ignorance is bliss). Then I remove the duplicates and add in the area types with sed:

$ sed -e 's/$/,CTY/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_CTY.csv
$ sed -e 's/$/,DIS/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_DIS.csv
$ sed -e 's/$/,DIW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_DIW.csv
$ sed -e 's/$/,LBO/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_LBO.csv
$ sed -e 's/$/,LBW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_LBW.csv
$ sed -e 's/$/,MTD/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_MTD.csv
$ sed -e 's/$/,MTW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_MTW.csv
$ sed -e 's/$/,UTA/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_UTA.csv
$ sed -e 's/$/,UTE/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_UTE.csv
$ sed -e 's/$/,UTW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_UTW.csv

Then I use a perl script to generate the latitude and longitude:

$ sudo apt-get install proj-bin
$ sudo perl -MCPAN -e 'install Text::CSV'
$ sudo perl -MCPAN -e 'install Geo::Proj4'
$ -i "~/codepoint_data/Data/CSV/*.csv" -o ~/codepoint_data/Data/all_areas_20140324.csv

I can now copy these directly into my database with a plpgsql script (See the link at the end of the post for the perl script and a database function to import into PostgreSQL)

Royal Mail Postcode Address File

If you’ve paid for the “postzon” data from the Royal Mail, you can go about generating the latitude and longitude in a similar way. There’s artefacts in the data from RM that shows its legacy, but they provide some (fairly convoluted) documentation here to help with deciphering it. The main gotchas being:

  • The cartesian coordinates are only accurate to 100 meters, which can be annoying as it can place you on the wrong street.
  • You may have to fiddle with the northing values before you can import them, as for 7 digit northings the leftmost digit is alphabetic (P|O=12,U|T=11,Z|Y=10) to fit them into a 5 character field (the least significant digit is always truncated).
  • The cartesian coordinates for Northern Ireland (postcodes starting BT) are for the TM75 Irish Grid system, whereas the rest are for the OSGB 1936 / British National Grid system. No fault of RM, but these need to be converted with the different SRID.
  • If you’re using the relational full file format there’s a few rules to follow regarding their schema design, one being that organisation keys for large organisations relate to address keys in their organisations table instead of organisation keys.

Like the OS data my process for importing is to post process the data with a perl script:

$ -i ~/rm_data/pzone100.c01 -o ~/rm_data/pzone100.c02

Again, I then import the data into my database with a plpgsql script (linked below).

Edit 22/05/2014: I’ve now tested an actual quarterly update, and created a separate version of the plpgsql functions using postgis to do all the latitude / longitude parts for both import and update (hence the script above isn’t required if you are using postgis)

You can find the scripts in my git repository at postgresql/geographic_data (mirrored also on github glynastill/geographic_data). I’m not totally happy with the speed of the perl scripts that do the latitude / longitude creation before loading; they could be much faster with a few minor changes. I’m sure the plpgsql functions could easily be translated for other databases without too much hassle if you really want to.