Perl | database dumping ground

When it comes to getting accurate geographic data for UK postcodes I only know of 2 sources; Ordnance Survey OpenData which is free, or the Postcode Address File (PAF) from the Royal Mail which costs a fair whack. If detail of individual addresses is required then there’s little choice other than to purchase the PAF data, but if only postcodes and coordinates are required then the OS data is fine.

Rather than latitude and longitude, both sources supply cartesian coordinates in the form of eastings and northings. It’s not a big issue though and we don’t even have to understand the maths involved in conversion because there are a few great cartographic tools out there that will do the conversions for us; perhaps the most popular of these is the Proj4 cartographic libraries.

Conversion to WGS84 / latitude and longitude

In my case I want to pull the data along with latitude and longitude into a PostgreSQL database and I have 2 options; do the conversion before importing with the cs2cs tool or use the PostGIS module within the database to compute them afterwards or during the load from an intermediate table.

The cs2cs tool takes input from stdin or a file and outputs anything after the eastings and northings to stdout. E.g. to convert from the British national grid, we’d do something like:

$ echo '7811 340198 other data' | cs2cs -f '%.7f' +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894 +units=m +no_defs +to +proj=latlong +ellps=WGS84 +towgs84=0,0,0 +no_defs

A breakdown of the parameters passed above:

-f ‘%.7f’ – Format of output – 7 decimal places
+proj=tmerc – Transverse Mercator Projection
+lat_0=49 – National grid true origin latitude
+lon_0=-2 – National grid true origin longitude
+k=0.9996012717 – National Grid scale factor on central meridian
+x_0=400000 – Easting of true origin metres
+y_0=-100000 – Northing of true origin metres
+ellps=airy – National grid is based on airy 1830 ellipsoid
+towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894
– The 7 Bursa Wolf transformation parameters, used in our projection transformation to approximate the transformation from horizontal datum
+units=m – Units in metres
+no_defs – Don’t use the /usr/share/proj/proj_def.dat defaults file (they’re for US maps)
+to +proj=latlong – Ouptut latitude and longitude
+ellps=WGS84 – Ouptut WGS84 ellipsoid
+towgs84=0,0,0 – WGS84 datum shifts set to zero

If we have the PostGIS module loaded in PostgreSQL we could calculate our latitude and longitude similarly with

[postgres]
SELECT ST_x(ST_transform(ST_GeomFromText(‘POINT(‘||’457811 340198’||’)’,27700),4326)) AS longitude,
ST_y(ST_transform(ST_GeomFromText(‘POINT(‘||’457811 340198’||’)’,27700),4326)) AS latitude;
[/postgres]

What we’re doing here is using PostGIS to first transform from our eastings / northings into a geometric type then transforming that back into latitude / longitude. My gut feeling is that there should be a more direct way to do the conversion than the above; but I couldn’t find any. The values we pass in are SRIDs; 27700 to represent SRID 27700 for OSGB 1936 / British National Grid, and 4326 to represent SRID 4326 for WGS84. For any references on the TM75 / Irish Grid we’d use SRID 29903 instead.

OrdananceSurvey Data

Getting the data from the data from OrdananceSurvey is fairly easy; go to the OpenData website, choose the Code Point data option and fill in the form to get a download link emailed to you.

The data may be lacking a few more postcodes than the Royal Mail data, but the geographic coordinates are much more accurate. Once you’ve got the data you can use the methods above to generate the latitude and longitude.

My process for importing the data is composed of 3 steps, firstly I want the area information along with my postcodes and they’re supplied in separate worksheets in an excel document named Codelist.xls. To quickly pull this data out I use the ssconvert utility supplied with Gnumeric:

$ sudo apt-get install gnumeric --no-install-recommends
$ ssconvert -S ~/codepoint_data/Doc/Codelist.xls ~/codepoint_data/Doc/Codelist_%s.csv > /dev/null 2>&1

The tool wants to use X and will spew out warnings if you’re on a headless machine, I just pipe these to null (ignorance is bliss). Then I remove the duplicates and add in the area types with sed:

$ sed -e 's/$/,CTY/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_CTY.csv
$ sed -e 's/$/,DIS/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_DIS.csv
$ sed -e 's/$/,DIW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_DIW.csv
$ sed -e 's/$/,LBO/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_LBO.csv
$ sed -e 's/$/,LBW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_LBW.csv
$ sed -e 's/$/,MTD/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_MTD.csv
$ sed -e 's/$/,MTW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_MTW.csv
$ sed -e 's/$/,UTA/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_UTA.csv
$ sed -e 's/$/,UTE/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_UTE.csv
$ sed -e 's/$/,UTW/; s/|/,/g; /\(DET\)/d' -i ~/codepoint_data/Doc/Codelist_UTW.csv

Then I use a perl script to generate the latitude and longitude:

$ sudo apt-get install proj-bin
$ sudo perl -MCPAN -e 'install Text::CSV'
$ sudo perl -MCPAN -e 'install Geo::Proj4'
	
$ os_convert.pl -i "~/codepoint_data/Data/CSV/*.csv" -o ~/codepoint_data/Data/all_areas_20140324.csv

I can now copy these directly into my database with a plpgsql script (See the link at the end of the post for the perl script and a database function to import into PostgreSQL)

Royal Mail Postcode Address File

If you’ve paid for the “postzon” data from the Royal Mail, you can go about generating the latitude and longitude in a similar way. There’s artefacts in the data from RM that shows its legacy, but they provide some (fairly convoluted) documentation here to help with deciphering it. The main gotchas being:

The cartesian coordinates are only accurate to 100 meters, which can be annoying as it can place you on the wrong street.
You may have to fiddle with the northing values before you can import them, as for 7 digit northings the leftmost digit is alphabetic (P|O=12,U|T=11,Z|Y=10) to fit them into a 5 character field (the least significant digit is always truncated).
The cartesian coordinates for Northern Ireland (postcodes starting BT) are for the TM75 Irish Grid system, whereas the rest are for the OSGB 1936 / British National Grid system. No fault of RM, but these need to be converted with the different SRID.
If you’re using the relational full file format there’s a few rules to follow regarding their schema design, one being that organisation keys for large organisations relate to address keys in their organisations table instead of organisation keys.

Like the OS data my process for importing is to post process the data with a perl script:

$ paf_convert.pl -i ~/rm_data/pzone100.c01 -o ~/rm_data/pzone100.c02

Again, I then import the data into my database with a plpgsql script (linked below).

Edit 22/05/2014: I’ve now tested an actual quarterly update, and created a separate version of the plpgsql functions using postgis to do all the latitude / longitude parts for both import and update (hence the script above isn’t required if you are using postgis)

You can find the scripts in my git repository at postgresql/geographic_data (mirrored also on github glynastill/geographic_data). I’m not totally happy with the speed of the perl scripts that do the latitude / longitude creation before loading; they could be much faster with a few minor changes. I’m sure the plpgsql functions could easily be translated for other databases without too much hassle if you really want to.

So the title sums it up without going into detail; we basically have a legacy DOS application that had its underlying dataflies migrated into a latin1 encoded PostgreSQL database with no form of intelligent mapping (i.e. the cp850 “Ãº” character which is codepoint 163 is seen as the character “Â£” in latin1).

Converting to UTF8 at this point would have saved lots of hassle, however at the time of conversion the existing data wouldn’t easily go into utf8 without further intermediate conversion. Even then UTF8 wouldn’t be a magic bullet; we’d still have to sanitize the incoming data so we didn’t store anything the DOS side (or any other equipment) couldn’t read.

Fast forward a few years and now there’s two main apps hitting the database; an ASP.Net application and the aforementioned legacy DOS system. Both are heavily used and the DOS system isn’t going anywhere so we need a way for both systems to get along; crucially the DOS application must be able to understand everything the ASP.Net application puts in the database, and the ASP.Net application must be able to display the DOS codepage correctly in UTF8.

So how do we do this? All my attempts using the PostgreSQL encode/decode functions failed due to there being no cp850 representation, and although I’m sure it’s possible to do the conversion in sql using something like translate / convert and a large mapping string I don’t have time to fiddle with it right now. However perls encode and decode functions do have the ability to convert from cp850, so one solution from within PostgreSQL is to write a set of untrusted pl/perl functions like:

[postgres]
CREATE OR REPLACE FUNCTION cp850_to_utf8(text)
RETURNS text AS
$BODY$
use Encode;
return encode( ‘iso-8859-1’, decode(‘cp850’, $_[0] ));
$BODY$
LANGUAGE ‘plperlu’ IMMUTABLE;

CREATE OR REPLACE FUNCTION utf8_to_cp850(text)
RETURNS text AS
$BODY$
use Encode;
return encode( ‘cp850’, decode(‘iso-8859-1’, $_[0] ));
$BODY$
LANGUAGE ‘plperlu’ IMMUTABLE;

[/postgres]

For my current problem this means if either application writes something like “StraÃŸe” the other can read back the same representation.

From the ASP.Net side we can do the conversion outside the database with a similar process. Converting text to go into the database involves switching the encoding first from unicode into cp850 then back into latin1 to get the correct representation on the DOS side:

private static Encoding _eCp850 = Encoding.GetEncoding(850);
private static Encoding _eUnicode = Encoding.UTF8;
private static Encoding _eLatin1 = Encoding.GetEncoding("ISO-8859-1");

public static string encode_cp850(string sText)
{
	string sReturn;
	byte[] bSource;
	byte[] bTarget;

	bSource = _eUnicode.GetBytes(sText);
	bTarget = Encoding.Convert(_eUnicode, _eCp850, bSource);
	sReturn = _eLatin1.GetString(bTarget);

	return sReturn;
}

Reading is a little trickier though, as all strings in .Net are UTF8 (and actually stored as UTF16 internally) so once our string is read back into a string we can’t easily do our conversion as our initial representation has been mangled. However if we read our string back from the database as bytea into a byte array we can successfully convert it into the correct UTF8 representation:

public static string decode_cp850(byte[] sTextAsBytea)
{
	string sReturn;
	byte[] bSource = sTextAsBytea;
	byte[] bTarget;

	bTarget = Encoding.Convert(_eCp850, _eUnicode, bSource);
	sReturn = _eUnicode.GetString(bTarget);

	return sReturn;
}

I’m still wondering if there is a better way (I know, I know, yes it’s to have everything in UTF8), but for now this seems to manage ok for us.

I’ve uploaded some scripts to my git repository at postgresql/encoding_woes (mirrored also on github glynastill/encoding_woes)

database dumping ground

… getting the wrong answer fast is not an improvement over the right answer slow

Category Archives: Perl

UK Geographic postcode data, latitude longitude, Royal Mail PAF and Ordnance Survey data

PostgreSQL Latin1 database with ASP.Net Unicode and DOS codepage 850 applications