PostgreSQL Latin1 database with ASP.Net Unicode and DOS codepage 850 applications

So the title sums it up without going into detail; we basically have a legacy DOS application that had its underlying dataflies migrated into a latin1 encoded PostgreSQL database with no form of intelligent mapping (i.e. the cp850 “ú” character which is codepoint 163 is seen as the character “£” in latin1).

Converting to UTF8 at this point would have saved lots of hassle, however at the time of conversion the existing data wouldn’t easily go into utf8 without further intermediate conversion. Even then UTF8 wouldn’t be a magic bullet; we’d still have to sanitize the incoming data so we didn’t store anything the DOS side (or any other equipment) couldn’t read.

Fast forward a few years and now there’s two main apps hitting the database; an ASP.Net application and the aforementioned legacy DOS system. Both are heavily used and the DOS system isn’t going anywhere so we need a way for both systems to get along; crucially the DOS application must be able to understand everything the ASP.Net application puts in the database, and the ASP.Net application must be able to display the DOS codepage correctly in UTF8.

So how do we do this? All my attempts using the PostgreSQL encode/decode functions failed due to there being no cp850 representation, and although I’m sure it’s possible to do the conversion in sql using something like translate / convert and a large mapping string I don’t have time to fiddle with it right now. However perls encode and decode functions do have the ability to convert from cp850, so one solution from within PostgreSQL is to write a set of untrusted pl/perl functions like:

CREATE OR REPLACE FUNCTION cp850_to_utf8(text)
RETURNS text AS
$BODY$
	use Encode;
       	return encode( 'iso-8859-1', decode('cp850', $_[0] ));
$BODY$
LANGUAGE 'plperlu' IMMUTABLE;

CREATE OR REPLACE FUNCTION utf8_to_cp850(text)
RETURNS text AS
$BODY$
	use Encode;
        return encode( 'cp850', decode('iso-8859-1', $_[0] ));
$BODY$
LANGUAGE 'plperlu' IMMUTABLE;

For my current problem this means if either application writes something like “Straße” the other can read back the same representation.

From the ASP.Net side we can do the conversion outside the database with a similar process. Converting text to go into the database involves switching the encoding first from unicode into cp850 then back into latin1 to get the correct representation on the DOS side:

private static Encoding _eCp850 = Encoding.GetEncoding(850);
private static Encoding _eUnicode = Encoding.UTF8;
private static Encoding _eLatin1 = Encoding.GetEncoding("ISO-8859-1");

public static string encode_cp850(string sText)
{
	string sReturn;
	byte[] bSource;
	byte[] bTarget;

	bSource = _eUnicode.GetBytes(sText);
	bTarget = Encoding.Convert(_eUnicode, _eCp850, bSource);
	sReturn = _eLatin1.GetString(bTarget);

	return sReturn;
}

Reading is a little trickier though, as all strings in .Net are UTF8 (and actually stored as UTF16 internally) so once our string is read back into a string we can’t easily do our conversion as our initial representation has been mangled. However if we read our string back from the database as bytea into a byte array we can successfully convert it into the correct UTF8 representation:

public static string decode_cp850(byte[] sTextAsBytea)
{
	string sReturn;
	byte[] bSource = sTextAsBytea;
	byte[] bTarget;

	bTarget = Encoding.Convert(_eCp850, _eUnicode, bSource);
	sReturn = _eUnicode.GetString(bTarget);

	return sReturn;
}
        

I’m still wondering if there is a better way (I know, I know, yes it’s to have everything in UTF8), but for now this seems to manage ok for us.

I’ve uploaded some scripts to my git repository at postgresql/encoding_woes (mirrored also on github glynastill/encoding_woes)

Leave a Reply

Your email address will not be published. Required fields are marked *