PostgreSQL | database dumping ground

When dealing with geographic data in PostgreSQL, at some point we’re going to want to lookup rows based purely on their location relative to one another.

With smaller volumes of data, we can get away quite easily with just latitude / longitude and some maths, but once we get past a certain point we’re going to want to be able to do index lookups based on distance from a specific location; this is where the PostGIS extension is priceless.

Installing PostGIS on Debian Wheezy

Installing under Debian when PostgreSQL has been installed via the package manager is dead simple. I’m using the 2.1 version packaged for a 9.3 server from the apt.postgresql.org repository, but you should select the package that matches your server version. It’s also possible to install version 1.5 directly from Wheezy main if you don’t want to add other repositories to apt.

$ sudo apt-get install postgresql-9.3-postgis-2.1

This will install the PostGIS contrib modules into the /usr/share/postgresql/9.3/contrib directory rather than the “extension” directory used by most other PostgreSQL packages in Debian.

Some of the machines I look after have PostgreSQL installed from source for various reasons (but mainly because historically packages from Debain weren’t very timely), and these machines require a slightly more lengthy setup process. To compile PostGIS from source we need to install some dependencies, one of which is the development files for the geospatial data abstraction library and these are in the “libgdal-dev” package; however this depends on the “libpq-dev” package which will most likely interfere with our source install. There are two options here; either install the package without it’s dependencies (make a note you’ve done it to avoid future dependency problems), or roll gdal from source:

First we install the required dependencies, these are listed in the PostGIS docs and for my test machine that already has everything required to compile PostgreSQL already these are:

$ sudo apt-get install libgeos-dev libproj-dev libjson-c-dev libjson0-dev

Then install the gdal dev libraries in whichever manner suits:

$ sudo apt-get download libgdal-dev
$ sudo dpkg --force-all -i libgdal-dev_1.9.0-3.1_amd64.deb

$ cd /tmp
$ wget http://download.osgeo.org/gdal/1.10.1/gdal-1.10.1.tar.gz
$ tar xvfz gdal-1.10.1.tar.gz && cd gdal-1.10.1
$ ./configure
$ make
$ sudo make install

Once this is done we can compile and install PostGIS:

$ cd /tmp
$ wget http://download.osgeo.org/postgis/source/postgis-2.1.1.tar.gz
$ tar xvfz postgis-2.1.1.tar.gz && cd postgis-2.1.1
$ ./configure
$ make				
$ sudo make install

Getting started with PostGIS

Now we’ve got PostGIS installed we just need to create the extension in the database:

$ psql -U glyn -d test -c 'CREATE EXTENSION postgis; CREATE EXTENSION postgis_topology;'

OR on and on 9.0 and lower run the sql scripts in the contrib/postgis-2.1 directory:

$ psql -U glyn -d test -f postgis.sql
$ psql -U glyn -d test -f postgis_comments.sql
$ psql -U glyn -d test -f spatial_ref_sys.sql
$ psql -U glyn -d test -f rtpostgis.sql
$ psql -U glyn -d test -f raster_comments.sql
$ psql -U glyn -d test -f topology.sql
$ psql -U glyn -d test -f topology_comments.sql

So let’s generate some data for testing; we create a table called “friends” with 90k rows which stores their locations as latitude and longitude values. Admittedly the distribution in this table won’t be that realistic, but it should suffice for testing:

[postgres]
INSERT INTO friends
SELECT b.unnest || ‘ v.’ || generate_series,
CASE WHEN random() > 0.5 THEN ‘Somewhere Else’ ELSE ‘Somewhere’ END,
1.0838637+random()*(CASE WHEN random() > 0.5 THEN -1 ELSE 1 END),
52.7389201+random()*(CASE WHEN random() > 0.5 THEN -1 ELSE 1 END)
FROM generate_series(1,10000)
CROSS JOIN (SELECT unnest(ARRAY[‘White Wonder’,’Colonel K’,’El Loco’,’Count Duckula’,’Leatherhead’,’Barron Greenback’,’Ernest Penfold’,’Professor Heinrich Von Squawkencluck’,’Flying Officer Buggles Pigeon’])) b;
[/postgres]

In the scenario where we don’t have PostGIS we can create an sql function to calculate earth distance between two points, but any relative distances will always be an unknown so can’t be indexed:

[postgres]
CREATE OR REPLACE FUNCTION earth_distance_miles(lat1 float, lat2 float, long1 float, long2 float)
RETURNS double precision
AS
‘SELECT 3963.0*acos(sin($1/57.2958)*sin($2/57.2958)+cos($1/57.2958)*cos($2/57.2958)*cos(($4/57.2958)-($3/57.2958)));’
LANGUAGE SQL IMMUTABLE;
[/postgres]

This makes listing out “friends” within a mile pretty easy:

[postgres]
EXPLAIN ANALYZE SELECT f.*,
earth_distance_miles(52.7389201, f.latitude, 1.0838637, f.longitude) AS dist_miles
FROM friends f WHERE earth_distance_miles(52.7389201, f.latitude, 1.0838637, f.longitude) <= 1
ORDER BY earth_distance_miles(52.7389201, f.latitude, 1.0838637, f.longitude);
[/postgres]

                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=10988.40..11063.40 rows=30000 width=108) (actual time=160.006..160.013 rows=29 loops=1)
   Sort Key: ((3963::double precision * acos(((0.795884736186082::double precision * sin((latitude / 57.2958::double precision))) 
   	+ ((0.605448170123596::double precision * cos((latitude / 57.2958::double precision))) * cos(((longitude / 57.2958::double precision) 
   	- 0.0189169834438126::double precision)))))))
   Sort Method:  quicksort  Memory: 30kB
   ->  Seq Scan on friends f  (cost=0.00..7510.00 rows=30000 width=108) (actual time=19.993..159.930 rows=29 loops=1)
         Filter: ((3963::double precision * acos(((0.795884736186082::double precision * sin((latitude / 57.2958::double precision))) 
         	+ ((0.605448170123596::double precision * cos((latitude / 57.2958::double precision))) 
         	* cos(((longitude / 57.2958::double precision) - 0.0189169834438126::double precision)))))) <= 1::double precision)
 Total runtime: 160.069 ms

Now lets put PostGIS to work and add in an indexed geography column:

[postgres]
ALTER TABLE friends ADD COLUMN geog geography(Point,4326); — SRID 4326 for WGS84
UPDATE friends SET geog = ST_MakePoint(longitude, latitude);
CREATE INDEX ON friends USING GIST (geog);
[/postgres]

Now let’s try to list out our “friends” within a mile again, this time making use of the PostGIS ST_Distance and ST_DWithin functions:

[postgres]
EXPLAIN ANALYZE SELECT f.*,
ST_Distance(f.geog, ST_MakePoint(1.0838637, 52.7389201))/1609 AS dist_miles
FROM friends f WHERE ST_DWithin(f.geog, ST_MakePoint(1.0838637, 52.7389201), 1609)
ORDER BY ST_Distance(f.geog, ST_MakePoint(1.0838637, 52.7389201));
[/postgres]

                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=37.90..37.90 rows=1 width=108) (actual time=1.756..1.760 rows=29 loops=1)
   Sort Key: (_st_distance(geog, '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography, 0::double precision, true))
   Sort Method:  quicksort  Memory: 32kB
   ->  Bitmap Heap Scan on friends f  (cost=2.39..37.89 rows=1 width=108) (actual time=0.908..1.684 rows=29 loops=1)
         Recheck Cond: (geog && '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography)
         Filter: (('0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography && _st_expand(geog, 1609::double precision)) 
         	AND _st_dwithin(geog, '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography, 1609::double precision, true))
         ->  Bitmap Index Scan on friends_geog_idx  (cost=0.00..2.39 rows=16 width=0) (actual time=0.351..0.351 rows=45 loops=1)
               Index Cond: (geog && '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography)
 Total runtime: 1.821 ms

This shows a marked improvement from a query time of 160.069ms down to 1.821 ms. Obviously our mileage will vary depending on the quantity of data in the table, it’s distribution and just how many rows we want to retrieve.

I’ve recently been digging through a bunch of old test programs and scripts I’ve written over the years, and thought I’d upload some of them incase they can be of any use to anyone. I thought I’d start off with a console app I’d named DBLT.

DBLT is simply the abbreviation for “database load test” that jumped into my head when faced with the Visual Studio “New Project” dialogue. I wanted to perform some load tests in a way that more closely represented a specific application and measured the things I was interested in. I don’t class it as a polished testing platform, and I take the results of my tests with a pinch of salt; as I assume some of my techniques in running the tests and locking when maintaining the statistics will in some way have skewed them, but hopefully only slightly.

The test is written in C#, and uses the npgsql data provider to connect to PostgreSQL. As I mentioned initially, the main reason for this is I wanted to test different configurations in connecting to our databases in a similar way to how a specific application did. Mainly this was to test the performance of npgsqls local connection pool logic, and other associated parameters, but also to test other parameters on our pgBouncer pools, PostgreSQL and Linux hosts.

My other reason for writing my own test is that pgbench outputs results based around TPS, but I wanted a slightly different take on the statistics. I wanted to know things like how long my queries were taking on average, what the worst and best execution times were and the standard deviation of execution time to see that we were providing a good quality of service to all clients rather than just shunting the most TPS through the database. If you’re now thinking “the fool – he should have just run pgbench and run an analysis on his logs”, you’re quite right; but what that wouldn’t have given me an overall view of the behaviour of the system when we start including things like cascaded connection pools (local npgsql, and pgbouncer), networking and local client machine configuration. I could have analyzed some of this using tools like tsung, but had issues generating heavy load with tsung and its output didn’t give me exactly what I wanted.

I wrote dblt back in 2009, and I’ve found it useful on quite a few occasions but never thought to share it before. I’ve recently signed up to github so the source can be obtained from glynastill/dblt or my git repository at postgresql/dblt. I’ve also uploaded a compiled binary here

Hopefully someone finds it useful or highlights failings that can be fixed. Even if you don’t want to test against PostgreSQL it could quite easily be used to test sql server or other systems by changing the data provider.

database dumping ground

… getting the wrong answer fast is not an improvement over the right answer slow

Tag Archives: PostgreSQL

Spatial capabilities in PostgreSQL with PostGIS

Uploading old stuff – DBLT