Category Archives: Playtime

Playing around with things for no sensible reason!

Copying Pavel Stehule’s simple history table but with the jsonb type

On 15/01/2015 Pavel Stehule wrote about implementing a dead simple history table using the hstore type.

On Friday evening I wanted to copy this almost line for line switching the hstore type for jsonb , but I counldn’t really see how to replicate the update part so simply without creating a delete operator. Once that operator has been created it appears to work:

[postgres highlight=”6,7,14,23,31,32,35″]
CREATE TABLE test(a int, b int, c int);

CREATE TABLE history(
event_time timestamp(2),
executed_by text,
origin_value jsonb,
new_value jsonb
);

CREATE OR REPLACE FUNCTION history_insert()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO history(event_time, executed_by, new_value)
VALUES(CURRENT_TIMESTAMP, SESSION_USER, row_to_json(NEW)::jsonb);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION history_delete()
RETURNS TRIGGER AS $$
BEGIN
INSERT INTO history(event_time, executed_by, origin_value)
VALUES(CURRENT_TIMESTAMP, SESSION_USER, row_to_json(OLD)::jsonb);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION history_update()
RETURNS TRIGGER AS $$
DECLARE
js_new jsonb := row_to_json(NEW)::jsonb;
js_old jsonb := row_to_json(OLD)::jsonb;
BEGIN
INSERT INTO history(event_time, executed_by, origin_value, new_value)
VALUES(CURRENT_TIMESTAMP, SESSION_USER, js_old – js_new, js_new – js_old);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER test_history_insert AFTER INSERT ON test
FOR EACH ROW EXECUTE PROCEDURE history_insert();

CREATE TRIGGER test_history_delete AFTER DELETE ON test
FOR EACH ROW EXECUTE PROCEDURE history_delete();

CREATE TRIGGER test_history_update AFTER UPDATE ON test
FOR EACH ROW EXECUTE PROCEDURE history_update();
[/postgres]

Result:

[postgres]
INSERT INTO test VALUES(1000, 1001, 1002);
UPDATE test SET a = 10, c = 20;
DELETE FROM test;

postgres=# table history;
event_time | executed_by | origin_value | new_value
————————+————-+——————————-+———————————–
2015-01-19 13:12:54.26 | glyn | | {"a": 1000, "b": 1001, "c": 1002}
2015-01-19 13:12:54.31 | glyn | {"a": 1000, "c": 1002} | {"a": 10, "c": 20}
2015-01-19 13:12:54.94 | glyn | {"a": 10, "b": 1001, "c": 20} |
(3 rows)
[/postgres]

Tested on PostgreSQL 9.4

PostgreSQL 9.4 released

jsonb

The new jsonb data type stores JSON data internally in a binary form, which makes it possible to index the keys and values within. In previous versions we have a JSON data type but all that does is enforce valid JSON; the data is still stored as text. Whilst it is possible to do lookups on key-value data in previous versions using the hstore type (provided by the hstore module), with JSON seemingly being ubiquitous in aplications these days jsonb means we can just let devs store their data straight into the database and still be able to do fast lookups and searches.

At work we get quite a lot of variable callback data from web APIs, or serialized data from application objects that tends to end up being stored as text. The ability to lookup that data via a GIN index will be invaluable. I assume even XML storage should become easier as there’s plenty of pre cooked ways to convert XML to JSON.

Let’s create a quick test table:

[postgres]
CREATE TABLE jsonb_test(
id integer PRIMARY KEY,
data jsonb
);
CREATE INDEX jsonb_test_data ON jsonb_test USING gin(data);

— Obviously this data is ridiculous, but we need enough rows for postgres to prefer an index over a seq scan.
INSERT INTO jsonb_test
SELECT i, (‘{"name": "Person’ || i || ‘","age" : ‘ || i || ‘,"address": {"add1": "’
|| i || ‘ Clarence Street","city": "Lancaster","postcode": "LA13BG"}}’)::jsonb
FROM generate_series(1,100000) i;
[/postgres]

Now if we query on the data column we should see the jsonb_test_data index being used:

[postgres]
TEST=# SELECT * FROM jsonb_test
WHERE data @> ‘{"address": {"add1": "2300 Clarence Street"}}’;
id | data
——+—————————————————————————————————————————–
2300 | {"age": 2300, "name": "Person2300", "address": {"add1": "2300 Clarence Street", "city": "Lancaster", "postcode": "LA13BG"}}
(1 row)

Time: 10.811 ms

TEST=# EXPLAIN SELECT * FROM jsonb_test
WHERE data @> ‘{"address": {"add1": "2300 Clarence Street"}}’;
QUERY PLAN
————————————————————————————–
Bitmap Heap Scan on jsonb_test (cost=1040.83..1395.09 rows=107 width=147)
Recheck Cond: (data @> ‘{"address": {"add1": "2300 Clarence Street"}}’::jsonb)
-> Bitmap Index Scan on jsonb_test_data (cost=0.00..1040.80 rows=107 width=0)
Index Cond: (data @> ‘{"address": {"add1": "2300 Clarence Street"}}’::jsonb)
(4 rows)
[/postgres]

Logical Decoding

Whilst Logical Decoding isn’t really in a state to be put into active duty right away, it is pretty special, and allows postgres to supply a stream of changes (or partial changes) in a user defined format. This is similar to what we’ve been doing for ages with trigger based replication like Slony and Londisite, but dissimilar because instead of all the overhead and clunkyness of log triggers the changes are read directly from WAL in a similar way to streaming binary replication. The uses don’t end at master-slave replication either; multimaster and selective replication with per-table granularity, auditing, online upgrades and cache invalidation are just some of the possible uses.

Logical Decoding uses the concept of “replication slots”, which represent a stream of changes logged for a particular consumer, and we can have as many replication slots as we like. The great thing about replication slots is that once they’re created all WAL files required by the slot are retained, and they aren’t just for Logical Decoding; Streaming Replication can make use of them too, so we don’t have to balance wal_keep_segments or rely on archive_command any more. Replication slots aren’t a magic bullet though; if a replication slot isn’t being consumed it will cause postgresql to consume disk space as it retains WAL files for the slot/consumer.

I mentioned earlier that Logical Decoding allows changes to be supplied in a “user defined format”; this is provided by an output plugin in the form of a shared library that needs to be custom written as required, and it’s in this output plugin where the format and any restrictions on what data we want would be controlled. The one exception to this is data used for identifying old rows from updates or deletes, which is defined before it is written to the WAL, and has to be set on a per table basis with ALTER TABLE REPLICA IDENTITY.

There’s a “test_decoding” plugin supplied as a contrib module that we can use for testing, and that’s what I’m going to have a quick look at now.

The first thing we have to do is set wal_level to logical and make sure max_replication_slots is greater than zero. Once we’ve done that and restarted PostgreSQL we’re ready to start playing, and we can create our first replication slot:

[postgres]
TEST=# SELECT * FROM pg_create_logical_replication_slot(‘test_replication_slot’, ‘test_decoding’);
slot_name | xlog_position
———————–+—————
test_replication_slot | 0/56436390
(1 row)
[/postgres]

We should now be able to see our replication slot in the pg_replication_slots view:

To look and see if there are any changes, we can use the pg_logical_slot_peek_changes function:

[postgres]
TEST=# \x
Expanded display is on.
TEST=# SELECT * FROM pg_logical_slot_peek_changes(‘test_replication_slot’,NULL, NULL);

-[ RECORD 1 ]——————————————————————————————————————————————————————————————————————————————————————————————————
location | 0/56436450
xid | 1135906
data | BEGIN 1135906
-[ RECORD 2 ]——————————————————————————————————————————————————————————————————————————————————————————————————
location | 0/56436450
xid | 1135906
data | table _test_replication.sl_components: UPDATE: co_actor:’local_sync’ co_pid[integer]:20814 co_node [integer]:0 co_connection_pid[integer]:20831 co_activity:’thread main loop’ co_starttime[timestamp with time zone]:’2014-12-22 16:00:48+00′ co_event[bigint]:null co_eventtype:’n/a’
-[ RECORD 3 ]——————————————————————————————————————————————————————————————————————————————————————————————————
location | 0/56436518
xid | 1135906
data | COMMIT 1135906

< snip >
[/postgres]

… and I’ll snip my output there at 3 rows; I use this machine for Slony testing, so we’re already seeing all of the Slony chatter here, but you should be able to see the capture of an update to the “_test_replication.sl_components” table (this could be any table – I just happened to call my slony cluster “test_replication” too). If you create some activity on your database, you should start so see some output. Notice that the output is the actual changes on the table, not a capture of the sql statement that caused the changes; we can use this change information to build SQL if we want, or some other form DML for another system.

To actually consume the queue we can call pg_logical_slot_get_changes:

[postgres]
TEST=# SELECT * FROM pg_logical_slot_get_changes(‘test_replication_slot’, NULL, NULL);
[/postgres]

This outputs the same as the above, but once we’ve called it the changes are classed as consumed regardless of the caller actually applying them, and will not be output again (nor the WAL reatined). One thing that would be useful here would be the ability to pull the changes, apply them, then confirm them as applied before they’re marked as consumed; I guess this could be achieved by first calling pg_logical_slot_peek_changes, applying the changes and then calling pg_logical_slot_get_changes passing the latest lsn seen from the peek.

In addition to the sql functions, the pg_recvlogical binary is provided to pull data over the streaming replication protocol with something like:

# pg_recvlogical -U postgres -d TEST --slot test_replication_slot --start -f -

For this, as with streaming replication we need to set max_wal_senders greater than zero.

Once we’re finished with our test, we should drop the replication slot:

[postgres]
TEST=# SELECT pg_drop_replication_slot(‘test_replication_slot’);
[/postgres]

Apparently the one thing Logical Decoding can’t do is output DDL, and I’m guessing this is due to other complexities that need to be overcome rather than by design. All exciting!

Spatial capabilities in PostgreSQL with PostGIS

Leave a reply

When dealing with geographic data in PostgreSQL, at some point we’re going to want to lookup rows based purely on their location relative to one another.

With smaller volumes of data, we can get away quite easily with just latitude / longitude and some maths, but once we get past a certain point we’re going to want to be able to do index lookups based on distance from a specific location; this is where the PostGIS extension is priceless.

Installing PostGIS on Debian Wheezy

Installing under Debian when PostgreSQL has been installed via the package manager is dead simple. I’m using the 2.1 version packaged for a 9.3 server from the apt.postgresql.org repository, but you should select the package that matches your server version. It’s also possible to install version 1.5 directly from Wheezy main if you don’t want to add other repositories to apt.

$ sudo apt-get install postgresql-9.3-postgis-2.1

This will install the PostGIS contrib modules into the /usr/share/postgresql/9.3/contrib directory rather than the “extension” directory used by most other PostgreSQL packages in Debian.

Some of the machines I look after have PostgreSQL installed from source for various reasons (but mainly because historically packages from Debain weren’t very timely), and these machines require a slightly more lengthy setup process. To compile PostGIS from source we need to install some dependencies, one of which is the development files for the geospatial data abstraction library and these are in the “libgdal-dev” package; however this depends on the “libpq-dev” package which will most likely interfere with our source install. There are two options here; either install the package without it’s dependencies (make a note you’ve done it to avoid future dependency problems), or roll gdal from source:

First we install the required dependencies, these are listed in the PostGIS docs and for my test machine that already has everything required to compile PostgreSQL already these are:

$ sudo apt-get install libgeos-dev libproj-dev libjson-c-dev libjson0-dev

Then install the gdal dev libraries in whichever manner suits:

$ sudo apt-get download libgdal-dev
$ sudo dpkg --force-all -i libgdal-dev_1.9.0-3.1_amd64.deb

$ cd /tmp
$ wget http://download.osgeo.org/gdal/1.10.1/gdal-1.10.1.tar.gz
$ tar xvfz gdal-1.10.1.tar.gz && cd gdal-1.10.1
$ ./configure
$ make
$ sudo make install

Once this is done we can compile and install PostGIS:

$ cd /tmp
$ wget http://download.osgeo.org/postgis/source/postgis-2.1.1.tar.gz
$ tar xvfz postgis-2.1.1.tar.gz && cd postgis-2.1.1
$ ./configure
$ make				
$ sudo make install

Getting started with PostGIS

Now we’ve got PostGIS installed we just need to create the extension in the database:

$ psql -U glyn -d test -c 'CREATE EXTENSION postgis; CREATE EXTENSION postgis_topology;'

OR on and on 9.0 and lower run the sql scripts in the contrib/postgis-2.1 directory:

$ psql -U glyn -d test -f postgis.sql
$ psql -U glyn -d test -f postgis_comments.sql
$ psql -U glyn -d test -f spatial_ref_sys.sql
$ psql -U glyn -d test -f rtpostgis.sql
$ psql -U glyn -d test -f raster_comments.sql
$ psql -U glyn -d test -f topology.sql
$ psql -U glyn -d test -f topology_comments.sql

So let’s generate some data for testing; we create a table called “friends” with 90k rows which stores their locations as latitude and longitude values. Admittedly the distribution in this table won’t be that realistic, but it should suffice for testing:

[postgres]
INSERT INTO friends
SELECT b.unnest || ‘ v.’ || generate_series,
CASE WHEN random() > 0.5 THEN ‘Somewhere Else’ ELSE ‘Somewhere’ END,
1.0838637+random()*(CASE WHEN random() > 0.5 THEN -1 ELSE 1 END),
52.7389201+random()*(CASE WHEN random() > 0.5 THEN -1 ELSE 1 END)
FROM generate_series(1,10000)
CROSS JOIN (SELECT unnest(ARRAY[‘White Wonder’,’Colonel K’,’El Loco’,’Count Duckula’,’Leatherhead’,’Barron Greenback’,’Ernest Penfold’,’Professor Heinrich Von Squawkencluck’,’Flying Officer Buggles Pigeon’])) b;
[/postgres]

In the scenario where we don’t have PostGIS we can create an sql function to calculate earth distance between two points, but any relative distances will always be an unknown so can’t be indexed:

[postgres]
CREATE OR REPLACE FUNCTION earth_distance_miles(lat1 float, lat2 float, long1 float, long2 float)
RETURNS double precision
AS
‘SELECT 3963.0*acos(sin($1/57.2958)*sin($2/57.2958)+cos($1/57.2958)*cos($2/57.2958)*cos(($4/57.2958)-($3/57.2958)));’
LANGUAGE SQL IMMUTABLE;
[/postgres]

This makes listing out “friends” within a mile pretty easy:

[postgres]
EXPLAIN ANALYZE SELECT f.*,
earth_distance_miles(52.7389201, f.latitude, 1.0838637, f.longitude) AS dist_miles
FROM friends f WHERE earth_distance_miles(52.7389201, f.latitude, 1.0838637, f.longitude) <= 1
ORDER BY earth_distance_miles(52.7389201, f.latitude, 1.0838637, f.longitude);
[/postgres]

                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=10988.40..11063.40 rows=30000 width=108) (actual time=160.006..160.013 rows=29 loops=1)
   Sort Key: ((3963::double precision * acos(((0.795884736186082::double precision * sin((latitude / 57.2958::double precision))) 
   	+ ((0.605448170123596::double precision * cos((latitude / 57.2958::double precision))) * cos(((longitude / 57.2958::double precision) 
   	- 0.0189169834438126::double precision)))))))
   Sort Method:  quicksort  Memory: 30kB
   ->  Seq Scan on friends f  (cost=0.00..7510.00 rows=30000 width=108) (actual time=19.993..159.930 rows=29 loops=1)
         Filter: ((3963::double precision * acos(((0.795884736186082::double precision * sin((latitude / 57.2958::double precision))) 
         	+ ((0.605448170123596::double precision * cos((latitude / 57.2958::double precision))) 
         	* cos(((longitude / 57.2958::double precision) - 0.0189169834438126::double precision)))))) <= 1::double precision)
 Total runtime: 160.069 ms

Now lets put PostGIS to work and add in an indexed geography column:

[postgres]
ALTER TABLE friends ADD COLUMN geog geography(Point,4326); — SRID 4326 for WGS84
UPDATE friends SET geog = ST_MakePoint(longitude, latitude);
CREATE INDEX ON friends USING GIST (geog);
[/postgres]

Now let’s try to list out our “friends” within a mile again, this time making use of the PostGIS ST_Distance and ST_DWithin functions:

[postgres]
EXPLAIN ANALYZE SELECT f.*,
ST_Distance(f.geog, ST_MakePoint(1.0838637, 52.7389201))/1609 AS dist_miles
FROM friends f WHERE ST_DWithin(f.geog, ST_MakePoint(1.0838637, 52.7389201), 1609)
ORDER BY ST_Distance(f.geog, ST_MakePoint(1.0838637, 52.7389201));
[/postgres]

                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=37.90..37.90 rows=1 width=108) (actual time=1.756..1.760 rows=29 loops=1)
   Sort Key: (_st_distance(geog, '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography, 0::double precision, true))
   Sort Method:  quicksort  Memory: 32kB
   ->  Bitmap Heap Scan on friends f  (cost=2.39..37.89 rows=1 width=108) (actual time=0.908..1.684 rows=29 loops=1)
         Recheck Cond: (geog && '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography)
         Filter: (('0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography && _st_expand(geog, 1609::double precision)) 
         	AND _st_dwithin(geog, '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography, 1609::double precision, true))
         ->  Bitmap Index Scan on friends_geog_idx  (cost=0.00..2.39 rows=16 width=0) (actual time=0.351..0.351 rows=45 loops=1)
               Index Cond: (geog && '0101000020E610000017258D768157F13FB4ED0FEF945E4A40'::geography)
 Total runtime: 1.821 ms

This shows a marked improvement from a query time of 160.069ms down to 1.821 ms. Obviously our mileage will vary depending on the quantity of data in the table, it’s distribution and just how many rows we want to retrieve.

database dumping ground

… getting the wrong answer fast is not an improvement over the right answer slow