Time-Series Data - Part 2.

Just a decorative image for the page.

Intro

In the first part we investigated whether PostgreSQL could be used as storage for time series data. Our results were not bad, but the performance was too slow for our use-case.

One option to increase the performance is the usage of specialized time-series databases. One of these databases is TimescaleDB. The cool thing is that it is an extension inside PostgreSQL. You can continue using PostgreSQL and all your knowledge plus - you get a lot of cool stuff for time-series data at the same time.

Sounds too good to be true. Let’s check it out.

The Test Scenario for TimescaleDB

We create the same scenario as in the first part of the post. In short: 5 users, 10 million events over 5 years and then do an aggregation query on top of that.

Let’s do it…

CREATE DATABASE timescale_test;

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "timescaledb"; --  Running version 2.11.2

The last line does the magic and loads timescale as extension into your database. You now have all goodies of timescale available. Timescale has pretty good documentation on how to do this. The easiest way to get started is their Docker image.

Another way is to use brew to install it and load it into Postgres.app - that’s what I did as humble Mac user. I also modified the brew tap so that it is compatible with PostgreSQL 15.

But let’s continue with the test setup. Of course we need a table…

CREATE TABLE events (
  user_id UUID NOT NULL,
  timestamp TIMESTAMPTZ NOT NULL,
  payload TEXT NOT NULL,
  source TEXT NOT NULL,
  target TEXT NOT NULL
);

And now comes the magic - we can turn table into a hypertable. The hypertable is basically a partitioned table with some default index on the time created and maintained by TimescaleDB:

SELECT create_hypertable('events', 'timestamp');

Magic. Now let’s create the dummy data…

DO $$
DECLARE
  user_id UUID;
BEGIN
  FOR i IN 1..10 LOOP
    user_id = gen_random_uuid();
    INSERT INTO events (
      user_id,
      timestamp,
      payload,
      source,
      target
    )
    SELECT
      user_id,
      NOW() - interval '1 second' * trunc(random() * 60 * 60 * 24 * 365 * 5),
      'dummypayload-.' || md5(random()::text),
      'dummysource-' || md5(random()::text),
      'dummytarget-' || md5(random()::text)
    FROM generate_series(1, 10000000);
  END LOOP;
END $$;

And turn on \timing and run our first query

EXPLAIN 
SELECT 
    time_bucket('1 day', timestamp) AS day,
    COUNT(*) AS count
FROM 
    events
WHERE user_id = '6a7f240e-44fa-455e-b97a-a9d7dcc74b29'
    AND timestamp >= NOW() - INTERVAL '5 years'
GROUP BY 
    day
ORDER BY 
    day; 

This took the database 8 seconds That’s already 50% faster than the non-optimized PostgreSQL version. Cool.

But of course we can also add a specialized index here:

CREATE INDEX on events(user_id, timestamp);

Now we are down to 700 milliseconds. That’s almost four times faster than the plain PostgreSQL version. This is also an index only scan, but it the query plan looks different. A hypertable is a nice way to say “a table that is automatically partitioned by time”. Therefore the query can parallelize apparently a bit better than without any partition. Interesting.

That’s cool and this can be achieved without too much magic. Just add an extension. Create a hypertable and you get the performance boost. Nice!

More Cool Stuff

I am not a TimescaleDB expert, but there is more cool stuff. Timescale has continuous aggregates. These are materialized views that get updated automatically. And querying them also includes the latest data from the source table if not yet in the materialized view. Pretty cool.

You also get a lot of functions that help with time-series data. time_bucket_gapfill can automatically fill “missing” values by interpolating or setting it to 0. Very useful.

And that’s just a glimpse into what TimescaleDB can do.

Summary

Our shootout now looks like this:

Aggregation with 10 Million events tablePostgreSQL 15TimescaleDB 2.11.2
Without explicit index (cold)15s8s
With Index on user_id and timestamp (cold)2.5s0.7s

TimescaleDB seems to be faster for our test scenario. If you don’t have so much data you can still use PostgreSQL. Once your data grows you can switch to TimescaleDB at any time and get the benefits.

More