Time-Series Data - Part 1 - Dr. Raphael A. Bauer, MBA | Fractional & Interim CTO | PE/VC Tech Strategy & Due Diligence | Berlin | London

Intro

As Interim CTO I deal with management topics. But I also deal a lot with IT architecture and high-performance systems. In one of my latest projects I had to deal with lots of IoT time-series data.

Time series-data is a bit special because it has some distinct properties. First of all it is ordered by - well - time. Secondly, old data usually does not get updated - only new data is added.

Such systems are also high volume. Many rows are being added per second. And from a user perspective you have to be able to analyze the data. That’s often done by aggregating data on a day to day basis. How many events came in today compared to yesterday. And then drill down and select certain events and compare them.

The teams did not yet chose a technology for a new product, and so I did some brief tests with a couple of database systems I had experience with. I - of course - started with plain old PostgreSQL (Version 15).

PostgreSQL and Time-Series Data

Note: I’ve done all measurements on my MacBook M1 Pro 32GB. I tried to make the queries as “cold” as possible, so that no data is cached in-memory.

My requirements were to store event-style data. Data was always tied to one specific user. Speaking in SQL this’d be something like

CREATE TABLE events (
  user_id UUID NOT NULL,
  timestamp TIMESTAMPTZ NOT NULL,
  payload TEXT NOT NULL,
  source TEXT NOT NULL,
  target TEXT NOT NULL
);

I first created some dummy data using the following function:

DO $$
DECLARE
  user_id UUID;
BEGIN
  FOR i IN 1..10 LOOP
    user_id = gen_random_uuid();
    INSERT INTO events (
      user_id,
      timestamp,
      payload,
      source,
      target
    )
    SELECT
      user_id,
      NOW() - interval '1 second' * trunc(random() * 60 * 60 * 24 * 365 * 5),
      'dummypayload-.' || md5(random()::text),
      'dummysource-' || md5(random()::text),
      'dummytarget-' || md5(random()::text)
    FROM generate_series(1, 10000000);
  END LOOP;
END $$;

This creates 10 Million events over 5 years for 10 users.

My task was to provide a dashboard of the events. For instance how many events came in on a daily basis. Let’s turn on the timing via \timing and measure the query.

SELECT 
    date_trunc('day', timestamp) AS day,
    COUNT(*) AS count
FROM 
    hits
WHERE user_id = 'a valid user'
  AND timestamp >= NOW() - INTERVAL '5 years'
GROUP BY 
    day
ORDER BY 
    day;

This took around 15 seconds. Not interactive. But of course we don’t have any indexes right now. So let’s create one for our query

CREATE INDEX on events(user_id, timestamp);

SELECT 
    date_trunc('day', timestamp) AS day,
    COUNT(*) AS count
FROM 
    hits
WHERE user_id = 'a valid user'
  AND timestamp >= NOW() - INTERVAL '5 years'
GROUP BY 
    day
ORDER BY 
    day;

This now takes around 2.5 seconds. Very nice performance improvement. The query now ran as Index-Only query and did not touch the heap. Great!

But was it really fast and interactive? Hmm. Guess it’s still to slow.

Options from Here

Challenge Product Design

The first question would be to ask: “Do we really need 5 years of data in one go?”. That’s the secret weapon of good CTOs and technical leaders. They don’t simply accept product decisions, but challenge them and try to unite product vision and technical feasibility.

We could for instance fetch the data in batches and render a hypothetical table from latest to oldest events. The hypothetical webpage would be interactive immediately and then add data quickly.

More CPU, Memory, SDD Everything!

My MacBook is fast, but not as fast as a beefy server with lots of RAM, CPU and disk. We could therefore easily scale vertically. The machine I configured at Hetzner had 1 TB RAM, 7TB disk and a 32 core AMD CPU for 600 Euros per month. That’s pretty massive and will make the queries much more interactive.

My gut feeling is also that these 600 Euros will take you a long way. And it’s much more predictable (and cheaper) than any cloud offering (a topic for another post).

Materialized Views and Pre-Aggregation

Materialized views are a nice way to run these aggregation queries and save the results. That’s much much faster than executing the query itself. Unfortunately, that also means updating the data yourself periodically (via pg_cron or so) and also added complexity when it comes to combining pre-aggregated data with new data.

In my case it also sounded too complicated as I had to adjust my queries to timezones of users. This would make pre-aggregation very complicated (one materialized view per timezone). In addition, my real-world queries looked much more complicated with more WHERE clauses.

In short: Materialized views were not possible in my case.

Specialized extensions like Timescale DB

I’ve read multiple times about Timescale. It’s an extension for PostgreSQL that adds some magic to deal with time-series-data in PostgreSQL in a better way.

Sounds cool. I can re-use all my knowledge about PostgreSQL and supercharge it with Timescale. And that’s something for the next blogpost!

Summary

For many time-series use-cases PostgreSQL might be fast enough. The most important recommendation is to tailor your queries to utilize index-only scans. In the example above this lead to a 6x speedup.

Important: Do the performance measurements yourself, make sure to not hit any caches during performance testing and use EXPLAIN a lot to optimize.

At the end it is also your most important job job as technical leader to challenge product. The goal is to create clever technical solutions that are technically feasible (aka cheap enough) and fullfil the product vision.

Awesome photo on top by Andrik Langfield: https://unsplash.com/photos/0rTCXZM7Xfo

Time-Series Data - Part 1.