Optimizing queries on a range of timestamps (two columns)

Question

I use PostgreSQL 9.1 on Ubuntu 12.04.

I need to select records inside a range of time: my table time_limits has two timestamp fields and one integer property. There are additional columns in my actual table that are not involved with this query.

create table (
   start_date_time timestamp,
   end_date_time timestamp, 
   id_phi integer, 
   primary key(start_date_time, end_date_time,id_phi);

This table contains roughly 2M records.

Queries like the following took enormous amounts of time:

select * from time_limits as t 
where t.id_phi=0 
and t.start_date_time <= timestamp'2010-08-08 00:00:00'
and t.end_date_time   >= timestamp'2010-08-08 00:05:00';

So I tried adding another index - the inverse of the PK:

create index idx_inversed on time_limits(id_phi, start_date_time, end_date_time);

I got the impression that performance improved: The time for accessing records in the middle of the table seems to be more reasonable: somewhere between 40 and 90 seconds.

But it's still several tens of seconds for values in the middle of the time range. And twice more when targeting the end of the table (chronologically speaking).

I tried explain analyze for the first time to get this query plan:

 Bitmap Heap Scan on time_limits  (cost=4730.38..22465.32 rows=62682 width=36) (actual time=44.446..44.446 rows=0 loops=1)
   Recheck Cond: ((id_phi = 0) AND (start_date_time <= '2011-08-08 00:00:00'::timestamp without time zone) AND (end_date_time >= '2011-08-08 00:05:00'::timestamp without time zone))
   ->  Bitmap Index Scan on idx_time_limits_phi_start_end  (cost=0.00..4714.71 rows=62682 width=0) (actual time=44.437..44.437 rows=0 loops=1)
         Index Cond: ((id_phi = 0) AND (start_date_time <= '2011-08-08 00:00:00'::timestamp without time zone) AND (end_date_time >= '2011-08-08 00:05:00'::timestamp without time zone))
 Total runtime: 44.507 ms

See the results on depesz.com.

What could I do to optimize the search? You can see all the time is spent scanning the two timestamps columns once id_phi is set to 0. And I don't understand the big scan (60K rows!) on the timestamps. Aren't they indexed by the primary key and idx_inversed I added?

Should I change from timestamp types to something else?

I have read a little about GIST and GIN indexes. I gather they can be more efficient on certain conditions for custom types. Is it a viable option for my use case?

Erwin Brandstetter · Accepted Answer · 2020-11-12 01:58:16Z

For Postgres 9.1 or later:

CREATE INDEX idx_time_limits_ts_inverse
ON time_limits (id_phi, start_date_time, end_date_time DESC);

In most cases the sort order of an index is hardly relevant. Postgres can scan backwards practically as fast. But for range queries on multiple columns it can make a huge difference. Closely related:

PostgreSQL index not used for query on range

Consider your query:

SELECT *
FROM   time_limits
WHERE  id_phi = 0
AND    start_date_time <= '2010-08-08 00:00'
AND    end_date_time   >= '2010-08-08 00:05';

Sort order of the first column id_phi in the index is irrelevant. Since it's checked for equality (=), it should come first. You got that right. More in this related answer:

Multicolumn index and performance

Postgres can jump to id_phi = 0 in next to no time and consider the following two columns of the matching index. These are queried with range conditions of inverted sort order (<=, >=). In my index, qualifying rows come first. Should be the fastest possible way with a B-Tree index¹:

You want start_date_time <= something: index has the earliest timestamp first.
If it qualifies, also check column 3.
Recurse until the first row fails to qualify (super fast).
You want end_date_time >= something: index has the latest timestamp first.
If it qualifies, keep fetching rows until the first one doesn't (super fast).
Continue with next value for column 2 ..

Postgres can either scan forward or backward. The way you had the index, it has to read all rows matching on the first two columns and then filter on the third. Be sure to read the chapter Indexes and ORDER BY in the manual. It fits your question pretty well.

How many rows match on the first two columns?
Only few with a start_date_time close to the start of the time range of the table. But almost all rows with id_phi = 0 at the chronological end of the table! So performance deteriorates with later start times.

Planner estimates

The planner estimates rows=62682 for your example query. Of those, none qualify (rows=0). You might get better estimates if you increase the statistics target for the table. For 2.000.000 rows ...

ALTER TABLE time_limits ALTER start_date_time SET STATISTICS 1000;
ALTER TABLE time_limits ALTER end_date_time   SET STATISTICS 1000;

... might pay. Or even higher. More in this related answer:

Check statistics targets in PostgreSQL

I guess you don't need that for id_phi (only few distinct values, evenly distributed), but for the timestamps (lots of distinct values, unevenly distributed).
I also don't think it matters much with the improved index.

`CLUSTER` / pg_repack / pg_squeeze

If you want it faster, yet, you could streamline the physical order of rows in your table. If you can afford to lock your table exclusively (at off hours for instance), rewrite your table and order rows according to the index with CLUSTER:

CLUSTER time_limits USING idx_time_limits_inversed;

Or consider pg_repack or the later pg_squeeze, which can do the same without exclusive lock on the table.

Either way, the effect is that fewer blocks need to be read from the table and everything is pre-sorted. It's a one-time effect deteriorating over time with writes on the table fragmenting the physical sort order.

GiST index in Postgres 9.2+

¹ With pg 9.2+ there is another, possibly faster option: a GiST index for a range column.

There are built-in range types for timestamp and timestamp with time zone: tsrange, tstzrange. A btree index is typically faster for an additional integer column like id_phi. Smaller and cheaper to maintain, too. But the query will probably still be faster overall with the combined index.
Change your table definition or use an expression index.
For the multicolumn GiST index at hand you also need the additional module btree_gist installed (once per database) which provides the operator classes to include an integer.

The trifecta! A multicolumn functional GiST index:

CREATE EXTENSION IF NOT EXISTS btree_gist;  -- if not installed, yet

CREATE INDEX idx_time_limits_funky ON time_limits USING gist
(id_phi, tsrange(start_date_time, end_date_time, '[]'));

Use the "contains range" operator @> in your query now:

SELECT *
FROM   time_limits
WHERE  id_phi = 0
AND    tsrange(start_date_time, end_date_time, '[]')
    @> tsrange('2010-08-08 00:00', '2010-08-08 00:05', '[]')

SP-GiST index in Postgres 9.3+

An SP-GiST index might be even faster for this kind of query - except that, quoting the manual:

Currently, only the B-tree, GiST, GIN, and BRIN index types support multicolumn indexes.

Still true in Postgres 12.
You would have to combine an spgist index on just (tsrange(...)) with a second btree index on (id_phi). With the added overhead, I'm not sure this can compete.
Related answer with a benchmark for just a tsrange column:

Perform this hours of operation query in PostgreSQL

Could you explain more why DESC matters? Won't the btree get to the first one that qualifies and move sequentially down the line with the same ease regardless of what order the data is in on the disk? In both cases, it knows that once it hits a row, everything else either in front or behind it also qualifies, right? Thanks! — John Bachir, Commented Dec 29, 2017 at 4:17
@John: Postgres can traverse an index forwards or backwards, but it can't change direction in the same scan. Ideally, you have all qualifying rows per node first (or last), but it has to be the same alignment (matching query predicates) for all columns to get best results. — Erwin Brandstetter, Commented Dec 29, 2017 at 22:31

nathan-m · Accepted Answer · 2013-04-17 22:45:10Z

5

Erwin's answer is already comprehensive, however:

Range types for timestamps are available in PostgreSQL 9.1 with the Temporal extension from Jeff Davis: https://github.com/jeff-davis/PostgreSQL-Temporal

Note: has limited features (uses Timestamptz, and you can only have the '[)' style overlap afaik). Also, there's lots of other great reasons to upgrade to PostgreSQL 9.2.

answered Apr 17, 2013 at 22:45

nathan-m

1511 bronze badge

Add a comment |

Community · Accepted Answer · 2017-04-13 12:42:48Z

3

You could try to create the multicolumn index in a different order:

primary key(id_phi, start_date_time,end_date_time);

I posted once a similar question also related to the ordering of indexes on a multicolumn index. The key is trying to use first the most restrictive conditions to reduce the search space.

Edit: My mistake. Now I see that you already have this index defined.

edited Apr 13, 2017 at 12:42

CommunityBot

1

answered Apr 9, 2013 at 20:11

jap1968

7211 gold badge6 silver badges12 bronze badges

I already have both index. Except the primary key is the other, but the index you propose already exists, and is the one that is used if you look at the explain: Bitmap Index Scan on idx_time_limits_phi_start_end
– Stephane Rolland
Commented Apr 9, 2013 at 20:29

Add a comment |

borovsky · Accepted Answer · 2017-06-18 08:34:30Z

I managed to rapidly increase (from 1 sec to 70ms)

I have a table with aggregations of many measurements and many levels (l column) (30s, 1m, 1h, etc) there are two range bound columns: $s for start and $e for end.

I created two multicolumn indexes: one for start and one for end.

I adjusted select query: select ranges where their start bound is in given range. additionally select ranges where their end bound is in given range.

Explain shows two streams of rows using our indexes efficiently.

Indexes:

drop index if exists agg_search_a;
CREATE INDEX agg_search_a
ON agg (measurement_id, l, "$s");

drop index if exists agg_search_b;
CREATE INDEX agg_search_b
ON agg (measurement_id, l, "$e");

Select query:

select "$s", "$e", a, t, b, c from agg
where 
    measurement_id=0 
    and l =  '30s'
    and (
        (
            "$s" > '2013-05-01 02:05:05'
            and "$s" < '2013-05-01 02:18:15'
        )
        or 
        (
             "$e" > '2013-05-01 02:00:05'
            and "$e" < '2013-05-01 02:18:05'
        )
    )

;

Explain:

[
  {
    "Execution Time": 0.058,
    "Planning Time": 0.112,
    "Plan": {
      "Startup Cost": 10.18,
      "Rows Removed by Index Recheck": 0,
      "Actual Rows": 37,
      "Plans": [
    {
      "Startup Cost": 10.18,
      "Actual Rows": 0,
      "Plans": [
        {
          "Startup Cost": 0,
          "Plan Width": 0,
          "Actual Rows": 26,
          "Node Type": "Bitmap Index Scan",
          "Index Cond": "((measurement_id = 0) AND ((l)::text = '30s'::text) AND (\"$s\" > '2013-05-01 02:05:05'::timestamp without time zone) AND (\"$s\" < '2013-05-01 02:18:15'::timestamp without time zone))",
          "Plan Rows": 29,
          "Parallel Aware": false,
          "Actual Total Time": 0.016,
          "Parent Relationship": "Member",
          "Actual Startup Time": 0.016,
          "Total Cost": 5,
          "Actual Loops": 1,
          "Index Name": "agg_search_a"
        },
        {
          "Startup Cost": 0,
          "Plan Width": 0,
          "Actual Rows": 36,
          "Node Type": "Bitmap Index Scan",
          "Index Cond": "((measurement_id = 0) AND ((l)::text = '30s'::text) AND (\"$e\" > '2013-05-01 02:00:05'::timestamp without time zone) AND (\"$e\" < '2013-05-01 02:18:05'::timestamp without time zone))",
          "Plan Rows": 39,
          "Parallel Aware": false,
          "Actual Total Time": 0.011,
          "Parent Relationship": "Member",
          "Actual Startup Time": 0.011,
          "Total Cost": 5.15,
          "Actual Loops": 1,
          "Index Name": "agg_search_b"
        }
      ],
      "Node Type": "BitmapOr",
      "Plan Rows": 68,
      "Parallel Aware": false,
      "Actual Total Time": 0.027,
      "Parent Relationship": "Outer",
      "Actual Startup Time": 0.027,
      "Plan Width": 0,
      "Actual Loops": 1,
      "Total Cost": 10.18
    }
      ],
      "Exact Heap Blocks": 1,
      "Node Type": "Bitmap Heap Scan",
      "Plan Rows": 68,
      "Relation Name": "agg",
      "Alias": "agg",
      "Parallel Aware": false,
      "Actual Total Time": 0.037,
      "Recheck Cond": "(((measurement_id = 0) AND ((l)::text = '30s'::text) AND (\"$s\" > '2013-05-01 02:05:05'::timestamp without time zone) AND (\"$s\" < '2013-05-01 02:18:15'::timestamp without time zone)) OR ((measurement_id = 0) AND ((l)::text = '30s'::text) AND (\"$e\" > '2013-05-01 02:00:05'::timestamp without time zone) AND (\"$e\" < '2013-05-01 02:18:05'::timestamp without time zone)))",
      "Lossy Heap Blocks": 0,
      "Actual Startup Time": 0.033,
      "Plan Width": 44,
      "Actual Loops": 1,
      "Total Cost": 280.95
    },
    "Triggers": []
  }
]

The trick is that your plan nodes contain only wanted rows. Previously we got thousands of rows in plan node because it selected all points from some point in time to the very end, then next node removed unnecesary rows.

Stack Exchange Network

Optimizing queries on a range of timestamps (two columns)

4 Answers 4

Planner estimates

`CLUSTER` / pg_repack / pg_squeeze

GiST index in Postgres 9.2+

SP-GiST index in Postgres 9.3+

Not the answer you're looking for? Browse other questions tagged
postgresql
index
optimization
postgresql-9.1
explain
or ask your own question.

Linked

Hot Network Questions

Optimizing queries on a range of timestamps (two columns)

4 Answers 4

Planner estimates

CLUSTER / pg_repack / pg_squeeze

GiST index in Postgres 9.2+

SP-GiST index in Postgres 9.3+

Not the answer you're looking for? Browse other questions tagged postgresqlindexoptimizationpostgresql-9.1explain or ask your own question.

Linked

Related

Hot Network Questions

`CLUSTER` / pg_repack / pg_squeeze

Not the answer you're looking for? Browse other questions tagged
postgresql
index
optimization
postgresql-9.1
explain
or ask your own question.