WordPress.org

On May 22, several Openverse maintainers met to investigate performance improvements to the Catalog database. We created three test instances of the Catalog DB from restored snapshots of production data:

baseline: A restored production snapshot with no changes applied
provider-index: A snapshot with a b-tree index added on the provider column for both the audio and image tables
provider-partition: A snapshot with the image and audio tables partitioned on the provider column

We then ran a series of SQL queries against all three environments in order to compare their relative performance. The queries selected were representative of operations that are routinely performed on the catalog. The exact queries used to modify the provider-index and provider-partition environments, as well as the queries used in testing, are documented below.

Setting up the test environments

`provider-index`

To create the provider-index test DB, we restored a production snapshot and ran the following queries to add a b-tree index on provider for both media tables:

CREATE INDEX audio_provider ON public.audio USING btree (provider);
CREATE INDEX image_provider ON image USING btree (provider);

`provider-partition`

To create the provider-partition test DB, we restored a production snapshot and ran the following queries to create new audio_part and image_part tables, which are partitioned on the provider column. Queries given are only for the audio table as an example:

CREATE TABLE audio_part (
    identifier uuid DEFAULT public.uuid_generate_v4(),
    created_on timestamp with time zone NOT NULL,
    updated_on timestamp with time zone NOT NULL,
    ingestion_type character varying(80),
    provider character varying(80),
    source character varying(80),
    foreign_identifier character varying(3000),
    foreign_landing_url character varying(1000),
    url character varying(3000) NOT NULL,
    thumbnail character varying(3000),
    filetype character varying(5),
    duration integer,
    bit_rate integer,
    sample_rate integer,
    category character varying(80),
    genres character varying(80)[],
    audio_set jsonb,
    set_position integer,
    alt_files jsonb,
    filesize integer,
    license character varying(50) NOT NULL,
    license_version character varying(25),
    creator character varying(2000),
    creator_url character varying(2000),
    title character varying(5000),
    meta_data jsonb,
    tags jsonb,
    watermarked boolean,
    last_synced_with_source timestamp with time zone,
    removed_from_source boolean NOT NULL,
    standardized_popularity double precision,
    PRIMARY KEY (identifier, provider)
) PARTITION BY LIST (provider);

-- Add indices
CREATE UNIQUE INDEX audio_part_provider_fid_idx
    ON audio_part
        USING btree (provider, md5(foreign_identifier));
CREATE UNIQUE INDEX audio_part_url_key
    ON audio_part
    USING btree (url, provider);

-- Create partitions
CREATE TABLE jamendo PARTITION OF audio_part for VALUES in ('jamendo');
CREATE TABLE freesound PARTITION OF audio_part for VALUES in ('freesound');
CREATE TABLE wikimedia_audio PARTITION OF audio_part for VALUES in ('wikimedia_audio');

-- Insert all records from the relevant media table
INSERT INTO audio_part SELECT * FROM audio;

CREATE INDEX audio_part_partition ON audio_part (provider);
SET constraint_exclusion=on;

The partitioned table matches the schema of the existing media tables as closely as possible. There are a few necessary changes worth noting:

In order to partition by provider, provider was added to the primary key (previously just identifier)
The unique indices all need to include the partition key, which meant that we had to add a unique constraint on (url, provider) rather than just (url)as we have in the current Catalog media tables and in the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.. This means that with this setup, it is theoretically possible to have a record which passes the constraints of the Catalog but not the API.

Tests in the provider-partition environment were run against the image_part and audio_part tables.

Test queries

Test queries were selected for the following common operations:

SELECT and UPDATE by provider
Large upserts (as performed during ingestion by provider DAGs)
Batch updates (for data migrations and popularity refreshes. See https://github.com/WordPress/openverse/pull/2331)

SELECT Met

Select all records for Metropolitan Museum (~370k rows)

SELECT COUNT(*) FROM image WHERE PROVIDER=‘met’;

SELECT Wikimedia Audio

Select all records for Wikimedia Audio (~1 million rows)

SELECT COUNT(*) FROM audio WHERE PROVIDER=‘wikimedia_audio’;

UPDATE thumbnails

Set thumbnail to null for all Wikimedia Audio records.

UPDATE audio
SET thumbnail = null
WHERE provider = ‘wikimedia_audio’;

UPDATE duration

Set duration to 0 for all Wikimedia Audio records.

UPDATE audio
SET duration = 0
WHERE provider = ‘wikimedia_audio’;

UPDATE standardized popularity

Calculate standardized popularity for all Wikimedia Audio records.

UPDATE audio
SET standardized_popularity = standardized_audio_popularity(audio.provider, audio.meta_data)
WHERE provider = ‘wikimedia_audio’;

UPSERT Wikimedia Audio

For upserts, we manually ran all of the queries generated during the load_data steps for some production DagRuns. This includes creating a loading table, loading the data from a production tsv, cleaning the data, and finally upserting it into the media table. Only the performance of the actual upsert step is compared in this study, but all of the queries can be found by looking at the logs of the relevant DagRun. If you do not have access to production Airflow, you can run a DAG locally to see all queries.

For Wikimedia Audio we used queries from the 2023-05-04 DagRun, loading 804 records.

UPSERT Freesound

Queries copied from the 2023-03-10 DagRun, loading 152,500 records.

UPSERT Europeana

Queries copied from the 2023-05-22 DagRun, loading 23,724 records.

CREATE Wikimedia Audio temp table

Create a table for all Wikimedia Audio records, as would be used in the first step of a batched update. (~370k rows)

CREATE TABLE rows_to_update_wiki_audio AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM audio
    WHERE provider=‘wikimedia_audio’ AND updated_on < NOW();

CREATE Met temp table

Create a table for all Metropolitan records, as would be used in the first step of a batched update. (~1 million rows)

CREATE TABLE rows_to_update_met AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM image
    WHERE provider=‘met’ AND updated_on < NOW();

CREATE Flickr temp table

Create a table for all Flickr records, as would be used in the first step of a batched update. (~500 million rows)

CREATE TABLE rows_to_update_flickr AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM image
    WHERE provider=‘flickr’ AND updated_on < NOW();

Flickr Large Batch Update

Update a batch of 10,000 Flickr records.

UPDATE image
    SET standardized_popularity = standardized_image_popularity(provider, meta_data)
    WHERE identifier in (
        SELECT identifier FROM rows_to_update_flickr
        WHERE row_id >= 300000000 AND row_id < 300010000
        FOR UPDATE SKIP LOCKED
    );

Flickr Small Batch Update

Update a batch of 1,000 Flickr records.

UPDATE image
SET standardized_popularity = standardized_image_popularity(provider, meta_data)
     WHERE identifier in (
         SELECT identifier FROM rows_to_update_flickr
         WHERE row_id >= 1000000 AND row_id < 1001000
         FOR UPDATE SKIP LOCKED
     );

Results

	`baseline`	`provider-index`	`provider-partition`
SELECT Met	11 min 15 sec	5.598 sec	6.778 sec
SELECT Wikimedia Audio	2 min 32 sec	0.510 sec	0.206 sec
UPDATE thumbnails	4 min 54 sec	1 min 14 sec	1 min 7 sec
UPDATE duration	1 min 12 sec	1 min 9 sec	1 min 16 sec
UPDATE popularity	1 min 18 sec	1 min 21 sec	1 min 18 sec
UPSERT Wikimedia Audio	33.751 sec	1.532 sec	1 min 2 sec
UPSERT Freesound	52 sec	35 sec	28 sec
UPSERT Europeana	44 min 22 sec	30 min 13 sec	6 min 51 sec
CREATE Wikimedia temp table	2 sec	1 sec	40 sec
CREATE Met temp table	6 min 40 sec	7 sec	24 sec
CREATE Flickr temp table	18 hr	40 min 13 sec	9 hr 35 min 20 sec
Flickr large batch UPDATE	1 hr 5 min 9 sec	50 min 10 sec	2 hr 42 min 30 sec
Flickr small batch UPDATE	5 min 1 sec	2 min 12 sec	31 min 7 sec

Inconsistencies Noted During Testing

When testing ingestion UPSERTs, we initially compared the performance of the query on provider-index and provider-partition to the runtime of the upsert_data task in the production DagRun from which the test was sourced. In each case, this production upsert took less than a minute to complete, while the same query took close to an hour in our test environments. Alarmingly, this seemed to indicate that the index and partition both massively degraded the performance of UPSERTs relative to baseline.

However, when we ran the queries on the baseline testing environment — which was theoretically identical to the production environment in which the DAGs ran — we found that the runtime was considerably longer (in fact, even longer than the runtime on provider-index and provider-partition). The timings given in the chart above are for these tests run on baseline. I also tested running the queries manually on the actual production database, and found they ran in less than a minute, just as they had during the DagRun!

This leaves us with a major inconsistency: the performance of queries on baseline, a restored snapshot of production with no additional changes, is clearly much worse than the performance on production, at least in the case of ingestion upserts. We concluded that the effect of routine operations performed on production (daily ingestion, for example) creates too many confounding variables. One theory, considered only after the tests had already been performed, was that the optimizer in the test environments may not have been able to plan queries optimally due to the lack of statistics in newly restored databases; per the docs:

Once restored, it is wise to run ANALYZE on each database so the optimizer has useful statistics. An easy way to do this is to run vacuumdb -a -z to VACUUM ANALYZE all databases; this is equivalent to running VACUUM ANALYZE manually.

Consequently, in our analysis we only compare the relative performance between our three test environments.

This is important to consider when looking at the data for batched updates in particular. A batched update on provider-index ran about 26% faster than the same update on baseline; however, it still took over 50 minutes to update just 10,000 records. Extrapolated out to the roughly 50,000 batches needed, a full update of the Flickr provider would take just over five years. This is obviously untenable.

However, upserts on baseline took about 60 times as long as the identical upsert on production. If the same ratio holds true for batch updates, and they take about 50 seconds when run on production, the full Flickr update could be run in about 28 days, which is comfortably within our desired timeframe for a popularity refresh.

As a final note, the order queries were performed in dramatically affected performance. For example, a SELECT on wikimedia_audio would run much more quickly after an UPDATE was performed on the same rows. We tried to test queries in as close to the same order as possible among the three environments, but as the test plan was evolving while we were working there may be some inconsistencies (for example, the very quick performance of baseline in the Wikimedia temp table test).

Analysis

It is useful to first examine our priorities in the Catalog, with respect to the types of queries that were analyzed in this work. By far our most commonly performed operations are (or will be) ingestion upserts and batched updates.

Upserts during ingestion are not currently a bottleneck. The pull_data step to pull data from external provider APIs is by far the greater part of ingestion runtime; even in reingestion DAGs, we do not have a pressing need to improve the performance of this step. It is not critical to improve the performance of this step, as long as we do not worsen it.

Batched updates are a greater concern, as large updates have historically taken considerable time (timing out after weeks in some cases). A batched update, as currently implemented, has two steps: the creation of the temp table which is done once, and the actual UPDATE which is run potentially many times, for each batch. Therefore improving the performance of the UPDATE has a significantly greater impact on the performance of a batched update overall.

With that context, we can look at the two proposals under test:

`provider-partition`

The provider partition had surprisingly mixed results. In the Wikimedia Audio upsert test, it had the worst performance of all three environments by a considerable margin, but was easily the most performant for the Freesound and Europeana upserts. It’s possible the order of query performance was a factor.

While the partition environment was consistently much faster at creating temp tables for the batched update, the actual updates took considerably longer than baseline.

`provider-index`

The provider index offered improved performance across the board (with one exception: the update_standardized_popularity test, in which queries in all three environments completed within 3 seconds of each other). Note this does require comparing the index to baseline and not to production, as noted in the Inconsistencies section.

In particular, provider-index appeared to have consistently better performance during a batched update UPDATE step.

Conclusions

Because the batched UPDATE step is the operation most critically in need of performance improvement, I think we should not pursue partitioning the media tables by provider, and instead consider adding the provider index. As noted throughout this post, the many inconsistencies found during testing and the lack of a large testing sample size make this compelling, but far from conclusive.

I propose the following next steps:

First, try running a batched update of wikimedia_audio, our largest audio provider, in production. This will allow us to verify the theory that batched updates, much like upserts, will be faster in production than in our tests.
Add the provider index to the audio table in production. Because audio is considerably smaller than image, this is a quick operation.
Re-run the batched update and compare the performance. If the results are promising, consider repeating the process with the image table. If not, the index can be safely dropped.

Note that this work only investigated two of the proposals put forth at the recent Catalog meet-up. There are many others we would like to test in the future, including pulling license information into its own table, and reducing the size of large text columns.

If you have read this far, thank you very much! I’d love to hear your thoughts.

#catalog, #postgres

Madison 8:14 pm on July 17, 2023

Apologies that it took me so long to read and respond to this excellent write-up Staci!

I'm very glad that you were able to run checks against baseline with results that made much more sense than what we were seeing against production. In addition to your note about the VACUUM ANALYZE steps (and based on how we saw that query ordering matters), I wonder if production usage itself might be a factor in why those other inserts are so fast. Perhaps Postgres is optimizing where various table data is stored as it gets used, which makes certain operations (like upsersts) faster than a "fresh out of the box" instance like the ones we were testing against. I could absolutely see something like a "warmed cache" effect happening.

I agree whole-heartedly with your analysis, and your idea to try this on audio first as a means of assessing how it affects production makes perfect sense. I feel comfortable with us moving in that directly, based on this evidence!

Thank you again for all the work you put into the perf testing and analysis!
Krystle Salazar 3:39 pm on July 18, 2023

Thanks a lot for compiling these findings, @stacimc. The following steps you mention make a lot of sense to me. I’m eager to see how the new provider index performs in production, and I believe it is the best sensible option too, which maintains the same constraints between the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways. and Catalog DBs.
beccawidom 7:53 pm on July 26, 2023

Sorry to be a bit late to the game, but yes, thanks so much for the thorough and thoughtful update here! When I saw the differences in results based on order of execution and the differences across test and prod, it also made me wonder about the effects of caching. (In addition to stats available for the query planner, of course!) I just did a tiny bit of looking to see if I could learn anything about how TOASTed JSONJSON JSON, or JavaScript Object Notation, is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML. fields might be cached for updates, but didn’t immediately find anything on point. In any case, thanks again, and next steps look great!

Welcome!

Catalog Performance Testing Results

Setting up the test environments

`provider-index`

`provider-partition`

Test queries

SELECT Met

SELECT Wikimedia Audio

UPDATE thumbnails

UPDATE duration

UPDATE standardized popularity

UPSERT Wikimedia Audio

UPSERT Freesound

UPSERT Europeana

CREATE Wikimedia Audio temp table

CREATE Met temp table

CREATE Flickr temp table

Flickr Large Batch Update

Flickr Small Batch Update

Results

Inconsistencies Noted During Testing

Analysis

`provider-partition`

`provider-index`

Conclusions