Catalog Performance Testing Results

On May 22, several Openverse maintainers met to investigate performance improvements to the Catalog database. We created three test instances of the Catalog DB from restored snapshots of production data:

  • baseline: A restored production snapshot with no changes applied
  • provider-index: A snapshot with a b-tree index added on the provider column for both the audio and image tables
  • provider-partition: A snapshot with the image and audio tables partitioned on the provider column

We then ran a series of SQL queries against all three environments in order to compare their relative performance. The queries selected were representative of operations that are routinely performed on the catalog. The exact queries used to modify the provider-index and provider-partition environments, as well as the queries used in testing, are documented below.

Setting up the test environments

provider-index

To create the provider-index test DB, we restored a production snapshot and ran the following queries to add a b-tree index on provider for both media tables:

CREATE INDEX audio_provider ON public.audio USING btree (provider);
CREATE INDEX image_provider ON image USING btree (provider);

provider-partition

To create the provider-partition test DB, we restored a production snapshot and ran the following queries to create new audio_part and image_part tables, which are partitioned on the provider column. Queries given are only for the audio table as an example:

CREATE TABLE audio_part (
    identifier uuid DEFAULT public.uuid_generate_v4(),
    created_on timestamp with time zone NOT NULL,
    updated_on timestamp with time zone NOT NULL,
    ingestion_type character varying(80),
    provider character varying(80),
    source character varying(80),
    foreign_identifier character varying(3000),
    foreign_landing_url character varying(1000),
    url character varying(3000) NOT NULL,
    thumbnail character varying(3000),
    filetype character varying(5),
    duration integer,
    bit_rate integer,
    sample_rate integer,
    category character varying(80),
    genres character varying(80)[],
    audio_set jsonb,
    set_position integer,
    alt_files jsonb,
    filesize integer,
    license character varying(50) NOT NULL,
    license_version character varying(25),
    creator character varying(2000),
    creator_url character varying(2000),
    title character varying(5000),
    meta_data jsonb,
    tags jsonb,
    watermarked boolean,
    last_synced_with_source timestamp with time zone,
    removed_from_source boolean NOT NULL,
    standardized_popularity double precision,
    PRIMARY KEY (identifier, provider)
) PARTITION BY LIST (provider);

-- Add indices
CREATE UNIQUE INDEX audio_part_provider_fid_idx
    ON audio_part
        USING btree (provider, md5(foreign_identifier));
CREATE UNIQUE INDEX audio_part_url_key
    ON audio_part
    USING btree (url, provider);

-- Create partitions
CREATE TABLE jamendo PARTITION OF audio_part for VALUES in ('jamendo');
CREATE TABLE freesound PARTITION OF audio_part for VALUES in ('freesound');
CREATE TABLE wikimedia_audio PARTITION OF audio_part for VALUES in ('wikimedia_audio');

-- Insert all records from the relevant media table
INSERT INTO audio_part SELECT * FROM audio;

CREATE INDEX audio_part_partition ON audio_part (provider);
SET constraint_exclusion=on;

The partitioned table matches the schema of the existing media tables as closely as possible. There are a few necessary changes worth noting:

  • In order to partition by provider, provider was added to the primary key (previously just identifier)
  • The unique indices all need to include the partition key, which meant that we had to add a unique constraint on (url, provider) rather than just (url)as we have in the current Catalog media tables and in the APIAPI An API or Application Programming Interface is a software intermediary that allows programs to interact with each other and share data in limited, clearly defined ways.. This means that with this setup, it is theoretically possible to have a record which passes the constraints of the Catalog but not the API.

Tests in the provider-partition environment were run against the image_part and audio_part tables.

Test queries

Test queries were selected for the following common operations:

  • SELECT and UPDATE by provider
  • Large upserts (as performed during ingestion by provider DAGs)
  • Batch updates (for data migrations and popularity refreshes. See https://github.com/WordPress/openverse/pull/2331)

SELECT Met

Select all records for Metropolitan Museum (~370k rows)

SELECT COUNT(*) FROM image WHERE PROVIDER=‘met’; 

SELECT Wikimedia Audio

Select all records for Wikimedia Audio (~1 million rows)

SELECT COUNT(*) FROM audio WHERE PROVIDER=‘wikimedia_audio’; 

UPDATE thumbnails

Set thumbnail to null for all Wikimedia Audio records.

UPDATE audio
SET thumbnail = null
WHERE provider = ‘wikimedia_audio’;

UPDATE duration

Set duration to 0 for all Wikimedia Audio records.

UPDATE audio
SET duration = 0
WHERE provider = ‘wikimedia_audio’;

UPDATE standardized popularity

Calculate standardized popularity for all Wikimedia Audio records.

UPDATE audio
SET standardized_popularity = standardized_audio_popularity(audio.provider, audio.meta_data)
WHERE provider = ‘wikimedia_audio’;

UPSERT Wikimedia Audio

For upserts, we manually ran all of the queries generated during the load_data steps for some production DagRuns. This includes creating a loading table, loading the data from a production tsv, cleaning the data, and finally upserting it into the media table. Only the performance of the actual upsert step is compared in this study, but all of the queries can be found by looking at the logs of the relevant DagRun. If you do not have access to production Airflow, you can run a DAG locally to see all queries.

For Wikimedia Audio we used queries from the 2023-05-04 DagRun, loading 804 records.

UPSERT Freesound

Queries copied from the 2023-03-10 DagRun, loading 152,500 records.

UPSERT Europeana

Queries copied from the 2023-05-22 DagRun, loading 23,724 records.

CREATE Wikimedia Audio temp table

Create a table for all Wikimedia Audio records, as would be used in the first step of a batched update. (~370k rows)

CREATE TABLE rows_to_update_wiki_audio AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM audio
    WHERE provider=‘wikimedia_audio’ AND updated_on < NOW();

CREATE Met temp table

Create a table for all Metropolitan records, as would be used in the first step of a batched update. (~1 million rows)

CREATE TABLE rows_to_update_met AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM image
    WHERE provider=‘met’ AND updated_on < NOW();

CREATE Flickr temp table

Create a table for all Flickr records, as would be used in the first step of a batched update. (~500 million rows)

CREATE TABLE rows_to_update_flickr AS
    SELECT ROW_NUMBER() over() row_id, identifier
    FROM image
    WHERE provider=‘flickr’ AND updated_on < NOW();

Flickr Large Batch Update

Update a batch of 10,000 Flickr records.

UPDATE image
    SET standardized_popularity = standardized_image_popularity(provider, meta_data)
    WHERE identifier in (
        SELECT identifier FROM rows_to_update_flickr
        WHERE row_id >= 300000000 AND row_id < 300010000
        FOR UPDATE SKIP LOCKED
    );

Flickr Small Batch Update

Update a batch of 1,000 Flickr records.

UPDATE image
SET standardized_popularity = standardized_image_popularity(provider, meta_data)
     WHERE identifier in (
         SELECT identifier FROM rows_to_update_flickr
         WHERE row_id >= 1000000 AND row_id < 1001000
         FOR UPDATE SKIP LOCKED
     );

Results

baselineprovider-indexprovider-partition
SELECT Met11 min 15 sec5.598 sec6.778 sec
SELECT Wikimedia Audio2 min 32 sec0.510 sec0.206 sec
UPDATE thumbnails4 min 54 sec1 min 14 sec1 min 7 sec
UPDATE duration1 min 12 sec1 min 9 sec1 min 16 sec
UPDATE popularity1 min 18 sec1 min 21 sec1 min 18 sec
UPSERT Wikimedia Audio33.751 sec1.532 sec1 min 2 sec
UPSERT Freesound52 sec35 sec28 sec
UPSERT Europeana44 min 22 sec30 min 13 sec6 min 51 sec
CREATE Wikimedia temp table2 sec1 sec40 sec
CREATE Met temp table6 min 40 sec7 sec24 sec
CREATE Flickr temp table18 hr40 min 13 sec9 hr 35 min 20 sec
Flickr large batch UPDATE1 hr 5 min 9 sec50 min 10 sec2 hr 42 min 30 sec
Flickr small batch UPDATE5 min 1 sec2 min 12 sec31 min 7 sec

Inconsistencies Noted During Testing

When testing ingestion UPSERTs, we initially compared the performance of the query on provider-index and provider-partition to the runtime of the upsert_data task in the production DagRun from which the test was sourced. In each case, this production upsert took less than a minute to complete, while the same query took close to an hour in our test environments. Alarmingly, this seemed to indicate that the index and partition both massively degraded the performance of UPSERTs relative to baseline.

However, when we ran the queries on the baseline testing environment — which was theoretically identical to the production environment in which the DAGs ran — we found that the runtime was considerably longer (in fact, even longer than the runtime on provider-index and provider-partition). The timings given in the chart above are for these tests run on baseline. I also tested running the queries manually on the actual production database, and found they ran in less than a minute, just as they had during the DagRun!

This leaves us with a major inconsistency: the performance of queries on baseline, a restored snapshot of production with no additional changes, is clearly much worse than the performance on production, at least in the case of ingestion upserts. We concluded that the effect of routine operations performed on production (daily ingestion, for example) creates too many confounding variables. One theory, considered only after the tests had already been performed, was that the optimizer in the test environments may not have been able to plan queries optimally due to the lack of statistics in newly restored databases; per the docs:

Once restored, it is wise to run ANALYZE on each database so the optimizer has useful statistics. An easy way to do this is to run vacuumdb -a -z to VACUUM ANALYZE all databases; this is equivalent to running VACUUM ANALYZE manually.

Consequently, in our analysis we only compare the relative performance between our three test environments.

This is important to consider when looking at the data for batched updates in particular. A batched update on provider-index ran about 26% faster than the same update on baseline; however, it still took over 50 minutes to update just 10,000 records. Extrapolated out to the roughly 50,000 batches needed, a full update of the Flickr provider would take just over five years. This is obviously untenable.

However, upserts on baseline took about 60 times as long as the identical upsert on production. If the same ratio holds true for batch updates, and they take about 50 seconds when run on production, the full Flickr update could be run in about 28 days, which is comfortably within our desired timeframe for a popularity refresh.

As a final note, the order queries were performed in dramatically affected performance. For example, a SELECT on wikimedia_audio would run much more quickly after an UPDATE was performed on the same rows. We tried to test queries in as close to the same order as possible among the three environments, but as the test plan was evolving while we were working there may be some inconsistencies (for example, the very quick performance of baseline in the Wikimedia temp table test).

Analysis

It is useful to first examine our priorities in the Catalog, with respect to the types of queries that were analyzed in this work. By far our most commonly performed operations are (or will be) ingestion upserts and batched updates.

Upserts during ingestion are not currently a bottleneck. The pull_data step to pull data from external provider APIs is by far the greater part of ingestion runtime; even in reingestion DAGs, we do not have a pressing need to improve the performance of this step. It is not critical to improve the performance of this step, as long as we do not worsen it.

Batched updates are a greater concern, as large updates have historically taken considerable time (timing out after weeks in some cases). A batched update, as currently implemented, has two steps: the creation of the temp table which is done once, and the actual UPDATE which is run potentially many times, for each batch. Therefore improving the performance of the UPDATE has a significantly greater impact on the performance of a batched update overall.

With that context, we can look at the two proposals under test:

provider-partition

The provider partition had surprisingly mixed results. In the Wikimedia Audio upsert test, it had the worst performance of all three environments by a considerable margin, but was easily the most performant for the Freesound and Europeana upserts. It’s possible the order of query performance was a factor.

While the partition environment was consistently much faster at creating temp tables for the batched update, the actual updates took considerably longer than baseline.

provider-index

The provider index offered improved performance across the board (with one exception: the update_standardized_popularity test, in which queries in all three environments completed within 3 seconds of each other). Note this does require comparing the index to baseline and not to production, as noted in the Inconsistencies section.

In particular, provider-index appeared to have consistently better performance during a batched update UPDATE step.

Conclusions

Because the batched UPDATE step is the operation most critically in need of performance improvement, I think we should not pursue partitioning the media tables by provider, and instead consider adding the provider index. As noted throughout this post, the many inconsistencies found during testing and the lack of a large testing sample size make this compelling, but far from conclusive.

I propose the following next steps:

  • First, try running a batched update of wikimedia_audio, our largest audio provider, in production. This will allow us to verify the theory that batched updates, much like upserts, will be faster in production than in our tests.
  • Add the provider index to the audio table in production. Because audio is considerably smaller than image, this is a quick operation.
  • Re-run the batched update and compare the performance. If the results are promising, consider repeating the process with the image table. If not, the index can be safely dropped.

Note that this work only investigated two of the proposals put forth at the recent Catalog meet-up. There are many others we would like to test in the future, including pulling license information into its own table, and reducing the size of large text columns.

If you have read this far, thank you very much! I’d love to hear your thoughts.

#catalog, #postgres