Re: Connect external Postgres to Big Query

lucierabochova · 05-31-2024 01:10 AM

Hi all,

We have Postgres database ( Amazon hosted) and there is one view ( 10 colums) I need to transfer to Big Query on daily basis. What would be the best solution for this? It must be automatic.

Can I create ODBC connection and somehow link the servers?

Im very confused from GCP, so mamy services and functions and I couldnt find any simple solution for my issue.

Thank you in advance for your answers!

ms4446

Unfortunately, direct ODBC connection between your PostgreSQL instance (hosted on AWS) and BigQuery isn't currently supported. However, there are several effective approaches:

Datastream: This is ideal for near real-time replication of entire databases, but might be a bit heavy-duty for just a single view.

Custom ETL Pipelines: This gives you complete control using tools like Cloud Composer (Apache Airflow) or Dataflow (Apache Beam). However, it requires more technical setup and maintenance.

Third-Party Tools: Services like Fivetran or Stitch specialize in data integration and could simplify the process, but often come with a cost.

My Recommendation: Dataflow + Cloud Scheduler + Pub/Sub (Google Cloud)

This approach is well-suited for your needs, as it leverages the power and flexibility of Google Cloud Platform:

Dataflow: A fully managed service for creating data processing pipelines.
Cloud Scheduler: Allows you to schedule jobs, like triggering your data transfer daily.
Pub/Sub: A messaging service that helps coordinate the different parts of your solution.

How It Works

Cloud Scheduler triggers a daily job.
The job sends a message to Pub/Sub.
Pub/Sub activates your Dataflow pipeline.
Dataflow securely extracts data from your PostgreSQL view.
Dataflow loads the data into your BigQuery table.

lucie_rabochova

thank you for your answer!

__dani__

One question: are you creating the view just for BigQuery or is it used for something else?

The view is really just a query of your underlying tables. Most people extract those tables first, and then either transform in transit (ETL) or in BigQuery (ELT). Transforming in the source database (TEL) is less common.

So my recommendation would be:

Use change data capture (CDC) to extract the source tables. This is the fastest and lowest load way to get data from Postgres. It will also get you all the deletes.
If you can't use CDC then batch-extract the source, or possibly the view. But you'll have more latency, more of a load on Postgres if the view is bigger (and only for BigQuery), and you'll also need to manage the deletes.

You miss deletes with batch extractions because whenever a delete happens in Postgres, the next extraction doesn't see it. The only way to know it happened with batch is to diff the old and new extracts to look for missing rows. CDC gives you all the change data.

Some ETL/ELT tools will do the deletions for you in BigQuery as part of CDC. In Estuary Flow we call this a standard materialization.

Good luck!