Hi All,
I'm new to Dataform and GCP but used dbt at my previous company. Unless I'm completely forgetting something, we structured each folder to have it's own yaml file that determined what dataset (in GCP terms) that table would go to. Right now in Dataform, there's one json file that I'm not sure how to overwrite. Even when attempting to do it manually in the sqlx file in the config, it won't allow me.
Does anyone have any understanding of this or any documentation? I've been finding it hard to find things, especially trainings videos, for Dataform.
Thanks
Solved! Go to Solution.
dataform.json
file at the root of your project, which may be what you're encountering.Here's how you can manage this in Dataform:
Project Configuration (dataform.json
): This file contains global settings for your Dataform project. It includes the default dataset (schema in dbt terms) where your tables will be created unless specified otherwise in the SQLX files. This looks something like:
{
"warehouse": "bigquery",
"defaultSchema": "your_default_dataset",
"assertionsSchema": "your_assertions_dataset",
"dataformCoreVersion": "1.x.x"
}
Overriding Default Settings in SQLX Files: If you want to specify a different dataset for a particular table or view, you can set this in the SQLX file itself using the config
block. Here’s how you might configure it:
config {
type: "table",
schema: "specific_dataset",
description: "Description of what this model represents"
}
SELECT ...
In this block, schema
corresponds to the dataset in BigQuery where this table/view will be created.
Tips for Larger or More Complex Projects:
dataform.json
files for different environments (development, staging, production) to easily manage configuration changes across environments.config
block is correctly placed before any SQL statements within the SQLX file.Dataform's official documentation is a valuable resource: Google Cloud Dataform Documentation. This includes guides on setting up your development environment, writing and running transformations, and more. While external resources might be less plentiful compared to dbt, the official documentation provides a comprehensive starting point.
dataform.json
file at the root of your project, which may be what you're encountering.Here's how you can manage this in Dataform:
Project Configuration (dataform.json
): This file contains global settings for your Dataform project. It includes the default dataset (schema in dbt terms) where your tables will be created unless specified otherwise in the SQLX files. This looks something like:
{
"warehouse": "bigquery",
"defaultSchema": "your_default_dataset",
"assertionsSchema": "your_assertions_dataset",
"dataformCoreVersion": "1.x.x"
}
Overriding Default Settings in SQLX Files: If you want to specify a different dataset for a particular table or view, you can set this in the SQLX file itself using the config
block. Here’s how you might configure it:
config {
type: "table",
schema: "specific_dataset",
description: "Description of what this model represents"
}
SELECT ...
In this block, schema
corresponds to the dataset in BigQuery where this table/view will be created.
Tips for Larger or More Complex Projects:
dataform.json
files for different environments (development, staging, production) to easily manage configuration changes across environments.config
block is correctly placed before any SQL statements within the SQLX file.Dataform's official documentation is a valuable resource: Google Cloud Dataform Documentation. This includes guides on setting up your development environment, writing and running transformations, and more. While external resources might be less plentiful compared to dbt, the official documentation provides a comprehensive starting point.
@ms4446 is there a specific supported way to have dataform.jsons differ by env? E.g. dataform.dev.json? Does Dataform support this?
In Dataform, managing environment-specific configurations is important for maintaining distinct development, staging, and production environments. This is achievable by utilizing different dataform.json
files tailored for each environment.
Firstly, create separate dataform.json
files for each environment, such as dataform.dev.json
, dataform.staging.json
, and dataform.prod.json
. Each file should contain the specific configurations needed for its respective environment. For instance, dataform.dev.json
might specify a development dataset, while dataform.prod.json
would specify a production dataset.
Example dataform.dev.json
:
{
"warehouse": "bigquery",
"defaultSchema": "dev_dataset",
"assertionsSchema": "dev_assertions",
"dataformCoreVersion": "1.x.x"
}
Example dataform.prod.json
:
{
"warehouse": "bigquery",
"defaultSchema": "prod_dataset",
"assertionsSchema": "prod_assertions",
"dataformCoreVersion": "1.x.x"
}
To use these environment-specific configurations, you can specify which configuration file to use when running your Dataform project. This can be done via command-line arguments or environment variables. For instance, when running Dataform from the command line, you can specify the configuration file like this:
dataform run --project-dir ./path/to/project --config dataform.dev.json
For production, you would adjust the command accordingly:
dataform run --project-dir ./path/to/project --config dataform.prod.json
A more automated approach involves using a script that sets the appropriate environment variable and runs Dataform with the correct configuration file. Below is an example of a script, run_dataform.sh
, which takes an environment argument (dev
, staging
, or prod
) and runs Dataform with the corresponding configuration:
#!/bin/bash
# Set the environment variable
ENV=$1
if [ "$ENV" == "dev" ]; then
CONFIG_FILE="dataform.dev.json"
elif [ "$ENV" == "staging" ]; then
CONFIG_FILE="dataform.staging.json"
elif [ "$ENV" == "prod" ]; then
CONFIG_FILE="dataform.prod.json"
else
echo "Unknown environment: $ENV"
exit 1
fi
# Run Dataform with the specified configuration file
dataform run --project-dir ./path/to/project --config $CONFIG_FILE
To execute the script, use the following command:
./run_dataform.sh dev
This method ensures consistent configuration management across different environments. It's also important to manage sensitive information securely, possibly using a secret management service like Google Cloud Secret Manager. Documenting the process and providing scripts for team members will facilitate smooth transitions between environments.
@ms4446 thank you! I believe it must have been user error on my part or something but seems to be working just fine now. Appreciate it!