Solved: Dataform yaml for dataset location

aaron_harkins · 05-08-2024 11:00 AM

Hi All,

I'm new to Dataform and GCP but used dbt at my previous company. Unless I'm completely forgetting something, we structured each folder to have it's own yaml file that determined what dataset (in GCP terms) that table would go to. Right now in Dataform, there's one json file that I'm not sure how to overwrite. Even when attempting to do it manually in the sqlx file in the config, it won't allow me.

Does anyone have any understanding of this or any documentation? I've been finding it hard to find things, especially trainings videos, for Dataform.

Thanks

ms4446

In Dataform, setting up the project structure and configuring datasets for your SQLX files can indeed be a bit different from how dbt handles it. In Dataform, much of the configuration is controlled through the dataform.json file at the root of your project, which may be what you're encountering.

Here's how you can manage this in Dataform:

Project Configuration (dataform.json): This file contains global settings for your Dataform project. It includes the default dataset (schema in dbt terms) where your tables will be created unless specified otherwise in the SQLX files. This looks something like:

 

{
  "warehouse": "bigquery",
  "defaultSchema": "your_default_dataset",
  "assertionsSchema": "your_assertions_dataset",
  "dataformCoreVersion": "1.x.x"
}

Overriding Default Settings in SQLX Files: If you want to specify a different dataset for a particular table or view, you can set this in the SQLX file itself using the config block. Here’s how you might configure it:

 

config {
  type: "table",
  schema: "specific_dataset",
  description: "Description of what this model represents"
}

SELECT ...

In this block, schema corresponds to the dataset in BigQuery where this table/view will be created.

Tips for Larger or More Complex Projects:

Consistent Naming Conventions: Consider using a consistent naming convention for your datasets (e.g., prefixing them with team or functional area) to help organize your data and configuration.
Environment-Specific Configuration: Leverage environment variables or create separate dataform.json files for different environments (development, staging, production) to easily manage configuration changes across environments.
Troubleshooting Configuration Issues: If configuration overrides aren't being respected, double-check the dataset names for typos and ensure that the config block is correctly placed before any SQL statements within the SQLX file.

Dataform's official documentation is a valuable resource: Google Cloud Dataform Documentation. This includes guides on setting up your development environment, writing and running transformations, and more. While external resources might be less plentiful compared to dbt, the official documentation provides a comprehensive starting point.

View solution in original post

ms4446

In Dataform, setting up the project structure and configuring datasets for your SQLX files can indeed be a bit different from how dbt handles it. In Dataform, much of the configuration is controlled through the dataform.json file at the root of your project, which may be what you're encountering.

Here's how you can manage this in Dataform:

Project Configuration (dataform.json): This file contains global settings for your Dataform project. It includes the default dataset (schema in dbt terms) where your tables will be created unless specified otherwise in the SQLX files. This looks something like:

 

{
  "warehouse": "bigquery",
  "defaultSchema": "your_default_dataset",
  "assertionsSchema": "your_assertions_dataset",
  "dataformCoreVersion": "1.x.x"
}

Overriding Default Settings in SQLX Files: If you want to specify a different dataset for a particular table or view, you can set this in the SQLX file itself using the config block. Here’s how you might configure it:

 

config {
  type: "table",
  schema: "specific_dataset",
  description: "Description of what this model represents"
}

SELECT ...

In this block, schema corresponds to the dataset in BigQuery where this table/view will be created.

Tips for Larger or More Complex Projects:

Consistent Naming Conventions: Consider using a consistent naming convention for your datasets (e.g., prefixing them with team or functional area) to help organize your data and configuration.
Environment-Specific Configuration: Leverage environment variables or create separate dataform.json files for different environments (development, staging, production) to easily manage configuration changes across environments.
Troubleshooting Configuration Issues: If configuration overrides aren't being respected, double-check the dataset names for typos and ensure that the config block is correctly placed before any SQL statements within the SQLX file.

Dataform's official documentation is a valuable resource: Google Cloud Dataform Documentation. This includes guides on setting up your development environment, writing and running transformations, and more. While external resources might be less plentiful compared to dbt, the official documentation provides a comprehensive starting point.

AKatoch

@ms4446 is there a specific supported way to have dataform.jsons differ by env? E.g. dataform.dev.json? Does Dataform support this?

ms4446

In Dataform, managing environment-specific configurations is important for maintaining distinct development, staging, and production environments. This is achievable by utilizing different dataform.json files tailored for each environment.

Firstly, create separate dataform.json files for each environment, such as dataform.dev.json, dataform.staging.json, and dataform.prod.json. Each file should contain the specific configurations needed for its respective environment. For instance, dataform.dev.json might specify a development dataset, while dataform.prod.json would specify a production dataset.

Example dataform.dev.json:

{
  "warehouse": "bigquery",
  "defaultSchema": "dev_dataset",
  "assertionsSchema": "dev_assertions",
  "dataformCoreVersion": "1.x.x"
}

Example dataform.prod.json:

{
  "warehouse": "bigquery",
  "defaultSchema": "prod_dataset",
  "assertionsSchema": "prod_assertions",
  "dataformCoreVersion": "1.x.x"
}

To use these environment-specific configurations, you can specify which configuration file to use when running your Dataform project. This can be done via command-line arguments or environment variables. For instance, when running Dataform from the command line, you can specify the configuration file like this:

dataform run --project-dir ./path/to/project --config dataform.dev.json

For production, you would adjust the command accordingly:

dataform run --project-dir ./path/to/project --config dataform.prod.json

A more automated approach involves using a script that sets the appropriate environment variable and runs Dataform with the correct configuration file. Below is an example of a script, run_dataform.sh, which takes an environment argument (dev, staging, or prod) and runs Dataform with the corresponding configuration:

#!/bin/bash

# Set the environment variable
ENV=$1

if [ "$ENV" == "dev" ]; then
  CONFIG_FILE="dataform.dev.json"
elif [ "$ENV" == "staging" ]; then
  CONFIG_FILE="dataform.staging.json"
elif [ "$ENV" == "prod" ]; then
  CONFIG_FILE="dataform.prod.json"
else
  echo "Unknown environment: $ENV"
  exit 1
fi

# Run Dataform with the specified configuration file
dataform run --project-dir ./path/to/project --config $CONFIG_FILE

To execute the script, use the following command:

./run_dataform.sh dev

This method ensures consistent configuration management across different environments. It's also important to manage sensitive information securely, possibly using a secret management service like Google Cloud Secret Manager. Documenting the process and providing scripts for team members will facilitate smooth transitions between environments.

aaron_harkins

@ms4446 thank you! I believe it must have been user error on my part or something but seems to be working just fine now. Appreciate it!