Skip to main content

FAQ

Agents

What SQL dialect and functions are supported?

Spark SQL is the default dialect.

Key Points:

  • Supports Spark functions (coalesce, cast, case, etc.).
  • SQL Copilot can assist with query suggestions.

AS-OF time travel

How to enable AS-OF time travel?

Requires: Parameterized orchestration.

To enable AS-OF time travel, do the following:

  1. Pass a historical date in views or workbook queries.
  2. View the dataset as it existed at that point in time.

BI and Visualizer

How does the BI dashboard layer operate in DataGOL?

Dashboards are built on top of workbooks using a Visualizer. They support:

Key Points:

  • Widgets use formulas and parameters.
  • Widget-level filters override page-level filters.
  • Drill-downs can be column-based or widget-defined.

Data Sources

How does DataGOL handle deleted records in source systems like SAP?

Deleted records are handled using Full Refresh Merge with Iceberg, which compares snapshots to detect deletions.

Key Points:

  • CDC pipelines are not directly supported in DataGOL.
  • Full refresh merge provides a reliable internal alternative.

Can DataGOL integrate with existing warehouses like Redshift or Snowflake?

Yes. Connections are defined in the Data Sources layer. Both source and target warehouses must be configured.

Key Points:

  • Destination must be explicitly selected when creating pipelines.
  • DataGOL supports two roles:
  • ETL Only: It acts as a transformation layer—pulls data from sources, the user is able to process it (joins, filters, cleans, etc.), and write it into your warehouse.
  • Full Warehouse: It stores data within its own internal open-format storage (e.g., Parquet or Iceberg), and also runs queries and dashboards from there.

How should we name data sources and warehouses?

Best Practice:

  • Format:
    <ProductName>_<Environment>_<DB|Warehouse>
  • For S3, replace DB with S3.

Example:

app_prod_db, app_stage_wh

This naming helps differentiate between DB, warehouse, and S3 sources clearly and ensures uniqueness for use-case-driven design.

How are S3 data sources structured?

S3 Path Format:

<bucket>/<base_folder>/<table_name>/<files>

The table_name folder becomes the table name in DataGOL.

[Important]

Give the path up to the base folder. DataGOL processes each subfolder (e.g., table_name) as an individual table.

Schema Tips:

  • Keep data types and column names consistent.
  • Only CSV is supported, unless a new folder data source is created for an alternative format.

How to onboard 3rd party connectors (PostHog, HubSpot, etc.)?

Do the following to onboard 3rd party connectors:

  1. Request whitelisting via Slack.
  2. Once whitelisted, configure the data source.
  3. Create a pipeline from the data source.
[Special case]

Some APIs (e.g., PostHog) send entire record updates rather than diffs — deduplication pipelines are needed to retain only the latest rows.

What is partitioning, and when to use it?

Use Case:

  • Large datasets that can benefit from parallelism in Spark.

How:

  • Choose a partition column (integer/date).
  • Set partition size (default: 50).
  • Spark executes parallel tasks for faster syncs.

Not Applicable: For REST-based APIs (e.g., HubSpot, Jira, Zendesk).

Should I copy data into DataGOL or query directly from the source?

Copy use case:

  • When downstream orchestration, time travel, or dedup is required.

Direct query use case:

  • For real-time dashboards or when querying cost-heavy external sources (e.g., Snowflake, BigQuery).

What formats are supported for Folder data source?

Supported:

  • Excel, Parquet, CSV, and JSON.

What happens during data source sync failure?

Common Reasons:

  • Incorrect S3 path.

Action:

  • Check sync logs.
  • Contact the DataGOL team via Slack.

Can I join data from multiple sources?

Yes, you can! You can join data from various sources (for example, S3 and Redshift) by using custom pipelines.

Here's how it works:

  • First, make sure all your data sources are added and configured under Lakehouse > Data Sources.
  • Then, you can write your joins in Spark SQL within a custom pipeline to combine the data and push a unified result into a single warehouse.

Lakehouse

How do permissions and access work in Lakehouse?

RoleCapabilities
AdminFull access to create/delete data sources, pipelines, and orchestrations.
MemberCan view all, but only edit pipelines/orchestrations if explicitly invited.

Governance:

  • Producers (data team) control access to workspaces.
  • Consumers (business teams) collaborate in their own workbooks.

Pipeline

What are the available pipeline types in DataGOL, and when should each be used?

DataGOL supports three pipeline types:

  • Standard Pipelines: Used for straightforward replication of source tables with sync modes.
  • Custom Pipelines: Designed for SQL-based transformations like joins, format changes, or business logic.
  • Dedupe Pipelines: Remove duplicate records when systems (e.g., Salesforce) retain updated rows as separate entries.

Key Points:

  • Standard: Good for raw replication with minimal transformation.
  • Custom: Required when transforming time formats, handling nulls, or merging sources.
  • Dedupe: Best suited for systems that log updates as new records; used to identify and retain only the latest version of each record by removing duplicates.

Can users transform data types, nulls, or time formats in standard pipelines?

No, such transformations require custom pipelines where SQL queries can be written to clean and reformat data.

Key Points:

  • Use Playground for SQL transformations.
  • Handles complex transformations, including cross-table joins, data type casting, and format normalization.

What are the different sync modes?

Sync ModeDescriptionNotes
Full Refresh OverrideOverwrites all data each sync.Storage-intensive. Use rarely.
Full Refresh AppendAdds all data every time (even duplicates).Needs a dedup pipeline.
Incremental AppendAdds only new data based on the cursor field.Most efficient, supports time travel.
Incremental MergeEfficiently updates a target table by syncing only changes since the last run, using a cursor to perform upserts.Requires an Iceberg destination. Performs an upsert operation (inserting new records or updating existing ones) based on a unique key.
Full Refresh MergePerforms an upsert operation (inserting new records or updating existing ones) based on a unique key, effectively updating a target table.Requires an Iceberg destination.

When and how should we use dedup pipelines?

Use Case:

  • Third-party sources that resend full record updates.
  • Append modes where duplication is expected.

Setup:

  • Source: Warehouse with duplicates.
  • Destination: Cleaned warehouse.

Custom SQL logic can be applied for each stream to retain the latest records.

What sync modes are supported in standard pipelines?

Available sync modes include:

  • Full Refresh Override
  • Full Refresh Append
  • Incremental Append
  • Incremental Merge
  • Full Refresh Merge

Key Points:

  • Incremental Append is optimal for performance and requires a cursor field.
  • Full Refresh Merge is useful for detecting deletes via Iceberg snapshot comparison.

What are the destination formats and query engines supported by DataGOL?

Supported formats:

  • Parquet (default)
  • Iceberg (for snapshot & merge functionality)

Query Engines:

  • Spark SQL (default)
  • Athena (alternative)
  • JDBC (Although not explicitly available, it is not recommended for large volumes.)

Key Points:

  • Parquet/Iceberg are disabled if using SQL dbs as destination.
  • Choose Athena or Spark based on scaling and partitioning needs.

Where can users inspect and debug pipeline execution?

  • Jobs Page: Shows execution status, time, and logs.
  • Streams Page: Displays selected source tables.
  • Revisions Tab: Tracks config changes.

Key Points:

  • Supports immediate runs and scheduled/cron executions.
  • Each revision preserves historical config.

Intermediate Views vs Intermediate Warehouses – Which is better for performance?

Recommendation:

Create a multi-layered data model:

  • Bronze: Raw data.
  • Silver: Cleaned, transformed.
  • Gold: Final semantic view.
[Note]

Avoid deeply nested CTEs. Instead, break logic into multiple views using pipelines — more performant and maintainable.

How to update the cursor field in an incremental pipeline?

Workaround:

  1. Go to the pipeline.
  2. Remove the stream.
  3. Re-add it from the stream settings with the new cursor field.

How to manage large APIs like HubSpot if pipelines timeout?

Solutions:

  • Reduce the number of streams per pipeline.
  • Use incremental fetches with modifiedAt or updatedAt fields.
  • If the HubSpot connector does not support this, segment the workload across multiple data sources.

Parquet vs Iceberg – When to use what?

FormatUse Case
ParquetEfficient analytics, low overhead.
IcebergRequired for time travel/versioning.
tip

Usage of Parquet or Iceberg depends on the actual use case.

Playground

How are custom transformations implemented in DataGOL?

Using the Playground, users write SQL (Spark SQL dialect) to clean, join, and transform data.

Key Points:

  • The Playground acts as the editor to perform transformations (like data type changes) in Spark SQL.
  • Queries can be saved as views and then published to the workspace as workbooks.

Schema changes

What happens if the schema changes downstream?

Impact Areas: Workbooks and dashboards may break.

Solution: Use the Impact Analysis feature to identify affected widgets, views, and dashboards. Use Playground to test the query and update the workbook schema.

Workspaces and Workbooks

What is the role of the Workbook in the BI flow?

Workbooks are saved queries or views from the Playground. They act as the golden data layer for dashboards.

Key Points:

  • Can be published to workspaces.
  • Supports API-accessible data, which can be filtered and downloaded.

How is access and sharing managed across workspaces and workbooks?

Workspaces support role-based sharing:

  • Creator: Full control.
  • Editor: Edit access but cannot delete.
  • Viewer/Guest: Read-only; guest can't download data.

Key Points:

  • Workbooks and dashboards adhere to these roles.
  • Delete protection available at workbook levels. When enabled, it makes the workspace delete protected.

Can workbook data be accessed via API?

Yes, via a CURL command using service accounts.

Key Points:

  • Data is returned in paginated JSON.
  • Default rows: 0–500 (configurable).

Can users import external data into workbooks?

Yes, via Excel or CSV import.

Key Points:

  • Supports creating a new workbook as well as Excel/CSV imports.
  • Ensure well defined, clean structure of workbooks for better BI compatibility.

Can filters be propagated across widgets or pages?

Yes, but linking filters across visuals (like dynamic filters) is not yet supported.

Key Points:

  • Page-level filters can be configured to include/exclude widgets.
  • Column-based linking across workbooks is supported via BI query joins.

How are dashboards created and managed?

Users drag and drop fields from loaded workbooks, configure visual properties, and pin to dashboards.

Key Points:

  • No widget limits per dashboard.
  • Widgets can be duplicated, sorted, and grouped.
  • Export to PDF is not yet supported, but on the roadmap.

What are the future enhancements in DataGOL?

Planned features include:

  • Graphical query builder
  • PDF export of dashboards
  • Improved dynamic titles in widgets
  • Filter linking across visuals