Skip to main content

Apache Spark formula reference

Overview

This document provides a comprehensive reference for Apache Spark formulas supported in DataGOL. These formulas enable powerful data processing, analysis, and transformation capabilities directly within your DataGOL workflows.

What you can achieve?

  • Data aggregation: Calculate sums, averages, counts, and statistical measures

  • Text processing: Clean, format, and manipulate string data

  • Date/Time operations: Parse, format, and calculate with dates and timestamps

  • Conditional logic: Implement business rules and data validation

  • Data type conversion: Transform data between different formats

  • Array/JSON processing: Handle complex nested data structures

  • Mathematical operations: Perform calculations and statistical analysis

Best Practices

Performance tips

  • Use appropriate data types - Cast data early in your pipeline.
  • Filter early - Apply WHERE conditions before complex calculations.
  • Avoid nested functions - Break complex formulas into steps.
  • Use COALESCE - Handle nulls explicitly to avoid errors.

Common patterns

  • Data cleaning: TRIM(UPPER(COALESCE(field, 'DEFAULT')))
  • Age calculation: DATE_DIFF(CURRENT_DATE(), birth_date) / 365
  • Categorization: Use CASE WHEN for business logic.
  • Null handling: Always consider null values in calculations.

Error handling

  • Use TRY_* functions for safe operations: TRY_DIVIDE, TRY_TO_NUMBER.
  • Test formulas with sample data before production use.
  • Validate data types and ranges.

Formula categories

CategoryPrimary useKey functions
AggregationSummarize dataCOUNT, SUM, AVG, MIN, MAX
Text ProcessingClean/format stringsCONCAT, TRIM, UPPER, REPLACE
Date/TimeHandle temporal dataDATE_ADD, DATE_DIFF, EXTRACT
ConditionalBusiness logicIF, CASE WHEN, COALESCE
ConversionData type changesCAST, TO_DATE, STRING
MathematicalCalculationsROUND, ABS, POWER, MOD
Pattern MatchingFind/replace patternsLIKE, REGEXP_REPLACE
Array/JSONComplex data typesSIZE, GET, FLATTEN

This reference provides the foundation for powerful data processing in DataGOL using Apache Spark. Start with simple formulas and gradually combine them for more complex data transformations.

Aggregate functions

COUNT


Purpose:

Count rows or non-null values.

Usage:

* COUNT(column_name)
* COUNT(*)

Example:

Returns number of customers

COUNT(customer_id)

Use Case:

Calculate total records, active users, or non-empty fields.

SUM


Purpose:

Calculate the total of numeric values.

Usage:

SUM(column_name)

Example:

Returns total sales revenue

SUM(sales_amount)

Use Case:

Use for financial totals, quantity calculations, or performance metrics.

AVG / MEAN


Purpose:

Calculate the average value of a set of numbers.

Usage:

* AVG(column_name)
* MEAN(column_name)

Example:

Returns average order size

AVG(order_value)

Use Case:

Useful for performance benchmarks, trend analysis, and KPI calculations.

MIN / MAX


Purpose:

Find the minimum or maximum values within a dataset.

Usage:

* MIN(column_name)
* MAX(column_name)

Example:

Returns the latest transaction date

MAX(transaction_date)

Use Case:

Apply for data quality checks, range analysis, and finding extremes.

MEDIAN


Purpose:

Find the middle value in sorted data.

Usage:

MEDIAN(column_name)

Example:

Returns the middle salary value

MEDIAN(salary)

Use Case:

Good for statistical analysis, outlier detection, and fair comparisons.

PERCENTILE


Purpose:

Calculate exact percentile values from a dataset.

Usage:

PERCENTILE(column_name, percentile_value)

Example:

Returns the 95th percentile response time

PERCENTILE(response_time, 0.95)

Use Case:

Crucial for performance SLAs, quality metrics, and distribution analysis.

Text/String functions

CONCAT / CONCAT_WS


Purpose:

Join strings together.

Usage:

* CONCAT(str1, str2)
* CONCAT_WS(separator, str1, str2)

Example:

Returns "John Smith"

CONCAT_WS(' ', first_name, last_name)

Use Case:

Use to create full names, build identifiers, or format display text.

UPPER / LOWER


Purpose:

Change the text case of a string.

Usage:

* UPPER(text)
* LOWER(text)

Example:

Returns "USA"

UPPER(country_code)

Use Case:

Useful for data standardization and case-insensitive comparisons.

TRIM


Purpose:

Remove leading/trailing spaces from a string.

Usage:

TRIM(text)

Example:

Returns clean customer names

TRIM(customer_name)

Use Case:

Primarily used for data cleaning and removing extra whitespace.

SUBSTRING


Purpose:

Extract part of a string.

Usage:

SUBSTRING(text, start_position, length)

Example:

Returns the area code from a phone number

SUBSTRING(phone_number, 1, 3)

Use Case:

Effective for extracting codes, parsing structured text, and data validation.

REPLACE


Purpose:

Replace specific text patterns within a string.

Usage:

REPLACE(text, old_value, new_value)

Example:

Removes dashes from a phone number

REPLACE(phone, '-', '')

Use Case:

Common for data cleaning and format standardization.

SPLIT


Purpose:

Split a string into an array of substrings.

Usage:

SPLIT(text, delimiter)

Example:

Returns an array of address parts

SPLIT(full_address, ',')

Use Case:

Ideal for parsing delimited data and extracting components.

CONTAINS / STARTSWITH


Purpose:

Check if text contains specific patterns or prefixes.

Usage:

* CONTAINS(text, pattern)
* STARTSWITH(text, prefix)

Example:

Returns TRUE/FALSE indicating if an email contains "@gmail.com"

CONTAINS(email, '@gmail.com')

Use Case:

Used for data filtering, pattern matching, and validation.

Date/Time functions

CURRENT_DATE / CURRENT_TIMESTAMP


Purpose:

Get the current date/time.

Usage:

* CURRENT_DATE()
* CURRENT_TIMESTAMP()

Example:

Returns the current date, e.g., 2023-01-01

CURRENT_DATE()

Use Case:

Useful to timestamp records, calculate age, or track processing time.

DATE_ADD / DATE_DIFF


Purpose:

Add days to a date or calculate the difference between two dates.

Usage:

* DATE_ADD(date, days)
* DATE_DIFF(end_date, start_date)

Example:

Returns the number of days since an order

DATE_DIFF(order_date, CURRENT_DATE())

Use Case:

Good for calculating aging, setting due dates, and measuring durations.

DATE_FORMAT


Purpose:

Format dates as strings according to a specified pattern.

Usage:

DATE_FORMAT(date, format)

Example:

Returns "2023-01"

DATE_FORMAT(order_date, 'yyyy-MM')

Use Case:

Helpful for display formatting, grouping by period, and reporting.

EXTRACT / DATE_PART


Purpose:

Extract specific components (e.g., year, month) from a date.

Usage:

* EXTRACT(part FROM date)
* DATE_PART(part, date)

Example:

Returns the year from an order date, e.g., 2023

EXTRACT(YEAR FROM order_date)

Use Case:

Essential for time-based analysis and grouping by periods.

LAST_DAY


Purpose:

Get the last day of the month for a given date.

Usage:

LAST_DAY(date)

Example:

Returns the last day of the order month

LAST_DAY(order_date)

Use Case:

Useful for month-end calculations and defining period boundaries.

Conditional Logic

IF


Purpose:

Apply simple conditional logic.

Usage:

IF(condition, true_value, false_value)

Example:

Categorizes based on sales: 'High' if sales > 1000, else 'Low'

IF(sales > 1000, 'High', 'Low')

Use Case:

Used for data categorization, flag creation, and implementing business rules.

CASE WHEN


Purpose:

Apply multiple condition logic.

Usage:

CASE
WHEN condition1 THEN value1
WHEN condition2 THEN value2
ELSE default
END

Example:

Assigns a grade based on score: 'A' for >= 90, 'B' for >= 80, else 'C'

CASE
WHEN score >= 90 THEN 'A'
WHEN score >= 80 THEN 'B'
ELSE 'C'
END

Use Case:

Ideal for complex categorization, multi-tier logic, and scoring systems.

COALESCE


Purpose:

Return the first non-null value from a list.

Usage:

COALESCE(value1, value2, default)

Example:

Returns the mobile phone, home phone, or 'No Phone' if both are null

COALESCE(mobile_phone, home_phone, 'No Phone')

Use Case:

Helps to handle missing data, provide defaults, and ensure data completeness.

NULLIF


Purpose:

Return null if two values are equal.

Usage:

NULLIF(value1, value2)

Example:

Returns NULL if the discount is 0

NULLIF(discount, 0)

Use Case:

Useful for converting zeros to nulls or creating conditional nulls.

Data Type Conversion

CAST functions


Purpose:

Convert between different data types.

Usage:

* INT(value)
* FLOAT(value)
* STRING(value)
* BOOLEAN(value)

Example:

Converts sales_amount to an integer

INT(sales_amount)

Use Case:

Ensures data type consistency, enables calculations, and aids in data validation.

TO_DATE / TO_TIMESTAMP


Purpose:

Parse strings into date or timestamp values.

Usage:

TO_DATE(date_string, format)

Example:

Converts '2023-01-01' to a date value

TO_DATE('2023-01-01', 'yyyy-MM-dd')

Use Case:

Useful for importing data, parsing text dates, and data integration.

Mathematical function

ROUND / CEIL / FLOOR


Purpose:

Round numbers in various ways.

Usage:

* ROUND(number, decimals)
* CEIL(number)
* FLOOR(number)

Example:

Rounds price to 2 decimal places

ROUND(price, 2)

Use Case:

Applied in financial calculations, display formatting, and data binning.

ABS / NEGATIVE


Purpose:

Get the absolute value of a number or its negation.

Usage:

* ABS(number)
* NEGATIVE(number)

Example:

Returns an always positive value for profit_loss

ABS(profit_loss)

Use Case:

Used for distance calculations and magnitude analysis.

MOD


Purpose:

Calculate the remainder after division.

Usage:

MOD(number, divisor)

Example:

Returns the last digit of an ID

MOD(customer_id, 10)

Use Case:

Useful for data sampling, identifying cyclic patterns, and partitioning.

POWER / SQRT


Purpose:

Calculate the power of a number or its square root.

Usage:

* POWER(base, exponent)
* SQRT(number)

Example:

Returns the side length of a square from its area

SQRT(area)

Use Case:

Common in scientific calculations and geometric formulas.

ARRAY functions

ARRAY operations


Purpose:

Work with array data.

Usage:

* SIZE(array)
* GET(array, index)
* CONTAINS(array, value)

Example:

Returns the number of tags in an array

SIZE(tags_array)

Use Case:

Used to handle multi-value fields, analyze lists, and perform data validation.

FLATTEN


Purpose:

Convert nested arrays into a single array.

Usage:

FLATTEN(nested_array)

Example:

Converts nested categories into a single list

FLATTEN(categories)

Use Case:

Helps to simplify complex data and prepare it for analysis.

JSON functions

JSON Processing


Purpose:

Extract data from JSON objects.

Usage:

* GET_JSON_OBJECT(json, path)
* FROM_JSON(json, schema)

Example:

Extracts the email from user data in JSON format

GET_JSON_OBJECT(user_data, '$.email')

Use Case:

Crucial for API data processing and semi-structured data analysis.

Pattern Matching


LIKE / ILIKE

Purpose:

Perform pattern matching (case-sensitive or insensitive).

Usage:

* LIKE(text, pattern)
* ILIKE(text, pattern)

Example:

Checks if product_name contains "iPhone"

LIKE(product_name, '%iPhone%')

Use Case:

Used for text search, data filtering, and pattern detection.

REGEXP functions


Purpose:

Perform regular expression matching and manipulation.

Usage:

REGEXP_REPLACE(text, pattern, replacement)

Example:

Removes non-numeric characters from a phone number, leaving only numbers

REGEXP_REPLACE(phone, '[^0-9]', '')

Use Case:

Essential for data cleaning, format validation, and complex text processing.