Apache Spark formula reference
Overview
This document provides a comprehensive reference for Apache Spark formulas supported in DataGOL. These formulas enable powerful data processing, analysis, and transformation capabilities directly within your DataGOL workflows.
What you can achieve?
-
Data aggregation: Calculate sums, averages, counts, and statistical measures
-
Text processing: Clean, format, and manipulate string data
-
Date/Time operations: Parse, format, and calculate with dates and timestamps
-
Conditional logic: Implement business rules and data validation
-
Data type conversion: Transform data between different formats
-
Array/JSON processing: Handle complex nested data structures
-
Mathematical operations: Perform calculations and statistical analysis
Best Practices
Performance tips
- Use appropriate data types - Cast data early in your pipeline.
- Filter early - Apply
WHERE
conditions before complex calculations. - Avoid nested functions - Break complex formulas into steps.
- Use COALESCE - Handle nulls explicitly to avoid errors.
Common patterns
- Data cleaning:
TRIM(UPPER(COALESCE(field, 'DEFAULT')))
- Age calculation:
DATE_DIFF(CURRENT_DATE(), birth_date) / 365
- Categorization: Use
CASE WHEN
for business logic. - Null handling: Always consider null values in calculations.
Error handling
- Use
TRY_*
functions for safe operations:TRY_DIVIDE
,TRY_TO_NUMBER
. - Test formulas with sample data before production use.
- Validate data types and ranges.
Formula categories
Category | Primary use | Key functions |
---|---|---|
Aggregation | Summarize data | COUNT, SUM, AVG, MIN, MAX |
Text Processing | Clean/format strings | CONCAT, TRIM, UPPER, REPLACE |
Date/Time | Handle temporal data | DATE_ADD, DATE_DIFF, EXTRACT |
Conditional | Business logic | IF, CASE WHEN, COALESCE |
Conversion | Data type changes | CAST, TO_DATE, STRING |
Mathematical | Calculations | ROUND, ABS, POWER, MOD |
Pattern Matching | Find/replace patterns | LIKE, REGEXP_REPLACE |
Array/JSON | Complex data types | SIZE, GET, FLATTEN |
This reference provides the foundation for powerful data processing in DataGOL using Apache Spark. Start with simple formulas and gradually combine them for more complex data transformations.
Aggregate functions
COUNT
Purpose:
Count rows or non-null values.
Usage:
* COUNT(column_name)
* COUNT(*)
Example:
Returns number of customers
COUNT(customer_id)
Use Case:
Calculate total records, active users, or non-empty fields.
SUM
Purpose:
Calculate the total of numeric values.
Usage:
SUM(column_name)
Example:
Returns total sales revenue
SUM(sales_amount)
Use Case:
Use for financial totals, quantity calculations, or performance metrics.
AVG / MEAN
Purpose:
Calculate the average value of a set of numbers.
Usage:
* AVG(column_name)
* MEAN(column_name)
Example:
Returns average order size
AVG(order_value)
Use Case:
Useful for performance benchmarks, trend analysis, and KPI calculations.
MIN / MAX
Purpose:
Find the minimum or maximum values within a dataset.
Usage:
* MIN(column_name)
* MAX(column_name)
Example:
Returns the latest transaction date
MAX(transaction_date)
Use Case:
Apply for data quality checks, range analysis, and finding extremes.
MEDIAN
Purpose:
Find the middle value in sorted data.
Usage:
MEDIAN(column_name)
Example:
Returns the middle salary value
MEDIAN(salary)
Use Case:
Good for statistical analysis, outlier detection, and fair comparisons.
PERCENTILE
Purpose:
Calculate exact percentile values from a dataset.
Usage:
PERCENTILE(column_name, percentile_value)
Example:
Returns the 95th percentile response time
PERCENTILE(response_time, 0.95)
Use Case:
Crucial for performance SLAs, quality metrics, and distribution analysis.
Text/String functions
CONCAT / CONCAT_WS
Purpose:
Join strings together.
Usage:
* CONCAT(str1, str2)
* CONCAT_WS(separator, str1, str2)
Example:
Returns "John Smith"
CONCAT_WS(' ', first_name, last_name)
Use Case:
Use to create full names, build identifiers, or format display text.
UPPER / LOWER
Purpose:
Change the text case of a string.
Usage:
* UPPER(text)
* LOWER(text)
Example:
Returns "USA"
UPPER(country_code)
Use Case:
Useful for data standardization and case-insensitive comparisons.
TRIM
Purpose:
Remove leading/trailing spaces from a string.
Usage:
TRIM(text)
Example:
Returns clean customer names
TRIM(customer_name)
Use Case:
Primarily used for data cleaning and removing extra whitespace.
SUBSTRING
Purpose:
Extract part of a string.
Usage:
SUBSTRING(text, start_position, length)
Example:
Returns the area code from a phone number
SUBSTRING(phone_number, 1, 3)
Use Case:
Effective for extracting codes, parsing structured text, and data validation.
REPLACE
Purpose:
Replace specific text patterns within a string.
Usage:
REPLACE(text, old_value, new_value)
Example:
Removes dashes from a phone number
REPLACE(phone, '-', '')
Use Case:
Common for data cleaning and format standardization.
SPLIT
Purpose:
Split a string into an array of substrings.
Usage:
SPLIT(text, delimiter)
Example:
Returns an array of address parts
SPLIT(full_address, ',')
Use Case:
Ideal for parsing delimited data and extracting components.
CONTAINS / STARTSWITH
Purpose:
Check if text contains specific patterns or prefixes.
Usage:
* CONTAINS(text, pattern)
* STARTSWITH(text, prefix)
Example:
Returns TRUE/FALSE indicating if an email contains "@gmail.com"
CONTAINS(email, '@gmail.com')
Use Case:
Used for data filtering, pattern matching, and validation.
Date/Time functions
CURRENT_DATE / CURRENT_TIMESTAMP
Purpose:
Get the current date/time.
Usage:
* CURRENT_DATE()
* CURRENT_TIMESTAMP()
Example:
Returns the current date, e.g., 2023-01-01
CURRENT_DATE()
Use Case:
Useful to timestamp records, calculate age, or track processing time.
DATE_ADD / DATE_DIFF
Purpose:
Add days to a date or calculate the difference between two dates.
Usage:
* DATE_ADD(date, days)
* DATE_DIFF(end_date, start_date)
Example:
Returns the number of days since an order
DATE_DIFF(order_date, CURRENT_DATE())
Use Case:
Good for calculating aging, setting due dates, and measuring durations.
DATE_FORMAT
Purpose:
Format dates as strings according to a specified pattern.
Usage:
DATE_FORMAT(date, format)
Example:
Returns "2023-01"
DATE_FORMAT(order_date, 'yyyy-MM')
Use Case:
Helpful for display formatting, grouping by period, and reporting.
EXTRACT / DATE_PART
Purpose:
Extract specific components (e.g., year, month) from a date.
Usage:
* EXTRACT(part FROM date)
* DATE_PART(part, date)
Example:
Returns the year from an order date, e.g., 2023
EXTRACT(YEAR FROM order_date)
Use Case:
Essential for time-based analysis and grouping by periods.
LAST_DAY
Purpose:
Get the last day of the month for a given date.
Usage:
LAST_DAY(date)
Example:
Returns the last day of the order month
LAST_DAY(order_date)
Use Case:
Useful for month-end calculations and defining period boundaries.
Conditional Logic
IF
Purpose:
Apply simple conditional logic.
Usage:
IF(condition, true_value, false_value)
Example:
Categorizes based on sales: 'High' if sales > 1000, else 'Low'
IF(sales > 1000, 'High', 'Low')
Use Case:
Used for data categorization, flag creation, and implementing business rules.
CASE WHEN
Purpose:
Apply multiple condition logic.
Usage:
CASE
WHEN condition1 THEN value1
WHEN condition2 THEN value2
ELSE default
END
Example:
Assigns a grade based on score: 'A' for >= 90, 'B' for >= 80, else 'C'
CASE
WHEN score >= 90 THEN 'A'
WHEN score >= 80 THEN 'B'
ELSE 'C'
END
Use Case:
Ideal for complex categorization, multi-tier logic, and scoring systems.
COALESCE
Purpose:
Return the first non-null value from a list.
Usage:
COALESCE(value1, value2, default)
Example:
Returns the mobile phone, home phone, or 'No Phone' if both are null
COALESCE(mobile_phone, home_phone, 'No Phone')
Use Case:
Helps to handle missing data, provide defaults, and ensure data completeness.
NULLIF
Purpose:
Return null if two values are equal.
Usage:
NULLIF(value1, value2)
Example:
Returns NULL if the discount is 0
NULLIF(discount, 0)
Use Case:
Useful for converting zeros to nulls or creating conditional nulls.
Data Type Conversion
CAST functions
Purpose:
Convert between different data types.
Usage:
* INT(value)
* FLOAT(value)
* STRING(value)
* BOOLEAN(value)
Example:
Converts sales_amount
to an integer
INT(sales_amount)
Use Case:
Ensures data type consistency, enables calculations, and aids in data validation.
TO_DATE / TO_TIMESTAMP
Purpose:
Parse strings into date or timestamp values.
Usage:
TO_DATE(date_string, format)
Example:
Converts '2023-01-01' to a date value
TO_DATE('2023-01-01', 'yyyy-MM-dd')
Use Case:
Useful for importing data, parsing text dates, and data integration.
Mathematical function
ROUND / CEIL / FLOOR
Purpose:
Round numbers in various ways.
Usage:
* ROUND(number, decimals)
* CEIL(number)
* FLOOR(number)
Example:
Rounds price
to 2 decimal places
ROUND(price, 2)
Use Case:
Applied in financial calculations, display formatting, and data binning.
ABS / NEGATIVE
Purpose:
Get the absolute value of a number or its negation.
Usage:
* ABS(number)
* NEGATIVE(number)
Example:
Returns an always positive value for profit_loss
ABS(profit_loss)
Use Case:
Used for distance calculations and magnitude analysis.
MOD
Purpose:
Calculate the remainder after division.
Usage:
MOD(number, divisor)
Example:
Returns the last digit of an ID
MOD(customer_id, 10)
Use Case:
Useful for data sampling, identifying cyclic patterns, and partitioning.
POWER / SQRT
Purpose:
Calculate the power of a number or its square root.
Usage:
* POWER(base, exponent)
* SQRT(number)
Example:
Returns the side length of a square from its area
SQRT(area)
Use Case:
Common in scientific calculations and geometric formulas.
ARRAY functions
ARRAY operations
Purpose:
Work with array data.
Usage:
* SIZE(array)
* GET(array, index)
* CONTAINS(array, value)
Example:
Returns the number of tags in an array
SIZE(tags_array)
Use Case:
Used to handle multi-value fields, analyze lists, and perform data validation.
FLATTEN
Purpose:
Convert nested arrays into a single array.
Usage:
FLATTEN(nested_array)
Example:
Converts nested categories into a single list
FLATTEN(categories)
Use Case:
Helps to simplify complex data and prepare it for analysis.
JSON functions
JSON Processing
Purpose:
Extract data from JSON objects.
Usage:
* GET_JSON_OBJECT(json, path)
* FROM_JSON(json, schema)
Example:
Extracts the email from user data in JSON format
GET_JSON_OBJECT(user_data, '$.email')
Use Case:
Crucial for API data processing and semi-structured data analysis.
Pattern Matching
LIKE / ILIKE
Purpose:
Perform pattern matching (case-sensitive or insensitive).
Usage:
* LIKE(text, pattern)
* ILIKE(text, pattern)
Example:
Checks if product_name
contains "iPhone"
LIKE(product_name, '%iPhone%')
Use Case:
Used for text search, data filtering, and pattern detection.
REGEXP functions
Purpose:
Perform regular expression matching and manipulation.
Usage:
REGEXP_REPLACE(text, pattern, replacement)
Example:
Removes non-numeric characters from a phone number, leaving only numbers
REGEXP_REPLACE(phone, '[^0-9]', '')
Use Case:
Essential for data cleaning, format validation, and complex text processing.