Apache Spark formula reference
Overview
This document provides a comprehensive reference for Apache Spark formulas supported in DataGOL. These formulas enable powerful data processing, analysis, and transformation capabilities directly within your DataGOL workflows.
What you can achieve?
-
Data aggregation: Calculate sums, averages, counts, and statistical measures
-
Text processing: Clean, format, and manipulate string data
-
Date/Time operations: Parse, format, and calculate with dates and timestamps
-
Conditional logic: Implement business rules and data validation
-
Data type conversion: Transform data between different formats
-
Array/JSON processing: Handle complex nested data structures
-
Mathematical operations: Perform calculations and statistical analysis
Best Practices
Performance tips
- Use appropriate data types - Cast data early in your pipeline.
- Filter early - Apply
WHEREconditions before complex calculations. - Avoid nested functions - Break complex formulas into steps.
- Use COALESCE - Handle nulls explicitly to avoid errors.
Common patterns
- Data cleaning:
TRIM(UPPER(COALESCE(field, 'DEFAULT'))) - Age calculation:
DATE_DIFF(CURRENT_DATE(), birth_date) / 365 - Categorization: Use
CASE WHENfor business logic. - Null handling: Always consider null values in calculations.
Error handling
- Use
TRY_*functions for safe operations:TRY_DIVIDE,TRY_TO_NUMBER. - Test formulas with sample data before production use.
- Validate data types and ranges.
Formula categories
| Category | Primary use | Key functions |
|---|---|---|
| Aggregation | Summarize data | COUNT, SUM, AVG, MIN, MAX |
| Text Processing | Clean/format strings | CONCAT, TRIM, UPPER, REPLACE |
| Date/Time | Handle temporal data | DATE_ADD, DATE_DIFF, EXTRACT |
| Conditional | Business logic | IF, CASE WHEN, COALESCE |
| Conversion | Data type changes | CAST, TO_DATE, STRING |
| Mathematical | Calculations | ROUND, ABS, POWER, MOD |
| Pattern Matching | Find/replace patterns | LIKE, REGEXP_REPLACE |
| Array/JSON | Complex data types | SIZE, GET, FLATTEN |
This reference provides the foundation for powerful data processing in DataGOL using Apache Spark. Start with simple formulas and gradually combine them for more complex data transformations.
Aggregate functions
COUNT
Purpose:
Count rows or non-null values.
Usage:
* COUNT(column_name)
* COUNT(*)
Example:
Returns number of customers
COUNT(customer_id)
Use Case:
Calculate total records, active users, or non-empty fields.
SUM
Purpose:
Calculate the total of numeric values.
Usage:
SUM(column_name)
Example:
Returns total sales revenue
SUM(sales_amount)
Use Case:
Use for financial totals, quantity calculations, or performance metrics.
AVG / MEAN
Purpose:
Calculate the average value of a set of numbers.
Usage:
* AVG(column_name)
* MEAN(column_name)
Example:
Returns average order size
AVG(order_value)
Use Case:
Useful for performance benchmarks, trend analysis, and KPI calculations.
MIN / MAX
Purpose:
Find the minimum or maximum values within a dataset.
Usage:
* MIN(column_name)
* MAX(column_name)
Example:
Returns the latest transaction date
MAX(transaction_date)
Use Case:
Apply for data quality checks, range analysis, and finding extremes.
MEDIAN
Purpose:
Find the middle value in sorted data.
Usage:
MEDIAN(column_name)
Example:
Returns the middle salary value
MEDIAN(salary)
Use Case:
Good for statistical analysis, outlier detection, and fair comparisons.
PERCENTILE
Purpose:
Calculate exact percentile values from a dataset.
Usage:
PERCENTILE(column_name, percentile_value)
Example:
Returns the 95th percentile response time
PERCENTILE(response_time, 0.95)
Use Case:
Crucial for performance SLAs, quality metrics, and distribution analysis.
Text/String functions
CONCAT / CONCAT_WS
Purpose:
Join strings together.
Usage:
* CONCAT(str1, str2)
* CONCAT_WS(separator, str1, str2)
Example:
Returns "John Smith"
CONCAT_WS(' ', first_name, last_name)
Use Case:
Use to create full names, build identifiers, or format display text.
UPPER / LOWER
Purpose:
Change the text case of a string.
Usage:
* UPPER(text)
* LOWER(text)
Example:
Returns "USA"
UPPER(country_code)
Use Case:
Useful for data standardization and case-insensitive comparisons.
TRIM
Purpose:
Remove leading/trailing spaces from a string.
Usage:
TRIM(text)
Example:
Returns clean customer names
TRIM(customer_name)
Use Case:
Primarily used for data cleaning and removing extra whitespace.
SUBSTRING
Purpose:
Extract part of a string.
Usage:
SUBSTRING(text, start_position, length)
Example:
Returns the area code from a phone number
SUBSTRING(phone_number, 1, 3)
Use Case:
Effective for extracting codes, parsing structured text, and data validation.
REPLACE
Purpose:
Replace specific text patterns within a string.
Usage:
REPLACE(text, old_value, new_value)
Example:
Removes dashes from a phone number
REPLACE(phone, '-', '')
Use Case:
Common for data cleaning and format standardization.
SPLIT
Purpose:
Split a string into an array of substrings.
Usage:
SPLIT(text, delimiter)
Example:
Returns an array of address parts
SPLIT(full_address, ',')
Use Case:
Ideal for parsing delimited data and extracting components.
CONTAINS / STARTSWITH
Purpose:
Check if text contains specific patterns or prefixes.
Usage:
* CONTAINS(text, pattern)
* STARTSWITH(text, prefix)
Example:
Returns TRUE/FALSE indicating if an email contains "@gmail.com"
CONTAINS(email, '@gmail.com')
Use Case:
Used for data filtering, pattern matching, and validation.
Date/Time functions
CURRENT_DATE / CURRENT_TIMESTAMP
Purpose:
Get the current date/time.
Usage:
* CURRENT_DATE()
* CURRENT_TIMESTAMP()
Example:
Returns the current date, e.g., 2023-01-01
CURRENT_DATE()
Use Case:
Useful to timestamp records, calculate age, or track processing time.
DATE_ADD / DATE_DIFF
Purpose:
Add days to a date or calculate the difference between two dates.
Usage:
* DATE_ADD(date, days)
* DATE_DIFF(end_date, start_date)
Example:
Returns the number of days since an order
DATE_DIFF(order_date, CURRENT_DATE())
Use Case:
Good for calculating aging, setting due dates, and measuring durations.
DATE_FORMAT
Purpose:
Format dates as strings according to a specified pattern.
Usage:
DATE_FORMAT(date, format)
Example:
Returns "2023-01"
DATE_FORMAT(order_date, 'yyyy-MM')
Use Case:
Helpful for display formatting, grouping by period, and reporting.
EXTRACT / DATE_PART
Purpose:
Extract specific components (e.g., year, month) from a date.
Usage:
* EXTRACT(part FROM date)
* DATE_PART(part, date)
Example:
Returns the year from an order date, e.g., 2023
EXTRACT(YEAR FROM order_date)