Multi-column rules #406

pwolff42 · 2025-04-24T15:30:45Z

Is there current functionality or plans on the roadmap to facilitate multiple column logic in a rule? This can be powerful and is a feature that few other frameworks offer. GreatExpectations has a few examples, but they are rather limited:

https://greatexpectations.io/expectations/expect_column_pair_values_a_to_be_greater_than_b/

https://greatexpectations.io/expectations/expect_column_pair_values_to_be_in_set/

A very powerful extension of the second link would be if the check accepted an arbitrary # of columns, then a hierarchical structure of sorts could be established:

if column A is in [X], column B must be in [Y, Z],
if column A is in [X] and column B is in [Y], column C must be in [...]

Is this the kind of rule that could be specified using satisfies, or is that only for a single column?
https://canimus.github.io/cuallee/module/check/#cuallee.Check.satisfies

The text was updated successfully, but these errors were encountered:

levyvix · 2025-04-25T04:47:25Z

from cuallee import Check

def hierarchical_rule(df):
    return ((df["A"].isin(["X"])) & (df["B"].isin(["Y", "Z"]))) | \
           ((df["A"].isin(["X"])) & (df["B"].isin(["Y"])) & (df["C"].isin(["..."])))

check = Check()
check.satisfies(hierarchical_rule, columns=["A", "B", "C"], coverage=1.0)

pwolff42 · 2025-04-30T18:56:49Z

Hi @levyvix ,

The above does not work for satisfies as neither columns nor coverage are valid arguments. satisfies accepts the argument column which takes a single string.

satisfies accepts a SQL-like predicate argument, perhaps an example of this would be helpful. I'm still not sure there is a way to specify multiple columns for this rule.

Perhaps @canimus you have an example?

canimus · 2025-04-30T21:52:28Z

Hi @pwolff42 here you can find a good example of the is_custom rule which will allow the composite rule definition

import inspect
import pyspark.sql.functions as F
import pyspark.sql.types as T
from cuallee import Check
from pyspark.sql import DataFrame, SparkSession
from toolz import curry

spark = SparkSession.builder.getOrCreate()
data = [("A", 1), ("B", -1), ("B", 0), ("C", 2)]
schema = T.StructType([T.StructField("id", T.StringType(), True), T.StructField("quantity", T.IntegerType(), True)])
orders = spark.createDataFrame(data, schema=schema)
orders.show()

check = Check(name="orders_checks")
check = check.add_rule("is_unique", "id", 1)
check = check.add_rule("is_greater_than", "quantity", 0, 0.5)


# Define and add a custom check
@curry
def mean_above_threshold(df: DataFrame, column_name: str, threshold: float) -> DataFrame:
    mean_value = df.select(F.mean(column_name).alias("mean")).collect()[0]["mean"]
    is_above_threshold = mean_value > threshold
    return df.withColumn("mean_above_threshold", F.lit(is_above_threshold))

col_name = "quantity"
check = check.add_rule("is_custom", col_name, mean_above_threshold(column_name=col_name, threshold=0), 1, options={"name" : "mean_above_threshold", "custom_value": f"{col_name}>0"})


# Define a custom check function for data type validation
@curry
def is_correct_dtype(df: DataFrame, column_name: str, expected_dtype: T.DataType) -> DataFrame:
    actual_dtype = [field.dataType for field in df.schema.fields if field.name == column_name][0]
    is_dtype_correct = actual_dtype == expected_dtype
    return df.withColumn(f"{column_name}_is_dtype_correct", F.lit(is_dtype_correct))
check = check.add_rule("is_custom", "id", is_correct_dtype(column_name="id", expected_dtype=T.StringType()), 1, options={"name" : "is_correct_dtype", "custom_value": "string"})
check = check.add_rule("is_custom", "quantity", is_correct_dtype(column_name="quantity", expected_dtype=T.IntegerType()), 1, options={"name" : "is_correct_dtype", "custom_value": "integer"})

# Run the checks
output = check.validate(orders)
output.show()

# Verbose alternative to `f(x)?`
#func = check.rules[-1].value
#print(f"{func.__name__}{inspect.signature(func)}")

pwolff42 · 2025-05-01T15:30:14Z

Hi @canimus, thanks for the quick response. Correct in assuming then that these rules (satisfies, is_custom) are not dataframe agnostic?

canimus · 2025-05-01T15:42:24Z

Hi @pwolff42 I am confident that satisfies is covered 100% across all dataframes. However, I think the is_custom is not across all implementations, I remember from the issues closed, and the conversations that is relatively new, and it was included in pandas lately, but I am afraid, that it may not cover all apis. Fancy a PR?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-column rules #406

Multi-column rules #406

pwolff42 commented Apr 24, 2025

levyvix commented Apr 25, 2025

pwolff42 commented Apr 30, 2025 •

edited

Loading

canimus commented Apr 30, 2025

pwolff42 commented May 1, 2025

canimus commented May 1, 2025

Multi-column rules #406

Multi-column rules #406

Comments

pwolff42 commented Apr 24, 2025

levyvix commented Apr 25, 2025

pwolff42 commented Apr 30, 2025 • edited Loading

canimus commented Apr 30, 2025

pwolff42 commented May 1, 2025

canimus commented May 1, 2025

pwolff42 commented Apr 30, 2025 •

edited

Loading