Skip to content

Multi-column rules #406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pwolff42 opened this issue Apr 24, 2025 · 5 comments
Open

Multi-column rules #406

pwolff42 opened this issue Apr 24, 2025 · 5 comments

Comments

@pwolff42
Copy link

Is there current functionality or plans on the roadmap to facilitate multiple column logic in a rule? This can be powerful and is a feature that few other frameworks offer. GreatExpectations has a few examples, but they are rather limited:

https://greatexpectations.io/expectations/expect_column_pair_values_a_to_be_greater_than_b/

https://greatexpectations.io/expectations/expect_column_pair_values_to_be_in_set/

A very powerful extension of the second link would be if the check accepted an arbitrary # of columns, then a hierarchical structure of sorts could be established:

if column A is in [X], column B must be in [Y, Z],
if column A is in [X] and column B is in [Y], column C must be in [...]

Is this the kind of rule that could be specified using satisfies, or is that only for a single column?
https://canimus.github.io/cuallee/module/check/#cuallee.Check.satisfies

@levyvix
Copy link

levyvix commented Apr 25, 2025

from cuallee import Check

def hierarchical_rule(df):
    return ((df["A"].isin(["X"])) & (df["B"].isin(["Y", "Z"]))) | \
           ((df["A"].isin(["X"])) & (df["B"].isin(["Y"])) & (df["C"].isin(["..."])))

check = Check()
check.satisfies(hierarchical_rule, columns=["A", "B", "C"], coverage=1.0)

@pwolff42
Copy link
Author

pwolff42 commented Apr 30, 2025

Hi @levyvix ,

The above does not work for satisfies as neither columns nor coverage are valid arguments. satisfies accepts the argument column which takes a single string.

satisfies accepts a SQL-like predicate argument, perhaps an example of this would be helpful. I'm still not sure there is a way to specify multiple columns for this rule.

Perhaps @canimus you have an example?

@canimus
Copy link
Owner

canimus commented Apr 30, 2025

Hi @pwolff42 here you can find a good example of the is_custom rule which will allow the composite rule definition

import inspect
import pyspark.sql.functions as F
import pyspark.sql.types as T
from cuallee import Check
from pyspark.sql import DataFrame, SparkSession
from toolz import curry

spark = SparkSession.builder.getOrCreate()
data = [("A", 1), ("B", -1), ("B", 0), ("C", 2)]
schema = T.StructType([T.StructField("id", T.StringType(), True), T.StructField("quantity", T.IntegerType(), True)])
orders = spark.createDataFrame(data, schema=schema)
orders.show()

check = Check(name="orders_checks")
check = check.add_rule("is_unique", "id", 1)
check = check.add_rule("is_greater_than", "quantity", 0, 0.5)


# Define and add a custom check
@curry
def mean_above_threshold(df: DataFrame, column_name: str, threshold: float) -> DataFrame:
    mean_value = df.select(F.mean(column_name).alias("mean")).collect()[0]["mean"]
    is_above_threshold = mean_value > threshold
    return df.withColumn("mean_above_threshold", F.lit(is_above_threshold))

col_name = "quantity"
check = check.add_rule("is_custom", col_name, mean_above_threshold(column_name=col_name, threshold=0), 1, options={"name" : "mean_above_threshold", "custom_value": f"{col_name}>0"})


# Define a custom check function for data type validation
@curry
def is_correct_dtype(df: DataFrame, column_name: str, expected_dtype: T.DataType) -> DataFrame:
    actual_dtype = [field.dataType for field in df.schema.fields if field.name == column_name][0]
    is_dtype_correct = actual_dtype == expected_dtype
    return df.withColumn(f"{column_name}_is_dtype_correct", F.lit(is_dtype_correct))
check = check.add_rule("is_custom", "id", is_correct_dtype(column_name="id", expected_dtype=T.StringType()), 1, options={"name" : "is_correct_dtype", "custom_value": "string"})
check = check.add_rule("is_custom", "quantity", is_correct_dtype(column_name="quantity", expected_dtype=T.IntegerType()), 1, options={"name" : "is_correct_dtype", "custom_value": "integer"})

# Run the checks
output = check.validate(orders)
output.show()

# Verbose alternative to `f(x)?`
#func = check.rules[-1].value
#print(f"{func.__name__}{inspect.signature(func)}")

@pwolff42
Copy link
Author

pwolff42 commented May 1, 2025

Hi @canimus, thanks for the quick response. Correct in assuming then that these rules (satisfies, is_custom) are not dataframe agnostic?

@canimus
Copy link
Owner

canimus commented May 1, 2025

Hi @pwolff42 I am confident that satisfies is covered 100% across all dataframes. However, I think the is_custom is not across all implementations, I remember from the issues closed, and the conversations that is relatively new, and it was included in pandas lately, but I am afraid, that it may not cover all apis. Fancy a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants