-
Notifications
You must be signed in to change notification settings - Fork 21
Multi-column rules #406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
from cuallee import Check
def hierarchical_rule(df):
return ((df["A"].isin(["X"])) & (df["B"].isin(["Y", "Z"]))) | \
((df["A"].isin(["X"])) & (df["B"].isin(["Y"])) & (df["C"].isin(["..."])))
check = Check()
check.satisfies(hierarchical_rule, columns=["A", "B", "C"], coverage=1.0) |
Hi @levyvix , The above does not work for
Perhaps @canimus you have an example? |
Hi @pwolff42 here you can find a good example of the import inspect
import pyspark.sql.functions as F
import pyspark.sql.types as T
from cuallee import Check
from pyspark.sql import DataFrame, SparkSession
from toolz import curry
spark = SparkSession.builder.getOrCreate()
data = [("A", 1), ("B", -1), ("B", 0), ("C", 2)]
schema = T.StructType([T.StructField("id", T.StringType(), True), T.StructField("quantity", T.IntegerType(), True)])
orders = spark.createDataFrame(data, schema=schema)
orders.show()
check = Check(name="orders_checks")
check = check.add_rule("is_unique", "id", 1)
check = check.add_rule("is_greater_than", "quantity", 0, 0.5)
# Define and add a custom check
@curry
def mean_above_threshold(df: DataFrame, column_name: str, threshold: float) -> DataFrame:
mean_value = df.select(F.mean(column_name).alias("mean")).collect()[0]["mean"]
is_above_threshold = mean_value > threshold
return df.withColumn("mean_above_threshold", F.lit(is_above_threshold))
col_name = "quantity"
check = check.add_rule("is_custom", col_name, mean_above_threshold(column_name=col_name, threshold=0), 1, options={"name" : "mean_above_threshold", "custom_value": f"{col_name}>0"})
# Define a custom check function for data type validation
@curry
def is_correct_dtype(df: DataFrame, column_name: str, expected_dtype: T.DataType) -> DataFrame:
actual_dtype = [field.dataType for field in df.schema.fields if field.name == column_name][0]
is_dtype_correct = actual_dtype == expected_dtype
return df.withColumn(f"{column_name}_is_dtype_correct", F.lit(is_dtype_correct))
check = check.add_rule("is_custom", "id", is_correct_dtype(column_name="id", expected_dtype=T.StringType()), 1, options={"name" : "is_correct_dtype", "custom_value": "string"})
check = check.add_rule("is_custom", "quantity", is_correct_dtype(column_name="quantity", expected_dtype=T.IntegerType()), 1, options={"name" : "is_correct_dtype", "custom_value": "integer"})
# Run the checks
output = check.validate(orders)
output.show()
# Verbose alternative to `f(x)?`
#func = check.rules[-1].value
#print(f"{func.__name__}{inspect.signature(func)}") |
Hi @canimus, thanks for the quick response. Correct in assuming then that these rules ( |
Hi @pwolff42 I am confident that |
Is there current functionality or plans on the roadmap to facilitate multiple column logic in a rule? This can be powerful and is a feature that few other frameworks offer. GreatExpectations has a few examples, but they are rather limited:
https://greatexpectations.io/expectations/expect_column_pair_values_a_to_be_greater_than_b/
https://greatexpectations.io/expectations/expect_column_pair_values_to_be_in_set/
A very powerful extension of the second link would be if the check accepted an arbitrary # of columns, then a hierarchical structure of sorts could be established:
if column A is in [X], column B must be in [Y, Z],
if column A is in [X] and column B is in [Y], column C must be in [...]
Is this the kind of rule that could be specified using
satisfies
, or is that only for a single column?https://canimus.github.io/cuallee/module/check/#cuallee.Check.satisfies
The text was updated successfully, but these errors were encountered: