Add support to compare.matches() to accept optional threshold #400

shreya-goddu · 2025-04-08T18:51:36Z

We have a number of use cases where 100% row match isn't required and 90% or another configurable value is permissible. Can there be support added to accept a custom threshold so that applications can configure failures when that threshold isn't met.

rhaffar · 2025-04-10T17:20:08Z

Could you share a bit more about your use case? It sounds like you're using datacompy for programmatic testing purposes, but it's really mainly intended to be used for manual data comparison.

shreya-goddu · 2025-04-10T20:11:11Z

Yea! We would integrate this library with an application that runs in production. The goal is to be able to programmatically use the tool to determine if dataframes match.

Right now we are able to do something like this

compare = SparkSQLCompare(base_df=df1, compare_df=df2, join_col='col1')
if not compare.matches():
    raise Exception()

The above will do a 100% row match and fail when the dataframes have even one record that's different. For some use cases, 100% matches aren't required and it's permissible to have 90% of the rows match. We are looking to see if DataComPy could support such scenairo

shreya-goddu · 2025-04-10T20:12:30Z

it's really mainly intended to be used for manual data comparison.

Is there a reason why it shouldn't support programmatic usage? It is just a python library. We have a number of use cases where we need to verify data from migrations or upgrades that cause extreme user toil if users are forced to verify it manually.

rhaffar · 2025-04-11T18:49:52Z

it's really mainly intended to be used for manual data comparison.

Is there a reason why it shouldn't support programmatic usage? It is just a python library. We have a number of use cases where we need to verify data from migrations or upgrades that cause extreme user toil if users are forced to verify it manually.

There's nothing stopping you from doing so, but my understanding is that the original intent of datacompy is the generation of explicit human-readable output for people who want some sense of how their data differs, less so intended to be used as a testing library. @fdosani thoughts?

fdosani · 2025-04-11T18:52:17Z

I'm good with the programatic execution here. It makes sense to me people would want to automate as much as possible within some thresholds etc.

We should refine the intent to make sure we align on what we want to do. Also we need to make sure it applies to all data frame types.

rhaffar · 2025-04-11T19:03:18Z

Fair enough 👍 - In terms of intent, there are 3 conditions for a match:

The schema in both dataframes is identical
All rows in both dataframes can be joined
All rows comparable columns match exactly.

My understanding is the first condition still applies, and the third is adjusted by a threshold. The second is a bit less clear, its easier to just say it stays the same but maybe that's not quite the most accurate intent. @shreya-goddu in your use case, when you mention this threshold, do you consider rows that fail to be joined between the dataframes as part of the tolerable error? Or are you still expecting all rows to be able to be joined, but want to excuse some limited amount of matching failures?

shreya-goddu · 2025-04-17T16:22:40Z

Good questions. We will take this back and refine some of the intent further and bring it back. Leaving the issue open in the meantime

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to compare.matches() to accept optional threshold #400

Add support to compare.matches() to accept optional threshold #400

shreya-goddu commented Apr 8, 2025

rhaffar commented Apr 10, 2025

shreya-goddu commented Apr 10, 2025 •

edited

Loading

shreya-goddu commented Apr 10, 2025 •

edited

Loading

rhaffar commented Apr 11, 2025

fdosani commented Apr 11, 2025 •

edited

Loading

rhaffar commented Apr 11, 2025

shreya-goddu commented Apr 17, 2025 •

edited

Loading

Add support to compare.matches() to accept optional threshold #400

Add support to compare.matches() to accept optional threshold #400

Comments

shreya-goddu commented Apr 8, 2025

rhaffar commented Apr 10, 2025

shreya-goddu commented Apr 10, 2025 • edited Loading

shreya-goddu commented Apr 10, 2025 • edited Loading

rhaffar commented Apr 11, 2025

fdosani commented Apr 11, 2025 • edited Loading

rhaffar commented Apr 11, 2025

shreya-goddu commented Apr 17, 2025 • edited Loading

shreya-goddu commented Apr 10, 2025 •

edited

Loading

shreya-goddu commented Apr 10, 2025 •

edited

Loading

fdosani commented Apr 11, 2025 •

edited

Loading

shreya-goddu commented Apr 17, 2025 •

edited

Loading