Skip to content

Add support to compare.matches() to accept optional threshold #400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shreya-goddu opened this issue Apr 8, 2025 · 7 comments
Open

Add support to compare.matches() to accept optional threshold #400

shreya-goddu opened this issue Apr 8, 2025 · 7 comments

Comments

@shreya-goddu
Copy link

We have a number of use cases where 100% row match isn't required and 90% or another configurable value is permissible. Can there be support added to accept a custom threshold so that applications can configure failures when that threshold isn't met.

@rhaffar
Copy link
Contributor

rhaffar commented Apr 10, 2025

Could you share a bit more about your use case? It sounds like you're using datacompy for programmatic testing purposes, but it's really mainly intended to be used for manual data comparison.

@shreya-goddu
Copy link
Author

shreya-goddu commented Apr 10, 2025

Yea! We would integrate this library with an application that runs in production. The goal is to be able to programmatically use the tool to determine if dataframes match.

Right now we are able to do something like this

compare = SparkSQLCompare(base_df=df1, compare_df=df2, join_col='col1')
if not compare.matches():
    raise Exception()

The above will do a 100% row match and fail when the dataframes have even one record that's different. For some use cases, 100% matches aren't required and it's permissible to have 90% of the rows match. We are looking to see if DataComPy could support such scenairo

@shreya-goddu
Copy link
Author

shreya-goddu commented Apr 10, 2025

it's really mainly intended to be used for manual data comparison.

Is there a reason why it shouldn't support programmatic usage? It is just a python library. We have a number of use cases where we need to verify data from migrations or upgrades that cause extreme user toil if users are forced to verify it manually.

@rhaffar
Copy link
Contributor

rhaffar commented Apr 11, 2025

it's really mainly intended to be used for manual data comparison.

Is there a reason why it shouldn't support programmatic usage? It is just a python library. We have a number of use cases where we need to verify data from migrations or upgrades that cause extreme user toil if users are forced to verify it manually.

There's nothing stopping you from doing so, but my understanding is that the original intent of datacompy is the generation of explicit human-readable output for people who want some sense of how their data differs, less so intended to be used as a testing library. @fdosani thoughts?

@fdosani
Copy link
Member

fdosani commented Apr 11, 2025

I'm good with the programatic execution here. It makes sense to me people would want to automate as much as possible within some thresholds etc.

We should refine the intent to make sure we align on what we want to do. Also we need to make sure it applies to all data frame types.

@rhaffar
Copy link
Contributor

rhaffar commented Apr 11, 2025

Fair enough 👍 - In terms of intent, there are 3 conditions for a match:

  1. The schema in both dataframes is identical
  2. All rows in both dataframes can be joined
  3. All rows comparable columns match exactly.

My understanding is the first condition still applies, and the third is adjusted by a threshold. The second is a bit less clear, its easier to just say it stays the same but maybe that's not quite the most accurate intent. @shreya-goddu in your use case, when you mention this threshold, do you consider rows that fail to be joined between the dataframes as part of the tolerable error? Or are you still expecting all rows to be able to be joined, but want to excuse some limited amount of matching failures?

@shreya-goddu
Copy link
Author

shreya-goddu commented Apr 17, 2025

Good questions. We will take this back and refine some of the intent further and bring it back. Leaving the issue open in the meantime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants