-
Notifications
You must be signed in to change notification settings - Fork 141
Add support to compare.matches() to accept optional threshold #400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you share a bit more about your use case? It sounds like you're using datacompy for programmatic testing purposes, but it's really mainly intended to be used for manual data comparison. |
Yea! We would integrate this library with an application that runs in production. The goal is to be able to programmatically use the tool to determine if dataframes match. Right now we are able to do something like this
The above will do a 100% row match and fail when the dataframes have even one record that's different. For some use cases, 100% matches aren't required and it's permissible to have 90% of the rows match. We are looking to see if DataComPy could support such scenairo |
Is there a reason why it shouldn't support programmatic usage? It is just a python library. We have a number of use cases where we need to verify data from migrations or upgrades that cause extreme user toil if users are forced to verify it manually. |
There's nothing stopping you from doing so, but my understanding is that the original intent of datacompy is the generation of explicit human-readable output for people who want some sense of how their data differs, less so intended to be used as a testing library. @fdosani thoughts? |
I'm good with the programatic execution here. It makes sense to me people would want to automate as much as possible within some thresholds etc. We should refine the intent to make sure we align on what we want to do. Also we need to make sure it applies to all data frame types. |
Fair enough 👍 - In terms of intent, there are 3 conditions for a match:
My understanding is the first condition still applies, and the third is adjusted by a threshold. The second is a bit less clear, its easier to just say it stays the same but maybe that's not quite the most accurate intent. @shreya-goddu in your use case, when you mention this threshold, do you consider rows that fail to be joined between the dataframes as part of the tolerable error? Or are you still expecting all rows to be able to be joined, but want to excuse some limited amount of matching failures? |
Good questions. We will take this back and refine some of the intent further and bring it back. Leaving the issue open in the meantime |
We have a number of use cases where 100% row match isn't required and 90% or another configurable value is permissible. Can there be support added to accept a custom threshold so that applications can configure failures when that threshold isn't met.
The text was updated successfully, but these errors were encountered: