-
Notifications
You must be signed in to change notification settings - Fork 921
Add custom decoder in arrow-json #7442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
FYI @tustvold @alamb @adriangb since you worked on the encoder one. This would allow us (I work with @cht42) to not vendor the code to get the spark-compatibility. Alternatively, if you have other ideas for how to reach our goal (we could also make the spark compat a config option, for example), let us know! |
I'm afraid I am unlikely to have time to review this in the near future, but I am not a big fan of making the tape public, this can and has changed in non-trivial ways in the past.
What does this look like? Is it just using nulls for fallible conversions, as that doesn't sound like a big lift, or is it something more intrusive? |
It is more intrusive when the target type is string. In that case it will use the raw string instead of null. For example
With target schema of
In spark. For just fallback to null on parsing errors, #7230 is open that would handle that. |
Without having looked at the code I will try to review - although I may not be able to give much feedback since I haven't worked on this side of the codebase before and we do not use it internally. |
FWIW I have now heard multiple use cases for the tape API so far:
So I think making it more public facing is not unreasonable, though that will of course constrain us in the future |
I will try and review this PR later this week |
tl;dr: Custom error handling for JSON parsing is really complex. I have lots of observations (see below) but no good answers. Having dealt quite a bit with spark's json parsing, it seems like there's always one more corner case to worry about no matter how many options are available for customizing it "from the outside" (via configs etc). This PR, for example, seems to assume that the desired parsing semantics are determined entirely by the data type -- but in practice, the semantic meaning of a given field also matters (e.g. the names of the field and its ancestors). For example, it might make sense to convert an invalid optional numeric field to NULL while still erroring out on non-nullable fields (or nullable fields whose schema the application considers to be more "strict" for whatever reason). It can also make a difference whether the "invalid" value merely had an unexpected type, vs. altogether invalid JSON that fails to even parse (spark json parser has a notion of "rescue columns" that tries to cover the latter). With variant on the horizon, it's tempting to just parse as variant (which succeeds for any lexically valid input), followed by a combination of field extraction and [try-]casting to deal with fields on a case by case basis. Downside of that approach being that variant support doesn't actually exist yet. It also lends itself more to specialized parsing situations where the expected schema is statically known. And it doesn't solve the problem very nicely for nested fields -- should a try-cast of a struct field succeed with a NULL child, or outright fail, if a child had the wrong type? The factory approach in this PR, on the other hand, is better for generic parsing where we don't statically know anything about the semantics of the schema being parsed. It also handles deep nesting better. One could also argue that this custom decoder is just an internal way to leverage variant-like parsing at the tape decoder level, needed because actual variant doesn't exist externally yet. Tho there is definitely an efficiency argument to be made for making decisions closer to the source data (string -> tape -> final), instead of translating to an intermediate variant along the way (string -> tape -> variant -> final). |
I think this has hit the nail on the head as to why I'm wary about exposing the tape, we're effectively at that point exposing a custom variant implementation. People will inevitably use it as a kind of arrow BSON, and it'll then become very sticky and hard to change. |
Similar behavior as #7015 but on the decoder side. This would allow introducing flexibility when decoding JSON.
For example, when trying to replicate JSON decoding in spark, spark will decode incorrect values as null or raw string. This would allow overriding the current StringDecoder to implement this functionality.