Skip to content

Add custom decoder in arrow-json #7442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cht42
Copy link

@cht42 cht42 commented Apr 24, 2025

Similar behavior as #7015 but on the decoder side. This would allow introducing flexibility when decoding JSON.

For example, when trying to replicate JSON decoding in spark, spark will decode incorrect values as null or raw string. This would allow overriding the current StringDecoder to implement this functionality.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 24, 2025
@Blizzara
Copy link
Contributor

FYI @tustvold @alamb @adriangb since you worked on the encoder one. This would allow us (I work with @cht42) to not vendor the code to get the spark-compatibility. Alternatively, if you have other ideas for how to reach our goal (we could also make the spark compat a config option, for example), let us know!

@tustvold
Copy link
Contributor

tustvold commented Apr 24, 2025

I'm afraid I am unlikely to have time to review this in the near future, but I am not a big fan of making the tape public, this can and has changed in non-trivial ways in the past.

get the spark-compatibility

What does this look like? Is it just using nulls for fallible conversions, as that doesn't sound like a big lift, or is it something more intrusive?

@cht42
Copy link
Author

cht42 commented Apr 24, 2025

It is more intrusive when the target type is string. In that case it will use the raw string instead of null. For example

{"a": [1,2,3]}

With target schema of map<str, str> will be decoded as

{"a": "[1,2,3]"}

In spark.

For just fallback to null on parsing errors, #7230 is open that would handle that.

@adriangb
Copy link
Contributor

Without having looked at the code I will try to review - although I may not be able to give much feedback since I haven't worked on this side of the codebase before and we do not use it internally.

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

FWIW I have now heard multiple use cases for the tape API so far:

So I think making it more public facing is not unreasonable, though that will of course constrain us in the future

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

@tustvold it is your own fault for creating something so useful 😆

If we are going to make it public, maybe we can avoid the requirement to copy the data into a contiguous buffer while we are at it 🤔 that was also mentioned by @mwylde in his blog

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

I will try and review this PR later this week

@alamb
Copy link
Contributor

alamb commented Apr 28, 2025

@cht42 / @Blizzara is there any chance you can file a ticket that describes your usecase (I think it is to implement spark specific JSON -> Arrow parsing, in terms of nulls, or numbers or whatever).

@scovich
Copy link
Contributor

scovich commented Apr 28, 2025

tl;dr: Custom error handling for JSON parsing is really complex. I have lots of observations (see below) but no good answers.

Having dealt quite a bit with spark's json parsing, it seems like there's always one more corner case to worry about no matter how many options are available for customizing it "from the outside" (via configs etc).

This PR, for example, seems to assume that the desired parsing semantics are determined entirely by the data type -- but in practice, the semantic meaning of a given field also matters (e.g. the names of the field and its ancestors). For example, it might make sense to convert an invalid optional numeric field to NULL while still erroring out on non-nullable fields (or nullable fields whose schema the application considers to be more "strict" for whatever reason). It can also make a difference whether the "invalid" value merely had an unexpected type, vs. altogether invalid JSON that fails to even parse (spark json parser has a notion of "rescue columns" that tries to cover the latter).

With variant on the horizon, it's tempting to just parse as variant (which succeeds for any lexically valid input), followed by a combination of field extraction and [try-]casting to deal with fields on a case by case basis. Downside of that approach being that variant support doesn't actually exist yet. It also lends itself more to specialized parsing situations where the expected schema is statically known. And it doesn't solve the problem very nicely for nested fields -- should a try-cast of a struct field succeed with a NULL child, or outright fail, if a child had the wrong type?

The factory approach in this PR, on the other hand, is better for generic parsing where we don't statically know anything about the semantics of the schema being parsed. It also handles deep nesting better.

One could also argue that this custom decoder is just an internal way to leverage variant-like parsing at the tape decoder level, needed because actual variant doesn't exist externally yet. Tho there is definitely an efficiency argument to be made for making decisions closer to the source data (string -> tape -> final), instead of translating to an intermediate variant along the way (string -> tape -> variant -> final).

@cht42
Copy link
Author

cht42 commented Apr 29, 2025

@cht42 / @Blizzara is there any chance you can file a ticket that describes your usecase (I think it is to implement spark specific JSON -> Arrow parsing, in terms of nulls, or numbers or whatever).

created #7453

@tustvold
Copy link
Contributor

One could also argue that this custom decoder is just an internal way to leverage variant-like parsing at the tape decoder level, needed because actual variant doesn't exist externally yet

I think this has hit the nail on the head as to why I'm wary about exposing the tape, we're effectively at that point exposing a custom variant implementation. People will inevitably use it as a kind of arrow BSON, and it'll then become very sticky and hard to change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants