Add custom decoder in arrow-json #7442

cht42 · 2025-04-24T09:12:14Z

Similar behavior as #7015 but on the decoder side. This would allow introducing flexibility when decoding JSON.

For example, when trying to replicate JSON decoding in spark, spark will decode incorrect values as null or raw string. This would allow overriding the current StringDecoder to implement this functionality.

Blizzara · 2025-04-24T10:46:51Z

FYI @tustvold @alamb @adriangb since you worked on the encoder one. This would allow us (I work with @cht42) to not vendor the code to get the spark-compatibility. Alternatively, if you have other ideas for how to reach our goal (we could also make the spark compat a config option, for example), let us know!

tustvold · 2025-04-24T10:49:37Z

I'm afraid I am unlikely to have time to review this in the near future, but I am not a big fan of making the tape public, this can and has changed in non-trivial ways in the past.

get the spark-compatibility

What does this look like? Is it just using nulls for fallible conversions, as that doesn't sound like a big lift, or is it something more intrusive?

cht42 · 2025-04-24T18:42:16Z

It is more intrusive when the target type is string. In that case it will use the raw string instead of null. For example

{"a": [1,2,3]}

With target schema of map<str, str> will be decoded as

{"a": "[1,2,3]"}

In spark.

For just fallback to null on parsing errors, #7230 is open that would handle that.

adriangb · 2025-04-24T19:02:38Z

Without having looked at the code I will try to review - although I may not be able to give much feedback since I haven't worked on this side of the codebase before and we do not use it internally.

alamb · 2025-04-28T18:00:36Z

FWIW I have now heard multiple use cases for the tape API so far:

@mwylde 's blog from arroyo for making a JSON-->Arrow decoder more tolerant of errors
[PATHFINDING] Parse json as variant #7403 from @scovich for parsing json to variants
Variant: Read/Parse JSON value as Variant #7425 more generally

So I think making it more public facing is not unreasonable, though that will of course constrain us in the future

alamb · 2025-04-28T18:01:44Z

@tustvold it is your own fault for creating something so useful 😆

If we are going to make it public, maybe we can avoid the requirement to copy the data into a contiguous buffer while we are at it 🤔 that was also mentioned by @mwylde in his blog

alamb · 2025-04-28T18:03:00Z

I will try and review this PR later this week

alamb · 2025-04-28T18:04:09Z

@cht42 / @Blizzara is there any chance you can file a ticket that describes your usecase (I think it is to implement spark specific JSON -> Arrow parsing, in terms of nulls, or numbers or whatever).

scovich · 2025-04-28T23:09:21Z

tl;dr: Custom error handling for JSON parsing is really complex. I have lots of observations (see below) but no good answers.

Having dealt quite a bit with spark's json parsing, it seems like there's always one more corner case to worry about no matter how many options are available for customizing it "from the outside" (via configs etc).

This PR, for example, seems to assume that the desired parsing semantics are determined entirely by the data type -- but in practice, the semantic meaning of a given field also matters (e.g. the names of the field and its ancestors). For example, it might make sense to convert an invalid optional numeric field to NULL while still erroring out on non-nullable fields (or nullable fields whose schema the application considers to be more "strict" for whatever reason). It can also make a difference whether the "invalid" value merely had an unexpected type, vs. altogether invalid JSON that fails to even parse (spark json parser has a notion of "rescue columns" that tries to cover the latter).

With variant on the horizon, it's tempting to just parse as variant (which succeeds for any lexically valid input), followed by a combination of field extraction and [try-]casting to deal with fields on a case by case basis. Downside of that approach being that variant support doesn't actually exist yet. It also lends itself more to specialized parsing situations where the expected schema is statically known. And it doesn't solve the problem very nicely for nested fields -- should a try-cast of a struct field succeed with a NULL child, or outright fail, if a child had the wrong type?

The factory approach in this PR, on the other hand, is better for generic parsing where we don't statically know anything about the semantics of the schema being parsed. It also handles deep nesting better.

One could also argue that this custom decoder is just an internal way to leverage variant-like parsing at the tape decoder level, needed because actual variant doesn't exist externally yet. Tho there is definitely an efficiency argument to be made for making decisions closer to the source data (string -> tape -> final), instead of translating to an intermediate variant along the way (string -> tape -> variant -> final).

cht42 · 2025-04-29T02:55:27Z

@cht42 / @Blizzara is there any chance you can file a ticket that describes your usecase (I think it is to implement spark specific JSON -> Arrow parsing, in terms of nulls, or numbers or whatever).

created #7453

tustvold · 2025-04-29T08:29:55Z

One could also argue that this custom decoder is just an internal way to leverage variant-like parsing at the tape decoder level, needed because actual variant doesn't exist externally yet

I think this has hit the nail on the head as to why I'm wary about exposing the tape, we're effectively at that point exposing a custom variant implementation. People will inevitably use it as a kind of arrow BSON, and it'll then become very sticky and hard to change.

Add custom decoder in arrow-json

d44cfbf

github-actions bot added the arrow Changes to the arrow crate label Apr 24, 2025

alamb mentioned this pull request Apr 28, 2025

Weekly Plan: Andrew Lamb 2025-04-28 apache/datafusion#15880

Closed

26 tasks

alamb mentioned this pull request Apr 28, 2025

Add a JSON reader option to ignore type conflicts #7276

Open

scovich mentioned this pull request May 1, 2025

Add string coercion when decoding json #7453

Open

This was referenced May 5, 2025

Weekly Plan: Andrew Lamb 2025-05-05 apache/datafusion#15943

Closed

Weekly Plan: Andrew Lamb 2025-05-12 apache/datafusion#16022

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom decoder in arrow-json #7442

Add custom decoder in arrow-json #7442

cht42 commented Apr 24, 2025

Blizzara commented Apr 24, 2025

tustvold commented Apr 24, 2025 •

edited

Loading

cht42 commented Apr 24, 2025

adriangb commented Apr 24, 2025

alamb commented Apr 28, 2025

alamb commented Apr 28, 2025 •

edited

Loading

alamb commented Apr 28, 2025

alamb commented Apr 28, 2025

scovich commented Apr 28, 2025

cht42 commented Apr 29, 2025

tustvold commented Apr 29, 2025

Add custom decoder in arrow-json #7442

Are you sure you want to change the base?

Add custom decoder in arrow-json #7442

Conversation

cht42 commented Apr 24, 2025

Blizzara commented Apr 24, 2025

tustvold commented Apr 24, 2025 • edited Loading

cht42 commented Apr 24, 2025

adriangb commented Apr 24, 2025

alamb commented Apr 28, 2025

alamb commented Apr 28, 2025 • edited Loading

alamb commented Apr 28, 2025

alamb commented Apr 28, 2025

scovich commented Apr 28, 2025

cht42 commented Apr 29, 2025

tustvold commented Apr 29, 2025

tustvold commented Apr 24, 2025 •

edited

Loading

alamb commented Apr 28, 2025 •

edited

Loading